kvcache: Use Cast instead of Copy for flash attention masks

Flash attention kernels require the mask of the KV cache be a F16
rather than an F32. We can use the GGML operation ggml_cast to do
this rather than doing it ourselves, which allows reuse of a
preallocated buffer in the graph rather than allocating a new one
for each batch. This improves token generation performance with
flash attention by 10-30% (with gpt-oss). This also makes performance
with flash attention better than without it, as expected.
This commit is contained in:
Jesse Gross
2025-08-19 09:52:18 -07:00
committed by Jesse Gross
parent f804e8a460
commit 05ccb17c6e
3 changed files with 29 additions and 20 deletions

View File

@@ -396,6 +396,7 @@ type Tensor interface {
Shape() []int
DType() DType
Cast(ctx Context, dtype DType) Tensor
Bytes() []byte
Floats() []float32