kvcache: Use Cast instead of Copy for flash attention masks

Flash attention kernels require the mask of the KV cache be a F16 rather than an F32. We can use the GGML operation ggml_cast to do this rather than doing it ourselves, which allows reuse of a preallocated buffer in the graph rather than allocating a new one for each batch. This improves token generation performance with flash attention by 10-30% (with gpt-oss). This also makes performance with flash attention better than without it, as expected.
2025-12-21 22:33:56 +00:00 · 2025-08-19 09:52:18 -07:00
parent f804e8a460
commit 05ccb17c6e
3 changed files with 29 additions and 20 deletions
--- a/ml/backend.go
+++ b/ml/backend.go
@@ -396,6 +396,7 @@ type Tensor interface {

 	Shape() []int
 	DType() DType
+	Cast(ctx Context, dtype DType) Tensor

 	Bytes() []byte
 	Floats() []float32