ollama-for-amd

mirror of https://github.com/likelovewant/ollama-for-amd.git synced 2025-12-21 14:26:30 +00:00

Files

Jesse Gross 4183bb0574 kvcache: Enable SWA to retain additional entries

Models that use sliding window attention can only resume a sequence
from the cache if it falls within the saved windows. This works well
if the next message picks up where the old one left off. However, it
generally prevents a partial prefix match unless the entire conversation
falls within the sliding window.

This can be a problem with reasoning models where the traces are
supposed to be removed from future messages, forcing the entire
history to be re-evaluated.

This change allows models to specify that a larger amount of the
history be retained in memory, to allow more partial resumption.
It still respects the window that the model was trained on for
token generation.

2025-07-31 14:48:01 -07:00

cache.go

ollamarunner: Preallocate worst case graph at startup

2025-04-08 10:01:28 -07:00

causal_test.go

kvcache: Enable SWA to retain additional entries

2025-07-31 14:48:01 -07:00

causal.go

kvcache: Enable SWA to retain additional entries

2025-07-31 14:48:01 -07:00

encoder.go

ollamarunner: Preallocate worst case graph at startup

2025-04-08 10:01:28 -07:00

wrapper.go

ollamarunner: Preallocate worst case graph at startup

2025-04-08 10:01:28 -07:00