Michael Yang
333203d871
chore: update models to use slice/chunk/chunksections ( #12934 )
...
* use slice/chunks
* bert
* llama4
* gemma3n
* gptoss
* mistral3
* qwen3vl
* qwen25vl
* deepseek2
* remove unused ops
2025-11-13 15:20:12 -08:00
Michael Yang
1188f408dd
s/From*Slice/From*s/ ( #12255 )
2025-10-28 12:08:49 -07:00
Daniel Hiltgen
bc1a818fdc
contiguous input per layer ( #12686 )
...
Co-authored-by: Michael Yang <git@mxy.ng >
2025-10-17 18:39:18 -07:00
Michael Yang
564b558c92
fix(llama): other llama flavours ( #12308 )
...
* fix(llama): rope scale
* spm llama
* skip moe models
* cleanup
2025-09-17 12:12:21 -07:00
Michael Yang
ad95d5b30b
use split activations when possible ( #12293 )
...
* use ggml_*_split activations when possible
* forward qkv
2025-09-16 09:51:19 -07:00
Michael Yang
6f7117145f
batch: use tensors for outputs ( #12185 )
...
this cleans up the model interface slightly without too much impact in
other areas
2025-09-15 14:33:06 -07:00
Oliver Simons
ea85e27bbd
Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution ( #11525 )
...
* Enable CUDA Graphs for gemma3n.
Similar to
https://github.com/ggml-org/llama.cpp/pull/14741 ,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.
* Remove residual check by reshaping differently in gemma3n model
This should make the heuristics more robust
2025-07-29 12:37:06 -07:00
Michael Yang
73b642e6f3
add new gemma model ( #11204 )
...
* update patches
* cherry pick metal mean kernel
* cherry pick cuda mean kernel
* gemma3n
2025-06-25 21:47:09 -07:00