Commit Graph

8 Commits

Author SHA1 Message Date
Michael Yang
333203d871 chore: update models to use slice/chunk/chunksections (#12934)
* use slice/chunks

* bert

* llama4

* gemma3n

* gptoss

* mistral3

* qwen3vl

* qwen25vl

* deepseek2

* remove unused ops
2025-11-13 15:20:12 -08:00
Michael Yang
1188f408dd s/From*Slice/From*s/ (#12255) 2025-10-28 12:08:49 -07:00
Daniel Hiltgen
bc1a818fdc contiguous input per layer (#12686)
Co-authored-by: Michael Yang <git@mxy.ng>
2025-10-17 18:39:18 -07:00
Michael Yang
564b558c92 fix(llama): other llama flavours (#12308)
* fix(llama): rope scale

* spm llama

* skip moe models

* cleanup
2025-09-17 12:12:21 -07:00
Michael Yang
ad95d5b30b use split activations when possible (#12293)
* use ggml_*_split activations when possible

* forward qkv
2025-09-16 09:51:19 -07:00
Michael Yang
6f7117145f batch: use tensors for outputs (#12185)
this cleans up the model interface slightly without too much impact in
other areas
2025-09-15 14:33:06 -07:00
Oliver Simons
ea85e27bbd Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525)
* Enable CUDA Graphs for gemma3n.

Similar to
https://github.com/ggml-org/llama.cpp/pull/14741,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.

* Remove residual check by reshaping differently in gemma3n model

This should make the heuristics more robust
2025-07-29 12:37:06 -07:00
Michael Yang
73b642e6f3 add new gemma model (#11204)
* update patches

* cherry pick metal mean kernel

* cherry pick cuda mean kernel

* gemma3n
2025-06-25 21:47:09 -07:00