update vendored llama.cpp and ggml (#11823)

* TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch This will be redone once my branch is merged upstream in llama.cpp * feat: Update all patches There are a number that are no longer needed at all: - 0003-embeddings: Embeddings entirely overhauled on master - 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely overhauled on master - 0019-metal-add-mean-kernel-14267: Merged upstream - 0020-CUDA-add-mean-operation-14313: Merged upstream * feat: Sync llama.cpp and ggml * fix: Update rsync-filter for all moved/new/removed files * fix: Add files missing from sync * fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs * fix: Add ggml files missing from sync * fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files * fix: Remove mtmd main cpp files * fix: Add missing include in sampling_ext.cpp * fix: Update llama.go to use mtmd instead of clip/llava * fix: Add patch for mtmd_input_text * chore: Ignore *.patched in the patch directory * fix: Fix support for arch-specific ggml-cpu source files with new arrangement In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific implementations were split out into a nested tree structure under ggml-cpu/arch. This conflicts with standard CGO layout where all arch-specific source files are expected to live in the same directory as the parent go module and use suffixes based on GOOS and GOARCH. As such, there were really two options for getting this to work: 1. Add a patch on top of the GGML sync to rearrange the files to match the GO layout convention 2. Use CGO directives to conditionally include the nested source files in the compilation units This commit does (2) in order to minimize the set of changes needed on top of the upstream file layout. To get this to work, there are two key things needed: 1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in the preprocessor directives 2. In arch-impls.c|cpp, use an #ifdef | #elif defined | #endif chain to explicitly include the .c|.cpp files for the given architecture from the nested directory * fix: Use mtmd_helper to correctly load the bitmap for the image * fix: Apply patch for mtmd_text_input * fix: Add missing stb to llama.cpp rsync-filter * fix: Add sync'ed stb vendored header * fix: Use c++17 and include vendor for go wrapper modules * fix: Update patch 0015 for upstream implementation of uuid * feat: Bump to the latest tip of the branch * fix: Update patches for bump * feat: Bump back to the cenral repo and point at the latest master This includes granite 4 and a number of other model architectures! * fix: Revert changes to ggml export GPU UUID patch * fix: Add patch for GGML_VERSION and GGML_COMMIT constants * feat: Sync all patched code * build: Include cmake/common.cmake in ggml sync * build: Add top-level include for GNUINstallDirs in CMakeLists.txt This is used to populate CMAKE_INSTALL_BINDIR * fix: Add a patch to avoid power throttling API on non-msvc windows builds * fix: Sync patch changes for ggml-cpu.c * feat: Bump llama.cpp to 4a4f42 This picks up support for Kimi K2 and PLaMO-2 * feat: Sync llama.cpp * fix: Handle multi-chunk image encodings from mtmd * fix: Re-number patches after merge with `main` * feat: Bump to 41e78c in the makefile * fix: Fix Solar and argsort/copy patches after bump * fix: Remove Gemma3n CUDA Graphs patch It was implemented upstream: https://github.com/ggml-org/llama.cpp/pull/14741 * feat: Sync llama.cpp / ggml after latest bump * build: Remove unnecessary CFLAGS definitions in cpu.go * fix: Remove unnecessary additions in the rsync-filter * fix: Remove unused vendored code for chat template parsing * Revert "fix: Remove Gemma3n CUDA Graphs patch" This reverts commit d724caced3ce21f08924d4b7801f94ce6638f6ea. * fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394 * fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n * unwind mxfp4 patch Prepare to bump ggml with their impl for mxfp4 * bump * fix windows build error * Convert tensors at load time Repack the mxfp4 tensors as ggmls kernels expect them to be. * convert mlp bf16 to f32 * buffer the conversion better * reshape earlier * openai swiglu * add ids * split qkv, gate_up * fix nested alt tags * fast attention * remove debug messages * fix lint * remove redundant test * remap values only if source/target are different * add back i32->i32 copy * refactor cpu quants * clean up vendor * update patch instructions * clean up patches * remove webgpu * update mem * also handle gpt-oss * revert convert changes --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-12-23 23:18:26 +00:00 · 2025-08-14 14:42:58 -07:00
parent 7ccfd97a93
commit 1a19df1f3a
243 changed files with 151610 additions and 43145 deletions
--- a/llama/patches/.gitignore
+++ b/llama/patches/.gitignore
@@ -0,0 +1 @@
+*.patched
--- a/llama/patches/0001-ggml-backend-malloc-and-free-using-the-same-compiler.patch
+++ b/llama/patches/0001-ggml-backend-malloc-and-free-using-the-same-compiler.patch
@@ -12,19 +12,18 @@ MSVC and freed by Clang, which can cause problems.
 This moves freeing of the buffers into the backends to avoid the
 problem.
 ---
- ggml/src/ggml-backend.cpp              | 9 +++++++--
- ggml/src/ggml-cann/ggml-cann.cpp       | 2 ++
- ggml/src/ggml-cuda/ggml-cuda.cu        | 3 +++
- ggml/src/ggml-kompute/ggml-kompute.cpp | 1 +
- ggml/src/ggml-metal/ggml-metal.m       | 1 +
- ggml/src/ggml-opencl/ggml-opencl.cpp   | 1 +
- ggml/src/ggml-rpc/ggml-rpc.cpp         | 1 +
- ggml/src/ggml-sycl/ggml-sycl.cpp       | 3 +++
- ggml/src/ggml-vulkan/ggml-vulkan.cpp   | 2 ++
- 9 files changed, 21 insertions(+), 2 deletions(-)
+ ggml/src/ggml-backend.cpp            | 9 +++++++--
+ ggml/src/ggml-cann/ggml-cann.cpp     | 2 ++
+ ggml/src/ggml-cuda/ggml-cuda.cu      | 3 +++
+ ggml/src/ggml-metal/ggml-metal.m     | 1 +
+ ggml/src/ggml-opencl/ggml-opencl.cpp | 1 +
+ ggml/src/ggml-rpc/ggml-rpc.cpp       | 1 +
+ ggml/src/ggml-sycl/ggml-sycl.cpp     | 3 +++
+ ggml/src/ggml-vulkan/ggml-vulkan.cpp | 2 ++
+ 8 files changed, 20 insertions(+), 2 deletions(-)

 diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp
-index b30b4cb3..0ce73a99 100644
+index 1b9d29e9..97f47abd 100644
 --- a/ggml/src/ggml-backend.cpp
 +++ b/ggml/src/ggml-backend.cpp
@@ -107,7 +107,6 @@ void ggml_backend_buffer_free(ggml_backend_buffer_t buffer) {
@@ -35,7 +34,7 @@ index b30b4cb3..0ce73a99 100644
 }
 
 size_t ggml_backend_buffer_get_size(ggml_backend_buffer_t buffer) {
-@@ -544,6 +543,7 @@ static void ggml_backend_multi_buffer_free_buffer(ggml_backend_buffer_t buffer)
+@@ -529,6 +528,7 @@ static void ggml_backend_multi_buffer_free_buffer(ggml_backend_buffer_t buffer)
 
     free(ctx->buffers);
     free(ctx);
@@ -43,7 +42,7 @@ index b30b4cb3..0ce73a99 100644
 }
 
 static void ggml_backend_multi_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value) {
-@@ -1871,6 +1871,11 @@ static void * ggml_backend_cpu_buffer_get_base(ggml_backend_buffer_t buffer) {
+@@ -1890,6 +1890,11 @@ static void * ggml_backend_cpu_buffer_get_base(ggml_backend_buffer_t buffer) {
 
 static void ggml_backend_cpu_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     ggml_aligned_free(buffer->context, buffer->size);
@@ -55,7 +54,7 @@ index b30b4cb3..0ce73a99 100644
 }
 
 static void ggml_backend_cpu_buffer_memset_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, uint8_t value, size_t offset, size_t size) {
-@@ -1918,7 +1923,7 @@ static const struct ggml_backend_buffer_i ggml_backend_cpu_buffer_i = {
+@@ -1937,7 +1942,7 @@ static const struct ggml_backend_buffer_i ggml_backend_cpu_buffer_i = {
 };
 
 static const struct ggml_backend_buffer_i ggml_backend_cpu_buffer_from_ptr_i = {
@@ -65,10 +64,10 @@ index b30b4cb3..0ce73a99 100644
     /* .init_tensor     = */ NULL, // no initialization required
     /* .memset_tensor   = */ ggml_backend_cpu_buffer_memset_tensor,
 diff --git a/ggml/src/ggml-cann/ggml-cann.cpp b/ggml/src/ggml-cann/ggml-cann.cpp
-index e2617b06..242e50a7 100644
+index cf575b36..ca1addfa 100755
 --- a/ggml/src/ggml-cann/ggml-cann.cpp
 +++ b/ggml/src/ggml-cann/ggml-cann.cpp
-@@ -800,6 +800,7 @@ static void ggml_backend_cann_buffer_free_buffer(
+@@ -826,6 +826,7 @@ static void ggml_backend_cann_buffer_free_buffer(
     ggml_backend_cann_buffer_context* ctx =
         (ggml_backend_cann_buffer_context*)buffer->context;
     delete ctx;
@@ -76,7 +75,7 @@ index e2617b06..242e50a7 100644
 }
 
 /**
-@@ -1472,6 +1473,7 @@ static const char * ggml_backend_cann_host_buffer_name(ggml_backend_buffer_t buf
+@@ -1572,6 +1573,7 @@ static const char * ggml_backend_cann_host_buffer_name(ggml_backend_buffer_t buf
  */
 static void ggml_backend_cann_host_buffer_free(ggml_backend_buffer_t buffer) {
     ACL_CHECK(aclrtFreeHost(buffer->context));
@@ -85,10 +84,10 @@ index e2617b06..242e50a7 100644
 
 /**
 diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index b4b85abc..cb0d8528 100644
+index d9110491..37ee2a6d 100644
 --- a/ggml/src/ggml-cuda/ggml-cuda.cu
 +++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -534,6 +534,7 @@ struct ggml_backend_cuda_buffer_context {
+@@ -567,6 +567,7 @@ struct ggml_backend_cuda_buffer_context {
 static void ggml_backend_cuda_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     ggml_backend_cuda_buffer_context * ctx = (ggml_backend_cuda_buffer_context *)buffer->context;
     delete ctx;
@@ -96,7 +95,7 @@ index b4b85abc..cb0d8528 100644
 }
 
 static bool ggml_backend_buffer_is_cuda(ggml_backend_buffer_t buffer) {
-@@ -790,6 +791,7 @@ struct ggml_backend_cuda_split_buffer_context {
+@@ -822,6 +823,7 @@ struct ggml_backend_cuda_split_buffer_context {
 static void ggml_backend_cuda_split_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     ggml_backend_cuda_split_buffer_context * ctx = (ggml_backend_cuda_split_buffer_context *)buffer->context;
     delete ctx;
@@ -104,7 +103,7 @@ index b4b85abc..cb0d8528 100644
 }
 
 static void * ggml_backend_cuda_split_buffer_get_base(ggml_backend_buffer_t buffer) {
-@@ -1067,6 +1069,7 @@ static const char * ggml_backend_cuda_host_buffer_type_name(ggml_backend_buffer_
+@@ -1103,6 +1105,7 @@ static bool ggml_backend_buft_is_cuda_host(ggml_backend_buffer_type_t buft) {
 
 static void ggml_backend_cuda_host_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     CUDA_CHECK(cudaFreeHost(buffer->context));
@@ -112,23 +111,11 @@ index b4b85abc..cb0d8528 100644
 }
 
 static void * ggml_cuda_host_malloc(size_t size) {
-diff --git a/ggml/src/ggml-kompute/ggml-kompute.cpp b/ggml/src/ggml-kompute/ggml-kompute.cpp
-index 50579227..2799a0a5 100644
--- a/ggml/src/ggml-kompute/ggml-kompute.cpp
-+++ b/ggml/src/ggml-kompute/ggml-kompute.cpp
-@@ -1911,6 +1911,7 @@ static void ggml_backend_kompute_buffer_free_buffer(ggml_backend_buffer_t buffer
-         ggml_vk_free_memory(*memory);
-     }
-     delete memory;
-+    delete buffer;
- }
- 
- static void * ggml_backend_kompute_buffer_get_base(ggml_backend_buffer_t buffer) {
 diff --git a/ggml/src/ggml-metal/ggml-metal.m b/ggml/src/ggml-metal/ggml-metal.m
-index 576f9581..1b56f858 100644
+index cb8eff4a..7bccc7bf 100644
 --- a/ggml/src/ggml-metal/ggml-metal.m
 +++ b/ggml/src/ggml-metal/ggml-metal.m
-@@ -5214,6 +5214,7 @@ static void ggml_backend_metal_buffer_free_buffer(ggml_backend_buffer_t buffer)
+@@ -6032,6 +6032,7 @@ static void ggml_backend_metal_buffer_free_buffer(ggml_backend_buffer_t buffer)
     }
 
     free(ctx);
@@ -137,10 +124,10 @@ index 576f9581..1b56f858 100644
 
 static void * ggml_backend_metal_buffer_get_base(ggml_backend_buffer_t buffer) {
 diff --git a/ggml/src/ggml-opencl/ggml-opencl.cpp b/ggml/src/ggml-opencl/ggml-opencl.cpp
-index 05a2f4e6..392cc18d 100644
+index 8ba1e00d..8163e8dc 100644
 --- a/ggml/src/ggml-opencl/ggml-opencl.cpp
 +++ b/ggml/src/ggml-opencl/ggml-opencl.cpp
-@@ -1940,6 +1940,7 @@ struct ggml_backend_opencl_buffer_context {
+@@ -2745,6 +2745,7 @@ struct ggml_backend_opencl_buffer_context {
 static void ggml_backend_opencl_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     ggml_backend_opencl_buffer_context * ctx = (ggml_backend_opencl_buffer_context *) buffer->context;
     delete ctx;
@@ -149,22 +136,22 @@ index 05a2f4e6..392cc18d 100644
 
 static void * ggml_backend_opencl_buffer_get_base(ggml_backend_buffer_t buffer) {
 diff --git a/ggml/src/ggml-rpc/ggml-rpc.cpp b/ggml/src/ggml-rpc/ggml-rpc.cpp
-index 4f0abb5a..de1ec184 100644
+index df6ba540..2e395968 100644
 --- a/ggml/src/ggml-rpc/ggml-rpc.cpp
 +++ b/ggml/src/ggml-rpc/ggml-rpc.cpp
-@@ -483,6 +483,7 @@ static void ggml_backend_rpc_buffer_free_buffer(ggml_backend_buffer_t buffer) {
+@@ -486,6 +486,7 @@ static void ggml_backend_rpc_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     bool status = send_rpc_cmd(ctx->sock, RPC_CMD_FREE_BUFFER, &request, sizeof(request), nullptr, 0);
-     GGML_ASSERT(status);
+     RPC_STATUS_ASSERT(status);
     delete ctx;
 +    delete buffer;
 }
 
 static void * ggml_backend_rpc_buffer_get_base(ggml_backend_buffer_t buffer) {
 diff --git a/ggml/src/ggml-sycl/ggml-sycl.cpp b/ggml/src/ggml-sycl/ggml-sycl.cpp
-index 0ea72994..ae3a3c33 100644
+index 3992dad0..67503951 100644
 --- a/ggml/src/ggml-sycl/ggml-sycl.cpp
 +++ b/ggml/src/ggml-sycl/ggml-sycl.cpp
-@@ -320,6 +320,7 @@ ggml_backend_sycl_buffer_free_buffer(ggml_backend_buffer_t buffer) try {
+@@ -331,6 +331,7 @@ ggml_backend_sycl_buffer_free_buffer(ggml_backend_buffer_t buffer) try {
     ggml_sycl_set_device(ctx->device);
 
     delete ctx;
@@ -172,7 +159,7 @@ index 0ea72994..ae3a3c33 100644
 }
 catch (sycl::exception const &exc) {
   std::cerr << exc.what() << "Exception caught at file:" << __FILE__
-@@ -765,6 +766,7 @@ struct ggml_backend_sycl_split_buffer_context {
+@@ -792,6 +793,7 @@ struct ggml_backend_sycl_split_buffer_context {
 static void ggml_backend_sycl_split_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     ggml_backend_sycl_split_buffer_context * ctx = (ggml_backend_sycl_split_buffer_context *)buffer->context;
     delete ctx;
@@ -180,7 +167,7 @@ index 0ea72994..ae3a3c33 100644
 }
 
 static void * ggml_backend_sycl_split_buffer_get_base(ggml_backend_buffer_t buffer) {
-@@ -1099,6 +1101,7 @@ static const char * ggml_backend_sycl_host_buffer_type_name(ggml_backend_buffer_
+@@ -1134,6 +1136,7 @@ static const char * ggml_backend_sycl_host_buffer_type_name(ggml_backend_buffer_
 
 static void ggml_backend_sycl_host_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     ggml_sycl_host_free(buffer->context);
@@ -189,10 +176,10 @@ index 0ea72994..ae3a3c33 100644
 
 static ggml_backend_buffer_t ggml_backend_sycl_host_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
 diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
-index e2b357fd..68768029 100644
+index 4070e248..394a2839 100644
 --- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
 +++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
-@@ -8962,6 +8962,7 @@ static void ggml_backend_vk_buffer_free_buffer(ggml_backend_buffer_t buffer) {
+@@ -10209,6 +10209,7 @@ static void ggml_backend_vk_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     ggml_backend_vk_buffer_context * ctx = (ggml_backend_vk_buffer_context *)buffer->context;
     ggml_vk_destroy_buffer(ctx->dev_buffer);
     delete ctx;
@@ -200,7 +187,7 @@ index e2b357fd..68768029 100644
 }
 
 static void * ggml_backend_vk_buffer_get_base(ggml_backend_buffer_t buffer) {
-@@ -9105,6 +9106,7 @@ static const char * ggml_backend_vk_host_buffer_name(ggml_backend_buffer_t buffe
+@@ -10352,6 +10353,7 @@ static const char * ggml_backend_vk_host_buffer_name(ggml_backend_buffer_t buffe
 static void ggml_backend_vk_host_buffer_free_buffer(ggml_backend_buffer_t buffer) {
     VK_LOG_MEMORY("ggml_backend_vk_host_buffer_free_buffer()");
     ggml_vk_host_free(vk_instance.devices[0], buffer->context);
--- a/llama/patches/0002-pretokenizer.patch
+++ b/llama/patches/0002-pretokenizer.patch
@@ -10,10 +10,10 @@ logs instead of throwing an error
 1 file changed, 3 insertions(+), 11 deletions(-)

 diff --git a/src/llama-vocab.cpp b/src/llama-vocab.cpp
-index 9389ca80..806c1b3d 100644
+index f7e03e70..8ebe11cf 100644
 --- a/src/llama-vocab.cpp
 +++ b/src/llama-vocab.cpp
-@@ -1503,16 +1503,7 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
+@@ -1804,16 +1804,7 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
         if (type == LLAMA_VOCAB_TYPE_BPE) {
             add_space_prefix = false;
             clean_spaces = true;
@@ -31,8 +31,8 @@ index 9389ca80..806c1b3d 100644
                 pre_type = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
             } else if (
                     tokenizer_pre == "llama3"   ||
-@@ -1651,7 +1642,8 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
-                 pre_type = LLAMA_VOCAB_PRE_TYPE_SEED_CODER;
+@@ -1975,7 +1966,8 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
+                 pre_type = LLAMA_VOCAB_PRE_TYPE_KIMI_K2;
                 clean_spaces = false;
             } else {
 -                throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str()));
--- a/llama/patches/0003-clip-unicode.patch
+++ b/llama/patches/0003-clip-unicode.patch
@@ -10,10 +10,10 @@ filesystems for paths that include wide characters
 1 file changed, 39 insertions(+)

 diff --git a/tools/mtmd/clip.cpp b/tools/mtmd/clip.cpp
-index 41ba45a7..cdd8ca44 100644
+index 20c21733..f4f69cfc 100644
 --- a/tools/mtmd/clip.cpp
 +++ b/tools/mtmd/clip.cpp
-@@ -31,6 +31,19 @@
+@@ -28,6 +28,19 @@
 #include <numeric>
 #include <functional>
 
@@ -33,7 +33,7 @@ index 41ba45a7..cdd8ca44 100644
 struct clip_logger_state g_logger_state = {GGML_LOG_LEVEL_CONT, clip_log_callback_default, NULL};
 
 enum ffn_op_type {
-@@ -2190,7 +2203,29 @@ struct clip_model_loader {
+@@ -2597,7 +2610,29 @@ struct clip_model_loader {
         {
             std::vector<uint8_t> read_buf;
 
@@ -63,7 +63,7 @@ index 41ba45a7..cdd8ca44 100644
             if (!fin) {
                 throw std::runtime_error(string_format("%s: failed to open %s\n", __func__, fname.c_str()));
             }
-@@ -2217,7 +2252,11 @@ struct clip_model_loader {
+@@ -2624,7 +2659,11 @@ struct clip_model_loader {
                     ggml_backend_tensor_set(cur, read_buf.data(), 0, num_bytes);
                 }
             }
--- a/llama/patches/0003-embeddings.patch
+++ b/llama/patches/0003-embeddings.patch
@@ -1,43 +0,0 @@
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: jmorganca <jmorganca@gmail.com>
-Date: Tue, 8 Apr 2025 15:28:34 -0700
-Subject: [PATCH] embeddings
-
-allow a loaded model in llama.cpp to be used for
-both embeddings and causal attention text generation
-instead of forcing one or the error
---
- src/llama-context.cpp | 6 +++---
- 1 file changed, 3 insertions(+), 3 deletions(-)
-
-diff --git a/src/llama-context.cpp b/src/llama-context.cpp
-index 62246c10..dca22d8b 100644
--- a/src/llama-context.cpp
-+++ b/src/llama-context.cpp
-@@ -901,7 +901,7 @@ int llama_context::decode(llama_batch & inp_batch) {
-     int64_t n_outputs_all = 0;
- 
-     // count outputs
-    if (batch.logits && !embd_pooled) {
-+    if (batch.logits) {
-         for (uint32_t i = 0; i < n_tokens_all; ++i) {
-             n_outputs_all += batch.logits[i] != 0;
-         }
-@@ -982,7 +982,7 @@ int llama_context::decode(llama_batch & inp_batch) {
-         //    ggml_graph_dump_dot(gf, NULL, "llama.dot");
-         //}
- 
-        auto * t_logits = cparams.embeddings ? nullptr         : res->get_logits();
-+        auto * t_logits = cparams.causal_attn ? res->get_logits() : nullptr;
-         auto * t_embd   = cparams.embeddings ? res->get_embd() : nullptr;
- 
-         if (t_embd && res->get_embd_pooled()) {
-@@ -1151,7 +1151,7 @@ int32_t llama_context::output_reserve(int32_t n_outputs) {
-     const auto n_embd  = hparams.n_embd;
- 
-     // TODO: use a per-batch flag for logits presence instead
-    bool has_logits = !cparams.embeddings;
-+    bool has_logits =  cparams.causal_attn;
-     bool has_embd   =  cparams.embeddings && (cparams.pooling_type == LLAMA_POOLING_TYPE_NONE);
- 
-     // TODO: hacky enc-dec support
--- a/llama/patches/0004-solar-pro.patch
+++ b/llama/patches/0004-solar-pro.patch
@@ -15,18 +15,18 @@ adds support for the Solar Pro architecture
 7 files changed, 248 insertions(+)

 diff --git a/src/llama-arch.cpp b/src/llama-arch.cpp
-index f2bc8ca7..5ab3f572 100644
+index 18dcc6dd..4b285646 100644
 --- a/src/llama-arch.cpp
 +++ b/src/llama-arch.cpp
-@@ -69,6 +69,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
-     { LLM_ARCH_GRANITE,          "granite"          },
+@@ -78,6 +78,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
     { LLM_ARCH_GRANITE_MOE,      "granitemoe"       },
+     { LLM_ARCH_GRANITE_HYBRID,   "granitehybrid"    },
     { LLM_ARCH_CHAMELEON,        "chameleon"        },
 +    { LLM_ARCH_SOLAR,            "solar"            },
     { LLM_ARCH_WAVTOKENIZER_DEC, "wavtokenizer-dec" },
     { LLM_ARCH_PLM,              "plm"              },
     { LLM_ARCH_BAILINGMOE,       "bailingmoe"       },
-@@ -142,6 +143,7 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
+@@ -164,6 +165,7 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
     { LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT,       "%s.attention.relative_buckets_count"       },
     { LLM_KV_ATTENTION_SLIDING_WINDOW,               "%s.attention.sliding_window"               },
     { LLM_KV_ATTENTION_SCALE,                        "%s.attention.scale"                        },
@@ -34,7 +34,7 @@ index f2bc8ca7..5ab3f572 100644
     { LLM_KV_ATTENTION_KEY_LENGTH_MLA,               "%s.attention.key_length_mla"               },
     { LLM_KV_ATTENTION_VALUE_LENGTH_MLA,             "%s.attention.value_length_mla"             },
 
-@@ -1502,6 +1504,24 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
+@@ -1794,6 +1796,24 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
             { LLM_TENSOR_ATTN_K_NORM,     "blk.%d.attn_k_norm" },
         },
     },
@@ -59,8 +59,8 @@ index f2bc8ca7..5ab3f572 100644
     {
         LLM_ARCH_WAVTOKENIZER_DEC,
         {
-@@ -1680,6 +1700,7 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = {
-     {LLM_TENSOR_FFN_EXP_PROBS_B,            {LLM_TENSOR_LAYER_REPEATING, GGML_OP_ADD}},
+@@ -2219,6 +2239,7 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = {
+     {LLM_TENSOR_LAUREL_POST_NORM,           {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
     // this tensor is loaded for T5, but never used
     {LLM_TENSOR_DEC_CROSS_ATTN_REL_B,       {LLM_TENSOR_LAYER_REPEATING, GGML_OP_NONE}},
 +    {LLM_TENSOR_BSKCN_TV,                   {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
@@ -68,18 +68,18 @@ index f2bc8ca7..5ab3f572 100644
     {LLM_TENSOR_POS_NET_NORM,               {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
     {LLM_TENSOR_POS_NET_NORM1,              {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
 diff --git a/src/llama-arch.h b/src/llama-arch.h
-index 41a023da..525c1b7d 100644
+index 7af587e7..3ea994c7 100644
 --- a/src/llama-arch.h
 +++ b/src/llama-arch.h
-@@ -73,6 +73,7 @@ enum llm_arch {
-     LLM_ARCH_GRANITE,
+@@ -82,6 +82,7 @@ enum llm_arch {
     LLM_ARCH_GRANITE_MOE,
+     LLM_ARCH_GRANITE_HYBRID,
     LLM_ARCH_CHAMELEON,
 +    LLM_ARCH_SOLAR,
     LLM_ARCH_WAVTOKENIZER_DEC,
     LLM_ARCH_PLM,
     LLM_ARCH_BAILINGMOE,
-@@ -146,6 +147,7 @@ enum llm_kv {
+@@ -168,6 +169,7 @@ enum llm_kv {
     LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT,
     LLM_KV_ATTENTION_SLIDING_WINDOW,
     LLM_KV_ATTENTION_SCALE,
@@ -87,7 +87,7 @@ index 41a023da..525c1b7d 100644
     LLM_KV_ATTENTION_KEY_LENGTH_MLA,
     LLM_KV_ATTENTION_VALUE_LENGTH_MLA,
 
-@@ -346,6 +348,7 @@ enum llm_tensor {
+@@ -394,6 +396,7 @@ enum llm_tensor {
     LLM_TENSOR_ENC_OUTPUT_NORM,
     LLM_TENSOR_CLS,
     LLM_TENSOR_CLS_OUT,
@@ -96,11 +96,11 @@ index 41a023da..525c1b7d 100644
     LLM_TENSOR_CONVNEXT_DW,
     LLM_TENSOR_CONVNEXT_NORM,
 diff --git a/src/llama-hparams.cpp b/src/llama-hparams.cpp
-index 90dfe7a7..8a667960 100644
+index 7a06368d..35fc054f 100644
 --- a/src/llama-hparams.cpp
 +++ b/src/llama-hparams.cpp
-@@ -70,6 +70,14 @@ uint32_t llama_hparams::n_embd_v_s() const {
-     return ssm_d_state * ssm_d_inner;
+@@ -146,6 +146,14 @@ uint32_t llama_hparams::n_pos_per_embd() const {
+     return rope_type == LLAMA_ROPE_TYPE_MROPE ? 4 : 1;
 }
 
 +bool llama_hparams::n_bskcn(uint32_t n, uint32_t il) const {
@@ -113,12 +113,12 @@ index 90dfe7a7..8a667960 100644
 +
 bool llama_hparams::is_swa(uint32_t il) const {
     if (il < n_layer) {
-         return n_swa > 0 && n_swa_pattern > 0 && il % n_swa_pattern < (n_swa_pattern - 1);
+         return swa_layers[il];
 diff --git a/src/llama-hparams.h b/src/llama-hparams.h
-index 7ee6a5b7..48dce407 100644
+index bd231224..29bd9056 100644
 --- a/src/llama-hparams.h
 +++ b/src/llama-hparams.h
-@@ -55,6 +55,8 @@ struct llama_hparams {
+@@ -62,6 +62,8 @@ struct llama_hparams {
     std::array<uint32_t, LLAMA_MAX_LAYERS> n_head_kv_arr;
     std::array<uint32_t, LLAMA_MAX_LAYERS> n_ff_arr;
 
@@ -127,9 +127,9 @@ index 7ee6a5b7..48dce407 100644
     uint32_t n_layer_dense_lead = 0;
     uint32_t n_lora_q           = 0;
     uint32_t n_lora_kv          = 0;
-@@ -154,6 +156,9 @@ struct llama_hparams {
-     // dimension of the recurrent state embeddings
-     uint32_t n_embd_v_s() const;
+@@ -220,6 +222,9 @@ struct llama_hparams {
+ 
+     uint32_t n_pos_per_embd() const;
 
 +    // Block skip connection
 +    bool n_bskcn(uint32_t n, uint32_t il) const;
@@ -138,10 +138,10 @@ index 7ee6a5b7..48dce407 100644
 };
 
 diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
-index 4cce5166..7f6617fa 100644
+index f71c40f8..7eab9b68 100644
 --- a/src/llama-model-loader.cpp
 +++ b/src/llama-model-loader.cpp
-@@ -439,6 +439,7 @@ namespace GGUFMeta {
+@@ -465,6 +465,7 @@ namespace GGUFMeta {
     // TODO: this is not very clever - figure out something better
     template bool llama_model_loader::get_key_or_arr<std::array<int, 4>>(enum llm_kv kid, std::array<int, 4> & result, uint32_t n, bool required);
     template bool llama_model_loader::get_key_or_arr<std::array<uint32_t, 512>>(enum llm_kv kid, std::array<uint32_t, 512> & result, uint32_t n, bool required);
@@ -150,10 +150,10 @@ index 4cce5166..7f6617fa 100644
 llama_model_loader::llama_model_loader(
         const std::string & fname,
 diff --git a/src/llama-model.cpp b/src/llama-model.cpp
-index 3a4e72a3..db62973f 100644
+index 58ca7df7..280129e1 100644
 --- a/src/llama-model.cpp
 +++ b/src/llama-model.cpp
-@@ -1402,6 +1402,21 @@ void llama_model::load_hparams(llama_model_loader & ml) {
+@@ -1706,6 +1706,21 @@ void llama_model::load_hparams(llama_model_loader & ml) {
                     default: type = LLM_TYPE_UNKNOWN;
                }
             } break;
@@ -175,7 +175,7 @@ index 3a4e72a3..db62973f 100644
         case LLM_ARCH_WAVTOKENIZER_DEC:
             {
                 ml.get_key(LLM_KV_ATTENTION_LAYERNORM_EPS,    hparams.f_norm_eps);
-@@ -3774,6 +3789,34 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
+@@ -4793,6 +4808,34 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
 
                         layer.ffn_norm = create_tensor(tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, 0);
 
@@ -210,12 +210,12 @@ index 3a4e72a3..db62973f 100644
                         layer.ffn_gate = create_tensor(tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff}, 0);
                         layer.ffn_down = create_tensor(tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd}, 0);
                         layer.ffn_up   = create_tensor(tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff}, 0);
-@@ -12397,6 +12440,165 @@ struct llm_build_chameleon : public llm_graph_context {
+@@ -15495,6 +15538,165 @@ struct llm_build_granite_hybrid : public llm_graph_context_mamba {
     }
 };
 
 +struct llm_build_solar : public llm_graph_context {
-+    llm_build_solar(const llama_model & model, const llm_graph_params & params, ggml_cgraph * gf) : llm_graph_context(params) {
+    llm_build_solar(const llama_model & model, const llm_graph_params & params) : llm_graph_context(params) {
 +        const int64_t n_embd_head = hparams.n_embd_head_v;
 +        GGML_ASSERT(n_embd_head == hparams.n_embd_head_k);
 +        GGML_ASSERT(n_embd_head == hparams.n_rot);
@@ -270,7 +270,7 @@ index 3a4e72a3..db62973f 100644
 +            // self-attention
 +            {
 +                // rope freq factors for llama3; may return nullptr for llama2 and other models
-+                ggml_tensor * rope_factors = model.get_rope_factors(n_ctx_per_seq, il);
+                ggml_tensor * rope_factors = model.get_rope_factors(cparams, il);
 +
 +                // compute Q and K and RoPE them
 +                ggml_tensor * Qcur = build_lora_mm(model.layers[il].wq, cur);
@@ -314,7 +314,7 @@ index 3a4e72a3..db62973f 100644
 +                cb(Kcur, "Kcur", il);
 +                cb(Vcur, "Vcur", il);
 +
-+                cur = build_attn(inp_attn, gf,
+                cur = build_attn(inp_attn,
 +                        model.layers[il].wo, model.layers[il].bo,
 +                        Qcur, Kcur, Vcur, nullptr, nullptr, kq_scale, il);
 +                cb(cur, "attn_out", il);
@@ -373,33 +373,33 @@ index 3a4e72a3..db62973f 100644
 +    }
 +};
 +
- struct llm_build_wavtokenizer_dec : public llm_graph_context {
-     llm_build_wavtokenizer_dec(const llama_model & model, const llm_graph_params & params, ggml_cgraph * gf) : llm_graph_context(params) {
-         ggml_tensor * cur;
-@@ -13157,6 +13359,10 @@ llm_graph_result_ptr llama_model::build_graph(
+ // ref: https://github.com/facebookresearch/chameleon
+ // based on the original build_llama() function, changes:
+ //   * qk-norm
+@@ -18439,6 +18641,10 @@ ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
             {
-                 llm = std::make_unique<llm_build_chameleon>(*this, params, gf);
+                 llm = std::make_unique<llm_build_chameleon>(*this, params);
             } break;
 +        case LLM_ARCH_SOLAR:
 +            {
-+                llm = std::make_unique<llm_build_solar>(*this, params, gf);
+                llm = std::make_unique<llm_build_solar>(*this, params);
 +            } break;
         case LLM_ARCH_WAVTOKENIZER_DEC:
             {
-                 llm = std::make_unique<llm_build_wavtokenizer_dec>(*this, params, gf);
-@@ -13301,6 +13507,7 @@ llama_rope_type llama_model_rope_type(const llama_model * model) {
-         case LLM_ARCH_GRANITE:
+                 llm = std::make_unique<llm_build_wavtokenizer_dec>(*this, params);
+@@ -18652,6 +18858,7 @@ llama_rope_type llama_model_rope_type(const llama_model * model) {
         case LLM_ARCH_GRANITE_MOE:
+         case LLM_ARCH_GRANITE_HYBRID:
         case LLM_ARCH_CHAMELEON:
 +        case LLM_ARCH_SOLAR:
         case LLM_ARCH_BAILINGMOE:
-             return LLAMA_ROPE_TYPE_NORM;
- 
+         case LLM_ARCH_NEO_BERT:
+         case LLM_ARCH_SMOLLM3:
 diff --git a/src/llama-model.h b/src/llama-model.h
-index 6bdec263..43746c7d 100644
+index 6fcd74d5..09964533 100644
 --- a/src/llama-model.h
 +++ b/src/llama-model.h
-@@ -65,6 +65,7 @@ enum llm_type {
+@@ -70,6 +70,7 @@ enum llm_type {
     LLM_TYPE_15B,
     LLM_TYPE_16B,
     LLM_TYPE_20B,
@@ -407,9 +407,9 @@ index 6bdec263..43746c7d 100644
     LLM_TYPE_27B,
     LLM_TYPE_30B,
     LLM_TYPE_32B,
-@@ -315,6 +316,8 @@ struct llama_layer {
-     struct ggml_tensor * ffn_up_scale   = nullptr;
-     struct ggml_tensor * ffn_down_scale = nullptr;
+@@ -367,6 +368,8 @@ struct llama_layer {
+     // openai-moe
+     struct ggml_tensor * attn_sinks = nullptr;
 
 +    struct ggml_tensor * bskcn_tv = nullptr;
 +
--- a/llama/patches/0005-fix-deepseek-deseret-regex.patch
+++ b/llama/patches/0005-fix-deepseek-deseret-regex.patch
@@ -12,10 +12,10 @@ regex
 2 files changed, 22 insertions(+), 1 deletion(-)

 diff --git a/src/llama-vocab.cpp b/src/llama-vocab.cpp
-index 806c1b3d..10f34d33 100644
+index 8ebe11cf..c011008f 100644
 --- a/src/llama-vocab.cpp
 +++ b/src/llama-vocab.cpp
-@@ -298,7 +298,7 @@ struct llm_tokenizer_bpe : llm_tokenizer {
+@@ -299,7 +299,7 @@ struct llm_tokenizer_bpe : llm_tokenizer {
             case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM:
                 regex_exprs = {
                     "[\r\n]",
@@ -25,7 +25,7 @@ index 806c1b3d..10f34d33 100644
                     "\\s+$",
                     "[一-龥ࠀ-一가-퟿]+",
 diff --git a/src/unicode.cpp b/src/unicode.cpp
-index e63bb4ab..73cb2b1a 100644
+index 65f36651..ce336a22 100644
 --- a/src/unicode.cpp
 +++ b/src/unicode.cpp
@@ -2,6 +2,11 @@
@@ -62,7 +62,7 @@ index e63bb4ab..73cb2b1a 100644
 #if defined(__clang__)
     // disable C++17 deprecation warning for std::codecvt_utf8
 #    pragma clang diagnostic push
-@@ -213,6 +233,7 @@ static inline std::wstring unicode_wstring_from_utf8(const std::string & s) {
+@@ -218,6 +238,7 @@ static inline std::wstring unicode_wstring_from_utf8(const std::string & s) {
 #endif
 
     return conv.from_bytes(s);
--- a/llama/patches/0006-maintain-ordering-for-rules-for-grammar.patch
+++ b/llama/patches/0006-maintain-ordering-for-rules-for-grammar.patch
@@ -8,10 +8,10 @@ Subject: [PATCH] maintain ordering for rules for grammar
 1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/common/json-schema-to-grammar.cpp b/common/json-schema-to-grammar.cpp
-index 5b3059c2..656b3eca 100644
+index 637891f5..98b8280f 100644
 --- a/common/json-schema-to-grammar.cpp
 +++ b/common/json-schema-to-grammar.cpp
-@@ -349,7 +349,7 @@ private:
+@@ -307,7 +307,7 @@ private:
     friend std::string build_grammar(const std::function<void(const common_grammar_builder &)> & cb, const common_grammar_options & options);
     std::function<json(const std::string &)> _fetch_json;
     bool _dotall;
--- a/llama/patches/0007-sort-devices-by-score.patch
+++ b/llama/patches/0007-sort-devices-by-score.patch
@@ -11,10 +11,10 @@ with the fastest acceleration is loaded
 1 file changed, 13 insertions(+), 8 deletions(-)

 diff --git a/ggml/src/ggml-backend-reg.cpp b/ggml/src/ggml-backend-reg.cpp
-index 405d8e31..4e67d243 100644
+index 6c315137..3040b2aa 100644
 --- a/ggml/src/ggml-backend-reg.cpp
 +++ b/ggml/src/ggml-backend-reg.cpp
-@@ -157,7 +157,7 @@ struct ggml_backend_reg_entry {
+@@ -162,7 +162,7 @@ struct ggml_backend_reg_entry {
 
 struct ggml_backend_registry {
     std::vector<ggml_backend_reg_entry> backends;
@@ -23,7 +23,7 @@ index 405d8e31..4e67d243 100644
 
     ggml_backend_registry() {
 #ifdef GGML_USE_CUDA
-@@ -202,7 +202,7 @@ struct ggml_backend_registry {
+@@ -207,7 +207,7 @@ struct ggml_backend_registry {
         }
     }
 
@@ -32,7 +32,7 @@ index 405d8e31..4e67d243 100644
         if (!reg) {
             return;
         }
-@@ -213,15 +213,20 @@ struct ggml_backend_registry {
+@@ -218,15 +218,20 @@ struct ggml_backend_registry {
 #endif
         backends.push_back({ reg, std::move(handle) });
         for (size_t i = 0; i < ggml_backend_reg_dev_count(reg); i++) {
@@ -56,7 +56,7 @@ index 405d8e31..4e67d243 100644
     }
 
     ggml_backend_reg_t load_backend(const fs::path & path, bool silent) {
-@@ -265,7 +270,7 @@ struct ggml_backend_registry {
+@@ -270,7 +275,7 @@ struct ggml_backend_registry {
 
         GGML_LOG_INFO("%s: loaded %s backend from %s\n", __func__, ggml_backend_reg_name(reg), path_str(path).c_str());
 
@@ -65,7 +65,7 @@ index 405d8e31..4e67d243 100644
 
         return reg;
     }
-@@ -288,7 +293,7 @@ struct ggml_backend_registry {
+@@ -293,7 +298,7 @@ struct ggml_backend_registry {
         // remove devices
         devices.erase(
             std::remove_if(devices.begin(), devices.end(),
@@ -74,7 +74,7 @@ index 405d8e31..4e67d243 100644
             devices.end());
 
         // remove backend
-@@ -346,7 +351,7 @@ size_t ggml_backend_dev_count() {
+@@ -351,7 +356,7 @@ size_t ggml_backend_dev_count() {
 
 ggml_backend_dev_t ggml_backend_dev_get(size_t index) {
     GGML_ASSERT(index < ggml_backend_dev_count());
--- a/llama/patches/0008-add-phony-target-ggml-cpu-for-all-cpu-variants.patch
+++ b/llama/patches/0008-add-phony-target-ggml-cpu-for-all-cpu-variants.patch
@@ -8,22 +8,22 @@ Subject: [PATCH] add phony target ggml-cpu for all cpu variants
 1 file changed, 2 insertions(+)

 diff --git a/ggml/src/CMakeLists.txt b/ggml/src/CMakeLists.txt
-index ddea5ad3..45918bf6 100644
+index 177fb282..f5a5079a 100644
 --- a/ggml/src/CMakeLists.txt
 +++ b/ggml/src/CMakeLists.txt
-@@ -279,6 +279,7 @@ function(ggml_add_cpu_backend_variant tag_name)
-     endforeach()
+@@ -304,6 +304,7 @@ function(ggml_add_cpu_backend_variant tag_name)
+     endif()
 
     ggml_add_cpu_backend_variant_impl(${tag_name})
 +    add_dependencies(ggml-cpu ggml-cpu-${tag_name})
 endfunction()
 
 ggml_add_backend(CPU)
-@@ -287,6 +288,7 @@ if (GGML_CPU_ALL_VARIANTS)
-     if (NOT GGML_BACKEND_DL)
-         message(FATAL_ERROR "GGML_CPU_ALL_VARIANTS requires GGML_BACKEND_DL")
+@@ -314,6 +315,7 @@ if (GGML_CPU_ALL_VARIANTS)
+     elseif (GGML_CPU_ARM_ARCH)
+         message(FATAL_ERROR "Cannot use both GGML_CPU_ARM_ARCH and GGML_CPU_ALL_VARIANTS")
     endif()
 +    add_custom_target(ggml-cpu)
-     ggml_add_cpu_backend_variant(x64)
-     ggml_add_cpu_backend_variant(sse42        SSE42)
-     ggml_add_cpu_backend_variant(sandybridge  SSE42 AVX)
+     if (GGML_SYSTEM_ARCH STREQUAL "x86")
+         ggml_add_cpu_backend_variant(x64)
+         ggml_add_cpu_backend_variant(sse42        SSE42)
--- a/llama/patches/0008-ensure-KV-cache-is-fully-defragmented.patch
+++ b/llama/patches/0008-ensure-KV-cache-is-fully-defragmented.patch
@@ -1,352 +0,0 @@
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: jmorganca <jmorganca@gmail.com>
-Date: Tue, 15 Apr 2025 14:27:40 -0400
-Subject: [PATCH] ensure KV cache is fully defragmented
-
-Sometimes the KV cache requires defragmentation even without
-triggering the threshold heuristic. In this case, decoding
-will not being able to find a KV cache slot. This is particularly
-difficult for the caller to handle if it happens in between
-ubatches. To avoid this, we should immediately trigger a defrag.
-
-In addition, a heavily fragmented cache can require more than
-max_moves to defragment. Currently, we stop when we hit the limit
-but this can leave a cache that still does not have adequate space
-even after defragmentation is triggered. Instead, we should do
-multiple batches of processing until everything is complete.
---
- src/llama-context.cpp  |  18 ++++---
- src/llama-context.h    |   1 +
- src/llama-kv-cache.cpp | 107 ++++++++++++++---------------------------
- src/llama-kv-cache.h   |  12 ++++-
- 4 files changed, 59 insertions(+), 79 deletions(-)
-
-diff --git a/src/llama-context.cpp b/src/llama-context.cpp
-index dca22d8b..1f3a3956 100644
--- a/src/llama-context.cpp
-+++ b/src/llama-context.cpp
-@@ -947,9 +947,12 @@ int llama_context::decode(llama_batch & inp_batch) {
- 
-         // find KV slot
-         if (!kv_self->find_slot(ubatch)) {
-            LLAMA_LOG_WARN("%s: failed to find KV cache slot for ubatch of size %d\n", __func__, ubatch.n_tokens);
-
-            return 1;
-+            kv_self->defrag_sched(-1.0f);
-+            kv_self->update(*this);
-+            if (!kv_self->find_slot(ubatch)) {
-+                LLAMA_LOG_WARN("%s: failed to find KV cache slot for ubatch of size %d\n", __func__, ubatch.n_tokens);
-+                return 1;
-+            }
-         }
- 
-         ggml_backend_sched_reset(sched.get());
-@@ -1965,9 +1968,12 @@ void llama_context::opt_epoch_iter(
- 
-             // TODO: not sure if this is needed
-             if (!kv_self->find_slot(ubatch)) {
-                LLAMA_LOG_WARN("%s: failed to find KV cache slot for ubatch of size %d\n", __func__, ubatch.n_tokens);
-
-                GGML_ABORT("TODO: handle this error");
-+                kv_self->defrag_sched(-1.0f);
-+                kv_self->update(*this);
-+                if (!kv_self->find_slot(ubatch)) {
-+                    LLAMA_LOG_WARN("%s: failed to find KV cache slot for ubatch of size %d\n", __func__, ubatch.n_tokens);
-+                    GGML_ABORT("TODO: handle this error");
-+                }
-             }
- 
-             auto * gf = graph_init();
-diff --git a/src/llama-context.h b/src/llama-context.h
-index c0ceacb1..0264e937 100644
--- a/src/llama-context.h
-+++ b/src/llama-context.h
-@@ -5,6 +5,7 @@
- #include "llama-cparams.h"
- #include "llama-graph.h"
- #include "llama-adapter.h"
-+#include "llama-kv-cache.h"
- 
- #include "ggml-cpp.h"
- #include "ggml-opt.h"
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 3dcad65b..60e67b03 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -364,8 +364,6 @@ void llama_kv_cache_unified::commit() {
- }
- 
- bool llama_kv_cache_unified::update(llama_context & lctx) {
-    bool need_reserve = false;
-
-     auto * sched = lctx.get_sched();
- 
-     if (has_shift) {
-@@ -388,8 +386,6 @@ bool llama_kv_cache_unified::update(llama_context & lctx) {
-             res->set_inputs(nullptr);
- 
-             lctx.graph_compute(gf, false);
-
-            need_reserve = true;
-         }
- 
-         {
-@@ -403,27 +399,36 @@ bool llama_kv_cache_unified::update(llama_context & lctx) {
- 
-     if (do_defrag) {
-         LLAMA_LOG_DEBUG("%s: defragmenting KV cache\n", __func__);
-+        const uint32_t n_max_nodes = lctx.graph_max_nodes();
-+        const uint32_t max_moves = (n_max_nodes - 2*model.hparams.n_layer)/(6*model.hparams.n_layer);
-+        if (!defrag_prepare(n_max_nodes)) {
-+            LLAMA_LOG_ERROR("%s: failed to prepare defragmentation\n", __func__);
-+            return false;
-+        }
-+
-+        for (std::size_t i = 0; i < defrag_info.moves.size(); i += max_moves) {
-+            std::vector<struct llama_kv_defrag_move> chunk;
-+            auto end = std::min(i + max_moves, defrag_info.moves.size());
-+            chunk.assign(defrag_info.moves.begin() + i, defrag_info.moves.begin() + end);
- 
-        if (defrag_prepare(lctx.graph_max_nodes())) {
-             ggml_backend_sched_reset(sched);
- 
-             auto * gf = lctx.graph_init();
- 
-            auto res = build_graph_defrag(lctx.get_cparams(), lctx.get_ctx_compute(), gf);
-+            auto res = build_graph_defrag(lctx.get_cparams(), lctx.get_ctx_compute(), gf, chunk);
- 
-             ggml_backend_sched_alloc_graph(sched, gf);
- 
-             res->set_inputs(nullptr);
- 
-             lctx.graph_compute(gf, false);
-
-            need_reserve = true;
-         }
- 
-         do_defrag = false;
-     }
- 
-    return need_reserve;
-+    // we never need to reserve a worst case graph
-+    return false;
- }
- 
- void llama_kv_cache_unified::defrag_sched(float thold) {
-@@ -707,11 +712,10 @@ llm_graph_result_ptr llama_kv_cache_unified::build_graph_shift(
- llm_graph_result_ptr llama_kv_cache_unified::build_graph_defrag(
-         const llama_cparams & cparams,
-                ggml_context * ctx,
-                ggml_cgraph * gf) const {
-+                ggml_cgraph * gf,
-+                const std::vector<struct llama_kv_defrag_move> & moves) const {
-     auto res = std::make_unique<llm_graph_result>();
- 
-    const auto & ids = defrag_info.ids;
-
- #if 0
-     // CPU defrag
-     //
-@@ -783,32 +787,20 @@ llm_graph_result_ptr llama_kv_cache_unified::build_graph_defrag(
-         ggml_backend_tensor_set(v_l[il], buf_v.data(), 0, buf_v.size());
-     }
- #else
-    for (uint32_t i = 0; i < ids.size(); ++i) {
-        const uint32_t id = ids[i];
-
-        if (i == id || id == ids.size()) {
-            continue;
-        }
-
-        uint32_t nm = 1;
-
-        while (i + nm < ids.size() && ids[i + nm] == id + nm) {
-            nm++;
-        }
-
-+    for (const auto & move : moves) {
-         for (uint32_t il = 0; il < hparams.n_layer; ++il) { // NOLINT
-             const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa(il);
-             const int64_t n_embd_v_gqa = hparams.n_embd_v_gqa(il);
- 
-             ggml_tensor * view_k_src = ggml_view_2d(ctx, k_l[il],
-                    n_embd_k_gqa, nm,
-+                    n_embd_k_gqa, move.len,
-                     ggml_row_size(k_l[il]->type, n_embd_k_gqa),
-                    ggml_row_size(k_l[il]->type, n_embd_k_gqa*i));
-+                    ggml_row_size(k_l[il]->type, n_embd_k_gqa*move.src));
- 
-             ggml_tensor * view_k_dst = ggml_view_2d(ctx, k_l[il],
-                    n_embd_k_gqa, nm,
-+                    n_embd_k_gqa, move.len,
-                     ggml_row_size(k_l[il]->type, n_embd_k_gqa),
-                    ggml_row_size(k_l[il]->type, n_embd_k_gqa*id));
-+                    ggml_row_size(k_l[il]->type, n_embd_k_gqa*move.dst));
- 
-             ggml_tensor * view_v_src;
-             ggml_tensor * view_v_dst;
-@@ -816,31 +808,29 @@ llm_graph_result_ptr llama_kv_cache_unified::build_graph_defrag(
-             if (cparams.flash_attn) {
-                 // NOTE: the V cache is not transposed when using flash attention
-                 view_v_src = ggml_view_2d(ctx, v_l[il],
-                        n_embd_v_gqa, nm,
-+                        n_embd_v_gqa, move.len,
-                         ggml_row_size(v_l[il]->type, n_embd_v_gqa),
-                        ggml_row_size(v_l[il]->type, n_embd_v_gqa*i));
-+                        ggml_row_size(v_l[il]->type, n_embd_v_gqa*move.dst));
- 
-                 view_v_dst = ggml_view_2d(ctx, v_l[il],
-                        n_embd_v_gqa, nm,
-+                        move.len, n_embd_v_gqa,
-                         ggml_row_size(v_l[il]->type, n_embd_v_gqa),
-                        ggml_row_size(v_l[il]->type, n_embd_v_gqa*id));
-+                        ggml_row_size(v_l[il]->type, move.src));
-             } else {
-                 view_v_src = ggml_view_2d(ctx, v_l[il],
-                        nm, n_embd_v_gqa,
-+                        move.len, n_embd_v_gqa,
-                         ggml_row_size(v_l[il]->type, size),
-                        ggml_row_size(v_l[il]->type, i));
-+                        ggml_row_size(v_l[il]->type, move.src));
- 
-                 view_v_dst = ggml_view_2d(ctx, v_l[il],
-                        nm, n_embd_v_gqa,
-+                        move.len, n_embd_v_gqa,
-                         ggml_row_size(v_l[il]->type, size),
-                        ggml_row_size(v_l[il]->type, id));
-+                        ggml_row_size(v_l[il]->type, move.dst));
-             }
- 
-             ggml_build_forward_expand(gf, ggml_cpy(ctx, view_k_src, view_k_dst));
-             ggml_build_forward_expand(gf, ggml_cpy(ctx, view_v_src, view_v_dst));
-         }
-
-        i += nm - 1;
-     }
- 
-     //LLAMA_LOG_INFO("gf->n_nodes = %d\n", gf->n_nodes);
-@@ -857,17 +847,7 @@ bool llama_kv_cache_unified::defrag_prepare(int32_t n_max_nodes) {
- 
-     assert(n_used <= n_kv);
- 
-    //const int64_t t_start = ggml_time_us();
-
-    // number of cells moved
-    uint32_t n_moves = 0;
-
-    // each move requires 6*n_layer tensors (see graph_build_kv_self_defrag)
-    //   - source view, destination view, copy operation
-    //   - x2 for keys and values
-    //const uint32_t max_moves = max_nodes()/(6*n_layer);
-    // TODO: tmp fix https://github.com/ggerganov/llama.cpp/issues/6685#issuecomment-2057579516
-    const uint32_t max_moves = (n_max_nodes - 2*n_layer)/(6*n_layer);
-+    defrag_info.moves.clear();
- 
-     // determine which KV cells to move where
-     //
-@@ -875,10 +855,7 @@ bool llama_kv_cache_unified::defrag_prepare(int32_t n_max_nodes) {
-     //
-     //  if ids[i] == i || ids[i] == n_kv, then cell i is not moved
-     //
-    auto & ids = defrag_info.ids;
-
-    ids.clear();
-    ids.resize(n_kv, n_kv);
-+    std::vector<uint32_t> ids(n_kv, n_kv);
- 
-     for (uint32_t i0 = 0; i0 < n_used; ++i0) {
-         const auto & cell0 = cells[i0];
-@@ -927,19 +904,11 @@ bool llama_kv_cache_unified::defrag_prepare(int32_t n_max_nodes) {
-         // are we moving a continuous block of memory?
-         bool cont = false;
- 
-        // should we stop searching for the next move?
-        bool stop = false;
-
-         // go back and move the nf cells to the hole
-         for (; i1 < n_kv; ++i1) {
-             auto & cell1 = cells[i1];
- 
-             if (cell1.is_empty() || ids[i1] != n_kv) {
-                if (n_moves == max_moves) {
-                    stop = true;
-                    break;
-                }
-
-                 cont = false;
-                 continue;
-             }
-@@ -955,8 +924,10 @@ bool llama_kv_cache_unified::defrag_prepare(int32_t n_max_nodes) {
-             head = n_used;
- 
-             if (!cont) {
-                n_moves++;
-+                defrag_info.moves.push_back({i1, i0 + nf, 1});
-                 cont = true;
-+            } else {
-+                defrag_info.moves.back().len++;
-             }
- 
-             nf++;
-@@ -966,22 +937,16 @@ bool llama_kv_cache_unified::defrag_prepare(int32_t n_max_nodes) {
-             }
-         }
- 
-        if (stop || n_moves == max_moves) {
-            break;
-        }
-
-         //LLAMA_LOG_INFO("(tmp log) KV defrag: move [%u, %u) to [%u, %u)\n", is, i1 + 1, i0, i0 + nh);
- 
-         i0 += nh - 1;
-     }
- 
-    if (n_moves == 0) {
-+    if (defrag_info.moves.size() == 0) {
-         return false;
-     }
- 
-    LLAMA_LOG_DEBUG("%s: (tmp log) KV defrag cell moves: %u\n", __func__, n_moves);
-
-    LLAMA_LOG_DEBUG("%s: expected gf nodes: %u\n", __func__, 6*n_moves*n_layer);
-+    // LLAMA_LOG_DEBUG("(tmp log) KV defrag cell moves: %u\n", n_moves);
- 
-     return true;
- }
-diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
-index bf3b4b6a..928b9712 100644
--- a/src/llama-kv-cache.h
-+++ b/src/llama-kv-cache.h
-@@ -82,6 +82,13 @@ struct llama_kv_cache_guard {
- private:
-     llama_kv_cache * kv;
- };
-+ 
-+// block of KV slots to move when defragging
-+struct llama_kv_defrag_move {
-+    uint32_t src;
-+    uint32_t dst;
-+    uint32_t len;
-+};
- 
- //
- // llama_kv_cache_unified
-@@ -207,7 +214,7 @@ private:
- 
-     // defrag
-     struct {
-        std::vector<uint32_t> ids;
-+        std::vector<llama_kv_defrag_move> moves;
-     } defrag_info;
- 
-     // return true if cells have been moved
-@@ -249,7 +256,8 @@ private:
-     llm_graph_result_ptr build_graph_defrag(
-             const llama_cparams & cparams,
-                    ggml_context * ctx,
-                    ggml_cgraph * gf) const;
-+                    ggml_cgraph * gf,
-+                    const std::vector<llama_kv_defrag_move> & moves) const;
- 
-     void state_write_meta(llama_io_write_i & io, const std::vector<std::pair<uint32_t, uint32_t>> & cell_ranges, llama_seq_id seq_id = -1) const;
-     void state_write_data(llama_io_write_i & io, const std::vector<std::pair<uint32_t, uint32_t>> & cell_ranges) const;
--- a/llama/patches/0009-remove-amx.patch
+++ b/llama/patches/0009-remove-amx.patch
@@ -0,0 +1,25 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: jmorganca <jmorganca@gmail.com>
+Date: Thu, 1 May 2025 15:05:08 -0700
+Subject: [PATCH] remove amx
+
+disable amx as it reduces performance on some systems
+---
+ ggml/src/CMakeLists.txt | 4 ----
+ 1 file changed, 4 deletions(-)
+
+diff --git a/ggml/src/CMakeLists.txt b/ggml/src/CMakeLists.txt
+index f5a5079a..5158acd6 100644
+--- a/ggml/src/CMakeLists.txt
+++ b/ggml/src/CMakeLists.txt
+@@ -324,10 +324,6 @@ if (GGML_CPU_ALL_VARIANTS)
+         ggml_add_cpu_backend_variant(skylakex     SSE42 AVX F16C AVX2 BMI2 FMA AVX512)
+         ggml_add_cpu_backend_variant(icelake      SSE42 AVX F16C AVX2 BMI2 FMA AVX512 AVX512_VBMI AVX512_VNNI)
+         ggml_add_cpu_backend_variant(alderlake    SSE42 AVX F16C AVX2 BMI2 FMA AVX_VNNI)
+-        if (NOT MSVC)
+-            # MSVC doesn't support AMX
+-            ggml_add_cpu_backend_variant(sapphirerapids SSE42 AVX F16C AVX2 BMI2 FMA AVX512 AVX512_VBMI AVX512_VNNI AVX512_BF16 AMX_TILE AMX_INT8)
+-        endif()
+     elseif(GGML_SYSTEM_ARCH STREQUAL "ARM")
+         if (CMAKE_SYSTEM_NAME MATCHES "Linux")
+             # Many of these features are optional so we build versions with popular
--- a/llama/patches/0010-fix-string-arr-kv-loading.patch
+++ b/llama/patches/0010-fix-string-arr-kv-loading.patch
@@ -25,10 +25,10 @@ index 79ee2020..3efb22f0 100644
     // get ith C string from array with given key_id
     GGML_API const char * gguf_get_arr_str (const struct gguf_context * ctx, int64_t key_id, size_t i);
 diff --git a/ggml/src/gguf.cpp b/ggml/src/gguf.cpp
-index 381a9c7d..e45b453d 100644
+index 53504399..0f71d5f3 100644
 --- a/ggml/src/gguf.cpp
 +++ b/ggml/src/gguf.cpp
-@@ -777,10 +777,14 @@ enum gguf_type gguf_get_arr_type(const struct gguf_context * ctx, int64_t key_id
+@@ -805,10 +805,14 @@ enum gguf_type gguf_get_arr_type(const struct gguf_context * ctx, int64_t key_id
 
 const void * gguf_get_arr_data(const struct gguf_context * ctx, int64_t key_id) {
     GGML_ASSERT(key_id >= 0 && key_id < gguf_get_n_kv(ctx));
@@ -44,7 +44,7 @@ index 381a9c7d..e45b453d 100644
 const char * gguf_get_arr_str(const struct gguf_context * ctx, int64_t key_id, size_t i) {
     GGML_ASSERT(key_id >= 0 && key_id < gguf_get_n_kv(ctx));
     GGML_ASSERT(ctx->kv[key_id].get_type() == GGUF_TYPE_STRING);
-@@ -874,7 +878,6 @@ const char * gguf_get_val_str(const struct gguf_context * ctx, int64_t key_id) {
+@@ -902,7 +906,6 @@ const char * gguf_get_val_str(const struct gguf_context * ctx, int64_t key_id) {
 const void * gguf_get_val_data(const struct gguf_context * ctx, int64_t key_id) {
     GGML_ASSERT(key_id >= 0 && key_id < gguf_get_n_kv(ctx));
     GGML_ASSERT(ctx->kv[key_id].get_ne() == 1);
@@ -53,10 +53,10 @@ index 381a9c7d..e45b453d 100644
 }
 
 diff --git a/src/llama-vocab.cpp b/src/llama-vocab.cpp
-index 10f34d33..9f5fd57b 100644
+index c011008f..fa388b03 100644
 --- a/src/llama-vocab.cpp
 +++ b/src/llama-vocab.cpp
-@@ -1469,9 +1469,7 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
+@@ -1760,9 +1760,7 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
             const int precompiled_charsmap_keyidx = gguf_find_key(ctx, kv(LLM_KV_TOKENIZER_PRECOMPILED_CHARSMAP).c_str());
             if (precompiled_charsmap_keyidx != -1) {
                 const gguf_type pc_type = gguf_get_arr_type(ctx, precompiled_charsmap_keyidx);
--- a/llama/patches/0011-ollama-debug-tensor.patch
+++ b/llama/patches/0011-ollama-debug-tensor.patch
@@ -8,7 +8,7 @@ Subject: [PATCH] ollama debug tensor
 1 file changed, 6 insertions(+)

 diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c
-index a30e67f2..2462d2b8 100644
+index d89cd8f4..a5689c18 100644
 --- a/ggml/src/ggml-cpu/ggml-cpu.c
 +++ b/ggml/src/ggml-cpu/ggml-cpu.c
@@ -15,6 +15,8 @@
@@ -20,7 +20,7 @@ index a30e67f2..2462d2b8 100644
 #if defined(_MSC_VER) || defined(__MINGW32__)
 #include <malloc.h> // using malloc.h with MSC/MINGW
 #elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
-@@ -2841,6 +2843,10 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
+@@ -2858,6 +2860,10 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
 
         ggml_compute_forward(&params, node);
 
--- a/llama/patches/0011-remove-amx.patch
+++ b/llama/patches/0011-remove-amx.patch
@@ -1,25 +0,0 @@
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: jmorganca <jmorganca@gmail.com>
-Date: Thu, 1 May 2025 15:05:08 -0700
-Subject: [PATCH] remove amx
-
-disable amx as it reduces performance on some systems
---
- ggml/src/CMakeLists.txt | 4 ----
- 1 file changed, 4 deletions(-)
-
-diff --git a/ggml/src/CMakeLists.txt b/ggml/src/CMakeLists.txt
-index 45918bf6..0beaed86 100644
--- a/ggml/src/CMakeLists.txt
-+++ b/ggml/src/CMakeLists.txt
-@@ -296,10 +296,6 @@ if (GGML_CPU_ALL_VARIANTS)
-     ggml_add_cpu_backend_variant(skylakex     SSE42 AVX F16C AVX2 BMI2 FMA AVX512)
-     ggml_add_cpu_backend_variant(icelake      SSE42 AVX F16C AVX2 BMI2 FMA AVX512 AVX512_VBMI AVX512_VNNI)
-     ggml_add_cpu_backend_variant(alderlake    SSE42 AVX F16C AVX2 BMI2 FMA AVX_VNNI)
-    if (NOT MSVC)
-        # MSVC doesn't support AMX
-        ggml_add_cpu_backend_variant(sapphirerapids SSE42 AVX F16C AVX2 BMI2 FMA AVX512 AVX512_VBMI AVX512_VNNI AVX512_BF16 AMX_TILE AMX_INT8)
-    endif()
- elseif (GGML_CPU)
-     ggml_add_cpu_backend_variant_impl("")
- endif()
--- a/llama/patches/0012-add-ollama-vocab-for-grammar-support.patch
+++ b/llama/patches/0012-add-ollama-vocab-for-grammar-support.patch
@@ -10,7 +10,7 @@ Subject: [PATCH] add ollama vocab for grammar support
 3 files changed, 58 insertions(+), 9 deletions(-)

 diff --git a/src/llama-grammar.cpp b/src/llama-grammar.cpp
-index 973b47ae..60d58236 100644
+index bed706bb..b51cee09 100644
 --- a/src/llama-grammar.cpp
 +++ b/src/llama-grammar.cpp
@@ -907,6 +907,7 @@ llama_grammar_candidates llama_grammar_reject_candidates_for_stack(
@@ -90,7 +90,7 @@ index 973b47ae..60d58236 100644
 
     if (grammar.awaiting_trigger) {
         if (std::find(grammar.trigger_tokens.begin(), grammar.trigger_tokens.end(), token) != grammar.trigger_tokens.end()) {
-@@ -1191,13 +1200,14 @@ void llama_grammar_accept_impl(struct llama_grammar & grammar, llama_token token
+@@ -1201,13 +1210,14 @@ void llama_grammar_accept_impl(struct llama_grammar & grammar, llama_token token
         }
     }
 
@@ -107,7 +107,7 @@ index 973b47ae..60d58236 100644
     }
 
     llama_grammar_accept_str(grammar, piece);
-@@ -1217,3 +1227,28 @@ void llama_grammar_accept_str(struct llama_grammar & grammar, const std::string
+@@ -1227,3 +1237,28 @@ void llama_grammar_accept_str(struct llama_grammar & grammar, const std::string
         throw std::runtime_error("Unexpected empty grammar stack after accepting piece: " + piece);
     }
 }
@@ -184,7 +184,7 @@ index f8c291de..2a3a62db 100644
                       const char * grammar_root,
                               bool lazy,
 diff --git a/src/llama-sampling.cpp b/src/llama-sampling.cpp
-index 804b11e0..15a10ca8 100644
+index bfbf5fa2..11f93f42 100644
 --- a/src/llama-sampling.cpp
 +++ b/src/llama-sampling.cpp
@@ -1466,7 +1466,7 @@ static void llama_sampler_grammar_reset(struct llama_sampler * smpl) {
--- a/llama/patches/0013-add-argsort-and-cuda-copy-for-i32.patch
+++ b/llama/patches/0013-add-argsort-and-cuda-copy-for-i32.patch
@@ -4,16 +4,17 @@ Date: Thu, 1 May 2025 13:45:12 -0700
 Subject: [PATCH] add argsort and cuda copy for i32

 ---
- ggml/src/ggml-cpu/ops.cpp     |  43 ++++++++++++++
- ggml/src/ggml-cuda/argsort.cu | 102 +++++++++++++++++++++++++++++++++-
- ggml/src/ggml-cuda/cpy.cu     |  49 ++++++++++++++++
- 3 files changed, 192 insertions(+), 2 deletions(-)
+ ggml/src/ggml-cpu/ops.cpp        |  43 +++++++++++++
+ ggml/src/ggml-cuda/argsort.cu    | 102 ++++++++++++++++++++++++++++++-
+ ggml/src/ggml-cuda/cpy-utils.cuh |   6 ++
+ ggml/src/ggml-cuda/cpy.cu        |  43 +++++++++++++
+ 4 files changed, 192 insertions(+), 2 deletions(-)

 diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index 955fec59..654e2f28 100644
+index 854f1c2b..a2924757 100644
 --- a/ggml/src/ggml-cpu/ops.cpp
 +++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -6822,6 +6822,45 @@ static void ggml_compute_forward_argsort_f32(
+@@ -8146,6 +8146,45 @@ static void ggml_compute_forward_argsort_f32(
     }
 }
 
@@ -59,7 +60,7 @@ index 955fec59..654e2f28 100644
 void ggml_compute_forward_argsort(
     const ggml_compute_params * params,
     ggml_tensor * dst) {
-@@ -6833,6 +6872,10 @@ void ggml_compute_forward_argsort(
+@@ -8157,6 +8196,10 @@ void ggml_compute_forward_argsort(
             {
                 ggml_compute_forward_argsort_f32(params, dst);
             } break;
@@ -194,84 +195,78 @@ index 607ded85..53b02634 100644
 +        argsort_f32_i32_cuda(src0_d, (int *)dst_d, ncols, nrows, order, stream);
 +    }
 }
+diff --git a/ggml/src/ggml-cuda/cpy-utils.cuh b/ggml/src/ggml-cuda/cpy-utils.cuh
+index 410c12b7..b8e9e107 100644
+--- a/ggml/src/ggml-cuda/cpy-utils.cuh
+++ b/ggml/src/ggml-cuda/cpy-utils.cuh
+@@ -223,3 +223,9 @@ template<typename src_t, typename dst_t>
+ static __device__ void cpy_1_flt(const char * cxi, char * cdsti) {
+     convert_flt((const src_t *)cxi, (dst_t *)cdsti);
+ }
+
+static __device__ void cpy_1_i32_i32(const char * cxi, char * cdsti) {
+    const int32_t * src = (const int32_t *)cxi;
+    int32_t * dst = (int32_t *)cdsti;
+    *dst = *src;
+}
 diff --git a/ggml/src/ggml-cuda/cpy.cu b/ggml/src/ggml-cuda/cpy.cu
-index d027271f..4abd01d7 100644
+index f9bb0256..9c3774e5 100644
 --- a/ggml/src/ggml-cuda/cpy.cu
 +++ b/ggml/src/ggml-cuda/cpy.cu
-@@ -38,6 +38,13 @@ static __device__ void cpy_1_f16_f32(const char * cxi, char * cdsti) {
-     *dsti = *xi;
+@@ -278,6 +278,47 @@ static void ggml_cpy_f32_iq4_nl_cuda(
+         (cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, cdst_indirect, graph_cpynode_index++);
 }
 
-+static __device__ void cpy_1_i32_i32(const char * cxi, char * cdsti) {
-+    const int32_t * xi = (const int32_t *) cxi;
-+    int32_t * dsti = (int32_t *) cdsti;
-+
-+    *dsti = *xi;
-+}
-+
- template <cpy_kernel_t cpy_1>
- static __global__ void cpy_f32_f16(const char * cx, char * cdst_direct, const int ne,
-                                    const int ne00, const int ne01, const int ne02, const int nb00, const int nb01, const int nb02,
-@@ -68,6 +75,44 @@ static __global__ void cpy_f32_f16(const char * cx, char * cdst_direct, const in
-     cpy_1(cx + x_offset, cdst + dst_offset);
- }
- 
-+// First, add this template function after the other template functions
 +template <cpy_kernel_t cpy_1>
-+static __global__ void cpy_i32_i32(const char * cx, char * cdst, const int ne,
-+                                 const int ne00, const int ne01, const int ne02, const int nb00, const int nb01, const int nb02,
-+                                 const int nb03, const int ne10, const int ne11, const int ne12, const int nb10, const int nb11,
-+                                 const int nb12, const int nb13) {
-+    const int64_t i = blockDim.x*blockIdx.x + threadIdx.x;
+static __global__ void cpy_i32_i32(
+    const char *cx, char *cdst, const int ne,
+    const int ne00, const int ne01, const int ne02, const int nb00, const int nb01, const int nb02, const int nb03,
+    const int ne10, const int ne11, const int ne12, const int nb10, const int nb11, const int nb12, const int nb13,
+    cudaStream_t stream, char ** cdst_indirect, int & graph_cpynode_index) {
+
+    const int64_t i = blockDim.x * blockIdx.x + threadIdx.x;
 +
 +    if (i >= ne) {
 +        return;
 +    }
 +
-+    const int64_t i03 = i/(ne00 * ne01 * ne02);
-+    const int64_t i02 = (i - i03*ne00*ne01*ne02 )/ (ne00*ne01);
-+    const int64_t i01 = (i - i03*ne00*ne01*ne02  -  i02*ne01*ne00) / ne00;
-+    const int64_t i00 = i - i03*ne00*ne01*ne02 - i02*ne01*ne00 - i01*ne00;
-+    const int64_t x_offset = i00*nb00 + i01*nb01 + i02*nb02 + i03 * nb03;
+    const int64_t i03 = i / (ne00 * ne01 * ne02);
+    const int64_t i02 = (i - i03 * ne00 * ne01 * ne02) / (ne00 * ne01);
+    const int64_t i01 = (i - i03 * ne00 * ne01 * ne02 - i02 * ne01 * ne00) / ne00;
+    const int64_t i00 = i - i03 * ne00 * ne01 * ne02 - i02 * ne01 * ne00 - i01 * ne00;
+    const int64_t x_offset = i00 * nb00 + i01 * nb01 + i02 * nb02 + i03 * nb03;
 +
-+    const int64_t i13 = i/(ne10 * ne11 * ne12);
-+    const int64_t i12 = (i - i13*ne10*ne11*ne12) / (ne10*ne11);
-+    const int64_t i11 = (i - i13*ne10*ne11*ne12 - i12*ne10*ne11) / ne10;
-+    const int64_t i10 = i - i13*ne10*ne11*ne12 - i12*ne10*ne11 - i11*ne10;
-+    const int64_t dst_offset = i10*nb10 + i11*nb11 + i12*nb12 + i13 * nb13;
+    const int64_t i13 = i / (ne10 * ne11 * ne12);
+    const int64_t i12 = (i - i13 * ne10 * ne11 * ne12) / (ne10 * ne11);
+    const int64_t i11 = (i - i13 * ne10 * ne11 * ne12 - i12 * ne10 * ne11) / ne10;
+    const int64_t i10 = i - i13 * ne10 * ne11 * ne12 - i12 * ne10 * ne11 - i11 * ne10;
+    const int64_t dst_offset = i10 * nb10 + i11 * nb11 + i12 * nb12 + i13 * nb13;
 +
-+    cpy_1(cx + x_offset, cdst + dst_offset);
+    char * cdst_ptr = (cdst_indirect != nullptr) ? cdst_indirect[graph_cpynode_index] : cdst;
+    cpy_1(cx + x_offset, cdst_ptr + dst_offset);
 +}
 +
-+// Then modify the ggml_cpy_i32_i32_cuda function to use the new template
+
 +static void ggml_cpy_i32_i32_cuda(
 +    const char * cx, char * cdst, const int ne,
-+    const int ne00, const int ne01, const int ne02, const int nb00, const int nb01, const int nb02,
-+    const int nb03, const int ne10, const int ne11, const int ne12, const int nb10, const int nb11, const int nb12, const int nb13, cudaStream_t stream, char ** cdst_indirect, int graph_cpynode_index) {
+    const int ne00, const int ne01, const int ne02, const int nb00, const int nb01, const int nb02, const int nb03,
+    const int ne10, const int ne11, const int ne12, const int nb10, const int nb11, const int nb12, const int nb13,
+    cudaStream_t stream, char ** cdst_indirect, int graph_cpynode_index) {
 +
 +    const int num_blocks = (ne + CUDA_CPY_BLOCK_SIZE - 1) / CUDA_CPY_BLOCK_SIZE;
 +    cpy_i32_i32<cpy_1_i32_i32><<<num_blocks, CUDA_CPY_BLOCK_SIZE, 0, stream>>>
-+        (cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13);
+        (cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, stream, cdst_indirect, graph_cpynode_index);
 +}
 +
- static __device__ void cpy_blck_f32_q8_0(const char * cxi, char * cdsti) {
-     const float * xi = (const float *) cxi;
-     block_q8_0 * dsti = (block_q8_0 *) cdsti;
-@@ -633,6 +678,8 @@ void ggml_cuda_cpy(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, gg
-         ggml_cpy_f16_f16_cuda (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
+ void ggml_cuda_cpy(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, ggml_tensor * src1, bool disable_indirection_for_this_node) {
+     const int64_t ne = ggml_nelements(src0);
+     GGML_ASSERT(ne == ggml_nelements(src1));
+@@ -369,6 +410,8 @@ void ggml_cuda_cpy(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, gg
+         ggml_cpy_flt_cuda<half, nv_bfloat16> (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
     } else if (src0->type == GGML_TYPE_F16 && src1->type == GGML_TYPE_F32) {
-         ggml_cpy_f16_f32_cuda (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
+         ggml_cpy_flt_cuda<half, float> (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
 +    } else if (src0->type == GGML_TYPE_I32 && src1->type == GGML_TYPE_I32) {
 +        ggml_cpy_i32_i32_cuda(src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
-     } else {
-         GGML_ABORT("%s: unsupported type combination (%s to %s)\n", __func__,
-                 ggml_type_name(src0->type), ggml_type_name(src1->type));
-@@ -688,6 +735,8 @@ void* ggml_cuda_cpy_fn(const ggml_tensor * src0, ggml_tensor * src1) {
-         return (void*) cpy_f32_f16<cpy_1_f32_f16>;
-     } else if (src0->type == GGML_TYPE_F16 && src1->type == GGML_TYPE_F32) {
-         return (void*) cpy_f32_f16<cpy_1_f16_f32>;
-+    } else if (src0->type == GGML_TYPE_I32 && src1->type == GGML_TYPE_I32) {
-+        return (void*) cpy_i32_i32<cpy_1_i32_i32>;
-     } else {
-         GGML_ABORT("%s: unsupported type combination (%s to %s)\n", __func__,
-                 ggml_type_name(src0->type), ggml_type_name(src1->type));
+     } else if (src0->type == GGML_TYPE_BF16 && src1->type == GGML_TYPE_BF16) {
+         ggml_cpy_flt_cuda<nv_bfloat16, nv_bfloat16> (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream, dest_ptrs_d, graph_cpynode_index);
+     } else if (src0->type == GGML_TYPE_BF16 && src1->type == GGML_TYPE_F16) {
--- a/llama/patches/0014-graph-memory-reporting-on-failure.patch
+++ b/llama/patches/0014-graph-memory-reporting-on-failure.patch
@@ -28,7 +28,7 @@ index 2cb150fd..781b1e10 100644
 // Create a buffer and allocate all the tensors in a ggml_context
 GGML_API struct ggml_backend_buffer * ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft);
 diff --git a/ggml/include/ggml-backend.h b/ggml/include/ggml-backend.h
-index 778927f6..74e46716 100644
+index a2977ea2..8a91b381 100644
 --- a/ggml/include/ggml-backend.h
 +++ b/ggml/include/ggml-backend.h
@@ -304,6 +304,12 @@ extern "C" {
@@ -45,10 +45,10 @@ index 778927f6..74e46716 100644
     GGML_API ggml_backend_t       ggml_backend_sched_get_tensor_backend(ggml_backend_sched_t sched, struct ggml_tensor * node);
 
 diff --git a/ggml/src/ggml-alloc.c b/ggml/src/ggml-alloc.c
-index 5fd379f6..04812990 100644
+index 8b6e6028..41c8c4a2 100644
 --- a/ggml/src/ggml-alloc.c
 +++ b/ggml/src/ggml-alloc.c
-@@ -364,6 +364,7 @@ struct node_alloc {
+@@ -350,6 +350,7 @@ struct node_alloc {
 struct ggml_gallocr {
     ggml_backend_buffer_type_t * bufts; // [n_buffers]
     ggml_backend_buffer_t * buffers; // [n_buffers]
@@ -56,7 +56,7 @@ index 5fd379f6..04812990 100644
     struct ggml_dyn_tallocr ** buf_tallocs; // [n_buffers]
     int n_buffers;
 
-@@ -387,6 +388,9 @@ ggml_gallocr_t ggml_gallocr_new_n(ggml_backend_buffer_type_t * bufts, int n_bufs
+@@ -373,6 +374,9 @@ ggml_gallocr_t ggml_gallocr_new_n(ggml_backend_buffer_type_t * bufts, int n_bufs
     galloc->buffers = calloc(n_bufs, sizeof(ggml_backend_buffer_t));
     GGML_ASSERT(galloc->buffers != NULL);
 
@@ -66,7 +66,7 @@ index 5fd379f6..04812990 100644
     galloc->buf_tallocs = calloc(n_bufs, sizeof(struct ggml_dyn_tallocr *));
     GGML_ASSERT(galloc->buf_tallocs != NULL);
 
-@@ -453,6 +457,7 @@ void ggml_gallocr_free(ggml_gallocr_t galloc) {
+@@ -439,6 +443,7 @@ void ggml_gallocr_free(ggml_gallocr_t galloc) {
     ggml_hash_set_free(&galloc->hash_set);
     free(galloc->hash_values);
     free(galloc->bufts);
@@ -74,7 +74,7 @@ index 5fd379f6..04812990 100644
     free(galloc->buffers);
     free(galloc->buf_tallocs);
     free(galloc->node_allocs);
-@@ -748,6 +753,8 @@ bool ggml_gallocr_reserve_n(ggml_gallocr_t galloc, struct ggml_cgraph * graph, c
+@@ -734,6 +739,8 @@ bool ggml_gallocr_reserve_n(ggml_gallocr_t galloc, struct ggml_cgraph * graph, c
         }
     }
 
@@ -83,7 +83,7 @@ index 5fd379f6..04812990 100644
     // reallocate buffers if needed
     for (int i = 0; i < galloc->n_buffers; i++) {
         // if the buffer type is used multiple times, we reuse the same buffer
-@@ -769,15 +776,20 @@ bool ggml_gallocr_reserve_n(ggml_gallocr_t galloc, struct ggml_cgraph * graph, c
+@@ -755,15 +762,20 @@ bool ggml_gallocr_reserve_n(ggml_gallocr_t galloc, struct ggml_cgraph * graph, c
 
             ggml_backend_buffer_free(galloc->buffers[i]);
             galloc->buffers[i] = ggml_backend_buft_alloc_buffer(galloc->bufts[i], new_size);
@@ -108,7 +108,7 @@ index 5fd379f6..04812990 100644
 }
 
 bool ggml_gallocr_reserve(ggml_gallocr_t galloc, struct ggml_cgraph *graph) {
-@@ -934,6 +946,24 @@ size_t ggml_gallocr_get_buffer_size(ggml_gallocr_t galloc, int buffer_id) {
+@@ -920,6 +932,24 @@ size_t ggml_gallocr_get_buffer_size(ggml_gallocr_t galloc, int buffer_id) {
     return ggml_backend_buffer_get_size(galloc->buffers[buffer_id]);
 }
 
@@ -134,10 +134,10 @@ index 5fd379f6..04812990 100644
 
 static void free_buffers(ggml_backend_buffer_t ** buffers, const size_t * n_buffers) {
 diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp
-index 0ce73a99..be335e8c 100644
+index 97f47abd..eded0291 100644
 --- a/ggml/src/ggml-backend.cpp
 +++ b/ggml/src/ggml-backend.cpp
-@@ -1629,6 +1629,16 @@ size_t ggml_backend_sched_get_buffer_size(ggml_backend_sched_t sched, ggml_backe
+@@ -1631,6 +1631,16 @@ size_t ggml_backend_sched_get_buffer_size(ggml_backend_sched_t sched, ggml_backe
     return ggml_gallocr_get_buffer_size(sched->galloc, backend_index);
 }
 
--- a/llama/patches/0015-ggml-Export-GPU-UUIDs.patch
+++ b/llama/patches/0015-ggml-Export-GPU-UUIDs.patch
@@ -12,7 +12,7 @@ with tools (e.g. nvidia-smi) and system management libraries (e.g. nvml).
 3 files changed, 63 insertions(+), 6 deletions(-)

 diff --git a/ggml/include/ggml-backend.h b/ggml/include/ggml-backend.h
-index 74e467163..48839339d 100644
+index 8a91b381..9424394e 100644
 --- a/ggml/include/ggml-backend.h
 +++ b/ggml/include/ggml-backend.h
@@ -152,6 +152,7 @@ extern "C" {
@@ -24,17 +24,17 @@ index 74e467163..48839339d 100644
         size_t memory_total;
         enum ggml_backend_dev_type type;
 diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index cb0d8528d..1492368de 100644
+index 37ee2a6d..57eae461 100644
 --- a/ggml/src/ggml-cuda/ggml-cuda.cu
 +++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -173,6 +173,51 @@ static int ggml_cuda_parse_id(char devName[]) {
+@@ -179,6 +179,51 @@ static int ggml_cuda_parse_id(char devName[]) {
 }
- #endif // defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)
+ #endif // defined(GGML_USE_HIP)
 
 +static std::string ggml_cuda_parse_uuid(cudaDeviceProp prop, int device_num) {
 +    char id[64];
 +
-+    #if !defined(GGML_USE_HIP)
+#if !defined(GGML_USE_HIP)
 +    snprintf(id, sizeof(id),
 +        "GPU-%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x-%02x%02x%02x%02x%02x%02x",
 +        (unsigned char)prop.uuid.bytes[0],
@@ -54,10 +54,10 @@ index cb0d8528d..1492368de 100644
 +        (unsigned char)prop.uuid.bytes[14],
 +        (unsigned char)prop.uuid.bytes[15]
 +        );
-+    #else
-+    #ifdef _WIN32
+#else
+#ifdef _WIN32
 +        snprintf(id, sizeof(id), "%d", device_num);
-+    #else
+#else
 +    try {
 +        std::string uuid = std::string(prop.uuid.bytes, 16);
 +
@@ -70,16 +70,16 @@ index cb0d8528d..1492368de 100644
 +    } catch (const std::exception &e) {
 +        snprintf(id, sizeof(id), "%d", device_num);
 +    }
-+    #endif
-+    #endif
+#endif
+#endif
 +
 +    return id;
 +}
 +
 static ggml_cuda_device_info ggml_cuda_init() {
- #ifdef __HIP_PLATFORM_AMD__
+ #if defined(GGML_USE_HIP)
     // Workaround for a rocBLAS bug when using multiple graphics cards:
-@@ -261,22 +306,24 @@ static ggml_cuda_device_info ggml_cuda_init() {
+@@ -267,22 +312,24 @@ static ggml_cuda_device_info ggml_cuda_init() {
                 info.devices[id].cc += prop.minor * 0x10;
             }
         }
@@ -107,10 +107,10 @@ index cb0d8528d..1492368de 100644
 +        GGML_LOG_INFO("  Device %d: %s, compute capability %d.%d, VMM: %s, ID: %s\n",
 +                        id, prop.name, prop.major, prop.minor, device_vmm ? "yes" : "no",
 +                        ggml_cuda_parse_uuid(prop, id).c_str());
- #endif // defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)
+ #endif // defined(GGML_USE_HIP)
     }
 
-@@ -2884,6 +2931,7 @@ struct ggml_backend_cuda_device_context {
+@@ -3144,6 +3191,7 @@ struct ggml_backend_cuda_device_context {
     int device;
     std::string name;
     std::string description;
@@ -118,7 +118,7 @@ index cb0d8528d..1492368de 100644
 };
 
 static const char * ggml_backend_cuda_device_get_name(ggml_backend_dev_t dev) {
-@@ -2896,6 +2944,11 @@ static const char * ggml_backend_cuda_device_get_description(ggml_backend_dev_t
+@@ -3156,6 +3204,11 @@ static const char * ggml_backend_cuda_device_get_description(ggml_backend_dev_t
     return ctx->description.c_str();
 }
 
@@ -130,7 +130,7 @@ index cb0d8528d..1492368de 100644
 static void ggml_backend_cuda_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
     ggml_backend_cuda_device_context * ctx = (ggml_backend_cuda_device_context *)dev->context;
     ggml_cuda_set_device(ctx->device);
-@@ -2910,6 +2963,7 @@ static enum ggml_backend_dev_type ggml_backend_cuda_device_get_type(ggml_backend
+@@ -3170,6 +3223,7 @@ static enum ggml_backend_dev_type ggml_backend_cuda_device_get_type(ggml_backend
 static void ggml_backend_cuda_device_get_props(ggml_backend_dev_t dev, ggml_backend_dev_props * props) {
     props->name        = ggml_backend_cuda_device_get_name(dev);
     props->description = ggml_backend_cuda_device_get_description(dev);
@@ -138,7 +138,7 @@ index cb0d8528d..1492368de 100644
     props->type        = ggml_backend_cuda_device_get_type(dev);
     ggml_backend_cuda_device_get_memory(dev, &props->memory_free, &props->memory_total);
 
-@@ -3457,6 +3511,7 @@ ggml_backend_reg_t ggml_backend_cuda_reg() {
+@@ -3767,6 +3821,7 @@ ggml_backend_reg_t ggml_backend_cuda_reg() {
                 cudaDeviceProp prop;
                 CUDA_CHECK(cudaGetDeviceProperties(&prop, i));
                 dev_ctx->description = prop.name;
@@ -147,10 +147,10 @@ index cb0d8528d..1492368de 100644
                 ggml_backend_dev_t dev = new ggml_backend_device {
                     /* .iface   = */ ggml_backend_cuda_device_interface,
 diff --git a/ggml/src/ggml-metal/ggml-metal.m b/ggml/src/ggml-metal/ggml-metal.m
-index 1b56f858c..a9eeebc6a 100644
+index 7bccc7bf..fe7b2f0a 100644
 --- a/ggml/src/ggml-metal/ggml-metal.m
 +++ b/ggml/src/ggml-metal/ggml-metal.m
-@@ -5703,6 +5703,7 @@ static enum ggml_backend_dev_type ggml_backend_metal_device_get_type(ggml_backen
+@@ -6522,6 +6522,7 @@ static enum ggml_backend_dev_type ggml_backend_metal_device_get_type(ggml_backen
 static void ggml_backend_metal_device_get_props(ggml_backend_dev_t dev, struct ggml_backend_dev_props * props) {
     props->name        = ggml_backend_metal_device_get_name(dev);
     props->description = ggml_backend_metal_device_get_description(dev);
--- a/llama/patches/0016-temporary-prevent-rocm-cuda-mixed-loading.patch
+++ b/llama/patches/0016-temporary-prevent-rocm-cuda-mixed-loading.patch
@@ -8,10 +8,10 @@ Subject: [PATCH] temporary prevent rocm+cuda mixed loading
 1 file changed, 10 insertions(+), 2 deletions(-)

 diff --git a/ggml/src/ggml-backend-reg.cpp b/ggml/src/ggml-backend-reg.cpp
-index 4e67d243..8f49f084 100644
+index 3040b2aa..f1e9c180 100644
 --- a/ggml/src/ggml-backend-reg.cpp
 +++ b/ggml/src/ggml-backend-reg.cpp
-@@ -573,8 +573,16 @@ void ggml_backend_load_all_from_path(const char * dir_path) {
+@@ -581,8 +581,16 @@ void ggml_backend_load_all_from_path(const char * dir_path) {
 
     ggml_backend_load_best("blas", silent, dir_path);
     ggml_backend_load_best("cann", silent, dir_path);
@@ -20,13 +20,13 @@ index 4e67d243..8f49f084 100644
 +
 +    // Avoid mixed hip+cuda configurations
 +    const char * hip_devices = std::getenv("HIP_VISIBLE_DEVICES");
-+    const char * rocr_devices = std::getenv("ROCR_VISIBLE_DEVICES"); 
+    const char * rocr_devices = std::getenv("ROCR_VISIBLE_DEVICES");
 +    if (!hip_devices && !rocr_devices) {
 +        ggml_backend_load_best("cuda", silent, dir_path);
 +    } else {
 +        ggml_backend_load_best("hip", silent, dir_path);
 +    }
-+    
-     ggml_backend_load_best("kompute", silent, dir_path);
+
     ggml_backend_load_best("metal", silent, dir_path);
     ggml_backend_load_best("rpc", silent, dir_path);
+     ggml_backend_load_best("sycl", silent, dir_path);
--- a/llama/patches/0017-add-C-API-for-mtmd_input_text.patch
+++ b/llama/patches/0017-add-C-API-for-mtmd_input_text.patch
@@ -0,0 +1,46 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Gabe Goodhart <ghart@us.ibm.com>
+Date: Tue, 24 Jun 2025 16:55:31 -0600
+Subject: [PATCH] add C API for mtmd_input_text
+
+Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
+---
+ tools/mtmd/mtmd.cpp | 10 ++++++++++
+ tools/mtmd/mtmd.h   |  3 +++
+ 2 files changed, 13 insertions(+)
+
+diff --git a/tools/mtmd/mtmd.cpp b/tools/mtmd/mtmd.cpp
+index a05373d5..6f70f7f4 100644
+--- a/tools/mtmd/mtmd.cpp
+++ b/tools/mtmd/mtmd.cpp
+@@ -79,6 +79,16 @@ enum mtmd_slice_tmpl {
+     // TODO @ngxson : add support for idefics (SmolVLM)
+ };
+ 
+mtmd_input_text* mtmd_input_text_init(const char * text, bool add_special, bool parse_special) {
+    return new mtmd_input_text{text, add_special, parse_special};
+}
+
+void mtmd_input_text_free(mtmd_input_text* input_text) {
+    if (input_text) {
+        delete input_text;
+    }
+}
+
+ const char * mtmd_default_marker() {
+     return "<__media__>";
+ }
+diff --git a/tools/mtmd/mtmd.h b/tools/mtmd/mtmd.h
+index f4ea07d3..cf287224 100644
+--- a/tools/mtmd/mtmd.h
+++ b/tools/mtmd/mtmd.h
+@@ -75,6 +75,9 @@ typedef struct mtmd_input_chunk  mtmd_input_chunk;
+ typedef struct mtmd_input_chunks mtmd_input_chunks;
+ typedef struct mtmd_input_text   mtmd_input_text;
+ 
+MTMD_API mtmd_input_text* mtmd_input_text_init(const char * text, bool add_special, bool parse_special);
+MTMD_API void mtmd_input_text_free(mtmd_input_text* input_text);
+
+ struct mtmd_context_params {
+     bool use_gpu;
+     bool print_timings;
--- a/llama/patches/0018-no-power-throttling-win32-with-gnuc.patch
+++ b/llama/patches/0018-no-power-throttling-win32-with-gnuc.patch
@@ -0,0 +1,22 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Gabe Goodhart <ghart@us.ibm.com>
+Date: Fri, 11 Jul 2025 15:59:19 -0600
+Subject: [PATCH] no power throttling win32 with gnuc
+
+---
+ ggml/src/ggml-cpu/ggml-cpu.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c
+index a5689c18..85af19a3 100644
+--- a/ggml/src/ggml-cpu/ggml-cpu.c
+++ b/ggml/src/ggml-cpu/ggml-cpu.c
+@@ -2412,7 +2412,7 @@ static bool ggml_thread_apply_priority(int32_t prio) {
+         // Newer Windows 11 versions aggresively park (offline) CPU cores and often place
+         // all our threads onto the first 4 cores which results in terrible performance with
+         // n_threads > 4
+-        #if _WIN32_WINNT >= 0x0602
+        #if (_WIN32_WINNT >= 0x0602) && !defined(__GNUC__)
+         THREAD_POWER_THROTTLING_STATE t;
+         ZeroMemory(&t, sizeof(t));
+         t.Version     = THREAD_POWER_THROTTLING_CURRENT_VERSION;
--- a/llama/patches/0019-BF16-macos-version-guard.patch
+++ b/llama/patches/0019-BF16-macos-version-guard.patch
@@ -9,10 +9,10 @@ Only enable BF16 on supported MacOS versions (v14+)
 1 file changed, 5 insertions(+), 1 deletion(-)

 diff --git a/ggml/src/ggml-metal/ggml-metal.m b/ggml/src/ggml-metal/ggml-metal.m
-index 110c9ece..ab46f6e3 100644
+index fe7b2f0a..e4c31268 100644
 --- a/ggml/src/ggml-metal/ggml-metal.m
 +++ b/ggml/src/ggml-metal/ggml-metal.m
-@@ -89,7 +89,11 @@ static id<MTLDevice> ggml_backend_metal_device_acq(struct ggml_backend_metal_dev
+@@ -106,7 +106,11 @@ static id<MTLDevice> ggml_backend_metal_device_acq(struct ggml_backend_metal_dev
         ctx->has_bfloat |= [ctx->mtl_device supportsFamily:MTLGPUFamilyApple6];
 
 #if defined(GGML_METAL_USE_BF16)
--- a/llama/patches/0019-metal-add-mean-kernel-14267.patch
+++ b/llama/patches/0019-metal-add-mean-kernel-14267.patch
@@ -1,169 +0,0 @@
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: Georgi Gerganov <ggerganov@gmail.com>
-Date: Thu, 19 Jun 2025 08:05:21 +0300
-Subject: [PATCH] metal : add mean kernel (#14267)
-
-* metal : add mean kernel
-
-ggml-ci
-
-* cont : dedup implementation
-
-ggml-ci
---
- ggml/src/ggml-metal/ggml-metal.m     | 33 ++++++++++++++++---
- ggml/src/ggml-metal/ggml-metal.metal | 48 ++++++++++++++++++++++------
- 2 files changed, 67 insertions(+), 14 deletions(-)
-
-diff --git a/ggml/src/ggml-metal/ggml-metal.m b/ggml/src/ggml-metal/ggml-metal.m
-index a9eeebc6..110c9ece 100644
--- a/ggml/src/ggml-metal/ggml-metal.m
-+++ b/ggml/src/ggml-metal/ggml-metal.m
-@@ -489,6 +489,7 @@ enum ggml_metal_kernel_type {
-     GGML_METAL_KERNEL_TYPE_COS,
-     GGML_METAL_KERNEL_TYPE_NEG,
-     GGML_METAL_KERNEL_TYPE_SUM_ROWS,
-+    GGML_METAL_KERNEL_TYPE_MEAN,
-     GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32,
-     GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32,
-     GGML_METAL_KERNEL_TYPE_ARGMAX,
-@@ -1436,6 +1437,7 @@ static struct ggml_backend_metal_context * ggml_metal_init(ggml_backend_dev_t de
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_COS,                             cos,                             true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_NEG,                             neg,                             true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_SUM_ROWS,                        sum_rows,                        true);
-+        GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MEAN,                            mean,                            true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_ARGMAX,                          argmax,                          true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32,                 pool_2d_avg_f32,                 true);
-         GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32,                 pool_2d_max_f32,                 true);
-@@ -1634,6 +1636,7 @@ static bool ggml_metal_supports_op(const struct ggml_backend_metal_device_contex
-         case GGML_OP_LOG:
-             return false; // TODO: implement
-         case GGML_OP_SUM_ROWS:
-+        case GGML_OP_MEAN:
-         case GGML_OP_SOFT_MAX:
-         case GGML_OP_GROUP_NORM:
-             return has_simdgroup_reduction && ggml_is_contiguous(op->src[0]);
-@@ -2362,11 +2365,30 @@ static bool ggml_metal_encode_node(
-                 [encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
-             } break;
-         case GGML_OP_SUM_ROWS:
-+        case GGML_OP_MEAN:
-             {
-                 GGML_ASSERT(src0->nb[0] == ggml_type_size(src0->type));
- 
-                id<MTLComputePipelineState> pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_SUM_ROWS].pipeline;
-+                id<MTLComputePipelineState> pipeline = nil;
-+
-+                switch (dst->op) {
-+                    case GGML_OP_SUM_ROWS:
-+                        pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_SUM_ROWS].pipeline;
-+                        break;
-+                    case GGML_OP_MEAN:
-+                        pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MEAN].pipeline;
-+                        break;
-+                    default:
-+                        GGML_ABORT("fatal error");
-+                }
-+
-+                int nth = 32; // SIMD width
-+
-+                while (nth < ne00 && nth < (int) pipeline.maxTotalThreadsPerThreadgroup) {
-+                    nth *= 2;
-+                }
- 
-+                nth = MIN(nth, ne00);
- 
-                 ggml_metal_kargs_sum_rows args = {
-                    /*.ne00 =*/ ne00,
-@@ -2396,11 +2418,12 @@ static bool ggml_metal_encode_node(
-                 };
- 
-                 [encoder setComputePipelineState:pipeline];
-                [encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
-                [encoder setBuffer:id_dst  offset:offs_dst  atIndex:1];
-                [encoder setBytes:&args length:sizeof(args) atIndex:2];
-+                [encoder setBytes:&args length:sizeof(args) atIndex:0];
-+                [encoder setBuffer:id_src0 offset:offs_src0 atIndex:1];
-+                [encoder setBuffer:id_dst  offset:offs_dst  atIndex:2];
-+                [encoder setThreadgroupMemoryLength:32*sizeof(float) atIndex:0];
- 
-                [encoder dispatchThreadgroups:MTLSizeMake(ne01, ne02, ne03) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
-+                [encoder dispatchThreadgroups:MTLSizeMake(ne01, ne02, ne03) threadsPerThreadgroup:MTLSizeMake(nth, 1, 1)];
-             } break;
-         case GGML_OP_SOFT_MAX:
-             {
-diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
-index 9cfddf45..08e8d807 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
-+++ b/ggml/src/ggml-metal/ggml-metal.metal
-@@ -956,31 +956,61 @@ kernel void kernel_neg(
-     dst[tpig] = -src0[tpig];
- }
- 
-+template <bool norm>
- kernel void kernel_sum_rows(
-+        constant ggml_metal_kargs_sum_rows & args,
-         device const float * src0,
-         device       float * dst,
-        constant ggml_metal_kargs_sum_rows & args,
-        uint3 tpig[[thread_position_in_grid]]) {
-    int64_t i3 = tpig.z;
-    int64_t i2 = tpig.y;
-    int64_t i1 = tpig.x;
-+        threadgroup  float * shmem_f32 [[threadgroup(0)]],
-+        uint3   tgpig[[threadgroup_position_in_grid]],
-+        ushort3 tpitg[[thread_position_in_threadgroup]],
-+        ushort  sgitg[[simdgroup_index_in_threadgroup]],
-+        ushort  tiisg[[thread_index_in_simdgroup]],
-+        ushort3   ntg[[threads_per_threadgroup]]) {
-+    int64_t i3 = tgpig.z;
-+    int64_t i2 = tgpig.y;
-+    int64_t i1 = tgpig.x;
- 
-     if (i3 >= args.ne03 || i2 >= args.ne02 || i1 >= args.ne01) {
-         return;
-     }
- 
-+    if (sgitg == 0) {
-+        shmem_f32[tiisg] = 0.0f;
-+    }
-+
-     device const float * src_row = (device const float *) ((device const char *) src0 + i1*args.nb01 + i2*args.nb02 + i3*args.nb03);
-     device       float * dst_row = (device       float *) ((device       char *) dst  + i1*args.nb1  + i2*args.nb2  + i3*args.nb3);
- 
-    float row_sum = 0;
-+    float sumf = 0;
- 
-    for (int64_t i0 = 0; i0 < args.ne00; i0++) {
-        row_sum += src_row[i0];
-+    for (int64_t i0 = tpitg.x; i0 < args.ne00; i0 += ntg.x) {
-+        sumf += src_row[i0];
-     }
- 
-    dst_row[0] = row_sum;
-+    sumf = simd_sum(sumf);
-+
-+    threadgroup_barrier(mem_flags::mem_threadgroup);
-+
-+    if (tiisg == 0) {
-+        shmem_f32[sgitg] = sumf;
-+    }
-+
-+    threadgroup_barrier(mem_flags::mem_threadgroup);
-+
-+    sumf = shmem_f32[tiisg];
-+    sumf = simd_sum(sumf);
-+
-+    if (tpitg.x == 0) {
-+        dst_row[0] = norm ? sumf / args.ne00 : sumf;
-+    }
- }
- 
-+typedef decltype(kernel_sum_rows<false>) kernel_sum_rows_t;
-+
-+template [[host_name("kernel_sum_rows")]] kernel kernel_sum_rows_t kernel_sum_rows<false>;
-+template [[host_name("kernel_mean")]]     kernel kernel_sum_rows_t kernel_sum_rows<true>;
-+
- template<typename T>
- kernel void kernel_soft_max(
-         device const  char * src0,
--- a/llama/patches/0020-CUDA-add-mean-operation-14313.patch
+++ b/llama/patches/0020-CUDA-add-mean-operation-14313.patch
--- a/llama/patches/0020-Enable-CUDA-Graphs-for-gemma3n.patch
+++ b/llama/patches/0020-Enable-CUDA-Graphs-for-gemma3n.patch
@@ -0,0 +1,56 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Oliver Simons <osimons@nvidia.com>
+Date: Tue, 22 Jul 2025 11:02:28 +0200
+Subject: [PATCH] Enable CUDA Graphs for gemma3n.
+
+Similar to
+https://github.com/ggml-org/llama.cpp/pull/14741,
+though ollama has a slightly different model graph
+than llama.cpp which requires different workaround
+checks.
+---
+ ggml/src/ggml-cuda/ggml-cuda.cu | 18 ++++++++++++++++++
+ 1 file changed, 18 insertions(+)
+
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index 57eae461..9db0c8b5 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -2671,12 +2671,24 @@ static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cud
+     // Loop over nodes in GGML graph to obtain info needed for CUDA graph
+     cuda_ctx->cuda_graph->cpy_dest_ptrs.clear();
+ 
+    // This fix was added in llama.cpp and Ollama in parallel, but with
+    // different tensor names.
+    // llama.cpp: https://github.com/ggml-org/llama.cpp/pull/14741
+    // ollama: https://github.com/ollama/ollama/pull/11525
+
+    const std::string gemma3n_per_layer_proj_src1_name_ollama = " (reshaped)";
+    const std::string gemma3n_node_name_ollama                = "node_";
+
+     const std::string gemma3n_per_layer_proj_src0_name = "inp_per_layer_selected";
+     const std::string gemma3n_per_layer_proj_src1_name = "per_layer_proj";
+
+    const std::string ffn_moe_bias_suffix = "_exps.bias";
+
+     const std::string ffn_moe_gate_bias_prefix = "ffn_moe_gate_biased";
+     const std::string ffn_moe_up_bias_prefix = "ffn_moe_up_biased";
+     const std::string ffn_moe_down_bias_prefix = "ffn_moe_down_biased";
+ 
+
+     for (int i = 0; i < cgraph->n_nodes; i++) {
+         ggml_tensor * node = cgraph->nodes[i];
+ 
+@@ -2700,6 +2712,12 @@ static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cud
+ 
+         if (node->op == GGML_OP_ADD &&
+             node->src[1] && node->src[1]->ne[1] > 1 &&
+            // ollama
+            // workarounds to exclude Gemma3n's `project_per_layer_input` operation from the batch-size heuristic, specific to ollama's implementation of gemma3n
+            // number of layers is different for per_layer_proj between gemma3n:2b and gemma3n:4b, which is why we don't check that value here
+            !(node->ne[0] == 256 && node->ne[2] == 1 && node->ne[3] == 1 && node->src[0] ? std::string(node->src[0]->name).find(gemma3n_node_name_ollama) != std::string::npos : false && node->src[1] ? node->src[1]->name == gemma3n_per_layer_proj_src1_name_ollama : false) &&
+            node->src[1] ? std::string(node->src[1]->name).find(ffn_moe_bias_suffix) == std::string::npos : false &&
+            // upstream
+             (node->src[0] ? node->src[0]->name != gemma3n_per_layer_proj_src0_name : true) &&
+             (node->src[1] ? node->src[1]->name != gemma3n_per_layer_proj_src1_name : true) &&
+             strncmp(node->name, ffn_moe_gate_bias_prefix.c_str(), ffn_moe_gate_bias_prefix.size()) != 0 &&
--- a/llama/patches/0021-Disable-ggml-blas-on-macos-v13-and-older.patch
+++ b/llama/patches/0021-Disable-ggml-blas-on-macos-v13-and-older.patch
@@ -8,7 +8,7 @@ Subject: [PATCH] Disable ggml-blas on macos v13 and older
 1 file changed, 5 insertions(+)

 diff --git a/ggml/src/ggml-blas/ggml-blas.cpp b/ggml/src/ggml-blas/ggml-blas.cpp
-index ec158dfa..22926d75 100644
+index aeac2e57..40738d5b 100644
 --- a/ggml/src/ggml-blas/ggml-blas.cpp
 +++ b/ggml/src/ggml-blas/ggml-blas.cpp
@@ -505,6 +505,11 @@ static const struct ggml_backend_reg_i ggml_backend_blas_reg_i = {
--- a/llama/patches/0021-Enable-CUDA-Graphs-for-gemma3n.patch
+++ b/llama/patches/0021-Enable-CUDA-Graphs-for-gemma3n.patch
@@ -1,50 +0,0 @@
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: Oliver Simons <osimons@nvidia.com>
-Date: Tue, 22 Jul 2025 11:02:28 +0200
-Subject: [PATCH] Enable CUDA Graphs for gemma3n.
-
-Similar to
-https://github.com/ggml-org/llama.cpp/pull/14741,
-though ollama has a slightly different model graph
-than llama.cpp which requires different workaround
-checks.
---
- ggml/src/ggml-cuda/ggml-cuda.cu | 16 ++++++++++++----
- 1 file changed, 12 insertions(+), 4 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index 2b9fabf4..28ccf4be 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -2474,6 +2474,9 @@ static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cud
-     // Loop over nodes in GGML graph to obtain info needed for CUDA graph
-     cuda_ctx->cuda_graph->cpy_dest_ptrs.clear();
- 
-+    const std::string gemma3n_per_layer_proj_src1_name   = " (reshaped)";
-+    const std::string gemma3n_node_name                  = "node_";
-+
-     for (int i = 0; i < cgraph->n_nodes; i++) {
-         ggml_tensor * node = cgraph->nodes[i];
- 
-@@ -2495,12 +2498,17 @@ static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cud
- #endif
-         }
- 
-        if (node->op == GGML_OP_ADD && node->src[1] && node->src[1]->ne[1] > 1) {
-            // disable CUDA graphs for batch size > 1 for now.
-            // Changes in batch size or context size can cause changes to the grid size of some kernels.
-+        // workarounds to exclude Gemma3n's `project_per_layer_input` operation from the batch-size heuristic, specific to ollama's implementation of gemma3n
-+        // number of layers is different for per_layer_proj between gemma3n:2b and gemma3n:4b, which is why we don't check that value here
-+        if (node->op == GGML_OP_ADD && node->src[1] && node->src[1]->ne[1] > 1 && !(node->ne[0] == 256
-+                                                                                    && node->ne[2] == 1
-+                                                                                    && node->ne[3] == 1
-+                                                                                    && node->src[0] ? std::string(node->src[0]->name).find(gemma3n_node_name) != std::string::npos : false
-+                                                                                    && node->src[1] ? node->src[1]->name == gemma3n_per_layer_proj_src1_name : false)) {
-+            // Generally, changes in batch size or context size can cause changes to the grid size of some kernels.
-             use_cuda_graph = false;
- #ifndef NDEBUG
-            GGML_LOG_DEBUG("%s: disabling CUDA graphs due to batch size > 1 [%s] [%ld %ld %ld %ld]\n", __func__, node->name, node->ne[0], node->ne[1], node->ne[2], node->ne[3]);
-+            GGML_LOG_INFO("%s: disabling CUDA graphs due to batch size > 1 [%s] [%ld %ld %ld %ld]\n", __func__, node->name, node->ne[0], node->ne[1], node->ne[2], node->ne[3]);
- #endif
-         }
- 
--- a/llama/patches/0022-fix-mtmd-audio.cpp-build-on-windows.patch
+++ b/llama/patches/0022-fix-mtmd-audio.cpp-build-on-windows.patch
@@ -0,0 +1,21 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Daniel Hiltgen <daniel@ollama.com>
+Date: Wed, 6 Aug 2025 12:35:29 -0700
+Subject: [PATCH] fix mtmd-audio.cpp build on windows
+
+---
+ tools/mtmd/mtmd-audio.cpp | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/tools/mtmd/mtmd-audio.cpp b/tools/mtmd/mtmd-audio.cpp
+index 4d053895..84bdc277 100644
+--- a/tools/mtmd/mtmd-audio.cpp
+++ b/tools/mtmd/mtmd-audio.cpp
+@@ -1,6 +1,6 @@
+#define _USE_MATH_DEFINES // for M_PI
+ #include "mtmd-audio.h"
+ 
+-#define _USE_MATH_DEFINES // for M_PI
+ #include <cmath>
+ #include <cstdint>
+ #include <cstring>
--- a/llama/patches/0023-MXFP4.patch
+++ b/llama/patches/0023-MXFP4.patch
--- a/llama/patches/0023-ggml-No-alloc-mode.patch
+++ b/llama/patches/0023-ggml-No-alloc-mode.patch
@@ -16,7 +16,7 @@ Defaults to false for newly created backend buffer types.
 3 files changed, 21 insertions(+), 1 deletion(-)

 diff --git a/ggml/include/ggml-backend.h b/ggml/include/ggml-backend.h
-index 48839339..3903c3cb 100644
+index 9424394e..b602a7c7 100644
 --- a/ggml/include/ggml-backend.h
 +++ b/ggml/include/ggml-backend.h
@@ -35,6 +35,7 @@ extern "C" {
@@ -48,7 +48,7 @@ index c36c12d6..81749a5a 100644
 
     GGML_API ggml_backend_buffer_t ggml_backend_buffer_init(
 diff --git a/ggml/src/ggml-backend.cpp b/ggml/src/ggml-backend.cpp
-index be335e8c..84928bc3 100644
+index eded0291..05a842ed 100644
 --- a/ggml/src/ggml-backend.cpp
 +++ b/ggml/src/ggml-backend.cpp
@@ -35,12 +35,22 @@ const char * ggml_backend_buft_name(ggml_backend_buffer_type_t buft) {
--- a/llama/patches/0024-cuda-disable-graph-compat-check-for-OP_ADD.patch
+++ b/llama/patches/0024-cuda-disable-graph-compat-check-for-OP_ADD.patch
@@ -1,34 +0,0 @@
-From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
-From: Michael Yang <git@mxy.ng>
-Date: Thu, 31 Jul 2025 12:31:58 -0700
-Subject: [PATCH] cuda: disable graph compat check for OP_ADD
-
---
- ggml/src/ggml-cuda/ggml-cuda.cu | 14 --------------
- 1 file changed, 14 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index bb19b06e..080e7467 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -2509,20 +2509,6 @@ static bool check_node_graph_compatibility_and_refresh_copy_ops(ggml_backend_cud
- #endif
-         }
- 
-        // workarounds to exclude Gemma3n's `project_per_layer_input` operation from the batch-size heuristic, specific to ollama's implementation of gemma3n
-        // number of layers is different for per_layer_proj between gemma3n:2b and gemma3n:4b, which is why we don't check that value here
-        if (node->op == GGML_OP_ADD && node->src[1] && node->src[1]->ne[1] > 1 && !(node->ne[0] == 256
-                                                                                    && node->ne[2] == 1
-                                                                                    && node->ne[3] == 1
-                                                                                    && node->src[0] ? std::string(node->src[0]->name).find(gemma3n_node_name) != std::string::npos : false
-                                                                                    && node->src[1] ? node->src[1]->name == gemma3n_per_layer_proj_src1_name : false)) {
-            // Generally, changes in batch size or context size can cause changes to the grid size of some kernels.
-            use_cuda_graph = false;
-#ifndef NDEBUG
-            GGML_LOG_INFO("%s: disabling CUDA graphs due to batch size > 1 [%s] [%ld %ld %ld %ld]\n", __func__, node->name, node->ne[0], node->ne[1], node->ne[2], node->ne[3]);
-#endif
-        }
-
-         if (node->op == GGML_OP_CPY) {
- 
-             // Store the pointers which are updated for each token, such that these can be sent