* feat: Bump llama.cpp to df1b612
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(mtmd): Correctly encode text chunks during mtmd tokenization
There can be text chunks that appear interspersed with the image embeddings
that contain template delimiter tokens for some models. These need to be
correctly translated to text tokens.
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* tests: Use MtmdChunk in image_test
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Fix unnecessary conversion linting
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(ggml): Revert changes to ggml_hip.cpp
These changes were done largely by our code assistant and are likely wrong
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Revert changes in mem_nvml.cpp
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Update sync point to 1deee0
This brings in several more optimization commits and model support for
EmbeddingGemma
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Update patches for 1deee0
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: sync for bump to 1deee0
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Bad patch updates with errant `+`
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Bump llama.cpp/ggml to 7049736
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: format-patches after latest bump
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This provides integration with the new Ollama engine
(5824541 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.
In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
- Parallel processing
- Memory management for defragmentation and shifting
- Multi-modal modals
Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:
Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve
Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1