Commit Graph

80 Commits

Author SHA1 Message Date
nicole pardal
3475d915cb embeddings: modified batch size (#13429)
This PR detects embedding models and sets batch_size = context_size so the full input fits in a single batch.
Previously, if batch size was smaller than the input, tokens could be split across batches and cause a SIGTRAP crash.
This change ensures all tokens stay in one batch and prevents crashes.
Fixes: #12938 #13054

Co-authored-by: Jesse Gross <jesse@ollama.com>
2025-12-11 15:36:31 -08:00
nicole pardal
e082d60a24 truncation: fixed runner truncation logic + removed server truncation (#12839)
This PR consolidates all embedding prompt-length checking, truncation, and prompt token counting into the runner to ensure a single source of truth.
2025-12-08 11:20:28 -08:00
Daniel Hiltgen
18b5958d46 test: avoid ministral tools test on low vram (#13302)
Avoid hitting test timeouts
2025-12-02 13:18:55 -08:00
Daniel Hiltgen
d771043e88 test: add ministral-3 (#13300) 2025-12-02 09:52:16 -08:00
Parth Sareen
c114987523 logprob: add bytes to logprobs (#13068) 2025-11-13 13:49:25 -08:00
Baptiste Jamin
59241c5bee server: add logprobs and top_logprobs support to Ollama's API (#12899)
Adds logprobs support to Ollama's API including support for Ollama's
OpenAI-compatible API. By specifying the new 'logprobs' boolean parameter
in the API, Ollama will return the log probabilities for each token generated.
'top_logprobs', an integer value can also be specified up to the value 20.
When specified, the API will also provide the number of most likely tokens to
return at each token position

Co-authored-by: Baptiste Jamin <baptiste@crisp.chat>
2025-11-11 08:49:50 -08:00
nicole pardal
7dd4862a89 embeddings: removed redundant TestAPIEmbeddings test (#12863)
This PR removes a redundant test from TestAPIEmbeddings
Contents of this test already exists in embed_test.go and model_arch_test.go
2025-10-30 17:12:33 -07:00
Patrick Devine
76eb7d0fff testing: test more models with tool calling (#12867) 2025-10-30 13:19:21 -07:00
Daniel Hiltgen
c88647104d int: harden server lifecycle (#12835)
this should reduce zombies during integration runs
2025-10-29 11:50:56 -07:00
Patrick Devine
05aff4a4f1 tests: fix embeddinggemma integration test (#12830) 2025-10-29 11:07:28 -07:00
Michael Yang
7d25b9e194 feat(model): add qwen3vl (#12665) 2025-10-28 17:39:47 -07:00
Patrick Devine
36d64fb531 embed: add distance correlation test for library embed models (#12796) 2025-10-28 16:57:27 -07:00
Patrick Devine
29f63f37c8 Revert "server: Consolidate embedding truncation in runner (#12730)" (#12810)
This reverts commit 5d347f6d6f.
2025-10-28 14:49:14 -07:00
nicole pardal
5d347f6d6f server: Consolidate embedding truncation in runner (#12730)
Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.
2025-10-27 11:59:12 -07:00
Jeffrey Morgan
5fe7ba1b9b runner: always truncate embeddings requests (#12714) 2025-10-20 16:47:05 -07:00
Daniel Hiltgen
68e04c7ff8 test: harden scheduler tests (#12662)
* test: harden scheduler tests

This removes reschedDelay which was stale code, and adds
a new configurable timeout for the waitForVRAMRecovery so
tests can now set the timeout to be very short to avoid the
scheduler getting stuck and hitting a test timeout.

* test: tune tests for partial loads

Give stress tests more time when the model is split between CPU/GPU
2025-10-17 08:56:44 -07:00
Daniel Hiltgen
b531777a66 test: add a few missing embedding models (#12661) 2025-10-16 09:36:25 -07:00
Daniel Hiltgen
4e5d862ec4 Integration test tuning (#12492)
Remove some flaky scenarios, and switch to chat for better reliability
2025-10-08 09:51:25 -07:00
Daniel Hiltgen
c68f367ef6 Update GGML to b6646 (#12245)
Notable EOLs with this change:
- MacOS v12 and v13 are no longer supported (v14+ required)
- AMD gfx900 and gfx906 are no longer supported
2025-10-02 14:47:10 -07:00
Daniel Hiltgen
bc8909fb38 Use runners for GPU discovery (#12090)
This revamps how we discover GPUs in the system by leveraging the Ollama
runner.  This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs.  Now the runner does that implicitly based on the actual
device list.  In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.

Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.

Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.
2025-10-01 15:12:32 -07:00
Daniel Hiltgen
c23e6f4cae tests: add single threaded history test (#12295)
* tests: add single threaded history test

Also tidies up some existing tests to handle more model output variation

* test: add support for testing specific architectures
2025-09-22 11:23:14 -07:00
Michael Yang
ceac416ec2 fix(integration): check truncated length (#12337) 2025-09-18 14:00:21 -07:00
Daniel Hiltgen
44a6792873 tests: tighten up a few flaky tests (#12271)
Sometimes the context test results are pure emoji's
Thanksgiving has too much variability, so swap for a more straight forward prompt.
2025-09-12 13:59:34 -07:00
Parth Sareen
20b53eaa72 tests: add tool calling integration test (#12232) 2025-09-09 14:01:11 -07:00
Daniel Hiltgen
6745182885 tests: reduce stress on CPU to 2 models (#12161)
* tests: reduce stress on CPU to 2 models

This should avoid flakes due to systems getting overloaded with 3 (or more) models running concurrently

* tests: allow slow systems to pass on timeout

If a slow system is still streaming a response, and the response
will pass validation, don't fail just because the system is slow.

* test: unload embedding models more quickly
2025-09-09 09:32:15 -07:00
Jesse Gross
e119783e66 llm: Clamp batch size to context size
The context must always be able to store the current batch, so
if the user requests a small context then we should also shrink
the batch to match. This also fixes the TestLongInputContext
test on the new engine. (The old engine already has this behavior.)
2025-09-08 20:40:11 -07:00
Daniel Hiltgen
517807cdf2 perf: build graph for next batch async to keep GPU busy (#11863)
* perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

* tests: tune integration tests for ollama engine

This tunes the integration tests to focus more on models supported
by the new engine.
2025-08-29 14:20:28 -07:00
Daniel Hiltgen
d6f7233a1c test: improve scheduler/concurrency stress tests (#11906)
* test: improve scheduler/concurrency stress tests

The scheduler test used to use approximate memory figures and would often
over or under shoot a systems capcity leading to flaky test results.
This should improve the reliability of this scenario by leveraging
ps output to determinie exactly how many models it takes to
trigger thrashing.

The concurrency test is also refined to target num_parallel + 1 and handle
timeouts better.

With these refinements, TestMultiModelConcurrency was redundant

* test: add parallel generate with history

TestGenerateWithHistory will help verify caching and context
are properly handled while making requests

* test: focus embed tests on embedding models

remove non-embedding models from the embedding tests
2025-08-15 14:37:54 -07:00
Daniel Hiltgen
c385ca8672 test: add valid responses (#11902)
some of the new models need a few more valid responses to pass
2025-08-14 11:07:13 -07:00
Daniel Hiltgen
a24f90604f int: adjust a few models for integration tests (#11872) 2025-08-13 15:42:36 -07:00
Daniel Hiltgen
114c3f2265 tests: add integration coverage for oss-gpt (#11696)
Also wires up support to override the default "smol" model
2025-08-07 15:06:57 -07:00
Daniel Hiltgen
f8a6e88819 Only load supported models on new engine (#11362)
* Only load supported models on new engine

Verify the model is supported before trying to load

* int: testcase for all library models
2025-07-11 12:21:54 -07:00
Daniel Hiltgen
4f473e224c int: add performance integration tests (#11173)
usage example:
  go test --tags=integration,perf -count 1 ./integration -v -timeout 1h -run TestModelsPerf 2>&1 | tee int.log
  cat int.log | grep MODEL_PERF_HEADER | cut -f2- -d: > perf.csv
  cat int.log | grep MODEL_PERF_DATA | cut -f2- -d: >> perf.csv
2025-07-05 16:07:09 -07:00
Daniel Hiltgen
f2527b08fb int: add coverage for older models (#11137)
Verified these fail on 0.9.1 and pass on HEAD.
2025-06-19 12:10:19 -07:00
Daniel Hiltgen
2307fc2bcd tests: drop llama3.2-vision embedding tests (#10837) 2025-05-24 13:17:53 -07:00
Daniel Hiltgen
fdd4d479a3 integration: add qwen2.5-vl (#10815)
Replace the older llava model with qwen2.5 for vision tests
Skip split-batch test on small VRAM systems to avoid excessive test time
2025-05-22 09:12:32 -07:00
Daniel Hiltgen
424810450f Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00
湛露先生
7e5c8eee5c file close check and close. (#10554)
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-05-04 15:37:59 -07:00
Daniel Hiltgen
7bec2724a5 integration: fix embedding tests error handling (#10478)
The cleanup routine from InitServerconnection should run in the defer of the test case to properly detect failures and report the server logs
2025-04-29 11:57:54 -07:00
Daniel Hiltgen
ed4e139314 Integration test improvements (#9654)
Add some new test coverage for various model architectures,
and switch from orca-mini to the small llama model.
2025-04-16 14:25:55 -07:00
CYJiang
e7019c9455 fix(integration): move waitgroup Add(1) outside goroutine to avoid potential issue (#10070)
Signed-off-by: googs1025 <googs1025@gmail.com>
2025-04-08 15:17:40 -07:00
Bruce MacDonald
9876c9faa4 chore(all): replace instances of interface with any (#10067)
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.
2025-04-02 09:44:27 -07:00
Jesse Gross
9679f40146 ml: Allow models to constrain inputs to a single batch
Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes #9697
2025-03-14 15:38:54 -07:00
Stefan Weil
abfdc4710f all: fix typos in documentation, code, and comments (#7021) 2024-12-10 12:58:06 -08:00
Daniel Hiltgen
f0a351810c tests: fix max queue integration test (#7782)
This had fallen out of sync with the envconfig behavior, where max queue default was not zero.
2024-11-22 08:05:45 -08:00
Jesse Gross
7121dfa309 runner.go: Retry decoding after defragmentation if needed
Fragmentation of the KV cache can occur due to cache shifting or
different sequences getting processed. Decode uses a heuristic to
decide if it should defrag. However, this heuristic isn't 100%
accurate, so decoding can sometimes fail by surprise.

For these cases, if decode indicates that there is no KV cache space,
we should defrag and then try again.
2024-11-20 12:49:24 -08:00
Daniel Hiltgen
8a9bb0d000 Add basic mllama integration tests (#7455) 2024-10-31 17:25:48 -07:00
Daniel Hiltgen
921779bb10 Give unicode test more time to run (#7437)
* Give unicode test more time to run

Some slower GPUs (or partial CPU/GPU loads) can take more than the default 30s to complete this test

* Give more time for concurrency test

CPU inference can be very slow under stress
2024-10-31 13:35:31 -07:00
Jesse Gross
078f666f73 tests: Add test for Unicode processing 2024-10-28 18:12:29 -07:00
Daniel Hiltgen
dc6fe82051 integration: harden embedding test (#7306)
Use cosine similarity to make the embeddings tests more robust
2024-10-22 15:25:22 -07:00