llm: Perform eviction when num_gpu is set with new estimates

Currently, if you set num_gpu then this forces the model to load with that number of layers in the current configuration. This is done regardless of any other information, which means that no eviction is performed even if another model is loaded. This behavior is different from the old estimates (and still happens for models that runs on the llama engine). In those cases, models would be evicted if needed to load at the requested number of layers. That behavior is more useful and less surprising, so this changes the new estimates to match. Fixes #12580
2025-12-21 14:26:30 +00:00 · 2025-10-14 17:21:16 -07:00
parent 53a969d509
commit 3dcfd5f69e
2 changed files with 12 additions and 4 deletions
--- a/llm/server_test.go
+++ b/llm/server_test.go
@@ -127,6 +127,14 @@ func TestLLMServerFitGPU(t *testing.T) {
 			requireFull: true,
 			expectedErr: ErrLoadRequiredFull,
 		},
+		{
+			name:        "requireFull numGPU",
+			gpus:        []gpu{{id: ml.DeviceID{ID: "gpu0"}, free: 256 * format.MebiByte}},
+			layers:      []int{100 * format.MebiByte, 100 * format.MebiByte, 100 * format.MebiByte, 100 * format.MebiByte},
+			numGPU:      4,
+			requireFull: true,
+			expectedErr: ErrLoadRequiredFull,
+		},
 	}

 	for _, tt := range tests {