This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs.
Benefits:
- Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed
- Allowing unallocated GPUs to get into power-saving mode.
- Significantly reduce VRAM allocation when using more than 2 GPUs in a system
- Due to the reduced memory allocation, you can run more models simultaneously.