mirror of https://github.com/likelovewant/ollama-for-amd.git synced 2025-12-21 22:33:56 +00:00

Files

Daniel Hiltgen 517807cdf2 perf: build graph for next batch async to keep GPU busy (#11863 )

* perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.

* tests: tune integration tests for ollama engine

This tunes the integration tests to focus more on models supported
by the new engine.

2025-08-29 14:20:28 -07:00

common

chore: fix some inconsistent function name in comment

2025-08-13 09:50:27 -07:00

llamarunner

chore: remove redundant words in comment (#12028 )

2025-08-22 11:00:27 -07:00

ollamarunner

perf: build graph for next batch async to keep GPU busy (#11863 )

2025-08-29 14:20:28 -07:00

README.md

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

runner.go

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding