mirror of https://github.com/likelovewant/ollama-for-amd.git synced 2025-12-21 22:33:56 +00:00

Files

Jesse Gross 26465fb85f ollamarunner: Worst case batch for token generation

We currently allocate the worst case batch for max sized
batches, which corresponds to prompt processing. However,
there are some cases where the generated graph is different
for small and large batches. To ensure that we don't need
to allocate memory later after layout has taken place, we
should run the worst case batch both ways and take the larger
amount of memory.

This does not noticeably affect loading speed as the most expensive
part of this logic is from image processing and that does not
occur during token generation.

2025-10-30 13:53:10 -07:00

common

chore: fix some inconsistent function name in comment

2025-08-13 09:50:27 -07:00

llamarunner

Revert "server: Consolidate embedding truncation in runner (#12730 )" (#12810 )

2025-10-28 14:49:14 -07:00

ollamarunner

ollamarunner: Worst case batch for token generation

2025-10-30 13:53:10 -07:00

README.md

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

runner.go

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding