mirror of
https://github.com/likelovewant/ollama-for-amd.git
synced 2025-12-21 22:33:56 +00:00
We currently allocate the worst case batch for max sized batches, which corresponds to prompt processing. However, there are some cases where the generated graph is different for small and large batches. To ensure that we don't need to allocate memory later after layout has taken place, we should run the worst case batch both ways and take the larger amount of memory. This does not noticeably affect loading speed as the most expensive part of this logic is from image processing and that does not occur during token generation.
runner
Note: this is a work in progress
A minimial runner for loading a model and running inference via a http web server.
./runner -model <model binary>
Completion
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion
Embeddings
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding