mirror of
https://github.com/likelovewant/ollama-for-amd.git
synced 2025-12-21 14:26:30 +00:00
this change updates how metrics are collected. until now, performance metrics, specifically initial input processing and subsequent generation durations, were collected by taking the timestamp when creating a new sequence, the first token generation, and completing generation. the processing duration is taken as first token generation sub sequence creation while generation is taken as completing generation sub first token generation. while this approach is an accurate end-to-end metric of processing and generation, it's not comparable to other tools which only measure the active, i.e. decode, duration. this change updates the metrics to only capture decode duration so it can be more directly compared to other tools
runner
Note: this is a work in progress
A minimial runner for loading a model and running inference via a http web server.
./runner -model <model binary>
Completion
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion
Embeddings
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding