mirror of https://github.com/likelovewant/ollama-for-amd.git synced 2025-12-21 14:26:30 +00:00

Files

Michael Yang bbbc73d637 llamarunner: update metrics

this change updates how metrics are collected. until now, performance
metrics, specifically initial input processing and subsequent generation
durations, were collected by taking the timestamp when creating a new
sequence, the first token generation, and completing generation. the
processing duration is taken as first token generation sub sequence
creation while generation is taken as completing generation sub first
token generation.

while this approach is an accurate end-to-end metric of processing and
generation, it's not comparable to other tools which only measure the
active, i.e. decode, duration.

this change updates the metrics to only capture decode duration so it
can be more directly compared to other tools

2025-10-09 15:44:04 -07:00

common

chore: fix some inconsistent function name in comment

2025-08-13 09:50:27 -07:00

llamarunner

llamarunner: update metrics

2025-10-09 15:44:04 -07:00

ollamarunner

Revert "add truncate and shift parameters (#12519 )" (#12545 )

2025-10-08 17:57:57 -07:00

README.md

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

runner.go

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding