mirror of
https://github.com/likelovewant/ollama-for-amd.git
synced 2025-12-23 23:18:26 +00:00
docs: add docs for docs.ollama.com (#12805)
This commit is contained in:
132
docs/faq.mdx
132
docs/faq.mdx
@@ -1,4 +1,6 @@
|
||||
# FAQ
|
||||
---
|
||||
title: FAQ
|
||||
---
|
||||
|
||||
## How can I upgrade Ollama?
|
||||
|
||||
@@ -20,9 +22,9 @@ Please refer to the [GPU docs](./gpu.md).
|
||||
|
||||
## How can I specify the context window size?
|
||||
|
||||
By default, Ollama uses a context window size of 4096 tokens for most models. The `gpt-oss` model has a default context window size of 8192 tokens.
|
||||
By default, Ollama uses a context window size of 2048 tokens.
|
||||
|
||||
This can be overridden in Settings in the Windows and macOS App, or with the `OLLAMA_CONTEXT_LENGTH` environment variable. For example, to set the default context window to 8K, use:
|
||||
This can be overridden with the `OLLAMA_CONTEXT_LENGTH` environment variable. For example, to set the default context window to 8K, use:
|
||||
|
||||
```shell
|
||||
OLLAMA_CONTEXT_LENGTH=8192 ollama serve
|
||||
@@ -46,8 +48,6 @@ curl http://localhost:11434/api/generate -d '{
|
||||
}'
|
||||
```
|
||||
|
||||
Setting the context length higher may cause the model to not be able to fit onto the GPU which make the model run more slowly.
|
||||
|
||||
## How can I tell if my model was loaded onto the GPU?
|
||||
|
||||
Use the `ollama ps` command to see what models are currently loaded into memory.
|
||||
@@ -56,17 +56,16 @@ Use the `ollama ps` command to see what models are currently loaded into memory.
|
||||
ollama ps
|
||||
```
|
||||
|
||||
> **Output**:
|
||||
>
|
||||
> ```
|
||||
> NAME ID SIZE PROCESSOR CONTEXT UNTIL
|
||||
> gpt-oss:20b 05afbac4bad6 16 GB 100% GPU 8192 4 minutes from now
|
||||
> ```
|
||||
<Info>
|
||||
**Output**: ``` NAME ID SIZE PROCESSOR UNTIL llama3:70b bcfb190ca3a7 42 GB
|
||||
100% GPU 4 minutes from now ```
|
||||
</Info>
|
||||
|
||||
The `Processor` column will show which memory the model was loaded in to:
|
||||
* `100% GPU` means the model was loaded entirely into the GPU
|
||||
* `100% CPU` means the model was loaded entirely in system memory
|
||||
* `48%/52% CPU/GPU` means the model was loaded partially onto both the GPU and into system memory
|
||||
|
||||
- `100% GPU` means the model was loaded entirely into the GPU
|
||||
- `100% CPU` means the model was loaded entirely in system memory
|
||||
- `48%/52% CPU/GPU` means the model was loaded partially onto both the GPU and into system memory
|
||||
|
||||
## How do I configure Ollama server?
|
||||
|
||||
@@ -78,9 +77,9 @@ If Ollama is run as a macOS application, environment variables should be set usi
|
||||
|
||||
1. For each environment variable, call `launchctl setenv`.
|
||||
|
||||
```bash
|
||||
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
|
||||
```
|
||||
```bash
|
||||
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
|
||||
```
|
||||
|
||||
2. Restart Ollama application.
|
||||
|
||||
@@ -92,10 +91,10 @@ If Ollama is run as a systemd service, environment variables should be set using
|
||||
|
||||
2. For each environment variable, add a line `Environment` under section `[Service]`:
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
Environment="OLLAMA_HOST=0.0.0.0:11434"
|
||||
```
|
||||
```ini
|
||||
[Service]
|
||||
Environment="OLLAMA_HOST=0.0.0.0:11434"
|
||||
```
|
||||
|
||||
3. Save and exit.
|
||||
|
||||
@@ -126,8 +125,10 @@ On Windows, Ollama inherits your user and system environment variables.
|
||||
|
||||
Ollama pulls models from the Internet and may require a proxy server to access the models. Use `HTTPS_PROXY` to redirect outbound requests through the proxy. Ensure the proxy certificate is installed as a system certificate. Refer to the section above for how to use environment variables on your platform.
|
||||
|
||||
> [!NOTE]
|
||||
> Avoid setting `HTTP_PROXY`. Ollama does not use HTTP for model pulls, only HTTPS. Setting `HTTP_PROXY` may interrupt client connections to the server.
|
||||
<Note>
|
||||
Avoid setting `HTTP_PROXY`. Ollama does not use HTTP for model pulls, only
|
||||
HTTPS. Setting `HTTP_PROXY` may interrupt client connections to the server.
|
||||
</Note>
|
||||
|
||||
### How do I use Ollama behind a proxy in Docker?
|
||||
|
||||
@@ -150,11 +151,9 @@ docker build -t ollama-with-ca .
|
||||
docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca
|
||||
```
|
||||
|
||||
## Does Ollama send my prompts and responses back to ollama.com?
|
||||
## Does Ollama send my prompts and answers back to ollama.com?
|
||||
|
||||
If you're running a model locally, your prompts and responses will always stay on your machine. Ollama Turbo in the App allows you to run your queries on Ollama's servers if you don't have a powerful enough GPU. Web search lets a model query the web, giving you more accurate and up-to-date information. Both Turbo and web search require sending your prompts and responses to Ollama.com. This data is neither logged nor stored.
|
||||
|
||||
If you don't want to see the Turbo and web search options in the app, you can disable them in Settings by turning on Airplane mode. In Airplane mode, all models will run locally, and your prompts and responses will stay on your machine.
|
||||
No. Ollama runs locally, and conversation data does not leave your machine.
|
||||
|
||||
## How can I expose Ollama on my network?
|
||||
|
||||
@@ -216,7 +215,9 @@ Refer to the section [above](#how-do-i-configure-ollama-server) for how to set e
|
||||
|
||||
If a different directory needs to be used, set the environment variable `OLLAMA_MODELS` to the chosen directory.
|
||||
|
||||
> Note: on Linux using the standard installer, the `ollama` user needs read and write access to the specified directory. To assign the directory to the `ollama` user run `sudo chown -R ollama:ollama <directory>`.
|
||||
<Note>
|
||||
On Linux using the standard installer, the `ollama` user needs read and write access to the specified directory. To assign the directory to the `ollama` user run `sudo chown -R ollama:ollama <directory>`.
|
||||
</Note>
|
||||
|
||||
Refer to the section [above](#how-do-i-configure-ollama-server) for how to set environment variables on your platform.
|
||||
|
||||
@@ -235,7 +236,7 @@ GPU acceleration is not available for Docker Desktop in macOS due to the lack of
|
||||
This can impact both installing Ollama, as well as downloading models.
|
||||
|
||||
Open `Control Panel > Networking and Internet > View network status and tasks` and click on `Change adapter settings` on the left panel. Find the `vEthernel (WSL)` adapter, right click and select `Properties`.
|
||||
Click on `Configure` and open the `Advanced` tab. Search through each of the properties until you find `Large Send Offload Version 2 (IPv4)` and `Large Send Offload Version 2 (IPv6)`. *Disable* both of these
|
||||
Click on `Configure` and open the `Advanced` tab. Search through each of the properties until you find `Large Send Offload Version 2 (IPv4)` and `Large Send Offload Version 2 (IPv6)`. _Disable_ both of these
|
||||
properties.
|
||||
|
||||
## How can I preload a model into Ollama to get faster response times?
|
||||
@@ -269,10 +270,11 @@ ollama stop llama3.2
|
||||
```
|
||||
|
||||
If you're using the API, use the `keep_alive` parameter with the `/api/generate` and `/api/chat` endpoints to set the amount of time that a model stays in memory. The `keep_alive` parameter can be set to:
|
||||
* a duration string (such as "10m" or "24h")
|
||||
* a number in seconds (such as 3600)
|
||||
* any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
|
||||
* '0' which will unload the model immediately after generating a response
|
||||
|
||||
- a duration string (such as "10m" or "24h")
|
||||
- a number in seconds (such as 3600)
|
||||
- any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
|
||||
- '0' which will unload the model immediately after generating a response
|
||||
|
||||
For example, to preload a model and leave it in memory use:
|
||||
|
||||
@@ -292,31 +294,31 @@ The `keep_alive` API parameter with the `/api/generate` and `/api/chat` API endp
|
||||
|
||||
## How do I manage the maximum number of requests the Ollama server can queue?
|
||||
|
||||
If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting `OLLAMA_MAX_QUEUE`.
|
||||
If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting `OLLAMA_MAX_QUEUE`.
|
||||
|
||||
## How does Ollama handle concurrent requests?
|
||||
|
||||
Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it can be configured to allow parallel request processing.
|
||||
Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing.
|
||||
|
||||
If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.
|
||||
If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.
|
||||
|
||||
Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.
|
||||
Parallel request processing for a given model results in increasing the context size by the number of parallel requests. For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.
|
||||
|
||||
The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:
|
||||
|
||||
- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference.
|
||||
- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default is 1, and will handle 1 request per model at a time.
|
||||
- `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 \* the number of GPUs or 3 for CPU inference.
|
||||
- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.
|
||||
- `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512
|
||||
|
||||
Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.
|
||||
Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.
|
||||
|
||||
## How does Ollama load models on multiple GPUs?
|
||||
|
||||
When loading a new model, Ollama evaluates the required VRAM for the model against what is currently available. If the model will entirely fit on any single GPU, Ollama will load the model on that GPU. This typically provides the best performance as it reduces the amount of data transferring across the PCI bus during inference. If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.
|
||||
When loading a new model, Ollama evaluates the required VRAM for the model against what is currently available. If the model will entirely fit on any single GPU, Ollama will load the model on that GPU. This typically provides the best performance as it reduces the amount of data transferring across the PCI bus during inference. If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.
|
||||
|
||||
## How can I enable Flash Attention?
|
||||
|
||||
Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows. To enable Flash Attention, set the `OLLAMA_FLASH_ATTENTION` environment variable to `1` when starting the Ollama server.
|
||||
Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows. To enable Flash Attention, set the `OLLAMA_FLASH_ATTENTION` environment variable to `1` when starting the Ollama server.
|
||||
|
||||
## How can I set the quantization type for the K/V cache?
|
||||
|
||||
@@ -324,9 +326,12 @@ The K/V context cache can be quantized to significantly reduce memory usage when
|
||||
|
||||
To use quantized K/V cache with Ollama you can set the following environment variable:
|
||||
|
||||
- `OLLAMA_KV_CACHE_TYPE` - The quantization type for the K/V cache. Default is `f16`.
|
||||
- `OLLAMA_KV_CACHE_TYPE` - The quantization type for the K/V cache. Default is `f16`.
|
||||
|
||||
> Note: Currently this is a global option - meaning all models will run with the specified quantization type.
|
||||
<Note>
|
||||
Currently this is a global option - meaning all models will run with the
|
||||
specified quantization type.
|
||||
</Note>
|
||||
|
||||
The currently available K/V cache quantization types are:
|
||||
|
||||
@@ -334,19 +339,40 @@ The currently available K/V cache quantization types are:
|
||||
- `q8_0` - 8-bit quantization, uses approximately 1/2 the memory of `f16` with a very small loss in precision, this usually has no noticeable impact on the model's quality (recommended if not using f16).
|
||||
- `q4_0` - 4-bit quantization, uses approximately 1/4 the memory of `f16` with a small-medium loss in precision that may be more noticeable at higher context sizes.
|
||||
|
||||
How much the cache quantization impacts the model's response quality will depend on the model and the task. Models that have a high GQA count (e.g. Qwen2) may see a larger impact on precision from quantization than models with a low GQA count.
|
||||
How much the cache quantization impacts the model's response quality will depend on the model and the task. Models that have a high GQA count (e.g. Qwen2) may see a larger impact on precision from quantization than models with a low GQA count.
|
||||
|
||||
You may need to experiment with different quantization types to find the best balance between memory usage and quality.
|
||||
|
||||
## How can I stop Ollama from starting when I login to my computer
|
||||
## Where can I find my Ollama Public Key?
|
||||
|
||||
Ollama for Windows and macOS register as a login item during installation. You can disable this if you prefer not to have Ollama automatically start. Ollama will respect this setting across upgrades, unless you uninstall the application.
|
||||
Your **Ollama Public Key** is the public part of the key pair that lets your local Ollama instance talk to [ollama.com](https://ollama.com).
|
||||
|
||||
**Windows**
|
||||
- Remove `%APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup\Ollama.lnk`
|
||||
You'll need it to:
|
||||
* Push models to Ollama
|
||||
* Pull private models from Ollama to your machine
|
||||
* Run models hosted in [Ollama Cloud](https://ollama.com/cloud)
|
||||
|
||||
**MacOS Monterey (v12)**
|
||||
- Open `Settings` -> `Users & Groups` -> `Login Items` and find the `Ollama` entry, then click the `-` (minus) to remove
|
||||
### How to Add the Key
|
||||
|
||||
**MacOS Ventura (v13) and later**
|
||||
- Open `Settings` and search for "Login Items", find the `Ollama` entry under "Allow in the Background`, then click the slider to disable.
|
||||
* **Sign-in via the Settings page** in the **Mac** and **Windows App**
|
||||
|
||||
* **Sign‑in via CLI**
|
||||
|
||||
```shell
|
||||
ollama signin
|
||||
```
|
||||
|
||||
* **Manually copy & paste** the key on the **Ollama Keys** page:
|
||||
[https://ollama.com/settings/keys](https://ollama.com/settings/keys)
|
||||
|
||||
### Where the Ollama Public Key lives
|
||||
|
||||
| OS | Path to `id_ed25519.pub` |
|
||||
| :- | :- |
|
||||
| macOS | `~/.ollama/id_ed25519.pub` |
|
||||
| Linux | `/usr/share/ollama/.ollama/id_ed25519.pub` |
|
||||
| Windows | `C:\Users\<username>\.ollama\id_ed25519.pub` |
|
||||
|
||||
<Note>
|
||||
Replace <username> with your actual Windows user name.
|
||||
</Note>
|
||||
|
||||
Reference in New Issue
Block a user