From 83a0cb8d88561b4302baa8b6ea0623c426483e5d Mon Sep 17 00:00:00 2001
From: Michael Yang <mxyng@pm.me>
Date: Tue, 2 Jul 2024 14:52:18 -0700
Subject: [PATCH 01/12] docs

---
 docs/template.md | 173 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 173 insertions(+)
 create mode 100644 docs/template.md

diff --git a/docs/template.md b/docs/template.md
new file mode 100644
index 00000000..8f41e8fb
--- /dev/null
+++ b/docs/template.md
@@ -0,0 +1,173 @@
+# Template
+
+Ollama provides a powerful templating engine backed by Go's built-in templating engine to construct prompts for your large language model. This feature is a valuable tool to get the most out of your models.
+
+## Basic Template Structure
+
+A basic Go template consists of three main parts:
+
+* **Layout**: The overall structure of the template.
+* **Variables**: Placeholders for dynamic data that will be replaced with actual values when the template is rendered.
+* **Functions**: Custom functions or logic that can be used to manipulate the template's content.
+
+Here's an example of a simple chat template:
+
+```gotmpl
+{{- range .Messages }}
+{{ .Role }}: {{ .Content }}
+{{- end }}
+```
+
+In this example, we have:
+
+* A basic messages structure (layout)
+* Three variables: `Messages`, `Role`, and `Content` (variables)
+* A custom function (action) that iterates over an array of items (`range .Messages`) and displays each item
+
+## Adding Templates to Your Model
+
+By default, models imported into Ollama have a default template of `{{ .Prompt }}`, i.e. user inputs are sent verbatim to the LLM. This is appropriate for text or code completion models but lacks essential markers for chat or instruction models.
+
+Omitting a template in these models puts the responsibility of correctly templating input onto the user. Adding a template allows users to easily get the best results from the model.
+
+To add templates in your model, you'll need to add a `TEMPLATE` command to the Modelfile. Here's an example using Meta's Llama 3.
+
+```dockerfile
+FROM llama3
+
+TEMPLATE """{{- if .System }}<|start_header_id|>system<|end_header_id|>
+
+{{ .System }}<|eot_id|>
+{{- end }}
+{{- range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
+
+{{ .Content }}<|eot_id|>
+{{- end }}<|start_header_id|>assistant<|end_header_id|>
+
+"""
+```
+
+## Variables
+
+`System` (string): system prompt
+
+`Prompt` (string): user prompt
+
+`Response` (string): assistant response
+
+`Suffix` (string): text inserted after the assistant's response
+
+`Messages` (list): list of messages
+
+`Messages[].Role` (string): role which can be one of `system`, `user`, `assistant`, or `tool`
+
+`Messages[].Content` (string):  message content
+
+`Messages[].ToolCalls` (list): list of tools the model wants to call
+
+`Messages[].ToolCalls[].Function` (object): function to call
+
+`Messages[].ToolCalls[].Function.Name` (string): function name
+
+`Messages[].ToolCalls[].Function.Arguments` (map): mapping of argument name to argument value
+
+`Tools` (list): list of tools the model can access
+
+`Tools[].Type` (string): schema type. `type` is always `function`
+
+`Tools[].Function` (object): function definition
+
+`Tools[].Function.Name` (string): function name
+
+`Tools[].Function.Description` (string): function description
+
+`Tools[].Function.Parameters` (object): function parameters
+
+`Tools[].Function.Parameters.Type` (string): schema type. `type` is always `object`
+
+`Tools[].Function.Parameters.Required` (list): list of required properties
+
+`Tools[].Function.Parameters.Properties` (map): mapping of property name to property definition
+
+`Tools[].Function.Parameters.Properties[].Type` (string): property type
+
+`Tools[].Function.Parameters.Properties[].Description` (string): property description
+
+`Tools[].Function.Parameters.Properties[].Enum` (list): list of valid values
+
+## Tips and Best Practices
+
+Keep the following tips and best practices in mind when working with Go templates:
+
+* **Be mindful of dot**: Control flow structures like `range` and `with` changes the value `.`
+* **Out-of-scope variables**: Use `$.` to reference variables not currently in scope, starting from the root
+* **Whitespace control**: Use `-` to trim leading (`{{-`) and trailing (`-}}`) whitespace
+
+## Examples
+
+### Example Messages
+
+#### ChatML
+
+ChatML is a popular template format. It can be used for models such as Databrick's DBRX, Intel's Neural Chat, and Microsoft's Orca 2.
+
+```gotmpl
+{{- if .System }}<|im_start|>system
+{{ .System }}<|im_end|>
+{{ end }}
+{{- range .Messages }}<|im_start|>{{ .Role }}
+{{ .Content }}<|im_end|>
+{{ end }}<|im_start|>assistant
+{{ else }}
+{{ if .System }}<|im_start|>system
+{{ .System }}<|im_end|>
+```
+
+### Example Tools
+
+Tools support can be added to a model by adding a `{{ .Tools }}` node to the template. This feature is useful for models trained to call external tools and can a powerful tool for retrieving real-time data or performing complex tasks.
+
+#### Mistral
+
+Mistral v0.3 and Mixtral 8x22B supports tool calling.
+
+```gotmpl
+{{- range $index, $_ := .Messages }}
+{{- if eq .Role "user" }}
+{{- if and (le (len (slice $.Messages $index)) 2) $.Tools }}[AVAILABLE_TOOLS] {{ json $.Tools }}[/AVAILABLE_TOOLS]
+{{- end }}[INST] {{ if and (eq (len (slice $.Messages $index)) 1) $.System }}{{ $.System }}
+
+{{ end }}{{ .Content }}[/INST]
+{{- else if eq .Role "assistant" }}
+{{- if .Content }} {{ .Content }}</s>
+{{- else if .ToolCalls }}[TOOL_CALLS] [
+{{- range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ json .Function.Arguments }}}
+{{- end }}]</s>
+{{- end }}
+{{- else if eq .Role "tool" }}[TOOL_RESULTS] {"content": {{ .Content }}}[/TOOL_RESULTS]
+{{- end }}
+{{- end }}
+```
+
+### Example Fill-in-Middle
+
+Fill-in-middle support can be added to a model by adding a `{{ .Suffix }}` node to the template. This feature is useful for models that are trained to generate text in the middle of user input, such as code completion models.
+
+#### CodeLlama
+
+CodeLlama [7B](https://ollama.com/library/codellama:7b-code) and [13B](https://ollama.com/library/codellama:13b-code) code completion models support fill-in-middle.
+
+```gotmpl
+<PRE> {{ .Prompt }} <SUF>{{ .Suffix }} <MID>
+```
+
+> [!NOTE]
+> CodeLlama 34B and 70B code completion and all instruct and Python fine-tuned models do not support fill-in-middle.
+
+#### Codestral
+
+Codestral [22B](https://ollama.com/library/codestral:22b) supports fill-in-middle.
+
+```gotmpl
+[SUFFIX]{{ .Suffix }}[PREFIX] {{ .Prompt }}
+```

From 9b60a038e5169c4a69bc513ae6e7ea1816f9fc11 Mon Sep 17 00:00:00 2001
From: Michael Yang <mxyng@pm.me>
Date: Mon, 22 Jul 2024 13:34:56 -0700
Subject: [PATCH 02/12] update api.md

---
 README.md         |   3 +-
 docs/api.md       | 117 +++++++++++++++++++++++++++++++++++++++++++++-
 docs/modelfile.md |   3 +-
 3 files changed, 119 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index b96f4c16..02ab7051 100644
--- a/README.md
+++ b/README.md
@@ -64,7 +64,8 @@ Here are some example models that can be downloaded:
 | LLaVA              | 7B         | 4.5GB | `ollama run llava`             |
 | Solar              | 10.7B      | 6.1GB | `ollama run solar`             |
 
-> Note: You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
+> [!NOTE]
+> You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.
 
 ## Customize a model
 
diff --git a/docs/api.md b/docs/api.md
index c577bb1a..bf4c8ce8 100644
--- a/docs/api.md
+++ b/docs/api.md
@@ -40,6 +40,7 @@ Generate a response for a given prompt with a provided model. This is a streamin
 
 - `model`: (required) the [model name](#model-names)
 - `prompt`: the prompt to generate a response for
+- `suffix`: the text after the model response
 - `images`: (optional) a list of base64-encoded images (for multimodal models such as `llava`)
 
 Advanced parameters (optional):
@@ -57,7 +58,8 @@ Advanced parameters (optional):
 
 Enable JSON mode by setting the `format` parameter to `json`. This will structure the response as a valid JSON object. See the JSON mode [example](#request-json-mode) below.
 
-> Note: it's important to instruct the model to use JSON in the `prompt`. Otherwise, the model may generate large amounts whitespace.
+> [!IMPORTANT]
+> It's important to instruct the model to use JSON in the `prompt`. Otherwise, the model may generate large amounts whitespace.
 
 ### Examples
 
@@ -148,8 +150,44 @@ If `stream` is set to `false`, the response will be a single JSON object:
 }
 ```
 
+#### Request (with suffix)
+
+##### Request
+
+```shell
+curl http://localhost:11434/api/generate -d '{
+  "model": "codellama:code",
+  "prompt": "def compute_gcd(a, b):",
+  "suffix": "    return result",
+  "options": {
+    "temperature": 0
+  },
+  "stream": false
+}'
+```
+
+##### Response
+
+```json
+{
+  "model": "codellama:code",
+  "created_at": "2024-07-22T20:47:51.147561Z",
+  "response": "\n  if a == 0:\n    return b\n  else:\n    return compute_gcd(b % a, a)\n\ndef compute_lcm(a, b):\n  result = (a * b) / compute_gcd(a, b)\n",
+  "done": true,
+  "done_reason": "stop",
+  "context": [...],
+  "total_duration": 1162761250,
+  "load_duration": 6683708,
+  "prompt_eval_count": 17,
+  "prompt_eval_duration": 201222000,
+  "eval_count": 63,
+  "eval_duration": 953997000
+}
+```
+
 #### Request (JSON mode)
 
+> [!IMPORTANT]
 > When `format` is set to `json`, the output will always be a well-formed JSON object. It's important to also instruct the model to respond in JSON.
 
 ##### Request
@@ -383,9 +421,10 @@ Generate the next message in a chat with a provided model. This is a streaming e
 
 The `message` object has the following fields:
 
-- `role`: the role of the message, either `system`, `user` or `assistant`
+- `role`: the role of the message, either `system`, `user`, `assistant`, or `tool`
 - `content`: the content of the message
 - `images` (optional): a list of images to include in the message (for multimodal models such as `llava`)
+- `tool_calls` (optional): a list of tools the model wants to use
 
 Advanced parameters (optional):
 
@@ -393,6 +432,7 @@ Advanced parameters (optional):
 - `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
 - `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
 - `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
+- `tools`: external tools the model can use. Not all models support this feature.
 
 ### Examples
 
@@ -622,6 +662,79 @@ curl http://localhost:11434/api/chat -d '{
 }
 ```
 
+#### Chat request (with tools)
+
+##### Request
+
+```
+curl http://localhost:11434/api/chat -d '{
+  "model": "mistral",
+  "messages": [
+    {
+      "role": "user",
+      "content": "What is the weather today in Paris?"
+    }
+  ],
+  "stream": false,
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "get_current_weather",
+        "description": "Get the current weather for a location",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "location": {
+              "type": "string",
+              "description": "The location to get the weather for, e.g. San Francisco, CA"
+            },
+            "format": {
+              "type": "string",
+              "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'",
+              "enum": ["celsius", "fahrenheit"]
+            }
+          },
+          "required": ["location", "format"]
+        }
+      }
+    }
+  ]
+}'
+```
+
+##### Response
+
+```json
+{
+  "model": "mistral:7b-instruct-v0.3-q4_K_M",
+  "created_at": "2024-07-22T20:33:28.123648Z",
+  "message": {
+    "role": "assistant",
+    "content": "",
+    "tool_calls": [
+      {
+        "function": {
+          "name": "get_current_weather",
+          "arguments": {
+            "format": "celsius",
+            "location": "Paris, FR"
+          }
+        }
+      }
+    ]
+  },
+  "done_reason": "stop",
+  "done": true,
+  "total_duration": 885095291,
+  "load_duration": 3753500,
+  "prompt_eval_count": 122,
+  "prompt_eval_duration": 328493000,
+  "eval_count": 33,
+  "eval_duration": 552222000
+}
+```
+
 ## Create a Model
 
 ```shell
diff --git a/docs/modelfile.md b/docs/modelfile.md
index 21ee1826..c3645b06 100644
--- a/docs/modelfile.md
+++ b/docs/modelfile.md
@@ -1,6 +1,7 @@
 # Ollama Model File
 
-> Note: `Modelfile` syntax is in development
+> [!NOTE]
+> `Modelfile` syntax is in development
 
 A model file is the blueprint to create and share models with Ollama.
 

From a6cd8f6169c029c92105962017562274bd90626b Mon Sep 17 00:00:00 2001
From: Ajay Chintala <nitk.ajay@gmail.com>
Date: Tue, 23 Jul 2024 11:40:23 -0700
Subject: [PATCH 03/12] Update README.md to add LLMStack integration (#5799)

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index b96f4c16..6a06b819 100644
--- a/README.md
+++ b/README.md
@@ -296,6 +296,7 @@ See the [API documentation](./docs/api.md) for all endpoints.
 - [Kerlig AI](https://www.kerlig.com/) (AI writing assistant for macOS)
 - [AI Studio](https://github.com/MindWorkAI/AI-Studio)
 - [Sidellama](https://github.com/gyopak/sidellama) (browser-based LLM client)
+- [LLMStack](https://github.com/trypromptly/LLMStack) (No-code multi-agent framework to build LLM agents and workflows)
 
 ### Terminal
 

From ac33aa7d3782887878e6e24fb4a6238356a489a6 Mon Sep 17 00:00:00 2001
From: royjhan <65097070+royjhan@users.noreply.github.com>
Date: Wed, 24 Jul 2024 11:15:46 -0700
Subject: [PATCH 04/12] Fix Embed Test Flakes (#5893)

* float cmp

* increase tolerance
---
 integration/embed_test.go | 59 +++++++++++++++++++++++++++++++++++----
 1 file changed, 54 insertions(+), 5 deletions(-)

diff --git a/integration/embed_test.go b/integration/embed_test.go
index aeafa57b..61b36fa2 100644
--- a/integration/embed_test.go
+++ b/integration/embed_test.go
@@ -4,12 +4,45 @@ package integration
 
 import (
 	"context"
+	"math"
 	"testing"
 	"time"
 
 	"github.com/ollama/ollama/api"
 )
 
+func floatsEqual32(a, b float32) bool {
+	return math.Abs(float64(a-b)) <= 1e-4
+}
+
+func floatsEqual64(a, b float64) bool {
+	return math.Abs(a-b) <= 1e-4
+}
+
+func TestAllMiniLMEmbeddings(t *testing.T) {
+	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
+	defer cancel()
+
+	req := api.EmbeddingRequest{
+		Model:  "all-minilm",
+		Prompt: "why is the sky blue?",
+	}
+
+	res, err := embeddingTestHelper(ctx, t, req)
+
+	if err != nil {
+		t.Fatalf("error: %v", err)
+	}
+
+	if len(res.Embedding) != 384 {
+		t.Fatalf("expected 384 floats, got %d", len(res.Embedding))
+	}
+
+	if !floatsEqual64(res.Embedding[0], 0.06642947345972061) {
+		t.Fatalf("expected 0.06642947345972061, got %.16f", res.Embedding[0])
+	}
+}
+
 func TestAllMiniLMEmbed(t *testing.T) {
 	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
 	defer cancel()
@@ -33,8 +66,8 @@ func TestAllMiniLMEmbed(t *testing.T) {
 		t.Fatalf("expected 384 floats, got %d", len(res.Embeddings[0]))
 	}
 
-	if res.Embeddings[0][0] != 0.010071031 {
-		t.Fatalf("expected 0.010071031, got %f", res.Embeddings[0][0])
+	if !floatsEqual32(res.Embeddings[0][0], 0.010071031) {
+		t.Fatalf("expected 0.010071031, got %.8f", res.Embeddings[0][0])
 	}
 }
 
@@ -61,12 +94,12 @@ func TestAllMiniLMBatchEmbed(t *testing.T) {
 		t.Fatalf("expected 384 floats, got %d", len(res.Embeddings[0]))
 	}
 
-	if res.Embeddings[0][0] != 0.010071031 || res.Embeddings[1][0] != -0.009802706 {
-		t.Fatalf("expected 0.010071031 and -0.009802706, got %f and %f", res.Embeddings[0][0], res.Embeddings[1][0])
+	if !floatsEqual32(res.Embeddings[0][0], 0.010071031) || !floatsEqual32(res.Embeddings[1][0], -0.009802706) {
+		t.Fatalf("expected 0.010071031 and -0.009802706, got %.8f and %.8f", res.Embeddings[0][0], res.Embeddings[1][0])
 	}
 }
 
-func TestAllMiniLmEmbedTruncate(t *testing.T) {
+func TestAllMiniLMEmbedTruncate(t *testing.T) {
 	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
 	defer cancel()
 
@@ -135,6 +168,22 @@ func TestAllMiniLmEmbedTruncate(t *testing.T) {
 	}
 }
 
+func embeddingTestHelper(ctx context.Context, t *testing.T, req api.EmbeddingRequest) (*api.EmbeddingResponse, error) {
+	client, _, cleanup := InitServerConnection(ctx, t)
+	defer cleanup()
+	if err := PullIfMissing(ctx, client, req.Model); err != nil {
+		t.Fatalf("failed to pull model %s: %v", req.Model, err)
+	}
+
+	response, err := client.Embeddings(ctx, &req)
+
+	if err != nil {
+		return nil, err
+	}
+
+	return response, nil
+}
+
 func embedTestHelper(ctx context.Context, t *testing.T, req api.EmbedRequest) (*api.EmbedResponse, error) {
 	client, _, cleanup := InitServerConnection(ctx, t)
 	defer cleanup()

From bb46bbcf5e90e5efab5ff946a6c798131907ba2d Mon Sep 17 00:00:00 2001
From: Michael Yang <mxyng@pm.me>
Date: Wed, 24 Jul 2024 13:05:59 -0700
Subject: [PATCH 05/12] llm(llama): pass rope factors (#5924)

---
 llm/patches/0001-llama-3.1-rope-scaling.diff | 71 ++++++++++++++++++++
 1 file changed, 71 insertions(+)
 create mode 100644 llm/patches/0001-llama-3.1-rope-scaling.diff

diff --git a/llm/patches/0001-llama-3.1-rope-scaling.diff b/llm/patches/0001-llama-3.1-rope-scaling.diff
new file mode 100644
index 00000000..45dcb4f5
--- /dev/null
+++ b/llm/patches/0001-llama-3.1-rope-scaling.diff
@@ -0,0 +1,71 @@
+From 2f872f294fb6f5c6e8f983b68c40ea656053dd92 Mon Sep 17 00:00:00 2001
+From: Michael Yang <mxyng@pm.me>
+Date: Tue, 23 Jul 2024 14:33:29 -0700
+Subject: [PATCH] llama 3.1 rope scaling
+
+---
+ src/llama.cpp | 14 ++++++++++++--
+ 1 file changed, 12 insertions(+), 2 deletions(-)
+
+diff --git a/src/llama.cpp b/src/llama.cpp
+index 8fe51971..a9969df8 100644
+--- a/src/llama.cpp
++++ b/src/llama.cpp
+@@ -2472,6 +2472,7 @@ struct llama_layer {
+     // long rope factors
+     struct ggml_tensor * rope_long  = nullptr;
+     struct ggml_tensor * rope_short = nullptr;
++    struct ggml_tensor * rope_freqs = nullptr;
+ 
+     // bitnet scale
+     struct ggml_tensor * wq_scale;
+@@ -6143,6 +6144,8 @@ static bool llm_load_tensors(
+ 
+                         layer.ffn_norm = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd});
+ 
++                        layer.rope_freqs  = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_ROPE_FREQS,  "weight"), { n_embd/n_head/2 }, llama_model_loader::TENSOR_NOT_REQUIRED | (i != 0 ? llama_model_loader::TENSOR_DUPLICATED : 0));
++
+                         if (n_expert == 0) {
+                             layer.ffn_gate = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff});
+                             layer.ffn_down = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd});
+@@ -8620,6 +8623,10 @@ struct llm_build_context {
+         // choose long/short freq factors based on the context size
+         const auto n_ctx_pre_seq = cparams.n_ctx / cparams.n_seq_max;
+ 
++        if (model.layers[il].rope_freqs != nullptr) {
++            return model.layers[il].rope_freqs;
++        }
++
+         if (n_ctx_pre_seq > hparams.n_ctx_orig_yarn) {
+             return model.layers[il].rope_long;
+         }
+@@ -8814,6 +8821,9 @@ struct llm_build_context {
+ 
+             // self-attention
+             {
++                // rope freq factors for llama3; may return nullptr for llama2 and other models
++                struct ggml_tensor * rope_factors = build_rope_factors(il);
++
+                 // compute Q and K and RoPE them
+                 struct ggml_tensor * Qcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wq, cur);
+                 cb(Qcur, "Qcur", il);
+@@ -8837,14 +8847,14 @@ struct llm_build_context {
+                 }
+ 
+                 Qcur = ggml_rope_ext(
+-                    ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, nullptr,
++                    ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, rope_factors,
+                     n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
+                     ext_factor, attn_factor, beta_fast, beta_slow
+                 );
+                 cb(Qcur, "Qcur", il);
+ 
+                 Kcur = ggml_rope_ext(
+-                    ctx0, ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens), inp_pos, nullptr,
++                    ctx0, ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens), inp_pos, rope_factors,
+                     n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
+                     ext_factor, attn_factor, beta_fast, beta_slow
+                 );
+-- 
+2.45.2
+

From bbf8f102ee06bd6b149e4999571c0844aa47b12f Mon Sep 17 00:00:00 2001
From: Jeffrey Morgan <jmorganca@gmail.com>
Date: Thu, 25 Jul 2024 18:24:55 -0400
Subject: [PATCH 06/12] Revert "llm(llama): pass rope factors (#5924)" (#5963)

This reverts commit bb46bbcf5e90e5efab5ff946a6c798131907ba2d.
---
 llm/patches/0001-llama-3.1-rope-scaling.diff | 71 --------------------
 1 file changed, 71 deletions(-)
 delete mode 100644 llm/patches/0001-llama-3.1-rope-scaling.diff

diff --git a/llm/patches/0001-llama-3.1-rope-scaling.diff b/llm/patches/0001-llama-3.1-rope-scaling.diff
deleted file mode 100644
index 45dcb4f5..00000000
--- a/llm/patches/0001-llama-3.1-rope-scaling.diff
+++ /dev/null
@@ -1,71 +0,0 @@
-From 2f872f294fb6f5c6e8f983b68c40ea656053dd92 Mon Sep 17 00:00:00 2001
-From: Michael Yang <mxyng@pm.me>
-Date: Tue, 23 Jul 2024 14:33:29 -0700
-Subject: [PATCH] llama 3.1 rope scaling
-
----
- src/llama.cpp | 14 ++++++++++++--
- 1 file changed, 12 insertions(+), 2 deletions(-)
-
-diff --git a/src/llama.cpp b/src/llama.cpp
-index 8fe51971..a9969df8 100644
---- a/src/llama.cpp
-+++ b/src/llama.cpp
-@@ -2472,6 +2472,7 @@ struct llama_layer {
-     // long rope factors
-     struct ggml_tensor * rope_long  = nullptr;
-     struct ggml_tensor * rope_short = nullptr;
-+    struct ggml_tensor * rope_freqs = nullptr;
- 
-     // bitnet scale
-     struct ggml_tensor * wq_scale;
-@@ -6143,6 +6144,8 @@ static bool llm_load_tensors(
- 
-                         layer.ffn_norm = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd});
- 
-+                        layer.rope_freqs  = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_ROPE_FREQS,  "weight"), { n_embd/n_head/2 }, llama_model_loader::TENSOR_NOT_REQUIRED | (i != 0 ? llama_model_loader::TENSOR_DUPLICATED : 0));
-+
-                         if (n_expert == 0) {
-                             layer.ffn_gate = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff});
-                             layer.ffn_down = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd});
-@@ -8620,6 +8623,10 @@ struct llm_build_context {
-         // choose long/short freq factors based on the context size
-         const auto n_ctx_pre_seq = cparams.n_ctx / cparams.n_seq_max;
- 
-+        if (model.layers[il].rope_freqs != nullptr) {
-+            return model.layers[il].rope_freqs;
-+        }
-+
-         if (n_ctx_pre_seq > hparams.n_ctx_orig_yarn) {
-             return model.layers[il].rope_long;
-         }
-@@ -8814,6 +8821,9 @@ struct llm_build_context {
- 
-             // self-attention
-             {
-+                // rope freq factors for llama3; may return nullptr for llama2 and other models
-+                struct ggml_tensor * rope_factors = build_rope_factors(il);
-+
-                 // compute Q and K and RoPE them
-                 struct ggml_tensor * Qcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wq, cur);
-                 cb(Qcur, "Qcur", il);
-@@ -8837,14 +8847,14 @@ struct llm_build_context {
-                 }
- 
-                 Qcur = ggml_rope_ext(
--                    ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, nullptr,
-+                    ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, rope_factors,
-                     n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
-                     ext_factor, attn_factor, beta_fast, beta_slow
-                 );
-                 cb(Qcur, "Qcur", il);
- 
-                 Kcur = ggml_rope_ext(
--                    ctx0, ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens), inp_pos, nullptr,
-+                    ctx0, ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens), inp_pos, rope_factors,
-                     n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
-                     ext_factor, attn_factor, beta_fast, beta_slow
-                 );
--- 
-2.45.2
-

From 4de1370a9dcc88b79ddc2d4af2e8c954bdfa67a1 Mon Sep 17 00:00:00 2001
From: royjhan <65097070+royjhan@users.noreply.github.com>
Date: Thu, 25 Jul 2024 15:34:06 -0700
Subject: [PATCH 07/12] openai tools doc (#5617)

---
 docs/openai.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/openai.md b/docs/openai.md
index 248ba74a..e51d3194 100644
--- a/docs/openai.md
+++ b/docs/openai.md
@@ -79,7 +79,7 @@ curl http://localhost:11434/v1/chat/completions \
 - [x] JSON mode
 - [x] Reproducible outputs
 - [ ] Vision
-- [ ] Function calling
+- [x] Tools
 - [ ] Logprobs
 
 #### Supported request fields
@@ -97,9 +97,9 @@ curl http://localhost:11434/v1/chat/completions \
 - [x] `temperature`
 - [x] `top_p`
 - [x] `max_tokens`
-- [ ] `logit_bias`
-- [ ] `tools`
+- [x] `tools`
 - [ ] `tool_choice`
+- [ ] `logit_bias`
 - [ ] `user`
 - [ ] `n`
 

From 455e61170d12d2b29ac2dfe5fa6444ae40a9ef7f Mon Sep 17 00:00:00 2001
From: Jeffrey Morgan <jmorganca@gmail.com>
Date: Thu, 25 Jul 2024 18:34:47 -0400
Subject: [PATCH 08/12] Update openai.md

---
 docs/openai.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/openai.md b/docs/openai.md
index e51d3194..04d56bd6 100644
--- a/docs/openai.md
+++ b/docs/openai.md
@@ -78,8 +78,8 @@ curl http://localhost:11434/v1/chat/completions \
 - [x] Streaming
 - [x] JSON mode
 - [x] Reproducible outputs
-- [ ] Vision
 - [x] Tools
+- [ ] Vision
 - [ ] Logprobs
 
 #### Supported request fields

From c8af3c2d969a99618eecf169bd75aa112573ac27 Mon Sep 17 00:00:00 2001
From: Blake Mizerany <blake.mizerany@gmail.com>
Date: Thu, 25 Jul 2024 15:58:30 -0700
Subject: [PATCH 09/12] server: reuse original download URL for images (#5962)

This changes the registry client to reuse the original download URL
it gets on the first redirect response for all subsequent requests,
preventing thundering herd issues when hot new LLMs are released.
---
 server/download.go | 75 +++++++++++++++++++++++++++++++++++++++++++++-
 server/images.go   |  6 +++-
 2 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/server/download.go b/server/download.go
index d93cd3b4..8b5b577f 100644
--- a/server/download.go
+++ b/server/download.go
@@ -8,6 +8,7 @@ import (
 	"io"
 	"log/slog"
 	"math"
+	"math/rand/v2"
 	"net/http"
 	"net/url"
 	"os"
@@ -141,6 +142,32 @@ func (b *blobDownload) Run(ctx context.Context, requestURL *url.URL, opts *regis
 	b.err = b.run(ctx, requestURL, opts)
 }
 
+func newBackoff(maxBackoff time.Duration) func(ctx context.Context) error {
+	var n int
+	return func(ctx context.Context) error {
+		if ctx.Err() != nil {
+			return ctx.Err()
+		}
+
+		n++
+
+		// n^2 backoff timer is a little smoother than the
+		// common choice of 2^n.
+		d := min(time.Duration(n*n)*10*time.Millisecond, maxBackoff)
+		// Randomize the delay between 0.5-1.5 x msec, in order
+		// to prevent accidental "thundering herd" problems.
+		d = time.Duration(float64(d) * (rand.Float64() + 0.5))
+		t := time.NewTimer(d)
+		defer t.Stop()
+		select {
+		case <-ctx.Done():
+			return ctx.Err()
+		case <-t.C:
+			return nil
+		}
+	}
+}
+
 func (b *blobDownload) run(ctx context.Context, requestURL *url.URL, opts *registryOptions) error {
 	defer blobDownloadManager.Delete(b.Digest)
 	ctx, b.CancelFunc = context.WithCancel(ctx)
@@ -153,6 +180,52 @@ func (b *blobDownload) run(ctx context.Context, requestURL *url.URL, opts *regis
 
 	_ = file.Truncate(b.Total)
 
+	directURL, err := func() (*url.URL, error) {
+		ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
+		defer cancel()
+
+		backoff := newBackoff(10 * time.Second)
+		for {
+			// shallow clone opts to be used in the closure
+			// without affecting the outer opts.
+			newOpts := new(registryOptions)
+			*newOpts = *opts
+
+			newOpts.CheckRedirect = func(req *http.Request, via []*http.Request) error {
+				if len(via) > 10 {
+					return errors.New("maxium redirects exceeded (10) for directURL")
+				}
+
+				// if the hostname is the same, allow the redirect
+				if req.URL.Hostname() == requestURL.Hostname() {
+					return nil
+				}
+
+				// stop at the first redirect that is not
+				// the same hostname as the original
+				// request.
+				return http.ErrUseLastResponse
+			}
+
+			resp, err := makeRequestWithRetry(ctx, http.MethodGet, requestURL, nil, nil, newOpts)
+			if err != nil {
+				slog.Warn("failed to get direct URL; backing off and retrying", "err", err)
+				if err := backoff(ctx); err != nil {
+					return nil, err
+				}
+				continue
+			}
+			defer resp.Body.Close()
+			if resp.StatusCode != http.StatusTemporaryRedirect {
+				return nil, fmt.Errorf("unexpected status code %d", resp.StatusCode)
+			}
+			return resp.Location()
+		}
+	}()
+	if err != nil {
+		return err
+	}
+
 	g, inner := errgroup.WithContext(ctx)
 	g.SetLimit(numDownloadParts)
 	for i := range b.Parts {
@@ -165,7 +238,7 @@ func (b *blobDownload) run(ctx context.Context, requestURL *url.URL, opts *regis
 			var err error
 			for try := 0; try < maxRetries; try++ {
 				w := io.NewOffsetWriter(file, part.StartsAt())
-				err = b.downloadChunk(inner, requestURL, w, part, opts)
+				err = b.downloadChunk(inner, directURL, w, part, opts)
 				switch {
 				case errors.Is(err, context.Canceled), errors.Is(err, syscall.ENOSPC):
 					// return immediately if the context is canceled or the device is out of space
diff --git a/server/images.go b/server/images.go
index 574dec19..836dbcc2 100644
--- a/server/images.go
+++ b/server/images.go
@@ -54,6 +54,8 @@ type registryOptions struct {
 	Username string
 	Password string
 	Token    string
+
+	CheckRedirect func(req *http.Request, via []*http.Request) error
 }
 
 type Model struct {
@@ -1131,7 +1133,9 @@ func makeRequest(ctx context.Context, method string, requestURL *url.URL, header
 		req.ContentLength = contentLength
 	}
 
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := (&http.Client{
+		CheckRedirect: regOpts.CheckRedirect,
+	}).Do(req)
 	if err != nil {
 		return nil, err
 	}

From 997c903884b08aef53d0f92634f74bdb64f05c0a Mon Sep 17 00:00:00 2001
From: Michael Yang <mxyng@pm.me>
Date: Thu, 25 Jul 2024 16:23:40 -0700
Subject: [PATCH 10/12] Update docs/template.md

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
---
 docs/template.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/template.md b/docs/template.md
index 8f41e8fb..f6ce06ba 100644
--- a/docs/template.md
+++ b/docs/template.md
@@ -24,7 +24,7 @@ In this example, we have:
 * Three variables: `Messages`, `Role`, and `Content` (variables)
 * A custom function (action) that iterates over an array of items (`range .Messages`) and displays each item
 
-## Adding Templates to Your Model
+## Adding templates to your model
 
 By default, models imported into Ollama have a default template of `{{ .Prompt }}`, i.e. user inputs are sent verbatim to the LLM. This is appropriate for text or code completion models but lacks essential markers for chat or instruction models.
 

From ae27d9dcfd32b7fbaa0d5a1fb0126106873332bf Mon Sep 17 00:00:00 2001
From: Jeffrey Morgan <jmorganca@gmail.com>
Date: Thu, 25 Jul 2024 20:27:33 -0400
Subject: [PATCH 11/12] Update openai.md

---
 docs/openai.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/openai.md b/docs/openai.md
index 04d56bd6..fee30f71 100644
--- a/docs/openai.md
+++ b/docs/openai.md
@@ -78,7 +78,7 @@ curl http://localhost:11434/v1/chat/completions \
 - [x] Streaming
 - [x] JSON mode
 - [x] Reproducible outputs
-- [x] Tools
+- [x] Tools (streaming support coming soon)
 - [ ] Vision
 - [ ] Logprobs
 

From f5e3939220e9cd3d7a636708bc9df031ebfd4854 Mon Sep 17 00:00:00 2001
From: Jeffrey Morgan <jmorganca@gmail.com>
Date: Thu, 25 Jul 2024 23:10:18 -0400
Subject: [PATCH 12/12] Update api.md (#5968)

---
 docs/api.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/api.md b/docs/api.md
index 0ab70383..2d4fe28f 100644
--- a/docs/api.md
+++ b/docs/api.md
@@ -418,6 +418,7 @@ Generate the next message in a chat with a provided model. This is a streaming e
 
 - `model`: (required) the [model name](#model-names)
 - `messages`: the messages of the chat, this can be used to keep a chat memory
+- `tools`: tools for the model to use if supported. Requires `stream` to be set to `false`
 
 The `message` object has the following fields:
 
@@ -432,7 +433,6 @@ Advanced parameters (optional):
 - `options`: additional model parameters listed in the documentation for the [Modelfile](./modelfile.md#valid-parameters-and-values) such as `temperature`
 - `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
 - `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
-- `tools`: external tools the model can use. Not all models support this feature.
 
 ### Examples
 
@@ -1286,4 +1286,4 @@ curl http://localhost:11434/api/embeddings -d '{
     0.8785552978515625, -0.34576427936553955, 0.5742510557174683, -0.04222835972905159, -0.137906014919281
   ]
 }
-```
\ No newline at end of file
+```