Skip to content

Commit 226c1eb

Browse files
[AI Gallery] Add quantized LLMs with Ollama (skypilot-org#3422)
* WIP * arm64 support * wip * wip * add ollama to ai gallery * minor edits * minor edits * Updates * comments * Add 'new' tag
1 parent d0f20ab commit 226c1eb

File tree

7 files changed

+391
-3
lines changed

7 files changed

+391
-3
lines changed

README.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727

2828
----
2929
:fire: *News* :fire:
30+
- [Apr, 2024] Using [**Ollama**](https://github.com/ollama/ollama) to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/)
3031
- [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
3132
- [Feb, 2024] Deploying and scaling [**Gemma**](https://blog.google/technology/developers/gemma-open-models/) with SkyServe: [**example**](./llm/gemma/)
3233
- [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
@@ -159,14 +160,15 @@ Runnable examples:
159160
- [Vicuna chatbots: Training & Serving](./llm/vicuna/) (from official Vicuna team)
160161
- [Train your own Vicuna on Llama-2](./llm/vicuna-llama-2/)
161162
- [Self-Hosted Llama-2 Chatbot](./llm/llama-2/)
163+
- [Ollama: Quantized LLMs on CPUs](./llm/ollama/)
162164
- [LoRAX](./llm/lorax/)
163165
- [QLoRA](https://github.com/artidoro/qlora/pull/132)
164166
- [LLaMA-LoRA-Tuner](https://github.com/zetavg/LLaMA-LoRA-Tuner#run-on-a-cloud-service-via-skypilot)
165167
- [Tabby: Self-hosted AI coding assistant](https://github.com/TabbyML/tabby/blob/bed723fcedb44a6b867ce22a7b1f03d2f3531c1e/experimental/eval/skypilot.yaml)
166168
- [LocalGPT](./llm/localgpt)
167169
- [Falcon](./llm/falcon)
168170
- Add yours here & see more in [`llm/`](./llm)!
169-
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), and [many more (`examples/`)](./examples).
171+
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama) and [many more (`examples/`)](./examples).
170172

171173
Follow updates:
172174
- [Twitter](https://twitter.com/skypilot_org)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../../llm/ollama/README.md

docs/source/_gallery_original/index.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,11 @@ Contents
1919
2020
.. toctree::
2121
:maxdepth: 1
22-
:caption: Inference Engines
22+
:caption: Inference Frameworks
2323

2424
vLLM <frameworks/vllm>
2525
Hugging Face TGI <frameworks/tgi>
26+
Ollama <frameworks/ollama>
2627
SGLang <frameworks/sglang>
2728
LoRAX <frameworks/lorax>
2829

docs/source/_static/custom.js

+1
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ document.addEventListener('DOMContentLoaded', () => {
2828
{ selector: '.caption-text', text: 'SkyServe: Model Serving' },
2929
{ selector: '.toctree-l1 > a', text: 'Running on Kubernetes' },
3030
{ selector: '.toctree-l1 > a', text: 'DBRX (Databricks)' },
31+
{ selector: '.toctree-l1 > a', text: 'Ollama' },
3132
];
3233
newItems.forEach(({ selector, text }) => {
3334
document.querySelectorAll(selector).forEach((el) => {

docs/source/docs/index.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ Runnable examples:
7878
* `Vicuna chatbots: Training & Serving <https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna>`_ (from official Vicuna team)
7979
* `Train your own Vicuna on Llama-2 <https://github.com/skypilot-org/skypilot/blob/master/llm/vicuna-llama-2>`_
8080
* `Self-Hosted Llama-2 Chatbot <https://github.com/skypilot-org/skypilot/tree/master/llm/llama-2>`_
81+
* `Ollama: Quantized LLMs on CPUs <https://github.com/skypilot-org/skypilot/tree/master/llm/ollama>`_
8182
* `LoRAX <https://github.com/skypilot-org/skypilot/tree/master/llm/lorax/>`_
8283
* `QLoRA <https://github.com/artidoro/qlora/pull/132>`_
8384
* `LLaMA-LoRA-Tuner <https://github.com/zetavg/LLaMA-LoRA-Tuner#run-on-a-cloud-service-via-skypilot>`_
@@ -86,7 +87,7 @@ Runnable examples:
8687
* `Falcon <https://github.com/skypilot-org/skypilot/tree/master/llm/falcon>`_
8788
* Add yours here & see more in `llm/ <https://github.com/skypilot-org/skypilot/tree/master/llm>`_!
8889

89-
* Framework examples: `PyTorch DDP <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml>`_, `DeepSpeed <https://github.com/skypilot-org/skypilot/blob/master/examples/deepspeed-multinode/sky.yaml>`_, `JAX/Flax on TPU <https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml>`_, `Stable Diffusion <https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion>`_, `Detectron2 <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_, `Distributed <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py>`_ `TensorFlow <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml>`_, `NeMo <https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml>`_, `programmatic grid search <https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py>`_, `Docker <https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml>`_, `Cog <https://github.com/skypilot-org/skypilot/blob/master/examples/cog/>`_, `Unsloth <https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml>`_, and `many more <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.
90+
* Framework examples: `PyTorch DDP <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml>`_, `DeepSpeed <https://github.com/skypilot-org/skypilot/blob/master/examples/deepspeed-multinode/sky.yaml>`_, `JAX/Flax on TPU <https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml>`_, `Stable Diffusion <https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion>`_, `Detectron2 <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_, `Distributed <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py>`_ `TensorFlow <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml>`_, `NeMo <https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml>`_, `programmatic grid search <https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py>`_, `Docker <https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml>`_, `Cog <https://github.com/skypilot-org/skypilot/blob/master/examples/cog/>`_, `Unsloth <https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml>`_, `Ollama <https://github.com/skypilot-org/skypilot/blob/master/llm/ollama>`_ and `many more <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.
9091

9192
Follow updates:
9293

llm/ollama/README.md

+299
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
# Ollama: Run quantized LLMs on CPUs and GPUs
2+
<p align="center">
3+
<img src="https://i.imgur.com/HfqnGVA.png" width="400">
4+
</p>
5+
6+
[Ollama](https://github.com/ollama/ollama) is popular library for running LLMs on both CPUs and GPUs.
7+
It supports a wide range of models, including quantized versions of `llama2`, `llama2:70b`, `mistral`, `phi`, `gemma:7b` and many [more](https://ollama.com/library).
8+
You can use SkyPilot to run these models on CPU instances on any cloud provider, Kubernetes cluster, or even on your local machine.
9+
And if your instance has GPUs, Ollama will automatically use them for faster inference.
10+
11+
In this example, you will run a quantized version of Llama2 on 4 CPUs with 8GB of memory, and then scale it up to more replicas with SkyServe.
12+
13+
## Prerequisites
14+
To get started, install the latest version of SkyPilot:
15+
16+
```bash
17+
pip install "skypilot-nightly[all]"
18+
```
19+
20+
For detailed installation instructions, please refer to the [installation guide](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
21+
22+
Once installed, run `sky check` to verify you have cloud access.
23+
24+
### [Optional] Running locally on your machine
25+
If you do not have cloud access, you also can run this recipe on your local machine by creating a local Kubernetes cluster with `sky local up`.
26+
27+
Make sure you have KinD installed and Docker running with 5 or more CPUs and 10GB or more of memory allocated to the [Docker runtime](https://kind.sigs.k8s.io/docs/user/quick-start/#settings-for-docker-desktop).
28+
29+
To create a local Kubernetes cluster, run:
30+
31+
```console
32+
sky local up
33+
```
34+
35+
<details>
36+
<summary>Example outputs:</summary>
37+
38+
```console
39+
$ sky local up
40+
Creating local cluster...
41+
To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-04-09-19-14-03-599730/local_up.log
42+
I 04-09 19:14:33 log_utils.py:79] Kubernetes is running.
43+
I 04-09 19:15:33 log_utils.py:117] SkyPilot CPU image pulled.
44+
I 04-09 19:15:49 log_utils.py:123] Nginx Ingress Controller installed.
45+
⠸ Running sky check...
46+
Local Kubernetes cluster created successfully with 16 CPUs.
47+
`sky launch` can now run tasks locally.
48+
Hint: To change the number of CPUs, change your docker runtime settings. See https://kind.sigs.k8s.io/docs/user/quick-start/#settings-for-docker-desktop for more info.
49+
```
50+
</details>
51+
52+
After running this, `sky check` should show that you have access to a Kubernetes cluster.
53+
54+
## SkyPilot YAML
55+
To run Ollama with SkyPilot, create a YAML file with the following content:
56+
57+
<details>
58+
<summary>Click to see the full recipe YAML</summary>
59+
60+
```yaml
61+
envs:
62+
MODEL_NAME: llama2 # mistral, phi, other ollama supported models
63+
OLLAMA_HOST: 0.0.0.0:8888 # Host and port for Ollama to listen on
64+
65+
resources:
66+
cpus: 4+
67+
memory: 8+ # 8 GB+ for 7B models, 16 GB+ for 13B models, 32 GB+ for 33B models
68+
# accelerators: L4:1 # No GPUs necessary for Ollama, but you can use them to run inference faster
69+
ports: 8888
70+
71+
service:
72+
replicas: 2
73+
# An actual request for readiness probe.
74+
readiness_probe:
75+
path: /v1/chat/completions
76+
post_data:
77+
model: $MODEL_NAME
78+
messages:
79+
- role: user
80+
content: Hello! What is your name?
81+
max_tokens: 1
82+
83+
setup: |
84+
# Install Ollama
85+
if [ "$(uname -m)" == "aarch64" ]; then
86+
# For apple silicon support
87+
sudo curl -L https://ollama.com/download/ollama-linux-arm64 -o /usr/bin/ollama
88+
else
89+
sudo curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/bin/ollama
90+
fi
91+
sudo chmod +x /usr/bin/ollama
92+
93+
# Start `ollama serve` and capture PID to kill it after pull is done
94+
ollama serve &
95+
OLLAMA_PID=$!
96+
97+
# Wait for ollama to be ready
98+
IS_READY=false
99+
for i in {1..20};
100+
do ollama list && IS_READY=true && break;
101+
sleep 5;
102+
done
103+
if [ "$IS_READY" = false ]; then
104+
echo "Ollama was not ready after 100 seconds. Exiting."
105+
exit 1
106+
fi
107+
108+
# Pull the model
109+
ollama pull $MODEL_NAME
110+
echo "Model $MODEL_NAME pulled successfully."
111+
112+
# Kill `ollama serve` after pull is done
113+
kill $OLLAMA_PID
114+
115+
run: |
116+
# Run `ollama serve` in the foreground
117+
echo "Serving model $MODEL_NAME"
118+
ollama serve
119+
```
120+
</details>
121+
122+
You can also get the full YAML [here](https://github.com/skypilot-org/skypilot/tree/master/llm/ollama/ollama.yaml).
123+
124+
## Serving Llama2 with a CPU instance
125+
Start serving Llama2 on a 4 CPU instance with the following command:
126+
127+
```console
128+
sky launch ollama.yaml -c ollama --detach-run
129+
```
130+
131+
Wait until the model command returns successfully.
132+
133+
<details>
134+
<summary>Example outputs:</summary>
135+
136+
```console
137+
...
138+
== Optimizer ==
139+
Target: minimizing cost
140+
Estimated cost: $0.0 / hour
141+
142+
Considered resources (1 node):
143+
-------------------------------------------------------------------------------------------------------
144+
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
145+
-------------------------------------------------------------------------------------------------------
146+
Kubernetes 4CPU--8GB 4 8 - kubernetes 0.00 ✔
147+
AWS c6i.xlarge 4 8 - us-east-1 0.17
148+
Azure Standard_F4s_v2 4 8 - eastus 0.17
149+
GCP n2-standard-4 4 16 - us-central1-a 0.19
150+
Fluidstack rec3pUyh6pNkIjCaL 6 24 RTXA4000:1 norway_4_eu 0.64
151+
-------------------------------------------------------------------------------------------------------
152+
...
153+
```
154+
155+
</details>
156+
157+
**💡Tip:** You can further reduce costs by using the `--use-spot` flag to run on spot instances.
158+
159+
To launch a different model, use the `MODEL_NAME` environment variable:
160+
161+
```console
162+
sky launch ollama.yaml -c ollama --detach-run --env MODEL_NAME=mistral
163+
```
164+
165+
Ollama supports `llama2`, `llama2:70b`, `mistral`, `phi`, `gemma:7b` and many more models.
166+
See the full list [here](https://ollama.com/library).
167+
168+
Once the `sky launch` command returns successfully, you can interact with the model via
169+
- Standard OpenAPI-compatible endpoints (e.g., `/v1/chat/completions`)
170+
- [Ollama API](https://github.com/ollama/ollama/blob/main/docs/api.md)
171+
172+
To curl `/v1/chat/completions`:
173+
```console
174+
ENDPOINT=$(sky status --endpoint 8888 ollama)
175+
curl $ENDPOINT/v1/chat/completions \
176+
-H "Content-Type: application/json" \
177+
-d '{
178+
"model": "llama2",
179+
"messages": [
180+
{
181+
"role": "system",
182+
"content": "You are a helpful assistant."
183+
},
184+
{
185+
"role": "user",
186+
"content": "Who are you?"
187+
}
188+
]
189+
}'
190+
```
191+
192+
<details>
193+
<summary>Example curl response:</summary>
194+
195+
```json
196+
{
197+
"id": "chatcmpl-322",
198+
"object": "chat.completion",
199+
"created": 1712015174,
200+
"model": "llama2",
201+
"system_fingerprint": "fp_ollama",
202+
"choices": [
203+
{
204+
"index": 0,
205+
"message": {
206+
"role": "assistant",
207+
"content": "Hello there! *adjusts glasses* I am Assistant, your friendly and helpful AI companion. My purpose is to assist you in any way possible, from answering questions to providing information on a wide range of topics. Is there something specific you would like to know or discuss? Feel free to ask me anything!"
208+
},
209+
"finish_reason": "stop"
210+
}
211+
],
212+
"usage": {
213+
"prompt_tokens": 29,
214+
"completion_tokens": 68,
215+
"total_tokens": 97
216+
}
217+
}
218+
```
219+
</details>
220+
221+
**💡Tip:** To speed up inference, you can use GPUs by specifying the `accelerators` field in the YAML.
222+
223+
To stop the instance:
224+
```console
225+
sky stop ollama
226+
```
227+
228+
To shut down all resources:
229+
```console
230+
sky down ollama
231+
```
232+
233+
If you are using a local Kubernetes cluster created with `sky local up`, shut it down with:
234+
```console
235+
sky local down
236+
```
237+
238+
## Serving LLMs on CPUs at scale with SkyServe
239+
240+
After experimenting with the model, you can deploy multiple replicas of the model with autoscaling and load-balancing using SkyServe.
241+
242+
With no change to the YAML, launch a fully managed service on your infra:
243+
```console
244+
sky serve up ollama.yaml -n ollama
245+
```
246+
247+
Wait until the service is ready:
248+
```console
249+
watch -n10 sky serve status ollama
250+
```
251+
252+
<details>
253+
<summary>Example outputs:</summary>
254+
255+
```console
256+
Services
257+
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
258+
ollama 1 3m 15s READY 2/2 34.171.202.102:30001
259+
260+
Service Replicas
261+
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
262+
ollama 1 1 34.69.185.170 4 mins ago 1x GCP(vCPU=4) READY us-central1
263+
ollama 2 1 35.184.144.198 4 mins ago 1x GCP(vCPU=4) READY us-central1
264+
```
265+
</details>
266+
267+
268+
Get a single endpoint that load-balances across replicas:
269+
```console
270+
ENDPOINT=$(sky serve status --endpoint ollama)
271+
```
272+
273+
**💡Tip:** SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.
274+
275+
To curl the endpoint:
276+
```console
277+
curl -L $ENDPOINT/v1/chat/completions \
278+
-H "Content-Type: application/json" \
279+
-d '{
280+
"model": "llama2",
281+
"messages": [
282+
{
283+
"role": "system",
284+
"content": "You are a helpful assistant."
285+
},
286+
{
287+
"role": "user",
288+
"content": "Who are you?"
289+
}
290+
]
291+
}'
292+
```
293+
294+
To shut down all resources:
295+
```console
296+
sky serve down ollama
297+
```
298+
299+
See more details in [SkyServe docs](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html).

0 commit comments

Comments
 (0)