With the rise of AI PC capabilities, hosting own Visual Studio code assistant is at your reach. In this demo, we will showcase how to deploy local LLM serving with OVMS and integrate it with Continue extension. It will employ iGPU or NPU acceleration.
- Windows (for standalone app) or Linux (using Docker)
- Python installed (for model preparation only)
- Intel Meteor Lake, Lunar Lake, Arrow Lake or newer Intel CPU.
We need to use medium size model in order to keep 50ms/word for human to feel the chat responsive. This will work in streaming mode, meaning we will see the chat response/code diff generation slowly roll out in real-time.
Download export script, install its dependencies and create directory for the models:
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/1/demos/common/export_models/export_model.py -o export_model.py
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/1/demos/common/export_models/requirements.txt
mkdir models
Note: The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub.
Export codellama/CodeLlama-7b-Instruct-hf
:
python export_model.py text_generation --source_model codellama/CodeLlama-7b-Instruct-hf --weight-format int4 --config_file_path models/config_all.json --model_repository_path models --target_device NPU --overwrite_models
Note: Use
--target_device GPU
for Intel GPU or omit this parameter to run on Intel CPU
For this task we need smaller, lighter model that will produce code quicker than chat task. Since we do not want to wait for the code to appear, we need to use smaller model. It should be responsive enough to generate multi-line blocks of code ahead of time as we type. Code completion works in non-streaming, unary mode. Do not use instruct model, there is no chat involved in the process.
Export Qwen/Qwen2.5-Coder-1.5B
:
python export_model.py text_generation --source_model Qwen/Qwen2.5-Coder-1.5B --weight-format int4 --config_file_path models/config_all.json --model_repository_path models --target_device NPU --overwrite_models
Examine that workspace is set up properly models/config_all.json
:
{
"mediapipe_config_list": [
{
"name": "codellama/CodeLlama-7b-Instruct-hf",
"base_path": "codellama/CodeLlama-7b-Instruct-hf"
},
{
"name": "Qwen/Qwen2.5-Coder-1.5B",
"base_path": "Qwen/Qwen2.5-Coder-1.5B"
}
],
"model_config_list": []
}
tree models
models
├── codellama
│ └── CodeLlama-7b-Instruct-hf
│ ├── config.json
│ ├── generation_config.json
│ ├── graph.pbtxt
│ ├── openvino_detokenizer.bin
│ ├── openvino_detokenizer.xml
│ ├── openvino_model.bin
│ ├── openvino_model.xml
│ ├── openvino_tokenizer.bin
│ ├── openvino_tokenizer.xml
│ ├── special_tokens_map.json
│ ├── tokenizer_config.json
│ ├── tokenizer.json
│ └── tokenizer.model
├── config_all.json
└── Qwen
└── Qwen2.5-Coder-1.5B
├── added_tokens.json
├── config.json
├── generation_config.json
├── graph.pbtxt
├── merges.txt
├── openvino_detokenizer.bin
├── openvino_detokenizer.xml
├── openvino_model.bin
├── openvino_model.xml
├── openvino_tokenizer.bin
├── openvino_tokenizer.xml
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── vocab.json
4 directories, 29 files
Run OpenVINO Model Server with both models loaded at the same time:
Please refer to OpenVINO Model Server installation first: link
ovms --rest_port 8000 --config_path ./models/config_all.json
docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
-p 8000:8000 -v $(pwd)/:/workspace/ openvino/model_server:2025.1 --rest_port 8000 --config_path /workspace/models/config_all.json
Download Continue plugin
We need to point Continue plugin to our OpenVINO Model Server instance. Open configuration file:
Add both models. Specify roles:
name: Local Assistant
version: 1.0.0
schema: v1
models:
-
name: OVMS CodeLlama-7b-Instruct-hf
provider: openai
model: codellama/CodeLlama-7b-Instruct-hf
apiKey: unused
apiBase: localhost:8000/v3
roles:
- chat
- edit
- apply
-
name: OVMS Qwen2.5-Coder-1.5B
provider: openai
model: Qwen/Qwen2.5-Coder-1.5B
apiKey: unused
apiBase: localhost:8000/v3
roles:
- autocomplete
context:
- provider: code
- provider: docs
- provider: diff
- provider: terminal
- provider: problems
- provider: folder
- provider: codebase
- to use chatting feature click continue button on the left sidebar
- use
CTRL+I
to select and include source in chat message - use
CTRL+L
to select and edit the source via chat request - simply write code to see code autocompletion (NOTE: this is turned off by default)
OpenVINO Model Server uses python to apply chat templates. If you get an error during model loading, enable Unicode UTF-8 in your system settings: