Ensure you have the Hugging Face pre-trained LLM directory with tokenizer, model, and config files before deployment. Download LLMs using this Python code:
Download
from transformers import AutoTokenizer, AutoModelForCausalLM
# Model ID
model_id = "airesearch/LLaMa3-8b-WangchanX-sft-Demo"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
# Save tokenizer and model
path = "LLaMa3-8b-WangchanX-sft-Demo"
tokenizer.save_pretrained(path)
model.save_pretrained(path)
Text Generation Inference
Text Generation Inference (TGI) is a toolkit that simplifies the deployment and serving of Large Language Models (LLMs). It offers advanced features such as tensor parallelism, quantization, watermarking, and custom prompt generation, making it easy to deploy and utilize LLMs in various applications. You can find more details.
-
At the current working directory location, prepare the following:
- The directory containing the pre-trained LLM model from Hugging Face. For example, if you are using the
LLaMa3-8b-WangchanX-sft-Demo
model, the directory should be namedLLaMa3-8b-WangchanX-sft-Demo
.
- The directory containing the pre-trained LLM model from Hugging Face. For example, if you are using the
- Create a
Dockerfile
with the following content to build a Docker image:
FROM ghcr.io/huggingface/text-generation-inference:2.0
COPY LLaMa3-8b-WangchanX-sft-Demo /data/LLaMa3-8b-WangchanX-sft-Demo
- Build the image using the following command:
docker build -t text-generation-inference -f <Dockerfile> .
- Alternatively, you can simply build the image which we already provided in the deployment directory:
docker build -t text-generation-inference -f deployment/TGI/Dockerfile.TextGenerationInference .
- Run the image using this command:
docker run --gpus all -p 8888:80 text-generation-inference --model-id /data/LLaMa3-8b-WangchanX-sft-Demo #you can add -d flag to run in background
- And then you can make requests like this:
curl 127.0.0.1:8888/generate_stream \
-X POST \
-d '{"inputs":"<|user|>ลิเก กับ งิ้ว ต่างกันอย่างไร<|end_of_text|>\n<|assistant|>\n","parameters":{"max_new_tokens":2048}}' \
-H 'Content-Type: application/json'
- Preview:
NOTE
Don't forget to add chat template <|user|>
message .... <|end_of_text|>\n<|assistant|>\n
in inputs requests for more nice results.
LocalAI
LocalAI is a free, open-source OpenAI alternative. It provides a drop-in REST API compatible with OpenAI's specs for local/on-prem inference with LLMs, image/audio generation across model families on consumer hardware sans GPU. You can find more details.
-
At the current working directory location, prepare the following:
-
The directory containing the pre-trained LLM model from Hugging Face. For example, if you are using the
LLaMa3-8b-WangchanX-sft-Demo
model, the directory should be namedLLaMa3-8b-WangchanX-sft-Demo
. -
The model YAML file. This file can be found in the
deployment/LocalAI
directory. For theLLaMa3-8b-WangchanX-sft-Demo
model, the YAML file would be namedLLaMa3-8b-WangchanX-sft-Demo.yaml
.
-
-
Create a
Dockerfile
with the following content to build a Docker image:
FROM localai/localai:latest-aio-gpu-nvidia-cuda-12
COPY LLaMa3-8b-WangchanX-sft-Demo.yaml /build/models
- Build the image using the following command:
docker build -t localai -f <Dockerfile> .
- Alternatively, you can simply build the image which we already provided in the deployment directory:
docker build -t localai -f deployment/LocalAi/Dockerfile.LocalAi .
- Run the image using this command:
docker run --gpus all -p 8888:8080 localai #you can add -d flag to run in background
- And then you can make requests like this:
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "LLaMa3-8b-WangchanX-sft-Demo", "messages": [{"role": "user", "content": "ลิเก กับ งิ้ว ต่างกันอย่างไร", "temperature": 0.1}] }'
- Preview:
Ollama
Ollama is an open-source and user-friendly platform that allows you to run large language models (LLMs) locally on your machine. You can find more details.
-
At the current working directory location, prepare the following:
-
The directory containing the pre-trained LLM model from Hugging Face. For example, if you are using the
LLaMa3-8b-WangchanX-sft-Demo
model, the directory should be namedLLaMa3-8b-WangchanX-sft-Demo
.
-
-
Create a
Dockerfile
with the following content to build a Docker image:
FROM ollama/ollama
COPY LLaMa3-8b-WangchanX-sft-Demo /root/LLaMa3-8b-WangchanX-sft-Demo
RUN apt update && apt-get install python3 python3-pip python3-venv git -y
# Clone the ollama repository first
RUN git clone https://github.com/ollama/ollama.git /root/ollama
# Change to the cloned ollama directory
WORKDIR /root/ollama
# Initialize and update git submodules
RUN git submodule update --init --recursive
# Create and activate virtual environment
RUN python3 -m venv .venv
RUN . .venv/bin/activate
RUN python3 -m pip install -r llm/llama.cpp/requirements.txt
# Build the submodule
RUN make -C llm/llama.cpp quantize
# Convert
RUN python3 llm/llama.cpp/convert-hf-to-gguf.py /root/LLaMa3-8b-WangchanX-sft-Demo --outtype f16 --outfile /root/LLaMa3-8b-WangchanX-sft-Demo.gguf
- Build the image using the following command:
docker build -t ollama -f <Dockerfile> .
- Alternatively, you can simply build the image which we already provided in the deployment directory:
docker build -t ollama -f deployment/Ollama/Dockerfile.Ollama .
- Run the image using this command:
docker run -d --gpus all -p 11434:11434 ollama #you can add -d flag to run in background
- Create model:
curl http://localhost:11434/api/create -d '{
"name": "LLaMa3-8b-WangchanX-sft-Demo",
"modelfile":"FROM /root/LLaMa3-8b-WangchanX-sft-Demo.gguf\n\n\nTEMPLATE \"\"\"\n{{ if .System }}<|system|>\n{{.System}}<|end_of_text|>\n{{ end }}{{ if .Prompt }}<|user|>\n{{ .Prompt }}<|end_of_text|>\n{{ end }}<|assistant|>\n\"\"\"\n\nPARAMETER stop \"<|end_of_text|>\"\nPARAMETER stop \"<|assistant|>\"\nPARAMETER stop \"<|user|>\"\nPARAMETER stop \"<|system|>\""
}'
- And then you can make requests like this:
curl http://localhost:11434/api/chat -d '{
"model": "LLaMa3-8b-WangchanX-sft-Demo",
"messages": [
{
"role": "user",
"content": "ลิเก กับ งิ้ว ต่างกันอย่างไร"
}
]
}'
- Preview: