Skip to content

Latest commit

 

History

History
28 lines (22 loc) · 1.05 KB

File metadata and controls

28 lines (22 loc) · 1.05 KB

Persistent Deployment Examples

The serve.py script can be used to create an inference server for any of the supported models. Provide the HuggingFace model name and tensor-parallelism (use the default values and run $ python serve.py for a single-GPU mistralai/Mistral-7B-v0.1 deployment):

$ python serve.py --model "mistralai/Mistral-7B-v0.1" tensor-parallel 1

Connect to the persistent deployment and generate text with client.py. Provide the HuggingFace model name, maximum generated tokens, and prompt(s) (or if you are using the default values, run $ python client.py):

$ python client.py --model "mistralai/Mistral-7B-v0.1" --max-new-tokens 128 --prompts "DeepSpeed is" "Seattle is"

Shutdown the persistent deployment with terminate.py. Provide the HuggingFace model name (or if you are using the default values, run $ python terminate.py):

$ python terminate.py --model "mistralai/Mistral-7B-v0.1