Persistent Deployment Examples

The serve.py script can be used to create an inference server for any of the supported models. Provide the HuggingFace model name and tensor-parallelism (use the default values and run $ python serve.py for a single-GPU mistralai/Mistral-7B-v0.1 deployment):

$ python serve.py --model "mistralai/Mistral-7B-v0.1" tensor-parallel 1

Connect to the persistent deployment and generate text with client.py. Provide the HuggingFace model name, maximum generated tokens, and prompt(s) (or if you are using the default values, run $ python client.py):

$ python client.py --model "mistralai/Mistral-7B-v0.1" --max-new-tokens 128 --prompts "DeepSpeed is" "Seattle is"

Shutdown the persistent deployment with terminate.py. Provide the HuggingFace model name (or if you are using the default values, run $ python terminate.py):

$ python terminate.py --model "mistralai/Mistral-7B-v0.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Persistent Deployment Examples

Files

README.md

Latest commit

History

README.md

File metadata and controls

Persistent Deployment Examples