This repo contains examples showing how to use containerized, locally hosted, free and open large-language models (e.g. llama2, mixtral, etc...) for various use-cases with either CPU or GPU (or both). This project includes an easy-to-use Streamlit user-interface for LLM inference for various use-cases (e.g. chatbot, RAG, text processing, and more...) and a jupyterlab environment (along with some example notebooks) for interacting with LLMs programatically in python (along with some auxiliary services; e.g. vector-store).
Follow the setup steps below to get started.
YOU MUST HAVE DOCKER & DOCKER-COMPOSE INSTALLED TO RUN THIS PROJECT. If you do not have Docker installed, follow the installation instructions for your operating system here - https://docs.docker.com/engine/install/
FOR GPU USERS ONLY: In order to share your host machine's CUDA-capable GPU with containers, you must install the Nvidia container toolkit. For installation instructions, see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html.
-
Copy or rename ".env.template" to ".env" and add your HuggingFace Hub API token
-
The "docker-compose.yml" (or "docker-compose-gpu.yml") contains a set of model configuration environment variables under the "inference-api" service. Update those variables to use a different model or to change model settings. YOU MUST CHOOSE A MODEL THAT CAN FIT ON YOUR CPU/GPU OR THE inference-api SERVICE WILL FAIL TO START (check logs with
docker compose logs -f inference-api
to identify any errors loading model) -
Run
docker compose up -d
(ordocker compose -f docker-compose-gpu.yml up -d
if using an Nvidia GPU) in the root level of the project (the folder containing the "docker-compose.yml" file) -
The inference-api service will take a while to start up the first time you use a new model, because it the model must first be downloaded to your computer. You can run
docker compose logs -f inference-api
to track the logs and see when the model is ready. The logs will appear to hang at the lines below while downloading the model:inference-api-1 | INFO: Will watch for changes in these directories: ['/app'] inference-api-1 | INFO: Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit) inference-api-1 | INFO: Started reloader process [1] using StatReload
You should see log lines like the below once the model has loaded successfully:
inference-api-1 | INFO: Started server process [28] inference-api-1 | INFO: Waiting for application startup. inference-api-1 | INFO: Application startup complete. inference-api-1 | INFO: 127.0.0.1:52106 - "GET /api/health HTTP/1.1" 200 OK
-
Navigate to http://localhost:8501 in your web browser to play with models and generative AI use-cases
-
Navigate to http://localhost:8888 in your web browser to run the python example notebooks.
- [] Update README
- Update available generation parameters in streamlit app
- Move model inference functionality to FastAPI
- [] Implement SQLAlchemy (or other) in streamlit app to save prompt configurations and store documents
- [] Update layout of streamlit app to improve user experience/flow
- [] Add multi-prompting to Basic Prompting in streamlit app
- [] Add natural-language to SQL use-case to streamlit app
- [] Add LLM Agent example (either as separate use-case or as part of chatbot) to streamlit app
- [] Add additional model metadata and model-specific prompts, and automatically update default prompts and kwargs on a per-model basis
- [] Address bug where llama-cpp-python gguf models don't release GPU VRAM when using n_gpu_layers