AWS GenAI Gateway

Project ACTIVE as of Feb 15, 2025

Project Overview

This project provides a simple Terraform deployment of LiteLLM into ECS or EKS on AWS. It aims to be pre-configured with defaults that will allow most users to quickly get started with LiteLLM.

It also provides additional features on top of LiteLLM such as an AWS Bedrock Interface (instead of the default OpenAI interface), support for AWS Bedrock Managed Prompts, Chat History, and support for Okta Oauth 2.0 JWT Token Auth.

If you are unfamiliar with LiteLLM, it provides a consistent interface to access all LLM providers so you do not need to edit your code to try out different models. It allows you to centrally manage and track LLM usage across your company at the user, team, and api key level. You can configure budgets and rate limits, restrict access to specific models, and set up retry/fallback routing logic across multiple providers. It provides cost saving measures like prompt caching. It provides security features like support for AWS Bedrock Guardrails for all LLM providers. Finally, it provides a UI where administrators can configure their users and teams, and users can generate their api keys and test out different LLMs in a chat interface.

Architecture

How to deploy

Prerequisites

Docker
AWS CLI
Terraform
yq (install with brew if on Mac, download binaries if on Linux (see Installing yq below))

If you have DEPLOYMENT_PLATFORM set to EKS:

kubectl

Installing kubectl

On Mac

brew install kubectl

On Linux

https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/

Installing yq

On Mac

brew install yq

On Linux

# Download the binary
VERSION="v4.40.5"  # Replace with desired version
BINARY="yq_linux_amd64"
sudo wget https://github.com/mikefarah/yq/releases/download/${VERSION}/${BINARY} -O /usr/bin/yq
# Make it executable
sudo chmod +x /usr/bin/yq

Environment tested and confirmed

Docker: version 27.3.1
AWS CLI: version 2.19.5
yq: version 4.40.5
Terraform: v1.5.7
kubectl Client Version: v1.32.1
kubectl Kustomize Version: v5.5.0

Deploying from AWS Cloud9 (Optional)

If it's easier for you, you can deploy from an AWS Cloud9 environment using the following steps:

Go to Cloud9 in the console
Click Create environment
Change Instance Type to t3.small (need to upgrade from micro for Docker to run effectively)
Leave rest as default, and click Create
Once, the environment is deployed, click Open under Cloud9 IDE
In the terminal, run the following commands:
git clone https://github.com/aws-samples/genai-gateway.git
run sudo ./install-cloud9-prerequisites.sh (This will install jq, terraform, and kubectl for you. All other dependencies are pre-installed on Cloud9)
Due to a limitation in Cloud 9, the built-in credentials only last 15 minutes. This will always cause your deployments to fail. To avoid this, you MUST do the following

Open up the credentials file via vi ~/.aws/credentials
Paste in your own credentials that have admin access, and will last at least an hour.
It will ask you to reenable managed credentials. Leave it disabled. This is the ONLY way you'll get through this deployment successfully.

Run the Deployment Steps described below

Creating your certificate

Domain and Certifcate, AWS Internal

Reach out to mirodrr for instructions on this

Domain and Certificate, AWS Customer

Follow the instructions here to create a certificate with AWS Certificate Manager.
Follow the instructions here to validate your domain ownership for your certificate.

Deployment Steps

Run cp .env.template .env
In .env, set the TERRAFORM_S3_BUCKET_NAME to a globally unique bucket name. This will store your .tfstate files for your terraform deployment.
In .env, set the CERTIFICATE_ARN to the ARN of the certificate you created in the Creating your certificate section of this README.
In .env, set the DOMAIN_NAME to the sub domain you created in the Creating your certificate section of this README.
In .env, set the HOSTED_ZONE_NAME to the hosted zone you created in the Creating your certificate section of this README.
In .env, Fill out any API Keys you need for any third party providers. If you only want to use Amazon Bedrock, you can just leave the .env file as-is
By default, this solution is deployed with redis caching enabled, and with most popular model providers enabled. If you want to remove support for certain models, or add more models, you can create and edit your own config/config.yaml file. If not, the deployment will automatically use the config/default-config.yaml. Make sure you enable model access on Amazon Bedrock.
Make sure you have valid AWS credentials configured in your environment before running the next step. The credentials must last at least an hour to ensure the deployment can complete successfully
Run ./deploy.sh
After the deployment is done, you can visit the UI by going to the url at the stack output ServiceURL, which is the DOMAIN_NAME you configured earlier.
The master api key is stored in AWS Secrets Manager in the LiteLLMMasterSalt secret. This api key can be used to call the LiteLLM API, and is also the default password for the LiteLLM UI.

Optional Deployment Configurations

If you'd like to provide an existing VPC, set EXISTING_VPC_ID to your existing VPC. The VPC is expected to have both private and public subnets. The public subnets should have Auto-assign public IPv4 address set to yes. The VPC should also have at least one NAT Gateway. This has been tested with the following VPC specification
2 Availabillity Zones
1 NAT Gateway
1 Internet Gateway
2 Public Subnets (Auto assigned public ips enabled, route table pointing to Internet Gateway)
2 Private Subnets (Route table pointed to NAT Gateway)
If you'd like to use EKS instead of ECS, switch DEPLOYMENT_PLATFORM="ECS" to DEPLOYMENT_PLATFORM="EKS"
If you'd like to bring your own EKS cluster, set EXISTING_EKS_CLUSTER_NAME to your EKS Cluster name. You must also set EXISTING_VPC_ID, and the VPC must be the vpc your EKS Cluster exists in. This has been tested with an new EKS cluster created with EKS Auto Mode disabled, and EKS API Auth enabled (see below image). It also has the following add-ons pre-installed:
- Amazon EKS Pod Identity Agent
- Amazon VPC CNI
- kube-proxy
- CoreDNS

If you'd like the terraform to install the required add-ons for you, you can set INSTALL_ADD_ONS_IN_EXISTING_EKS_CLUSTER="true"

If you're bringing an existing cluster, it is assumed that the role or user deploying this has access to the cluster.

Other configurations are not guaranteed to work. You may have to tweak the terraform yourself to work with your particular existing EKS cluster configuration.

If you'd like to run LiteLLM in subnets without outbound internet access, set DISABLE_OUTBOUND_NETWORK_ACCESS="true". Due to lack of internet access, this configuration will only work with AWS Bedrock or SageMaker models. Because of this, you must remove all non-Bedrock/non-SageMaker models from your config/config.yaml file. If you do not do this, LiteLLM will fail to start as it will attempt to call third party models over the internet.
If you'd like the Application Load Balancer to be private to your vpc, set PUBLIC_LOAD_BALANCER="false"

Usage Instructions

Using LiteLLM is practically Identical to using OpenAI, you just need to replace the baseurl and the api key with your LiteLLM ones

import openai # openai v1.0.0+
client = openai.OpenAI(api_key="anything",base_url="https://<Your-Proxy-Endpoint>") # set proxy to base_url
response = client.chat.completions.create(model="anthropic.claude-3-5-sonnet-20240620-v1:0", messages = [
    {
        "role": "user",
        "content": "this is a test request, write a short poem"
    }
])

print(response)

Compare Models

If you would like to compare different models, you can use the scripts/benchmark.py script. To do so, do the following:

cd scripts
python3 -m venv myenv
source myenv/bin/activate
pip3 install -r requirements.txt
cp .env.template .env
Update the .env file with your litellm base url, and a valid litellm api key. Also change the list of model ids you would like to benchmark if needed.
benchmark.py has a list of questions to use for the benchmark, at the top of the file. You can edit this list to try out different questions.
run python3 benchmark.py
The script will output the response from each model for each question, as well as the response time and the cost

Config.yaml (all values pre-populated in Config.yaml, what they do, and what the default values are.)

model_list: within this field, many different models are already configured for you. If you would like to add more models, or remove models, edit this field. Some model providers (such as Databricks and Azure OpenAI) will need you to add additional configuration to function, so they are commented out by default.

model_name: this is the model's public name. You can set it to whatever you like. When someone is calling your model using the OpenAI client, they will use this value for the model id. By default, the model_name is set to the model id from each provider.

If a model id is used by two different providers, we instead used <provider>/<model_id> as the model_name. For example, the github provider shares a model id with the groq provider. So, to avoid the conflict, we use github/llama-3.1-8b-instant instead of just llama-3.1-8b-instant

litellm_params: This is the full list of additional parameters sent to the model. For most models, this will only be model which is the model id used by the provider. Some providers such as azure need additional parameters, which are documented in config/default-config.yaml.

You can also use this to set default parameters for the model such as temperature and top_p.

You can also use this to override the default region for a model. For example, if you deploy the litellm to us-east-1, but want to use a AWS Bedrock model in us-west-2, you would set aws_region_name to us-west-2. The parameter to adjust the default region will vary by LLM Provider

You can also use this to set a weight for an LLM to load balance between two different models with the same model_name

litellm_settings: These are additional settings for the litellm proxy

litellm_settings.cache: Whether to enable or disable prompt caching. Caches prompt results to save time and money for repeated prompts. Set to True by default.

litellm_settings.cache_params.type: The type of cache to use. Set to Redis by default. Redis is currently the only cache type that will work out of the box with this deployment

litellm_settings.max_budget: (float) sets max budget in dollars across the entire proxy across all API keys. Note, the budget does not apply to the master key. That is the only exception. Set to an extremly large value by default (1000000000.0).

litellm_settings.budget_duration: (str) frequency of budget reset for entire proxy. - You can set duration as seconds ("30s"), minutes ("30m"), hours ("30h"), days ("30d"), months ("1mo"). Set to 1mo (1 month) by default

litellm_settings.max_internal_user_budget: (float) sets default budget in dollars for each internal user. (Doesn't apply to Admins. Doesn't apply to Teams. Doesn't apply to master key). Set to an extremely large value by default (1000000000.0)

litellm_settings.internal_user_budget_duration: (str) frequency of budget reset for each internal user - You can set duration as seconds ("30s"), minutes ("30m"), hours ("30h"), days ("30d"), months ("1mo"). Set to 1mo (1 month) by default

litellm_settings.success_callback: defines where success logs are sent to. Defaults to s3. s3 is the only destination that works out of the box with this deployment

litellm_settings.failure_callback: defines where failure logs are sent to. Defaults to s3. s3 is the only destination that works out of the box with this deployment

litellm_settings.service_callback: defines where service logs (such as rds and redis) are sent to. Defaults to s3. s3 is the only destination that works out of the box with this deployment

litellm_settings.s3_callback_params.s3_bucket_name: defines the bucket where the logs will be sent to. Is automatically populated with the name of an S3 bucket that is created during deployment

litellm_settings.s3_callback_params.s3_region_name: defines the bucket region where the logs will be sent to. Is automatically populated with the current region used during deployment

Routing

A/B testing and Load Balancing

To do A/B testing with two different models, do the following:

Define two different models with the same model_name
Point them each to one of the two different models you want to A/B test
Set the weight for each of them to determine the percentage of traffic you want going to each model

Example: Let's say you're using OpenAI, but you want to migrate to Anthropic Claude on AWS Bedrock. You want to send 10% of your traffic there to see if you're getting comparible speed and quality before you fully commit. You can do so like this in your config.yaml:

- model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      weight: 9
- model_name: gpt-4o
    litellm_params:
      model: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0 
      weight: 1

You can list several models under the same model name, and litellm will automatically distribute traffic between them, and if one begins failing, it will fall back to the others

Example:

model_list:
  - model_name: claude-3-5-sonnet-20240620-v1:0
    litellm_params:
      model: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0 
  - model_name: claude-3-5-sonnet-20240620-v1:0
    litellm_params:
      model: bedrock/anthropic.claude-3-haiku-20240307-v1:0
  - model_name: claude-3-5-sonnet-20240620-v1:0
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20240620

To see more possible routing options, check the full documumentation: https://docs.litellm.ai/docs/routing

Routing Strategies

By default, the solution is configured with usage-based-routing. This will always route to the model with lowest TPM (Token Per Minute) usage for that minute for a given model_name

This routing will also respect tpm (tokens per minute) and rpm (requests per minute) limits, and will stop sending traffic to a model if it exceeds that limit.

Example of setting multiple models to be load balanced between with differnt tpm and rpm limits

model_list:
  - model_name: claude-3-5-sonnet-20240620-v1:0
    tpm: 100000
    rpm: 1000
    litellm_params:
      model: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0 
  - model_name: claude-3-5-sonnet-20240620-v1:0
    tpm: 200000
    rpm: 2000
    litellm_params:
      model: bedrock/anthropic.claude-3-haiku-20240307-v1:0
  - model_name: claude-3-5-sonnet-20240620-v1:0
    tpm: 300000
    rpm: 3000
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20240620

The routing strategy is configured like:

router_settings:
  routing_strategy: usage-based-routing-v2
  enable_pre_call_check: true

You can explore alternative routing strategies here: https://docs.litellm.ai/docs/routing#advanced---routing-strategies-%EF%B8%8F

Fallbacks

You can also configure fallbacks for an entire model_name. If all models in a given model_name are failing, you can configure a final fallback.

Let's say you love Claude 3.5 sonnet, but occationally your users perform a query that overwhelms its context window size. You can configure a fallback for that scenario. All requests will go to Claude 3.5 Sonnet, but if they are too large, they will go to gemini which has a larger context window

router_settings:
    context_window_fallbacks: [{"anthropic.claude-3-5-sonnet-20240620-v1:0": ["gemini-1.5-pro"]}]

If a model_name fails for any other reason, you can configure a generic fallback

router_settings:
    fallbacks: [{"gpt-4o": ["anthropic.claude-3-5-sonnet-20240620-v1:0"]}]

And finally you can set a fallback for all model_name as a global fallback in case of unexpected failures:

router_settings:
  default_fallbacks: ["anthropic.claude-3-haiku-20240307-v1:0"]

More details here https://docs.litellm.ai/docs/routing and here https://docs.litellm.ai/docs/proxy/reliability

Guardrails

To set Guardrails for your llm calls, do the following

Create a Guardrail in AWS Bedrock
Get the Guardrail ID and guardrail version
Define the Guardrail like the example below in your config.yaml

guardrails:
   - guardrail_name: "bedrock-pre-guard"
     litellm_params:
       guardrail: bedrock
       mode: "during_call" # supported values: "pre_call", "post_call", "during_call"
       guardrailIdentifier: ff6ujrregl1q # your guardrail ID on bedrock
       guardrailVersion: "1"         # your guardrail version on bedrock
       default_on: true # enforces the guardrail serverside for all models. Caller does not need to pass in the name of the guardrail for it to be enforced.

If you set default_on to true, the guardrail will be enforced at all times. If you set it to false, enforcement is optional.

In the case that default_on is false, in order to make use of the Guardrail, you must specifiy it's name in the client call. Example:

curl -X POST "https://<Your-Proxy-Endpoint>/user/new" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <Your-Master-Key-Or-Admin-Key>" \
-d '{
    "model": "anthropic.claude-3-5-sonnet-20240620-v1:0",
    "messages": [
        {
            "role": "user",
            "content": "prohibited topic"
        }
    ],
    "guardrails": ["bedrock-pre-guard"]
}'

More details on guardrails here: https://docs.litellm.ai/docs/proxy/guardrails/bedrock

Common Operations

See full documentation for all Operations here: https://litellm-api.up.railway.app/#/Internal%20User%20management/new_user_user_new_post

Create new user

Use this to create a new INTERNAL user. Internal Users can access LiteLLM Admin UI to make keys. This creates a new user and generates a new api key for the new user. The new api key is returned.

If you don't specify a budget, the values in litellm_settings.max_internal_user_budget and litellm_settings.internal_user_budget_duration are applied to the user.

Create User with default budget defined in your config.yaml:

curl -X POST "https://<Your-Proxy-Endpoint>/user/new" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <Your-Master-Key-Or-Admin-Key>" \
-d '{
     "user_email": "[email protected]",
     "user_role": "internal_user"
 }'

Create User with budget that overrides default (in this example we give a budget of 1000 dollars of spend a month)

curl -X POST "https://<Your-Proxy-Endpoint>/user/new" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <Your-Master-Key-Or-Admin-Key>" \
-d '{
     "user_email": "[email protected]",
     "user_role": "internal_user"
     "max_budget": 1000.0,
     "budget_duration": "1mo"
 }'

Create user with a limit on TPM (Tokens Per Minute) and RPM (Requests Per Minute) and max parallel requests. In this case we give our user 10000 tokens per minute, and 10 requests per minute, and 2 parallel requests.

Note: There is currently a bug where max_parallel_requests is not returned in the create user response. However, it is still taking effect, and you can confirm that by doing a GET on the user

curl -X POST "https://<Your-Proxy-Endpoint>/user/new" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <Your-Master-Key-Or-Admin-Key>" \
-d '{
     "user_email": "[email protected]",
     "user_role": "internal_user"
     "tpm_limit": 10000,
     "rpm_limit": 10,
     "max_parallel_requests": 2
 }'

Create a user that can only access Bedrock Claude 3.5 sonnet and Claude 3 Haiku

curl -X POST "https://<Your-Proxy-Endpoint>/user/new" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <Your-Master-Key-Or-Admin-Key>" \
-d '{
     "user_email": "[email protected]",
     "user_role": "internal_user"
     "models": ["anthropic.claude-3-5-sonnet-20240620-v1:0", "anthropic.claude-3-haiku-20240307-v1:0"],
 }'

Create a user that has separate Spends TPM (Tokens Per Minute) limits and RPM (Requests Per Minute) limits for different models

In this case: for Claude 3.5 sonnet: 10000 tokens per minute, and 5 requests per minute for Claude 3 haiku: 20000 tokens per minute, and 10 requests per minute

Note: There is currently a bug where model_rpm_limit and model_tpm_limit are not returned in the create user response. However, they are still taking effect, and you can confirm that by doing a GET on the user

curl -X POST "https://<Your-Proxy-Endpoint>/user/new" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <Your-Master-Key-Or-Admin-Key>" \
-d '{
     "user_email": "[email protected]",
     "user_role": "internal_user"
     "model_rpm_limit": {"anthropic.claude-3-5-sonnet-20240620-v1:0": 1, "anthropic.claude-3-haiku-20240307-v1:0": 1},
     "model_tpm_limit": {"anthropic.claude-3-5-sonnet-20240620-v1:0": 10000, "anthropic.claude-3-haiku-20240307-v1:0": 20000},
 }'

Create User Return value

The return value of user/new will look something like this:

{"key_alias":null,"duration":null,"models":[],"spend":0.0,"max_budget":1000.0,"user_id":"22bfb70a-fdda-49ce-8447-807149aba3d3","team_id":null,"max_parallel_requests":null,"metadata":{"model_rpm_limit":{"anthropic.claude-3-5-sonnet-20240620-v1:0":1,"anthropic.claude-3-haiku-20240307-v1:0":1},"model_tpm_limit":{"anthropic.claude-3-5-sonnet-20240620-v1:0":10000,"anthropic.claude-3-haiku-20240307-v1:0":20000}},"tpm_limit":null,"rpm_limit":null,"budget_duration":"1mo","allowed_cache_controls":[],"soft_budget":null,"config":{},"permissions":{},"model_max_budget":{},"send_invite_email":null,"model_rpm_limit":null,"model_tpm_limit":null,"guardrails":null,"blocked":null,"aliases":{},"key":"sk-UJwU0Mu_Rs3Iq6ag","key_name":null,"expires":null,"token_id":null,"user_email":"[email protected]","user_role":"internal_user","teams":null,"user_alias":null}

Copy the key value and provide it to your user to begin using the gateway with the configured models, budgets, and quotas

Set Priority of request (currently broken: BerriAI/litellm#7144)

To set the priority of a request on the client side, you can do the following:

curl -X POST 'https://<Your-Proxy-Endpoint>/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <Your-Master-Key-Or-Admin-Key>' \
-D '{
    "model": "gpt-3.5-turbo-fake-model",
    "messages": [
        {
        "role": "user",
        "content": "what is the meaning of the universe? 1234"
        }],
    "priority": 0 👈 SET VALUE HERE
}'

Priority - The lower the number, the higher the priority: e.g. priority=0 > priority=2000

So if you have traffic you want prioritized over all others, set those calls to priority=0, and the other calls to priority>0

There is currently no way to set this priority on the server side. So you must handle this on the client side for now.

Bedrock interface

This deployment has a middleware layer that allows you to use the Bedrock interface via boto3 instead of the OpenAi interface. This requires overriding the endpoint_url, and injecting your api key into the authorization header in the request. There are example scripts on how to do this, test-middleware-synchronous.py (for synchronous requests) and test-middleware-streaming.py (for streaming requests)

To use this script:

Set the required environment variables:

export API_ENDPOINT="your-bedrock-endpoint" #Should be https://<Your-Proxy-Endpoint>/bedrock
export API_KEY="your-api-key" #Should be your litellm api key you normally use
export AWS_REGION="your-region" #Should be your deployment region

Install dependencies:

pip install boto3

Run the script:

python test-middleware-synchronous.py (for synchronous requests)

python test-middleware-streaming.py (for streaming requests)

The key part of this script is the initialization of the boto3 client, like this:

def create_bedrock_client():
    """
    Creates a Bedrock client with custom endpoint and authorization header.
    Uses environment variables for configuration.

    Required environment variables:
    - API_ENDPOINT: Custom Bedrock endpoint URL
    - API_KEY: Authorization bearer token
    - AWS_REGION: AWS region

    Returns:
        boto3.client: Configured Bedrock client
    """

    # Get configuration from environment variables
    endpoint = os.getenv("API_ENDPOINT")
    api_key = os.getenv("API_KEY")
    region = os.getenv("AWS_REGION")

    if not all([endpoint, api_key, region]):
        raise ValueError(
            "Missing required environment variables: API_ENDPOINT, API_KEY, AWS_REGION"
        )

    # Initialize session and configure client
    session = boto3.Session()
    client_config = Config(
        signature_version=UNSIGNED,  # Disable SigV4 signing
        retries={"max_attempts": 10, "mode": "standard"},
    )

    # Create the Bedrock client
    client = session.client(
        "bedrock-runtime",
        endpoint_url=endpoint,
        config=client_config,
        region_name=region,
    )

    # Define authorization header handler
    def add_authorization_header(request, **kwargs):
        request.headers["Authorization"] = f"Bearer {api_key}"

    # Register the event handler
    client.meta.events.register("request-created.*", add_authorization_header)

    return client

Once that client is initialized, you can use it exactly as you would use boto3 to call AWS Bedrock directly (currently only supports converse and converse_stream)

Bedrock Managed Prompts

The middleware layer also has support for Bedrock Managed Prompts. It works the same as documented here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-runtime/client/converse.html#

You can use a managed prompt like this:

model_id = "arn:aws:bedrock:us-west-2:235614385815:prompt/6LE1KDKISG" #Put the arn of your prompt as the model_id
response = client.converse(
    modelId=model_id,
    promptVariables={ #specify any variables you need for your prompt
        "topic": {"text": "fruit"},
    })

The OpenAI Interface also has support for Bedrock Manage Prompts.

You can use a managed prompt like this:

model = "arn:aws:bedrock:us-west-2:235614385815:prompt/6LE1KDKISG:2" #Put the arn of your prompt as the model_id

response = client.chat.completions.create(
    model=model,
    messages=[], #Messages is required to be passed in, but it will not be used. Your managed prompt will be used instead
    stream=False,
    extra_body={"promptVariables": {"topic": {"text": "fruit"}}},
)
return response.choices[0].message.content

Chat History

Middleware layer also supports chat history, via a session_id

Note: A session_id is tied to a specific api key. Only that api key can access that chat history associated with the session. May eventually make an exception for admins. May eventually allow a single user across multiple api keys to own a session_id

To use this with the OpenAI Interface when not using streaming, do the following:

response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=False,
            extra_body={"enable_history": True}
        )

session_id = response.model_extra.get("session_id")

response_2 = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt2}],
            stream=False,
            extra_body={"session_id": session_id}
        )

The session_id is returned as part of the response.model_extra dictionary. And you pass that session_id in the extra_body parameter to continue the same conversation

To use this with the OpenAI Interface with streaming, do the following:

stream = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            extra_body={"enable_history": True}
        )

session_id = None
first_chunk = True

for chunk in stream:
    # Get session_id from first chunk
    if first_chunk:
        session_id = getattr(chunk, "session_id", None)
        first_chunk = False
    
    #Do normal processing on all chunks

stream2 = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            extra_body={"session_id": session_id}
        )

The session_id is returned as part of the first chunk of the response stream. And you pass that session_id in the extra_body parameter to continue the same conversation

To use this with the Bedrock interface, do the following:

response = client.converse(
                modelId=model_id,
                messages=[{"role": "user", "content": [{"text": message}]}],
                additionalModelRequestFields={"enable_history": True}
            )

session_id = response["ResponseMetadata"]["HTTPHeaders"].get("x-session-id")

response2 = client.converse(
                modelId=model_id,
                messages=[{"role": "user", "content": [{"text": message2}]}],
                additionalModelRequestFields={"session_id": session_id},
            )

The session_id is returned as a header in response["ResponseMetadata"]["HTTPHeaders"]. And you pass that session_id in the additionalModelRequestFields parameter to continue the same conversation

The approach with Bedrock interface with streaming is identical, but included here for completion:

response = client.converse_stream(
                modelId=model_id,
                messages=[{"role": "user", "content": [{"text": message}]}],
                additionalModelRequestFields={"enable_history": True},
            )
session_id = response["ResponseMetadata"]["HTTPHeaders"].get("x-session-id")

response2 = client.converse_stream(
                modelId=model_id,
                messages=[{"role": "user", "content": [{"text": message2}]}],
                additionalModelRequestFields={"session_id": session_id},
            )

The session_id is returned as a header in response["ResponseMetadata"]["HTTPHeaders"]. And you pass that session_id in the additionalModelRequestFields parameter to continue the same conversation

You can get the chat history for a given session id by calling POST /chat-history for history in OpenAI format, or POST /bedrock/chat-history for history in AWS Bedrock Converse API format, like this:

# Common headers, including authorization
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}

# Request body containing session_id
payload = {"session_id": session_id}

# Endpoint to get chat history in OpenAI format
openai_url = f"{base_url}/chat-history"
response_openai = requests.post(openai_url, json=payload, headers=headers)

if response_openai.status_code == 200:
    print("OpenAI Format History:")
    print(response_openai.json())
else:
    print("Failed to retrieve OpenAI format history")
    print("Status code:", response_openai.status_code)
    print("Response:", response_openai.text)

# Endpoint to get chat history in Bedrock format
bedrock_url = f"{base_url}/bedrock/chat-history"
response_bedrock = requests.post(bedrock_url, json=payload, headers=headers)

if response_bedrock.status_code == 200:
    print("\nBedrock Format History:")
    print(response_bedrock.json())
else:
    print("Failed to retrieve Bedrock format history")
    print("Status code:", response_bedrock.status_code)
    print("Response:", response_bedrock.text)

You can get all session ids for an api key by calling POST /session-ids like this:

endpoint = f"{base_url}/session-ids"
headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}

response = requests.post(endpoint, headers=headers, json={})
if response.status_code == 200:
    print("Success!")
    print("Session IDs:", response.json().get("session_ids", []))
else:
    print(f"Error: {response.status_code}")
    print(response.text)

Okta Oauth 2.0 JWT Token Auth Support

This solution supports creating LiteLLM users using an Okta Oauth 2.0 JWT

In your .env file, you must provide your OKTA_ISSUER (something like https://dev-12345.okta.com/oauth2/default) and your OKTA_AUDIENCE (default is api://default, but set it to whatever makes sense for your Okta setup)

Any user created with an Okta JWT will be a non admin internal_user role. Only someone with the master key (or Admin users/keys derived from the master key) will be able to perform any admin operations. At a later point, we may make it so that someone with a specific Okta claim is able to act as an admin and bypass these restrictions without needing the master key.

Their user_id will be the sub of the Okta User's claims.

Right now, these users can give themselves any max_budget, tpm_limit, rpm_limit, max_parallel_requests, or teams. At a later point, we may lock these down more, or make a default configurable in the deployment.

Once you have configured your Okta settings, you can create a user like this:

Request

curl -X POST "https://<Your-Proxy-Endpoint>/user/new" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <Okta Oauth 2.0 JWT>" \
-d '{
 }'

Response

{"key_alias":null,"duration":null,"models":[],"spend":0.0,"max_budget":1000000000.0,"user_id":"[email protected]","team_id":null,"max_parallel_requests":null,"metadata":{},"tpm_limit":null,"rpm_limit":null,"budget_duration":"1mo","allowed_cache_controls":[],"soft_budget":null,"config":{},"permissions":{},"model_max_budget":{},"send_invite_email":null,"model_rpm_limit":null,"model_tpm_limit":null,"guardrails":null,"blocked":null,"aliases":{},"key":"<New_Api_Key_Tied_To_Okta_User>","key_name":null,"expires":null,"token_id":null,"user_email":"[email protected]","user_role":"internal_user","teams":null,"user_alias":null}

With the returned API key, you use LiteLLM as you normally would.

You can also create additional api keys tied to your user:

curl -X POST "https://<Your-Proxy-Endpoint>/key/generate" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <New_Api_Key_Tied_To_Okta_User>" \
-d '{"user_id": "[email protected]" }'

Reponse

{"key_alias":null,"duration":null,"models":[],"spend":0.0,"max_budget":null,"user_id":"[email protected]","team_id":null,"max_parallel_requests":null,"metadata":{},"tpm_limit":null,"rpm_limit":null,"budget_duration":null,"allowed_cache_controls":[],"soft_budget":null,"config":{},"permissions":{},"model_max_budget":{},"send_invite_email":null,"model_rpm_limit":null,"model_tpm_limit":null,"guardrails":null,"blocked":null,"aliases":{},"key":"<Second_Api_Key_Tied_To_Okta_User>","key_name":"sk-...fbcg","expires":null,"token_id":"8bb9cb70ce3ed3b7907dfbaae525e06a2fec6601dbe930b5571c0aca12552378"}

Langsmith support

To use langsmith, provide your LANGSMITH_API_KEY, LANGSMITH_PROJECT, and LANGSMITH_DEFAULT_RUN_NAME in your .env file

Open Source Library

For detailed information about the open source libraries used in this application, please refer to the ATTRIBUTION file.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
config		config
litellm-fake-openai-load-testing-server-cdk		litellm-fake-openai-load-testing-server-cdk
litellm-private-load-balancer-ec2		litellm-private-load-balancer-ec2
litellm-s3-log-bucket-terraform		litellm-s3-log-bucket-terraform
litellm-terraform-stack		litellm-terraform-stack
media		media
middleware		middleware
scripts		scripts
tests		tests
.env.template		.env.template
.gitignore		.gitignore
ATTRIBUTION.md		ATTRIBUTION.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
create-ec2-to-access-private-load-balancer.sh		create-ec2-to-access-private-load-balancer.sh
create-fake-openai-load-testing-server.sh		create-fake-openai-load-testing-server.sh
deploy.sh		deploy.sh
docker-build-and-deploy.sh		docker-build-and-deploy.sh
install-cloud9-prerequisites.sh		install-cloud9-prerequisites.sh
test-middleware-streaming.py		test-middleware-streaming.py
test-middleware-synchronous.py		test-middleware-synchronous.py
undeploy.sh		undeploy.sh

License

aws-samples/genai-gateway

Folders and files

Latest commit

History

Repository files navigation