Skip to content

Latest commit

 

History

History
129 lines (89 loc) · 7.03 KB

productionizing.md

File metadata and controls

129 lines (89 loc) · 7.03 KB

RAG chat: Productionizing the app

This sample is designed to be a starting point for your own production application, but you should do a thorough review of the security and performance before deploying to production. Here are some things to consider:

Azure resource configuration

OpenAI Capacity

The default TPM (tokens per minute) is set to 30K. That is equivalent to approximately 30 conversations per minute (assuming 1K per user message/response). You can increase the capacity by changing the chatGptDeploymentCapacity and embeddingDeploymentCapacity parameters in infra/main.bicep to your account's maximum capacity. You can also view the Quotas tab in Azure OpenAI studio to understand how much capacity you have.

If the maximum TPM isn't enough for your expected load, you have a few options:

Azure Storage

The default storage account uses the Standard_LRS SKU. To improve your resiliency, we recommend using Standard_ZRS for production deployments, which you can specify using the sku property under the storage module in infra/main.bicep.

Azure AI Search

The default search service uses the "Basic" SKU with the free semantic ranker option, which gives you 1000 free queries a month. After 1000 queries, you will get an error message about exceeding the semantic ranker free capacity.

  • Assuming your app will experience more than 1000 questions per month, you should upgrade the semantic ranker SKU from "free" to "standard" SKU:

    azd env set AZURE_SEARCH_SEMANTIC_RANKER standard

    Or disable semantic search entirely:

    azd env set AZURE_SEARCH_SEMANTIC_RANKER disabled
  • The search service can handle fairly large indexes, but it does have per-SKU limits on storage sizes, maximum vector dimensions, etc. You may want to upgrade the SKU to either a Standard or Storage Optimized SKU, depending on your expected load. However, you cannot change the SKU of an existing search service, so you will need to re-index the data or manually copy it over. You can change the SKU by setting the AZURE_SEARCH_SERVICE_SKU azd environment variable to an allowed SKU.

    azd env set AZURE_SEARCH_SERVICE_SKU standard

    See the Azure AI Search service limits documentation for more details.

  • If you see errors about search service capacity being exceeded, you may find it helpful to increase the number of replicas by changing replicaCount in infra/core/search/search-services.bicep or manually scaling it from the Azure Portal.

Azure App Service

The default app service plan uses the Basic SKU with 1 CPU core and 1.75 GB RAM. We recommend using a Premium level SKU, starting with 1 CPU core. You can use auto-scaling rules or scheduled scaling rules, and scale up the maximum/minimum based on load.

Additional security measures

Load testing

We recommend running a loadtest for your expected number of users. You can use the locust tool with the locustfile.py in this sample or set up a loadtest with Azure Load Testing.

To use locust, first install the dev requirements that includes locust:

python -m pip install -r requirements-dev.txt

Or manually install locust:

python -m pip install locust

Then run the locust command, specifying the name of the User class to use from locustfile.py. We've provided a ChatUser class that simulates a user asking questions and receiving answers, as well as a ChatVisionUser to simulate a user asking questions with the GPT-4 vision mode enabled.

locust ChatUser

Open the locust UI at http://localhost:8089/, the URI displayed in the terminal.

Start a new test with the URI of your website, e.g. https://my-chat-app.azurewebsites.net. Do not end the URI with a slash. You can start by pointing at your localhost if you're concerned more about load on OpenAI/AI Search than the host platform.

For the number of users and spawn rate, we recommend starting with 20 users and 1 users/second. From there, you can keep increasing the number of users to simulate your expected load.

Here's an example loadtest for 50 users and a spawn rate of 1 per second:

Screenshot of Locust charts showing 5 requests per second

After each test, check the local or App Service logs to see if there are any errors.

Evaluation

Before you make your chat app available to users, you'll want to rigorously evaluate the answer quality. You can use tools in the AI RAG Chat evaluator repository to run evaluations, review results, and compare answers across runs.