RAG chat: Using GPT vision model with RAG approach

📺 Watch: (RAG Deep Dive series) Multimedia data ingestion

This repository includes an optional feature that uses the GPT vision model to generate responses based on retrieved content. This feature is useful for answering questions based on the visual content of documents, such as photos and charts.

How it works

When this feature is enabled, the following changes are made to the application:

Search index: We added a new field to the Azure AI Search index to store the embedding returned by the multimodal Azure AI Vision API (while keeping the existing field that stores the OpenAI text embeddings).
Data ingestion: In addition to our usual PDF ingestion flow, we also convert each PDF document page to an image, store that image with the filename rendered on top, and add the embedding to the index.
Question answering: We search the index using both the text and multimodal embeddings. We send both the text and the image to gpt-4o, and ask it to answer the question based on both kinds of sources.
Citations: The frontend displays both image sources and text sources, to help users understand how the answer was generated.

For more details on how this feature works, read this blog post or watch this video.

Using the feature

Prerequisites

Create a AI Vision account in Azure Portal first, so that you can agree to the Responsible AI terms for that resource. You can delete that account after agreeing.
The ability to deploy a gpt-4o model in the supported regions. If you're not sure, try to create a gpt-4o deployment from your Azure OpenAI deployments page.
Ensure that you can deploy the Azure OpenAI resource group in a region where all required components are available:
- Azure OpenAI models
  - gpt-35-turbo
  - text-embedding-ada-002
  - gpt-4o
- Azure AI Vision

Deployment

Enable GPT vision approach:

First, make sure you do not have integrated vectorization enabled, since that is currently incompatible:
```
azd env set USE_FEATURE_INT_VECTORIZATION false
```
Then set the environment variable for enabling vision support:
```
azd env set USE_GPT4V true
```
When set, that flag will provision a Azure AI Vision resource and gpt-4o model, upload image versions of PDFs to Blob storage, upload embeddings of images in a new imageEmbedding field, and enable the vision approach in the UI.
Clean old deployments (optional): Run azd down --purge for a fresh setup.
Start the application: Execute azd up to build, provision, deploy, and initiate document preparation.
Try out the feature:
- Access the developer options in the web app and select "Use GPT vision model".
- New sample questions will show up in the UI that are based on the sample financial document.
- Try out a question and see the answer generated by the GPT vision model.
- Check the 'Thought process' and 'Supporting content' tabs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt4v.md

gpt4v.md

RAG chat: Using GPT vision model with RAG approach

How it works

Using the feature

Prerequisites

Deployment

Files

gpt4v.md

Latest commit

History

gpt4v.md

File metadata and controls

RAG chat: Using GPT vision model with RAG approach

How it works

Using the feature

Prerequisites

Deployment