Advanced RAG of IPO Risk Factors

Voyage-AI LLM embedded document search and BGE reranker

Reach out to me with any suggestions/questions on LinkedIN (https://www.linkedin.com/in/vishalkannan/)

INTRODUCTION

Organizations across various industries generate and store vast amounts of data in the form of documents. Unfortunately, the valuable information within these documents often goes underutilized, particularly when employees who authored them leave the company. While the information remains valuable, the business context can change over time. To address this, I propose using a convolutional search algorithm enhanced with a predefined semantic context layer to retrieve relevant information in the appropriate context.

To illustrate this approach, a large language model (LLM)-powered retriever specifically designed to extract risk factors from public IPO filings in India is developed.

USE-CASE

An end user (employee of a company) describes a specific risk faced by his/her company. Out of the risk factors listed in public IPO filings from the past, find the 10 semantically closest risk factors to the one described.

CHALLENGES ADDRESSED

How can we avoid/mitigate false positives (verbose chunks that appear for almost any/all prompts)?

How can we allow manual training by domain experts for the RAG application?

How can we use both bi-encoder cosine similarity and cross-encoder similarity to search through a huge dataset without sacrificing speed or accuracy?

SOLUTION DESIGN

There's a wide variance in the verbosity and topics covered in the risk factors. It has been observed that verbose risk factors are more likely to be retrieved for a query (false-positives).

A convolutional search layer is necessary to ensure that the false positives can be mitigated through training.

BI-ENCODER SIMILARITY

For technical subjects, the vector embeddings generated by any large language model trained on an unknown corpus of publicly available knowledge data is not reliable. To mitigate this, an additional knowledge layer is used to calculate the distances of each text statement in the vector database to a set of mutually exclusive, collectively exhaustive (MECE) domain knowledge statements setup by an expert in the field.

This allows the end users to understand the intrinsic biases and behaviors built into the Large Language Model and mitigate them by using fine-tuned distances (through training) instead of relying on the cosine similarity scores between the vector embeddings. For this project, a simple set of risk categories statements are authored and vectorized. They may not be MECE.

OBJECTIVES

Develop a complete retrieval functionality using cosine similarities against each Risk Factor
Develop a quick retrieval functionality with the ability to finetune the output
Compare the quick and complete retrieval processes on accuracy/ time for an input text prompt

DATA PREP

Vectorizing and storage of the risk factors found in the IPO documents in a vector database (csv)
Detailed description, vectorizing and storage of commonly understood risk categories
Finetuned distance matrix construction between the risk factors and risk categories

Risk Factors

Risk Categories

Finetuned Risk Matrix

Search ALGORITHMS

Complete Search Algorithm

User inputs the risk description
Vectorize the risk description using the voyage-ai embedder
Calculate the cosine similarity between the risk embedding and each embedding in the risk factors database (~8,000 Risk Factors)
Select the 10 closest risk factors and record the time taken for the complete search function

Matrix Search Algorithm

User inputs the risk description
Vectorize the risk description using the voyage-ai embedder
Calculate the cosine similarity between the risk embedding and each embedding in the risk categories database (17 Risk Categories)

Multiply the Finetuned risk matrix with the cosine similarity values from the previous step (multiply same column names together)
Select the max similarity for each risk statement among the Risk Columns as the closeness score
Select the 10 closest risk factors and record the time taken for the matrix search function

Hybrid Search Algorithm

Combining the best of both worlds

PERFORMANCE METRICS

Semantic Accuracy

Rouge-L score (longest common subsequence) is used to assess the semantic similarity between the 10 selected risk factors and the input risk statement
Rouge-L metric calculation between the input risk statement and the risk factors from each search type
Plot the Rouge-L metric on a graph for display

Compute Time

Record the time taken to execute both forms of search (matrix, complete and hybrid search)
Plot the time taken on a graph

CROSS-ENCODER SIMILARITY

The semantically closest risk factors (and their metadata) is available for review

FINE-TUNING HEURISTIC (Matrix-Search Input)

Create a training-set

Author a statement and identify the correct 10 Risk Factors from the database that are semantically the closest to the authored statement. During the course of this project, the training was done on the Factors from the database and the direct cosine similarity was used to identify the 10 closest Risk Factors.
Initialize Baseline Cosine Similarity Matrix

Calculate the cosine similarities between the embeddings of Risk Factors and Risk Categories to form the baseline cosine similarity matrix.
Determine Real Ranks

Use the complete search functionality to rank the Risk Factors based on their distances to Risk Categories. These ranks represent the ground truth or real ranks.
Determine Initial Calculated Ranks

Use the matrix search functionality to rank the Risk Factors by distance using the initial baseline cosine similarity matrix. These are the initial calculated ranks.
Compute Closeness Scores

Multiply the baseline cosine similarity matrix with the cosine similarity scores between the risk statement embeddings and risk category embeddings.
Identify Maximum Closeness Scores

For each Risk Factor, identify the column (Risk Category) with the maximum closeness score and select this score.
Determine Calculated Ranks

Select the maximum closeness score for each Risk Factor to determine the calculated ranks.
Optimization Model

Use gradient descent or a similar optimization technique to solve the optimization problem.
A mathematical optimization model is developed using the cvxpy library to adjust the baseline cosine similarity matrix.
The goal is to minimize the difference between the real ranks and calculated ranks for the top 10 ranks.
The variables in the optimization model represent the adjusted similarity scores.

Update Baseline Cosine Similarity Matrix

Replace the old baseline cosine similarity matrix values with the optimized variables obtained from the mathematical model.
Fine-Tuning over the Dataset

Move to the next risk statement in the training dataset and repeat the process (steps 4-9) for each risk statement to iteratively fine-tune the baseline cosine similarity matrix.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Graphics		Graphics
IPO_Docs		IPO_Docs
Input		Input
LDA_Analysis.ipynb		LDA_Analysis.ipynb
README.md		README.md
Risk_Factor_Retrieval.ipynb		Risk_Factor_Retrieval.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced RAG of IPO Risk Factors

INTRODUCTION

USE-CASE

CHALLENGES ADDRESSED

SOLUTION DESIGN

BI-ENCODER SIMILARITY

CROSS-ENCODER SIMILARITY

FINE-TUNING HEURISTIC (Matrix-Search Input)

About

Releases

Packages

Languages

vishalkk96/LLM_Doc_Retrieval

Folders and files

Latest commit

History

Repository files navigation

Advanced RAG of IPO Risk Factors

INTRODUCTION

USE-CASE

CHALLENGES ADDRESSED

SOLUTION DESIGN

BI-ENCODER SIMILARITY

CROSS-ENCODER SIMILARITY

FINE-TUNING HEURISTIC (Matrix-Search Input)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages