diff --git a/rag/open-source/chunking/chunking_advisor/ChunkingAdvisor.ipynb b/rag/open-source/chunking/chunking_advisor/ChunkingAdvisor.ipynb deleted file mode 100644 index b5b361a4..00000000 --- a/rag/open-source/chunking/chunking_advisor/ChunkingAdvisor.ipynb +++ /dev/null @@ -1,903 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "90c9d089-3132-48cf-88f9-99f7b20227a7", - "metadata": {}, - "source": [ - "## Important note : Pre-requisite\n", - "\n", - "\n", - "#### This Notebook requires an existing Knowldge base on bedrock .To create a knowldge base you can execute the code provided in : https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb" - ] - }, - { - "cell_type": "markdown", - "id": "d4a64fff-2c1e-4f34-9c7f-0c8180bb039a", - "metadata": {}, - "source": [ - "# Automated Document Processing Pipeline: Using Foundation Models for Smart Document Chunking and Knowledge Base Integration\n", - "\n", - "This notebook demonstrates an end-to-end automated solution for intelligent document processing using Foundation Models (FM). The pipeline performs three key functions:\n", - "\n", - "\n", - "#### Notebook Walkthrough\n", - "The pipeline streamlines the entire process from document analysis to knowledge base population, making it efficient to prepare documents for advanced query and retrieval operations.\n", - "\n", - "##### 1.Document Structure Analysis: FM Automatically analyzes document structure and content to determine the optimal chunking strategy\n", - "##### 2. Configuration Generation: Creates customized chunking parameters based on the analysis\n", - "\n", - "##### 3.Knowledge Base Integration: Processes and loads the chunked documents into a Bedrock Knowledge Base powered by OpenSearch Serverless\n", - "\n", - "![data_ingestion](./img/chunkingAdvs.jpg)\n", - "\n", - "#### Steps: \n", - "\n", - "1. Setup Access and Permissions:\n", - " Create Amazon Bedrock Knowledge Base execution role\n", - " Configure necessary IAM policies for:\n", - " S3 data access\n", - " OpenSearch Serverless writing permissions\n", - " Reference template available in \"0_create_ingest_documents_test_kb\" notebook\n", - "\n", - "2. Document Analysis Using Claude Sonnet:\n", - " Process files within target folder.For each document, Claude analyzes and recommends:\n", - " Optimal chunking strategy (FIXED_SIZE/NONE/HIERARCHICAL/SEMANTIC)\n", - " Specific configuration parameters\n", - " Custom processing requirements\n", - "\n", - "3. Data Preparation and Storage:\n", - " Upload analyzed files to designated S3 buckets\n", - " Configure buckets as data source for Bedrock KB DataStore\n", - "\n", - "4. Ingestion Process:\n", - " Initiate ingestion job via Knowledge Base APIs\n", - "5. Validation:\n", - " Test ingestion completion\n", - " Verify data accessibility and accuracy\n", - "\n", - "#### Pre-requisites\n", - "\n", - "This notebook requires permissions to:\n", - "\n", - "1. Create and delete Amazon IAM roles\n", - "2. Create, update and delete Amazon S3 buckets\n", - "3. Access Amazon Bedrock\n", - "4. Bedrock roles to access s3 buckets (3 bucket)\n", - "5. Access to Amazon OpenSearch Serverless\n", - "\n", - "If running on SageMaker Studio, you should add the following managed policies to your role:\n", - "\n", - "- IAMFullAccess\n", - "- AWSLambda_FullAccess\n", - "- AmazonS3FullAccess-\n" - ] - }, - { - "cell_type": "markdown", - "id": "d066b04e-bc6f-42e6-8836-817d2e0854b1", - "metadata": {}, - "source": [ - "### Install required libraries" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4f36a15a-6cc3-464a-a2f1-fcb2d45ea7b9", - "metadata": {}, - "outputs": [], - "source": [ - "%pip install --force-reinstall -q -r ./requirements.txt --quiet" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5f694d1b-c537-4bf2-a5b1-d4f5c7025620", - "metadata": {}, - "outputs": [], - "source": [ - "# restart kernel\n", - "from IPython.core.display import HTML\n", - "HTML(\"\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "25239d0e-972d-4fff-b200-f20c39714a9e", - "metadata": {}, - "outputs": [], - "source": [ - "import boto3\n", - "\n", - "import json\n", - "\n", - "# create a boto3 session to dynamically get and set the region name\n", - "session = boto3.Session() \n", - "\n", - "AWS_REGION = session.region_name\n", - "bedrock = boto3.client('bedrock-runtime',region_name=AWS_REGION)\n", - "bedrock_agent_client = session.client('bedrock-agent', region_name=AWS_REGION)\n", - "\n", - "MODEL_NAME = \"anthropic.claude-3-5-sonnet-20241022-v2:0\" \n", - "datasources =[]\n", - "#create a folder data if not yet done and \n", - "path=\"data\"\n", - "# To get knowledgeBaseId look int Amazon Bedrock > knowledgeBaseId > knowledgeBaseId\n", - "# This ID should bave being created from first notebook\n", - "# https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb\n", - "\n", - "kb_id= \"XXXXX\" # Retrieve KB First \n", - "kb = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb_id)\n", - "print(kb)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f02e1241-fa8d-4dc2-a298-959a61ae2665", - "metadata": {}, - "outputs": [], - "source": [ - "#bedrock_kb_execution_role = kb['roleArn']\n", - "bedrock_kb_execution_role_arn = kb ['knowledgeBase']['roleArn']\n", - "bedrock_kb_execution_role= bedrock_kb_execution_role_arn.split('/')[-1]\n", - "print (bedrock_kb_execution_role_arn)\n", - "print (bedrock_kb_execution_role)" - ] - }, - { - "cell_type": "markdown", - "id": "b9291ec4-e2dc-47c1-950b-9fa7e737bee3", - "metadata": {}, - "source": [ - "### Supporting functions\n", - "##### Createbucket: Checks if an S3 bucket exists and creates it if it doesn't. \n", - "##### Upload_file: Upload_files to bucket: Upload a file to an S3 bucket\n", - "##### Listfile: List all files in a specified directory:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7578f6f2-1cd2-4683-b150-1b6900ff77ee", - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "import boto3\n", - "from botocore.exceptions import ClientError\n", - "import os\n", - "\n", - "def createbucket(bucketname):\n", - " \"\"\"\n", - " Checks if an S3 bucket exists and creates it if it doesn't. \n", - " Args:\n", - " bucket_name (str): Name of the S3 bucket to create (must be globally unique)\n", - " Raises:\n", - " ClientError: If there's an error accessing or creating the bucket\n", - " \"\"\"\n", - " try:\n", - " s3_client = boto3.client('s3')\n", - " s3_client.head_bucket(Bucket=bucketname)\n", - " print(f'Bucket {bucketname} Exists')\n", - " except ClientError as e:\n", - " print(f'Creating bucket {bucketname}')\n", - " if AWS_REGION == \"us-east-1\":\n", - " s3bucket = s3_client.create_bucket(\n", - " Bucket=bucketname)\n", - " else:\n", - " s3bucket = s3_client.create_bucket(\n", - " Bucket=bucketname,\n", - " CreateBucketConfiguration={ 'LocationConstraint': AWS_REGION }\n", - " )\n", - "\n", - "def upload_file(file_name, bucket, object_name=None):\n", - " \"\"\"Upload a file to an S3 bucket\n", - "\n", - " :param file_name: File to upload\n", - " :param bucket: Bucket to upload to\n", - " :param object_name: S3 object name. If not specified then file_name is used\n", - " :return: True if file was uploaded, else False\n", - " \"\"\"\n", - "\n", - " # If S3 object_name was not specified, use file_name\n", - " if object_name is None:\n", - " object_name = os.path.basename(file_name)\n", - "\n", - " # Upload the file\n", - " s3_client = boto3.client('s3')\n", - " try:\n", - " response = s3_client.upload_file(file_name, bucket, object_name)\n", - " except ClientError as e:\n", - " logging.error(e)\n", - " return False\n", - " return True\n", - "\n", - "def listfile (folder):\n", - " \"\"\"\n", - " List all files in a specified directory.\n", - " \n", - " Args:\n", - " folder (str): Path to the directory to list files from\n", - " \n", - " Returns:\n", - " list: A list of filenames in the specified directory\n", - " \"\"\"\n", - " dir_list = os.listdir(folder)\n", - " return dir_list\n", - "import boto3\n", - "from botocore.exceptions import ClientError\n", - "import logging\n", - "\n", - "def delete_bucket_and_objects(bucket_name):\n", - " # Create an S3 client\n", - " s3_client = boto3.client('s3')\n", - " # Create an S3 resource\n", - " s3 = boto3.resource('s3')\n", - " bucket = s3.Bucket(bucket_name)\n", - " bucket.objects.all().delete()\n", - " # Delete the bucket itself\n", - " bucket.delete()" - ] - }, - { - "cell_type": "markdown", - "id": "c549427c-9d3d-485c-a542-93ef49b540fe", - "metadata": {}, - "source": [ - "#### Download and prepare dataset\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "217702d0-42bb-4da0-b14b-37b4d0c0b503", - "metadata": {}, - "outputs": [], - "source": [ - "# if not yet created, create folder already\n", - "#!mkdir -p ./data\n", - "\n", - "from urllib.request import urlretrieve\n", - "urls = [\n", - " 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',\n", - " 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',\n", - " 'https://www.apple.com/newsroom/pdfs/Q3FY18ConsolidatedFinancialStatements.pdf'\n", - "]\n", - "\n", - "filenames = [\n", - " 'AMZN-2022-Shareholder-Letter.pdf',\n", - " 'AMZN-2021-Shareholder-Letter.pdf',\n", - " 'Q3FY18ConsolidatedFinancialStatements.pdf'\n", - "]\n", - "\n", - "data_root = \"./data/\"\n", - "\n", - "for idx, url in enumerate(urls):\n", - " file_path = data_root + filenames[idx]\n", - " urlretrieve(url, file_path)" - ] - }, - { - "cell_type": "markdown", - "id": "48551cb4-02a8-4426-88aa-089516a4c1d9", - "metadata": {}, - "source": [ - "#### Create 3 S3 buckets for 3 data sources \n", - "##### Check if bucket exists, and if not create S3 bucket by knowledge base data source, each bucket will be used to load files with corrspending strategy :semantic, fixed_size , HIERARCHICAl" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "67a84784-fbfe-4ba5-88a5-d2e2a382bfac", - "metadata": {}, - "outputs": [], - "source": [ - "import random\n", - "suffix = random.randrange(200, 900)\n", - "s3_client = boto3.client('s3')\n", - "bucket_name_semantic = 'kb-dataset-bucket-semantic-' + str(suffix) #### Provide your bucket name which is already created\n", - "bucket_name_fixed = 'kb-dataset-bucket-fixed-'+str(suffix) \n", - "bucket_name_hierachical='kb-dataset-bucket-hierarchical-'+str(suffix) \n", - "s3_policy_name = 'AmazonBedrockS3PolicyForKnowledgeBase_'+ str(suffix)\n", - "\n", - "createbucket(bucket_name_semantic)\n", - "createbucket(bucket_name_fixed)\n", - "createbucket(bucket_name_hierachical)\n" - ] - }, - { - "cell_type": "markdown", - "id": "2a54bc11-8de7-4a67-b8ed-0ba2944fb9e4", - "metadata": {}, - "source": [ - "##### Create read and list S3 policy on the 3 buckets and attach it to existing Bedrock role \"bedrock_kb_execution_role\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6230aaf0-2aa6-429e-9824-0ef2f9b12579", - "metadata": {}, - "outputs": [], - "source": [ - "account_number = boto3.client('sts').get_caller_identity().get('Account')\n", - "iam_client = session.client('iam')\n", - "s3_policy_document = {\n", - " \"Version\": \"2012-10-17\",\n", - " \"Statement\": [\n", - " {\n", - " \"Effect\": \"Allow\",\n", - " \"Action\": [\n", - " \"s3:GetObject\",\n", - " \"s3:ListBucket\"\n", - " ],\n", - " \"Resource\": [\n", - " f\"arn:aws:s3:::{bucket_name_semantic}\",\n", - " f\"arn:aws:s3:::{bucket_name_semantic}/*\", \n", - " f\"arn:aws:s3:::{bucket_name_fixed}\",\n", - " f\"arn:aws:s3:::{bucket_name_fixed}/*\", \n", - " f\"arn:aws:s3:::{bucket_name_hierachical}\",\n", - " f\"arn:aws:s3:::{bucket_name_hierachical}/*\"\n", - " ],\n", - " \"Condition\": {\n", - " \"StringEquals\": {\n", - " \"aws:ResourceAccount\": f\"{account_number}\"\n", - " }\n", - " }\n", - " }\n", - " ]\n", - "}\n", - "s3_policy = iam_client.create_policy(\n", - " PolicyName=s3_policy_name,\n", - " PolicyDocument=json.dumps(s3_policy_document),\n", - " Description='Policy for reading documents from s3')\n", - "\n", - " # fetch arn of this policy \n", - "s3_policy_arn = s3_policy[\"Policy\"][\"Arn\"]\n", - "iam_client = session.client('iam')\n", - "fm_policy_arn = f\"arn:aws:iam::{account_number}:policy/{s3_policy_name}\"\n", - "iam_client.attach_role_policy(\n", - " RoleName=bedrock_kb_execution_role,\n", - " PolicyArn=fm_policy_arn\n", - " )" - ] - }, - { - "cell_type": "markdown", - "id": "55725547-428a-4568-8b4c-c7a8abab6e3d", - "metadata": {}, - "source": [ - "##### Sends a prompt to Claude Sonnet via Amazon Bedrock and returns the generated response.\n", - " " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0ee11cad-ea50-4e87-8bd2-b8f159bf4d91", - "metadata": {}, - "outputs": [], - "source": [ - "def get_completion(prompt):\n", - " \"\"\"\n", - " Sends a prompt to Claude Sonnet via Amazon Bedrock and returns the generated response.\n", - " Args:\n", - " prompt (str): The input text prompt to send to the AI model.\n", - "\n", - " Returns:\n", - " str: The generated text response \"\"\"\n", - "\n", - " body = json.dumps(\n", - " {\n", - " \"anthropic_version\": '',\n", - " \"max_tokens\": 2000,\n", - " \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n", - " \"temperature\": 0.0,\n", - " \"top_p\": 1,\n", - " \"system\": ''\n", - " }\n", - " )\n", - " response = bedrock.invoke_model(body=body, modelId=MODEL_NAME)\n", - " response_body = json.loads(response.get('body').read())\n", - " return response_body.get('content')[0].get('text')" - ] - }, - { - "cell_type": "markdown", - "id": "339eb2ae-e825-435f-b77b-0524144f081c", - "metadata": {}, - "source": [ - " #### Chunkingadvise function\n", - " ##### Analyzes a PDF document and recommends optimal LLM chunking strategy with parameters. This function loads a PDF file, analyzes its content using LLM, and provides recommendations for chunking strategy (FIXED_SIZE, NONE, HIERARCHICAL, or SEMANTIC) along with specific configuration parameters.\n", - " - Args:\n", - " - file (str): Name of the PDF file located in the 'data' directory\n", - " - Returns:\n", - " - dict: JSON containing recommended chunking strategy and parameters:\n", - " For HIERARCHICAL:\n", - " - Recommend only one Strategy\n", - " - Maximum Parent chunk token size\n", - " - Maximum child chunk token size\n", - " - Overlap Tokens\n", - " - Rational\n", - " - For SEMANTIC:\n", - " - Recommend only one Strategy\n", - " - Maximum tokens\n", - " - Buffer size\n", - " - Breakpoint percentile threshold\n", - " - Rational:\n", - " - For FIXED-SIZE\n", - " - overlapPercentage'\n", - " - parsed_data['maxTokens'\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7f99efd3-d16a-411a-9ad7-70522b9e1641", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c5801051-5411-4659-a303-c06aed74af04", - "metadata": {}, - "outputs": [], - "source": [ - "def Chunkingadvise (file):\n", - " \"\"\"\n", - " Analyzes a PDF document and recommends optimal LLM chunking strategy with parameters.\n", - "\n", - " This function loads a PDF file, analyzes its content using LLM, and provides \n", - " recommendations for chunking strategy (FIXED_SIZE, NONE, HIERARCHICAL, or SEMANTIC)\n", - " along with specific configuration parameters.\n", - " Args:\n", - " file (str): Name of the PDF file located in the 'data' directory\n", - " Returns:\n", - " dict: JSON containing recommended chunking strategy and parameters:\n", - " For HIERARCHICAL:\n", - " - Recommend only one Strategy\n", - " - Maximum Parent chunk token size\n", - " - Maximum child chunk token size\n", - " - Overlap Tokens\n", - " - Rational\n", - " For SEMANTIC:\n", - " - Recommend only one Strategy\n", - " - Maximum tokens\n", - " - Buffer size\n", - " - Breakpoint percentile threshold\n", - " - Rational\n", - " \"\"\"\n", - " my_docs = []\n", - " my_strategies =[]\n", - " strategy=\"\"\n", - " strategytext=\"\"\n", - " path=\"data\"\n", - " strategylist =[]\n", - " metadata = [\n", - " dict(year=2023, source=file)]\n", - " from langchain.document_loaders import PyPDFLoader\n", - " file = path +\"/\"+ file\n", - " loader = PyPDFLoader(file)\n", - " document = loader.load()\n", - " loader = PyPDFLoader(file)\n", - " # print (\"path + file :: \", file)\n", - " document = loader.load()\n", - " # print (\"path + file :: \", document)\n", - " \n", - " prompt = f\"\"\"SYSTEM you are an advisor expert in LLM chunking strategies,\n", - " USER can you analyze the type,content, format, structure and size of {document}. \n", - " Can you advise on best LLM chunking Strategy based on this analysis. Recommend only one Strategy, however show recommended strategy prefernece ratio \n", - " Available strategies to recommend from are : FIXED_SIZE or NONE or HIERARCHICAL or SEMANTIC\n", - " Decide on recommendatin first and then , what is the recommendation? \"\"\"\n", - " res = get_completion(prompt)\n", - " print(res)\n", - " prompt = f\"\"\" USER based on recommnedation provide in {res} \n", - " if you recommend HIERARCHICAL chunking then provide recommendation for: \n", - " Parent: Maximum parent chunk token size. \n", - " Child: Maximum child chunk token size and Overlap Tokens: Number of overlap tokens between each parent chunk and between each parent and its children.\n", - " If recommendation is HIERARCHICAL then provide response using JSON format\n", - " with the keys as \\\"Recommend only one Strategy\\\", \\\"Maximum Parent chunk token size\\\", \\\"Maximum child chunk token size\\\",\\\"Overlap Tokens\\\", \n", - " \\\"Rational:please explain rational for decision and explain why each other choice is not prefered, keep reational to 100 words maximum. \\\" . provide crisp and clear answer, \n", - " if you recommend SEMANTIC then provide response using JSON format with\n", - " the keys as \\\"Recommend only one Strategy\\\",\\\" Maximum tokens\\\", \\\"Buffer size\\\",\\\"Breakpoint percentile threshold\\\", \n", - " Buffer size should be less or equal than 1 , Breakpoint percentile threshold should >= 50\n", - " \\\"Rational:please explain rational for decision and explain why each other choice is not prefered, keep reational to 100 words maximum. \\\" . provide crisp and clear answer, \n", - " do not provide recommendation if not enough data inputs and say sorry I need more data,\n", - " if you recommend FIXED_SIZE then provide response using JSON format with\n", - " the keys as \\\"Recommend only one Strategy\\\",\\\" maxTokens\\\", \\\"overlapPercentage \\\",\n", - " \\\"Rational:please explain rational for decision and explain why each other choice is not prefered, keep reational to 100 words maximum. \\\" . provide crisp and clear answer, \n", - " do not provide recommendation if not enough data inputs and say sorry I need more data\"\"\"\n", - " res = get_completion(prompt)\n", - " print(res)\n", - " parsed_data = json.loads(res )\n", - " return parsed_data" - ] - }, - { - "cell_type": "markdown", - "id": "579229ec-23b4-4e9f-99db-cb78c7453e4e", - "metadata": {}, - "source": [ - "#### ingestbystrategy: function that configure chunking strategy parameters for Bedrock Knowledge Base ingestion based on recommended strategy.\n", - " - Args:\n", - " - parsed_data (dict): Dictionary containing chunking strategy recommendation and parameters\n", - " - Returns:\n", - " - tuple: Contains:\n", - " - chunking_strategy_config (dict): Configuration for the chosen chunking strategy\n", - " - bucket_name (str): S3 bucket name for storage\n", - " - name (str): Knowledge base name\n", - " - description (str): Knowledge base description \n", - " - s3_configuration (dict): S3 configuration with bucket ARN\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a016a3c8-6759-40fe-a997-91357a3f48e9", - "metadata": {}, - "outputs": [], - "source": [ - "def ingestbystrategy(parsed_data):\n", - " \"\"\"\n", - " Configures chunking strategy parameters for Bedrock Knowledge Base ingestion based on recommended strategy.\n", - "\n", - " Args:\n", - " parsed_data (dict): Dictionary containing chunking strategy recommendation and parameters\n", - "\n", - " Returns:\n", - " tuple: Contains:\n", - " - chunking_strategy_config (dict): Configuration for the chosen chunking strategy\n", - " - bucket_name (str): S3 bucket name for storage\n", - " - name (str): Knowledge base name\n", - " - description (str): Knowledge base description \n", - " - s3_configuration (dict): S3 configuration with bucket ARN\n", - "\n", - " Example:\n", - " >>> strategy_config, bucket, kb_name, desc, s3_config = ingest_by_strategy(strategy_data)\n", - " \"\"\"\n", - " chunkingStrategyConfiguration ={}\n", - " # print(\"Strategy::\", parsed_data)\n", - " strategy= parsed_data['Recommend only one Strategy']\n", - "\n", - " if strategy =='HIERARCHICAL':\n", - " p1 = parsed_data['Maximum Parent chunk token size']\n", - " p2= parsed_data['Maximum child chunk token size']\n", - " p3= parsed_data['Overlap Tokens'] \n", - " bucket_name=bucket_name_hierachical \n", - " name = f\"bedrock-sample-knowledge-base-HIERARCHICAL\"\n", - " description = \"Bedrock Knowledge Bases for and S3 HIERARCHICAL\"\n", - " # HIERARCHICAL Chunking\n", - " chunkingStrategyConfiguration = {\n", - " \"chunkingStrategy\": \"HIERARCHICAL\", \n", - " \"hierarchicalChunkingConfiguration\": { \n", - " 'levelConfigurations': [\n", - " {\n", - " 'maxTokens': p1\n", - " },\n", - " {\n", - " 'maxTokens': p2\n", - " }\n", - " ],\n", - " 'overlapTokens': p3\n", - " }\n", - " }\n", - " \n", - " # # SEMANTIC Chunking \n", - " if strategy =='SEMANTIC':\n", - " p3 = parsed_data['Maximum tokens']\n", - " p2= int(parsed_data['Buffer size'])\n", - " p1= parsed_data['Breakpoint percentile threshold']\n", - " bucket_name= bucket_name_semantic\n", - " name = f\"bedrock-sample-knowledge-base-SEMANTIC\"\n", - " description = \"Bedrock Knowledge Bases for and S3 SEMANTIC\"\n", - " chunkingStrategyConfiguration = { \"chunkingStrategy\": \"SEMANTIC\",\n", - " \"semanticChunkingConfiguration\": { \n", - " 'breakpointPercentileThreshold': p1,\n", - " 'bufferSize': p2,\n", - " 'maxTokens': p3\n", - " }\n", - " }\n", - "\n", - "\n", - " if strategy =='FIXED_SIZE':\n", - " p2= int(parsed_data['overlapPercentage'])\n", - " p1= int (parsed_data['maxTokens'])\n", - " bucket_name=bucket_name_fixed\n", - " name = f\"bedrock-sample-knowledge-base-FIXED\"\n", - " description = \"Bedrock Knowledge Bases for and S3 FIXED\"\n", - " \n", - " chunkingStrategyConfiguration = { \"chunkingStrategy\": \"FIXED_SIZE\",\n", - " \"semanticChunkingConfiguration\": { \n", - " \"maxTokens\": p1,\n", - " \"overlapPercentage\":p2\n", - " \n", - " }\n", - " }\n", - " \n", - " s3Configuration = {\n", - " \"bucketArn\": f\"arn:aws:s3:::{bucket_name}\",\n", - " } \n", - " return chunkingStrategyConfiguration ,bucket_name , name , description ,s3Configuration " - ] - }, - { - "cell_type": "markdown", - "id": "4cc202b3-12b9-45e7-a610-e76c28b142ad", - "metadata": {}, - "source": [ - "#### Function to Creates or retrieves a data source in an Amazon Bedrock Knowledge Base\n", - " \n", - "#### First checks if a data source with the given name exists. If found, returns the existing data source. Otherwise creates a new one with specified configurations.\n", - "- Args:\n", - " - name (str): Name of the data source\n", - " - description (str): Description of the data source\n", - " - knowledge_base_id (str): ID of the knowledge base to create data source in\n", - " - s3_configuration (dict): S3 bucket configuration for the data source\n", - " - chunking_strategy_configuration (dict): Configuration for text chunking strategy\n", - "- Returns:\n", - " - dict: Response containing the data source details from Bedrock" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "90628e2d-913d-4290-8d40-78dacd3d0e9b", - "metadata": {}, - "outputs": [], - "source": [ - "def createDS (name, description,knowledgeBaseId , s3Configuration , chunkingStrategyConfiguration ):\n", - " \"\"\"\n", - " Creates or retrieves a data source in an Amazon Bedrock Knowledge Base.\n", - " \n", - " First checks if a data source with the given name exists. If found, returns the existing \n", - " data source. Otherwise creates a new one with specified configurations.\n", - "\n", - " Args:\n", - " name (str): Name of the data source\n", - " description (str): Description of the data source\n", - " knowledge_base_id (str): ID of the knowledge base to create data source in\n", - " s3_configuration (dict): S3 bucket configuration for the data source\n", - " chunking_strategy_configuration (dict): Configuration for text chunking strategy\n", - "\n", - " Returns:\n", - " dict: Response containing the data source details from Bedrock\n", - "\n", - " Raises:\n", - " ClientError: If there's an error accessing or creating the data source\n", - " \"\"\"\n", - " response = bedrock_agent_client.list_data_sources(\n", - " knowledgeBaseId=kb_id,\n", - " maxResults=12\n", - " )\n", - " for i in range (len(response[\"dataSourceSummaries\"])):\n", - " print (response[\"dataSourceSummaries\"][i] [\"name\"] ,\"::\", name)\n", - " print (response[\"dataSourceSummaries\"][i][\"dataSourceId\"])\n", - " if response[\"dataSourceSummaries\"][i] [\"name\"] == name:\n", - " ds = bedrock_agent_client.get_data_source(knowledgeBaseId = knowledgeBaseId, dataSourceId = response[\"dataSourceSummaries\"][i-1][\"dataSourceId\"] )\n", - " return ds\n", - " \n", - " ds = bedrock_agent_client.create_data_source(\n", - " name = name,\n", - " description = description,\n", - " knowledgeBaseId = knowledgeBaseId,\n", - " dataDeletionPolicy = 'DELETE',\n", - " dataSourceConfiguration = {\n", - " # # For S3 \n", - " \"type\": \"S3\",\n", - " \"s3Configuration\" : s3Configuration\n", - " # # For Web URL \n", - " # \"type\": \"WEB\",\n", - " # \"webConfiguration\":webConfiguration \n", - " },\n", - " vectorIngestionConfiguration = {\n", - " \"chunkingConfiguration\": chunkingStrategyConfiguration\n", - " })\n", - " \n", - " \n", - " return ds" - ] - }, - { - "cell_type": "markdown", - "id": "8e60ea6a-a1dc-47f8-9922-b7c70be750d1", - "metadata": {}, - "source": [ - "### Process PDF files by analyzing content, creating data sources, and uploading to S3.\n", - "\n", - "#### Workflow:\n", - "1. Lists all files in specified directory\n", - "2. For each PDF:\n", - " - Analyzes for optimal chunking strategy\n", - " - Creates data source with recommended configuration\n", - " - Uploads file to appropriate S3 bucket \n", - "
" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cee5b445-2945-448d-8bb9-250d47f63672", - "metadata": {}, - "outputs": [], - "source": [ - "s3_client = boto3.client('s3')\n", - "dir_list1= listfile (\"data\")\n", - "print(dir_list1)\n", - "strategylist= []\n", - "for file in dir_list1:\n", - " print (\" print(f)\" , file)\n", - " if \".pdf\" in file:\n", - " chunkingStrategyConfiguration=[]\n", - " strategy = Chunkingadvise (file)\n", - " chunkingStrategyConfiguration ,bucket_name , name , description ,s3Configuration = ingestbystrategy(strategy)\n", - " print (\"name\", name)\n", - " datasources = createDS (name, description,kb_id, s3Configuration , chunkingStrategyConfiguration )\n", - " print (datasources)\n", - " #ds_id = datasources[0][\"dataSource\"][\"dataSourceId\"]\n", - " with open( path +\"/\"+ file, \"rb\") as f:\n", - " print(bucket_name)\n", - " print(f)\n", - " s3_client.upload_fileobj(f, bucket_name, file)\n", - " #print (strategylist) " - ] - }, - { - "cell_type": "markdown", - "id": "ceaa41da-ecab-4592-8d78-59815e0dfb62", - "metadata": {}, - "source": [ - "#### Starts Ingestin and monitors ingestion jobs for all data sources in a knowledge base." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "43d0d10e-40e3-4769-a5e1-d115fce38041", - "metadata": {}, - "outputs": [], - "source": [ - "from datetime import datetime \n", - "import time\n", - "\"\"\"\n", - " Starts and monitors ingestion jobs for all data sources in a knowledge base.\n", - "\"\"\"\n", - "sources = bedrock_agent_client.list_data_sources(knowledgeBaseId = kb_id )\n", - "for i in range(len(sources[\"dataSourceSummaries\"])):\n", - " print (\"ds [dataSourceId]\", sources[\"dataSourceSummaries\"] [i-1] [\"dataSourceId\"])\n", - " start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb_id, dataSourceId = sources[\"dataSourceSummaries\"] [i-1] [\"dataSourceId\"])\n", - " job = start_job_response[\"ingestionJob\"]\n", - " print (job)\n", - " # Get job \n", - " while(job['status']!='COMPLETE' ):\n", - " get_job_response = bedrock_agent_client.get_ingestion_job(\n", - " knowledgeBaseId = kb_id, dataSourceId = sources[\"dataSourceSummaries\"][i-1] [\"dataSourceId\"], ingestionJobId = job[\"ingestionJobId\"]\n", - " )\n", - " job = get_job_response[\"ingestionJob\"]\n", - " time.sleep(10)" - ] - }, - { - "cell_type": "markdown", - "id": "30c5e219-97bd-4219-8f97-cd7be339cc5e", - "metadata": {}, - "source": [ - "#### Try out KB using RetrieveAndGenerate API" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ba77eec6-d83e-4263-ab09-1eefaefa8b03", - "metadata": {}, - "outputs": [], - "source": [ - "model_id = \"anthropic.claude-3-5-sonnet-20241022-v2:0\" # \n", - "model_arn = f'arn:aws:bedrock:us-west-2::foundation-model/{model_id}'\n", - "bedrock_agent_runtime_client = boto3.client(\"bedrock-agent-runtime\", region_name=AWS_REGION)\n", - "# ucomment to test \n", - "#query = \"what is AWS annualized revenue run rate\"\n", - "query = \"what is iphone sales in 2018\"\n", - "response = bedrock_agent_runtime_client.retrieve_and_generate(\n", - " input={\n", - " 'text': query\n", - " },\n", - " retrieveAndGenerateConfiguration={\n", - " 'type': 'KNOWLEDGE_BASE',\n", - " 'knowledgeBaseConfiguration': {\n", - " 'knowledgeBaseId': kb_id,\n", - " 'modelArn': model_arn\n", - " }\n", - " },\n", - ")\n", - "\n", - "generated_text = response['output']['text']\n", - "\n", - "print(generated_text)" - ] - }, - { - "cell_type": "markdown", - "id": "bec94bfe-c99e-4e1c-9e97-bad5b3d0c09e", - "metadata": {}, - "source": [ - "##### Clean buckets \n", - "##### NOTE : please delete also Bedrock KB if not required by other works and data sources \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e67d50d0-8963-40d6-90be-be4c654c015f", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "delete_bucket_and_objects(bucket_name_semantic)\n", - "delete_bucket_and_objects(bucket_name_fixed)\n", - "delete_bucket_and_objects(bucket_name_hierachical)" - ] - }, - { - "cell_type": "markdown", - "id": "f157092f-7534-41f5-b664-e9f1ca67d6bc", - "metadata": {}, - "source": [ - "#### Conclusion: \n", - "\n", - "This notebook demonstrates an experimental approach using Foundation Models to determine optimal chunking strategies for different document types. While showing promising initial results, the current methodology is exploratory and requires further refinement.\n", - "\n", - "#### Disclaimer:\n", - "\n", - "Recommendations are based on base Foundation Models without fine-tuning\n", - "Results, validation against ground truth data would be required to validate results accurracy \n", - "\n", - "#### Proposed Next Steps:\n", - "\n", - "1.Model Enhancement\n", - " Implement fine-tuning on domain-specific data\n", - " Experiment with different prompt engineering mechanisms\n", - "\n", - "2.Validation Framework\n", - " Establish ground truth dataset for testing\n", - " Develop evaluation metrics for chunking quality\n", - " Create a systematic testing methodology\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1443821d-ad4c-4361-883f-002682160108", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/rag/open-source/chunking/chunking_advisor/data/empty.txt b/rag/open-source/chunking/chunking_advisor/data/empty.txt deleted file mode 100644 index 307a9857..00000000 --- a/rag/open-source/chunking/chunking_advisor/data/empty.txt +++ /dev/null @@ -1 +0,0 @@ -This is an empty file \ No newline at end of file diff --git a/rag/open-source/chunking/chunking_advisor/requirements.txt b/rag/open-source/chunking/chunking_advisor/requirements.txt deleted file mode 100644 index f9994574..00000000 --- a/rag/open-source/chunking/chunking_advisor/requirements.txt +++ /dev/null @@ -1,7 +0,0 @@ -boto3 -opensearch-py -botocore -awscli -retrying -opensearch-py==2.3.1 -pypdf \ No newline at end of file