diff --git a/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb new file mode 100644 index 00000000..5a14a5d4 --- /dev/null +++ b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb @@ -0,0 +1,1002 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d4a64fff-2c1e-4f34-9c7f-0c8180bb039a", + "metadata": {}, + "source": [ + "## Automated Document Processing Pipeline: Using Foundation Models for Smart Documents Chunking and Knowledge Base Integration\n", + "\n", + "\n", + "##### This Notebook requires an existing Knowledge base on bedrock. To create a knowledge base you can execute the code provided in :\n", + "https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb\n", + "### Challenge\n", + "\n", + "\n", + "Chunking is the process of dividing documents into smaller sections, or \"chunks,\" before embedding them into a Knowledge Base. This process enhances retrieval efficiency and precision. There are several chunking strategies available, each suited to different types of content and document structures. Examples of chunking strategies supported by Amazon Bedrock are: \n", + "- FIXED_SIZE: Splitting documents into chunks of the approximate size that set.\n", + "- HIERARCHICAL: Splitting documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.\n", + "- SEMANTIC: Split documents into chunks based on groups of similar content derived with natural language processing.\n", + "\n", + "FIXED_SIZE is useful in scenarios requiring predictable chunk sizes for processing. HIERARCHICAL chunking is appropriate when dealing with complex, nested data structures. Whereas Semantic Chunking is useful when dealing with complex, contextual information and processing documents where meaning across sentences is highly interconnected. \n", + "The main drawbacks of semantic chunking include higher computational requirements, limited effectiveness across different languages and scalability challenges with large datasets. The main drawbacks of hierarchical chunking include higher computational overhead, difficulty in managing deep hierarchies and slower query performance at deeper levels.\n", + "\n", + "Selecting the right chunking strategy require understanding of benefits and limitations of each strategy in the context of analyzed documents, business requirements and SLAs. To determine the adequate chunking strategy, developer needs to manually assess document before selecting a strategy. The final choice is a balance between efficiency, accuracy, and practical constraints of the specific use case\n", + "\n", + "\n", + "### Approach presented in this notebook\n", + "\n", + "The approach presented in this notebook leverages Foundation Models (FMs) to automate document analysis and ingestion into an Amazon Bedrock Knowledge Base, replacing manual human assessment. The system automatically:\n", + "- Analyzes document structure and content\n", + "- Determines the optimal chunking strategy for each document\n", + "- Generates appropriate chunking configurations\n", + "- Executes the document ingestion process\n", + "\n", + "The solution recognizes that different documents require different chunking approaches, and therefore performs individual assessments to optimize content segmentation for each document type. This automation streamlines the process of building and maintaining knowledge bases while ensuring optimal document processing for better retrieval and usage.\n", + "\n", + "The key idea in this work is using FMs to intelligently analyze and process documents, rather than relying on predetermined or manual chunking strategies.\n", + "\n", + "### Notebook Walkthrough\n", + "The pipeline streamlines the entire process from document analysis to knowledge base population, making it efficient to prepare documents for advanced query and retrieval operations.\n", + "\n", + "![data_ingestion](./img/chunkingAdvs.jpg)\n", + "### Steps: \n", + "\n", + "1. Create Amazon Bedrock Knowledge Base execution role and S3 bucket used as data sources and configure necessary IAM policies \n", + "2. Process files within target folder. For each document, analyze and recommends an optimal chunking strategy (FIXED_SIZE/NONE/HIERARCHICAL/SEMANTIC) and specific configuration parameters\n", + "3. Upload analyzed files to designated S3 buckets and configure buckets as data source for Bedrock KB\n", + "4. Initiate ingestion job \n", + "5. Verify data accessibility and accuracy\n" + ] + }, + { + "cell_type": "markdown", + "id": "d066b04e-bc6f-42e6-8836-817d2e0854b1", + "metadata": {}, + "source": [ + "### Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4f36a15a-6cc3-464a-a2f1-fcb2d45ea7b9", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --force-reinstall -q -r ./requirements.txt --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "5f694d1b-c537-4bf2-a5b1-d4f5c7025620", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# restart kernel\n", + "from IPython.core.display import HTML\n", + "\n", + "HTML(\"\")" + ] + }, + { + "cell_type": "markdown", + "id": "9eb01263-04ab-4471-b73a-366055027873", + "metadata": {}, + "source": [ + "### Initiate parameters \n", + "\n", + "##### Knowledge base ID should have been created from first notebook (https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb) or similar\n", + "- To get knowledge Base Id using Bedrock console, look int Amazon Bedrock > knowledgebase> knowledgebase \n" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "25239d0e-972d-4fff-b200-f20c39714a9e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "AmazonBedrockExecutionRoleForKnowledgeBase_758\n" + ] + } + ], + "source": [ + "import boto3\n", + "import json\n", + "\n", + "# create a boto3 session to dynamically get and set the region name\n", + "session = boto3.Session()\n", + "\n", + "AWS_REGION = session.region_name\n", + "bedrock = boto3.client(\"bedrock-runtime\", region_name=AWS_REGION)\n", + "bedrock_agent_client = session.client(\"bedrock-agent\", region_name=AWS_REGION)\n", + "# model was run in us-west-2 , if you are using us-east-1 then change model ID to \"us.anthropic.claude-3-5-sonnet-20241022-v2:0\" #\n", + "MODEL_NAME = \"anthropic.claude-3-5-sonnet-20241022-v2:0\"\n", + "datasources = []\n", + "\n", + "# create a folder data if not yet done and\n", + "path = \"data\"\n", + "\n", + "kb_id = \"XXXX\" # Retrieve KB First # update value here with your KB ID\n", + "kb = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=kb_id)\n", + "# get bedrock_kb_execution ID - this role should have been create from notebook creating KB\n", + "bedrock_kb_execution_role_arn = kb[\"knowledgeBase\"][\"roleArn\"]\n", + "bedrock_kb_execution_role = bedrock_kb_execution_role_arn.split(\"/\")[-1]\n", + "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n", + "print(bedrock_kb_execution_role)" + ] + }, + { + "cell_type": "markdown", + "id": "b9291ec4-e2dc-47c1-950b-9fa7e737bee3", + "metadata": {}, + "source": [ + "### Supporting functions\n", + "##### Function 1 - Createbucket: Checks if an S3 bucket exists and creates it if it doesn't. \n", + "##### Function 2 - Upload_file: Upload_files to bucket: Upload a file to an S3 bucket\n", + "##### Function 3 - List all files in a specified directory\n", + "##### Function 4 - Delete a S3 bucket and all objects included within" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "7578f6f2-1cd2-4683-b150-1b6900ff77ee", + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "import boto3\n", + "from botocore.exceptions import ClientError\n", + "import os\n", + "\n", + "\n", + "def createbucket(bucketname):\n", + " \"\"\"\n", + " Checks if an S3 bucket exists and creates it if it doesn't.\n", + " \"\"\"\n", + " try:\n", + " s3_client = boto3.client(\"s3\")\n", + " s3_client.head_bucket(Bucket=bucketname)\n", + " print(f\"Bucket {bucketname} Exists\")\n", + " except ClientError as e:\n", + " print(f\"Creating bucket {bucketname}\")\n", + " if AWS_REGION == \"us-east-1\":\n", + " s3bucket = s3_client.create_bucket(Bucket=bucketname)\n", + " else:\n", + " s3bucket = s3_client.create_bucket(\n", + " Bucket=bucketname,\n", + " CreateBucketConfiguration={\"LocationConstraint\": AWS_REGION},\n", + " )\n", + "\n", + "\n", + "def upload_file(file_name, bucket, object_name=None):\n", + " \"\"\"\n", + " Upload a file to an S3 bucket\n", + " \"\"\"\n", + " # If S3 object_name was not specified, use file_name\n", + " if object_name is None:\n", + " object_name = os.path.basename(file_name)\n", + "\n", + " # Upload the file\n", + " s3_client = boto3.client(\"s3\")\n", + " try:\n", + " response = s3_client.upload_file(file_name, bucket, object_name)\n", + " except ClientError as e:\n", + " logging.error(e)\n", + " return False\n", + " return True\n", + "\n", + "\n", + "def listfile(folder):\n", + " \"\"\"\n", + " List all files in a specified directory.\n", + " \"\"\"\n", + " dir_list = os.listdir(folder)\n", + " return dir_list\n", + "\n", + "\n", + "def delete_bucket_and_objects(bucket_name):\n", + " \"\"\"\n", + " Delete a S3 bucket and all objects included in\n", + " \"\"\"\n", + " # Create an S3 client\n", + " s3_client = boto3.client(\"s3\")\n", + " # Create an S3 resource\n", + " s3 = boto3.resource(\"s3\")\n", + " bucket = s3.Bucket(bucket_name)\n", + " bucket.objects.all().delete()\n", + " # Delete the bucket itself\n", + " bucket.delete()" + ] + }, + { + "cell_type": "markdown", + "id": "5d4b8fcc-5789-4df1-b72d-aef328b1a6c2", + "metadata": {}, + "source": [ + "### Standard prompt completion function" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "0ee11cad-ea50-4e87-8bd2-b8f159bf4d91", + "metadata": {}, + "outputs": [], + "source": [ + "def get_completion(prompt):\n", + " body = json.dumps(\n", + " {\n", + " \"anthropic_version\": \"\",\n", + " \"max_tokens\": 1500,\n", + " \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n", + " \"temperature\": 0.0,\n", + " \"top_p\": 1,\n", + " \"system\": \"\",\n", + " }\n", + " )\n", + " response = bedrock.invoke_model(body=body, modelId=MODEL_NAME)\n", + " response_body = json.loads(response.get(\"body\").read())\n", + " return response_body.get(\"content\")[0].get(\"text\")" + ] + }, + { + "cell_type": "markdown", + "id": "c549427c-9d3d-485c-a542-93ef49b540fe", + "metadata": {}, + "source": [ + "### Download and prepare datasets \n", + "The test dataset consists of two documents, these files will serve as test cases to validate the model's ability to correctly identify and recommend the most appropriate chunking strategy for each document type." + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "217702d0-42bb-4da0-b14b-37b4d0c0b503", + "metadata": {}, + "outputs": [], + "source": [ + "# if not yet created, create folder already\n", + "#!mkdir -p ./data\n", + "\n", + "from urllib.request import urlretrieve\n", + "\n", + "urls = [\n", + " \"https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf\",\n", + " \"https://www.apple.com/newsroom/pdfs/Q3FY18ConsolidatedFinancialStatements.pdf\",\n", + "]\n", + "filenames = [\n", + " \"AMZN-2022-Shareholder-Letter.pdf\",\n", + " \"Q3FY18ConsolidatedFinancialStatements.pdf\",\n", + "]\n", + "data_root = \"./data/\"\n", + "for idx, url in enumerate(urls):\n", + " file_path = data_root + filenames[idx]\n", + " urlretrieve(url, file_path)" + ] + }, + { + "cell_type": "markdown", + "id": "48551cb4-02a8-4426-88aa-089516a4c1d9", + "metadata": {}, + "source": [ + "### Create 3 S3 buckets, one per chunking strategy\n", + "##### Important Note on AWS Bedrock Knowledge Base Configuration:\n", + "\n", + "The chunking strategy for a data source is permanent and cannot be modified after initial configuration. To address this challenge, we are implementing the following structure:\n", + "\n", + "Three separate S3 buckets will be created, each dedicated to a specific chunking strategy:\n", + "- Bucket for semantic chunking\n", + "- Bucket for hierarchical chunking\n", + "- Bucket for hybrid chunking\n", + "\n", + "These separate buckets approach allows us to maintain different chunking strategies for different document types within the same knowledge base system, ensuring optimal processing for each document category.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67a84784-fbfe-4ba5-88a5-d2e2a382bfac", + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "\n", + "suffix = random.randrange(200, 900)\n", + "s3_client = boto3.client(\"s3\")\n", + "bucket_name_semantic = \"kb-dataset-bucket-semantic-\" + str(suffix)\n", + "bucket_name_fixed = \"kb-dataset-bucket-fixed-\" + str(suffix)\n", + "bucket_name_hierachical = \"kb-dataset-bucket-hierarchical-\" + str(suffix)\n", + "s3_policy_name = \"AmazonBedrockS3PolicyForKnowledgeBase_\" + str(suffix)\n", + "createbucket(bucket_name_semantic)\n", + "createbucket(bucket_name_fixed)\n", + "createbucket(bucket_name_hierachical)" + ] + }, + { + "cell_type": "markdown", + "id": "2a54bc11-8de7-4a67-b8ed-0ba2944fb9e4", + "metadata": {}, + "source": [ + "### Create S3 policies and attach to existing Bedrock role\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6230aaf0-2aa6-429e-9824-0ef2f9b12579", + "metadata": {}, + "outputs": [], + "source": [ + "account_number = boto3.client(\"sts\").get_caller_identity().get(\"Account\")\n", + "iam_client = session.client(\"iam\")\n", + "iam_client = session.client(\"iam\")\n", + "s3_policy_document = {\n", + " \"Version\": \"2012-10-17\",\n", + " \"Statement\": [\n", + " {\n", + " \"Effect\": \"Allow\",\n", + " \"Action\": [\"s3:GetObject\", \"s3:ListBucket\"],\n", + " \"Resource\": [\n", + " f\"arn:aws:s3:::{bucket_name_semantic}\",\n", + " f\"arn:aws:s3:::{bucket_name_semantic}/*\",\n", + " f\"arn:aws:s3:::{bucket_name_fixed}\",\n", + " f\"arn:aws:s3:::{bucket_name_fixed}/*\",\n", + " f\"arn:aws:s3:::{bucket_name_hierachical}\",\n", + " f\"arn:aws:s3:::{bucket_name_hierachical}/*\",\n", + " ],\n", + " \"Condition\": {\"StringEquals\": {\"aws:ResourceAccount\": f\"{account_number}\"}},\n", + " }\n", + " ],\n", + "}\n", + "s3_policy = iam_client.create_policy(\n", + " PolicyName=s3_policy_name,\n", + " PolicyDocument=json.dumps(s3_policy_document),\n", + " Description=\"Policy for reading documents from s3\",\n", + ")\n", + "\n", + "# fetch arn of this policy\n", + "s3_policy_arn = s3_policy[\"Policy\"][\"Arn\"]\n", + "iam_client = session.client(\"iam\")\n", + "fm_policy_arn = f\"arn:aws:iam::{account_number}:policy/{s3_policy_name}\"\n", + "iam_client.attach_role_policy(\n", + " RoleName=bedrock_kb_execution_role, PolicyArn=fm_policy_arn\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "339eb2ae-e825-435f-b77b-0524144f081c", + "metadata": {}, + "source": [ + "### Document Analysis\n", + "\n", + "Purpose: analyzes PDF documents using an LLM to recommend the optimal chunking strategy and its associated parameters.\n", + "Input: PDF document\n", + "Output: The function recommends one of the following chunking strategies with specific parameters:\n", + "- HIERARCHICAL Chunking:\n", + " - Maximum parent chunk token size\n", + " - Maximum child chunk token size\n", + " - Overlap tokens\n", + " - Rationale for recommendation\n", + "- SEMANTIC Chunking:\n", + " - Maximum tokens\n", + " - Buffer size\n", + " - Breakpoint percentile threshold\n", + " - Rationale for recommendation\n", + "- FIXED-SIZE Chunking:\n", + " - Maximum tokens\n", + " - Overlap percentage\n", + " - Rationale for recommendation\n" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "c5801051-5411-4659-a303-c06aed74af04", + "metadata": {}, + "outputs": [], + "source": [ + "def chunking_advise(file):\n", + " from langchain.document_loaders import PyPDFLoader\n", + "\n", + " my_docs = []\n", + " my_strategies = []\n", + " strategy = \"\"\n", + " strategytext = \"\"\n", + " path = \"data\"\n", + " strategylist = []\n", + " metadata = [dict(year=2023, source=file)]\n", + " print(\"I am now analyzing the file:\", file)\n", + " file = path + \"/\" + file\n", + " loader = PyPDFLoader(file)\n", + " document = loader.load()\n", + " loader = PyPDFLoader(file)\n", + " document = loader.load()\n", + " # print (document)\n", + " prompt = f\"\"\"SYSTEM you are an advisor expert in LLM chunking strategies,\n", + " USER can you analyze the type, content, format, structure and size of {document}. \n", + " 1. See the actual document content\n", + " 2. Analyze its structure\n", + " 3. Examine the text format\n", + " 4. Understand the document length\n", + " 5. Review any hierarchical elements and Assess the semantic relationships within the content\n", + " 6. Evaluate the formatting and section breaks\n", + " then advise on best LLM chunking Strategy based on this analysis. Recommend only one Strategy, however show recommended strategy preference ratio \n", + " Available strategies to recommend from are: FIXED_SIZE or NONE or HIERARCHICAL or SEMANTIC\n", + " Decide on recommendation first and then, what is the recommendation? \"\"\"\n", + "\n", + " resultA = get_completion(prompt)\n", + " print(\"my recommendation is:\", resultA)\n", + " return resultA\n", + "\n", + "\n", + "def chunking_configuration(strategy, file):\n", + "\n", + " prompt = f\"\"\" USER based on recommendation provide in {strategy} , provide for {file} a recommended chunking configuration, \n", + " if you recommend HIERARCHICAL chunking then provide recommendation for: \n", + " Parent: Maximum parent chunk token size. \n", + " Child: Maximum child chunk token size and Overlap Tokens: Number of overlap tokens between each parent chunk and between each parent and its children.\n", + " If recommendation is HIERARCHICAL then provide response using JSON format\n", + " with the keys as \\”Recommend only one Strategy\\”, \\”Maximum Parent chunk token size\\”, \\”Maximum child chunk token size\\”,\\”Overlap Tokens\\”,\n", + " \\\"Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” . \n", + " provide crisp and clear answer, \n", + " if you recommend SEMANTIC then provide response using JSON format with\n", + " the keys as \\”Recommend only one Strategy\\”,\\”Maximum tokens\\”, \\”Buffer size\\”,\\”Breakpoint percentile threshold\\”, \n", + " Buffer size should be less or equal than 1 , Breakpoint percentile threshold should >= 50\n", + " \\”Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” . \n", + " provide crisp and clear answer, \n", + " do not provide recommendation if not enough data inputs and say sorry I need more data,\n", + " if you recommend FIXED_SIZE then provide response using JSON format with\n", + " the keys as \\”Recommend only one Strategy\\”,\\”maxTokens\\”, \\”overlapPercentage \\”,\n", + " \\”Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” .\n", + " provide crisp and clear answer, \n", + " do not provide recommendation if not enough data inputs and say sorry I need more data\"\"\"\n", + "\n", + " res = get_completion(prompt)\n", + " #print(res)\n", + " parsed_data = json.loads(res)\n", + " return parsed_data" + ] + }, + { + "cell_type": "markdown", + "id": "579229ec-23b4-4e9f-99db-cb78c7453e4e", + "metadata": {}, + "source": [ + "### Ingest Documents By Strategy\n", + "Purpose: Configures AWS Bedrock Knowledge Base ingestion settings based on the recommended chunking strategy analysis.\n", + "- Interprets the recommended strategy from parsed_data\n", + "- Applies corresponding parameters to create appropriate configuration\n", + "- Selects the matching S3 bucket for the strategy\n", + "- Generates knowledge base metadata\n", + "- Returns all necessary components for Bedrock KB ingestion\n" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "a016a3c8-6759-40fe-a997-91357a3f48e9", + "metadata": {}, + "outputs": [], + "source": [ + "def ingestbystrategy(parsed_data):\n", + "\n", + " chunkingStrategyConfiguration = {}\n", + " strategy = parsed_data.get(\"Recommend only one Strategy\")\n", + "\n", + " # HIERARCHICAL Chunking\n", + " if strategy == \"HIERARCHICAL\":\n", + " p1 = parsed_data[\"Maximum Parent chunk token size\"]\n", + " p2 = parsed_data[\"Maximum child chunk token size\"]\n", + " p3 = parsed_data[\"Overlap Tokens\"]\n", + " bucket_name = bucket_name_hierachical\n", + " name = f\"bedrock-sample-knowledge-base-HIERARCHICAL\"\n", + " description = \"Bedrock Knowledge Bases for and S3 HIERARCHICAL\"\n", + " chunkingStrategyConfiguration = {\n", + " \"chunkingStrategy\": \"HIERARCHICAL\",\n", + " \"hierarchicalChunkingConfiguration\": {\n", + " \"levelConfigurations\": [{\"maxTokens\": p1}, {\"maxTokens\": p2}],\n", + " \"overlapTokens\": p3,\n", + " },\n", + " }\n", + "\n", + " # SEMANTIC Chunking\n", + " if strategy == \"SEMANTIC\":\n", + " p3 = parsed_data[\"Maximum tokens\"]\n", + " p2 = int(parsed_data[\"Buffer size\"])\n", + " p1 = parsed_data[\"Breakpoint percentile threshold\"]\n", + " bucket_name = bucket_name_semantic\n", + " name = f\"bedrock-sample-knowledge-base-SEMANTIC\"\n", + " description = \"Bedrock Knowledge Bases for and S3 SEMANTIC\"\n", + " chunkingStrategyConfiguration = {\n", + " \"chunkingStrategy\": \"SEMANTIC\",\n", + " \"semanticChunkingConfiguration\": {\n", + " \"breakpointPercentileThreshold\": p1,\n", + " \"bufferSize\": p2,\n", + " \"maxTokens\": p3,\n", + " },\n", + " }\n", + " # FIXED_SIZE Chunking\n", + " if strategy == \"FIXED_SIZE\":\n", + " p2 = int(parsed_data[\"overlapPercentage\"])\n", + " p1 = int(parsed_data[\"maxTokens\"])\n", + " bucket_name = bucket_name_fixed\n", + " name = f\"bedrock-sample-knowledge-base-FIXED\"\n", + " description = \"Bedrock Knowledge Bases for and S3 FIXED\"\n", + "\n", + " chunkingStrategyConfiguration = {\n", + " \"chunkingStrategy\": \"FIXED_SIZE\",\n", + " \"semanticChunkingConfiguration\": {\"maxTokens\": p1, \"overlapPercentage\": p2},\n", + " }\n", + "\n", + " s3Configuration = {\n", + " \"bucketArn\": f\"arn:aws:s3:::{bucket_name}\",\n", + " }\n", + " return (\n", + " chunkingStrategyConfiguration,\n", + " bucket_name,\n", + " name,\n", + " description,\n", + " s3Configuration,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "4cc202b3-12b9-45e7-a610-e76c28b142ad", + "metadata": {}, + "source": [ + "### Create or retrieve data source from Amazon Bedrock Knowledge Base\n" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "90628e2d-913d-4290-8d40-78dacd3d0e9b", + "metadata": {}, + "outputs": [], + "source": [ + "def createDS(\n", + " name, description, knowledgeBaseId, s3Configuration, chunkingStrategyConfiguration\n", + "):\n", + " response = bedrock_agent_client.list_data_sources(\n", + " knowledgeBaseId=kb_id, maxResults=12\n", + " )\n", + " for i in range(len(response[\"dataSourceSummaries\"])):\n", + " if response[\"dataSourceSummaries\"][i][\"name\"] == name:\n", + " ds = bedrock_agent_client.get_data_source(\n", + " knowledgeBaseId=knowledgeBaseId,\n", + " dataSourceId=response[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", + " )\n", + " return ds\n", + " ds = bedrock_agent_client.create_data_source(\n", + " name=name,\n", + " description=description,\n", + " knowledgeBaseId=knowledgeBaseId,\n", + " dataDeletionPolicy=\"DELETE\",\n", + " dataSourceConfiguration={\n", + " # # For S3\n", + " \"type\": \"S3\",\n", + " \"s3Configuration\": s3Configuration,\n", + " },\n", + " vectorIngestionConfiguration={\n", + " \"chunkingConfiguration\": chunkingStrategyConfiguration\n", + " },\n", + " )\n", + " return ds" + ] + }, + { + "cell_type": "markdown", + "id": "8e60ea6a-a1dc-47f8-9922-b7c70be750d1", + "metadata": {}, + "source": [ + "### Process PDF files by analyzing content, creating data sources, and uploading to S3.\n", + "\n", + "#### Workflow:\n", + "1. Lists all files in specified directory\n", + "2. For each PDF:\n", + " - Analyzes for optimal chunking strategy\n", + " - Creates data source with recommended configuration\n", + " - Uploads file to appropriate S3 bucket " + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "ee575c20-388b-4119-bd0d-080a71a5cbd0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['.ipynb_checkpoints', 'AMZN-2022-Shareholder-Letter.pdf', 'Q3FY18ConsolidatedFinancialStatements.pdf']\n", + "I am now analyzing the file: AMZN-2022-Shareholder-Letter.pdf\n", + "my recommendation is: Let me analyze the document according to your requirements:\n", + "\n", + "1. Content Analysis:\n", + "- This is Amazon's 2022 Shareholder Letter plus the 1997 letter\n", + "- Contains formal business communication with clear sections\n", + "- Mix of financial data, strategic information, and company updates\n", + "\n", + "2. Structure Analysis:\n", + "- Clear hierarchical organization\n", + "- Main sections with subtopics\n", + "- Natural paragraph breaks\n", + "- Distinct thematic segments\n", + "\n", + "3. Text Format:\n", + "- Professional business letter format\n", + "- Consistent paragraph structure\n", + "- Mixed content types (narrative, numerical, strategic)\n", + "- Contains lists and bullet points\n", + "\n", + "4. Document Length:\n", + "- Multiple pages (10 pages)\n", + "- Substantial content length\n", + "- Well-distributed content across pages\n", + "\n", + "5. Hierarchical/Semantic Elements:\n", + "- Strong thematic organization\n", + "- Clear topic transitions\n", + "- Natural semantic boundaries between business areas\n", + "- Logical flow between sections\n", + "\n", + "6. Formatting/Breaks:\n", + "- Clear section breaks\n", + "- Paragraph spacing\n", + "- Topic-based divisions\n", + "- Natural content groupings\n", + "\n", + "Strategy Preference Ratio:\n", + "HIERARCHICAL: 70%\n", + "SEMANTIC: 20%\n", + "FIXED_SIZE: 8%\n", + "NONE: 2%\n", + "\n", + "RECOMMENDED STRATEGY: HIERARCHICAL\n", + "\n", + "Reasoning:\n", + "The HIERARCHICAL chunking strategy is most appropriate because:\n", + "1. The document has clear hierarchical organization\n", + "2. Natural section breaks exist\n", + "3. Content is logically structured by topics\n", + "4. Information flow follows a clear hierarchy\n", + "5. Business sections are well-defined\n", + "6. Maintains context within related sections\n", + "7. Preserves the relationship between main topics and subtopics\n", + "\n", + "This strategy will:\n", + "- Keep related information together\n", + "- Maintain context within sections\n", + "- Preserve the logical flow of information\n", + "- Enable better question-answering\n", + "- Support more accurate information retrieval\n", + "- Respect the natural structure of the document\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Ignoring wrong pointing object 8 0 (offset 0)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I am now analyzing the file: Q3FY18ConsolidatedFinancialStatements.pdf\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Ignoring wrong pointing object 8 0 (offset 0)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "my recommendation is: I'll analyze the document according to your criteria and then make a recommendation.\n", + "\n", + "1. Content Analysis:\n", + "- Financial statements from Apple Inc.\n", + "- Contains 3 main documents: Income Statement, Balance Sheet, and Cash Flow Statement\n", + "- Highly structured numerical data with clear hierarchical organization\n", + "- Contains headers, sub-headers, and detailed line items\n", + "- Tabular format with columns for different time periods\n", + "\n", + "2. Structure Analysis:\n", + "- Clear document sections with distinct titles\n", + "- Hierarchical organization (Main statements → Categories → Line items)\n", + "- Consistent indentation patterns\n", + "- Well-defined parent-child relationships in financial data\n", + "\n", + "3. Format Analysis:\n", + "- Tabular data with aligned columns\n", + "- Consistent spacing and indentation\n", + "- Clear section breaks between statements\n", + "- Standardized number formatting\n", + "- Footnotes and annotations\n", + "\n", + "4. Length:\n", + "- Three pages of financial statements\n", + "- Moderate document length\n", + "- Dense with numerical data\n", + "- Consistent formatting throughout\n", + "\n", + "5. Hierarchical/Semantic Elements:\n", + "- Strong natural hierarchy in financial statements\n", + "- Clear parent-child relationships\n", + "- Logical grouping of related items\n", + "- Natural semantic connections between financial concepts\n", + "\n", + "6. Formatting/Section Breaks:\n", + "- Clear separation between major statements\n", + "- Consistent sub-section formatting\n", + "- Well-defined category breaks\n", + "- Standardized presentation\n", + "\n", + "RECOMMENDATION: HIERARCHICAL\n", + "\n", + "Strategy Preference Ratio:\n", + "- HIERARCHICAL: 70%\n", + "- SEMANTIC: 20%\n", + "- FIXED_SIZE: 8%\n", + "- NONE: 2%\n", + "\n", + "Reasoning for HIERARCHICAL:\n", + "1. Financial statements have natural hierarchical structure\n", + "2. Clear parent-child relationships in data\n", + "3. Logical grouping of information\n", + "4. Maintains context within financial statement sections\n", + "5. Preserves relationships between numbers and their categories\n", + "6. Allows for meaningful chunking based on statement sections and sub-sections\n", + "7. Helps maintain calculation integrity\n", + "8. Keeps related financial concepts together\n", + "\n", + "This will allow the LLM to:\n", + "- Maintain the integrity of financial statements\n", + "- Keep related items together\n", + "- Preserve calculation relationships\n", + "- Respect the natural structure of financial data\n", + "- Enable better question-answering about specific sections\n", + "- Maintain context when processing financial information\n" + ] + } + ], + "source": [ + "s3_client = boto3.client(\"s3\")\n", + "dir_list1 = listfile(\"data\")\n", + "print(dir_list1)\n", + "strategylist = []\n", + "for file in dir_list1:\n", + " if \".pdf\" in file:\n", + " chunkingStrategyConfiguration = []\n", + "\n", + " strategy = chunking_advise(file)\n", + " strategy_conf = chunking_configuration(strategy, file)\n", + "\n", + "chunkingStrategyConfiguration, bucket_name, name, description, s3Configuration = (\n", + " ingestbystrategy(strategy_conf)\n", + ")\n", + "datasources = createDS(\n", + " name, description, kb_id, s3Configuration, chunkingStrategyConfiguration\n", + ")\n", + "with open(path + \"/\" + file, \"rb\") as f:\n", + " s3_client.upload_fileobj(f, bucket_name, file)" + ] + }, + { + "cell_type": "markdown", + "id": "ceaa41da-ecab-4592-8d78-59815e0dfb62", + "metadata": {}, + "source": [ + "### Ingestion jobs\n", + "##### please ensure that Knowledge base role have the permission to InvokeModel on resource: arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43d0d10e-40e3-4769-a5e1-d115fce38041", + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "import time\n", + "\n", + "\"\"\"\n", + " Starts and monitors ingestion jobs for all data sources in a knowledge base.\n", + "\"\"\"\n", + "sources = bedrock_agent_client.list_data_sources(knowledgeBaseId=kb_id)\n", + "for i in range(len(sources[\"dataSourceSummaries\"])):\n", + " print(\"ds [dataSourceId]\", sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"])\n", + " start_job_response = bedrock_agent_client.start_ingestion_job(\n", + " knowledgeBaseId=kb_id,\n", + " dataSourceId=sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", + " )\n", + " job = start_job_response[\"ingestionJob\"]\n", + " print(job)\n", + " # Get job\n", + " while job[\"status\"] != \"COMPLETE\":\n", + " get_job_response = bedrock_agent_client.get_ingestion_job(\n", + " knowledgeBaseId=kb_id,\n", + " dataSourceId=sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", + " ingestionJobId=job[\"ingestionJobId\"],\n", + " )\n", + " job = get_job_response[\"ingestionJob\"]\n", + " time.sleep(10)" + ] + }, + { + "cell_type": "markdown", + "id": "30c5e219-97bd-4219-8f97-cd7be339cc5e", + "metadata": {}, + "source": [ + "### Try out KB and evaluate result score \n", + "##### try both queries below" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "fb36c3b2-3b4e-4fca-8a95-3ae4d46aee98", + "metadata": {}, + "outputs": [], + "source": [ + "bedrock_agent_runtime_client = boto3.client(\"bedrock-agent-runtime\")\n", + "def response_print(retrieve_resp):\n", + " # structure 'retrievalResults': list of contents. Each list has content, location, score, metadata\n", + " for num, chunk in enumerate(retrieve_resp[\"retrievalResults\"], 1):\n", + " print(f\"Chunk -length : \", len(chunk[\"content\"][\"text\"]), end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} Location: \", chunk[\"location\"], end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} length: \", chunk[\"location\"], end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} Score: \", chunk[\"score\"], end=\"\\n\" * 2)\n", + " #print(f\"Chunk {num} Metadata: \", chunk[\"metadata\"], end=\"\\n\" * 2)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8dec69f1-826c-4dd1-b812-c5ded0b55d05", + "metadata": {}, + "outputs": [], + "source": [ + "query1 = \"what is AWS annual revenue increase\"\n", + "\n", + "response_ret = bedrock_agent_runtime_client.retrieve(\n", + " knowledgeBaseId=kb_id,\n", + " nextToken=\"string\",\n", + " retrievalConfiguration={\n", + " \"vectorSearchConfiguration\": {\n", + " \"numberOfResults\": 1,\n", + " }\n", + " },\n", + " retrievalQuery={\"text\": query1},\n", + ")\n", + "print(\"Response should come from semantic chunked document: to verify let us check data source uri\")\n", + "response_print(response_ret)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5cdcc7e2-0238-452c-aa3c-2e1ea06e5499", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "query2 = \"what is iphone sales in 2018?\"\n", + "response_ret2 = bedrock_agent_runtime_client.retrieve(\n", + " knowledgeBaseId=kb_id,\n", + " nextToken=\"string\",\n", + " retrievalConfiguration={\n", + " \"vectorSearchConfiguration\": {\n", + " \"numberOfResults\": 1,\n", + " }\n", + " },\n", + " retrievalQuery={\"text\": query2},\n", + ")\n", + "print(\"Response should come from hierarchical chunked document:\")\n", + "response_print(response_ret2)" + ] + }, + { + "cell_type": "markdown", + "id": "bec94bfe-c99e-4e1c-9e97-bad5b3d0c09e", + "metadata": {}, + "source": [ + "##### Clean buckets \n", + "##### NOTE : please delete also Bedrock KB if not required by other works and data sources \n" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "e67d50d0-8963-40d6-90be-be4c654c015f", + "metadata": {}, + "outputs": [], + "source": [ + "delete_bucket_and_objects(bucket_name_semantic)\n", + "delete_bucket_and_objects(bucket_name_fixed)\n", + "delete_bucket_and_objects(bucket_name_hierachical)" + ] + }, + { + "cell_type": "markdown", + "id": "f157092f-7534-41f5-b664-e9f1ca67d6bc", + "metadata": {}, + "source": [ + "## Conclusion: \n", + "\n", + "This notebook presents a proof-of-concept approach that uses Foundation Models to automate chunking strategy selection for document processing. Please note:\n", + "- This is an experimental implementation\n", + "- Results should be validated before production use\n", + "\n", + "This work serves as a starting point for automating chunking strategy decisions, but additional research and validation are needed to ensure reliability across diverse document types and use cases.\n", + "\n", + "Suggested Next Steps:\n", + "- Expand testing across more document types\n", + "- Validate recommendations against human expert decisions\n", + "- Refine the model's decision-making criteria\n", + "- Gather performance metrics in real-world applications\n", + "- Build a validation Framework having a Ground Truth Database and including varied document types and structures using proven validation framework such as RAGA" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1443821d-ad4c-4361-883f-002682160108", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/rag/knowledge-bases/features-examples/08-chunking-recommender/img/chunkingAdvs.jpg b/rag/knowledge-bases/features-examples/08-chunking-recommender/img/chunkingAdvs.jpg new file mode 100644 index 00000000..e968c984 Binary files /dev/null and b/rag/knowledge-bases/features-examples/08-chunking-recommender/img/chunkingAdvs.jpg differ diff --git a/rag/knowledge-bases/features-examples/08-chunking-recommender/requirements.txt b/rag/knowledge-bases/features-examples/08-chunking-recommender/requirements.txt new file mode 100644 index 00000000..f9994574 --- /dev/null +++ b/rag/knowledge-bases/features-examples/08-chunking-recommender/requirements.txt @@ -0,0 +1,7 @@ +boto3 +opensearch-py +botocore +awscli +retrying +opensearch-py==2.3.1 +pypdf \ No newline at end of file