From 63b03345b467f5bd1191ab4ad4a1db9a7c3307e2 Mon Sep 17 00:00:00 2001 From: Mustapha Tawbi Date: Tue, 21 Jan 2025 09:41:42 +0400 Subject: [PATCH 1/5] chunking recommender --- .../ChunkingRecommender.ipynb | 808 ++++++++++++++++++ .../img/chunkingAdvs.jpg | Bin 0 -> 65285 bytes .../08-chunking-recommender/requirements.txt | 7 + 3 files changed, 815 insertions(+) create mode 100644 rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb create mode 100644 rag/knowledge-bases/features-examples/08-chunking-recommender/img/chunkingAdvs.jpg create mode 100644 rag/knowledge-bases/features-examples/08-chunking-recommender/requirements.txt diff --git a/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb new file mode 100644 index 00000000..c4523e2e --- /dev/null +++ b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb @@ -0,0 +1,808 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d4a64fff-2c1e-4f34-9c7f-0c8180bb039a", + "metadata": {}, + "source": [ + "## Automated Document Processing Pipeline: Using Foundation Models for Smart Documents Chunking and Knowledge Base Integration\n", + "\n", + "\n", + "##### This Notebook requires an existing Knowledge base on bedrock. To create a knowledge base you can execute the code provided in :\n", + "https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb\n", + "### Challenge\n", + "\n", + "\n", + "Chunking is the process of dividing documents into smaller sections, or \"chunks,\" before embedding them into a Knowledge Base. This process enhances retrieval efficiency and precision. There are several chunking strategies available, each suited to different types of content and document structures. Examples of chunking strategies supported by Amazon Bedrock are: \n", + "- FIXED_SIZE: Splitting documents into chunks of the approximate size that set.\n", + "- HIERARCHICAL: Splitting documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.\n", + "- SEMANTIC: Split documents into chunks based on groups of similar content derived with natural language processing.\n", + "\n", + "FIXED_SIZE is useful in scenarios requiring predictable chunk sizes for processing. HIERARCHICAL chunking is appropriate when dealing with complex, nested data structures. Whereas Semantic Chunking is useful when dealing with complex, contextual information and processing documents where meaning across sentences is highly interconnected. \n", + "The main drawbacks of semantic chunking include higher computational requirements, limited effectiveness across different languages and scalability challenges with large datasets. The main drawbacks of hierarchical chunking include higher computational overhead, difficulty in managing deep hierarchies and slower query performance at deeper levels.\n", + "\n", + "Selecting the right chunking strategy require understanding of benefits and limitations of each strategy in the context of analyzed documents, business requirements and SLAs. To determine the adequate chunking strategy, developer needs to manually assess document before selecting a strategy. The final choice is a balance between efficiency, accuracy, and practical constraints of the specific use case\n", + "\n", + "\n", + "### Approach presented in this notebook\n", + "\n", + "The approach presented in this notebook leverages Foundation Models (FMs) to automate document analysis and ingestion into an Amazon Bedrock Knowledge Base, replacing manual human assessment. The system automatically:\n", + "- Analyzes document structure and content\n", + "- Determines the optimal chunking strategy for each document\n", + "- Generates appropriate chunking configurations\n", + "- Executes the document ingestion process\n", + "\n", + "The solution recognizes that different documents require different chunking approaches, and therefore performs individual assessments to optimize content segmentation for each document type. This automation streamlines the process of building and maintaining knowledge bases while ensuring optimal document processing for better retrieval and usage.\n", + "\n", + "The key idea in this work is using FMs to intelligently analyze and process documents, rather than relying on predetermined or manual chunking strategies.\n", + "\n", + "### Notebook Walkthrough\n", + "The pipeline streamlines the entire process from document analysis to knowledge base population, making it efficient to prepare documents for advanced query and retrieval operations.\n", + "\n", + "![data_ingestion](./img/chunkingAdvs.jpg)\n", + "### Steps: \n", + "\n", + "1. Create Amazon Bedrock Knowledge Base execution role and S3 bucket used as data sources and configure necessary IAM policies \n", + "2. Process files within target folder. For each document, analyze and recommends an optimal chunking strategy (FIXED_SIZE/NONE/HIERARCHICAL/SEMANTIC) and specific configuration parameters\n", + "3. Upload analyzed files to designated S3 buckets and configure buckets as data source for Bedrock KB\n", + "4. Initiate ingestion job \n", + "5. Verify data accessibility and accuracy\n" + ] + }, + { + "cell_type": "markdown", + "id": "d066b04e-bc6f-42e6-8836-817d2e0854b1", + "metadata": {}, + "source": [ + "### Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4f36a15a-6cc3-464a-a2f1-fcb2d45ea7b9", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --force-reinstall -q -r ./requirements.txt --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f694d1b-c537-4bf2-a5b1-d4f5c7025620", + "metadata": {}, + "outputs": [], + "source": [ + "# restart kernel\n", + "from IPython.core.display import HTML\n", + "\n", + "HTML(\"\")" + ] + }, + { + "cell_type": "markdown", + "id": "9eb01263-04ab-4471-b73a-366055027873", + "metadata": {}, + "source": [ + "### Initiate parameters \n", + "\n", + "##### Knowledge base ID should have been created from first notebook (https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb) or similar\n", + "- To get knowledge Base Id using Bedrock console, look int Amazon Bedrock > knowledgebase> knowledgebase \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25239d0e-972d-4fff-b200-f20c39714a9e", + "metadata": {}, + "outputs": [], + "source": [ + "import boto3\n", + "import json\n", + "\n", + "# create a boto3 session to dynamically get and set the region name\n", + "session = boto3.Session()\n", + "\n", + "AWS_REGION = session.region_name\n", + "bedrock = boto3.client(\"bedrock-runtime\", region_name=AWS_REGION)\n", + "bedrock_agent_client = session.client(\"bedrock-agent\", region_name=AWS_REGION)\n", + "# model was run in us-west-2 , if you are using us-east-1 then change model ID to \"us.anthropic.claude-3-5-sonnet-20241022-v2:0\" #\n", + "MODEL_NAME = \"anthropic.claude-3-5-sonnet-20241022-v2:0\"\n", + "datasources = []\n", + "\n", + "# create a folder data if not yet done and\n", + "path = \"data\"\n", + "\n", + "kb_id = \"xxxx\" # Retrieve KB First # update value here with your KB ID\n", + "kb = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=kb_id)\n", + "# get bedrock_kb_execution ID - this role should have been create from notebook creating KB\n", + "bedrock_kb_execution_role_arn = kb[\"knowledgeBase\"][\"roleArn\"]\n", + "bedrock_kb_execution_role = bedrock_kb_execution_role_arn.split(\"/\")[-1]\n", + "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n", + "print(bedrock_kb_execution_role)" + ] + }, + { + "cell_type": "markdown", + "id": "b9291ec4-e2dc-47c1-950b-9fa7e737bee3", + "metadata": {}, + "source": [ + "### Supporting functions\n", + "##### Function 1 - Createbucket: Checks if an S3 bucket exists and creates it if it doesn't. \n", + "##### Function 2 - Upload_file: Upload_files to bucket: Upload a file to an S3 bucket\n", + "##### Function 3 - List all files in a specified directory\n", + "##### Function 4 - Delete a S3 bucket and all objects included within" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7578f6f2-1cd2-4683-b150-1b6900ff77ee", + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "import boto3\n", + "from botocore.exceptions import ClientError\n", + "import os\n", + "\n", + "\n", + "def createbucket(bucketname):\n", + " \"\"\"\n", + " Checks if an S3 bucket exists and creates it if it doesn't.\n", + " \"\"\"\n", + " try:\n", + " s3_client = boto3.client(\"s3\")\n", + " s3_client.head_bucket(Bucket=bucketname)\n", + " print(f\"Bucket {bucketname} Exists\")\n", + " except ClientError as e:\n", + " print(f\"Creating bucket {bucketname}\")\n", + " if AWS_REGION == \"us-east-1\":\n", + " s3bucket = s3_client.create_bucket(Bucket=bucketname)\n", + " else:\n", + " s3bucket = s3_client.create_bucket(\n", + " Bucket=bucketname,\n", + " CreateBucketConfiguration={\"LocationConstraint\": AWS_REGION},\n", + " )\n", + "\n", + "\n", + "def upload_file(file_name, bucket, object_name=None):\n", + " \"\"\"\n", + " Upload a file to an S3 bucket\n", + " \"\"\"\n", + " # If S3 object_name was not specified, use file_name\n", + " if object_name is None:\n", + " object_name = os.path.basename(file_name)\n", + "\n", + " # Upload the file\n", + " s3_client = boto3.client(\"s3\")\n", + " try:\n", + " response = s3_client.upload_file(file_name, bucket, object_name)\n", + " except ClientError as e:\n", + " logging.error(e)\n", + " return False\n", + " return True\n", + "\n", + "\n", + "def listfile(folder):\n", + " \"\"\"\n", + " List all files in a specified directory.\n", + " \"\"\"\n", + " dir_list = os.listdir(folder)\n", + " return dir_list\n", + "\n", + "\n", + "def delete_bucket_and_objects(bucket_name):\n", + " \"\"\"\n", + " Delete a S3 bucket and all objects included in\n", + " \"\"\"\n", + " # Create an S3 client\n", + " s3_client = boto3.client(\"s3\")\n", + " # Create an S3 resource\n", + " s3 = boto3.resource(\"s3\")\n", + " bucket = s3.Bucket(bucket_name)\n", + " bucket.objects.all().delete()\n", + " # Delete the bucket itself\n", + " bucket.delete()" + ] + }, + { + "cell_type": "markdown", + "id": "5d4b8fcc-5789-4df1-b72d-aef328b1a6c2", + "metadata": {}, + "source": [ + "### Standard prompt completion function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0ee11cad-ea50-4e87-8bd2-b8f159bf4d91", + "metadata": {}, + "outputs": [], + "source": [ + "def get_completion(prompt):\n", + " body = json.dumps(\n", + " {\n", + " \"anthropic_version\": \"\",\n", + " \"max_tokens\": 2000,\n", + " \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n", + " \"temperature\": 0.0,\n", + " \"top_p\": 1,\n", + " \"system\": \"\",\n", + " }\n", + " )\n", + " response = bedrock.invoke_model(body=body, modelId=MODEL_NAME)\n", + " response_body = json.loads(response.get(\"body\").read())\n", + " return response_body.get(\"content\")[0].get(\"text\")" + ] + }, + { + "cell_type": "markdown", + "id": "c549427c-9d3d-485c-a542-93ef49b540fe", + "metadata": {}, + "source": [ + "### Download and prepare datasets \n", + "The test dataset consists of two documents, these files will serve as test cases to validate the model's ability to correctly identify and recommend the most appropriate chunking strategy for each document type." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "217702d0-42bb-4da0-b14b-37b4d0c0b503", + "metadata": {}, + "outputs": [], + "source": [ + "# if not yet created, create folder already\n", + "#!mkdir -p ./data\n", + "\n", + "from urllib.request import urlretrieve\n", + "\n", + "urls = [\n", + " \"https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf\",\n", + " \"https://www.apple.com/newsroom/pdfs/Q3FY18ConsolidatedFinancialStatements.pdf\",\n", + "]\n", + "filenames = [\n", + " \"AMZN-2022-Shareholder-Letter.pdf\",\n", + " \"Q3FY18ConsolidatedFinancialStatements.pdf\",\n", + "]\n", + "data_root = \"./data/\"\n", + "for idx, url in enumerate(urls):\n", + " file_path = data_root + filenames[idx]\n", + " urlretrieve(url, file_path)" + ] + }, + { + "cell_type": "markdown", + "id": "48551cb4-02a8-4426-88aa-089516a4c1d9", + "metadata": {}, + "source": [ + "### Create 3 S3 buckets, one per chunking strategy\n", + "##### Important Note on AWS Bedrock Knowledge Base Configuration:\n", + "\n", + "The chunking strategy for a data source is permanent and cannot be modified after initial configuration. To address this challenge, we are implementing the following structure:\n", + "\n", + "Three separate S3 buckets will be created, each dedicated to a specific chunking strategy:\n", + "- Bucket for semantic chunking\n", + "- Bucket for hierarchical chunking\n", + "- Bucket for hybrid chunking\n", + "\n", + "These separate buckets approach allows us to maintain different chunking strategies for different document types within the same knowledge base system, ensuring optimal processing for each document category.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67a84784-fbfe-4ba5-88a5-d2e2a382bfac", + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "\n", + "suffix = random.randrange(200, 900)\n", + "s3_client = boto3.client(\"s3\")\n", + "bucket_name_semantic = \"kb-dataset-bucket-semantic-\" + str(suffix)\n", + "bucket_name_fixed = \"kb-dataset-bucket-fixed-\" + str(suffix)\n", + "bucket_name_hierachical = \"kb-dataset-bucket-hierarchical-\" + str(suffix)\n", + "s3_policy_name = \"AmazonBedrockS3PolicyForKnowledgeBase_\" + str(suffix)\n", + "createbucket(bucket_name_semantic)\n", + "createbucket(bucket_name_fixed)\n", + "createbucket(bucket_name_hierachical)" + ] + }, + { + "cell_type": "markdown", + "id": "2a54bc11-8de7-4a67-b8ed-0ba2944fb9e4", + "metadata": {}, + "source": [ + "### Create S3 policies and attach to existing Bedrock role\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6230aaf0-2aa6-429e-9824-0ef2f9b12579", + "metadata": {}, + "outputs": [], + "source": [ + "account_number = boto3.client(\"sts\").get_caller_identity().get(\"Account\")\n", + "iam_client = session.client(\"iam\")\n", + "iam_client = session.client(\"iam\")\n", + "s3_policy_document = {\n", + " \"Version\": \"2012-10-17\",\n", + " \"Statement\": [\n", + " {\n", + " \"Effect\": \"Allow\",\n", + " \"Action\": [\"s3:GetObject\", \"s3:ListBucket\"],\n", + " \"Resource\": [\n", + " f\"arn:aws:s3:::{bucket_name_semantic}\",\n", + " f\"arn:aws:s3:::{bucket_name_semantic}/*\",\n", + " f\"arn:aws:s3:::{bucket_name_fixed}\",\n", + " f\"arn:aws:s3:::{bucket_name_fixed}/*\",\n", + " f\"arn:aws:s3:::{bucket_name_hierachical}\",\n", + " f\"arn:aws:s3:::{bucket_name_hierachical}/*\",\n", + " ],\n", + " \"Condition\": {\"StringEquals\": {\"aws:ResourceAccount\": f\"{account_number}\"}},\n", + " }\n", + " ],\n", + "}\n", + "s3_policy = iam_client.create_policy(\n", + " PolicyName=s3_policy_name,\n", + " PolicyDocument=json.dumps(s3_policy_document),\n", + " Description=\"Policy for reading documents from s3\",\n", + ")\n", + "\n", + "# fetch arn of this policy\n", + "s3_policy_arn = s3_policy[\"Policy\"][\"Arn\"]\n", + "iam_client = session.client(\"iam\")\n", + "fm_policy_arn = f\"arn:aws:iam::{account_number}:policy/{s3_policy_name}\"\n", + "iam_client.attach_role_policy(\n", + " RoleName=bedrock_kb_execution_role, PolicyArn=fm_policy_arn\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "339eb2ae-e825-435f-b77b-0524144f081c", + "metadata": {}, + "source": [ + "### Document Analysis\n", + "\n", + "Purpose: analyzes PDF documents using an LLM to recommend the optimal chunking strategy and its associated parameters.\n", + "Input: PDF document\n", + "Output: The function recommends one of the following chunking strategies with specific parameters:\n", + "- HIERARCHICAL Chunking:\n", + " - Maximum parent chunk token size\n", + " - Maximum child chunk token size\n", + " - Overlap tokens\n", + " - Rationale for recommendation\n", + "- SEMANTIC Chunking:\n", + " - Maximum tokens\n", + " - Buffer size\n", + " - Breakpoint percentile threshold\n", + " - Rationale for recommendation\n", + "- FIXED-SIZE Chunking:\n", + " - Maximum tokens\n", + " - Overlap percentage\n", + " - Rationale for recommendation\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5801051-5411-4659-a303-c06aed74af04", + "metadata": {}, + "outputs": [], + "source": [ + "def chunking_advise(file):\n", + " from langchain.document_loaders import PyPDFLoader\n", + "\n", + " my_docs = []\n", + " my_strategies = []\n", + " strategy = \"\"\n", + " strategytext = \"\"\n", + " path = \"data\"\n", + " strategylist = []\n", + " metadata = [dict(year=2023, source=file)]\n", + " print(\"I am now analyzing the file:\", file)\n", + " file = path + \"/\" + file\n", + " loader = PyPDFLoader(file)\n", + " document = loader.load()\n", + " loader = PyPDFLoader(file)\n", + " document = loader.load()\n", + " # print (document)\n", + " prompt = f\"\"\"SYSTEM you are an advisor expert in LLM chunking strategies,\n", + " USER can you analyze the type, content, format, structure and size of {document}. \n", + " 1. See the actual document content\n", + " 2. Analyze its structure\n", + " 3. Examine the text format\n", + " 4. Understand the document length\n", + " 5. Review any hierarchical elements and Assess the semantic relationships within the content\n", + " 6. Evaluate the formatting and section breaks\n", + " then advise on best LLM chunking Strategy based on this analysis. Recommend only one Strategy, however show recommended strategy preference ratio \n", + " Available strategies to recommend from are: FIXED_SIZE or NONE or HIERARCHICAL or SEMANTIC\n", + " Decide on recommendation first and then, what is the recommendation? \"\"\"\n", + "\n", + " res = get_completion(prompt)\n", + " print(\"my recommendation is:\", res)\n", + " return res\n", + "\n", + "\n", + "def chunking_configuration(strategy, file):\n", + "\n", + " prompt = f\"\"\" USER based on recommendation provide in {strategy} , provide for {file} a recommended chunking configuration, \n", + " if you recommend HIERARCHICAL chunking then provide recommendation for: \n", + " Parent: Maximum parent chunk token size. \n", + " Child: Maximum child chunk token size and Overlap Tokens: Number of overlap tokens between each parent chunk and between each parent and its children.\n", + " If recommendation is HIERARCHICAL then provide response using JSON format\n", + " with the keys as \\”Recommend only one Strategy\\”, \\”Maximum Parent chunk token size\\”, \\”Maximum child chunk token size\\”,\\”Overlap Tokens\\”,\n", + " \\\"Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” . \n", + " provide crisp and clear answer, \n", + " if you recommend SEMANTIC then provide response using JSON format with\n", + " the keys as \\”Recommend only one Strategy\\”,\\”Maximum tokens\\”, \\”Buffer size\\”,\\”Breakpoint percentile threshold\\”, \n", + " Buffer size should be less or equal than 1 , Breakpoint percentile threshold should >= 50\n", + " \\”Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” . \n", + " provide crisp and clear answer, \n", + " do not provide recommendation if not enough data inputs and say sorry I need more data,\n", + " if you recommend FIXED_SIZE then provide response using JSON format with\n", + " the keys as \\”Recommend only one Strategy\\”,\\”maxTokens\\”, \\”overlapPercentage \\”,\n", + " \\”Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” .\n", + " provide crisp and clear answer, \n", + " do not provide recommendation if not enough data inputs and say sorry I need more data\"\"\"\n", + "\n", + " res = get_completion(prompt)\n", + " print(res)\n", + " parsed_data = json.loads(res)\n", + " return parsed_data" + ] + }, + { + "cell_type": "markdown", + "id": "579229ec-23b4-4e9f-99db-cb78c7453e4e", + "metadata": {}, + "source": [ + "### Ingest Documents By Strategy\n", + "Purpose: Configures AWS Bedrock Knowledge Base ingestion settings based on the recommended chunking strategy analysis.\n", + "- Interprets the recommended strategy from parsed_data\n", + "- Applies corresponding parameters to create appropriate configuration\n", + "- Selects the matching S3 bucket for the strategy\n", + "- Generates knowledge base metadata\n", + "- Returns all necessary components for Bedrock KB ingestion\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a016a3c8-6759-40fe-a997-91357a3f48e9", + "metadata": {}, + "outputs": [], + "source": [ + "def ingestbystrategy(parsed_data):\n", + "\n", + " chunkingStrategyConfiguration = {}\n", + " strategy = parsed_data.get(\"Recommend only one Strategy\")\n", + "\n", + " # HIERARCHICAL Chunking\n", + " if strategy == \"HIERARCHICAL\":\n", + " p1 = parsed_data[\"Maximum Parent chunk token size\"]\n", + " p2 = parsed_data[\"Maximum child chunk token size\"]\n", + " p3 = parsed_data[\"Overlap Tokens\"]\n", + " bucket_name = bucket_name_hierachical\n", + " name = f\"bedrock-sample-knowledge-base-HIERARCHICAL\"\n", + " description = \"Bedrock Knowledge Bases for and S3 HIERARCHICAL\"\n", + " chunkingStrategyConfiguration = {\n", + " \"chunkingStrategy\": \"HIERARCHICAL\",\n", + " \"hierarchicalChunkingConfiguration\": {\n", + " \"levelConfigurations\": [{\"maxTokens\": p1}, {\"maxTokens\": p2}],\n", + " \"overlapTokens\": p3,\n", + " },\n", + " }\n", + "\n", + " # SEMANTIC Chunking\n", + " if strategy == \"SEMANTIC\":\n", + " p3 = parsed_data[\"Maximum tokens\"]\n", + " p2 = int(parsed_data[\"Buffer size\"])\n", + " p1 = parsed_data[\"Breakpoint percentile threshold\"]\n", + " bucket_name = bucket_name_semantic\n", + " name = f\"bedrock-sample-knowledge-base-SEMANTIC\"\n", + " description = \"Bedrock Knowledge Bases for and S3 SEMANTIC\"\n", + " chunkingStrategyConfiguration = {\n", + " \"chunkingStrategy\": \"SEMANTIC\",\n", + " \"semanticChunkingConfiguration\": {\n", + " \"breakpointPercentileThreshold\": p1,\n", + " \"bufferSize\": p2,\n", + " \"maxTokens\": p3,\n", + " },\n", + " }\n", + " # FIXED_SIZE Chunking\n", + " if strategy == \"FIXED_SIZE\":\n", + " p2 = int(parsed_data[\"overlapPercentage\"])\n", + " p1 = int(parsed_data[\"maxTokens\"])\n", + " bucket_name = bucket_name_fixed\n", + " name = f\"bedrock-sample-knowledge-base-FIXED\"\n", + " description = \"Bedrock Knowledge Bases for and S3 FIXED\"\n", + "\n", + " chunkingStrategyConfiguration = {\n", + " \"chunkingStrategy\": \"FIXED_SIZE\",\n", + " \"semanticChunkingConfiguration\": {\"maxTokens\": p1, \"overlapPercentage\": p2},\n", + " }\n", + "\n", + " s3Configuration = {\n", + " \"bucketArn\": f\"arn:aws:s3:::{bucket_name}\",\n", + " }\n", + " return (\n", + " chunkingStrategyConfiguration,\n", + " bucket_name,\n", + " name,\n", + " description,\n", + " s3Configuration,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "4cc202b3-12b9-45e7-a610-e76c28b142ad", + "metadata": {}, + "source": [ + "### Create or retrieve data source from Amazon Bedrock Knowledge Base\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "90628e2d-913d-4290-8d40-78dacd3d0e9b", + "metadata": {}, + "outputs": [], + "source": [ + "def createDS(\n", + " name, description, knowledgeBaseId, s3Configuration, chunkingStrategyConfiguration\n", + "):\n", + " response = bedrock_agent_client.list_data_sources(\n", + " knowledgeBaseId=kb_id, maxResults=12\n", + " )\n", + " print(response)\n", + " for i in range(len(response[\"dataSourceSummaries\"])):\n", + " print(response[\"dataSourceSummaries\"][i][\"name\"], \"::\", name)\n", + " print(response[\"dataSourceSummaries\"][i][\"dataSourceId\"])\n", + " if response[\"dataSourceSummaries\"][i][\"name\"] == name:\n", + " ds = bedrock_agent_client.get_data_source(\n", + " knowledgeBaseId=knowledgeBaseId,\n", + " dataSourceId=response[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", + " )\n", + " return ds\n", + " ds = bedrock_agent_client.create_data_source(\n", + " name=name,\n", + " description=description,\n", + " knowledgeBaseId=knowledgeBaseId,\n", + " dataDeletionPolicy=\"DELETE\",\n", + " dataSourceConfiguration={\n", + " # # For S3\n", + " \"type\": \"S3\",\n", + " \"s3Configuration\": s3Configuration,\n", + " },\n", + " vectorIngestionConfiguration={\n", + " \"chunkingConfiguration\": chunkingStrategyConfiguration\n", + " },\n", + " )\n", + " return ds" + ] + }, + { + "cell_type": "markdown", + "id": "8e60ea6a-a1dc-47f8-9922-b7c70be750d1", + "metadata": {}, + "source": [ + "### Process PDF files by analyzing content, creating data sources, and uploading to S3.\n", + "\n", + "#### Workflow:\n", + "1. Lists all files in specified directory\n", + "2. For each PDF:\n", + " - Analyzes for optimal chunking strategy\n", + " - Creates data source with recommended configuration\n", + " - Uploads file to appropriate S3 bucket " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6dc6d08-bc1d-4e72-8c9a-c6c73e092ebe", + "metadata": {}, + "outputs": [], + "source": [ + "s3_client = boto3.client(\"s3\")\n", + "dir_list1 = listfile(\"data\")\n", + "print(dir_list1)\n", + "strategylist = []\n", + "for file in dir_list1:\n", + " if \".pdf\" in file:\n", + " chunkingStrategyConfiguration = []\n", + "\n", + " strategy = chunking_advise(file)\n", + " strategy_conf = chunking_configuration(strategy, file)\n", + "\n", + "chunkingStrategyConfiguration, bucket_name, name, description, s3Configuration = (\n", + " ingestbystrategy(strategy_conf)\n", + ")\n", + "print(\"name\", name)\n", + "datasources = createDS(\n", + " name, description, kb_id, s3Configuration, chunkingStrategyConfiguration\n", + ")\n", + "with open(path + \"/\" + file, \"rb\") as f:\n", + " s3_client.upload_fileobj(f, bucket_name, file)" + ] + }, + { + "cell_type": "markdown", + "id": "ceaa41da-ecab-4592-8d78-59815e0dfb62", + "metadata": {}, + "source": [ + "### Ingestion jobs\n", + "##### please ensure that Knowledge base role have the permission to InvokeModel on resource: arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43d0d10e-40e3-4769-a5e1-d115fce38041", + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "import time\n", + "\n", + "\"\"\"\n", + " Starts and monitors ingestion jobs for all data sources in a knowledge base.\n", + "\"\"\"\n", + "sources = bedrock_agent_client.list_data_sources(knowledgeBaseId=kb_id)\n", + "for i in range(len(sources[\"dataSourceSummaries\"])):\n", + " print(\"ds [dataSourceId]\", sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"])\n", + " start_job_response = bedrock_agent_client.start_ingestion_job(\n", + " knowledgeBaseId=kb_id,\n", + " dataSourceId=sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", + " )\n", + " job = start_job_response[\"ingestionJob\"]\n", + " print(job)\n", + " # Get job\n", + " while job[\"status\"] != \"COMPLETE\":\n", + " get_job_response = bedrock_agent_client.get_ingestion_job(\n", + " knowledgeBaseId=kb_id,\n", + " dataSourceId=sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", + " ingestionJobId=job[\"ingestionJobId\"],\n", + " )\n", + " job = get_job_response[\"ingestionJob\"]\n", + " time.sleep(10)" + ] + }, + { + "cell_type": "markdown", + "id": "30c5e219-97bd-4219-8f97-cd7be339cc5e", + "metadata": {}, + "source": [ + "### Try out KB and evaluate result score \n", + "##### try both queries below" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fb36c3b2-3b4e-4fca-8a95-3ae4d46aee98", + "metadata": {}, + "outputs": [], + "source": [ + "# print (response_ret )\n", + "def response_print(retrieve_resp):\n", + " # structure 'retrievalResults': list of contents. Each list has content, location, score, metadata\n", + " for num, chunk in enumerate(retrieve_resp[\"retrievalResults\"], 1):\n", + " print(f\"Chunk -length : \", len(chunk[\"content\"][\"text\"]), end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} Location: \", chunk[\"location\"], end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} length: \", chunk[\"location\"], end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} Score: \", chunk[\"score\"], end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} Metadata: \", chunk[\"metadata\"], end=\"\\n\" * 2)\n", + "\n", + "\n", + "query1 = \"what is AWS annual revenue increase\"\n", + "\n", + "query2 = \"what is iphone sales in 2018?\"\n", + "\n", + "bedrock_agent_runtime_client = boto3.client(\"bedrock-agent-runtime\")\n", + "response_ret = bedrock_agent_runtime_client.retrieve(\n", + " knowledgeBaseId=kb_id,\n", + " nextToken=\"string\",\n", + " retrievalConfiguration={\n", + " \"vectorSearchConfiguration\": {\n", + " \"numberOfResults\": 1,\n", + " }\n", + " },\n", + " retrievalQuery={\"text\": query1},\n", + ")\n", + "print(\"Response shpould come from semantic chunked document:\")\n", + "response_print(response_ret)\n", + "\n", + "response_ret2 = bedrock_agent_runtime_client.retrieve(\n", + " knowledgeBaseId=kb_id,\n", + " nextToken=\"string\",\n", + " retrievalConfiguration={\n", + " \"vectorSearchConfiguration\": {\n", + " \"numberOfResults\": 1,\n", + " }\n", + " },\n", + " retrievalQuery={\"text\": query2},\n", + ")\n", + "print(\"Response shpould come from hierarchical chunked document:\")\n", + "response_print(response_ret2)" + ] + }, + { + "cell_type": "markdown", + "id": "bec94bfe-c99e-4e1c-9e97-bad5b3d0c09e", + "metadata": {}, + "source": [ + "##### Clean buckets \n", + "##### NOTE : please delete also Bedrock KB if not required by other works and data sources \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e67d50d0-8963-40d6-90be-be4c654c015f", + "metadata": {}, + "outputs": [], + "source": [ + "delete_bucket_and_objects(bucket_name_semantic)\n", + "delete_bucket_and_objects(bucket_name_fixed)\n", + "delete_bucket_and_objects(bucket_name_hierachical)" + ] + }, + { + "cell_type": "markdown", + "id": "f157092f-7534-41f5-b664-e9f1ca67d6bc", + "metadata": {}, + "source": [ + "## Conclusion: \n", + "\n", + "This notebook presents a proof-of-concept approach that uses Foundation Models to automate chunking strategy selection for document processing. Please note:\n", + "- This is an experimental implementation\n", + "- Results should be validated before production use\n", + "\n", + "This work serves as a starting point for automating chunking strategy decisions, but additional research and validation are needed to ensure reliability across diverse document types and use cases.\n", + "\n", + "Suggested Next Steps:\n", + "- Expand testing across more document types\n", + "- Validate recommendations against human expert decisions\n", + "- Refine the model's decision-making criteria\n", + "- Gather performance metrics in real-world applications\n", + "- Build a validation Framework having a Ground Truth Database and including varied document types and structures using proven validation framework such as RAGA" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1443821d-ad4c-4361-883f-002682160108", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/rag/knowledge-bases/features-examples/08-chunking-recommender/img/chunkingAdvs.jpg b/rag/knowledge-bases/features-examples/08-chunking-recommender/img/chunkingAdvs.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e968c984a036045d64c6302ba10f14655dd227dc GIT binary patch literal 65285 zcmdqJ2Q-}9+c-K()F66^79?7f=%P#zorsbUfqK2qJM3)c~y+uTc zUPnT-m{G=H%6;XW^E>DF`@XyW-@WUub=UfK^2+w^{XYBY``P6Qaf-MEI%lAxuLB|> z0fF>@KOo{V=$dvAZInU zjYK5rHGM5D`kfa^zn|890V04t`=|X~e%DE< zPU{YT)VWXVkDP-}#|3G5c?bJKTwEXV%ScO0^J_SHIXLs12m3v8_VDA^_jL5~_4e|; z|H#=1==%qLr(b~>|LAziF_A0E%1WZY-~NC2<5UiRfZH?u&0i9>&ws!Bdmu-D-+(`V zBO#Rqf&Ndw{|zID0|YWD27$64KNfJhlh$QVh8T_8Sy)n`b4m%q>=AtfU}LqSPJO+yQGXgCKVB_SgtB_}&`h8$2> zl5n6MM9z4I>B1Ea3TD&$lmb31GEb5Vs01}@+F8v;P(rc}z7f3tQO}}dV&N&NY3Ui6 zFJ5L9zAh>*DJ?61Q(ITx(Ad=c{zFG+S9ecuU;n`9*!Z`J$*Jj?W#r2DAFFFW*Ei6+ zd;15NL+sJA^pAi=E|BAAI5%xD-vmh-H>F+{D zN=kNyjEw9I#TlSbP*I!~DjKTah33z4_IIH_Ee!uEL;wg00E3*IoD%qZj+UDC-2dZ3 zoCn;KJaGnemW%{YCNf447=*_a!bL#;%N{b~*Z-w-jr=dAYxI99U1NA$^$Lu@4(mxg z=EG?aL6J5Hg0lVe85rZievTT;(XSS0Y3>W~(*p*qEfK^8{s}`<66idMAnXr5R0~0l z2>MC{AsKL7^F$DOVE0&jT`et?*Gu+SNHk882uk4F2V*0NAc`g!&NuNl2-rVB1V>#o zEavU-I8@UM^TLi0L0X3~V3d0x5!5Sc(zBC5NpRUY2V0uhA%dP;W|&37qYc!azL6bg z0NMlzE=15@jia!Ro!{RRL4N@d|6EIV-Yq;Wl=bsT-b*5g7O`>z{QYnA<#1luvBh2j zKM~YkwMYcKj@9sCV>D5@0wJJi|YrUlEl10x8q2ku>{@H{ic{6Oe@S#(taq8yoIw>B9WF#QT12 zYb8Wb2MYTQV9wvfi4LMWS1kVlpSxO0u;CKby4kv9`UsPKm>L1VdYzhpq2Dhizrv1KOKP9Da*&NhFwE5 z{Gkxr*P)8Ihv2=NssN&=dZc3aP)D@rL1ryLM%^fQIXLVg-`}muHvojz(A5MuWeg1E zaVm51&oA8sn%A{TdE192*UAC9cd-+&)mK3PYfm*OAPcG#=Dv+4>jM9Y&X1S9Ea)K@ zG4UD$ulMV{D~UbB$0EEJ{t{^k-*fPJ72IzW-=6AhaEWMfxTvr0J3sZc;?qX(&sW$} z=#v2SR04bE>G$fV-#7fJ5)IODwVY5&m0uw+oD?`AfB$CKDS7UIndtuK3p*RQf!!Ci z=syK8yyAjq;X6kJEzO)Nuok9vy_?(YB@m+?KfKM(tNT=M?| z7wOlWr=sHCIVCi+r;`Yh{}-hCVXPhO-vG1@dMFf(*+%9bs(}gY{{ib2&JQ~aU_$#B zCJjHjA6@G=czUk3sw)R`<`4D_J)HCIVjJK6Vq`x{r-QO5x4}4h22qf|Ig;&If?j zKIQ5#pfnFGH;&umS%vK=aZSNJmnv+ zj()d=C+N=Y1L)ckFt#n9nwv8GxX83ZY>&h~l;Q!gKAZ5#vKO6#{#2 zFXJSV?-bi$KnWo$E4@9t;O~C}Xf1mjGDZ_(+SkJRdDznHc|JFTNlcy~-y!>NaAaWt z*zZ?y*U@|gx*affJ2=Q!{Qw*x#v?za{Bj+JYExGtf<}HfR|Cy*xFGDxEK)dz2ts+C z4(R^bTFT&Qyx9*Oc){dvaE`SNW z)G5bJ%|shgOC8b@Ex>$G6#xx_&|herx;;F4mH-FGf>A!d`8`NAq%*jFr@1gK(GqAq z`L|X`>B>$obdeFK4=^z9l!3UHe$xivZr;r|bjpe)brP28>j&3xQSot!Y2V9V3mz*KZ=_hOGg_0rOcQz;$C` zf6(Rn-+b{sNba^-0$k0Hz(uywi)Z;8A7F~2Bl)-Q-TQu)lTCt`U#Y9YM)c59fAE$M z^_7no{PQ0~K8J%2hOw>;iG2HJs)m1ZNf#$a*_(JQ`Zod9f`yO&r{lh2xcgs`uZMMB z3p-`_-yFLt^8Y7-|09M!x=DFRdm^>Le%!P*g37%0&C|K{V=L-;q}<-;ZK2x1OZl`a zEj5=vf7X8>lybv3G&hKHE1w9uREP2-bPY4W)<4i39)pPEKwyScM>Nq81fNtFFYh$% z;!RZeFxrK1@%6XCjWpt$i_`fSX>9h1>3Ya&)6yVbMZf0j>a+PD<91GWiqr25=zj0M zWFTTT-MettCC3|9@3pnEk~p2&Gtowk?(Qz$nw-!NKQk1kH?gbp3!QW}_Ppn9t!Bsg z$gl8j2gv&qxdZX(fw_sVN5^oHsw`Po(NAOUvr=FeaKZ;z=XGpn@UnOrjtkZ6D}TAi zVfTR#m-^Qqm^x|B*+n&R(SC_3k+$eK|DsvrQnF30m`fKgaF*glsdmg{WUd&Oi&9af zKaPQW7{Wb8?K)hYS&A$hur3&1?EH!p$TrLz)%*1A0RzpnuFvJkkr%iI!)r#p5by8& zluG7f@QX@SHcXMByceI3k1Cgr7#HfSO*{^V=&{UM)T=yS8f9F!f6G@Vg=1b^&hOco z9yY=DBN-SsFkHDaZkH!8C5wH0j(Jw8=aCP->rfwkNQL8)zPRUQ2*~=zhn&BWs;@Zi ze1Y?Ab^F3uv`H5q(^Qa3yS3ACPeXa1KvP4L?U(KGW6&Ft%bzvN$WuU<5I4evi6HWL z>~saB#tvK8eO2L!kJz&lnc&G3-Q+{tWHYtSwjQ6K#;i4R>p0JjC{Dv=-S0Iw4mi!k ze=*F&VYsoeClWuoyx=&)s`adixi0(PFl&kO>`lidTxw(QwJH$w*=5Kb!kUw7NmjCkH`DO5Y8bw6YgH4y-xXGF z^p&O$UDsg!I{Ltp4<27#r4m?mwV=CTE)r6bUP1ogg|>K93|03<=rck)qLwlO?6!{c zV2?k@)z#zWuvLC_Lh>{?|Dhtu;ZY63M97h2!_QutFtSLypC1W6kF!B_YPRHIIy(;? z7C7Gq9DON~3v2sSk<)ZfoYh*=Gi#pfs)x4zR!|h{3{DXR4?01zKMSLi*bi|CiZ-ts zo7u2`5Bl)ZK}PWgdq7BH74#W&F8^I%1k8Cjx{VJ#8QE$Y`sVz}{+OqSKc8@@N~ciU zBVMzTbHPE*7h>kr-EN6%>udb#OAao&(F3YI9167K37X*oLbg}=^d$KOg?Iw*FcGj&K6bI8&* z8c7}ssqRR8V$as5w6M62pix!yL8>H3%4MwcyG{=1Q{NKm>r>`m;Z4gn^wVqk8O+F8 zlH^{1R(#g5NQsR$L*>VA<=^HbtqnZRo? zUHN;DAw>>FlD7oaAWs>0Zo?xF%HYWw`pNz!jPSIwZR6un*P=dnq-kx)EjWatpes;} zaUmP%W-8-Qn?`H&v`n5;Q!KnBOV_Qxt7u@~qFbYEAo)&n+KUcZV?mZy=KHW~=T@upu1EJU znO9^M-)ZMX1f98wb#Y%Aeqzng!$T$UIgk0hPVnpR>e=O6n$tYj(lm8nJy37JFl=k1 zLfiRxRY^QhgFSWYYtOIlUW`z_WX1Z;QCvZ1==1p}-?SeofN5HVkqpz@gfk@# zL=ZLo^A8uk((SeS>D}UI)Y)=43hV2c;MpbF916RoY7oC-ziqqt72|ANzdYKzw6n#0 zA`qsGFE2!QVu>L0Z}lt78DrW_7^ipIi0z0q*vmBy1)uM4=R7w2@~(Uj$^XPaJq4XE zKT)na*u(m_>QW?Qa?!Mlb)PqAy>@X3X^PDsE2`Vu_OAgapzp=7k;gpIrWj;n6W2HX zD0Jx3msMFDv&~1Guhi0jc32bg#ib@9D5A8J`=_JhimlqG`;iTwjY@x7M5rZ%4zeoR zReBX#+WRW0rF@TTGM`Xa1VT4;byA!pHWr2Gc-<57Hu*JnV$Uq{=D55~wpzxZi>phV zWAHhF`VR?aIbxh?o*uB$qiR~QiO{XbK}R?_O9rStY=kz zeqPWFeq%X5c6Yf1X;K|VX^xi1Rup|F{otjn{Ft=%ibxUk?{aAJV?*ZhWD*u7GBiIbdu4Xqj9Up<7G{2Gda z6RE-bNW@t;bN4pdZ>0(&L*8)7%3lhw3Y;B|}}MeK)6a$jk1)5%(`^Qgctu0H&!DS2aF z>L^dw4+sR>@h?=vmt@Pr1kvHRCZ^YMYIfqA;WxhMTdW*bFo|7_kiPEulo#hvN8p9E zD-c0WD&?N4=gY9iE8PUinmEO;({8tu`bj7uJ}&1{tkRe$cQ}uBWXS zwyboKHa~9O&`hke&SB6yu&(J+V2AMe)FKIq3ylG=LbDqE>X$-g09Y> z-1ndGWcBzM@KRnIOG{iQK@wY%nKj19tnG)SQ=iB6ERmdbt9+?O#(W?Tu&!NGoyDDx z8~hOQjE`nt{Aqp5L<-igj4Gc#k!p$l*atdQc`|p*z^*x znFdQ7GvX7NvH^T1D7ynH^r03NfT*5woq0c=CmgPzGOd6wzckDBB3a9m$+wiehI||E zhP6exEN4CeKgh?L6L=?QZ=xQ4@GK_(@Gg&Z_{^TsBr|ook)X?qS(`H!B451S*QBtb zD(kz1p3(kdriXTW`qIqJ`6BP8bBQ2@fR$?XhH4bOC=mNac-nN*WCpA9RlRu=pJFf% z0`cg-z$NLDntb^4cvaQ{$9 zS$B!`^m*@;ceT$idVa@=eCN|VsjmTQj{F5OaF_VDanhQ+mph7jbwv0b5SooOg; z`2Kupsq>-0k}{>kJKqq}-s)ID4tx|j>il8X_h2hT(23+QFuS}}4-H0O4>Rp`!u-)| zIOeraFunt06XfKx1~zP5$^4AnP)K=MlFM_Y?@bwMb(PMyG(Bcof_lW?e6)U3TD$5w zSh9Mql-@2>A~gR6tO2&=jFg%_K~6^FsFw|6TlGt@(;cq9GGcT6idi-Cl~17N0&>Z? zC9!wj-t^b-tY5)PFNmPJz)llT=UeF4Ho2_x7;VFriE;kX{XUn}q@?=(6y0l1ehAJf z#3RLS`v!2k{frH!1fN&gN-00o;mXxmD$zMVdp>ORO zWhcCM&p)>Gv;SNoYyHF9&EGt564%!Q4xkGFKLS?yNsF{{TFW~eVXm!>d;aFZ_-Kf@ z?}cYdubcB-c${Q(sQ2&ESLIXSYY|@638`UDXsJ$M4gUex#){gB@^Tp?sk5-jF5k8p z$h&jt!V3m@E1h#?vk@Aqlau)(#JC>7?q@6ftA7Oc|}N6vo<&l8VD=s!9~rHwFd+6zcX9(8wKHAv=uFcQ)cysT|1M@N_hSXH!|IhdHN2d;Rb3DS zw!p&PY;wi7@xO0*=2Jf8CH7{v=wY#HT#%m7`!D+2qxHC3Uh+*i)*Dp61h3Sf8}pW= z+hE?Kq!Y~I#6 zhTQs#%>ge%ysR7@tAg=H~i8U@1k_3 z*+JO(r5tWg^XF%Rp&w89-2v<0UPkMvX&NSow_lYkN_gWY9mT}v9yDNMlX+W9_CTnd z3L#5sP!Y}tAiqikr9Q!z!M5(j5kZFjSMj2XY*$O8Tkkugp6{46qFKxqI(FqkUq@PB zA*~9Ci*}$+$X2Zn^-1p*!^UdXXU!KMpSdb#fy~!JF=Nv3X+)6oVx&DwD<#UkUHu}O ztpd5j=zjGP7u?wY0eby7d~T4_!LdQJiJ@8US_)~x=VsBC8h|!9VF1d9ZmvozDfI+Z!}rzBBNZY~?gfP!VH=ysy3s+M z##d8qc}DwH7P@+j=Q;15>l@rn=db3doey;tXDI!rOxlpMVC-nu2AmxX3*ubK%XExV{NBrjdFN%@^re6Z@iB! z4+6)ui7`g6JrQcb+W#f-JAh&NpBmO48^?t3P3Ufc)-Ul1*iy+Hp_i+f2r8WyCxUdV zf-P~3nBdtZaNKsDj%y)WD2gSIr6IR*i+=jkkWc8JBB%c?^3ULDU5qfAa{CNAunS6s zX5IWb!~Uc+NWP`TfAPZ0i>^^0-}}_-Ur6YCKrU8(QNJU9B#{M}y?`Lmp&~b6art>H zT3|&J{ozP=%WR|3lD$s!!k0~j>*2$^0b4T^S|0bL!=nk*D+)Oa0{DVgaRl&K?@XK_ z5p?&?oHh}Zcsczj^tCJW@;g901-cNVA%qC3=j*MVr`zaisLV9JEK`k_m1cETBVBp- zHr66%LtoUouV3>DeDn}lq&4kqd7;KI7GtfffhZ!#;@bf`TBr#3v6&qX^kJ>4^te1GQz#H$Ckxy$ z`koR*`T-^DTU0EV&B{_T3JyXnCGk+mtqHsJ?%@Xv^E9dlWEh0sJzSM}`DCSHor^uEi;ML4+5|pDon!ePZH-VxBLWl1zdZ$)EEr-;8 zHX(dAiCtB?Q<7>!{1;#J;csoiIL!zpE0R%>$ zH83!2I@-V?Ayty4Arx3HH!{2Jbf?hy+M8B-zpRtc<@{*a@)Ov7B8X|jdbR!Sx^elU zveK+i+1(o+4=J1dZM;L!siSRRK*JU1=iPWOW1C>%;wFB66zP|1l?PQ?n~Xkp9SG&5 z+I*=mdSk4{)?-@Cvdq??`nrP;Wg?-7tlkbp%Pyy2v&|CQ^FS%xJN~q#&)05!NWJ;Q zSr0c9%X=tb<|Gmj1*%}qIu5kMKL$dnT$G!o9Z4$EbOdMWa`OH{YS(R2jI!R6zL%Ik zw{w7K`*xE1=T?mq`}PxI!Xkp>2bw-pp<)@vYFwfX%_xSB(aubF-Ou9xaSIb`dxut1 z%O${)a?{piX>m&#&@1Xuc@wT6NxjLg80PJ)=UH66=1mzl!|H-P1fntnI>`DGnnkIF zIT>&Jv*u(Q4BqcJ7pN~jhqK3L5kZc&E|?VT2vUj!8)(Ir$BZiODm<~??fUeoPVQ!< zL8-f;l3D-W8JfyzB@%??FpilPbII5g%4N91V@_aN7ucJa+f>l9xU&5P?fyOdvtPb= zX~+gZ^2^1uz3mOy5fs9-*>K3c6Dn+|zotD?XLGSxVf2hp`T}huwbB=vkjWnut?K-+ z&cQWNOd?hfyzD5IItiPG)l;SuK@Zyoo2q#&?Zkd4fAJ8#BV^>rp{?+C3b=-Wc$e`S zB@28Jj_R}>U;U=ail(#zF2H@`*P#3KEVOwkkL@fE8&W7qTFL3&z|=_}qQLqP$5_4W zv>-H+-KnlH;i0M?>+P@IsxZl(0{%t#(m2G^dZhJ!vL_+XFLr0}F9@>hL=b~I-~0x9h{u;W_cE0YnQpm-M`I?; zpb>4R?Jk3BvpK0>q$%G*KP|qyoqJ+#6(Y2?gpcJ?4g-c3h_BhAq;MKKs_SD3G*|QTkXyjVanLVlDYXH6YG!c z==o`plu(t@gbPXaMwXuwM&-VYy$?WDwU9N|KPaD(+*&IYc|X4uZoE=GVdGh!Mm{PC7!`6weI=m~-r${8lAu7&~w z$9et;9>@o^i5jLM^0<;O?^+Sg)NFPNmM_0OjTU;fbZ=5|#bjy?BZqHNJr6^Jm;C1& zlPm31U%7oyaxFOV{Q5(xl3V1qN~h8^OOdok7h^sUGDfJn9xm{XK&69V7B`!+9@IFy z%S0v#MG5PI4!?8lm0ns$1h3C+#(T&YtRk9(xBO1P#thHfgu~R(xeHBbce{!0r|P`M zCMIcWn9?$B@^=!2noVGl^ceF?mYfdPwZSNiNa6$oPpq&3ht`nSt;S+{Y}oE|=8U=ejafqU(^_Awp8OOm746NS{-~)C zn4UNYo&|z}MiKrH5hx@RlxV_UGoW)V zWx=9Hm(v#R9qPnIfVJAco_6IGpQx9R`Ds55nQT({dbgb1m_RL|ud3z# zY&nht9|c}23{kIG@gnbM!oDO8uQMGZvvSMWGhoQ@k|?@a`)lf2u%Jh~p1x8a!AD&d zcV(P#Hp~L8+cD&@EF~ExA1+al%kt)Pj30M_{>8w8+q0?hE5cQfsO&)FmZQ`={<^8@ z`$g;R6=!q1bu7m%@24aK#}Z59Pr1?h+xmqucl=CQdaY}3XX#cwK*|j~<+{<}oZ4J= zM^8SjJnbd5hG~+vRrdwfRP7SW!XnXi({ZJApXU9N!*Y=_A1>u0;Qs<-O`Z@1qCJA+Dg^M!UjX1in@TaU4>^XrrP{8B889t`GA}9r@;hcm!(B~ zY#uKWH1>#f=U|Di7KY0jZXZ5jR#hW{N`8tj{v;z#{P!mZmh-T8`P;0hIE)KAKL+QA z7B~3Tpr(l#P~lNpPn!KKknnv^(kL$s>+6g4y*2O-_zR61nT4yc%8hX^XoWA^C+jUj4ZkVbo_Qx(YSl9qJS6G`zpHm*n zRxfcoB@=9Zu(`apjYq{T?`&(Y498~CM^@-AdBvW)qlS~wCbgS&Q-BYy73r(=px*~d zs(zaG)itjAML4zQ48p*SOUx^DQ$!o@5YmK=4X*f2_+3L?u0*pX>3ZCpteAIsz35LX z8XSG`ke`X@YvDlXY*sJ3e8I8hNiYGJFM%C8H#O$=W0y|6l;zm>uac6h%^qgwgCc>@ zL4z+@^$2|7OTbFJOVen69T9|MnO(jSg$mWix}!@wKg3UF@#eF=7@clEGhemc(CQ%a zIref*dcZZZ^0Sp19-S(>NH%Q=i`tnRSw#~uYyx-Z`tV%nrtPL>dj{y3HA64M^kj!t zXV#_c*CnZ;MMLClqGdWRS^8mJ^^$PCAWWmxvjlzevwW?+IF{EKW%S(-F7aW`OTcae zR{xPRo!jBnTgCH3_4z?@HmRawWRD*9n3z#|tqs6$$$XGCz^it18+{x8$}nq*S!%SFtqO^E|Peu zvj>5-Co3D=E8;}2XQHfoJu5u;1CI*dr|B(VI48sKsg@uWf9&iQbA2wZ_$jH-O zkM8xFE$y_Ez?Vb$2wxGjkK*>5P#$n=WA`^jHx0dcZ4%X3+AnXsL4CPpmGkU-Ia&U@ zR^T}93@JWFT>$dI60YjCWHZd{J|pc?TlaBH;ztU#6{+P=iaB3~)!E&7*>^cKIJ&}j zd6xRSJFh0hAPCumUQ81nL0~0(B?5U8onG`op?5~*H_Nv?FS3h1d{`*WGg@fOy{ZdB z#ossXOkpp+i%&o-73NW5Wl@EL%M9HK>^GYmNAi3eLraUypG6>bUi7N^>T{m4oe~#) zSPm+iO+BKZQzHE%~u!5U88rDMw^Oabr2!{*$b|TR`&TnvE=p zyuzO*my-sB#Z;b0lg5^vW%%aQNlXo znE5)_OFuNFSzAwgH3##j2UIIqW;qkdQjs#u381B^UhhxlodMC@!skig zE|y^(20vx3MtjEkxR=gO%}(KSuJy-Er`^4NK`8Y-<6V*OvMyvOnWAnv_#|Cal=!Z{ zY{hN!hT%YbF2WVY10AvhoA6+T2b%W^6($=k&&RU|Y_-D`Mrq!O(J1{eDZ6O0O_2zU zW`WOAwJq*-K~r{o_!wVIW_Y$SV-izGoq*6VqG&+5d? zY2-4U!uQ3QIln%t{zMD{6^!gzpvDF(#;RHynYLA^MOEdjdvrA0lq`CG zF7GJ$T3Uh(SSNwr@j*Z?jw$3TY*q?pyA z(9#Wtt%4J;FflA6+HgfUq1C2n(=vmPQT}?{J3|hMqAZb#)n6WUl}%T~8fT+U;#c_h z;+a1K%X4ft;Lx^xkUI&y91U&RpLi8owaV6IVx8pAO)yR^2wQ^9(#1| zJuNF6Fb@(7^r-sww1&CXu;6&|ME`WksSWH|WD62R2E;BSm?3SlD9`GWIedws3X9=K z^_lgHSqo;FmEGxDa>CpS7D;dKc!Rz8W}}x)wzW`eodkLmMRJvu?YeatBK<%~qAi`=mdXa`{mjJgSnMAS2U5+9DrqZ^Um zl(61F4wy>dKUM||G>%fjvc${ng*o_yCs{jzyFpD19r@FN%abvxd@FcC>v75ddj?Bf z9lduOW)=}y`n9@pFgw5eF!7W9j2AMn(|(4zcagpA9Mg;ZGOT50~_QjYg3HGL8z&Vzn5w?*^wcC5=ZgP z_Va=jg7E>>qqZglTH*WnyhT6&v=>kOB-w|*We(`{nm5k+fkMOHAF%evcPbmsN8x^X zMOUUw+&OlCx6S$x_Q8GE4-@VoZIUXI+pbNUyXWv;PaOcA=jM!vw=JD67*$qwW)KvnUtp!{}Wy4dJP zsT-DhNxdv$j!lhCi7FSOs2$4Q3#aG{)iHHt?OwL@v}~+YcVZ9NV(1QQ^=3T=J?0w^nMmq7Akg1Q+in+!Qp~YPw#I{bfiO+{XPy) zz-HVDG$g3a-h3t~R~Q6)zG#!nC0_N6+Q4W3&`{T^Ut5wE6itp^ezNDF{yprZI{a&_ z2bW3#O9tbU)h`6u4HDQ&JLh#RS{w0)nDB{g$MR0rPoF-^n242Qe9OGA(h$N$Dla}R zb!y?d44mY=KZ!;A5olm26QtJ`9*KAspw2icXX|c1>RG=Tl%b-3nO{cqxzs}J2!_0! zUt>2vjT9ujSAmhlzper{BN!5H6I3T&efWqv_>tTFtkmm4*F~pBFwr+oFzaeJZHddPtxK4qR=*l-vzoRYc57rCBD*{>+BcH4HGL$c1o93I7e1XF+3+z?+Misxy`QQ{j44XX z#ad?8k8Ru1xp#H*B|KdyFPQr|y6HCwyMCYC+t$-@7rl4yP*SN}@Ymv#{HDwS~|!C-F_on%vq;ESZl3qe`)&$p{Ri;6;PX_!h=o1yu9Ska~}gX0dIVQTwK zKJH>adrFh8cJ0+yiRZ&={MYbv+>S#%DwcbtaF1I?JjyCHTC7SHe6PUgf#rOz;Y#K< zC$O823}`AbF2+hNW2&7)U5Y!9kQoulmAv4bn(OYdx20b>2|TXIZ{7a6*tE0Vje3nK zNAW!i6Gt8DP3)iKo;M4f&bNq|;g~)v!9jKJeNM(;Ajt6})P}W$b-(n$>QdHE!)kH4 z=4v0^*_AxMk?MBEt}z!#Lab@tN&3C-5fe@`F5B$ecL2Fk;&Y%vLFdg)A_)v_ zHP{5zV~_d7SV_6r^TFGf-nB-2mu|k}l+!o`cjQ>(6wf-S1{AQ}0{j5*lu)V$)a17O zlBnIjhvJp+h{fBZ&dC(yWvPlUp5`Q#D9nCU@vmOPy%m#rHI6r6tAX7~{{?+d_-5a* zvlY6+7s0jD0iNDl{_xZK+53yH6$O!zt20zYWKc>-5xM8dR$;p zm^A(gj%NiQYDAbBdn84=ZmFnXFLiuf)RArfsNFnW_!);99Zb;1>QVK0f+2zC_HpJ` zWq#ciY=D3hY90}dXh1a7ULmJu$j5lq((B#b+6x+9sB~ZENlElGxOL#EJCgzIc32>$ zDVIwlcPQHt(Sb9buLED|WP6lMtt;&K#8WhRsa}f#xUiwlplSdlK6N5l(lFv{>^s#(YDy4{_|kL`5#rEm%fX6KJ?I6_`c1Z zg=5DSpc(A24QPfgd!|;Cx)YPG%sRU-H4%dG1JjJJgmPn=Z}kg2cCQLf@Ip29s0 zfOIO0=Co_(Z^f9JD}v-YUQKuk8dbp|H{E>G87XSgWyUUu!=3LSUs!Mr6{f{NsD@O_ zS!SNWn>$3sfMHKl6G%)9;uEcF;F=_5J|dO6As#m@fI(F( zODE!G?Jt*Tka@0ouqoW2Z;2;Lc!yx-=D*7ah$djtwS2?u@y24T<%GE9U7XjGH$)1PgN;v(0XGz(vY^ zNLnDx-TK;w*R-D)3$lvxrp;uY6D$%pV>Ip&yOzjNR4&@$N-ER9X`Pi6KXOkzdDCb{ zceGMwj3(>ZOO09QqtK=IsQdpQZYMZMgBley3B8R-yX6V#T*uqS^mcryPzSp>+K$TX0R$ zEA~dg&(Hd^%ejw`WFwiLk9&@mzp?_@QUMFC1x@zRxph z_L0TqBg-YaGc2Y?NriCHD{?HavJAxTYTsY_s>^LI>g&d>p?5{!WNGxCMVgMbBCDim zbJ15uhir3?5@b>qpa^gex&C(!FSOO+T<1Rd7%)^xo{@uRJbQ7hDQo;i^waUE*ou#_ zpuB_}oP2~t!!)fz?y1>3_$y>?vYCv9w%%&0euQIrak$9*4NCd!IPJSntK^=4Q|2s5 zFZOnQ9RH|fe5&LeoywUWQ^4-U9-iJVzI*L~=#@<~GsaU5sxDS27GxQfHT!zDWV@WI zWR&!(rkD|p(M|G;ZZcxsa#fcezqJS`i9akK7^5n8g(TnOiZA6XH`wdGxy4P2x*S=awxiyYteFI#EF0~U?$jW`a zp{*SuBB2L35kSY%;>Yy5FTraV_sa(83rjzhB*ULY{BV$+c%cJ89)H-puA!e~9xz_4 z;|pBYj$?Vr$jU6TCDu!k>Y)`g;0&R&=I4q{qIVa)RxH3ICX<(bg=f7XE>c3~BdhXF zH<{+jIRm2yvPR-2g~f%bf@K5#-8Zhujh^OMIr~zTB;TeQwobgi4gBiFViBF?IWyE% ze4*Y{`@zcn{Mq3@^J={Q&Z|K&{MUU{ei#wRTCYZmt4?Az?Yq>%73uMa%nfhHysk8J zjf3L8>JlYS@_?f6H2y5qJuA8f{uek|az$i(9LtI*L}bz%Ga5E=6xwebmeeq zTNPbewF_^c0ZZC>pSi96Y*#&j zJV;u~Eu*|f%Nzx&A#L!?VfL^$4twdD^W5^*SNU+}LBe>WtLkXGuSUO-C$~#-aOGezO^9Y;06gSpni*|AI53Ua z^r-$RR2HSGS-5MJ9*t;-seXanMG^J-G>dw1D;P!6bJ6-O^pIA}*!j_<`0*Sw)^bk3 z`HeRh^+0@|u!%N2u)6f5ap=5`-WQH@-p~eW&SD=q8Tef;gTl)fNV6kN3RAC1b9o;Y z^~y!7i9M9-Wn4&#A0^dRk%Q0O=Sh&OGK6a?zcOYWK*Qr$9|B2>cSjjPs-tUi$owd6 z{q8GsP+24Xu_p71OO?0!6vhkyYdkimLEk{u&|Ngpm{hK7u%rWNQ{}EyDA8v5OX9^}TcrZu&e*w8`<%%weiG}Pv5XN3ZhC0i-mh|g zw`hvH3YhYK4M!|(Bk5!bL{L8AQgvx3o$sE`$4NQE3S(;?f=h7)&i)mYl8w*YVSsY@brSRIptoZnwW=={J z*^y@3C3$q>?7mE|05A2XpQS_1ja%oXIZNI*r?(*%M$MJZq-ik*k*4VarGhN!sj>R$ zKg$is#^CW0SI3poUd++l=Kv00{>T7f6Yut0i9tp!(11TbwG5$3_b zg`fYDl|dNb+jX}Gf*s>Oax$QA1MU!k{{x~L$%|-$6E6Iflo5UYD`5V|j>A8=>8`#m zekFB(-&~{grb=w0fhS}?jcIub@DQV-vfUIOR^cqB*CMK>?2Jq8Z8q4JI6aE)ut@g} zaL1qc+!Fs9MurSO-|4rYUVSzZ6BQ%adsen~D@j&zx|!rtyI`_04~Ml+t~RG(dZ41* z&jv4^IcqnCo=$zjmu9_tam?u{!PBc`1nWG5oRn*<>j+9H|9q}t_ipIdILkV7LVK`8iT3_kC&81R7h#mBbQFdNVy+r7 zf$Uct_#80y3INI3Pu)ouxEL6labt+sBs(y1K*_@F%zee5yFS}vwfy)C?_RENq z&09cRd>q@Eh5m-`7WN2Gg_;~B9LNn|G|=W9i~B~Zz%{pUw|rc_T@w-(y+DqQTF$3I zJbBa{CchGHZV-MA@^w)Bj@!Mv!_@HEZ66-bZ5w9@ZXC>&t$~ z)yQ_P;0MzZOt)6!goRR`iaHlJKxA_|k7FES-3Ig%1OM!mX<0@eK*wx4cZw%1NBk7{ zIToe+*@ZrW0sj#kACPVRu;5)&FLWC2oB85_Ja^mjml%ysRsz3QCb^;0FmjwEdV3u> zsrA-ckYD|z>t+M}_%L76z}W};RB&?BeetB*+iW`Cpn6!C_v)Y)fAIlDSg^MQjSSQB+bf>qfwn1uFaD|;~MRMR84 zxg6f@l9q<&Gx5~xfJSK7d9s$)f?-$OI63rI#5#xq8`LM4zTHb1W&T>Ox`u5erKfO= zCLr{UZ=IV+Gw89auqcnB4Rj1=vU7AGP!sHlw%N}|uXk}!9|^ya%(44p*f{erd-zIc zCOuZ=jX}UYsCS&`VpBrU_RS7at;?54*3WLsgVia%8nHmcA-a%VKl^gjkE;G-(bv-C z9|b%)8cK!Zq~sw~Ya&!GK*>qE#=oxnfUdY!xY5DUEsL}`$s#*o!+H8=V!b$l1>-1j zpvh>jU0a>*`dJ{NWB)b4aGf;UyHlYpuJy(bC!yER`IkTx^OxXlMjp;F3+9{`XODNh z@qe#dU0n~^{J5TXp4_J~AQR}BTzLwV#6U>BIrR&zEvREi0D660xD7=~ic`)+xxyP` z>tor(X0k?6E2O5keaM`xkXesrP#-xPhKPN0^COM0yuF|(Up!l5Jq8+Hy=aDZ`~FL? zMk^wN&&VxEyPk>4vY(Rkuo*d?F4=)DhEXa;tk>MZhFLbC_nQy1p||SK%iPC5*|s<4 zB)5;zjusZWCZ;kmj`*!oo~%+}d4h&nRzIvTB7&tW%X`YfUgTzWP0_k9(<5`A8}@5D z7wM-#UkX-5q**vunQP{8rw?ek=>#$KtwzZ~-^qJ=Q{HhY=nYiMzXd=1%jedEKqZWN^1&==W*{-RWoh@s=3CYihl{d{5!`iup>iH1U3sM_%J`q+c(?Do-tw}sJu%&plV`i9)+lU_1Gy|{KnJ~p?NyutH_L@NX*}1 z&WH%{@M29_nO{-%d2K`oG2K>CgxWkgsWfUPAb=X#<8wyLwD)1ZREa@5(|J2qbfp6& zr`Swo-U56FXC>w3GNJ9#j97a3t(LS@HiB{BM0l%VV1$OPD%|i+%t>Cfxm$!ojj|WA z#Z2&xkAQ{<{fZ`0G0kIHC8P;I={n~^E4rqNtEP;cZj3hkwNW9nEeRrN`h(V##sG2g zw*2Hh(l79b6}%oAj0tdi+kVKd1!dGK37!TE=~)rS4ol9)O6F1TV0N3Kk>33fja!AI zJG5B|U-p`weq2148XQ&HGW}!7#|5SXn7gbvKH%Z%%Q3JL=rosTPBkyd_E9tz(AH=f z_VMB~eHCY)nLHh|RT->_6~RydWxeZ4qo!6rbr-SlxdEerKC~B%Wp$CCH{0Ol zs^yx!aU~S2g^7&CJnhaS=WM%j3d7k}gBzF|VuP5ib{gH6EeGyK928SBrb5`hms0a> zni@-|SA&1n*l(u%D8^Xj#qSHAtO9~b*Kj)l3muhR#-!H~39@D{mF3X{tg}25fiF{v z9F6<9gE~4BqpT^g?<%obdh9DU^S|2}#-Wl;9d-0P&43VtnsjRPf;jls#-+&wJHvHu zD6Rj>mHw1G_Zlqb626kuFcv`9p7ZL+tugtcVM~d^WF&F>`Ss?8hdzI_n=9iI4ll3+ z0M_?*nrxgFL@gXCZdR?M_aPzVB8%EQA7fsY$HX+Z(dzK{QcISR) z^NM_l3CofqHFbOA55;zGpO+iOJl!$A6KLcR;{T=SAzOReQs8u$n{faF726LSN*Ge$6XE*YWi(GtLT72^7@?X2AK?wp=KuI&fhBnxJ z+()xg@MLR|Kfw5&P(4Uc%&XRaXpERyh?{qAd1n3iTX$x$qHGAR|PW9%*e#qJMU)g()4rOhL2JD z8&J|{*6dt1Mpnk>UMmWKAHK01asMu$$u9$H8_+GdkFZ$)IoVwN!AM`@0Fu(31dGJ^ z?O_+!U{O zh2%Sm{XpnaJiwtJ&jgAKYt7pLb$J4~_C!?2_3gk_JXtjiyXf=lRBQp^Tphn^|_LqR= zOE4$48nmSifIGC6*I$DEy)+=XF1=W};v#S{Z@7pnr~X0C(ioG;Uht8B}cxM$L8$7m$t)-+j& z4PJh)yg>+{`ab5GSV}TelEHR|aTT=r9ApM^uA%e~L;6OS`n&Y>n?t)f2<8_0coK_u z%;egeJRcfU>)P7O8uAirGV)`Nv$jJif>p2+>oxBH%|=BvR&BM?Vp$&jP#bgw&o#j1LtLLxYdpRO?^e9M6B6JVw$M?6 z#B}h}imtPlx%gG1Apa7OW8&S~4|9(e-yz_YM+Tw^I|@~sOQTuARUHFc!5BZZ_FC12 zVaS!rj>#>>qfhpw?czHhM?$RDI_4p3_04+W!6Ht`CY;|FY>iG^>vZTt^JN_8jB{Z{#qW%t|BSMu>j)1lMeGJg>QaqfDTf5|bB?O~?RM`e zf3+rjD$SF~;}zE%!{5*>s;W-BGbPW0f>K*qD>h6ewNJf^cp>7PYUg$ou=ZEJCo$n=L#ZQ!?@>U{ZH0qkw2Z^hwPX z4hZ*6QDy6&S8LN~1FeNRy0wq_)4t)_FVk^Ze+f8?K%}7pc^4`04jcnu4~1#g!y19; zLBIbqXD$9SXFn-EzyPCt{O{Z^1iolgFX)7(`7c4o7+!K@|Bv1_Kt6a2XtPoz+G#Ev z&TEe4pjaatU=*!?GygxM>;~zD+0eTk=GU`RARda#6<J^WI#pUKFRI=0U&XW zVCB4v+sm+3U_oNeo$9mZscnutTQzE;UThu=|M^S$!QdcQ*x+v$r@OQ1$D=ua33`B) z;vw1(A$B!;!D6idp#8+iz@({Q;I}MkNqLwQKmLovkhLPsZJuDmDM_GuY5-@CfCKsx z?nHdbn6(e8h7nuRe_61nC^W-6Bf;WvR)Zcx(^&=+T{nx;)t?6~fuFHa>I=hjyhD>`9it2Eu;M&+p+eZ)K^d9N+VR~L z8yn=}fQ{B{!a8b!V_CbJvcM9-78!*FU7NhnKzhgoWYDO3Auh+kU`=;}k0H*$z zfX?RG&QKp%WN*V`ymY#4_MEm5T@vrM6%>6Eq5^)+{b^yv*wKE7D}9I;QRl-(VmUFv zHB-MA%bk`K>7VXJ*&drKh7%5e@rFL|yxhL>GoZG9{;@fYyf8U_ z8`LXj(u5trs~<(Ko{9VhUFyQuvQR1uhecb8V<%;>D0!%?0wg0vww%*YH)T{UtuI7! zdbcToZ*ed_Az=6-D(Q(wR_qJ8XC}|1E3DKzfA+V#2-qi8J6LtH-{^A;iBbi2hgiz1 zS1@e(?}7<$`nsh_%DM}Sk)1a(nINyBWvqQoFsg;s8=h2FeDXN1`2`W6$yt3(0dfau z$*a4+B?Pjg@$A@ddICS`U1j2b=_-QTAUSf-sIwGFr%@RWM-a{miVXmvR{dKz{w(SO z`(Sakd903B)}re1UxFz4tyBV_-a>#%eV5)@rKVo=oWgm5Uy6)wa73C;*4$Bc>+N(* zjEA|Ks8-LN69pQ)W=#xi?@9|8$89gcio-ZnE5!4OY_@KKKH1WTx$f`DNpaW(#;~N@ zL)8UHT>FDeb5{YS>Cw!y(&CH&u|@d`ja)y$^4qSZaY_WA(Q*i3{gmJuy9@Kc0x1Qp zTQA6Qo$vk<$RjF?Yi^!SmvEd7L0W-1n7jAS3l*2y1Wa(B+I&lhHFt-rxE2Oy<9yF@ ziBe7$mFQY2^5oDrjvrSVz{?J=#E1pe{3_Q=wI4&;iu7>Y(_sAJ3ekP?p-cAtpA8zHkiL~*D{IOme-uO`9xj;-dk%1wVWlUTd8~=oP zT$T0*mK7>AG&VS-dJ${WWv;N&QenlhyfEMW=VwpT!rP>R7IuP-B!N9ZO=G{doj4HL zN{LdC>lT9OpkWd9*DMx!n$3;VQ`*x)g*lt=(epzhNk($Gu?Vg;(fcvm?f0>>y?A0E zzM;XYclnLvgL5BG_^=#+1z)-rPH*ct;^)biCf*7D{?_}UwuSbF8iV8?^)Pv_b-gI& z6Ga|058Us|ZUauJuds^dKtaFZvzRo7^JbyDGTqpbv^VruL4hcoJjinTq)l9P8d}`+ zZ6d{V|BIgxTeHrC4ApqOJIrc=9@8lhc3G}Sqg)2jAr@j|!$@P2`|stbY+hnH)}?zC zf>529)dRw{VIkelos>mQ;>41wRxcN#RJQ%IB+Geq&d}nK%@U! zZ-C?vQ(Ro%rvy#;6g>low(6uLjLHR(MI7cJRZ!+aT}Y2WVCw611U~aWe8l}Rm*8fH zHGFlhVr@2VYZ}L~_BI%VQgHhTqTR{*+{oa^teTNs=Hl50wtt{ur4ch>q^WqOFSYrH z$tVkVJ{~HCb3`*lZlXED+$#5lgfe!2xiE*O+DQ_!j^h0?%d0vTA76cXdi&j+?|X)u zA7%`&YCDTwyQ|@OcUnc)*eFqnJ!HxvaiPMyiexjl4jM9aMsicWYTWpy$LrD3E@<@g z#YfbFJ%;!~VR|(B8pGB8Bwy>l4trNh}`=-x+nJa_RO^EJ5kcMmuV7 z&g1oShhL+R{UxE9f#m61IWGA7(#B%g1dvqk*wzrMh1T(wEC+jJlZ$E@i}LT_x+|jv zNNq4LTB@YRIWQ`!2!XR%AJJmbF_=FP&R99L64Mw~TVs!6Vqzj8n(cn(*nV_fQ6r2~ zMjv$s0vTKH9FR7KuDt~{LXF*t_jD~}Hr&~bQ2q72k(}#ud=t@Y51TvRsVpY2nxEkh z1_Iq~l!@+XsDdG}o)1NJ_{p_3{@J{>@Q_qJGntC1SCyy+lZ%xx0;fDTN~rH{PophV zub;XTW{@i*5f0F3j9(w=qpFf)Y+pcRpMlfCcft(Q`3jU%z*i*0+_*)GyRx5TgN0U zT0Q_K{!>j1ZEwoCOJd22tb(?t%Mgonh5U0pPZxTObmwANJB)C5T>AXgEGOAtf@DEc zJREhwnPrKv?pGkp$y33xHleRq2ZWemWOHLB;NGFl9xwMsv9f!LUJnQE$;w}MC44?D z2i>W{^N$p+NWTr{bw!aWKc*g5OoxW*pP}=&;(x|@z3Vf$&h6N!{tY?;ki%E&rbLq8 zxx~}^+nIfe0wFiO-Kn|1XKV+qAGyV?dG(V!Nmf|rh79XH3n4$_SWw4D7X$Dnd>N3% zF_veAgX$Hk;`>I2THHc%=`E05}2LwUH8s{PMqVS&O5p&s(u+?rp;_#o(|=9QJgInQyPkpi|VF~ zL6D=Tu4^qV6|@IN$5>(rvf%UY7<-L$CDx0S}F+lOZ^a%s=C&&cj`V*BN;0 zTkgEj=E6^lVXr#Ryw1~VH`OOH1Fx^;7=o;iYG*>nn-hTu3%W1~jcaF|5?oZBDbCu~ z5hZd}1RVY17(@vfeu^=rzunXv6#(FP>?*tWDu=(eCf08aM!M+F+udxnahY8AK0w$! zXfMJjAXoFTzv`WG5<)}{re&QsqSXu8aVJ;~>-A;qi@$m|2P6gYWT^d$3bg8M}VEEcdf3qR$ z*38Cj54OH?ffH)WJ{HRIx|{qctQPh!!SH4jp12E0+Ri|9C{}itokEN!mm1L;qfDVc zR6ynp^5QOmd(Hlr^&L33Hf~KbMVmgxr^Xuc457;+^+)~qCD{IHW(78MTRlP6P1x0m zI)$LIDaKRu8p$Hc5f2fG5X`RmP(N`BxI)c1>|P21RAD`eGH@x=?OU!{Dkf};fg(08 zTc2IW54}w|GeuAxoGZZ&d04%6W}wp7@#QbUj@<0~JuuB)bM6I^(}6*PvOI%4X2-e< zV`7-ChM{~X3Gp#Wnd&iTjj75htZM$u`f6>K!E2q8#A->~&tt8{F~iZxQ$)B^aDbO^ zd_PV^*P1=yT;SJGd8X z^<2CVe4kqKqd%K`6AYX6yHy?l+So>X^Umre?M${$wZBVQ6p3i_xvEl3E(-mfq)HRZ{S!Jq)^V?-Q;wA6S%rgPQ0q%IsYD6+N?*h>#InIQ2UbHUP6S(%t%%H z(!D4mw9&{VYB)$uEu$bHu9x+!;icJOl2@+=Q{Wj_%(Qk{$M!KUH8{Mq_~g1ib=cLI<#mU%F= z<9pPY$p<1MOTf3LGu>N6Cg@{H~()G8}lQSW&*1WsQ8|0tEvX*Lm%daV_UnG50l}!o5N7w-!%Ro+*f2XK2rSzdGLd4VFj8i> zt>f7z^4ab$fy5Xv-NOfvq&w(C!&8J4KKYIw_IW#XLLPM$6lYKO+Tw%{w!8CdeHgwEx~p>6VAXU-kKj>(+!Hl)7E23&XzY5;vFh_;-A>mF;H{2 zszPm58gDbc32%;ZBPJ9vzo_VHm&GLZjQDcU_%oQdjKei#EMjG|n51&c`!tqHu<76b z#cXGJ{i}(e(1Ip*$r$mlK+zNBJ(+{g`6C9TAIq1)L zrN(8QmLgyBdlZnTd~b2ZEw(^MTG~sv_PmCsGp$aHuNrLf(c5wLIReixp-lzOew2PC zrZ$#~i#qpNHY54WIDHrU&6Idqor*%>7zm2D$4c+arE#*}hR$qP?y$;XEl6z=6|>`3 zWz&O@3ulOzO`#1Jz6z&GOCR=hzZEQ9@;hs^!EI5q_D6@*3r#=ISMi(2RqF$vB(jwk zVw0N^3FWhj>`PW%{1BIJ*--A~&86tOkGmr^xl>f=sHi;6J|+mqJ}5(evG&U;)v2W-%eaE}tpo&D!7)#lVS zNthurM_JA!YBnaVZ(NV>PX`HD_s2AP?Tx*Iu_`_eew8<1b2pR=_k0?9Ag_vaRY8m@ zKwmUv4IlhMe;g*t<%$`}UgR`8{TiXt)(;e+M0&XV-Y3FP8xgkN$=+T+n2JoKgC5 z_imXwQIjq|F3*?Q)!b?-Y5%YrHf6X8T4=a^ViVO)Jbv0{@qD^f*{cTIV$r6;c$%RYKvAVYCF=|@mFO#dHhfYOsrVR<=1r*2qg}OVTdqoL%95EAjAtoR zpVMRcrYqhbIBQ#X<_8c*^Jsv5Tr%%30hdWk5+#sL9C?=S=jo%~5CE$iz~QjKCpX>z zxaDD@$eI8D|NQ7f+Fk3hk=VS!GY$s-f=;rv5Arl?GN*i;AxbU`G`j+KU(-f^eWn_* z%DK&z67pDJp!93G=y&cGKyqSqA5aI}@Jit8Jo_(m|0B5!>vY0S0$1%YYhIw%qY@7Z zAW#+sz9>NPrTv0R0W>1;s01)+@$6JHNvFu2Mck9|_6NbH*eLMYhgMTzu8Dnomrt4H zlCtZAi0yRO{s$ih@04-Xy-NB~>7uFbxt6VH|FAtA%@G9`VZpJ^w(=rt=|*el;ST<; zx&HY{i?IRgVvo`@-R~*^-b><$W}t`2_o&_)Vt?^x*wlQm$--*kY|~Sv-Ff-hniEj~uH5f^vf{r!4L3ZTdmKa967@=vkuLy%nwooPL^L_Z$Gym!k ztwkl^pZEzU`Uu6NKHOmx&VQDS4B2$Q;!-25AZGE>nTb4CIrH52Cy*(5&3xWLA39<~ zAIyXxCuZ0yp`8wQny z?@7Cee29$vPpW=p34EV+@GvebG@vpC7OE23?a5E|31EqjSGhF;5>Gt>lU(A{3i??S`qZM4!uY$jpDPOJUWI(v`kuxX^Za*}#DUAo4Hi?E z`*9fN=9)E86MAw>pYJnK<3m8#GK4ZlJz5XfCk#8wD|h-!#MPs8Ns~70ZWSk;?QgY!>K4rj$q;B~BpZibp1!~2!(;ANS5E$Vts4)N z@3Y<`w?i~K$Tl%n*t)~E*g`{76Yt+grrC0OJG zs*YJn{?fg#yw>9Xz40Uw&(mD4E8=t=(*H3;8s9-|G6UPl{0%yzo$fe!Srsrrx0}m5 zc`C$!9E;kQ4AOk#rdv=t_%MfAdTwHoV9}KDy)wbcpkn%FdLLbWTqC0zpC}+CdaCk} zlXf8mP)5`fie_CVj19m@6$SkPrrAAg8{2BQPX{>|1D1n*IGKSA8^a@(Q94ic99MiZ zZMQ9C-mu#Sn+zA)rF_+I(oawP5z32|Hp2>`>4%lEo@FQ@%Kqnd*wlVz3$j?URN9jL z=I5RzBT!m@vm#9dpF>*p3qdV`pMY#6?*_K6*M_l`d!1Zd4~kW9<*Wi{Wz?4a8iLx| zf?d5>ziWDUiE0#d^F7dZbBno+YVARu_fc{L3(UMm$lt2cD;$vv@T1D7wpi*7?DK%$ zcP@$BPD^GAsc=Gp`d2@|XoU{E^X^&qIpu>l;JNX|76T!c$sGSbTGJPmtO34)ghd$%+BZc;BlP*r5yz%jkC6bu{gad;Q6Nmj+=nd z`xwCK=IctzJ;4J!2JzoV@i6Lk zR{x656-u;pPak4CvFH49rN}R!IwbFafo6~6^{C?LFN&!*nnFWDHWG9Xi};lX3-+v% z8eMN&gsncqBZ zS|mU_SlFjKkvv)O1<-gGO~0H;w&}(;YT3z zxO`gLPn5)2s#>bzL_UQRnt2P2`xQVLtpg3i_a0FKX(pf(QaRdY-U6=^wgwP>qW_1U zUjCWFe)=tx_+PUtfKQ8Gt-!wPR1o;TRvFn;<36;bs=~XM_MZ}pld}NbR(@*@?f!e> zzzLnj9c6k}lOKgk%3E)HeM*9h-&?BQC71iUpuC<%_-BMjLR;oV&kStRBp8Ewzw;9Z4+mJtDyOxLEbYJTbsLEZM*I zz;GZb6GHh*qP^R=gV}@%RS?I!jMiQrKN=1WndRSo2sPZ6d2MYpJ;0z-QL>0(DdOBV z7xn@*D~STQeP2-j)y-TwS-kWO9c zIvdZ0%|znUF&xkFqxjL;!_mBAi;+@`o)DP7F!kbKLql?grcH2+?zoc8$0UAyEtxfH zBmGRCmBajnmyP~1exQT#K88ITd8IaS%6LjQt?fb&7~M4cG=|-UCy@PFG?}-0RrLlVZV;vd&>iV~{T3(oXa1vWaKml3UX7A}bIi{{qg7uR2pCwE_6-OqY7 zYG(NkKTAS_C|ke{W@=d_rk2)}S)3_jD<$Vn>RZM)rnn|4e)#mwxB^U0gU;u!aq&Sv z)JQ)D1O=v!Q;m3W)JJuU2ObWZ;G#m3UOw5-K`}^q)JL< zB6|O$hib&g)_M-5{m{3{N_W;Mqp;92fBE%^Fy)B zkLgzlx4;^IYR~NKxTIHiuA0w6?9ZC8MD4<}@*(+V*z{8n1Gcd{lzZB>TBx#@zU+&x zLoOqmn>LUqZ=iokvPt508^on^C5+Qe?8;n`iL~O`9UUdx+h`0kY)m=c)0rxU1jVZ? z5G)qKP1JhBjvg(2sgcb0x#Gf^)M0d^*TBUS9Z1fZiOo!1W_VfClwHiyvL?@$cR(-4 zgcvvxa#bn^mzvO+({d>IR`^)u=zAYkdz|Teu&v8{ zC?8ILrQs$+ls;M$-2o)b;zH@+ExCsw6lWGJ@rq0b`)~-P;=|V&2MxjM*D+GUNxH{h z)Lz}bOR?kyqz1Hhc5wrN@#X^)#;fq{PUx+@Jrsy=4x4S1XN+0HL>94^kr&!Ax(Ob+ zx;SdNa1>)Se21H-u&y4*9)*7y| z1s1xj66Ncf-V*+tL=UV9rsd2HO_BN4-^n1g&(GZNeXtjFot(%2&6{WE`9f9pDP$iP zj^4;*ta_&T{bZ%C^mog?GV!PA>CGjW8A&L=k1i^SKR8UckMe4ac4$i(@0CZB>eUZy z9SXcJal_rc(2MS{UyTc8oPbJ#`_oG;!6PL?qdE3v%08u?&2@#oP@(Ul-e$Q*FPFpq z5XP5C!Z8A8wy^M3LLWUw|C$%yTQH7!s0tkl4et>xvN=}+?I)i~-fuo!vj34_ns(3B zK6&hgjD83dws%Um)y)10-FWsL3qg(%JF^pDtUvjmxnp(LmD-!-91_Ylt=Ozi8QGs$ z4Ba*Cs(nA0LiFG*zgP?e)C+^>MM%c0_ssfs7>}abKgfw&5E}~YQYodR$@DUSwmLfKh zo9rd}jV-Cds`GH;9e;H^i6jWU`t8Ek|+iPZ@`WSXWS@M{LGqWtRquwsIMO5>&xT#?nQgB8_US~AN{1QeIXn3 z{noD)Uwhez6Q~Gt>U}<8e~P4>13Ff5Vx+?*E?V`GJM|(o5Gv;>mq5n~*HU#S@11Q% zQ_k}GgL{rKlJ;9v^->T@-Rq=K9{(^?g!k%Q7jyaoikvdoYja)6AfrUX>ANbh?`EPQ zb-K0RGhet@`le=;@l(jh?=wV%xs2?PMS@BQQ!MMSTZS;b`7=>16%)?$zD=rIV zSGvkKug7v&J#M|^ry&SF)tUwTnV%vRpvZbfDxJ_js%Vo0H=Mrvxi4beQU~lSM5Y$a zHkF@l&SymIU{IvHX7>Jq_|p|{u<>*&8)e+DES=9@U}^L7IXC#D2h*8a^@3cjQgJKL z^t4xnH-RvmZ=GZQ2xaqj4U;|_y}wo;)!CRK#c=4P!XBb(uI;h}F{1yKWPk)?CwH5W%YDyJz_+EPGAcN^8Vqf)Q_D-Dc|ACPFAYjGkFz8rpp_;WrDQssT{bcBwyOqs`zDOR<>7d2b~ zqO~d^6pu&{32<_W7ah*ce+hDd%tr{bZPUNQv44jKaP|e2X-MmLwn?C0=A?zM{`!e?Mo_?{@gG=s@EOpv60yHKg9xb_=^(>l&V62WX}EwyK%We4YfnPdsI1 z_|YNS-bHg4f{ErT0!JbXWM_OBy?3oXC>B^m4Y=kvJ@ZIHDtV!KPSr{kgr@{9bZdhu z-MRHbQ)4(nL5#%w&sE;s@H&_E!>JP%GYQC5(A+5#_w<2}1=xG>K|weK6X416hImdR z`}{9~zw5F=g>uCb8y_!(gL~3sSy2{f(_spSk~)VHO-pQ>;TWJOt03c+8P~`}x7Te= z@ITb&AA0q^KaL?2wF|*aW3SF0e}{ke_d;GSyD(q(U=>GiC**^9K0+n0n0P-kyj=P{ z+5cQjt?#B2_xIw0#Q^}rB#wUPdkopFSe^T`PTt;#)FIP#N2Quh6=tNBr-uD&4Wa|Qrt{cIF$0Z7CKH3Fckj7FxS4F{*Z7rfxW_ zom2?6dgv$bDf=-qGOH=6a`Jj8*|jof-h03o_PpY4n121MWBxY=$)<59#nVL^dpein zVDG^}z&FOwE<4dYlPW2lBmSAuc?EX}U!kw${ZH3Lcp&^ee#1cu+SdhiwD|0`nt&SO z>vW)&Hy3&jHkXG4;vVTP>pwOMpy=GMbN?~dLE{0J{e%<~rllwY$`wUd=9g#d2=A6n}1cOT@Z8p3q?!FR^B@7@Fd zT&X7>>S8Q#3&bc>utmM6)H07sm%J23u}OE~?_H-Ua_k}KLrDN{7~+VsA@An4xZ}Oq zl-R3SpNvsclyiE4lGG@<^Uz!IAnYlxEW z1JMM7Fm4qA`8r?we9nYq`--Z|O0Bx&x&<>2Qq`es{bb^t8UY3sxdjXId{-R)l=^up z=+#)CE*lzC1ERlo4L+$AoLJrf`EdljLoQ7LD~V&7T8j*sV6R1?^~(%!V0msuZRK1QV7- zvWg#*?V@ly1>$jH;Ftlww?s7=cR5x1CG&=uz+OwuLeI;#>yk>Cb43-q89>E#J>r~T zd122ebvEn@cND$XwAk`cSUip$4iuY7@RS{WvcXSFUCd7#BZ6g-Pbd$_4UJl2!0Yd& z)9*)9KcZXFbVyO6QUZ2^x0DET-u+tiI!e6<8fgxVkasEu=kj5Mi%=FUPs*yGgR>)P zwISKg$7hQVdDI+8=abbCvIFc7tGEt8x@|B25-3lFH*4V^`RPj4UdX7vD#|WvBFuJA2QAi9?;hA% zDxo*^*3q0JylKi>Q)q*H;}@Ga-%5lUaw5VTawXv(9ryIgdYUPiOeA68C(ittrDW~; zHFVHD^hg&nZ+{(M87y7>xD8YN2_-N_w(9Sx&H_ZPv{d6?=aCP;Fne+s$Ia17{;+3Q^^)HU<~>2^JmdBm+aCdqRqHF3i%Ue#XOu#Q zgA1XKON`e`UMH(D9j~uxx;j{Ks>l%!iTPWX{&faZ!BTE+1{Oj*-2!EAgo|@;`esm7 zJf#t5cyae5QOxmy)0tdbE&8lLF@D|ki5bTDQz6)Vqxr7Xtfspl+}^D=Pf??nSQQff zQ-C$W^Y)_IRGG?R@KQ%-;mjwYAA~dd@Ia#*Qn32~O z+BY!YyYpUSzfh{wRlXtK%;HV>8eu2hs6X#n6&A8qDq+Fc&SYt`fMqeuwm^R!tCqSV zi~W#K%{|3k9$Weg9o>2J;qCbp*)kuFxgg_^m#wmEM{GOR+=$9nbz7vX)&g13plU*a zKaUtR#WekO3>Dd-|2Yl3{e==qKVArTce0%otFDfjjg)t_U)P~ApN7`Y71uR1HSkI0 z)=IQwY~D+HUI4cGsYVgd`680tI1cv0vlI@?U$DT>GetNAxq-X zs1WKR{muiyAFdngUZD>|2W%)$pda$?d4~|99CFi<9Z}y*^~#oZeH&9WU7%)mrirP< zj!7*y+YRV?ZjCB`70(YbImye+W-&Z;pTy5zrUh=tw5k_kecl}5lx>j~+P;vE$L87q z=Wt8&!?DQ-jpUaLv25v(tpUY%>mrN!!Fs5GR&9!RRgYWY2-~cI7Z}c!xh~;aC!u_Z zhWSKrTGWVOMFw~1^U|(r!spk?sYY0PSB&*)zgDxL0XR%QBZR@I;r)V;u6LvHdz2}i z@B$y(O9b28?rm$x#3^jfrN}5S=I!5xtk_CDF%33xXI%i8h2Fh+fAKWuliT(S-<6qceICM$2fE z|C@Z@z4yDn`(M_qWzCwmoOAZs?d<(LDrAK{5RYaPmAIC%f>X^RmW=cui?gHS1KU-J z?`*!y>@y7-qZikTp+XqHu=WP12z?Y{#FxQ-++s2ig-vfm5`H#}X8+`pUBAn9|JxT; zPkG1+s>d3%Lh;J2MXzX6tt(Z^!X_J`$gce4XfN5?rQYwpLqJ|5@jwU2J3qve z)@*>CTVRFv(8*AngyqkP5Jk?qmTHVPX!eGiq-Ef_GQ{&e?{v6DO%e}ToA%wS1kNA| z1G6CAoMxDIJA``&MiV@7&m@0Y!E|}?WedxUbOd6*)89k4KFvT{95@y1{^ZMFiFord z$9s?~g^%`1+~a+vtl%jiA4NhRb9}h01au0qroG&`Xm5Z4H8?NrAhLcx4ir6VgOV9K zPXiCrW)I!^fR=E@Of)D|kx&iortg2)ee;{G>u4q)sJcQ(ct_`=%pDDr0vE*M2kTc$ z}@ryR$| z5N`EqM9X9NY_d1C^b%&{oUL(gERpfDp|6YKFqj8uMe7ngS;!RyJsD2i!vV*voKtzO zpa$~%iVN9s==C53L`qCUs($hrIG7`8U{1MwSg0>0z`g#nFf=kfgtKs=ogdG%>Ov6f zxG%nZW8pS7783Z*^~5iif&(L!cV&|XBraRWP~V@iu=Q^S-H7!~+8IbS#JWQGSDtsPJ5fu%C4~7k z_?h&{+A{uWZX|%J?%%Fi+Pek?C3hzKCwG5BXREFqJGGmO$Y(|u#q(t>hzN_hUl#$m z4Omv)jy&jaeE@jz;wy;+(k`@IS#+I9^gfmq;bLG8psG~o?s8mFOV2E3_KxG*s4z?b zu(u)3#r^`LhbEZYb?5{Z+wYT;8BPhvnZff{?1nj`0orNwqIed3ej&k+WP!R~iA_O` z;P$Eebvj14_OiEZ)R&b)Q2Kq;eQY4Y-)39CdE{7#*BI_*`smydQGRAUV_gG2S)aEV znX7*Dh9v!Y&11)yN^!L9?$u=!gTJN4ep&>S;fyt>0o~tX&e1&HUW?aAcc^@afo`9g zxfT)n#h9#R!#79^a?XE2vTuSwn?s%bDr~pB{fV}_Q$_Qta=s5}LAt=6y_Q-vZg-*| zFg*C6+A*g%8tz9}dU0)^s0!&1>s1tPOHOpww5m$b#>h=CSX$|j zs(9U);&*~#9UREZ*2*}SM9(KiT-mEBIu=PD03(6uU8opEL2R&l+w3nq;xA=E#R2Ef1KRDgmlqp& znd>(g(QxCE;d%H)kF2ppeQ@MvJBuQN#=f8dMlHeJM=qBY6zU<*T&(3-;wt8fV(s0` zi#+J5NS)9Y{UAOU(C1G|1 zTFx#vs^3YV+#IPmp^s}Cpi^Drb0*{aRfg+gDNYlSqxKTDrcb^|6Zqt@_##cWO<&5h z?>F=)%3!t56(7WhehrB|)WvB}NU?&XzY=bu*Ou4m4-vUAF=nm?obu~2aj!V(Kp3^QK zvtZJXtNkQdrK>^qv6NIyyo3juy zdc6U@AbQ7J?zAfQ4Vd8_k^p}pZ8ATQ;+sZUZq7qv$7_h-LI~`ALYIzQU$b-=CpPr% z+wb>M(@T4GGrjzSJBqDsFY6a3ClAc2s7`us$FHh!rCKdznZQDR;W>*;zuVT+W7RA5 zk22eDYh7xmfzWx*&CGTLx_RgA!2`@xyw_3<>9l<22Kv_Cgc=+VJa48bQ@`k*0gI^t zkj$MFFL8vJ-QfgzQ*@B6;aCgVk`{G|=1Js?SbS6F3tv zolSMZ2!gLb&9RVu=MFW4iCSy5p5}J=POar;?Y6Se$9NZQk#{ngo|GZ)N*R;)>KX5^ zf48#-KrJeSwfP#VPQ4N}RUtc~=Xb%fp_W!zjKPe0D~rckJ>5WLX4BFOZEv5-$uezF zI~Ra*?YF~^uM1Ihzo@K0gQBL|C|k#OQ9Ev9I!*Oc?&V{jVp2_?JJ1W5H47xB^u35d zbB-WXwJ!!|wFS3q-~DbGcSwk2(%WVCUh-h_uWC^!^{+g%ntEJYA9h`fc!wz+KYsgT&8_KfCBu=n$-cp zPE*q#80PD$rmi78Us7uKpSgUqO7kEag;znf2_uVCxcO$E9L_1I*Z$t}&kJrvUW`pcq3cZD_*ufAFTJVvw*>}0og3i?D56}0>cV!V&5Lw-NAK^3{S-A+A|uAL+ZwlhTRB5 zlFlJl^or0po#QI;;t5aVsJv3e7ETfyRfG*xub!x#r`8!q)PKkjGCVh_(K>gCmnjsY zioT_qvidRZ{WS$ha%$AEx6^l^+~tln?wcZUYkG^QNS6j^*;a2^P>)ovFdICxTUgbe zrr*%4wO2z{eJAO0X-sIJKJisB%>4ti2$ea63DmV$ZMHR&5Fx@*>t;4!$bQ{o(DgN= zP6^NF>Z>hr#S4!P6PxxaP9#Rl75T8jvnvaANb}-4w2+=!{r*Mbf|l?*2{6+F=H&>7 znZCAEuXS3m9O<%z$2VB-C{$uh@y@~me`FwYP{7f8RhwCSqR>Dm#Yxl~p&L7|)-SU( z)&p!oby_FV4X3~Gtgrrzpa@;8ZCPMz+fR6@5}E$Q!x2s9XVmmz(tSgVq{4TWwkBiL z%Z(tZ_X_8%SmbHC#{G91CyUI%W9&q26xg!mh4lS*z3T_V9ZZtDE#X~?9L#Tu3hE5e>Sqe|o=3rvtlks>FQ}{y`G1G;R0onH`N7KGn;asmsxgl^nmIbZ~cZB*C z0oXfRI`X4)2q&1!Vs+m|CVeOSIrS06@BAIchT4Q9NWK;C7D{UBp_%^#~9-0hvZp|jq}i9RvKOzqVUJA{vH-2F8F;G z&%Jd%%8v%kkg%&rNZt!IdmygSk2FvcB3kmVMA>2*;PQI;0B~W(+TM)M9_j2LxVUGn zH`h4~oU=UjOrU06&1o=|TZRCS@R@_tYpU9e2hJl2=Z`L+f!o;{7lw^e|Qg}yJn(I$h+JRlQyb3A?l8)#dW7T&1zH(jyqw!MaJFu{CD5AyJ>*iO=r+&PPg zC@$R4;mPsVH2lH%$l*{i=sW%})XRD@`P?daBts~*%F!9!=dn9jHaNYWF*z9+{ljC* z?Q`nkCu$9MM;nTy{%eoPpySLQw8G$lmu#4a-ijL|Ca^l*K4uC$ib#_-j#obT#lwY9 zA=h~u@45Bvp0vCAll7$>){pRAKuZjLm!8Et`8V%Qz;MI2&0aPV-3BqA( ztOh2^0;?8SA(@(j#D;5OSYyWOgv1}gHJ<)>d`R_Zu1bSCN=?PV*VFhC(liEgWoO*9 zSPYfSzV_g&&nAx5)gq@9VWCXcH=U~U=<6E%{uJR1&o_d{_uezkeGWinY?dP)kj)pI zrXRbZc70dmZ1GTGmYq2E#%5b`E37s|Y}SfZ8%)pq3-1}LI~|99e~;zB(Hw>BVA@F~ zHdX`)Z4Xf0FIGgjs){CamA(p*)o2v`C3WeoNQ1 z9)2&GJ8OwXMjsoY*(t#A1@N3O$Tpm{^8`APlVA#x{~T&WWCx7AdZ#9~qxD_;nO zy&rTLS#q(O`+*EDg3cdf`D=%WC&F^elcs;GoJp2$S~P18|@2z&4Zn>aQC>0du+xQ~mm_g=}=xi@RxQWJ)j zi!C1G)j((MHTqAQh`wtwCN(VxAAcBx~`l}z)*cT3z}hfMP_cF?5JkY=6RD!O2WGBG3qKkf!0G`H6Z@BhX{tC9NHITq=UHJ+ z?1Z@CVCqSC-$cUIjJ67i$#zG}=bjd<_qfb6T$VG;SDj@HcSsZT5=%jG##x}d6hfs){~UdpjUUQ z7}&>@MBu*jlI5l>)}bMWm(WI?}kspvF*t1U`7lrGMY(2x#Pfg5YaER`wP!=rl?xapY$}) z!92%Y`c7!t4P(3()3~pO??)_EsQEt~E5>G?ps>53n4Jr&vy1DnC&fdBoiWAWEmw){ zm>boWlJ90?{ozR*y?L+)^=jm#_+nvu-?2bwE z`FHzsAgP2S?`1kZGVEK38}9Gpj71>A8@1TXlQ3+@3ZwgN3~#grrqVaoI^fX5z3^Oh zfTG57AfKDrL?&1z#)|GLs9uz5su*=7RPbuXc8 zsS(F+i3JKyC0;zWkfED2G5N`>8}a;Bjg88cx!Nbs?mEg6Y{*I;PVyJo73;^f@$k${ z=WKYKYg{8pT8@uQimx%FG$X(SX0`)HTM%m ziyK>HhJuT&T{3X3)LT<80>OH2MA`O`n^llEAMO@(Tp?idas61yV9z+CER2pi^K+)uMVTcAdPxIRiE$1yLk$~3$+-f>yG11`Ps3-lnvUYj z&Mlz+lXY&OZu*(ys)^baey#D*+RYyb@Ghbz5t;55{$+^ZOZw{u>#hy_vgCbK-(IW@ zsV#ZWa;vAM^0HRYgE76BUAN*Gp<^|X0GgOP+BH0%HMsPbo`1a3;`|#`Sg$H_QYBDr z?pTBX>WqE0zunV}eAU8U(qmt`oj6RkFZH~7W={DTTEi|-gA3x#=t-=^@aW?V5g`P% zONj{+^6tvvY?z{2IPa96YEFIg)oi+lwPLbh;HPjE50@BgVz8=VY@9#|y2U1$Wz47G z*hMy1HSn@wR|oTD@FV+v`EH96!w9U#`72y9B6*HgYUTO4ZJj*P%E0Nj5-MoXMdQal z5by3aOr(y7HHO(gBwMY!toU}m$g~?QxLLg8R`l$en=L;O_;FazdY*?qh6d8Sd@+8` z;Pb#c&3kd>vMl)Cz9j3jS+iFjs5;sN!_nI9BaLI;L-)?>nPB-w1S~sWrz-L^ED~=j z(SM^a_zpG&4HU34`!DLd8Ci}`GXX`4)PAY4On8a!3V^?N??Pzs@UAo!TaBwI@W1oY z-6Hv-S8_Gu0e=hJdOS<+ioGVl)3Akbb*EGK+09;?8-JnZhR(E{6_TaznW0Nu-mYOW zx{}(=-ypR|=OmNT_9f?u+rYEgNB(^UfI_SvPFR_!W7XK2Xo)7zE_`4g%O3JDXYy%pm*&^WVquz8y0K#Vvb=^8K6*K z6xrrX#70#4nC{qdso|!p)JdYI6I`@<&24eeiSv5rHyhb)TggNn#BU=eI5XI^37nU- z-TDq=2eCvp`$X{50K9eQ`L3)!n|}2tL88P^r_-v&EZ}`*3gC$Af?UjP zNEDQT9di$J8?CU$~1tmJ$z%6%pCfye$>Y0JFWpr?hzTo3Lfng%n$Il4t^7 zFXR}nC6J{V%`*O=BAuvS$8|+iTj1VA*44?kgIh`kjvnd$Kz9~AfT3jI#;D8Ve&Q{7 zL?L%Z@}1iKkK>K;8z;X1Ki_{DRK0kSDbfi#r;+#*BkbCW184l>&q06aD3+IP6(HwP zsO2whxl!~OMcdNshh_!&Hn2Tv z0I*kkgdPHfKWP5iv0M4XLjW1gxWSVfTUr^0_{E8aBV+99s$l$M#1L1dG*M=#;c-#V zl{F4Q0~%R$g@H_MheoK89kx-3acy?df0jxD@(}bz_7Yq!OE&u+CYAV4r{H5)Fel!67 z06IAwMd5b!CT;T!RA_e_b3BKNlZuLZ0GgosB!HYW#MSD)qeO)~T=O+|9y}PJjKKp2 zsj60CRy4yeL1?GIpWbI+V(u!Y;A+9zA5CBkkhIt}Zs~lvHwq91#rGBpC=E=sToK?Z z(zkFv@j4d*;>v0$*YG7N)3x6PB+pSKm#IJj$#gK~FT8L0d9bZE(x0xUuDR3-{X4Kf zEC$P8+8YOO%pFIllI8Z(fP>2i-ER!9kO>7T#LdZ(-%u)jMy90dHKdWO0vEVtpW~_> zuAUkJh`~*x@E2n)d3Q9)|)8vhCj=ywDg%ga- zQ4Yx{dL^oGGUjEH8fK7K$XWy&FX=18?aG|84N_#$ucc%pZcfCw6o+T{Uiq85C)CUe zo%Oqev7V@kj)JthTen}17=<1ee>Ftwc3~y#M>CT($&^)YON9?9X2KKW zX25l~IN9E0-H#$lD&nSji|~?(D@Y?sD#;{)duRsHWZWF_6xAdHei9yT^rRTXSe*Fa z2x%zzOpG6B1a$?ojNFE=y%;x>d>TJK+G0p)>njZIk%K1?7&jKn#dAw^^3Rq|keQX` zRP8&!x;K6oGi@)PDGB*+S)9Pam2td^K>L1`Gl1tXN&*3(%fiE9{}!f^%J%8fGoV`E zjicw5R=74->nxiuQ2a(Kbc3Imrq;NYz~?Oy0$>1gA&%w(1dR)k$Nlj)b-7*AYsxXr z@ggIy`!{KO?j~4NtB9tJCf_%mjeoO%Egi^sLMucjet!2d4YoSn_HSWf`f0I`n)bu# z7Zd{Pbi0~WkolaKnL+a8&Be9R&RG%x|!wJ{uuwM*lluWIz8 zknd-BX^fSCM%Cu9TWl+(N*q9xtHY>C@NW7G_;kn=c@?he&NFWGs0P3eP6d`I??E`+ zQ(}OR`e8>aq;S;doVGgCYGR_OF0|XL6A8{SB$@qC9U|6;x-Tos(kN5eQ3x3(dU$e? zt@uM(EZ(!`9f9}m8*as-7n4Uq&v&bF?VksVEKPtJeVO0o#^e|hjtjWV)??qxT?m#t z)S0+b&ig*WdX%Q?YI}+V!H>HGs;UeJ1|he+`#Y}5QxAE?Ln`WDm0z$1N~~Pf&*59X zRu&-Z{A8!;s^W2xYf~|@GS~08Dy?eGFJ{01U>95Rwu&qmQ~#ZLAxR?Fky~@s1k4Dr zA!!aDS@;n1@B{wtt4ibwQNaFE4r}V+gOU=vq{78SGx%--GWwfhjU*qC4pw2W`<04t zcknp*idP?vH{6$P9N-VR?~f=>>|q25M96<{z`w8fL(G7#YKUjikQj1oKH`d>BTic?C}fsRH)M#f@FBO}ex z_hzvG(~LAtr5@jms1{taKy?Y|-3j7kEJ9yZI!2BW=dl>q8rnw}=kQM-A&SJf0G-*R zeIKBDBB?S%i1R!u5RlS63OPoF^ZqT!kiP`EzmcoeZ)s-59K(JYd+#UiwU!MiNd}eM zsd^g8wKzy|*>Y6Ye&QMo^kmMe2yPkdmimzr4+WY4%u}M;+yI=tIK$9g&Vn2@nmJ01 zJZ(+B0z`f!p|l6$9y~Q94(>klhkZ)xzN&u_a>D9AQaW9^N5N{Q(2d&rck5&2* z##&9F(zL>siOp<4ik5FPOAg5KmA|hk?K~3{>69`_TCn;)^Ty6nfs5Lki!JFcr@ivd zs95-J#>eqv!FjtgH!b})gYvsgZ*d{yAIb4p@O!`{&(7y{nyBE(-(MJDC_3{!fz$4n zu3KMZL9Ww`9=5#eGQio+=L=U$1-7of1<`$6nnyWBjq6nX$rTaafuyguJlpTwepZbNlXa1 zcng;nTNG7WzGi=%aqEj%(NdPf6^I&+M&u=E(-?5q_F;uE4DGrt{qTi88*qm5ek!K@ z$gKjpL+1H_%JnC%?SS#5;eyYDIN|ioF#u7+?y`8HA`d0p&txrK7P%?T7V10Ck`#mo zl((US$)5UquVIMZs`u01{e`z_S2~(@7z|t2xkxa=<}SgmV*`~(eF3Z7QEJ9qHh6x$ zOu?Wa$;G)!)cv%;m_gBrSn#!)1;+L4!wGy~OC9tbg5mu>+1qBoGBJ|u_2IUbHNnRs zj_^d?f_l!p#qWZM+I^20Ns8;PHd5790X_;`=!MDeo3zt3Q%^qgDaqdI_K3S0=N&JY z4s_E`v*VZ7!q}nLu<850P&I@tw z&xgm+0x8g*_r{~?U?Zl59StL@PiK#LYV&cMj_xcw?-}w-%?HB#H*UQ6AN#Le8#k3< zVG}%Q=*+o`1*%#}B-$zZ;1|-gmEp^_H#X^mg|sd?S8JO_cM<1m({9J=t6Jy00ape% zxDnVgGq_5{Mim@!@c4}M_;!}wZJyWS4Sb#NmlY*t72jK znyYSnqRHJ=!_I77KJ===I^U+PT}|uHeB+!TnWa>Gi7!4hk-k&-BvZjo7`BU#x!2XRSO<~H^(_*436({Q@M-~@ zV05P6EcX4n?Y-$oj$}VSR}SUqdf7S*VgSPVD2%iQlVZbiY_=RwKSX=+uz9zq6Fe1h z#GJkxCO0hhK?N*O^FGnspng@96k?7pN`ihUpp~pdt~k=3YdCr>>s`MP+fO)|L@vvT zM4GlyVyIo4eLn&iM)i{XZDS=W$hyQNabIo{`Fv{CHRCes@GxjY&lpq=Al_O@h_zwD z(7#*_GU{lf)*&B|D0&=d3C+}IrUb?t5O5Wy3cMjRjK+P#qJU~s&gy6axk6AZGgq{VETGz6KDbS z0(H!_s5%C&G>sr(hz66OJ(Igu;!}bck@OESzwiPQ9t)zIxAP4KNcBpyR-7LrhF@q3 z8Yvq5NtU5Q)>$>;Mkr`%DZGmO3MIZj;N#M#1DqIr7jZhk)L!}5CrzQ0FHcxrQb^A&Qe2}KF*=Y9LEsxbup$P&0u_|hW1FoKSL4LQeOgxY)$)czbAwuNOC0qgeHq{VP zDUoVAvxbQo?)+vZTw5w2no-$&QHfi(`CW?DJ4Lm!jDX-uE85C}Kbv2vRPZhFkTlIj zxK`->ofqvL9RQwr!f&sm4h7pHs9X7*vXgsF?Lrr-0(jb!t$?RZHT4?sg#P1Jx3!-Gz(h#I8P9@KggMkyeX|ezwAQrf7ylo;puRk z4su!K68@L{@J1#5c82l+XJ`UoPFPO!g?T@^5PMCbd6vUu+YMA?AH9e#Nbg6?Z`&F^ z9l?$J6ak@Gmb=sPH~m%PbTR|qqS6V*TCH%c#-GL4`fOa4!&T8N6kY%eB)O>W4e>1{ z03Px;^WYbkk)du+9-mU$nvRl^(yBLwb3f!|Q7+BF>7`ruhO&N&y0YmFd5CpyuSxAt z^)H2aof-px!&pB6F``227v5E%{cH8bl2XTzI)h){yxn?1p~LPxck{V(SNj>Et1Xt9 z{x@cr{rX#R!@8L##sdu2fg>6#USx5=&aEK2K!tr8vDQ4+JMq+Noii zvlgdgJf`!!ING7!EFT4!(a430Jp>jM;Plf(kt@t4;*eV%xieDmUeV&1AU@z-?l<7| z+c(*lhtChuC>+=O2#BEhTbW!+qKXRq=qVqxoE0D@`S5d!bM_^_%c~c%00b5ks2}b& zo;>sGkTYLt3hV2F@MyE@C8G-?jb3{iO_riT&Hbx~vI#S{`4)N^7L=%@;TOtjkJa^c zDmAY1Uj0UGw!n>5+qc))e;!@|Zl`|S*gS6Y#EY{o;~^m}ahwuWZ58d@GsoT^#8)$p zU{wAfO>8T|!X?8ruDnwIp(LxVNrvQ@uZ+;_E`)klEp?WZe!$%-gh=egR2Roah5y^n z)oonH*(}IX!Wo` zpe;ZeSWIHIF3=f2T|4#;5QMJyuF9~M%F?1vq_}b60zu>Lr}{Z!xfiDiI2|AyWCQlr zfB)VG)=9BHjDxG7aMwiuXo2|vVu@Oqy&Y&%3^@~i~&@8&~8Ft`-Q$Ma*@`!ZC zx&HgdS3B5mOyOL0{(bQ+ntxtPecTK?TnC+k0K}LVTl;!l0Dg=l7hvfL(*X1BnvA6XTWMz&}Da|vh^F9Bu7ac)eWbd(|sTP_B`%l@Hx z((*D)rounCRr9U!_kmqm!A%tSli>8NK%u9J@%E~s?@nn`lFvtjG)hV)J}~)nOx1c` z|2v4JBYpnejfNUR6a9%xYD@72mx38cXRlk z5AtfLPC*Ee6#L29RqajxI+3&Tb$|%GCfh`k4uwG-9NFgf!4rzNb%7p_l;>LcZAR z_~M&D3Hw`Nt~xliLY5YphNIheET-Cz!YmH-y={mnaH+ zRVUb44+DxkVhxjvi)!`$g-0aKeqqhN-_izXg^{0?u%+G=0QeMjry#+Rt+Q$K1+r%t zp0O6!Y*VvBW^?vs$t{Tywc$S>#Ax2n{cS4Oz4)tfL29Q%Xe)nLj(~vvLhUCncDNY$ z>IO4XodC8)y>%+nY<-4G?IMk=%w+!&a@a zUVwGp!>ytq{r$8{`%~wr%T}Xbc)IGlpQrYM4S)oOJ{ljmooIN*^CdmWJvQ$LWzH2K zh3{fb@K_8>jkh19{$M&tSqI;Tz=tEz25+Fh$Lgk< z{)_R>A_o+`rx(xwVv22>|fnD%?}vTRkuqP!Dg*`HBkQcJwrF3`bpbNVJmp0se2=s8uG>Uj8>r05j=Z zAl;H10wz8xWn`Oi|BbAYVRA*Nb|#G|6K+Y8|Hi03!|lVAU9OsBoy`Ieee zi4o?b(vIfToK{yux;(F4E9t*B(`G5$J6f87-t0O5`bw_Cr(I610-i6kv^iA4oYH=! zLZ>g6#*a5`xCX{9@Mq51s zcPKTAdfx&GQUNfK1u$_3y5equBfL%S^8e0{-UleDt?VoACTKEUe^u}%&d~7v@J`^# zlV5nru2?{8+r@_i9^B2TCeujVXV8J0ElK+y{Se0VQvloC{#T^^n<{=eFax&VR$^(PPKvWTInL1THM^##gjFd1 zG4g7LXb!lQui=SVkjP|ubGBM3IoHw&;)aUU8~awR1bgW#`o@=F(JgIx9|T!^ZRP#_ zAn{LIE(>6>SkJ0IbG4nFwXoD6yc|gy z+#N7Lz@=~8CFXEox`-j$)+K7(UY4Do#nt`76SDw$%2T36LT%_8F+2A_(ej{`p+`{T zWLIieX&n34{E1rUv87o`NT%%LRAQvKn#i|{f>8=dT}BlsGjLF;IqKd*xIehu5NfN& z#1hOjwRD)I}YZo&{!^4U+iDnA$M_4M=8l|3TIRZ2{b5~36lVvr{J!}->?M9aw> z@&H;wJ|0uYI1Ao_0*d%LA-Y?601G;?g?T!Pnj9?}xfG}Ny}gjYhGLLQdvQLN z*31?YP$@}Sp<*eVeUT~MPx8CF*VkCBoC!*Y{Bmo1Qb2((`C%xzys2uAFHZ+}D ztnG`zG0*$UEvJPjviYY#^>Zl4D&2m#kCJdaI2dZZZ;q!RxQZ!u6>{<=ta~$4+L2}- z@c0*AYn;h9zB)47tu0SZ<~xinv=@ zWhEGEg=nt+CvW;E3)+Ov&Typ3U~>3F)_zTbLdm57E_^%ePv#%aFL66u2I-Wkq7_}I zj8RuZNXsC@&1)rk?Owx^8H;*2b7Gb~u7IazsIg#K*UmUX>{lRW@avSb%{0J9{v^HkN=_}u?}BhVH5_C@ zOG-1IRZ+HG(YtoLB1lm=i9Ia%&J&NPGYO$0tIH0}Z(N(7cwZi}=+9gBDm<8%Za`*Y zGrL|iy!72(B->7mMm@E)f9adZaFfxRmBdudkdam6M$t`m$<6I@DBXS+Nv#gGAtq=C z+9zDYWoGU?vGa7n+;*IzzGSB(LgV~CHloEGqx!R3#G7_5t)5Bl6Av?i&EqkllnE1g zu9uHZT|?-ZJw@S@WQUgdZ;~3)(ckuT)^k_wY;L@czKv=e(&BHy05li|05=Usq9 z8wD`J)kvq8uS!a=`h_tR-Vis&D6US7k2Iv|kB6k9On+*A5^iuyadW}=v6WH`4&uR9*N;{ot zFUL&x>oU&100H-JUw5xVsQZSBIs=oFxK0P{ag_bTZ+)8h$0c2q9i?JS$*=moSjhQ5 zjzi)2B`!nVX%b68t{lJf13WSb+>J%+A>ip@8h5P+L`Wm`qE9qtnN<4Vy!F(`WX>MZ z0Jib{%hGfSubKRwLI5Ci$+u|LIDo5&oWep_eF=8YU$Ksn?M*t@?=-c_Lo$#rJ!iMC z3{XkKFX4dJW?}L*MH0ypj8>6R8|VPISczo6aYac zk_JWwTeL{{g~u_~vpPM&L}M-RPE*t9oO)YUvevC?d&N_{RMCIBOl?juIx+@)!W71aAp!?dy%BKfC$?fxz0qJs$UWSvB~*xsMQUq+3lnoTPzNfE zl;t3TQzkPBE81Y?C*N!>h$8)b+Xwd>iZ>)0o~dPlXV{R-3DGAxY?F2eB1iL@e7xF` zvpmltp1f9<;jRmg3?9SDT;|m7B?OCAW7Imcl`SToe0bE@5^L;l=zNg8ZcNEq$iOfZ zAQIk&kELFwt0?l1JxQKoQg8~4cBtQ<%tCAZeC`$Q%Q4U?KF*%e){Z?d3jX%9ok?B~ zSlqLPP91Vf1f9UP-Dow#7+wtiGw+<1u?h>ZlDf^~)(_Ogwdwn-Kwr0`3%$=GktAQo zkD0;Hvo6~w6!E!4h<(GwEgOL`pY|yJ<*jntk4cs>9$$+3AGH0#6A|`%z2N@+U^tvj z7{lAM$B1oMO53GCbi6Y+2mkb=Qt^NgFuiIWik^=(eWkQ8Pke;J!(#T^JKFF^P)jwpfacARS%$c~Rw3%;-_JljpgQ*Mc3QJ7sVPBZhhjQ&?`TCtZJVdt z2Mvvt*PIL+gEL5wk6q(1*2b0+$OwH0VpGC5#v8ZMdx z%RpbB`cCDC8P*DXs!$x*#c}s-!T(v4|GtCL&imYdPa}6htM~s58rFI6U(=FMzMgxC&c%_X{lXKVaC`~O zND`p-Oo8l804wcwE|7BI{)IO@k$@xh*~6(7{?84A?Eg87fp>T>F4AyU+E2v+F@PVp z15@zdLiF{27gn(*12znYVWog9ph7A&E*MBh;@e)NuYMezD*vAs{+}rrJvl{QV37dO zzvC~wfyn&-8iNAuw(`pVS@`z-mSAq!f5x;e`;Y6xr=NiSy^0ztFBUIED54ktAq7Cq zTEsGj85dj*+c!eO4irHDoI?8QPT24#*peXN5y+>(HGj_lJ7xPnr52cgF0}MNw}xWX zecgh^((a?pB8F_!v+5TZ(s3nFa3*P(x)WYoaVbesfTftcqS}ij1$i}1u7bz zT|M+5YP~05)!h2x7|=qg&zrUyKv!`RK&e7Io_`n}Yp&)8qq834I~Vw9GVbm()#Q~R za)5d#R{srNnx~fQXhCkK^V__fkBMWFeI2E)Chw>wf0&tm<#A-sFOzEHm=ux}w6@$8 zk?hHgivA28AwGe|v7sw9OaUnB&e{<7S{^TGvlbWKIj6A_d`cl!+L!zn0T(9E{-`5McalK}$ zyXei5gkV|Qe zs+(CFjYGV^-EOAWw64OOAJ2>$jLl>b!xaRL$p+fjMzhFW=76I>8Kj{1kQm%V%!`+R z$s(M`2U*4znGy|@lf1qwQ5*BShK_|Phk}+?h~OaEw|EKe;`-dbBQG`L+GMuQ$D*wv zf&`$d4rmoSr#_(*ZP>*JyN+^c1r%+Y>_%L-I=b9z-k zrT6P2XY}B8U=zCZ@=p_-*Hc74c2niYtfG(>r!zC_>jP-H53)8SXwJm1@ap%QN-W{F z{uj3{o8*<3H&v4{S~@@zbD?B*TfwT_g*oLRTP&9G7Guq83K>)lRBhQFO)=|?c8-%A za{AuY1@wNqb5EkE7JaG*UKd%~!|w*n%?DkRJiP4d&Ye!~I6KFtX^r@{$7IZEZNv@* zZ(2O$`Lf*10PTgct@Q**ZtSvR1ab7_b(pe;O*(>>Q#?*NGH(<MH25>0kZ4XWac4=e;YzOjsl0OwktBMRD1+px0xK!fUu`xroIQr5B3 z_o4uNWu=>`oN4wpP+xLel{^Vdst7T*{@^gT2-t=QHIV5?zcB*?9;LI7DR*c$+p#7mx(eXAr$;<+9Cfqta}4^FcKAYLYeL)3V$Y1IUq8 z_vcbG)M4b}zK2jglX+{H8?Sy1ob#P_>UYi$jWHa}E{H@EB@%9IO1h`c9cx9vj8hj4 zCxN|f^Xl8H=)xUW0`(}q!>EpSbr-cUj!}{FO03z4ooCZJ!K1`<(y6S~5*3uO&mz_5 zL+$j%i7SC8Qg>^VV&#?{cT=1Ep+W3XLdT^G2lKmmbf%`_Dt9QDdcStQ7azQ<(~ayW zWy-A_`ypb_y~Hv)Rny)Xbk!UH`U8yPB6*7k5}Wh?y8G^+sFSW;l7b2hVGu-QMBckIf&|Grpde{Ll4O(|B}f_&5Xm4)9uN^>nEiTib=}?fexbfw zb#J|YP(W8#_vzE8``3NW^PCobfFBk>m#W?EM7&rNj1Tl%VJU9*#ioL}KzqX)Q&3v)Blj>OD#Cjn?Vd(Y}f9081 z@%`v4T}LwCfAbCD|17@5dHJ}9_xU&r3ZK;0uTu=0tOc_!TKRQ@>mSP>J?r*O-eQ_| zoM-2~>9k}V^JQit^U84TZ=@i%ZNh{;Tl%E_F) z4&2T0Fo5cNK)hDX{4_s1fB~yhk~z_D?QK10K~rY-zP6o1X9T5!DdRd`N;FR{!<>6Z zIzXZJW1A4#C$4`g)dW*``$Y2Iq}j!$>AXl6o_QMD(fWCuQaMk+w_uJtyC>=44)7W+o5)kaD@A)OKj$TFTLj zZGlyv*1xg4$GDGHYgFd#x(?QHXSUs5lOcPGT^EoW=}&4p81Z>6bKL7eq2rXuFh5S# zdv;BZA!cNbi9_8r%ONPcyrh@2ONYbCVZk-Z=nR0HQl(t^Pdb;Be{(*_ zvj2R;<0|M2Vw!6F_7=EqIzckWcageJp0rGwAuO?7`mbaZ`^tCb7h8 z_!@qL>Bfh<778Zowya?o}g`$-5;ydtK$w>`5$KMXlNHSUto2BV9L0e7d-&UIrLn zx`uJsw&di<1_5>!tJGklM+iD359(J-Rm^I>c&G?RwOO(<{->9+;eBN69m)uT8CLqkEB6zR5jD| zK%`5qVx3BbibwX-`9yM;3=ybn&Rno~co=z+fRe(B1`J9wr=D&K{C#0LA1f6>c#C$n z-mz)#l5Y+<0~O7j2Be70K1Z0wu?eK8W6N05d=(Wu5_BQb1|X6WbjT6)>JIKwSq~lI zo-69FOH;!wn-3gu^!o1OF-gotZ@PTRBSlxVAx|ODIbw88p&D}{%dkw;Tp}~soVGFo zW_tVdV@WS4i{abaOScT{w7vQ5B15kp&88Gd&}c(QM-umIz$xGgh(tZoE%>)Dd5)fu z$J{wP{0uy(Km^Z5Pw^s^G|CwBkVQ{Of7`$wQE|LgEr(Lo)Rc#-Wmb1qn^zT+z_efx zw(7u{GrT7Rlv)v1Gz%lK0Eq-xX|VY4yyaJyPy*|+|BsO;b;i2Yw!t14Q z_FU;vr0>3?&*FZEAFM3}iw-w#e9QFOmr!<%n|gE)-kVM@5{q#cLJvEGP~7n z$4ziaX5dZVT}GZy5=0lz@GUOG`31xRCFs*?)0L}19E{Hf!(S&qTTibzbV~@@I`{qp zpJZ{sqm5@lQ^brLf@j`P3cP*fI(6+)IMH1h<91fI`c1cpN+iRwI*Yeo?}zr)9vVL8 zE>Cu~S<;hi3f6bkCD)zPdnK7`45GY#`*_2=VgiJqD(Kat7Yy@qvn+=V6=n>} z71;hj4INTYbSB%yBgs7!dKu2h-^{U`Fy-lWK-xHEX)rI9rS2o%xP*=-PpLj-^nLKA zV{%_-?PZ@-2Tx6*{h+CIOk9D(nk%X@aYx&m(pxxU-r>y)4N;c-8mi9uB+KSiXXz(x z4Bk4!g4+e3#Q4a!)&|OTbrw}4de7? z1$YjmM&Ft#Rq36gofkU!#GXF%Ht*Q2kn;<4=PATG7chnkF=}bAc@r&%g_~jb3g05C zRNS+(u+NBlZ_js(Z=^sy1QMsT+}n!8)9v|IBYrz8h%Fa~6344I6-%dZT+%@KXCzt+ zMVQs*J29f_x5uORBe>H(jPYhH-3tZJ=-whCzU00FfKDO$&JGUo8l6!kITB3yYVDkS zlne1$r-Rb;6BMX9nieCYScb|R&fi=mqt$`LRFEp1TF|o_a z&@PcT7o)5hks86}sneAgDk;Hma#b;c+w?bCxad?Ov-i2?G{BB!YtC-YY=LQ}n?}*= z<}7Q*y6ejFkROdt0qnE+v96|v!ZHSNPm@lHXfQ1%b+S@ZNlM5i4k)Z?M)Gq-NEkk6 zi5p2{0^b8p=-psVg@ka++0Bt`N&+v-(3R${>V@_2jw$gL- zfS87*QIXZe(@ud+mP+Vo5dU|0k6xey&9atVN)ce2JYq-PneqFY`7mlgJCqlq;r6B? zpfyKEn0+xL9LjTJ==r&iOWs?+cpNRJB)?Qbv&Wj%_{?u)3{B_LdWBSa1qtRKPN}q?7y9&c1G0lW4<3 zldlw4ZqO`m_KSUp9%e3Y;r~c8k$I6l)Jo;F+C!1+o}Y~u(>yg#$z4sFIG)C0!QeNB(c*~6g%E{zmd8ULJMxl{d3N`A%60ab+(S;Pq@P? z3hvwO%jIXAUz-r)hJI}PTJ(q>n zJUV2Y{9tb#{7`(q>bYXI^;z7Xw)m0`50a@P@%VLw?CqW=o4= zm7TX^ec4PrsK@19z(XR(8ejo3y1NRvQ|^{$s`3}xU|Mz_FR-Me}KNJKSy$A0TuB64oCNM#{x=1^xsP|ZO3{KL#vG@ z)ti&E6j6^FPBZixj=CS|df`&VBPfc5-_Yqd5x^Za)27T!fhHMqwa{uWFazcv?Li3s zP2-l!^l|xB!vO%ZWZv1^guF|^YuC=I9z^;v;vXuSPg@XQoxg(~Qk+%&P9OR&_8#D< zIoyM8I=MiPM2Mnp%2l2OD65PQJ?z$kIi*N@@@3XF^mYzeP$mgLB;#adN)1Q9PN*phfK=oI8biBII>V2) zGmJ>GS$QbIdrS+r(KJ+Kk~=IEh{|J3GKNXtGQuKwvd(U2{`tZs3zc+M<~=h2=~NaV zr71ewD;<+aH^ZeYAi=bBkvl)5_t`K=dHu6KA$m^S@wtCpP6k?adjtH^JoxDq6%5fcG z0N{r=eJfAvlJ zVHbb=X5<2di}R6WgMyCU$M3TM!H#;|<04*64OVe{2R$V@MrvIoL!*j#bKF$~e8JGN z`JP(9S2+`zLN7u44l(1Yznzok5KgD(t|7&6(s2gE)@H!7bWt)#W~jc4@m%ubAf8^b zvG^!MPfhjgPIo9BeD-s5uKh^VCB$GtbcFkodOWX8T)P?*YXuiu)6m!BvE*rb)RG93 z6gnqf;XwHNydT&BldiVMR}ss_lISo*q1~LOvKaVJ3DCO%bRb{i(j`5vh#JJB?OfL{ z=wJZq0&kO^O9)m|;i1`sXY65uQAe(?sRzM_^)xRsAsOI$x;cm_jE7W4a&0)T<81gH zI4OxQl!0X~h{c~yUoY%jHdQ+rfe(2Iz|QIA1m=YMtf`M#7+`5#66Wd5$*`spbQ})P zA(K3zEWDU<`>1u^FCQd;!sdh$VI7^MZz51|kzoV-C=BM1`jV*(tb*|2N})4TKKi7Y zz@9ud=FXZpVR#P>SdcbKb|@8I16g=Y&UN`0uy?6=?Z#iW0dw|!QDj#&4i2t{x&SW# zu7>XmKJQJR{&L7HJcK^~tF3|0t<_IUdc6_-X_QgkI&#Y8It}5P zLhLtHSHqu=IkdK!GmeC1*^5Q@PSBc}_BwjAztg1jR9)!KSUC1JahqMXgZJ`5RDekX z|0`_$NF(|^#o}O5#;dqA1)1?zr%2dcR_F@cShGkuBG?c3ACOc|RFtecn{!>8P6 z%qzRDjEsz!d_3RGakDO=JK$5)d1!8zvfdYxm{A157?3 z^f-epFM9i475!smwVdWgjeOMJW8eHD z_dSH}BTJ(PhB@o7awTYWv0-^TO%rPrPm$GIo5XnT#wpuC;cTiD_($9OH^>X$3q@Fz zxzs9q`eL`{mW^IbayYY;D!q_Hvy_GQWv{l4xy3MW2B3*VduzHGM_^mRJ5NmAB^gT)fe^_@0K^e3M=v#!JP0l+8lfL{OUyK*Maxrihf$O zUBuD)T9|6rRL zUmk{Snxng|<9>0Vl|^UQVqk21s~={8d9S=}isv911U9Z7z*?oLpUnoEELBB2`|@*r zJdhr=jrUowyfw|)1*em4t8pou5nM*dM_81XV*A6l(;0nu=YA*BFAd@;ah=zW>flRd zB$lbI3C<4{y8M;5_7pqu=tjpuZRNKsY4woA?ja9VhV=&8Q9Q+x9M>k*^n+Ukz22Jj zRmc~a;}5JUH+ZZH=qzCTOk4SE$rk)+kFq7%s|On3IGWNCoS$GF?87I&wOMJg>URy8 zDkJytX+>I}g*uhbJw2{(dPJJ(&I_|CN*OurJUC~34U+U0Ame;XN^nA(j5}8c>fGl_ z=%#!$4PU)T{;K|RK!B14;gx@5g>z0Tci~K7>y7K8uDXpu+Fm2mmE0|B7H^PDpBM|b zs95Nok4wtiyk#F2irsK6W1e@uJG0JNHrB;0WyOi1?5!T#<>N-YEU6A}hm37qh?dul zZ5mG;*OJpp5fZ2%+42vmu8VrIVtMTqF6aSvP#UHqgq)bZWFqd$44DHpax9pIw`yL-<8zCJCZ~6B%QEr_rqu|j}ngJna(@mO;>U{ zd`k+xfF)G^v}h*v3CHevZ3z>6?~}91BZGIeb)20Q7r$-z^e#NeTPIS9j#OH{6vm%~ zYy}7bTNA{j9E5(CWNkwrI?7X<$ev#H%#&AUotI1Wj-fu8y?SbG6E8c6ONH~Z)bAF z$7_ljAvn{6Bt${4kXaR$Vf}Ibc+ne7sBX1VdHov?rMP2mR^9x(pNeM^C%ZM9FRe7h z*;EDSqIlbJULxjwC>xBKDPJ5kL4_@DNakd;`Niy*j0?oePQ@FhOGm>8F$q#xlG$^Q z)1Hm_SR2@|@(LcyZe|%?jvD7^KAH$Ky&iBoUl@XLWo!D{3*PmXJeNkXgaJt{9deJN zZEk0DHG%MEftT1BC`uHBH(3)ai6o`+NrOitjYJ&+q55hHsv0~QH|Xx=wJl_BH8>Vkhvv?`&^LfpgPL*h^2U>LXs3n$RK{?x*vw&VQCti$L z&_{X6YiN0A^fsBoUm{*++;jN#K;hxYlW~j(Fu?XOrN(EJS+@-qsr9mC^D^^5`W!fc z0tPoprKVcZP=$8uxnHRX?O!F|u?H@Uo4WwP^wWi5d>;>-Z(FXg*L)n!L6%va%9D2m~@QeS?R9TD;WKq z(cg(+I9(Uc#0E}jm%3JJQ~<51E6ikAV#PvK-=t1Ve51SMlWKYNhPB35led*TBLXU0 zSCg)=>T!7dPE?n0`hEJ`4a%ozl5&U4Dt^%K&s+YlfB!gO1ujdQp@0W?Qm4b|tfvlH zutH)68S_mm=|L1lKjIWZhQ=|lT`>p(#o{YW!fC&uNwsJ57H%!+%_&BlMa(!6+}MD( z#qkn6X5?q6(0d_=k*gU{K2(o_`MyAJx&2a%RSj*ANy!&vZ%8Cj<`mEjx0ax8ZQVhv zl43G}<9Q*{SQOSdmsYnqg<@i*g0m)?&N)0}z3KRt{9GEio&fU;`-0Z9o8jSL8Gd#z z&msFwnkL9z(uof1XW;ON0f$T)gJ9q(QiJ4NkgKG+;9W#kp{DAt9?wIiITFM#s4O5n zK+VILw+9#1yofp-t8s0N>0EL!I9G#=M6kq76_HN#QF5XIO=GB>4*VTcFCs@Y2|+VA zCh3d+}bntf>jt{ij$ zCr$rs=x5VEJNp++|Bs~pUpD=782>GW{AW`CZ0U!te-`|2nl1^F-`4%v5{N(?=bMQ= zpq~bStC`$>0<-tIzG`K<7fb)k4@hig0|j_2RTjH}KYjynj^D!mQR0^a^}q7f@=ZS< zFgFIQ>L2Fj2+S?-FD2g%kS7aR6t$hN1&-$zNp$fNV33VL7C@Y#wXe%KdTZz2T#VB%74%9P`w_RaEp+*5!&)vS#M=?uZr{%Z+y ziD71}v)qiY8;)l>D1BQON6=(}?%}%L_)CF#4p8m2rmg8+{plm}pVD{zL3r0F9p@BF zC_w&M;99Q{DO*^3@53}Us{z1(2j#I}@yCn6biw{!;1>;S#+$SU+`DlAZ2OE9LYT0y9};C2Jjd1(?!6h`QF#}rcbb5 z1L~2VDLJ5Eg#j&G{zD7MW{W)<8Ts#Hl`zjD46WsL3bG}Z(CKgF;Q&-{nP9q~fa$`p z#`q8j1gwkahw9B6B&fF(j4xa61ajaWK$+keOb<=$@5A^*WFwhcJoGPpiV>zwai;ng zxtL4zz+4EnzHfF%A$uY|SChQ0c$LX7hc?d`*VQr`j}0{caQ zhxN;38VOP&|CAD}eK-X8)PI@SdXSpC5O|LMFV)@sVK6`2ZN)=ln`8HA0)7|+kO6im z!=H8K2r`7mEdoW|IRR+f!9QmFmF}*AJVBWXf7)Fr!KA4#`6B-@me@Zu|F0|mt7XAOI^6URU#Q5)NgW<-F8x=i#ffCx)F&GR6 zB#pLIU~yMvI)49t#*xcJL_|$2$19Z?9`zHTr9WJf|3<$_3E-7~qwyTNjKO$F37HK4 NEyp$dZ#u4b{{c?_u5kbW literal 0 HcmV?d00001 diff --git a/rag/knowledge-bases/features-examples/08-chunking-recommender/requirements.txt b/rag/knowledge-bases/features-examples/08-chunking-recommender/requirements.txt new file mode 100644 index 00000000..f9994574 --- /dev/null +++ b/rag/knowledge-bases/features-examples/08-chunking-recommender/requirements.txt @@ -0,0 +1,7 @@ +boto3 +opensearch-py +botocore +awscli +retrying +opensearch-py==2.3.1 +pypdf \ No newline at end of file From 8318287cde3d438cfa1084db9ddf43a43f09fd34 Mon Sep 17 00:00:00 2001 From: Mustapha Tawbi Date: Fri, 24 Jan 2025 10:54:35 +0400 Subject: [PATCH 2/5] =?UTF-8?q?=E2=80=98chunking=5Frecommender=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../ChunkingRecommender.ipynb | 808 ------------------ 1 file changed, 808 deletions(-) delete mode 100644 rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb diff --git a/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb deleted file mode 100644 index c4523e2e..00000000 --- a/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb +++ /dev/null @@ -1,808 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "d4a64fff-2c1e-4f34-9c7f-0c8180bb039a", - "metadata": {}, - "source": [ - "## Automated Document Processing Pipeline: Using Foundation Models for Smart Documents Chunking and Knowledge Base Integration\n", - "\n", - "\n", - "##### This Notebook requires an existing Knowledge base on bedrock. To create a knowledge base you can execute the code provided in :\n", - "https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb\n", - "### Challenge\n", - "\n", - "\n", - "Chunking is the process of dividing documents into smaller sections, or \"chunks,\" before embedding them into a Knowledge Base. This process enhances retrieval efficiency and precision. There are several chunking strategies available, each suited to different types of content and document structures. Examples of chunking strategies supported by Amazon Bedrock are: \n", - "- FIXED_SIZE: Splitting documents into chunks of the approximate size that set.\n", - "- HIERARCHICAL: Splitting documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.\n", - "- SEMANTIC: Split documents into chunks based on groups of similar content derived with natural language processing.\n", - "\n", - "FIXED_SIZE is useful in scenarios requiring predictable chunk sizes for processing. HIERARCHICAL chunking is appropriate when dealing with complex, nested data structures. Whereas Semantic Chunking is useful when dealing with complex, contextual information and processing documents where meaning across sentences is highly interconnected. \n", - "The main drawbacks of semantic chunking include higher computational requirements, limited effectiveness across different languages and scalability challenges with large datasets. The main drawbacks of hierarchical chunking include higher computational overhead, difficulty in managing deep hierarchies and slower query performance at deeper levels.\n", - "\n", - "Selecting the right chunking strategy require understanding of benefits and limitations of each strategy in the context of analyzed documents, business requirements and SLAs. To determine the adequate chunking strategy, developer needs to manually assess document before selecting a strategy. The final choice is a balance between efficiency, accuracy, and practical constraints of the specific use case\n", - "\n", - "\n", - "### Approach presented in this notebook\n", - "\n", - "The approach presented in this notebook leverages Foundation Models (FMs) to automate document analysis and ingestion into an Amazon Bedrock Knowledge Base, replacing manual human assessment. The system automatically:\n", - "- Analyzes document structure and content\n", - "- Determines the optimal chunking strategy for each document\n", - "- Generates appropriate chunking configurations\n", - "- Executes the document ingestion process\n", - "\n", - "The solution recognizes that different documents require different chunking approaches, and therefore performs individual assessments to optimize content segmentation for each document type. This automation streamlines the process of building and maintaining knowledge bases while ensuring optimal document processing for better retrieval and usage.\n", - "\n", - "The key idea in this work is using FMs to intelligently analyze and process documents, rather than relying on predetermined or manual chunking strategies.\n", - "\n", - "### Notebook Walkthrough\n", - "The pipeline streamlines the entire process from document analysis to knowledge base population, making it efficient to prepare documents for advanced query and retrieval operations.\n", - "\n", - "![data_ingestion](./img/chunkingAdvs.jpg)\n", - "### Steps: \n", - "\n", - "1. Create Amazon Bedrock Knowledge Base execution role and S3 bucket used as data sources and configure necessary IAM policies \n", - "2. Process files within target folder. For each document, analyze and recommends an optimal chunking strategy (FIXED_SIZE/NONE/HIERARCHICAL/SEMANTIC) and specific configuration parameters\n", - "3. Upload analyzed files to designated S3 buckets and configure buckets as data source for Bedrock KB\n", - "4. Initiate ingestion job \n", - "5. Verify data accessibility and accuracy\n" - ] - }, - { - "cell_type": "markdown", - "id": "d066b04e-bc6f-42e6-8836-817d2e0854b1", - "metadata": {}, - "source": [ - "### Setup" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4f36a15a-6cc3-464a-a2f1-fcb2d45ea7b9", - "metadata": {}, - "outputs": [], - "source": [ - "%pip install --force-reinstall -q -r ./requirements.txt --quiet" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5f694d1b-c537-4bf2-a5b1-d4f5c7025620", - "metadata": {}, - "outputs": [], - "source": [ - "# restart kernel\n", - "from IPython.core.display import HTML\n", - "\n", - "HTML(\"\")" - ] - }, - { - "cell_type": "markdown", - "id": "9eb01263-04ab-4471-b73a-366055027873", - "metadata": {}, - "source": [ - "### Initiate parameters \n", - "\n", - "##### Knowledge base ID should have been created from first notebook (https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb) or similar\n", - "- To get knowledge Base Id using Bedrock console, look int Amazon Bedrock > knowledgebase> knowledgebase \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "25239d0e-972d-4fff-b200-f20c39714a9e", - "metadata": {}, - "outputs": [], - "source": [ - "import boto3\n", - "import json\n", - "\n", - "# create a boto3 session to dynamically get and set the region name\n", - "session = boto3.Session()\n", - "\n", - "AWS_REGION = session.region_name\n", - "bedrock = boto3.client(\"bedrock-runtime\", region_name=AWS_REGION)\n", - "bedrock_agent_client = session.client(\"bedrock-agent\", region_name=AWS_REGION)\n", - "# model was run in us-west-2 , if you are using us-east-1 then change model ID to \"us.anthropic.claude-3-5-sonnet-20241022-v2:0\" #\n", - "MODEL_NAME = \"anthropic.claude-3-5-sonnet-20241022-v2:0\"\n", - "datasources = []\n", - "\n", - "# create a folder data if not yet done and\n", - "path = \"data\"\n", - "\n", - "kb_id = \"xxxx\" # Retrieve KB First # update value here with your KB ID\n", - "kb = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=kb_id)\n", - "# get bedrock_kb_execution ID - this role should have been create from notebook creating KB\n", - "bedrock_kb_execution_role_arn = kb[\"knowledgeBase\"][\"roleArn\"]\n", - "bedrock_kb_execution_role = bedrock_kb_execution_role_arn.split(\"/\")[-1]\n", - "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n", - "print(bedrock_kb_execution_role)" - ] - }, - { - "cell_type": "markdown", - "id": "b9291ec4-e2dc-47c1-950b-9fa7e737bee3", - "metadata": {}, - "source": [ - "### Supporting functions\n", - "##### Function 1 - Createbucket: Checks if an S3 bucket exists and creates it if it doesn't. \n", - "##### Function 2 - Upload_file: Upload_files to bucket: Upload a file to an S3 bucket\n", - "##### Function 3 - List all files in a specified directory\n", - "##### Function 4 - Delete a S3 bucket and all objects included within" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7578f6f2-1cd2-4683-b150-1b6900ff77ee", - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "import boto3\n", - "from botocore.exceptions import ClientError\n", - "import os\n", - "\n", - "\n", - "def createbucket(bucketname):\n", - " \"\"\"\n", - " Checks if an S3 bucket exists and creates it if it doesn't.\n", - " \"\"\"\n", - " try:\n", - " s3_client = boto3.client(\"s3\")\n", - " s3_client.head_bucket(Bucket=bucketname)\n", - " print(f\"Bucket {bucketname} Exists\")\n", - " except ClientError as e:\n", - " print(f\"Creating bucket {bucketname}\")\n", - " if AWS_REGION == \"us-east-1\":\n", - " s3bucket = s3_client.create_bucket(Bucket=bucketname)\n", - " else:\n", - " s3bucket = s3_client.create_bucket(\n", - " Bucket=bucketname,\n", - " CreateBucketConfiguration={\"LocationConstraint\": AWS_REGION},\n", - " )\n", - "\n", - "\n", - "def upload_file(file_name, bucket, object_name=None):\n", - " \"\"\"\n", - " Upload a file to an S3 bucket\n", - " \"\"\"\n", - " # If S3 object_name was not specified, use file_name\n", - " if object_name is None:\n", - " object_name = os.path.basename(file_name)\n", - "\n", - " # Upload the file\n", - " s3_client = boto3.client(\"s3\")\n", - " try:\n", - " response = s3_client.upload_file(file_name, bucket, object_name)\n", - " except ClientError as e:\n", - " logging.error(e)\n", - " return False\n", - " return True\n", - "\n", - "\n", - "def listfile(folder):\n", - " \"\"\"\n", - " List all files in a specified directory.\n", - " \"\"\"\n", - " dir_list = os.listdir(folder)\n", - " return dir_list\n", - "\n", - "\n", - "def delete_bucket_and_objects(bucket_name):\n", - " \"\"\"\n", - " Delete a S3 bucket and all objects included in\n", - " \"\"\"\n", - " # Create an S3 client\n", - " s3_client = boto3.client(\"s3\")\n", - " # Create an S3 resource\n", - " s3 = boto3.resource(\"s3\")\n", - " bucket = s3.Bucket(bucket_name)\n", - " bucket.objects.all().delete()\n", - " # Delete the bucket itself\n", - " bucket.delete()" - ] - }, - { - "cell_type": "markdown", - "id": "5d4b8fcc-5789-4df1-b72d-aef328b1a6c2", - "metadata": {}, - "source": [ - "### Standard prompt completion function" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0ee11cad-ea50-4e87-8bd2-b8f159bf4d91", - "metadata": {}, - "outputs": [], - "source": [ - "def get_completion(prompt):\n", - " body = json.dumps(\n", - " {\n", - " \"anthropic_version\": \"\",\n", - " \"max_tokens\": 2000,\n", - " \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n", - " \"temperature\": 0.0,\n", - " \"top_p\": 1,\n", - " \"system\": \"\",\n", - " }\n", - " )\n", - " response = bedrock.invoke_model(body=body, modelId=MODEL_NAME)\n", - " response_body = json.loads(response.get(\"body\").read())\n", - " return response_body.get(\"content\")[0].get(\"text\")" - ] - }, - { - "cell_type": "markdown", - "id": "c549427c-9d3d-485c-a542-93ef49b540fe", - "metadata": {}, - "source": [ - "### Download and prepare datasets \n", - "The test dataset consists of two documents, these files will serve as test cases to validate the model's ability to correctly identify and recommend the most appropriate chunking strategy for each document type." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "217702d0-42bb-4da0-b14b-37b4d0c0b503", - "metadata": {}, - "outputs": [], - "source": [ - "# if not yet created, create folder already\n", - "#!mkdir -p ./data\n", - "\n", - "from urllib.request import urlretrieve\n", - "\n", - "urls = [\n", - " \"https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf\",\n", - " \"https://www.apple.com/newsroom/pdfs/Q3FY18ConsolidatedFinancialStatements.pdf\",\n", - "]\n", - "filenames = [\n", - " \"AMZN-2022-Shareholder-Letter.pdf\",\n", - " \"Q3FY18ConsolidatedFinancialStatements.pdf\",\n", - "]\n", - "data_root = \"./data/\"\n", - "for idx, url in enumerate(urls):\n", - " file_path = data_root + filenames[idx]\n", - " urlretrieve(url, file_path)" - ] - }, - { - "cell_type": "markdown", - "id": "48551cb4-02a8-4426-88aa-089516a4c1d9", - "metadata": {}, - "source": [ - "### Create 3 S3 buckets, one per chunking strategy\n", - "##### Important Note on AWS Bedrock Knowledge Base Configuration:\n", - "\n", - "The chunking strategy for a data source is permanent and cannot be modified after initial configuration. To address this challenge, we are implementing the following structure:\n", - "\n", - "Three separate S3 buckets will be created, each dedicated to a specific chunking strategy:\n", - "- Bucket for semantic chunking\n", - "- Bucket for hierarchical chunking\n", - "- Bucket for hybrid chunking\n", - "\n", - "These separate buckets approach allows us to maintain different chunking strategies for different document types within the same knowledge base system, ensuring optimal processing for each document category.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "67a84784-fbfe-4ba5-88a5-d2e2a382bfac", - "metadata": {}, - "outputs": [], - "source": [ - "import random\n", - "\n", - "suffix = random.randrange(200, 900)\n", - "s3_client = boto3.client(\"s3\")\n", - "bucket_name_semantic = \"kb-dataset-bucket-semantic-\" + str(suffix)\n", - "bucket_name_fixed = \"kb-dataset-bucket-fixed-\" + str(suffix)\n", - "bucket_name_hierachical = \"kb-dataset-bucket-hierarchical-\" + str(suffix)\n", - "s3_policy_name = \"AmazonBedrockS3PolicyForKnowledgeBase_\" + str(suffix)\n", - "createbucket(bucket_name_semantic)\n", - "createbucket(bucket_name_fixed)\n", - "createbucket(bucket_name_hierachical)" - ] - }, - { - "cell_type": "markdown", - "id": "2a54bc11-8de7-4a67-b8ed-0ba2944fb9e4", - "metadata": {}, - "source": [ - "### Create S3 policies and attach to existing Bedrock role\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6230aaf0-2aa6-429e-9824-0ef2f9b12579", - "metadata": {}, - "outputs": [], - "source": [ - "account_number = boto3.client(\"sts\").get_caller_identity().get(\"Account\")\n", - "iam_client = session.client(\"iam\")\n", - "iam_client = session.client(\"iam\")\n", - "s3_policy_document = {\n", - " \"Version\": \"2012-10-17\",\n", - " \"Statement\": [\n", - " {\n", - " \"Effect\": \"Allow\",\n", - " \"Action\": [\"s3:GetObject\", \"s3:ListBucket\"],\n", - " \"Resource\": [\n", - " f\"arn:aws:s3:::{bucket_name_semantic}\",\n", - " f\"arn:aws:s3:::{bucket_name_semantic}/*\",\n", - " f\"arn:aws:s3:::{bucket_name_fixed}\",\n", - " f\"arn:aws:s3:::{bucket_name_fixed}/*\",\n", - " f\"arn:aws:s3:::{bucket_name_hierachical}\",\n", - " f\"arn:aws:s3:::{bucket_name_hierachical}/*\",\n", - " ],\n", - " \"Condition\": {\"StringEquals\": {\"aws:ResourceAccount\": f\"{account_number}\"}},\n", - " }\n", - " ],\n", - "}\n", - "s3_policy = iam_client.create_policy(\n", - " PolicyName=s3_policy_name,\n", - " PolicyDocument=json.dumps(s3_policy_document),\n", - " Description=\"Policy for reading documents from s3\",\n", - ")\n", - "\n", - "# fetch arn of this policy\n", - "s3_policy_arn = s3_policy[\"Policy\"][\"Arn\"]\n", - "iam_client = session.client(\"iam\")\n", - "fm_policy_arn = f\"arn:aws:iam::{account_number}:policy/{s3_policy_name}\"\n", - "iam_client.attach_role_policy(\n", - " RoleName=bedrock_kb_execution_role, PolicyArn=fm_policy_arn\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "339eb2ae-e825-435f-b77b-0524144f081c", - "metadata": {}, - "source": [ - "### Document Analysis\n", - "\n", - "Purpose: analyzes PDF documents using an LLM to recommend the optimal chunking strategy and its associated parameters.\n", - "Input: PDF document\n", - "Output: The function recommends one of the following chunking strategies with specific parameters:\n", - "- HIERARCHICAL Chunking:\n", - " - Maximum parent chunk token size\n", - " - Maximum child chunk token size\n", - " - Overlap tokens\n", - " - Rationale for recommendation\n", - "- SEMANTIC Chunking:\n", - " - Maximum tokens\n", - " - Buffer size\n", - " - Breakpoint percentile threshold\n", - " - Rationale for recommendation\n", - "- FIXED-SIZE Chunking:\n", - " - Maximum tokens\n", - " - Overlap percentage\n", - " - Rationale for recommendation\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c5801051-5411-4659-a303-c06aed74af04", - "metadata": {}, - "outputs": [], - "source": [ - "def chunking_advise(file):\n", - " from langchain.document_loaders import PyPDFLoader\n", - "\n", - " my_docs = []\n", - " my_strategies = []\n", - " strategy = \"\"\n", - " strategytext = \"\"\n", - " path = \"data\"\n", - " strategylist = []\n", - " metadata = [dict(year=2023, source=file)]\n", - " print(\"I am now analyzing the file:\", file)\n", - " file = path + \"/\" + file\n", - " loader = PyPDFLoader(file)\n", - " document = loader.load()\n", - " loader = PyPDFLoader(file)\n", - " document = loader.load()\n", - " # print (document)\n", - " prompt = f\"\"\"SYSTEM you are an advisor expert in LLM chunking strategies,\n", - " USER can you analyze the type, content, format, structure and size of {document}. \n", - " 1. See the actual document content\n", - " 2. Analyze its structure\n", - " 3. Examine the text format\n", - " 4. Understand the document length\n", - " 5. Review any hierarchical elements and Assess the semantic relationships within the content\n", - " 6. Evaluate the formatting and section breaks\n", - " then advise on best LLM chunking Strategy based on this analysis. Recommend only one Strategy, however show recommended strategy preference ratio \n", - " Available strategies to recommend from are: FIXED_SIZE or NONE or HIERARCHICAL or SEMANTIC\n", - " Decide on recommendation first and then, what is the recommendation? \"\"\"\n", - "\n", - " res = get_completion(prompt)\n", - " print(\"my recommendation is:\", res)\n", - " return res\n", - "\n", - "\n", - "def chunking_configuration(strategy, file):\n", - "\n", - " prompt = f\"\"\" USER based on recommendation provide in {strategy} , provide for {file} a recommended chunking configuration, \n", - " if you recommend HIERARCHICAL chunking then provide recommendation for: \n", - " Parent: Maximum parent chunk token size. \n", - " Child: Maximum child chunk token size and Overlap Tokens: Number of overlap tokens between each parent chunk and between each parent and its children.\n", - " If recommendation is HIERARCHICAL then provide response using JSON format\n", - " with the keys as \\”Recommend only one Strategy\\”, \\”Maximum Parent chunk token size\\”, \\”Maximum child chunk token size\\”,\\”Overlap Tokens\\”,\n", - " \\\"Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” . \n", - " provide crisp and clear answer, \n", - " if you recommend SEMANTIC then provide response using JSON format with\n", - " the keys as \\”Recommend only one Strategy\\”,\\”Maximum tokens\\”, \\”Buffer size\\”,\\”Breakpoint percentile threshold\\”, \n", - " Buffer size should be less or equal than 1 , Breakpoint percentile threshold should >= 50\n", - " \\”Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” . \n", - " provide crisp and clear answer, \n", - " do not provide recommendation if not enough data inputs and say sorry I need more data,\n", - " if you recommend FIXED_SIZE then provide response using JSON format with\n", - " the keys as \\”Recommend only one Strategy\\”,\\”maxTokens\\”, \\”overlapPercentage \\”,\n", - " \\”Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” .\n", - " provide crisp and clear answer, \n", - " do not provide recommendation if not enough data inputs and say sorry I need more data\"\"\"\n", - "\n", - " res = get_completion(prompt)\n", - " print(res)\n", - " parsed_data = json.loads(res)\n", - " return parsed_data" - ] - }, - { - "cell_type": "markdown", - "id": "579229ec-23b4-4e9f-99db-cb78c7453e4e", - "metadata": {}, - "source": [ - "### Ingest Documents By Strategy\n", - "Purpose: Configures AWS Bedrock Knowledge Base ingestion settings based on the recommended chunking strategy analysis.\n", - "- Interprets the recommended strategy from parsed_data\n", - "- Applies corresponding parameters to create appropriate configuration\n", - "- Selects the matching S3 bucket for the strategy\n", - "- Generates knowledge base metadata\n", - "- Returns all necessary components for Bedrock KB ingestion\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a016a3c8-6759-40fe-a997-91357a3f48e9", - "metadata": {}, - "outputs": [], - "source": [ - "def ingestbystrategy(parsed_data):\n", - "\n", - " chunkingStrategyConfiguration = {}\n", - " strategy = parsed_data.get(\"Recommend only one Strategy\")\n", - "\n", - " # HIERARCHICAL Chunking\n", - " if strategy == \"HIERARCHICAL\":\n", - " p1 = parsed_data[\"Maximum Parent chunk token size\"]\n", - " p2 = parsed_data[\"Maximum child chunk token size\"]\n", - " p3 = parsed_data[\"Overlap Tokens\"]\n", - " bucket_name = bucket_name_hierachical\n", - " name = f\"bedrock-sample-knowledge-base-HIERARCHICAL\"\n", - " description = \"Bedrock Knowledge Bases for and S3 HIERARCHICAL\"\n", - " chunkingStrategyConfiguration = {\n", - " \"chunkingStrategy\": \"HIERARCHICAL\",\n", - " \"hierarchicalChunkingConfiguration\": {\n", - " \"levelConfigurations\": [{\"maxTokens\": p1}, {\"maxTokens\": p2}],\n", - " \"overlapTokens\": p3,\n", - " },\n", - " }\n", - "\n", - " # SEMANTIC Chunking\n", - " if strategy == \"SEMANTIC\":\n", - " p3 = parsed_data[\"Maximum tokens\"]\n", - " p2 = int(parsed_data[\"Buffer size\"])\n", - " p1 = parsed_data[\"Breakpoint percentile threshold\"]\n", - " bucket_name = bucket_name_semantic\n", - " name = f\"bedrock-sample-knowledge-base-SEMANTIC\"\n", - " description = \"Bedrock Knowledge Bases for and S3 SEMANTIC\"\n", - " chunkingStrategyConfiguration = {\n", - " \"chunkingStrategy\": \"SEMANTIC\",\n", - " \"semanticChunkingConfiguration\": {\n", - " \"breakpointPercentileThreshold\": p1,\n", - " \"bufferSize\": p2,\n", - " \"maxTokens\": p3,\n", - " },\n", - " }\n", - " # FIXED_SIZE Chunking\n", - " if strategy == \"FIXED_SIZE\":\n", - " p2 = int(parsed_data[\"overlapPercentage\"])\n", - " p1 = int(parsed_data[\"maxTokens\"])\n", - " bucket_name = bucket_name_fixed\n", - " name = f\"bedrock-sample-knowledge-base-FIXED\"\n", - " description = \"Bedrock Knowledge Bases for and S3 FIXED\"\n", - "\n", - " chunkingStrategyConfiguration = {\n", - " \"chunkingStrategy\": \"FIXED_SIZE\",\n", - " \"semanticChunkingConfiguration\": {\"maxTokens\": p1, \"overlapPercentage\": p2},\n", - " }\n", - "\n", - " s3Configuration = {\n", - " \"bucketArn\": f\"arn:aws:s3:::{bucket_name}\",\n", - " }\n", - " return (\n", - " chunkingStrategyConfiguration,\n", - " bucket_name,\n", - " name,\n", - " description,\n", - " s3Configuration,\n", - " )" - ] - }, - { - "cell_type": "markdown", - "id": "4cc202b3-12b9-45e7-a610-e76c28b142ad", - "metadata": {}, - "source": [ - "### Create or retrieve data source from Amazon Bedrock Knowledge Base\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "90628e2d-913d-4290-8d40-78dacd3d0e9b", - "metadata": {}, - "outputs": [], - "source": [ - "def createDS(\n", - " name, description, knowledgeBaseId, s3Configuration, chunkingStrategyConfiguration\n", - "):\n", - " response = bedrock_agent_client.list_data_sources(\n", - " knowledgeBaseId=kb_id, maxResults=12\n", - " )\n", - " print(response)\n", - " for i in range(len(response[\"dataSourceSummaries\"])):\n", - " print(response[\"dataSourceSummaries\"][i][\"name\"], \"::\", name)\n", - " print(response[\"dataSourceSummaries\"][i][\"dataSourceId\"])\n", - " if response[\"dataSourceSummaries\"][i][\"name\"] == name:\n", - " ds = bedrock_agent_client.get_data_source(\n", - " knowledgeBaseId=knowledgeBaseId,\n", - " dataSourceId=response[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", - " )\n", - " return ds\n", - " ds = bedrock_agent_client.create_data_source(\n", - " name=name,\n", - " description=description,\n", - " knowledgeBaseId=knowledgeBaseId,\n", - " dataDeletionPolicy=\"DELETE\",\n", - " dataSourceConfiguration={\n", - " # # For S3\n", - " \"type\": \"S3\",\n", - " \"s3Configuration\": s3Configuration,\n", - " },\n", - " vectorIngestionConfiguration={\n", - " \"chunkingConfiguration\": chunkingStrategyConfiguration\n", - " },\n", - " )\n", - " return ds" - ] - }, - { - "cell_type": "markdown", - "id": "8e60ea6a-a1dc-47f8-9922-b7c70be750d1", - "metadata": {}, - "source": [ - "### Process PDF files by analyzing content, creating data sources, and uploading to S3.\n", - "\n", - "#### Workflow:\n", - "1. Lists all files in specified directory\n", - "2. For each PDF:\n", - " - Analyzes for optimal chunking strategy\n", - " - Creates data source with recommended configuration\n", - " - Uploads file to appropriate S3 bucket " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f6dc6d08-bc1d-4e72-8c9a-c6c73e092ebe", - "metadata": {}, - "outputs": [], - "source": [ - "s3_client = boto3.client(\"s3\")\n", - "dir_list1 = listfile(\"data\")\n", - "print(dir_list1)\n", - "strategylist = []\n", - "for file in dir_list1:\n", - " if \".pdf\" in file:\n", - " chunkingStrategyConfiguration = []\n", - "\n", - " strategy = chunking_advise(file)\n", - " strategy_conf = chunking_configuration(strategy, file)\n", - "\n", - "chunkingStrategyConfiguration, bucket_name, name, description, s3Configuration = (\n", - " ingestbystrategy(strategy_conf)\n", - ")\n", - "print(\"name\", name)\n", - "datasources = createDS(\n", - " name, description, kb_id, s3Configuration, chunkingStrategyConfiguration\n", - ")\n", - "with open(path + \"/\" + file, \"rb\") as f:\n", - " s3_client.upload_fileobj(f, bucket_name, file)" - ] - }, - { - "cell_type": "markdown", - "id": "ceaa41da-ecab-4592-8d78-59815e0dfb62", - "metadata": {}, - "source": [ - "### Ingestion jobs\n", - "##### please ensure that Knowledge base role have the permission to InvokeModel on resource: arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "43d0d10e-40e3-4769-a5e1-d115fce38041", - "metadata": {}, - "outputs": [], - "source": [ - "from datetime import datetime\n", - "import time\n", - "\n", - "\"\"\"\n", - " Starts and monitors ingestion jobs for all data sources in a knowledge base.\n", - "\"\"\"\n", - "sources = bedrock_agent_client.list_data_sources(knowledgeBaseId=kb_id)\n", - "for i in range(len(sources[\"dataSourceSummaries\"])):\n", - " print(\"ds [dataSourceId]\", sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"])\n", - " start_job_response = bedrock_agent_client.start_ingestion_job(\n", - " knowledgeBaseId=kb_id,\n", - " dataSourceId=sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", - " )\n", - " job = start_job_response[\"ingestionJob\"]\n", - " print(job)\n", - " # Get job\n", - " while job[\"status\"] != \"COMPLETE\":\n", - " get_job_response = bedrock_agent_client.get_ingestion_job(\n", - " knowledgeBaseId=kb_id,\n", - " dataSourceId=sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", - " ingestionJobId=job[\"ingestionJobId\"],\n", - " )\n", - " job = get_job_response[\"ingestionJob\"]\n", - " time.sleep(10)" - ] - }, - { - "cell_type": "markdown", - "id": "30c5e219-97bd-4219-8f97-cd7be339cc5e", - "metadata": {}, - "source": [ - "### Try out KB and evaluate result score \n", - "##### try both queries below" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fb36c3b2-3b4e-4fca-8a95-3ae4d46aee98", - "metadata": {}, - "outputs": [], - "source": [ - "# print (response_ret )\n", - "def response_print(retrieve_resp):\n", - " # structure 'retrievalResults': list of contents. Each list has content, location, score, metadata\n", - " for num, chunk in enumerate(retrieve_resp[\"retrievalResults\"], 1):\n", - " print(f\"Chunk -length : \", len(chunk[\"content\"][\"text\"]), end=\"\\n\" * 2)\n", - " print(f\"Chunk {num} Location: \", chunk[\"location\"], end=\"\\n\" * 2)\n", - " print(f\"Chunk {num} length: \", chunk[\"location\"], end=\"\\n\" * 2)\n", - " print(f\"Chunk {num} Score: \", chunk[\"score\"], end=\"\\n\" * 2)\n", - " print(f\"Chunk {num} Metadata: \", chunk[\"metadata\"], end=\"\\n\" * 2)\n", - "\n", - "\n", - "query1 = \"what is AWS annual revenue increase\"\n", - "\n", - "query2 = \"what is iphone sales in 2018?\"\n", - "\n", - "bedrock_agent_runtime_client = boto3.client(\"bedrock-agent-runtime\")\n", - "response_ret = bedrock_agent_runtime_client.retrieve(\n", - " knowledgeBaseId=kb_id,\n", - " nextToken=\"string\",\n", - " retrievalConfiguration={\n", - " \"vectorSearchConfiguration\": {\n", - " \"numberOfResults\": 1,\n", - " }\n", - " },\n", - " retrievalQuery={\"text\": query1},\n", - ")\n", - "print(\"Response shpould come from semantic chunked document:\")\n", - "response_print(response_ret)\n", - "\n", - "response_ret2 = bedrock_agent_runtime_client.retrieve(\n", - " knowledgeBaseId=kb_id,\n", - " nextToken=\"string\",\n", - " retrievalConfiguration={\n", - " \"vectorSearchConfiguration\": {\n", - " \"numberOfResults\": 1,\n", - " }\n", - " },\n", - " retrievalQuery={\"text\": query2},\n", - ")\n", - "print(\"Response shpould come from hierarchical chunked document:\")\n", - "response_print(response_ret2)" - ] - }, - { - "cell_type": "markdown", - "id": "bec94bfe-c99e-4e1c-9e97-bad5b3d0c09e", - "metadata": {}, - "source": [ - "##### Clean buckets \n", - "##### NOTE : please delete also Bedrock KB if not required by other works and data sources \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e67d50d0-8963-40d6-90be-be4c654c015f", - "metadata": {}, - "outputs": [], - "source": [ - "delete_bucket_and_objects(bucket_name_semantic)\n", - "delete_bucket_and_objects(bucket_name_fixed)\n", - "delete_bucket_and_objects(bucket_name_hierachical)" - ] - }, - { - "cell_type": "markdown", - "id": "f157092f-7534-41f5-b664-e9f1ca67d6bc", - "metadata": {}, - "source": [ - "## Conclusion: \n", - "\n", - "This notebook presents a proof-of-concept approach that uses Foundation Models to automate chunking strategy selection for document processing. Please note:\n", - "- This is an experimental implementation\n", - "- Results should be validated before production use\n", - "\n", - "This work serves as a starting point for automating chunking strategy decisions, but additional research and validation are needed to ensure reliability across diverse document types and use cases.\n", - "\n", - "Suggested Next Steps:\n", - "- Expand testing across more document types\n", - "- Validate recommendations against human expert decisions\n", - "- Refine the model's decision-making criteria\n", - "- Gather performance metrics in real-world applications\n", - "- Build a validation Framework having a Ground Truth Database and including varied document types and structures using proven validation framework such as RAGA" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1443821d-ad4c-4361-883f-002682160108", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From 721443f8f5b20abc28186994c6f0dc1e3a9c57b2 Mon Sep 17 00:00:00 2001 From: Mustapha Tawbi Date: Fri, 24 Jan 2025 10:55:53 +0400 Subject: [PATCH 3/5] =?UTF-8?q?=E2=80=98chunking=5Frecommender=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../ChunkingRecommender.ipynb | 808 ++++++++++++++++++ 1 file changed, 808 insertions(+) create mode 100644 rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb diff --git a/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb new file mode 100644 index 00000000..c4523e2e --- /dev/null +++ b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb @@ -0,0 +1,808 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d4a64fff-2c1e-4f34-9c7f-0c8180bb039a", + "metadata": {}, + "source": [ + "## Automated Document Processing Pipeline: Using Foundation Models for Smart Documents Chunking and Knowledge Base Integration\n", + "\n", + "\n", + "##### This Notebook requires an existing Knowledge base on bedrock. To create a knowledge base you can execute the code provided in :\n", + "https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb\n", + "### Challenge\n", + "\n", + "\n", + "Chunking is the process of dividing documents into smaller sections, or \"chunks,\" before embedding them into a Knowledge Base. This process enhances retrieval efficiency and precision. There are several chunking strategies available, each suited to different types of content and document structures. Examples of chunking strategies supported by Amazon Bedrock are: \n", + "- FIXED_SIZE: Splitting documents into chunks of the approximate size that set.\n", + "- HIERARCHICAL: Splitting documents into layers of chunks where the first layer contains large chunks, and the second layer contains smaller chunks derived from the first layer.\n", + "- SEMANTIC: Split documents into chunks based on groups of similar content derived with natural language processing.\n", + "\n", + "FIXED_SIZE is useful in scenarios requiring predictable chunk sizes for processing. HIERARCHICAL chunking is appropriate when dealing with complex, nested data structures. Whereas Semantic Chunking is useful when dealing with complex, contextual information and processing documents where meaning across sentences is highly interconnected. \n", + "The main drawbacks of semantic chunking include higher computational requirements, limited effectiveness across different languages and scalability challenges with large datasets. The main drawbacks of hierarchical chunking include higher computational overhead, difficulty in managing deep hierarchies and slower query performance at deeper levels.\n", + "\n", + "Selecting the right chunking strategy require understanding of benefits and limitations of each strategy in the context of analyzed documents, business requirements and SLAs. To determine the adequate chunking strategy, developer needs to manually assess document before selecting a strategy. The final choice is a balance between efficiency, accuracy, and practical constraints of the specific use case\n", + "\n", + "\n", + "### Approach presented in this notebook\n", + "\n", + "The approach presented in this notebook leverages Foundation Models (FMs) to automate document analysis and ingestion into an Amazon Bedrock Knowledge Base, replacing manual human assessment. The system automatically:\n", + "- Analyzes document structure and content\n", + "- Determines the optimal chunking strategy for each document\n", + "- Generates appropriate chunking configurations\n", + "- Executes the document ingestion process\n", + "\n", + "The solution recognizes that different documents require different chunking approaches, and therefore performs individual assessments to optimize content segmentation for each document type. This automation streamlines the process of building and maintaining knowledge bases while ensuring optimal document processing for better retrieval and usage.\n", + "\n", + "The key idea in this work is using FMs to intelligently analyze and process documents, rather than relying on predetermined or manual chunking strategies.\n", + "\n", + "### Notebook Walkthrough\n", + "The pipeline streamlines the entire process from document analysis to knowledge base population, making it efficient to prepare documents for advanced query and retrieval operations.\n", + "\n", + "![data_ingestion](./img/chunkingAdvs.jpg)\n", + "### Steps: \n", + "\n", + "1. Create Amazon Bedrock Knowledge Base execution role and S3 bucket used as data sources and configure necessary IAM policies \n", + "2. Process files within target folder. For each document, analyze and recommends an optimal chunking strategy (FIXED_SIZE/NONE/HIERARCHICAL/SEMANTIC) and specific configuration parameters\n", + "3. Upload analyzed files to designated S3 buckets and configure buckets as data source for Bedrock KB\n", + "4. Initiate ingestion job \n", + "5. Verify data accessibility and accuracy\n" + ] + }, + { + "cell_type": "markdown", + "id": "d066b04e-bc6f-42e6-8836-817d2e0854b1", + "metadata": {}, + "source": [ + "### Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4f36a15a-6cc3-464a-a2f1-fcb2d45ea7b9", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --force-reinstall -q -r ./requirements.txt --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f694d1b-c537-4bf2-a5b1-d4f5c7025620", + "metadata": {}, + "outputs": [], + "source": [ + "# restart kernel\n", + "from IPython.core.display import HTML\n", + "\n", + "HTML(\"\")" + ] + }, + { + "cell_type": "markdown", + "id": "9eb01263-04ab-4471-b73a-366055027873", + "metadata": {}, + "source": [ + "### Initiate parameters \n", + "\n", + "##### Knowledge base ID should have been created from first notebook (https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb) or similar\n", + "- To get knowledge Base Id using Bedrock console, look int Amazon Bedrock > knowledgebase> knowledgebase \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25239d0e-972d-4fff-b200-f20c39714a9e", + "metadata": {}, + "outputs": [], + "source": [ + "import boto3\n", + "import json\n", + "\n", + "# create a boto3 session to dynamically get and set the region name\n", + "session = boto3.Session()\n", + "\n", + "AWS_REGION = session.region_name\n", + "bedrock = boto3.client(\"bedrock-runtime\", region_name=AWS_REGION)\n", + "bedrock_agent_client = session.client(\"bedrock-agent\", region_name=AWS_REGION)\n", + "# model was run in us-west-2 , if you are using us-east-1 then change model ID to \"us.anthropic.claude-3-5-sonnet-20241022-v2:0\" #\n", + "MODEL_NAME = \"anthropic.claude-3-5-sonnet-20241022-v2:0\"\n", + "datasources = []\n", + "\n", + "# create a folder data if not yet done and\n", + "path = \"data\"\n", + "\n", + "kb_id = \"xxxx\" # Retrieve KB First # update value here with your KB ID\n", + "kb = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=kb_id)\n", + "# get bedrock_kb_execution ID - this role should have been create from notebook creating KB\n", + "bedrock_kb_execution_role_arn = kb[\"knowledgeBase\"][\"roleArn\"]\n", + "bedrock_kb_execution_role = bedrock_kb_execution_role_arn.split(\"/\")[-1]\n", + "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n", + "print(bedrock_kb_execution_role)" + ] + }, + { + "cell_type": "markdown", + "id": "b9291ec4-e2dc-47c1-950b-9fa7e737bee3", + "metadata": {}, + "source": [ + "### Supporting functions\n", + "##### Function 1 - Createbucket: Checks if an S3 bucket exists and creates it if it doesn't. \n", + "##### Function 2 - Upload_file: Upload_files to bucket: Upload a file to an S3 bucket\n", + "##### Function 3 - List all files in a specified directory\n", + "##### Function 4 - Delete a S3 bucket and all objects included within" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7578f6f2-1cd2-4683-b150-1b6900ff77ee", + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "import boto3\n", + "from botocore.exceptions import ClientError\n", + "import os\n", + "\n", + "\n", + "def createbucket(bucketname):\n", + " \"\"\"\n", + " Checks if an S3 bucket exists and creates it if it doesn't.\n", + " \"\"\"\n", + " try:\n", + " s3_client = boto3.client(\"s3\")\n", + " s3_client.head_bucket(Bucket=bucketname)\n", + " print(f\"Bucket {bucketname} Exists\")\n", + " except ClientError as e:\n", + " print(f\"Creating bucket {bucketname}\")\n", + " if AWS_REGION == \"us-east-1\":\n", + " s3bucket = s3_client.create_bucket(Bucket=bucketname)\n", + " else:\n", + " s3bucket = s3_client.create_bucket(\n", + " Bucket=bucketname,\n", + " CreateBucketConfiguration={\"LocationConstraint\": AWS_REGION},\n", + " )\n", + "\n", + "\n", + "def upload_file(file_name, bucket, object_name=None):\n", + " \"\"\"\n", + " Upload a file to an S3 bucket\n", + " \"\"\"\n", + " # If S3 object_name was not specified, use file_name\n", + " if object_name is None:\n", + " object_name = os.path.basename(file_name)\n", + "\n", + " # Upload the file\n", + " s3_client = boto3.client(\"s3\")\n", + " try:\n", + " response = s3_client.upload_file(file_name, bucket, object_name)\n", + " except ClientError as e:\n", + " logging.error(e)\n", + " return False\n", + " return True\n", + "\n", + "\n", + "def listfile(folder):\n", + " \"\"\"\n", + " List all files in a specified directory.\n", + " \"\"\"\n", + " dir_list = os.listdir(folder)\n", + " return dir_list\n", + "\n", + "\n", + "def delete_bucket_and_objects(bucket_name):\n", + " \"\"\"\n", + " Delete a S3 bucket and all objects included in\n", + " \"\"\"\n", + " # Create an S3 client\n", + " s3_client = boto3.client(\"s3\")\n", + " # Create an S3 resource\n", + " s3 = boto3.resource(\"s3\")\n", + " bucket = s3.Bucket(bucket_name)\n", + " bucket.objects.all().delete()\n", + " # Delete the bucket itself\n", + " bucket.delete()" + ] + }, + { + "cell_type": "markdown", + "id": "5d4b8fcc-5789-4df1-b72d-aef328b1a6c2", + "metadata": {}, + "source": [ + "### Standard prompt completion function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0ee11cad-ea50-4e87-8bd2-b8f159bf4d91", + "metadata": {}, + "outputs": [], + "source": [ + "def get_completion(prompt):\n", + " body = json.dumps(\n", + " {\n", + " \"anthropic_version\": \"\",\n", + " \"max_tokens\": 2000,\n", + " \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n", + " \"temperature\": 0.0,\n", + " \"top_p\": 1,\n", + " \"system\": \"\",\n", + " }\n", + " )\n", + " response = bedrock.invoke_model(body=body, modelId=MODEL_NAME)\n", + " response_body = json.loads(response.get(\"body\").read())\n", + " return response_body.get(\"content\")[0].get(\"text\")" + ] + }, + { + "cell_type": "markdown", + "id": "c549427c-9d3d-485c-a542-93ef49b540fe", + "metadata": {}, + "source": [ + "### Download and prepare datasets \n", + "The test dataset consists of two documents, these files will serve as test cases to validate the model's ability to correctly identify and recommend the most appropriate chunking strategy for each document type." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "217702d0-42bb-4da0-b14b-37b4d0c0b503", + "metadata": {}, + "outputs": [], + "source": [ + "# if not yet created, create folder already\n", + "#!mkdir -p ./data\n", + "\n", + "from urllib.request import urlretrieve\n", + "\n", + "urls = [\n", + " \"https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf\",\n", + " \"https://www.apple.com/newsroom/pdfs/Q3FY18ConsolidatedFinancialStatements.pdf\",\n", + "]\n", + "filenames = [\n", + " \"AMZN-2022-Shareholder-Letter.pdf\",\n", + " \"Q3FY18ConsolidatedFinancialStatements.pdf\",\n", + "]\n", + "data_root = \"./data/\"\n", + "for idx, url in enumerate(urls):\n", + " file_path = data_root + filenames[idx]\n", + " urlretrieve(url, file_path)" + ] + }, + { + "cell_type": "markdown", + "id": "48551cb4-02a8-4426-88aa-089516a4c1d9", + "metadata": {}, + "source": [ + "### Create 3 S3 buckets, one per chunking strategy\n", + "##### Important Note on AWS Bedrock Knowledge Base Configuration:\n", + "\n", + "The chunking strategy for a data source is permanent and cannot be modified after initial configuration. To address this challenge, we are implementing the following structure:\n", + "\n", + "Three separate S3 buckets will be created, each dedicated to a specific chunking strategy:\n", + "- Bucket for semantic chunking\n", + "- Bucket for hierarchical chunking\n", + "- Bucket for hybrid chunking\n", + "\n", + "These separate buckets approach allows us to maintain different chunking strategies for different document types within the same knowledge base system, ensuring optimal processing for each document category.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67a84784-fbfe-4ba5-88a5-d2e2a382bfac", + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "\n", + "suffix = random.randrange(200, 900)\n", + "s3_client = boto3.client(\"s3\")\n", + "bucket_name_semantic = \"kb-dataset-bucket-semantic-\" + str(suffix)\n", + "bucket_name_fixed = \"kb-dataset-bucket-fixed-\" + str(suffix)\n", + "bucket_name_hierachical = \"kb-dataset-bucket-hierarchical-\" + str(suffix)\n", + "s3_policy_name = \"AmazonBedrockS3PolicyForKnowledgeBase_\" + str(suffix)\n", + "createbucket(bucket_name_semantic)\n", + "createbucket(bucket_name_fixed)\n", + "createbucket(bucket_name_hierachical)" + ] + }, + { + "cell_type": "markdown", + "id": "2a54bc11-8de7-4a67-b8ed-0ba2944fb9e4", + "metadata": {}, + "source": [ + "### Create S3 policies and attach to existing Bedrock role\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6230aaf0-2aa6-429e-9824-0ef2f9b12579", + "metadata": {}, + "outputs": [], + "source": [ + "account_number = boto3.client(\"sts\").get_caller_identity().get(\"Account\")\n", + "iam_client = session.client(\"iam\")\n", + "iam_client = session.client(\"iam\")\n", + "s3_policy_document = {\n", + " \"Version\": \"2012-10-17\",\n", + " \"Statement\": [\n", + " {\n", + " \"Effect\": \"Allow\",\n", + " \"Action\": [\"s3:GetObject\", \"s3:ListBucket\"],\n", + " \"Resource\": [\n", + " f\"arn:aws:s3:::{bucket_name_semantic}\",\n", + " f\"arn:aws:s3:::{bucket_name_semantic}/*\",\n", + " f\"arn:aws:s3:::{bucket_name_fixed}\",\n", + " f\"arn:aws:s3:::{bucket_name_fixed}/*\",\n", + " f\"arn:aws:s3:::{bucket_name_hierachical}\",\n", + " f\"arn:aws:s3:::{bucket_name_hierachical}/*\",\n", + " ],\n", + " \"Condition\": {\"StringEquals\": {\"aws:ResourceAccount\": f\"{account_number}\"}},\n", + " }\n", + " ],\n", + "}\n", + "s3_policy = iam_client.create_policy(\n", + " PolicyName=s3_policy_name,\n", + " PolicyDocument=json.dumps(s3_policy_document),\n", + " Description=\"Policy for reading documents from s3\",\n", + ")\n", + "\n", + "# fetch arn of this policy\n", + "s3_policy_arn = s3_policy[\"Policy\"][\"Arn\"]\n", + "iam_client = session.client(\"iam\")\n", + "fm_policy_arn = f\"arn:aws:iam::{account_number}:policy/{s3_policy_name}\"\n", + "iam_client.attach_role_policy(\n", + " RoleName=bedrock_kb_execution_role, PolicyArn=fm_policy_arn\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "339eb2ae-e825-435f-b77b-0524144f081c", + "metadata": {}, + "source": [ + "### Document Analysis\n", + "\n", + "Purpose: analyzes PDF documents using an LLM to recommend the optimal chunking strategy and its associated parameters.\n", + "Input: PDF document\n", + "Output: The function recommends one of the following chunking strategies with specific parameters:\n", + "- HIERARCHICAL Chunking:\n", + " - Maximum parent chunk token size\n", + " - Maximum child chunk token size\n", + " - Overlap tokens\n", + " - Rationale for recommendation\n", + "- SEMANTIC Chunking:\n", + " - Maximum tokens\n", + " - Buffer size\n", + " - Breakpoint percentile threshold\n", + " - Rationale for recommendation\n", + "- FIXED-SIZE Chunking:\n", + " - Maximum tokens\n", + " - Overlap percentage\n", + " - Rationale for recommendation\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5801051-5411-4659-a303-c06aed74af04", + "metadata": {}, + "outputs": [], + "source": [ + "def chunking_advise(file):\n", + " from langchain.document_loaders import PyPDFLoader\n", + "\n", + " my_docs = []\n", + " my_strategies = []\n", + " strategy = \"\"\n", + " strategytext = \"\"\n", + " path = \"data\"\n", + " strategylist = []\n", + " metadata = [dict(year=2023, source=file)]\n", + " print(\"I am now analyzing the file:\", file)\n", + " file = path + \"/\" + file\n", + " loader = PyPDFLoader(file)\n", + " document = loader.load()\n", + " loader = PyPDFLoader(file)\n", + " document = loader.load()\n", + " # print (document)\n", + " prompt = f\"\"\"SYSTEM you are an advisor expert in LLM chunking strategies,\n", + " USER can you analyze the type, content, format, structure and size of {document}. \n", + " 1. See the actual document content\n", + " 2. Analyze its structure\n", + " 3. Examine the text format\n", + " 4. Understand the document length\n", + " 5. Review any hierarchical elements and Assess the semantic relationships within the content\n", + " 6. Evaluate the formatting and section breaks\n", + " then advise on best LLM chunking Strategy based on this analysis. Recommend only one Strategy, however show recommended strategy preference ratio \n", + " Available strategies to recommend from are: FIXED_SIZE or NONE or HIERARCHICAL or SEMANTIC\n", + " Decide on recommendation first and then, what is the recommendation? \"\"\"\n", + "\n", + " res = get_completion(prompt)\n", + " print(\"my recommendation is:\", res)\n", + " return res\n", + "\n", + "\n", + "def chunking_configuration(strategy, file):\n", + "\n", + " prompt = f\"\"\" USER based on recommendation provide in {strategy} , provide for {file} a recommended chunking configuration, \n", + " if you recommend HIERARCHICAL chunking then provide recommendation for: \n", + " Parent: Maximum parent chunk token size. \n", + " Child: Maximum child chunk token size and Overlap Tokens: Number of overlap tokens between each parent chunk and between each parent and its children.\n", + " If recommendation is HIERARCHICAL then provide response using JSON format\n", + " with the keys as \\”Recommend only one Strategy\\”, \\”Maximum Parent chunk token size\\”, \\”Maximum child chunk token size\\”,\\”Overlap Tokens\\”,\n", + " \\\"Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” . \n", + " provide crisp and clear answer, \n", + " if you recommend SEMANTIC then provide response using JSON format with\n", + " the keys as \\”Recommend only one Strategy\\”,\\”Maximum tokens\\”, \\”Buffer size\\”,\\”Breakpoint percentile threshold\\”, \n", + " Buffer size should be less or equal than 1 , Breakpoint percentile threshold should >= 50\n", + " \\”Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” . \n", + " provide crisp and clear answer, \n", + " do not provide recommendation if not enough data inputs and say sorry I need more data,\n", + " if you recommend FIXED_SIZE then provide response using JSON format with\n", + " the keys as \\”Recommend only one Strategy\\”,\\”maxTokens\\”, \\”overlapPercentage \\”,\n", + " \\”Rational: please explain rational for decision and explain why each other choice is not preferred, keep rational to 100 words maximum. \\” .\n", + " provide crisp and clear answer, \n", + " do not provide recommendation if not enough data inputs and say sorry I need more data\"\"\"\n", + "\n", + " res = get_completion(prompt)\n", + " print(res)\n", + " parsed_data = json.loads(res)\n", + " return parsed_data" + ] + }, + { + "cell_type": "markdown", + "id": "579229ec-23b4-4e9f-99db-cb78c7453e4e", + "metadata": {}, + "source": [ + "### Ingest Documents By Strategy\n", + "Purpose: Configures AWS Bedrock Knowledge Base ingestion settings based on the recommended chunking strategy analysis.\n", + "- Interprets the recommended strategy from parsed_data\n", + "- Applies corresponding parameters to create appropriate configuration\n", + "- Selects the matching S3 bucket for the strategy\n", + "- Generates knowledge base metadata\n", + "- Returns all necessary components for Bedrock KB ingestion\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a016a3c8-6759-40fe-a997-91357a3f48e9", + "metadata": {}, + "outputs": [], + "source": [ + "def ingestbystrategy(parsed_data):\n", + "\n", + " chunkingStrategyConfiguration = {}\n", + " strategy = parsed_data.get(\"Recommend only one Strategy\")\n", + "\n", + " # HIERARCHICAL Chunking\n", + " if strategy == \"HIERARCHICAL\":\n", + " p1 = parsed_data[\"Maximum Parent chunk token size\"]\n", + " p2 = parsed_data[\"Maximum child chunk token size\"]\n", + " p3 = parsed_data[\"Overlap Tokens\"]\n", + " bucket_name = bucket_name_hierachical\n", + " name = f\"bedrock-sample-knowledge-base-HIERARCHICAL\"\n", + " description = \"Bedrock Knowledge Bases for and S3 HIERARCHICAL\"\n", + " chunkingStrategyConfiguration = {\n", + " \"chunkingStrategy\": \"HIERARCHICAL\",\n", + " \"hierarchicalChunkingConfiguration\": {\n", + " \"levelConfigurations\": [{\"maxTokens\": p1}, {\"maxTokens\": p2}],\n", + " \"overlapTokens\": p3,\n", + " },\n", + " }\n", + "\n", + " # SEMANTIC Chunking\n", + " if strategy == \"SEMANTIC\":\n", + " p3 = parsed_data[\"Maximum tokens\"]\n", + " p2 = int(parsed_data[\"Buffer size\"])\n", + " p1 = parsed_data[\"Breakpoint percentile threshold\"]\n", + " bucket_name = bucket_name_semantic\n", + " name = f\"bedrock-sample-knowledge-base-SEMANTIC\"\n", + " description = \"Bedrock Knowledge Bases for and S3 SEMANTIC\"\n", + " chunkingStrategyConfiguration = {\n", + " \"chunkingStrategy\": \"SEMANTIC\",\n", + " \"semanticChunkingConfiguration\": {\n", + " \"breakpointPercentileThreshold\": p1,\n", + " \"bufferSize\": p2,\n", + " \"maxTokens\": p3,\n", + " },\n", + " }\n", + " # FIXED_SIZE Chunking\n", + " if strategy == \"FIXED_SIZE\":\n", + " p2 = int(parsed_data[\"overlapPercentage\"])\n", + " p1 = int(parsed_data[\"maxTokens\"])\n", + " bucket_name = bucket_name_fixed\n", + " name = f\"bedrock-sample-knowledge-base-FIXED\"\n", + " description = \"Bedrock Knowledge Bases for and S3 FIXED\"\n", + "\n", + " chunkingStrategyConfiguration = {\n", + " \"chunkingStrategy\": \"FIXED_SIZE\",\n", + " \"semanticChunkingConfiguration\": {\"maxTokens\": p1, \"overlapPercentage\": p2},\n", + " }\n", + "\n", + " s3Configuration = {\n", + " \"bucketArn\": f\"arn:aws:s3:::{bucket_name}\",\n", + " }\n", + " return (\n", + " chunkingStrategyConfiguration,\n", + " bucket_name,\n", + " name,\n", + " description,\n", + " s3Configuration,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "4cc202b3-12b9-45e7-a610-e76c28b142ad", + "metadata": {}, + "source": [ + "### Create or retrieve data source from Amazon Bedrock Knowledge Base\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "90628e2d-913d-4290-8d40-78dacd3d0e9b", + "metadata": {}, + "outputs": [], + "source": [ + "def createDS(\n", + " name, description, knowledgeBaseId, s3Configuration, chunkingStrategyConfiguration\n", + "):\n", + " response = bedrock_agent_client.list_data_sources(\n", + " knowledgeBaseId=kb_id, maxResults=12\n", + " )\n", + " print(response)\n", + " for i in range(len(response[\"dataSourceSummaries\"])):\n", + " print(response[\"dataSourceSummaries\"][i][\"name\"], \"::\", name)\n", + " print(response[\"dataSourceSummaries\"][i][\"dataSourceId\"])\n", + " if response[\"dataSourceSummaries\"][i][\"name\"] == name:\n", + " ds = bedrock_agent_client.get_data_source(\n", + " knowledgeBaseId=knowledgeBaseId,\n", + " dataSourceId=response[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", + " )\n", + " return ds\n", + " ds = bedrock_agent_client.create_data_source(\n", + " name=name,\n", + " description=description,\n", + " knowledgeBaseId=knowledgeBaseId,\n", + " dataDeletionPolicy=\"DELETE\",\n", + " dataSourceConfiguration={\n", + " # # For S3\n", + " \"type\": \"S3\",\n", + " \"s3Configuration\": s3Configuration,\n", + " },\n", + " vectorIngestionConfiguration={\n", + " \"chunkingConfiguration\": chunkingStrategyConfiguration\n", + " },\n", + " )\n", + " return ds" + ] + }, + { + "cell_type": "markdown", + "id": "8e60ea6a-a1dc-47f8-9922-b7c70be750d1", + "metadata": {}, + "source": [ + "### Process PDF files by analyzing content, creating data sources, and uploading to S3.\n", + "\n", + "#### Workflow:\n", + "1. Lists all files in specified directory\n", + "2. For each PDF:\n", + " - Analyzes for optimal chunking strategy\n", + " - Creates data source with recommended configuration\n", + " - Uploads file to appropriate S3 bucket " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6dc6d08-bc1d-4e72-8c9a-c6c73e092ebe", + "metadata": {}, + "outputs": [], + "source": [ + "s3_client = boto3.client(\"s3\")\n", + "dir_list1 = listfile(\"data\")\n", + "print(dir_list1)\n", + "strategylist = []\n", + "for file in dir_list1:\n", + " if \".pdf\" in file:\n", + " chunkingStrategyConfiguration = []\n", + "\n", + " strategy = chunking_advise(file)\n", + " strategy_conf = chunking_configuration(strategy, file)\n", + "\n", + "chunkingStrategyConfiguration, bucket_name, name, description, s3Configuration = (\n", + " ingestbystrategy(strategy_conf)\n", + ")\n", + "print(\"name\", name)\n", + "datasources = createDS(\n", + " name, description, kb_id, s3Configuration, chunkingStrategyConfiguration\n", + ")\n", + "with open(path + \"/\" + file, \"rb\") as f:\n", + " s3_client.upload_fileobj(f, bucket_name, file)" + ] + }, + { + "cell_type": "markdown", + "id": "ceaa41da-ecab-4592-8d78-59815e0dfb62", + "metadata": {}, + "source": [ + "### Ingestion jobs\n", + "##### please ensure that Knowledge base role have the permission to InvokeModel on resource: arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43d0d10e-40e3-4769-a5e1-d115fce38041", + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "import time\n", + "\n", + "\"\"\"\n", + " Starts and monitors ingestion jobs for all data sources in a knowledge base.\n", + "\"\"\"\n", + "sources = bedrock_agent_client.list_data_sources(knowledgeBaseId=kb_id)\n", + "for i in range(len(sources[\"dataSourceSummaries\"])):\n", + " print(\"ds [dataSourceId]\", sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"])\n", + " start_job_response = bedrock_agent_client.start_ingestion_job(\n", + " knowledgeBaseId=kb_id,\n", + " dataSourceId=sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", + " )\n", + " job = start_job_response[\"ingestionJob\"]\n", + " print(job)\n", + " # Get job\n", + " while job[\"status\"] != \"COMPLETE\":\n", + " get_job_response = bedrock_agent_client.get_ingestion_job(\n", + " knowledgeBaseId=kb_id,\n", + " dataSourceId=sources[\"dataSourceSummaries\"][i - 1][\"dataSourceId\"],\n", + " ingestionJobId=job[\"ingestionJobId\"],\n", + " )\n", + " job = get_job_response[\"ingestionJob\"]\n", + " time.sleep(10)" + ] + }, + { + "cell_type": "markdown", + "id": "30c5e219-97bd-4219-8f97-cd7be339cc5e", + "metadata": {}, + "source": [ + "### Try out KB and evaluate result score \n", + "##### try both queries below" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fb36c3b2-3b4e-4fca-8a95-3ae4d46aee98", + "metadata": {}, + "outputs": [], + "source": [ + "# print (response_ret )\n", + "def response_print(retrieve_resp):\n", + " # structure 'retrievalResults': list of contents. Each list has content, location, score, metadata\n", + " for num, chunk in enumerate(retrieve_resp[\"retrievalResults\"], 1):\n", + " print(f\"Chunk -length : \", len(chunk[\"content\"][\"text\"]), end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} Location: \", chunk[\"location\"], end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} length: \", chunk[\"location\"], end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} Score: \", chunk[\"score\"], end=\"\\n\" * 2)\n", + " print(f\"Chunk {num} Metadata: \", chunk[\"metadata\"], end=\"\\n\" * 2)\n", + "\n", + "\n", + "query1 = \"what is AWS annual revenue increase\"\n", + "\n", + "query2 = \"what is iphone sales in 2018?\"\n", + "\n", + "bedrock_agent_runtime_client = boto3.client(\"bedrock-agent-runtime\")\n", + "response_ret = bedrock_agent_runtime_client.retrieve(\n", + " knowledgeBaseId=kb_id,\n", + " nextToken=\"string\",\n", + " retrievalConfiguration={\n", + " \"vectorSearchConfiguration\": {\n", + " \"numberOfResults\": 1,\n", + " }\n", + " },\n", + " retrievalQuery={\"text\": query1},\n", + ")\n", + "print(\"Response shpould come from semantic chunked document:\")\n", + "response_print(response_ret)\n", + "\n", + "response_ret2 = bedrock_agent_runtime_client.retrieve(\n", + " knowledgeBaseId=kb_id,\n", + " nextToken=\"string\",\n", + " retrievalConfiguration={\n", + " \"vectorSearchConfiguration\": {\n", + " \"numberOfResults\": 1,\n", + " }\n", + " },\n", + " retrievalQuery={\"text\": query2},\n", + ")\n", + "print(\"Response shpould come from hierarchical chunked document:\")\n", + "response_print(response_ret2)" + ] + }, + { + "cell_type": "markdown", + "id": "bec94bfe-c99e-4e1c-9e97-bad5b3d0c09e", + "metadata": {}, + "source": [ + "##### Clean buckets \n", + "##### NOTE : please delete also Bedrock KB if not required by other works and data sources \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e67d50d0-8963-40d6-90be-be4c654c015f", + "metadata": {}, + "outputs": [], + "source": [ + "delete_bucket_and_objects(bucket_name_semantic)\n", + "delete_bucket_and_objects(bucket_name_fixed)\n", + "delete_bucket_and_objects(bucket_name_hierachical)" + ] + }, + { + "cell_type": "markdown", + "id": "f157092f-7534-41f5-b664-e9f1ca67d6bc", + "metadata": {}, + "source": [ + "## Conclusion: \n", + "\n", + "This notebook presents a proof-of-concept approach that uses Foundation Models to automate chunking strategy selection for document processing. Please note:\n", + "- This is an experimental implementation\n", + "- Results should be validated before production use\n", + "\n", + "This work serves as a starting point for automating chunking strategy decisions, but additional research and validation are needed to ensure reliability across diverse document types and use cases.\n", + "\n", + "Suggested Next Steps:\n", + "- Expand testing across more document types\n", + "- Validate recommendations against human expert decisions\n", + "- Refine the model's decision-making criteria\n", + "- Gather performance metrics in real-world applications\n", + "- Build a validation Framework having a Ground Truth Database and including varied document types and structures using proven validation framework such as RAGA" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1443821d-ad4c-4361-883f-002682160108", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From dbe65278de01413088bd8805d79f7ea6ba82ae84 Mon Sep 17 00:00:00 2001 From: Mustapha Tawbi Date: Fri, 24 Jan 2025 11:24:00 +0400 Subject: [PATCH 4/5] =?UTF-8?q?=E2=80=98chunking=5Frecommender=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../ChunkingRecommender.ipynb | 252 +++++++++++++++--- 1 file changed, 218 insertions(+), 34 deletions(-) diff --git a/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb index c4523e2e..5e09561f 100644 --- a/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb +++ b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb @@ -69,10 +69,24 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 53, "id": "5f694d1b-c537-4bf2-a5b1-d4f5c7025620", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# restart kernel\n", "from IPython.core.display import HTML\n", @@ -114,7 +128,7 @@ "# create a folder data if not yet done and\n", "path = \"data\"\n", "\n", - "kb_id = \"xxxx\" # Retrieve KB First # update value here with your KB ID\n", + "kb_id = \"XXXXX\" # Retrieve KB First # update value here with your KB ID\n", "kb = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=kb_id)\n", "# get bedrock_kb_execution ID - this role should have been create from notebook creating KB\n", "bedrock_kb_execution_role_arn = kb[\"knowledgeBase\"][\"roleArn\"]\n", @@ -137,7 +151,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 55, "id": "7578f6f2-1cd2-4683-b150-1b6900ff77ee", "metadata": {}, "outputs": [], @@ -217,7 +231,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 56, "id": "0ee11cad-ea50-4e87-8bd2-b8f159bf4d91", "metadata": {}, "outputs": [], @@ -226,7 +240,7 @@ " body = json.dumps(\n", " {\n", " \"anthropic_version\": \"\",\n", - " \"max_tokens\": 2000,\n", + " \"max_tokens\": 1500,\n", " \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n", " \"temperature\": 0.0,\n", " \"top_p\": 1,\n", @@ -249,7 +263,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 57, "id": "217702d0-42bb-4da0-b14b-37b4d0c0b503", "metadata": {}, "outputs": [], @@ -390,7 +404,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 65, "id": "c5801051-5411-4659-a303-c06aed74af04", "metadata": {}, "outputs": [], @@ -424,9 +438,9 @@ " Available strategies to recommend from are: FIXED_SIZE or NONE or HIERARCHICAL or SEMANTIC\n", " Decide on recommendation first and then, what is the recommendation? \"\"\"\n", "\n", - " res = get_completion(prompt)\n", - " print(\"my recommendation is:\", res)\n", - " return res\n", + " resultA = get_completion(prompt)\n", + " print(\"my recommendation is:\", resultA)\n", + " return resultA\n", "\n", "\n", "def chunking_configuration(strategy, file):\n", @@ -452,7 +466,7 @@ " do not provide recommendation if not enough data inputs and say sorry I need more data\"\"\"\n", "\n", " res = get_completion(prompt)\n", - " print(res)\n", + " #print(res)\n", " parsed_data = json.loads(res)\n", " return parsed_data" ] @@ -473,7 +487,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 66, "id": "a016a3c8-6759-40fe-a997-91357a3f48e9", "metadata": {}, "outputs": [], @@ -550,7 +564,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 69, "id": "90628e2d-913d-4290-8d40-78dacd3d0e9b", "metadata": {}, "outputs": [], @@ -561,10 +575,7 @@ " response = bedrock_agent_client.list_data_sources(\n", " knowledgeBaseId=kb_id, maxResults=12\n", " )\n", - " print(response)\n", " for i in range(len(response[\"dataSourceSummaries\"])):\n", - " print(response[\"dataSourceSummaries\"][i][\"name\"], \"::\", name)\n", - " print(response[\"dataSourceSummaries\"][i][\"dataSourceId\"])\n", " if response[\"dataSourceSummaries\"][i][\"name\"] == name:\n", " ds = bedrock_agent_client.get_data_source(\n", " knowledgeBaseId=knowledgeBaseId,\n", @@ -605,10 +616,170 @@ }, { "cell_type": "code", - "execution_count": null, - "id": "f6dc6d08-bc1d-4e72-8c9a-c6c73e092ebe", - "metadata": {}, - "outputs": [], + "execution_count": 70, + "id": "ee575c20-388b-4119-bd0d-080a71a5cbd0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['.ipynb_checkpoints', 'AMZN-2022-Shareholder-Letter.pdf', 'Q3FY18ConsolidatedFinancialStatements.pdf']\n", + "I am now analyzing the file: AMZN-2022-Shareholder-Letter.pdf\n", + "my recommendation is: Let me analyze the document according to your criteria:\n", + "\n", + "1. Content Analysis:\n", + "- This is Amazon's 2022 Shareholder Letter plus the 1997 letter\n", + "- Contains business updates, strategy, financial information, and future outlook\n", + "- Includes both narrative and some structured sections\n", + "\n", + "2. Structure Analysis:\n", + "- Clear sections with topics and subtopics\n", + "- Natural thematic breaks between different business areas\n", + "- Contains both narrative paragraphs and bullet-pointed lists\n", + "- Logical flow from one topic to another\n", + "\n", + "3. Text Format:\n", + "- Mixed format with paragraphs, lists, and quotes\n", + "- Consistent formatting within sections\n", + "- Clear section breaks and topic transitions\n", + "- Some numerical data and statistics embedded\n", + "\n", + "4. Document Length:\n", + "- Multiple pages (10 pages approximately)\n", + "- Substantial content per page\n", + "- Medium to long paragraphs\n", + "- Varied section lengths\n", + "\n", + "5. Hierarchical/Semantic Elements:\n", + "- Clear topic hierarchy present\n", + "- Strong semantic relationships between sections\n", + "- Natural topical boundaries\n", + "- Coherent thematic groupings\n", + "\n", + "6. Formatting/Section Breaks:\n", + "- Well-defined section breaks\n", + "- Clear paragraph separation\n", + "- Distinct topic transitions\n", + "- Logical content organization\n", + "\n", + "Strategy Preference Ratio:\n", + "SEMANTIC: 85%\n", + "HIERARCHICAL: 70%\n", + "FIXED_SIZE: 40%\n", + "NONE: 5%\n", + "\n", + "Recommended Strategy: SEMANTIC\n", + "\n", + "Reasoning:\n", + "A SEMANTIC chunking strategy would be most effective because:\n", + "1. The document has strong natural semantic relationships\n", + "2. Content is organized by distinct business topics and themes\n", + "3. Information flow follows logical topic boundaries\n", + "4. Related concepts and ideas are grouped together\n", + "5. The narrative structure maintains context within topical sections\n", + "\n", + "This approach would allow:\n", + "- Preservation of context within related content\n", + "- More meaningful chunks for LLM processing\n", + "- Better maintenance of topical relationships\n", + "- More natural Q&A capabilities\n", + "- Improved information retrieval accuracy\n", + "\n", + "The semantic chunking would help maintain the integrity of related concepts while creating meaningful, context-aware segments that an LLM can process effectively.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Ignoring wrong pointing object 8 0 (offset 0)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I am now analyzing the file: Q3FY18ConsolidatedFinancialStatements.pdf\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Ignoring wrong pointing object 8 0 (offset 0)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "my recommendation is: I'll analyze the document following your structure and then provide a chunking strategy recommendation.\n", + "\n", + "1. Document Content Analysis:\n", + "- Financial statements from Apple Inc.\n", + "- Contains 3 major financial reports:\n", + " * Consolidated Statements of Operations\n", + " * Consolidated Balance Sheets\n", + " * Consolidated Statements of Cash Flows\n", + "\n", + "2. Structure Analysis:\n", + "- Each page represents a distinct financial statement\n", + "- Highly structured tabular data\n", + "- Clear hierarchical organization within each statement\n", + "- Consistent formatting across documents\n", + "\n", + "3. Text Format:\n", + "- Numerical data in columns\n", + "- Section headers and subheaders\n", + "- Consistent indentation for subcategories\n", + "- Mixed content (text and numbers)\n", + "\n", + "4. Document Length:\n", + "- 3 pages total\n", + "- Each page contains a complete financial statement\n", + "- Moderate length per page\n", + "- Natural breaks between statements\n", + "\n", + "5. Hierarchical Elements & Semantic Relationships:\n", + "- Strong hierarchical organization:\n", + " * Main categories (Assets, Liabilities, etc.)\n", + " * Subcategories (Current assets, Long-term assets, etc.)\n", + " * Line items under each category\n", + "- Clear semantic relationships between financial items\n", + "- Parent-child relationships in data structure\n", + "\n", + "6. Formatting and Section Breaks:\n", + "- Clear section demarcation\n", + "- Consistent indentation levels\n", + "- Natural breaks between major categories\n", + "- Well-defined table structure\n", + "\n", + "RECOMMENDATION: HIERARCHICAL\n", + "\n", + "Strategy Preference Ratio:\n", + "- HIERARCHICAL: 70%\n", + "- SEMANTIC: 20%\n", + "- FIXED_SIZE: 8%\n", + "- NONE: 2%\n", + "\n", + "Reasoning for HIERARCHICAL recommendation:\n", + "1. The content has natural hierarchical structure in financial statements\n", + "2. Each statement has clear parent-child relationships\n", + "3. The data is organized in logical nested categories\n", + "4. Maintaining hierarchical relationships is crucial for financial data interpretation\n", + "5. This approach would preserve the contextual relationship between headers and their associated data\n", + "6. Would allow for more meaningful query responses by maintaining the financial statement structure\n", + "\n", + "The hierarchical chunking would allow:\n", + "- Chunks based on major financial statement sections\n", + "- Preservation of header-data relationships\n", + "- Maintenance of financial context\n", + "- Better handling of nested categories\n", + "- More accurate responses to financial queries\n" + ] + } + ], "source": [ "s3_client = boto3.client(\"s3\")\n", "dir_list1 = listfile(\"data\")\n", @@ -624,7 +795,6 @@ "chunkingStrategyConfiguration, bucket_name, name, description, s3Configuration = (\n", " ingestbystrategy(strategy_conf)\n", ")\n", - "print(\"name\", name)\n", "datasources = createDS(\n", " name, description, kb_id, s3Configuration, chunkingStrategyConfiguration\n", ")\n", @@ -685,12 +855,12 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 76, "id": "fb36c3b2-3b4e-4fca-8a95-3ae4d46aee98", "metadata": {}, "outputs": [], "source": [ - "# print (response_ret )\n", + "bedrock_agent_runtime_client = boto3.client(\"bedrock-agent-runtime\")\n", "def response_print(retrieve_resp):\n", " # structure 'retrievalResults': list of contents. Each list has content, location, score, metadata\n", " for num, chunk in enumerate(retrieve_resp[\"retrievalResults\"], 1):\n", @@ -698,14 +868,18 @@ " print(f\"Chunk {num} Location: \", chunk[\"location\"], end=\"\\n\" * 2)\n", " print(f\"Chunk {num} length: \", chunk[\"location\"], end=\"\\n\" * 2)\n", " print(f\"Chunk {num} Score: \", chunk[\"score\"], end=\"\\n\" * 2)\n", - " print(f\"Chunk {num} Metadata: \", chunk[\"metadata\"], end=\"\\n\" * 2)\n", - "\n", - "\n", + " #print(f\"Chunk {num} Metadata: \", chunk[\"metadata\"], end=\"\\n\" * 2)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8dec69f1-826c-4dd1-b812-c5ded0b55d05", + "metadata": {}, + "outputs": [], + "source": [ "query1 = \"what is AWS annual revenue increase\"\n", "\n", - "query2 = \"what is iphone sales in 2018?\"\n", - "\n", - "bedrock_agent_runtime_client = boto3.client(\"bedrock-agent-runtime\")\n", "response_ret = bedrock_agent_runtime_client.retrieve(\n", " knowledgeBaseId=kb_id,\n", " nextToken=\"string\",\n", @@ -716,9 +890,19 @@ " },\n", " retrievalQuery={\"text\": query1},\n", ")\n", - "print(\"Response shpould come from semantic chunked document:\")\n", - "response_print(response_ret)\n", + "print(\"Response should come from semantic chunked document: to verify let us check data source uri\")\n", + "response_print(response_ret)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5cdcc7e2-0238-452c-aa3c-2e1ea06e5499", + "metadata": {}, + "outputs": [], + "source": [ "\n", + "query2 = \"what is iphone sales in 2018?\"\n", "response_ret2 = bedrock_agent_runtime_client.retrieve(\n", " knowledgeBaseId=kb_id,\n", " nextToken=\"string\",\n", @@ -729,7 +913,7 @@ " },\n", " retrievalQuery={\"text\": query2},\n", ")\n", - "print(\"Response shpould come from hierarchical chunked document:\")\n", + "print(\"Response should come from hierarchical chunked document:\")\n", "response_print(response_ret2)" ] }, @@ -744,7 +928,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 78, "id": "e67d50d0-8963-40d6-90be-be4c654c015f", "metadata": {}, "outputs": [], From 4ac268098f8db326d12e588d302f6246c31340a6 Mon Sep 17 00:00:00 2001 From: Mustapha Tawbi Date: Fri, 24 Jan 2025 11:55:20 +0400 Subject: [PATCH 5/5] =?UTF-8?q?=E2=80=98chunking=5Frecommender=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../ChunkingRecommender.ipynb | 204 +++++++++--------- 1 file changed, 107 insertions(+), 97 deletions(-) diff --git a/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb index 5e09561f..5a14a5d4 100644 --- a/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb +++ b/rag/knowledge-bases/features-examples/08-chunking-recommender/ChunkingRecommender.ipynb @@ -107,10 +107,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 80, "id": "25239d0e-972d-4fff-b200-f20c39714a9e", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "AmazonBedrockExecutionRoleForKnowledgeBase_758\n" + ] + } + ], "source": [ "import boto3\n", "import json\n", @@ -128,7 +136,7 @@ "# create a folder data if not yet done and\n", "path = \"data\"\n", "\n", - "kb_id = \"XXXXX\" # Retrieve KB First # update value here with your KB ID\n", + "kb_id = \"XXXX\" # Retrieve KB First # update value here with your KB ID\n", "kb = bedrock_agent_client.get_knowledge_base(knowledgeBaseId=kb_id)\n", "# get bedrock_kb_execution ID - this role should have been create from notebook creating KB\n", "bedrock_kb_execution_role_arn = kb[\"knowledgeBase\"][\"roleArn\"]\n", @@ -151,7 +159,7 @@ }, { "cell_type": "code", - "execution_count": 55, + "execution_count": 81, "id": "7578f6f2-1cd2-4683-b150-1b6900ff77ee", "metadata": {}, "outputs": [], @@ -231,7 +239,7 @@ }, { "cell_type": "code", - "execution_count": 56, + "execution_count": 82, "id": "0ee11cad-ea50-4e87-8bd2-b8f159bf4d91", "metadata": {}, "outputs": [], @@ -263,7 +271,7 @@ }, { "cell_type": "code", - "execution_count": 57, + "execution_count": 83, "id": "217702d0-42bb-4da0-b14b-37b4d0c0b503", "metadata": {}, "outputs": [], @@ -404,7 +412,7 @@ }, { "cell_type": "code", - "execution_count": 65, + "execution_count": 86, "id": "c5801051-5411-4659-a303-c06aed74af04", "metadata": {}, "outputs": [], @@ -487,7 +495,7 @@ }, { "cell_type": "code", - "execution_count": 66, + "execution_count": 87, "id": "a016a3c8-6759-40fe-a997-91357a3f48e9", "metadata": {}, "outputs": [], @@ -564,7 +572,7 @@ }, { "cell_type": "code", - "execution_count": 69, + "execution_count": 88, "id": "90628e2d-913d-4290-8d40-78dacd3d0e9b", "metadata": {}, "outputs": [], @@ -616,7 +624,7 @@ }, { "cell_type": "code", - "execution_count": 70, + "execution_count": 90, "id": "ee575c20-388b-4119-bd0d-080a71a5cbd0", "metadata": {}, "outputs": [ @@ -626,67 +634,67 @@ "text": [ "['.ipynb_checkpoints', 'AMZN-2022-Shareholder-Letter.pdf', 'Q3FY18ConsolidatedFinancialStatements.pdf']\n", "I am now analyzing the file: AMZN-2022-Shareholder-Letter.pdf\n", - "my recommendation is: Let me analyze the document according to your criteria:\n", + "my recommendation is: Let me analyze the document according to your requirements:\n", "\n", "1. Content Analysis:\n", "- This is Amazon's 2022 Shareholder Letter plus the 1997 letter\n", - "- Contains business updates, strategy, financial information, and future outlook\n", - "- Includes both narrative and some structured sections\n", + "- Contains formal business communication with clear sections\n", + "- Mix of financial data, strategic information, and company updates\n", "\n", "2. Structure Analysis:\n", - "- Clear sections with topics and subtopics\n", - "- Natural thematic breaks between different business areas\n", - "- Contains both narrative paragraphs and bullet-pointed lists\n", - "- Logical flow from one topic to another\n", + "- Clear hierarchical organization\n", + "- Main sections with subtopics\n", + "- Natural paragraph breaks\n", + "- Distinct thematic segments\n", "\n", "3. Text Format:\n", - "- Mixed format with paragraphs, lists, and quotes\n", - "- Consistent formatting within sections\n", - "- Clear section breaks and topic transitions\n", - "- Some numerical data and statistics embedded\n", + "- Professional business letter format\n", + "- Consistent paragraph structure\n", + "- Mixed content types (narrative, numerical, strategic)\n", + "- Contains lists and bullet points\n", "\n", "4. Document Length:\n", - "- Multiple pages (10 pages approximately)\n", - "- Substantial content per page\n", - "- Medium to long paragraphs\n", - "- Varied section lengths\n", + "- Multiple pages (10 pages)\n", + "- Substantial content length\n", + "- Well-distributed content across pages\n", "\n", "5. Hierarchical/Semantic Elements:\n", - "- Clear topic hierarchy present\n", - "- Strong semantic relationships between sections\n", - "- Natural topical boundaries\n", - "- Coherent thematic groupings\n", + "- Strong thematic organization\n", + "- Clear topic transitions\n", + "- Natural semantic boundaries between business areas\n", + "- Logical flow between sections\n", "\n", - "6. Formatting/Section Breaks:\n", - "- Well-defined section breaks\n", - "- Clear paragraph separation\n", - "- Distinct topic transitions\n", - "- Logical content organization\n", + "6. Formatting/Breaks:\n", + "- Clear section breaks\n", + "- Paragraph spacing\n", + "- Topic-based divisions\n", + "- Natural content groupings\n", "\n", "Strategy Preference Ratio:\n", - "SEMANTIC: 85%\n", "HIERARCHICAL: 70%\n", - "FIXED_SIZE: 40%\n", - "NONE: 5%\n", + "SEMANTIC: 20%\n", + "FIXED_SIZE: 8%\n", + "NONE: 2%\n", "\n", - "Recommended Strategy: SEMANTIC\n", + "RECOMMENDED STRATEGY: HIERARCHICAL\n", "\n", "Reasoning:\n", - "A SEMANTIC chunking strategy would be most effective because:\n", - "1. The document has strong natural semantic relationships\n", - "2. Content is organized by distinct business topics and themes\n", - "3. Information flow follows logical topic boundaries\n", - "4. Related concepts and ideas are grouped together\n", - "5. The narrative structure maintains context within topical sections\n", + "The HIERARCHICAL chunking strategy is most appropriate because:\n", + "1. The document has clear hierarchical organization\n", + "2. Natural section breaks exist\n", + "3. Content is logically structured by topics\n", + "4. Information flow follows a clear hierarchy\n", + "5. Business sections are well-defined\n", + "6. Maintains context within related sections\n", + "7. Preserves the relationship between main topics and subtopics\n", "\n", - "This approach would allow:\n", - "- Preservation of context within related content\n", - "- More meaningful chunks for LLM processing\n", - "- Better maintenance of topical relationships\n", - "- More natural Q&A capabilities\n", - "- Improved information retrieval accuracy\n", - "\n", - "The semantic chunking would help maintain the integrity of related concepts while creating meaningful, context-aware segments that an LLM can process effectively.\n" + "This strategy will:\n", + "- Keep related information together\n", + "- Maintain context within sections\n", + "- Preserve the logical flow of information\n", + "- Enable better question-answering\n", + "- Support more accurate information retrieval\n", + "- Respect the natural structure of the document\n" ] }, { @@ -714,46 +722,45 @@ "name": "stdout", "output_type": "stream", "text": [ - "my recommendation is: I'll analyze the document following your structure and then provide a chunking strategy recommendation.\n", + "my recommendation is: I'll analyze the document according to your criteria and then make a recommendation.\n", "\n", - "1. Document Content Analysis:\n", + "1. Content Analysis:\n", "- Financial statements from Apple Inc.\n", - "- Contains 3 major financial reports:\n", - " * Consolidated Statements of Operations\n", - " * Consolidated Balance Sheets\n", - " * Consolidated Statements of Cash Flows\n", + "- Contains 3 main documents: Income Statement, Balance Sheet, and Cash Flow Statement\n", + "- Highly structured numerical data with clear hierarchical organization\n", + "- Contains headers, sub-headers, and detailed line items\n", + "- Tabular format with columns for different time periods\n", "\n", "2. Structure Analysis:\n", - "- Each page represents a distinct financial statement\n", - "- Highly structured tabular data\n", - "- Clear hierarchical organization within each statement\n", - "- Consistent formatting across documents\n", + "- Clear document sections with distinct titles\n", + "- Hierarchical organization (Main statements → Categories → Line items)\n", + "- Consistent indentation patterns\n", + "- Well-defined parent-child relationships in financial data\n", "\n", - "3. Text Format:\n", - "- Numerical data in columns\n", - "- Section headers and subheaders\n", - "- Consistent indentation for subcategories\n", - "- Mixed content (text and numbers)\n", + "3. Format Analysis:\n", + "- Tabular data with aligned columns\n", + "- Consistent spacing and indentation\n", + "- Clear section breaks between statements\n", + "- Standardized number formatting\n", + "- Footnotes and annotations\n", "\n", - "4. Document Length:\n", - "- 3 pages total\n", - "- Each page contains a complete financial statement\n", - "- Moderate length per page\n", - "- Natural breaks between statements\n", + "4. Length:\n", + "- Three pages of financial statements\n", + "- Moderate document length\n", + "- Dense with numerical data\n", + "- Consistent formatting throughout\n", "\n", - "5. Hierarchical Elements & Semantic Relationships:\n", - "- Strong hierarchical organization:\n", - " * Main categories (Assets, Liabilities, etc.)\n", - " * Subcategories (Current assets, Long-term assets, etc.)\n", - " * Line items under each category\n", - "- Clear semantic relationships between financial items\n", - "- Parent-child relationships in data structure\n", + "5. Hierarchical/Semantic Elements:\n", + "- Strong natural hierarchy in financial statements\n", + "- Clear parent-child relationships\n", + "- Logical grouping of related items\n", + "- Natural semantic connections between financial concepts\n", "\n", - "6. Formatting and Section Breaks:\n", - "- Clear section demarcation\n", - "- Consistent indentation levels\n", - "- Natural breaks between major categories\n", - "- Well-defined table structure\n", + "6. Formatting/Section Breaks:\n", + "- Clear separation between major statements\n", + "- Consistent sub-section formatting\n", + "- Well-defined category breaks\n", + "- Standardized presentation\n", "\n", "RECOMMENDATION: HIERARCHICAL\n", "\n", @@ -763,20 +770,23 @@ "- FIXED_SIZE: 8%\n", "- NONE: 2%\n", "\n", - "Reasoning for HIERARCHICAL recommendation:\n", - "1. The content has natural hierarchical structure in financial statements\n", - "2. Each statement has clear parent-child relationships\n", - "3. The data is organized in logical nested categories\n", - "4. Maintaining hierarchical relationships is crucial for financial data interpretation\n", - "5. This approach would preserve the contextual relationship between headers and their associated data\n", - "6. Would allow for more meaningful query responses by maintaining the financial statement structure\n", + "Reasoning for HIERARCHICAL:\n", + "1. Financial statements have natural hierarchical structure\n", + "2. Clear parent-child relationships in data\n", + "3. Logical grouping of information\n", + "4. Maintains context within financial statement sections\n", + "5. Preserves relationships between numbers and their categories\n", + "6. Allows for meaningful chunking based on statement sections and sub-sections\n", + "7. Helps maintain calculation integrity\n", + "8. Keeps related financial concepts together\n", "\n", - "The hierarchical chunking would allow:\n", - "- Chunks based on major financial statement sections\n", - "- Preservation of header-data relationships\n", - "- Maintenance of financial context\n", - "- Better handling of nested categories\n", - "- More accurate responses to financial queries\n" + "This will allow the LLM to:\n", + "- Maintain the integrity of financial statements\n", + "- Keep related items together\n", + "- Preserve calculation relationships\n", + "- Respect the natural structure of financial data\n", + "- Enable better question-answering about specific sections\n", + "- Maintain context when processing financial information\n" ] } ],