diff --git a/rag/open-source/chunking/chunking_advisor/ChunkingAdvisor.ipynb b/rag/open-source/chunking/chunking_advisor/ChunkingAdvisor.ipynb deleted file mode 100644 index b5b361a4..00000000 --- a/rag/open-source/chunking/chunking_advisor/ChunkingAdvisor.ipynb +++ /dev/null @@ -1,903 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "90c9d089-3132-48cf-88f9-99f7b20227a7", - "metadata": {}, - "source": [ - "## Important note : Pre-requisite\n", - "\n", - "\n", - "#### This Notebook requires an existing Knowldge base on bedrock .To create a knowldge base you can execute the code provided in : https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb" - ] - }, - { - "cell_type": "markdown", - "id": "d4a64fff-2c1e-4f34-9c7f-0c8180bb039a", - "metadata": {}, - "source": [ - "# Automated Document Processing Pipeline: Using Foundation Models for Smart Document Chunking and Knowledge Base Integration\n", - "\n", - "This notebook demonstrates an end-to-end automated solution for intelligent document processing using Foundation Models (FM). The pipeline performs three key functions:\n", - "\n", - "\n", - "#### Notebook Walkthrough\n", - "The pipeline streamlines the entire process from document analysis to knowledge base population, making it efficient to prepare documents for advanced query and retrieval operations.\n", - "\n", - "##### 1.Document Structure Analysis: FM Automatically analyzes document structure and content to determine the optimal chunking strategy\n", - "##### 2. Configuration Generation: Creates customized chunking parameters based on the analysis\n", - "\n", - "##### 3.Knowledge Base Integration: Processes and loads the chunked documents into a Bedrock Knowledge Base powered by OpenSearch Serverless\n", - "\n", - "\n", - "\n", - "#### Steps: \n", - "\n", - "1. Setup Access and Permissions:\n", - " Create Amazon Bedrock Knowledge Base execution role\n", - " Configure necessary IAM policies for:\n", - " S3 data access\n", - " OpenSearch Serverless writing permissions\n", - " Reference template available in \"0_create_ingest_documents_test_kb\" notebook\n", - "\n", - "2. Document Analysis Using Claude Sonnet:\n", - " Process files within target folder.For each document, Claude analyzes and recommends:\n", - " Optimal chunking strategy (FIXED_SIZE/NONE/HIERARCHICAL/SEMANTIC)\n", - " Specific configuration parameters\n", - " Custom processing requirements\n", - "\n", - "3. Data Preparation and Storage:\n", - " Upload analyzed files to designated S3 buckets\n", - " Configure buckets as data source for Bedrock KB DataStore\n", - "\n", - "4. Ingestion Process:\n", - " Initiate ingestion job via Knowledge Base APIs\n", - "5. Validation:\n", - " Test ingestion completion\n", - " Verify data accessibility and accuracy\n", - "\n", - "#### Pre-requisites\n", - "\n", - "This notebook requires permissions to:\n", - "\n", - "1. Create and delete Amazon IAM roles\n", - "2. Create, update and delete Amazon S3 buckets\n", - "3. Access Amazon Bedrock\n", - "4. Bedrock roles to access s3 buckets (3 bucket)\n", - "5. Access to Amazon OpenSearch Serverless\n", - "\n", - "If running on SageMaker Studio, you should add the following managed policies to your role:\n", - "\n", - "- IAMFullAccess\n", - "- AWSLambda_FullAccess\n", - "- AmazonS3FullAccess-\n" - ] - }, - { - "cell_type": "markdown", - "id": "d066b04e-bc6f-42e6-8836-817d2e0854b1", - "metadata": {}, - "source": [ - "### Install required libraries" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4f36a15a-6cc3-464a-a2f1-fcb2d45ea7b9", - "metadata": {}, - "outputs": [], - "source": [ - "%pip install --force-reinstall -q -r ./requirements.txt --quiet" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5f694d1b-c537-4bf2-a5b1-d4f5c7025620", - "metadata": {}, - "outputs": [], - "source": [ - "# restart kernel\n", - "from IPython.core.display import HTML\n", - "HTML(\"\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "25239d0e-972d-4fff-b200-f20c39714a9e", - "metadata": {}, - "outputs": [], - "source": [ - "import boto3\n", - "\n", - "import json\n", - "\n", - "# create a boto3 session to dynamically get and set the region name\n", - "session = boto3.Session() \n", - "\n", - "AWS_REGION = session.region_name\n", - "bedrock = boto3.client('bedrock-runtime',region_name=AWS_REGION)\n", - "bedrock_agent_client = session.client('bedrock-agent', region_name=AWS_REGION)\n", - "\n", - "MODEL_NAME = \"anthropic.claude-3-5-sonnet-20241022-v2:0\" \n", - "datasources =[]\n", - "#create a folder data if not yet done and \n", - "path=\"data\"\n", - "# To get knowledgeBaseId look int Amazon Bedrock > knowledgeBaseId > knowledgeBaseId\n", - "# This ID should bave being created from first notebook\n", - "# https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb\n", - "\n", - "kb_id= \"XXXXX\" # Retrieve KB First \n", - "kb = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb_id)\n", - "print(kb)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f02e1241-fa8d-4dc2-a298-959a61ae2665", - "metadata": {}, - "outputs": [], - "source": [ - "#bedrock_kb_execution_role = kb['roleArn']\n", - "bedrock_kb_execution_role_arn = kb ['knowledgeBase']['roleArn']\n", - "bedrock_kb_execution_role= bedrock_kb_execution_role_arn.split('/')[-1]\n", - "print (bedrock_kb_execution_role_arn)\n", - "print (bedrock_kb_execution_role)" - ] - }, - { - "cell_type": "markdown", - "id": "b9291ec4-e2dc-47c1-950b-9fa7e737bee3", - "metadata": {}, - "source": [ - "### Supporting functions\n", - "##### Createbucket: Checks if an S3 bucket exists and creates it if it doesn't. \n", - "##### Upload_file: Upload_files to bucket: Upload a file to an S3 bucket\n", - "##### Listfile: List all files in a specified directory:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7578f6f2-1cd2-4683-b150-1b6900ff77ee", - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "import boto3\n", - "from botocore.exceptions import ClientError\n", - "import os\n", - "\n", - "def createbucket(bucketname):\n", - " \"\"\"\n", - " Checks if an S3 bucket exists and creates it if it doesn't. \n", - " Args:\n", - " bucket_name (str): Name of the S3 bucket to create (must be globally unique)\n", - " Raises:\n", - " ClientError: If there's an error accessing or creating the bucket\n", - " \"\"\"\n", - " try:\n", - " s3_client = boto3.client('s3')\n", - " s3_client.head_bucket(Bucket=bucketname)\n", - " print(f'Bucket {bucketname} Exists')\n", - " except ClientError as e:\n", - " print(f'Creating bucket {bucketname}')\n", - " if AWS_REGION == \"us-east-1\":\n", - " s3bucket = s3_client.create_bucket(\n", - " Bucket=bucketname)\n", - " else:\n", - " s3bucket = s3_client.create_bucket(\n", - " Bucket=bucketname,\n", - " CreateBucketConfiguration={ 'LocationConstraint': AWS_REGION }\n", - " )\n", - "\n", - "def upload_file(file_name, bucket, object_name=None):\n", - " \"\"\"Upload a file to an S3 bucket\n", - "\n", - " :param file_name: File to upload\n", - " :param bucket: Bucket to upload to\n", - " :param object_name: S3 object name. If not specified then file_name is used\n", - " :return: True if file was uploaded, else False\n", - " \"\"\"\n", - "\n", - " # If S3 object_name was not specified, use file_name\n", - " if object_name is None:\n", - " object_name = os.path.basename(file_name)\n", - "\n", - " # Upload the file\n", - " s3_client = boto3.client('s3')\n", - " try:\n", - " response = s3_client.upload_file(file_name, bucket, object_name)\n", - " except ClientError as e:\n", - " logging.error(e)\n", - " return False\n", - " return True\n", - "\n", - "def listfile (folder):\n", - " \"\"\"\n", - " List all files in a specified directory.\n", - " \n", - " Args:\n", - " folder (str): Path to the directory to list files from\n", - " \n", - " Returns:\n", - " list: A list of filenames in the specified directory\n", - " \"\"\"\n", - " dir_list = os.listdir(folder)\n", - " return dir_list\n", - "import boto3\n", - "from botocore.exceptions import ClientError\n", - "import logging\n", - "\n", - "def delete_bucket_and_objects(bucket_name):\n", - " # Create an S3 client\n", - " s3_client = boto3.client('s3')\n", - " # Create an S3 resource\n", - " s3 = boto3.resource('s3')\n", - " bucket = s3.Bucket(bucket_name)\n", - " bucket.objects.all().delete()\n", - " # Delete the bucket itself\n", - " bucket.delete()" - ] - }, - { - "cell_type": "markdown", - "id": "c549427c-9d3d-485c-a542-93ef49b540fe", - "metadata": {}, - "source": [ - "#### Download and prepare dataset\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "217702d0-42bb-4da0-b14b-37b4d0c0b503", - "metadata": {}, - "outputs": [], - "source": [ - "# if not yet created, create folder already\n", - "#!mkdir -p ./data\n", - "\n", - "from urllib.request import urlretrieve\n", - "urls = [\n", - " 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',\n", - " 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',\n", - " 'https://www.apple.com/newsroom/pdfs/Q3FY18ConsolidatedFinancialStatements.pdf'\n", - "]\n", - "\n", - "filenames = [\n", - " 'AMZN-2022-Shareholder-Letter.pdf',\n", - " 'AMZN-2021-Shareholder-Letter.pdf',\n", - " 'Q3FY18ConsolidatedFinancialStatements.pdf'\n", - "]\n", - "\n", - "data_root = \"./data/\"\n", - "\n", - "for idx, url in enumerate(urls):\n", - " file_path = data_root + filenames[idx]\n", - " urlretrieve(url, file_path)" - ] - }, - { - "cell_type": "markdown", - "id": "48551cb4-02a8-4426-88aa-089516a4c1d9", - "metadata": {}, - "source": [ - "#### Create 3 S3 buckets for 3 data sources \n", - "##### Check if bucket exists, and if not create S3 bucket by knowledge base data source, each bucket will be used to load files with corrspending strategy :semantic, fixed_size , HIERARCHICAl" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "67a84784-fbfe-4ba5-88a5-d2e2a382bfac", - "metadata": {}, - "outputs": [], - "source": [ - "import random\n", - "suffix = random.randrange(200, 900)\n", - "s3_client = boto3.client('s3')\n", - "bucket_name_semantic = 'kb-dataset-bucket-semantic-' + str(suffix) #### Provide your bucket name which is already created\n", - "bucket_name_fixed = 'kb-dataset-bucket-fixed-'+str(suffix) \n", - "bucket_name_hierachical='kb-dataset-bucket-hierarchical-'+str(suffix) \n", - "s3_policy_name = 'AmazonBedrockS3PolicyForKnowledgeBase_'+ str(suffix)\n", - "\n", - "createbucket(bucket_name_semantic)\n", - "createbucket(bucket_name_fixed)\n", - "createbucket(bucket_name_hierachical)\n" - ] - }, - { - "cell_type": "markdown", - "id": "2a54bc11-8de7-4a67-b8ed-0ba2944fb9e4", - "metadata": {}, - "source": [ - "##### Create read and list S3 policy on the 3 buckets and attach it to existing Bedrock role \"bedrock_kb_execution_role\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6230aaf0-2aa6-429e-9824-0ef2f9b12579", - "metadata": {}, - "outputs": [], - "source": [ - "account_number = boto3.client('sts').get_caller_identity().get('Account')\n", - "iam_client = session.client('iam')\n", - "s3_policy_document = {\n", - " \"Version\": \"2012-10-17\",\n", - " \"Statement\": [\n", - " {\n", - " \"Effect\": \"Allow\",\n", - " \"Action\": [\n", - " \"s3:GetObject\",\n", - " \"s3:ListBucket\"\n", - " ],\n", - " \"Resource\": [\n", - " f\"arn:aws:s3:::{bucket_name_semantic}\",\n", - " f\"arn:aws:s3:::{bucket_name_semantic}/*\", \n", - " f\"arn:aws:s3:::{bucket_name_fixed}\",\n", - " f\"arn:aws:s3:::{bucket_name_fixed}/*\", \n", - " f\"arn:aws:s3:::{bucket_name_hierachical}\",\n", - " f\"arn:aws:s3:::{bucket_name_hierachical}/*\"\n", - " ],\n", - " \"Condition\": {\n", - " \"StringEquals\": {\n", - " \"aws:ResourceAccount\": f\"{account_number}\"\n", - " }\n", - " }\n", - " }\n", - " ]\n", - "}\n", - "s3_policy = iam_client.create_policy(\n", - " PolicyName=s3_policy_name,\n", - " PolicyDocument=json.dumps(s3_policy_document),\n", - " Description='Policy for reading documents from s3')\n", - "\n", - " # fetch arn of this policy \n", - "s3_policy_arn = s3_policy[\"Policy\"][\"Arn\"]\n", - "iam_client = session.client('iam')\n", - "fm_policy_arn = f\"arn:aws:iam::{account_number}:policy/{s3_policy_name}\"\n", - "iam_client.attach_role_policy(\n", - " RoleName=bedrock_kb_execution_role,\n", - " PolicyArn=fm_policy_arn\n", - " )" - ] - }, - { - "cell_type": "markdown", - "id": "55725547-428a-4568-8b4c-c7a8abab6e3d", - "metadata": {}, - "source": [ - "##### Sends a prompt to Claude Sonnet via Amazon Bedrock and returns the generated response.\n", - " " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0ee11cad-ea50-4e87-8bd2-b8f159bf4d91", - "metadata": {}, - "outputs": [], - "source": [ - "def get_completion(prompt):\n", - " \"\"\"\n", - " Sends a prompt to Claude Sonnet via Amazon Bedrock and returns the generated response.\n", - " Args:\n", - " prompt (str): The input text prompt to send to the AI model.\n", - "\n", - " Returns:\n", - " str: The generated text response \"\"\"\n", - "\n", - " body = json.dumps(\n", - " {\n", - " \"anthropic_version\": '',\n", - " \"max_tokens\": 2000,\n", - " \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n", - " \"temperature\": 0.0,\n", - " \"top_p\": 1,\n", - " \"system\": ''\n", - " }\n", - " )\n", - " response = bedrock.invoke_model(body=body, modelId=MODEL_NAME)\n", - " response_body = json.loads(response.get('body').read())\n", - " return response_body.get('content')[0].get('text')" - ] - }, - { - "cell_type": "markdown", - "id": "339eb2ae-e825-435f-b77b-0524144f081c", - "metadata": {}, - "source": [ - " #### Chunkingadvise function\n", - " ##### Analyzes a PDF document and recommends optimal LLM chunking strategy with parameters. This function loads a PDF file, analyzes its content using LLM, and provides recommendations for chunking strategy (FIXED_SIZE, NONE, HIERARCHICAL, or SEMANTIC) along with specific configuration parameters.\n", - " - Args:\n", - " - file (str): Name of the PDF file located in the 'data' directory\n", - " - Returns:\n", - " - dict: JSON containing recommended chunking strategy and parameters:\n", - " For HIERARCHICAL:\n", - " - Recommend only one Strategy\n", - " - Maximum Parent chunk token size\n", - " - Maximum child chunk token size\n", - " - Overlap Tokens\n", - " - Rational\n", - " - For SEMANTIC:\n", - " - Recommend only one Strategy\n", - " - Maximum tokens\n", - " - Buffer size\n", - " - Breakpoint percentile threshold\n", - " - Rational:\n", - " - For FIXED-SIZE\n", - " - overlapPercentage'\n", - " - parsed_data['maxTokens'\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7f99efd3-d16a-411a-9ad7-70522b9e1641", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c5801051-5411-4659-a303-c06aed74af04", - "metadata": {}, - "outputs": [], - "source": [ - "def Chunkingadvise (file):\n", - " \"\"\"\n", - " Analyzes a PDF document and recommends optimal LLM chunking strategy with parameters.\n", - "\n", - " This function loads a PDF file, analyzes its content using LLM, and provides \n", - " recommendations for chunking strategy (FIXED_SIZE, NONE, HIERARCHICAL, or SEMANTIC)\n", - " along with specific configuration parameters.\n", - " Args:\n", - " file (str): Name of the PDF file located in the 'data' directory\n", - " Returns:\n", - " dict: JSON containing recommended chunking strategy and parameters:\n", - " For HIERARCHICAL:\n", - " - Recommend only one Strategy\n", - " - Maximum Parent chunk token size\n", - " - Maximum child chunk token size\n", - " - Overlap Tokens\n", - " - Rational\n", - " For SEMANTIC:\n", - " - Recommend only one Strategy\n", - " - Maximum tokens\n", - " - Buffer size\n", - " - Breakpoint percentile threshold\n", - " - Rational\n", - " \"\"\"\n", - " my_docs = []\n", - " my_strategies =[]\n", - " strategy=\"\"\n", - " strategytext=\"\"\n", - " path=\"data\"\n", - " strategylist =[]\n", - " metadata = [\n", - " dict(year=2023, source=file)]\n", - " from langchain.document_loaders import PyPDFLoader\n", - " file = path +\"/\"+ file\n", - " loader = PyPDFLoader(file)\n", - " document = loader.load()\n", - " loader = PyPDFLoader(file)\n", - " # print (\"path + file :: \", file)\n", - " document = loader.load()\n", - " # print (\"path + file :: \", document)\n", - " \n", - " prompt = f\"\"\"SYSTEM you are an advisor expert in LLM chunking strategies,\n", - " USER can you analyze the type,content, format, structure and size of {document}. \n", - " Can you advise on best LLM chunking Strategy based on this analysis. Recommend only one Strategy, however show recommended strategy prefernece ratio \n", - " Available strategies to recommend from are : FIXED_SIZE or NONE or HIERARCHICAL or SEMANTIC\n", - " Decide on recommendatin first and then , what is the recommendation? \"\"\"\n", - " res = get_completion(prompt)\n", - " print(res)\n", - " prompt = f\"\"\" USER based on recommnedation provide in {res} \n", - " if you recommend HIERARCHICAL chunking then provide recommendation for: \n", - " Parent: Maximum parent chunk token size. \n", - " Child: Maximum child chunk token size and Overlap Tokens: Number of overlap tokens between each parent chunk and between each parent and its children.\n", - " If recommendation is HIERARCHICAL then provide response using JSON format\n", - " with the keys as \\\"Recommend only one Strategy\\\", \\\"Maximum Parent chunk token size\\\", \\\"Maximum child chunk token size\\\",\\\"Overlap Tokens\\\", \n", - " \\\"Rational:please explain rational for decision and explain why each other choice is not prefered, keep reational to 100 words maximum. \\\" . provide crisp and clear answer, \n", - " if you recommend SEMANTIC then provide response using JSON format with\n", - " the keys as \\\"Recommend only one Strategy\\\",\\\" Maximum tokens\\\", \\\"Buffer size\\\",\\\"Breakpoint percentile threshold\\\", \n", - " Buffer size should be less or equal than 1 , Breakpoint percentile threshold should >= 50\n", - " \\\"Rational:please explain rational for decision and explain why each other choice is not prefered, keep reational to 100 words maximum. \\\" . provide crisp and clear answer, \n", - " do not provide recommendation if not enough data inputs and say sorry I need more data,\n", - " if you recommend FIXED_SIZE then provide response using JSON format with\n", - " the keys as \\\"Recommend only one Strategy\\\",\\\" maxTokens\\\", \\\"overlapPercentage \\\",\n", - " \\\"Rational:please explain rational for decision and explain why each other choice is not prefered, keep reational to 100 words maximum. \\\" . provide crisp and clear answer, \n", - " do not provide recommendation if not enough data inputs and say sorry I need more data\"\"\"\n", - " res = get_completion(prompt)\n", - " print(res)\n", - " parsed_data = json.loads(res )\n", - " return parsed_data" - ] - }, - { - "cell_type": "markdown", - "id": "579229ec-23b4-4e9f-99db-cb78c7453e4e", - "metadata": {}, - "source": [ - "#### ingestbystrategy: function that configure chunking strategy parameters for Bedrock Knowledge Base ingestion based on recommended strategy.\n", - " - Args:\n", - " - parsed_data (dict): Dictionary containing chunking strategy recommendation and parameters\n", - " - Returns:\n", - " - tuple: Contains:\n", - " - chunking_strategy_config (dict): Configuration for the chosen chunking strategy\n", - " - bucket_name (str): S3 bucket name for storage\n", - " - name (str): Knowledge base name\n", - " - description (str): Knowledge base description \n", - " - s3_configuration (dict): S3 configuration with bucket ARN\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a016a3c8-6759-40fe-a997-91357a3f48e9", - "metadata": {}, - "outputs": [], - "source": [ - "def ingestbystrategy(parsed_data):\n", - " \"\"\"\n", - " Configures chunking strategy parameters for Bedrock Knowledge Base ingestion based on recommended strategy.\n", - "\n", - " Args:\n", - " parsed_data (dict): Dictionary containing chunking strategy recommendation and parameters\n", - "\n", - " Returns:\n", - " tuple: Contains:\n", - " - chunking_strategy_config (dict): Configuration for the chosen chunking strategy\n", - " - bucket_name (str): S3 bucket name for storage\n", - " - name (str): Knowledge base name\n", - " - description (str): Knowledge base description \n", - " - s3_configuration (dict): S3 configuration with bucket ARN\n", - "\n", - " Example:\n", - " >>> strategy_config, bucket, kb_name, desc, s3_config = ingest_by_strategy(strategy_data)\n", - " \"\"\"\n", - " chunkingStrategyConfiguration ={}\n", - " # print(\"Strategy::\", parsed_data)\n", - " strategy= parsed_data['Recommend only one Strategy']\n", - "\n", - " if strategy =='HIERARCHICAL':\n", - " p1 = parsed_data['Maximum Parent chunk token size']\n", - " p2= parsed_data['Maximum child chunk token size']\n", - " p3= parsed_data['Overlap Tokens'] \n", - " bucket_name=bucket_name_hierachical \n", - " name = f\"bedrock-sample-knowledge-base-HIERARCHICAL\"\n", - " description = \"Bedrock Knowledge Bases for and S3 HIERARCHICAL\"\n", - " # HIERARCHICAL Chunking\n", - " chunkingStrategyConfiguration = {\n", - " \"chunkingStrategy\": \"HIERARCHICAL\", \n", - " \"hierarchicalChunkingConfiguration\": { \n", - " 'levelConfigurations': [\n", - " {\n", - " 'maxTokens': p1\n", - " },\n", - " {\n", - " 'maxTokens': p2\n", - " }\n", - " ],\n", - " 'overlapTokens': p3\n", - " }\n", - " }\n", - " \n", - " # # SEMANTIC Chunking \n", - " if strategy =='SEMANTIC':\n", - " p3 = parsed_data['Maximum tokens']\n", - " p2= int(parsed_data['Buffer size'])\n", - " p1= parsed_data['Breakpoint percentile threshold']\n", - " bucket_name= bucket_name_semantic\n", - " name = f\"bedrock-sample-knowledge-base-SEMANTIC\"\n", - " description = \"Bedrock Knowledge Bases for and S3 SEMANTIC\"\n", - " chunkingStrategyConfiguration = { \"chunkingStrategy\": \"SEMANTIC\",\n", - " \"semanticChunkingConfiguration\": { \n", - " 'breakpointPercentileThreshold': p1,\n", - " 'bufferSize': p2,\n", - " 'maxTokens': p3\n", - " }\n", - " }\n", - "\n", - "\n", - " if strategy =='FIXED_SIZE':\n", - " p2= int(parsed_data['overlapPercentage'])\n", - " p1= int (parsed_data['maxTokens'])\n", - " bucket_name=bucket_name_fixed\n", - " name = f\"bedrock-sample-knowledge-base-FIXED\"\n", - " description = \"Bedrock Knowledge Bases for and S3 FIXED\"\n", - " \n", - " chunkingStrategyConfiguration = { \"chunkingStrategy\": \"FIXED_SIZE\",\n", - " \"semanticChunkingConfiguration\": { \n", - " \"maxTokens\": p1,\n", - " \"overlapPercentage\":p2\n", - " \n", - " }\n", - " }\n", - " \n", - " s3Configuration = {\n", - " \"bucketArn\": f\"arn:aws:s3:::{bucket_name}\",\n", - " } \n", - " return chunkingStrategyConfiguration ,bucket_name , name , description ,s3Configuration " - ] - }, - { - "cell_type": "markdown", - "id": "4cc202b3-12b9-45e7-a610-e76c28b142ad", - "metadata": {}, - "source": [ - "#### Function to Creates or retrieves a data source in an Amazon Bedrock Knowledge Base\n", - " \n", - "#### First checks if a data source with the given name exists. If found, returns the existing data source. Otherwise creates a new one with specified configurations.\n", - "- Args:\n", - " - name (str): Name of the data source\n", - " - description (str): Description of the data source\n", - " - knowledge_base_id (str): ID of the knowledge base to create data source in\n", - " - s3_configuration (dict): S3 bucket configuration for the data source\n", - " - chunking_strategy_configuration (dict): Configuration for text chunking strategy\n", - "- Returns:\n", - " - dict: Response containing the data source details from Bedrock" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "90628e2d-913d-4290-8d40-78dacd3d0e9b", - "metadata": {}, - "outputs": [], - "source": [ - "def createDS (name, description,knowledgeBaseId , s3Configuration , chunkingStrategyConfiguration ):\n", - " \"\"\"\n", - " Creates or retrieves a data source in an Amazon Bedrock Knowledge Base.\n", - " \n", - " First checks if a data source with the given name exists. If found, returns the existing \n", - " data source. Otherwise creates a new one with specified configurations.\n", - "\n", - " Args:\n", - " name (str): Name of the data source\n", - " description (str): Description of the data source\n", - " knowledge_base_id (str): ID of the knowledge base to create data source in\n", - " s3_configuration (dict): S3 bucket configuration for the data source\n", - " chunking_strategy_configuration (dict): Configuration for text chunking strategy\n", - "\n", - " Returns:\n", - " dict: Response containing the data source details from Bedrock\n", - "\n", - " Raises:\n", - " ClientError: If there's an error accessing or creating the data source\n", - " \"\"\"\n", - " response = bedrock_agent_client.list_data_sources(\n", - " knowledgeBaseId=kb_id,\n", - " maxResults=12\n", - " )\n", - " for i in range (len(response[\"dataSourceSummaries\"])):\n", - " print (response[\"dataSourceSummaries\"][i] [\"name\"] ,\"::\", name)\n", - " print (response[\"dataSourceSummaries\"][i][\"dataSourceId\"])\n", - " if response[\"dataSourceSummaries\"][i] [\"name\"] == name:\n", - " ds = bedrock_agent_client.get_data_source(knowledgeBaseId = knowledgeBaseId, dataSourceId = response[\"dataSourceSummaries\"][i-1][\"dataSourceId\"] )\n", - " return ds\n", - " \n", - " ds = bedrock_agent_client.create_data_source(\n", - " name = name,\n", - " description = description,\n", - " knowledgeBaseId = knowledgeBaseId,\n", - " dataDeletionPolicy = 'DELETE',\n", - " dataSourceConfiguration = {\n", - " # # For S3 \n", - " \"type\": \"S3\",\n", - " \"s3Configuration\" : s3Configuration\n", - " # # For Web URL \n", - " # \"type\": \"WEB\",\n", - " # \"webConfiguration\":webConfiguration \n", - " },\n", - " vectorIngestionConfiguration = {\n", - " \"chunkingConfiguration\": chunkingStrategyConfiguration\n", - " })\n", - " \n", - " \n", - " return ds" - ] - }, - { - "cell_type": "markdown", - "id": "8e60ea6a-a1dc-47f8-9922-b7c70be750d1", - "metadata": {}, - "source": [ - "### Process PDF files by analyzing content, creating data sources, and uploading to S3.\n", - "\n", - "#### Workflow:\n", - "1. Lists all files in specified directory\n", - "2. For each PDF:\n", - " - Analyzes for optimal chunking strategy\n", - " - Creates data source with recommended configuration\n", - " - Uploads file to appropriate S3 bucket \n", - "