All classes are under active development and subject to non-backward compatible changes or removal in any future version. These are not subject to the Semantic Versioning model. This means that while you may use them, you may need to update your source code when upgrading to a newer version of this package.
Language | Package |
---|---|
TypeScript | @cdklabs/generative-ai-cdk-constructs |
Python | cdklabs.generative_ai_cdk_constructs |
- Overview
- Initializer
- Pattern Construct Props
- Pattern Properties
- Default properties
- Crawler Output
- Architecture
- Web Crawler Agent for Amazon Bedrock
- Cost
- Security
- Supported AWS Regions
- Quotas
- Clean up
The WebCrawler construct provided here simplifies website crawling. It can crawl websites and RSS feeds on a schedule and store changeset data on S3. WebCrawler construct uses Crawlee library to crawl websites.
Crawling will begin shortly after the stack deployment. Please allow up to 15 minutes for it to start the first time.
All job status changes are notified via the SNS Topic available through the snsTopic
property.
Here is a minimal deployable pattern definition:
TypeScript
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { CrawlerTargetType, WebCrawler } from '@cdklabs/generative-ai-cdk-constructs';
export class SampleStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
new WebCrawler(this, 'WebCrawler', {
targets: [
{
url: 'https://cloudprice.net/aws/ec2',
targetType: CrawlerTargetType.WEBSITE,
crawlIntervalHours: 24,
maxRequests: 15000,
},
{
url: 'https://aws.amazon.com/about-aws/whats-new/recent/feed',
targetType: CrawlerTargetType.RSS_FEED,
crawlIntervalHours: 24,
},
],
});
}
}
Python
from aws_cdk import Stack
from constructs import Construct
from cdklabs.generative_ai_cdk_constructs import WebCrawler, CrawlerTargetType
class SampleStack(Stack):
def __init__(self, scope: Construct, id: str, **kwargs):
super().__init__(scope, id, **kwargs)
WebCrawler(self, 'WebCrawler',
targets=[
{
'url': 'https://cloudprice.net/aws/ec2',
'targetType': CrawlerTargetType.WEBSITE,
'crawlIntervalHours': 24,
'maxRequests': 15000,
},
{
'url': 'https://aws.amazon.com/about-aws/whats-new/recent/feed',
'targetType': CrawlerTargetType.RSS_FEED,
'crawlIntervalHours': 24,
}
]
)
new WebCrawler(scope: Construct, id: string, props: WebCrawlerProps)
Parameters
- scope Construct
- id string
- props WebCrawlerProps
Name | Type | Required | Description |
---|---|---|---|
existingVpc | ec2.IVpc | An existing VPC can be used to deploy the construct. | |
vpcProps | ec2.VpcProps | Optional custom properties for a VPC the construct will create. This VPC will be used by the compute resources the construct creates. Providing both this and existingVpc is an error. | |
existingOutputBucketObj | s3.IBucket | Existing instance of S3 Bucket object, providing both this and bucketOutputProps will cause an error.. |
|
bucketOutputProps | s3.BucketProps | User provided props to override the default props for the S3 Bucket. Providing both this and existingOutputBucketObj will cause an error. |
|
observability | boolean | Enables observability on all services used. Warning: associated costs with the services used. It is a best practice to enable by default. Defaults to true. | |
stage | string | Value will be appended to resources name service. | |
targets | CrawlerTarget[] | Target website and RSS feeds to be crawled | |
enableLambdaCrawler | boolean | Specifies whether a separate crawler available as a Lambda function should be deployed. |
Name | Type | Description |
---|---|---|
vpc | ec2.IVpc | The VPC used by the construct (whether created by the construct or provided by the client). |
dataBucket | s3.IBucket | Returns an instance of s3.Bucket created by the construct. |
snsTopic | sns.ITopic | Returns an instance of SNS Topic created by the construct. |
targetsTable | dynamodb.ITable | Returns the instance of Targets DynamoDB table created by the construct. |
jobsTable | dynamodb.ITable | Returns the instance of Jobs DynamoDB table created by the construct. |
jobQueue | batch.IJobQueue | Returns the instance of batch.IJobQueue created by the construct. |
webCrawlerJobDefinition | batch.IJobDefinition | Returns the instance of JobDefinition created by the construct. |
lambdaCrawler | lambda.IFunction | Returns the instance of the lambda function created by the construct if enableLambdaCrawler was set to true in the construct properties. |
lambdaCrawlerApiSchemaPath | string | Returns the path to the OpenAPI specification for the Lambda Crawler. |
Out of the box implementation of the construct without any override will set the following defaults:
- Set up a VPC
- Uses existing VPC if provided, otherwise creates a new one
- Setup VPC endpoints for Amazon S3 and ECR
- Setup a VPC flow logs
- when an infrastructure update is triggered, any running jobs will be allowed to run until 30min has expired
- 2vCPUs
- 6144 MB as the container memory hard limit
- hard limit of 48 hours for container timeout, and 100 000 pages crawled
- Set up one Amazon S3 Bucket
- Uses existing bucket if provided, otherwise creates new one
- Setup a logging access S3 bucket
S3 output bucket names is available via dataBucket
property. Data is stored on S3 in the following format:
- /{target_key}/files/{file_name}
- /{target_key}/jobs/{job_id}/{data_file}
File Name | Description |
---|---|
crawl_data.jsonl | Raw HTML crawler output |
crawl_files.jsonl | Links to files found on crawled pages |
crawl_errors.jsonl | Crawler errors |
crawl_sitemaps.jsonl | Links found in sitemaps |
pages_changeset.jsonl | Cleaned data in text format from the pages, including information on whether the page was created, updated, deleted or not changed after a previous crawl. |
files_changeset.jsonl | All downloaded files data including S3 key and information on whether the file was added, changed, or deleted after a previous crawl |
All files are stored in jsonl format (JSON Lines) and can be processed line by line. Here is a Python snippet.
import json
with open(FILE_NAME, 'r') as file:
for line in file:
try:
json_data = json.loads(line)
# TODO: Process the JSON data
except json.JSONDecodeError as e:
print(f"Error decoding JSON: {e}")
Page records have the following format
{
content_type: string;
url: string;
base_url?: string;
file_type?: string;
canonical?: string;
meta?: {
title?: string;
description?: string;
keywords?: string;
author?: string;
}
content?: string;
fingerprint?: string;
operation?: string;
}
File records have the following format
{
url: string;
file_type: string;
file_size: number;
last_modified: string,
checksum?: string,
s3_key?: string
operation?: string;
}
Operations:
- not_changed
- created
- updated
- deleted
The WebCrawler construct also provides a Lambda Crawler function that can be used as an action in the Amazon Bedrock Agent. Here is an example of an agent capable of accessing the web.
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { WebCrawler, bedrock } from '@cdklabs/generative-ai-cdk-constructs';
export class WebCrawlerAgentStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const crawler = new WebCrawler(this, 'WebCrawler', {
enableLambdaCrawler: true,
});
const agent = new bedrock.Agent(this, 'WebAgent', {
foundationModel: bedrock.BedrockFoundationModel.ANTHROPIC_CLAUDE_SONNET_V1_0,
instruction: `You are a helpful and friendly agent with access to the internet.
You can get content from web pages provided by an URL.
You can summarize web pages.`,
});
agent.addActionGroup(
new bedrock.AgentActionGroup(this, 'WebCrawlerActionGroup', {
actionGroupName: 'web-crawler',
description: 'Use this function to get content from a web page by HTTP or HTTPS URL',
actionGroupExecutor: {
lambda: crawler.lambdaCrawler
},
actionGroupState: 'ENABLED',
apiSchema: bedrock.ApiSchema.fromAsset(crawler.lambdaCrawlerApiSchemaPath),
}),
);
}
}
When deploying this architecture, you as the customer are responsible for the costs of the AWS services utilized. Based on current pricing in the US East (N. Virginia) region, operating this infrastructure with the default configuration to crawle websites every day for 4 hours is estimated to cost approximately $52.75 per month. This cost estimate includes usage of the various AWS services leveraged in this architecture such as AWS Lambda, AWS Batch, AWS Fargate, and Amazon S3.
We recommend creating a budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this solution.
The following table provides a sample cost breakdown for deploying this solution with the default parameters in the US East (N. Virginia) Region for one month.
AWS Service | Dimensions | Cost [USD] |
---|---|---|
Amazon Virtual Private Cloud | 37.35 | |
Amazon Simple Storage Service (S3) | 100 GB / month | 2.30 |
AWS Fargate | 4 hours per day | 13.10 |
When you build systems on AWS infrastructure, security responsibilities are shared between you and AWS. This shared responsibility model reduces your operational burden because AWS operates, manages, and controls the components including the host operating system, virtualization layer, and physical security of the facilities in which the services operate. For more information about AWS security, visit AWS Cloud Security.
Optionnaly, you can provide existing resources to the constructs (marked optional in the construct pattern props). If you chose to do so, please refer to the official documentation on best practices to secure each service:
If you grant access to a user to your account where this construct is deployed, this user may access information stored by the construct (Amazon Simple Storage Service buckets, Amazon CloudWatch logs). To help secure your AWS resources, please follow the best practices for AWS Identity and Access Management (IAM).
Note
This construct stores documents in the output assets bucket. You should consider validating each file stored in the output bucket before using it in your application. here for file validation best practices.
If you use the data generated by this construct in a knowledge base, you should consider ingesting only the documents you deem appropriate. Depending on the service and LLM you use, results returned by the knowledge base may be eligible for inclusion into the prompt; and therefore, may be sent to the LLM or underlying service. If you use a third-party LLM, you should consider auditing the documents contained within your knowledge base.
The construct Web Crawler can be used to index your own web pages, or web pages that you have authorization to index. Using this construct Web Crawler to aggressively crawl websites or web pages you don't own could violate the rights, laws, and terms of others. Some forms of web scraping, like extracting copyrighted content, personal information without consent, or engaging in activities that interfere with the normal operation of websites may constitute unlawful behavior. Accordingly, you should consider reviewing the applicable laws and regulations that apply to you and your intended web crawling and web scraping activities before using this construct.
This construct provides several configurable options for logging. Please consider security best practices when enabling or disabling logging and related features. Verbose logging, for instance, may log content of API calls. You can disable this functionality by ensuring observability flag is set to false.
This solution uses multiple AWS services, which might not be currently available in all AWS Regions. You must launch this construct in an AWS Region where these services are available. For the most current availability of AWS services by Region, see the AWS Regional Services List.
Service quotas, also referred to as limits, are the maximum number of service resources or operations for your AWS account.
Make sure you have sufficient quota for each of the services implemented in this solution and the associated instance types. For more information, refer to AWS service quotas.
To view the service quotas for all AWS services in the documentation without switching pages, view the information in the Service endpoints and quotas page in the PDF instead.
When deleting your stack which uses this construct, do not forget to go over the following instructions to avoid unexpected charges:
- delete the logs uploaded to the account
© Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.