Crawl well-known Resources introduced by The Privacy Sandbox:
-
/.well-known/privacy-sandbox-attestations.json
- Submit a form, JSON file sent by Google
- No public list of who participates
A Dockerfile
is provided under .devcontainer/
; for direct integration with
VS Code or to manually build the image and deploy the Docker container, follow
the instructions in this guide.
Required:
CRUX_URL
: The URL to the cached version of CrUX to use (https://github.com/zakird/crux-top-lists/raw/main/data/global/current.csv.gz)CRUX_TOP
: Specify how many first top origins to crawl (1000000)RWS_URL
: The URL to the RWS canonical set (https://raw.githubusercontent.com/GoogleChrome/related-website-sets/main/related_website_sets.JSON)
Optional:
S3_DATA_BUCKET
: The s3 bucket where the crawl raw results are saved, if undefined, we are assuming local run.
./crawl_crux.sh
Define the following CI variables to have Gitlab CI building and pushing the Docker image automatically so that ECS task is up to date:
AWS_ACCOUNT_ID
: the AWS account IDAWS_REGION
: the AWS region to useAWS_ACCESS_KEY_ID
: of an IAM user with theAmazonEC2ContainerRegistryPowerUser
policyAWS_SECRET_ACCESS_KEY
: of an IAM user with theAmazonEC2ContainerRegistryPowerUser
policy