Skip to content

Latest commit

 

History

History
49 lines (36 loc) · 1.98 KB

README.md

File metadata and controls

49 lines (36 loc) · 1.98 KB

well-known-crawler

Crawl well-known Resources introduced by The Privacy Sandbox:

  • Related Website Sets

    • /.well-known/related-website-set.json
    • HTTPS only
    • on ETLD+1 only, where PSL is authoritative source for ETLD
    • different JSON format depending if primary or other
    • There is a canonical public list of sets, but some (like google.com/youtube.com) are missing from it for instance
    • Generator
  • Attestation File

    • /.well-known/privacy-sandbox-attestations.json
    • Submit a form, JSON file sent by Google
    • No public list of who participates

Dependencies

A Dockerfile is provided under .devcontainer/; for direct integration with VS Code or to manually build the image and deploy the Docker container, follow the instructions in this guide.

Environment Variables

Required:

Optional:

  • S3_DATA_BUCKET: The s3 bucket where the crawl raw results are saved, if undefined, we are assuming local run.

Usage

./crawl_crux.sh

Gitlab CI/CD Variables

Define the following CI variables to have Gitlab CI building and pushing the Docker image automatically so that ECS task is up to date:

  • AWS_ACCOUNT_ID: the AWS account ID
  • AWS_REGION: the AWS region to use
  • AWS_ACCESS_KEY_ID: of an IAM user with the AmazonEC2ContainerRegistryPowerUser policy
  • AWS_SECRET_ACCESS_KEY: of an IAM user with the AmazonEC2ContainerRegistryPowerUser policy