well-known-crawler

Crawl well-known Resources introduced by The Privacy Sandbox:

Related Website Sets
- /.well-known/related-website-set.json
- HTTPS only
- on ETLD+1 only, where PSL is authoritative source for ETLD
- different JSON format depending if primary or other
- There is a canonical public list of sets, but some (like google.com/youtube.com) are missing from it for instance
- Generator
Attestation File
- /.well-known/privacy-sandbox-attestations.json
- Submit a form, JSON file sent by Google
- No public list of who participates

Dependencies

A Dockerfile is provided under .devcontainer/; for direct integration with VS Code or to manually build the image and deploy the Docker container, follow the instructions in this guide.

Environment Variables

Required:

CRUX_URL: The URL to the cached version of CrUX to use (https://github.com/zakird/crux-top-lists/raw/main/data/global/current.csv.gz)
CRUX_TOP: Specify how many first top origins to crawl (1000000)
RWS_URL: The URL to the RWS canonical set (https://raw.githubusercontent.com/GoogleChrome/related-website-sets/main/related_website_sets.JSON)

Optional:

S3_DATA_BUCKET: The s3 bucket where the crawl raw results are saved, if undefined, we are assuming local run.

Usage

./crawl_crux.sh

Gitlab CI/CD Variables

Define the following CI variables to have Gitlab CI building and pushing the Docker image automatically so that ECS task is up to date:

AWS_ACCOUNT_ID: the AWS account ID
AWS_REGION: the AWS region to use
AWS_ACCESS_KEY_ID: of an IAM user with the AmazonEC2ContainerRegistryPowerUser policy
AWS_SECRET_ACCESS_KEY: of an IAM user with the AmazonEC2ContainerRegistryPowerUser policy

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.devcontainer		.devcontainer
schemas		schemas
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
README.md		README.md
crawl_crux.sh		crawl_crux.sh
crawl_origins.sh		crawl_origins.sh
etld1_only.py		etld1_only.py
validate_json.py		validate_json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

well-known-crawler

Dependencies

Environment Variables

Usage

Gitlab CI/CD Variables

About

Releases

Packages

Languages

privacysandstorm/well-known-crawler

Folders and files

Latest commit

History

Repository files navigation

well-known-crawler

Dependencies

Environment Variables

Usage

Gitlab CI/CD Variables

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages