Scripts for crawling the 500 most visited websites in Thailand according to Alexa for th
and en
parallel texts.
get_alexa_rankings.py
- get top 500 most visited domains in Thailand according to Alexascrape_robots.py
- scrape all sitemaps fromhttps://domain/robots.txt
scrape_urls.py
- scrape all URLs fromhttps://domain/sitemap.xml
clean_urls.py
- send head requests to verify that URLs return200
scrape_requests.py
- scrape URL contents and align by tagscheck_health.py
- simple dataset health checkup script for scraped datasets