Web Crawler

Main idea of this task is to create high-performance web-crawler, backed by database.

starts from some starting URL, downloads page
parse links (local and external)
parses page content (filtering out non-displaying content like scripts or hidden parameters or html tags)
saves page content processed to database
saves links as a new starting points for later processing.

save it's state to database.
be multiple-threaded instance.
be able to be started in a few application instances.
not visit the same page twice.
support max. depth limits, for example 20 external hops from the starting point.
be able to support HTTPS

If crawler finished all available tasks, it should wait for a new tasks.

If crawler died or was stopped, another crawler instance must continue unfinished work.

Downloaded page must be parsed to separate words, so for every parsed page we need to save list of words and how many times they was used.

Use JPA and Hibernate as ORM, microservice architecture and REST.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
auth-service		auth-service
common-service		common-service
task-service		task-service
web-crawler-worker		web-crawler-worker
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback