Skip to content

High-performance web-crawler, backed by database

Notifications You must be signed in to change notification settings

mperever/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

Main idea of this task is to create high-performance web-crawler, backed by database.

Crawler itself is a simple tool which

  • starts from some starting URL, downloads page
  • parse links (local and external)
  • parses page content (filtering out non-displaying content like scripts or hidden parameters or html tags)
  • saves page content processed to database
  • saves links as a new starting points for later processing.

Crawler must

  • save it's state to database.
  • be multiple-threaded instance.
  • be able to be started in a few application instances.
  • not visit the same page twice.
  • support max. depth limits, for example 20 external hops from the starting point.
  • be able to support HTTPS

If crawler finished all available tasks, it should wait for a new tasks.

If crawler died or was stopped, another crawler instance must continue unfinished work.

Downloaded page must be parsed to separate words, so for every parsed page we need to save list of words and how many times they was used.

Use JPA and Hibernate as ORM, microservice architecture and REST.

About

High-performance web-crawler, backed by database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages