Crawlers for various websites mostly news providers

Basic idea, fetch content of a web-page and examine

the text present, extracting matching keywords/text

eg by file extension name or domain.

Once links are extracted, if files, they are

downloaded, or queued up on the cloud for workers to

actually perform the downloads.

To use the local based downloader:

++ Works on any version of Python >= 2.X

python fileDownloader.py
To use the cloud based job queuer:

++ So far built for Python3.X

python3 targetForCloud.py

Name	Name	Last commit message	Last commit date
Latest commit Emmanuel Odeke merge from master Oct 11, 2015 a8722a1 · Oct 11, 2015 History 83 Commits
classifier	classifier	Fresh categories	Jul 25, 2014
resty @ b0769d8	resty @ b0769d8	Refactoring	Sep 6, 2014
routing @ 277248c	routing @ 277248c	Now using cleaner and latest resty + fixed routeUtils	Sep 6, 2014
solos	solos	Renamed sitereader	Jul 21, 2014
.gitignore	.gitignore	Added .gitignore	Apr 14, 2014
.gitmodules	.gitmodules	Started use of load balancer	Jun 22, 2014
README.md	README.md	Adding in newlines for formatting of README.md	May 25, 2014
RobotParser.py	RobotParser.py	Refactoring	Sep 6, 2014
acmDl.py	acmDl.py	Compiling nameSearch regex	Jul 15, 2014
fileDownloader.py	fileDownloader.py	added mechanism for extracting capable urls	Oct 11, 2015
oxy	oxy	Syntax fix	Jul 25, 2014
routeUtils.py	routeUtils.py	Now using cleaner and latest resty + fixed routeUtils	Sep 6, 2014
shardy.py	shardy.py	merge from master	Oct 11, 2015
utils.py	utils.py	added mechanism for extracting capable urls	Oct 11, 2015

Provide feedback