Scrapes AO3 for metadata about fanfictions and pushes the data into a MongoDB database.
Download leiningen and a MongoDB server
Make sure the MongoDB server is running
$ mongod --dbpath C:\data\db
Run unit tests with
$ lein test
Write a config YAML file and run
$ java -jar fanfic-scrape-0.1.0-standalone.jar config.yaml
$ lein run config.yaml
Where config.yaml
is a YAML config file.
Data in MongoDB will have fields for the title, author, summary, tags, url, and date it was added to the database.
$ mongo
> use fanfics
switched to db fanfics
>{ "tags" : "Kururugi Suzaku"}).count()
>[{ $unwind : { path : "$author" } },
{ $group : { _id : "$author", count: { $sum: 1 } } },
{ $sort : { count : -1 }} ])
{ "_id" : "Divano_Messiah", "count" : 80 }
{ "_id" : "orphan_account", "count" : 45 }
{ "_id" : "NeoDiji", "count" : 37 }
# base URL it uses
root_url: "{tag}/works"
# list of searches and values to use in the root_url
- name: "Code Geass fics"
- key: "tag"
value: "Code Geass"
# the name of the page parameter it will add to the url
page_num_parameter: "page"
# the script will save .html files to the cache_folder as it scrapes
# the site. Each URL gets a unique cache file name.
cache_folder: "C:\\path\\to\\cache\\folder"
# MongoDB server connection info
connection_string: "mongodb://localhost:27017"
db: "fanfics"
collection: "works"
- tags with non-ASCII characters aren't being downloaded correctly
- When downloading a lot of pages, responses might give errors because there are too many requests too fast
