an application for managing & executing crawl jobs and piping results to Slack channels.
This repo is proposed to be cloned/forked & modified to fit your needs
-
[PREREQUISITES]
-
Get the code
git clone github.com/vire/hlad
-
Install dependencies
npm install
-
Configure by creating
.env
file in project root and ensure it's in.gitignore
API_URL=https://<url-to-your-slack>/api/chat.postMessage API_TOKEN=guid-blah-foo-bar API_CHANEL_ID=some-channel-id
-
Run
npm start
this will start the agent listening for changes of data in Firebase. Optionally you can useDEBUG=hlad-*
to have console output for debugging.To be able to execute a
CRAWL_JOB
you need to have at least one recipe in firebase + valid config for publishing the result to an endpoint. -
Backup your existing recipes
curl -X GET -o ./tmp/$(date +%s).json https://<your-firebase-id>.firebaseio.com/recipes.json
Recipe: is the minimal abstraction & data-structure required to define a remote resource (HTML page containing lunch menu) that can be crawled, transformed and published
- There are 2 ways how to define recipes
- hlad-ui webapp (with editor + live test/debugger)
- define by hand and POST REST API to firebase
- YAML example of recipe
structure
defined via hlad-ui webapp (without name and URL props)- 2 root sections:
main
andsoups
- each can have 0-n
locator
properties containing an CSS selector string - [NOTICE] locators must use quotes in YAML
- 2 root sections:
main:
- locator: '#lunch1 > div:nth-child(3)'
- locator: '#lunch1 > div:nth-child(4)'
- locator: '#lunch1 > div:nth-child(5)'
soup:
- locator: '#lunch1 > div:nth-child(1)'
- locator: '#lunch1 > div:nth-child(2)'
- JSON example of a complete recipe
{
"URL": "http://example.com",
"name": "Foo Bar",
"structure": {
"main": [
{
"locator": "#lunch1 > div:nth-child(3)"
},
{
"locator": "#lunch1 > div:nth-child(4)"
},
{
"locator": "#lunch1 > div:nth-child(5)"
}
],
"soup": [
{
"locator": "#lunch1 > div:nth-child(1)"
},
{
"locator": "#lunch1 > div:nth-child(2)"
}
]
},
}
- could POST the above example to Firebase like
curl -X POST -d @example-recipe.json https://<your-firebase-id>.firebaseio.com/recipes.json
Firebase: remote data-store which stores recipes
, jobs
in queues, results
Tried to use a most simple and lightweight solution for storing data. Firebase handles such use-case very good + offers real-time out-of-the-box.
Agent: run locally, observes firebase for changes, reacts to such changes.
Executors: stream pipelines, that handle e2e when event arrives, execute pre-defined steps, publish result to desired endpoint (firebase, slack). 2 basic types of executors:
TEST_JOB
real-time test for newly createdrecipes
CRAWL_JOB
crawls all recipes in configuration (defined in firebase) and tries to publish result to slack.
Triggers: events that trigger jobs
-
planned/automated triggers like cron, heroku scheduler
-
manual via hlad-ui webapp or by calling direct firebase REST API
curl -X PUT -d '{ "execute" : true }' \ https://<your-firebase-id>.firebaseio.com/crawl_jobs.json
- TODO
- [REQUIREMENTS]
docker
(tested v1.12),.env
file containing proper variables (endpoint, token, channelID)
-
build the image
docker build -t hlad .
<-- don't forget the dot! -
run a container from image
docker run --rm -it --env DEBUG="hlad-*" --env-file ./.env hlad npm start