-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathREADME
executable file
·69 lines (47 loc) · 1.52 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
Redis-based components for Scrapy
=================================
This is a initial work on Scrapy-Redis integration, not production-tested.
Use it at your own risk!
Features:
* Distributed crawling/scraping
* Distributed post-processing
Requirements:
* Scrapy 0.17
* redis-py (tested on 2.4.9)
* redis server (tested on 2.2-2.4)
Available Scrapy components:
* Scheduler
* Duplication Filter
* Item Pipeline
Usage
-----
In your settings.py:
# enables scheduling storing requests queue in redis
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# don't cleanup redis queues, allows to pause/resume crawls
SCHEDULER_PERSIST = True
# store scraped item in redis for post-processing
ITEM_PIPELINES = [
'scrapy_redis.pipelines.RedisPipeline',
]
Running the example project
---------------------------
You can test the funcionality following the next steps:
1. Setup scrapy_redis package in your PYTHONPATH
2. Run the crawler for first time then stop it
$ cd example-project
$ scrapy crawl dmoz
... [dmoz] ...
^C
3. Run the crawler again to resume stopped crawling
$ scrapy crawl dmoz
... [dmoz] DEBUG: Resuming crawl (9019 requests scheduled)
4. Start one or more additional scrapy crawlers
$ scrapy crawl dmoz
... [dmoz] DEBUG: Resuming crawl (8712 requests scheduled)
5. Start one or more post-processing workers
$ python process_items.py
Processing: Kilani Giftware (http://www.dmoz.org/Computers/Shopping/Gifts/)
Processing: NinjaGizmos.com (http://www.dmoz.org/Computers/Shopping/Gifts/)
...
That's it.