Skip to content

Commit 5ab92b3

Browse files
committed
V1 of notebooks for Ch2,3,4,7,8,11 in Ch2 Snorkel still in progress
1 parent 3e76782 commit 5ab92b3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+55298
-44690
lines changed

Diff for: Ch11/AutoML_Demo.ipynb

+2,032
Large diffs are not rendered by default.

Diff for: Ch11/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
#### README for Ch11

Diff for: Ch2/Extracting_text_from_images_tesseract.ipynb

+229
Large diffs are not rendered by default.

Diff for: Ch2/Tokenization_Stemming_lemmatization_stopword_postagging.ipynb

+566
Large diffs are not rendered by default.

Diff for: Ch2/WebScraping_using_BeautifulSoup.ipynb

+5,188
Large diffs are not rendered by default.

Diff for: Ch2/WebScraping_using_scrapy/.README.md.un~

3.43 KB
Binary file not shown.

Diff for: Ch2/WebScraping_using_scrapy/README.md

+266
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
# Building a book scraper using Scrapy
2+
3+
Scrapy is a python framework for scraping and crawling websites. This tutorial demonstrates the use
4+
of scrapy to quickly mine a large amount of data from a demo website
5+
[books.toscrape.com](http://books.toscrape.com/).
6+
7+
## Installation
8+
9+
```bash
10+
pip install Scrapy
11+
```
12+
13+
Refer the [documentation](https://docs.scrapy.org/en/latest/intro/install.html) for advanced
14+
installation instructions.
15+
16+
## Tutorial
17+
18+
### Set-Up
19+
20+
First, we need to start a new scrapy project.
21+
22+
```
23+
scrapy startproject tutorial
24+
```
25+
26+
Scrapy auto-generates a few boilerplate files. The directory structure should look like the
27+
following.
28+
29+
```
30+
.
31+
└── tutorial
32+
├── scrapy.cfg
33+
└── tutorial
34+
├── __init__.py
35+
├── items.py
36+
├── middlewares.py
37+
├── pipelines.py
38+
├── settings.py
39+
└── spiders
40+
└── __init__.py
41+
```
42+
43+
Now enter the base project directory `cd tutorial`.
44+
45+
### A Spider
46+
47+
We will now create a new spider inside `tutorial/spiders`. A spider is nothing but a python class
48+
which inherits from the `scrapy.Spider` class. Each spider has a unique `name` which is used when
49+
running the spider and a `start_urls` which a list of URLs the spider starts crawling as soon as it
50+
is run; we will set `start_urls` to `["http://books.toscrape.com/"]` which is the website we want to
51+
scrape. Create a new file at `tutorial/spiders/books_spider.py` which will have our spider, with the
52+
following content.
53+
54+
```python
55+
import scrapy
56+
57+
class BookSpider(scrapy.Spider):
58+
name = "books"
59+
60+
start_urls = ["http://books.toscrape.com/"]
61+
```
62+
63+
When a spider is run, it starts by executing the `parse` function of it on all the `start_urls`. The
64+
`parse` function has a parameter named `response` which stores all information about the fetched
65+
page. `response.url` contains URL of the currently fetched page. Lets print the URL in the fetch
66+
function and see what it does. (The following code should be inside the BookSpider defined earlier.
67+
68+
```python
69+
def parse(self, response):
70+
print(response.url)
71+
```
72+
73+
Let's now run the spider. Go to base project directory (`tutorial`) and run
74+
(Note that `books` is the name we gave the spider inside the BookSpider class)
75+
```bash
76+
scrapy crawl books
77+
```
78+
You will see lots of text output, try to find `http://books.toscrape.com/` in one of the lines.
79+
80+
### CSS and XPath selectors
81+
82+
Now, lets focus on actually extracting the content out of the website. If you have ever used CSS,
83+
you will know about selectors which are used to apply style to specific elements. We can use the
84+
same type of selectors to extract data from specific elements. This can be done by using
85+
`response.css(".css selector here")` and then use `getall()` or `get()` function of the CSS object
86+
to actually get the element. You can use `::text` in the selector to get the text contained in the
87+
required element. To get a specific property of the element, say `href`, use `attrib["href"]` on the
88+
object returned by `response.css`.
89+
90+
A more powerful type of selectors are XPath selectors. [Here](https://devhints.io/xpath) is
91+
a cheat-sheet for XPath and [here](https://www.guru99.com/xpath-selenium.html) is a more complete
92+
tutorial. A summary of XPath selectors that will be used in this tutorial are:
93+
- `//element`: Get all elements of type `element` anywhere in the DOM.
94+
- `element[@class='classname']]`: Get only elements of type `element` and of class `classname`.
95+
Note the class is matched entirely, if a element is of class `class1 class2`, a `@class='class1'`
96+
will not match it.
97+
- `element[contains(@class, 'classname')]`: Get elements of type `element` whose `class` contains
98+
`classname` anywhere.
99+
- `element/text()`: Get the text contained inside elements.
100+
- `element/@href`: Get the `href` attribute (can be replace with any attribute)
101+
- `element1/element2`: Get `element2` which a direct child of `element1`
102+
103+
### Extracting Data
104+
105+
Let's take a look at the website we want to scrape. It has a sidebar which shows the list of all
106+
categories and on the right we can see all books.
107+
108+
The approach used in this tutorial is going to each of the categories and scraping all books inside
109+
them.
110+
111+
A great tool to test and experiment with selectors in real time is the **Scrapy shell**. Launch the
112+
scrapy shell for website of interest using the following command.
113+
114+
```bash
115+
scrapy shell "http://books.toscrape.com/"
116+
```
117+
(Pro tip: have IPython installed to get a better scrapy shell experience)
118+
119+
From the shell, you can view the exact response it received using `view(response)` (make sure you do
120+
this for any website you want to scrape before proceeding, the website may look different to your
121+
browser and scrapy since scrapy does not have JavaScript by default).
122+
123+
Now, let's see how to select the list of categories. We will use the ever useful *Inspect Element*
124+
tool. It can be seen in the source that the category list is in a `div` of class `side_categories`.
125+
This div contains an unordered list whose only element contains another unordered list! The second
126+
`ul` is the one we want since each element of this contains an anchor tag with the URL to each
127+
category. Thus, the XPaths are as follows (you can experiment with XPaths in the scrapy shell till
128+
you get the output you need).
129+
130+
```python
131+
cat_names = response.xpath("//div[@class='side_categories']/ul/li/ul/li/a/text()").getall()
132+
cat_urls = response.xpath("//div[@class='side_categories']/ul/li/ul/li/a/@href").getall()
133+
```
134+
`cat_names` stores the list of all categories and `cat_urls` the corresponding URLs. We can iterate
135+
over these using zip. There is one issue though, the URLs are relative to the current URL; to fix
136+
this, we use `response.urljoin` to get the absolute URL. Now that we have the URLs, we need a way to
137+
scrape them separately, the `parse` function is specifically made to get the list of categories;
138+
hence, we need a separate function which will parse all books in a category - we call this function
139+
`parse_category`. To tell scrapy to parse a particular URL, we need to create an object of
140+
`scrapy.Request` and return it. Since we have a list of URLs to return, we'll use `yield` instead of
141+
return to return multiple Requests to scrapy. A `scrapy.Request` object required two parameters -
142+
the URL and the function to pass the handle to after a response is received, called `callback`. We
143+
use another parameter `cb_kwargs` to pass additional parameters to the callback function instead
144+
of just the response. Here is the `parse` function after adding all the above mentioned features.
145+
146+
```python
147+
def parse(self, response):
148+
num_cats_to_parse = 5
149+
cat_names = response.xpath("//div[@class='side_categories']/ul/li/ul/li/a/text()").getall()
150+
cat_urls = response.xpath("//div[@class='side_categories']/ul/li/ul/li/a/@href").getall()
151+
for _, name, url in zip(range(num_cats_to_parse), cat_names, cat_urls):
152+
name = name.strip()
153+
url = response.urljoin(url)
154+
yield scrapy.Request(url,
155+
callback=self.parse_category,
156+
cb_kwargs=dict(cat_name=name))
157+
```
158+
Since there are quite a few categories, we limit the number of categories crawled.…
159+
160+
Now, let's scrape all the books in a category which will be done inside the `parse_category`
161+
function. You can use scrapy shell again on one of the category URLs to build selectors
162+
interactively. Here, we can see that each book is inside an `article` of class `product_pod`, which
163+
has a `h3` containing an anchor tag linking to the book's URL. Thus, the line to get all books will
164+
be as follows.
165+
166+
```python
167+
book_urls = response.xpath("//article[@class='product_pod']/h3/a/@href").getall()
168+
```
169+
Now we can loop through the book URLs and `yield` a `scrapy.Request` for each URL with the callback
170+
as a new function `parse_book` which we will define later. You would have noticed that some
171+
categories have too many books to be displayed in one page and as such they are paginated i.e.,
172+
separated into pages and there's a `next` button near the bottom of the page. After `yield`ing all
173+
the requests for books we find the `next` button and get its URL which is then used to `yield`
174+
another request but with the callback being `parse_category` (which is nothing but a recursion).
175+
The entire code for this function is as follows.
176+
177+
```python
178+
def parse_category(self, response, cat_name):
179+
book_urls = response.xpath("//article[@class='product_pod']/h3/a/@href").getall()
180+
181+
for book_url in book_urls:
182+
book_url = response.urljoin(book_url)
183+
yield scrapy.Request(book_url, callback=self.parse_book,
184+
cb_kwargs=dict(cat_name=cat_name))
185+
186+
next_button = response.css(".next a")
187+
if next_button:
188+
next_url = next_button.attrib["href"]
189+
next_url = response.urljoin(next_url)
190+
yield scrapy.Request(next_url,
191+
callback=self.parse_category,
192+
cb_kwargs=dict(cat_name=cat_name))
193+
```
194+
195+
Finally, let's write the `parse_book` function to scrape each book. For the purposes of this
196+
tutorial, we will only scrape the title, price, stock information and star rating. You can use the
197+
scrapy shell to build selectors for them. Getting title and price is trivial. The `instock` selector
198+
gives a list of strings which mostly consist of spaces and newline with the actual information
199+
contained in the middle, thus we use `strip` and `join` to get the required information. Obtaining
200+
the star rating is tricky since that information is only contained in the class name, thus we use
201+
XPath to get the class name and then select the second word in that. Finally, we return the required
202+
information in a dictionary, the reason for this is that scrapy considers these dictionaries as
203+
`items`. This allows you to directly export the data to a JSON using `-o output.json` options while
204+
running the spider. The main reason to return `items` is to use scrapy pipelines to programmatically
205+
process the data. For example, you could write a pipeline to automatically insert the data into
206+
a database or write it to a file.
207+
208+
### Saving Data using a Pipeline
209+
210+
After a spider yields an item, it goes to the item pipeline where each it is processed sequentially
211+
by all the pipeline objects. Each object in the pipeline is just a python class that implements
212+
a few special functions including `process_item`, `open_spider` (optional) and `close_spider`
213+
(optional).
214+
215+
Let us now create a pipeline to save the data to a CSV file. We will be using the built-in csv
216+
module in python, specifically the `DictWriter` object from that module. Refer the
217+
[documentation](https://docs.python.org/3/library/csv.html#csv.DictWriter) for more details.
218+
219+
Open `tutorial/pipelines.py` (it would have been auto-generated), delete the existing example
220+
pipeline and replace with the following.
221+
222+
```python
223+
import csv
224+
225+
class BookCsvPipeline():
226+
def open_spider(self, spider):
227+
self.file = open("output.csv", "at")
228+
fieldnames = ["title",
229+
"price",
230+
"stock",
231+
"rating",
232+
"category"]
233+
self.writer = csv.DictWriter(self.file, fieldnames=fieldnames)
234+
if self.file.tell() == 0:
235+
self.writer.writeheader()
236+
237+
def close_spider(self, spider):
238+
self.file.close()
239+
240+
def process_item(self, item, spider):
241+
self.writer.writerow(item)
242+
return item
243+
```
244+
Note that the pipeline's `process_item` needs to return the `item` it processes so that the
245+
next processor in the pipeline can process it.
246+
247+
Before running the spider, the pipeline that we just created needs to be enabled. Open
248+
`tutorial/settings.py` (go through this file once, it contains some very useful options)
249+
and look for `ITEM_PIPELINES`, uncomment it and replace with the following.
250+
```python
251+
ITEM_PIPELINES = {
252+
'tutorial.pipelines.BookCsvPipeline': 300,
253+
}
254+
```
255+
The `300` indicates its priority in the pipeline with `0` being the highest priority and `1000`
256+
being the lowest.
257+
258+
Now we are ready to run the spider.
259+
```
260+
scrapy crawl books -o output.json
261+
```
262+
This will take a few seconds with lots of output. After it has finished executing, you can find
263+
`output.json` and `output.csv` in the same folder from where you ran it. The CSV should contain 163
264+
rows if `num_cats_to_parse` was set to 5 (excluding the header).
265+
266+
Now you have a scrapy crawler which uses most of the useful functionalities provided by scrapy.

Diff for: Ch2/WebScraping_using_scrapy/tutorial/.hmm.py.un~

1.77 KB
Binary file not shown.
26.2 KB
Binary file not shown.

0 commit comments

Comments
 (0)