|
| 1 | +# Building a book scraper using Scrapy |
| 2 | + |
| 3 | +Scrapy is a python framework for scraping and crawling websites. This tutorial demonstrates the use |
| 4 | +of scrapy to quickly mine a large amount of data from a demo website |
| 5 | +[books.toscrape.com](http://books.toscrape.com/). |
| 6 | + |
| 7 | +## Installation |
| 8 | + |
| 9 | +```bash |
| 10 | +pip install Scrapy |
| 11 | +``` |
| 12 | + |
| 13 | +Refer the [documentation](https://docs.scrapy.org/en/latest/intro/install.html) for advanced |
| 14 | +installation instructions. |
| 15 | + |
| 16 | +## Tutorial |
| 17 | + |
| 18 | +### Set-Up |
| 19 | + |
| 20 | +First, we need to start a new scrapy project. |
| 21 | + |
| 22 | +``` |
| 23 | +scrapy startproject tutorial |
| 24 | +``` |
| 25 | + |
| 26 | +Scrapy auto-generates a few boilerplate files. The directory structure should look like the |
| 27 | +following. |
| 28 | + |
| 29 | +``` |
| 30 | +. |
| 31 | +└── tutorial |
| 32 | + ├── scrapy.cfg |
| 33 | + └── tutorial |
| 34 | + ├── __init__.py |
| 35 | + ├── items.py |
| 36 | + ├── middlewares.py |
| 37 | + ├── pipelines.py |
| 38 | + ├── settings.py |
| 39 | + └── spiders |
| 40 | + └── __init__.py |
| 41 | +``` |
| 42 | + |
| 43 | +Now enter the base project directory `cd tutorial`. |
| 44 | + |
| 45 | +### A Spider |
| 46 | + |
| 47 | +We will now create a new spider inside `tutorial/spiders`. A spider is nothing but a python class |
| 48 | +which inherits from the `scrapy.Spider` class. Each spider has a unique `name` which is used when |
| 49 | +running the spider and a `start_urls` which a list of URLs the spider starts crawling as soon as it |
| 50 | +is run; we will set `start_urls` to `["http://books.toscrape.com/"]` which is the website we want to |
| 51 | +scrape. Create a new file at `tutorial/spiders/books_spider.py` which will have our spider, with the |
| 52 | +following content. |
| 53 | + |
| 54 | +```python |
| 55 | +import scrapy |
| 56 | + |
| 57 | +class BookSpider(scrapy.Spider): |
| 58 | + name = "books" |
| 59 | + |
| 60 | + start_urls = ["http://books.toscrape.com/"] |
| 61 | +``` |
| 62 | + |
| 63 | +When a spider is run, it starts by executing the `parse` function of it on all the `start_urls`. The |
| 64 | +`parse` function has a parameter named `response` which stores all information about the fetched |
| 65 | +page. `response.url` contains URL of the currently fetched page. Lets print the URL in the fetch |
| 66 | +function and see what it does. (The following code should be inside the BookSpider defined earlier. |
| 67 | + |
| 68 | +```python |
| 69 | + def parse(self, response): |
| 70 | + print(response.url) |
| 71 | +``` |
| 72 | + |
| 73 | +Let's now run the spider. Go to base project directory (`tutorial`) and run |
| 74 | +(Note that `books` is the name we gave the spider inside the BookSpider class) |
| 75 | +```bash |
| 76 | +scrapy crawl books |
| 77 | +``` |
| 78 | +You will see lots of text output, try to find `http://books.toscrape.com/` in one of the lines. |
| 79 | + |
| 80 | +### CSS and XPath selectors |
| 81 | + |
| 82 | +Now, lets focus on actually extracting the content out of the website. If you have ever used CSS, |
| 83 | +you will know about selectors which are used to apply style to specific elements. We can use the |
| 84 | +same type of selectors to extract data from specific elements. This can be done by using |
| 85 | +`response.css(".css selector here")` and then use `getall()` or `get()` function of the CSS object |
| 86 | +to actually get the element. You can use `::text` in the selector to get the text contained in the |
| 87 | +required element. To get a specific property of the element, say `href`, use `attrib["href"]` on the |
| 88 | +object returned by `response.css`. |
| 89 | + |
| 90 | +A more powerful type of selectors are XPath selectors. [Here](https://devhints.io/xpath) is |
| 91 | +a cheat-sheet for XPath and [here](https://www.guru99.com/xpath-selenium.html) is a more complete |
| 92 | +tutorial. A summary of XPath selectors that will be used in this tutorial are: |
| 93 | + - `//element`: Get all elements of type `element` anywhere in the DOM. |
| 94 | + - `element[@class='classname']]`: Get only elements of type `element` and of class `classname`. |
| 95 | + Note the class is matched entirely, if a element is of class `class1 class2`, a `@class='class1'` |
| 96 | + will not match it. |
| 97 | + - `element[contains(@class, 'classname')]`: Get elements of type `element` whose `class` contains |
| 98 | + `classname` anywhere. |
| 99 | + - `element/text()`: Get the text contained inside elements. |
| 100 | + - `element/@href`: Get the `href` attribute (can be replace with any attribute) |
| 101 | + - `element1/element2`: Get `element2` which a direct child of `element1` |
| 102 | + |
| 103 | +### Extracting Data |
| 104 | + |
| 105 | +Let's take a look at the website we want to scrape. It has a sidebar which shows the list of all |
| 106 | +categories and on the right we can see all books. |
| 107 | + |
| 108 | +The approach used in this tutorial is going to each of the categories and scraping all books inside |
| 109 | +them. |
| 110 | + |
| 111 | +A great tool to test and experiment with selectors in real time is the **Scrapy shell**. Launch the |
| 112 | +scrapy shell for website of interest using the following command. |
| 113 | + |
| 114 | +```bash |
| 115 | +scrapy shell "http://books.toscrape.com/" |
| 116 | +``` |
| 117 | +(Pro tip: have IPython installed to get a better scrapy shell experience) |
| 118 | + |
| 119 | +From the shell, you can view the exact response it received using `view(response)` (make sure you do |
| 120 | +this for any website you want to scrape before proceeding, the website may look different to your |
| 121 | +browser and scrapy since scrapy does not have JavaScript by default). |
| 122 | + |
| 123 | +Now, let's see how to select the list of categories. We will use the ever useful *Inspect Element* |
| 124 | +tool. It can be seen in the source that the category list is in a `div` of class `side_categories`. |
| 125 | +This div contains an unordered list whose only element contains another unordered list! The second |
| 126 | +`ul` is the one we want since each element of this contains an anchor tag with the URL to each |
| 127 | +category. Thus, the XPaths are as follows (you can experiment with XPaths in the scrapy shell till |
| 128 | +you get the output you need). |
| 129 | + |
| 130 | +```python |
| 131 | + cat_names = response.xpath("//div[@class='side_categories']/ul/li/ul/li/a/text()").getall() |
| 132 | + cat_urls = response.xpath("//div[@class='side_categories']/ul/li/ul/li/a/@href").getall() |
| 133 | +``` |
| 134 | +`cat_names` stores the list of all categories and `cat_urls` the corresponding URLs. We can iterate |
| 135 | +over these using zip. There is one issue though, the URLs are relative to the current URL; to fix |
| 136 | +this, we use `response.urljoin` to get the absolute URL. Now that we have the URLs, we need a way to |
| 137 | +scrape them separately, the `parse` function is specifically made to get the list of categories; |
| 138 | +hence, we need a separate function which will parse all books in a category - we call this function |
| 139 | +`parse_category`. To tell scrapy to parse a particular URL, we need to create an object of |
| 140 | +`scrapy.Request` and return it. Since we have a list of URLs to return, we'll use `yield` instead of |
| 141 | +return to return multiple Requests to scrapy. A `scrapy.Request` object required two parameters - |
| 142 | +the URL and the function to pass the handle to after a response is received, called `callback`. We |
| 143 | + use another parameter `cb_kwargs` to pass additional parameters to the callback function instead |
| 144 | +of just the response. Here is the `parse` function after adding all the above mentioned features. |
| 145 | + |
| 146 | +```python |
| 147 | + def parse(self, response): |
| 148 | + num_cats_to_parse = 5 |
| 149 | + cat_names = response.xpath("//div[@class='side_categories']/ul/li/ul/li/a/text()").getall() |
| 150 | + cat_urls = response.xpath("//div[@class='side_categories']/ul/li/ul/li/a/@href").getall() |
| 151 | + for _, name, url in zip(range(num_cats_to_parse), cat_names, cat_urls): |
| 152 | + name = name.strip() |
| 153 | + url = response.urljoin(url) |
| 154 | + yield scrapy.Request(url, |
| 155 | + callback=self.parse_category, |
| 156 | + cb_kwargs=dict(cat_name=name)) |
| 157 | +``` |
| 158 | +Since there are quite a few categories, we limit the number of categories crawled.… |
| 159 | + |
| 160 | +Now, let's scrape all the books in a category which will be done inside the `parse_category` |
| 161 | +function. You can use scrapy shell again on one of the category URLs to build selectors |
| 162 | +interactively. Here, we can see that each book is inside an `article` of class `product_pod`, which |
| 163 | +has a `h3` containing an anchor tag linking to the book's URL. Thus, the line to get all books will |
| 164 | +be as follows. |
| 165 | + |
| 166 | +```python |
| 167 | + book_urls = response.xpath("//article[@class='product_pod']/h3/a/@href").getall() |
| 168 | +``` |
| 169 | +Now we can loop through the book URLs and `yield` a `scrapy.Request` for each URL with the callback |
| 170 | +as a new function `parse_book` which we will define later. You would have noticed that some |
| 171 | +categories have too many books to be displayed in one page and as such they are paginated i.e., |
| 172 | +separated into pages and there's a `next` button near the bottom of the page. After `yield`ing all |
| 173 | +the requests for books we find the `next` button and get its URL which is then used to `yield` |
| 174 | +another request but with the callback being `parse_category` (which is nothing but a recursion). |
| 175 | +The entire code for this function is as follows. |
| 176 | + |
| 177 | +```python |
| 178 | + def parse_category(self, response, cat_name): |
| 179 | + book_urls = response.xpath("//article[@class='product_pod']/h3/a/@href").getall() |
| 180 | + |
| 181 | + for book_url in book_urls: |
| 182 | + book_url = response.urljoin(book_url) |
| 183 | + yield scrapy.Request(book_url, callback=self.parse_book, |
| 184 | + cb_kwargs=dict(cat_name=cat_name)) |
| 185 | + |
| 186 | + next_button = response.css(".next a") |
| 187 | + if next_button: |
| 188 | + next_url = next_button.attrib["href"] |
| 189 | + next_url = response.urljoin(next_url) |
| 190 | + yield scrapy.Request(next_url, |
| 191 | + callback=self.parse_category, |
| 192 | + cb_kwargs=dict(cat_name=cat_name)) |
| 193 | +``` |
| 194 | + |
| 195 | +Finally, let's write the `parse_book` function to scrape each book. For the purposes of this |
| 196 | +tutorial, we will only scrape the title, price, stock information and star rating. You can use the |
| 197 | +scrapy shell to build selectors for them. Getting title and price is trivial. The `instock` selector |
| 198 | +gives a list of strings which mostly consist of spaces and newline with the actual information |
| 199 | +contained in the middle, thus we use `strip` and `join` to get the required information. Obtaining |
| 200 | +the star rating is tricky since that information is only contained in the class name, thus we use |
| 201 | +XPath to get the class name and then select the second word in that. Finally, we return the required |
| 202 | +information in a dictionary, the reason for this is that scrapy considers these dictionaries as |
| 203 | +`items`. This allows you to directly export the data to a JSON using `-o output.json` options while |
| 204 | +running the spider. The main reason to return `items` is to use scrapy pipelines to programmatically |
| 205 | +process the data. For example, you could write a pipeline to automatically insert the data into |
| 206 | +a database or write it to a file. |
| 207 | + |
| 208 | +### Saving Data using a Pipeline |
| 209 | + |
| 210 | +After a spider yields an item, it goes to the item pipeline where each it is processed sequentially |
| 211 | +by all the pipeline objects. Each object in the pipeline is just a python class that implements |
| 212 | +a few special functions including `process_item`, `open_spider` (optional) and `close_spider` |
| 213 | +(optional). |
| 214 | + |
| 215 | +Let us now create a pipeline to save the data to a CSV file. We will be using the built-in csv |
| 216 | +module in python, specifically the `DictWriter` object from that module. Refer the |
| 217 | +[documentation](https://docs.python.org/3/library/csv.html#csv.DictWriter) for more details. |
| 218 | + |
| 219 | +Open `tutorial/pipelines.py` (it would have been auto-generated), delete the existing example |
| 220 | +pipeline and replace with the following. |
| 221 | + |
| 222 | +```python |
| 223 | +import csv |
| 224 | + |
| 225 | +class BookCsvPipeline(): |
| 226 | + def open_spider(self, spider): |
| 227 | + self.file = open("output.csv", "at") |
| 228 | + fieldnames = ["title", |
| 229 | + "price", |
| 230 | + "stock", |
| 231 | + "rating", |
| 232 | + "category"] |
| 233 | + self.writer = csv.DictWriter(self.file, fieldnames=fieldnames) |
| 234 | + if self.file.tell() == 0: |
| 235 | + self.writer.writeheader() |
| 236 | + |
| 237 | + def close_spider(self, spider): |
| 238 | + self.file.close() |
| 239 | + |
| 240 | + def process_item(self, item, spider): |
| 241 | + self.writer.writerow(item) |
| 242 | + return item |
| 243 | +``` |
| 244 | +Note that the pipeline's `process_item` needs to return the `item` it processes so that the |
| 245 | +next processor in the pipeline can process it. |
| 246 | + |
| 247 | +Before running the spider, the pipeline that we just created needs to be enabled. Open |
| 248 | +`tutorial/settings.py` (go through this file once, it contains some very useful options) |
| 249 | +and look for `ITEM_PIPELINES`, uncomment it and replace with the following. |
| 250 | +```python |
| 251 | +ITEM_PIPELINES = { |
| 252 | + 'tutorial.pipelines.BookCsvPipeline': 300, |
| 253 | +} |
| 254 | +``` |
| 255 | +The `300` indicates its priority in the pipeline with `0` being the highest priority and `1000` |
| 256 | +being the lowest. |
| 257 | + |
| 258 | +Now we are ready to run the spider. |
| 259 | +``` |
| 260 | +scrapy crawl books -o output.json |
| 261 | +``` |
| 262 | +This will take a few seconds with lots of output. After it has finished executing, you can find |
| 263 | +`output.json` and `output.csv` in the same folder from where you ran it. The CSV should contain 163 |
| 264 | +rows if `num_cats_to_parse` was set to 5 (excluding the header). |
| 265 | + |
| 266 | +Now you have a scrapy crawler which uses most of the useful functionalities provided by scrapy. |
0 commit comments