Skip to content

Commit 6ab9e8a

Browse files
committed
add trustpilot scraper
1 parent 86653e8 commit 6ab9e8a

File tree

10 files changed

+8161
-4
lines changed

10 files changed

+8161
-4
lines changed

.github/workflows/trustpilot.yaml

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
name: Trustpilot.com Test
2+
on:
3+
workflow_dispatch:
4+
schedule:
5+
- cron: '0 12 * * MON'
6+
7+
env:
8+
PROJECT_DIR: trustpilot-scraper
9+
10+
jobs:
11+
test:
12+
runs-on: ubuntu-latest
13+
strategy:
14+
fail-fast: false
15+
matrix:
16+
test: [test_company_scraping, test_search_scraping, test_review_scraping]
17+
18+
steps:
19+
- uses: actions/checkout@v2
20+
21+
- name: Set up Python
22+
uses: actions/setup-python@v2
23+
with:
24+
python-version: "3.10"
25+
26+
- name: Install Poetry
27+
run: |
28+
curl -sSL https://install.python-poetry.org | python3 -
29+
30+
- name: Cache Poetry virtual environment
31+
uses: actions/cache@v2
32+
id: cache
33+
with:
34+
path: ~/.cache/pypoetry/virtualenvs
35+
key: ${{ runner.os }}-poetry-${{ hashFiles('**/${{ env.PROJECT_DIR }}/pyproject.toml') }}
36+
restore-keys: |
37+
${{ runner.os }}-poetry-
38+
39+
- name: Install dependencies
40+
run: |
41+
cd ${{ env.PROJECT_DIR }}
42+
poetry install
43+
44+
- name: Run test
45+
env:
46+
SCRAPFLY_KEY: ${{ secrets.SCRAPFLY_KEY }}
47+
run: |
48+
cd ${{ env.PROJECT_DIR }}
49+
poetry run pytest test.py -k ${{ matrix.test }}

README.md

+20-4
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,11 @@ The following is the list of supported websites grouped by type.
6969
- [Zillow.com](#zillow)
7070
- [Zoopla.co.uk](#zoopla)
7171

72+
### Reviews
73+
- [Yellowpages.com](#yellowpages)
74+
- [Yelp.com](#yelp)
75+
- [Trustpilot.com](#trustpilot)
76+
7277
### Social Media
7378
- [Instagram.com](#instagram)
7479
- [Threads.net](#threads)
@@ -78,10 +83,6 @@ The following is the list of supported websites grouped by type.
7883
- [Booking.com](#booking)
7984
- [Tripadvisor.com](#tripadvisor)
8085

81-
### Other
82-
- [Yellowpages.com](#yellowpages)
83-
- [Yelp.com](#yelp)
84-
8586
------------
8687

8788
### Aliexpress
@@ -598,6 +599,21 @@ The [tripadvisor.com scraper](./tripadvisor-scraper/) can scrape the following d
598599

599600
For the full guide, refer to our blog article [How to Scrape TripAdvisor.com (2023 Updated)](https://scrapfly.io/blog/how-to-scrape-tripadvisor/)
600601

602+
### Trustpilot
603+
The [trustpilot.com scraper](./trustpilot-scraper/) can scrape the following data:
604+
- Company pages for a specific company data.
605+
- Reviews data in a specific company reviews.
606+
- Search pages for company listings data.
607+
608+
<details class="is-code">
609+
<summary>View sample data</summary>
610+
611+
- [Company pages](./trustpilot-scraper/results/companies.json)
612+
- [Reviews pages](./trustpilot-scraper/results/reviews.json)
613+
- [Search pages](./trustpilot-scraper/results/search.json)
614+
615+
</details>
616+
601617
### Twitter(X)
602618
<!-- <p align="left">
603619
<img width="100" height="100" src="./.github/assets/x.svg">

trustpilot-scraper/README.md

+48
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Trustpilot.com Scraper
2+
3+
This scraper is using [scrapfly.io](https://scrapfly.io/) and Python to scrape product listing data from Trustpilot.com.
4+
5+
Full tutorial
6+
7+
The scraping code is located in the `trustpilot.py` file. It's fully documented and simplified for educational purposes and the example scraper run code can be found in `run.py` file.
8+
9+
This scraper scrapes:
10+
- Trustpilot company pages for company data
11+
- Trustpilot search pages for company listing data
12+
- Trustpilot review pages for company reviews
13+
14+
For output examples see the `./results` directory.
15+
16+
## Fair Use Disclaimer
17+
18+
Note that this code is provided free of charge as is, and Scrapfly does __not__ provide free web scraping support or consultation. For any bugs, see the issue tracker.
19+
20+
## Setup and Use
21+
22+
This Trustpilot.com scraper uses __Python 3.10__ with [scrapfly-sdk](https://pypi.org/project/scrapfly-sdk/) package which is used to scrape and parse Trustpilot's data.
23+
24+
0. Ensure you have __Python 3.10__ and [poetry Python package manager](https://python-poetry.org/docs/#installation) on your system.
25+
1. Retrieve your Scrapfly API key from <https://scrapfly.io/dashboard> and set `SCRAPFLY_KEY` environment variable:
26+
```shell
27+
$ export SCRAPFLY_KEY="YOUR SCRAPFLY KEY"
28+
```
29+
2. Clone and install Python environment:
30+
```shell
31+
$ git clone https://github.com/scrapfly/scrapfly-scrapers.git
32+
$ cd scrapfly-scrapers/trustpilot-scraper
33+
$ poetry install
34+
```
35+
3. Run example scrape:
36+
```shell
37+
$ poetry run python run.py
38+
```
39+
4. Run tests:
40+
```shell
41+
$ poetry install --with dev
42+
$ poetry run pytest test.py
43+
# or specific scraping areas
44+
$ poetry run pytest test.py -k test_company_scraping
45+
$ poetry run pytest test.py -k test_search_scraping
46+
$ poetry run pytest test.py -k test_review_scraping
47+
```
48+

trustpilot-scraper/pyproject.toml

+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
[tool.poetry]
2+
name = "scrapfly-trustpilot"
3+
version = "0.1.0"
4+
description = "demo web scraper for Trustpilot.com using Scrapfly"
5+
authors = ["Mazen Ramadan <[email protected]>"]
6+
license = "NPOS-3.0"
7+
readme = "README.md"
8+
9+
[tool.poetry.dependencies]
10+
python = "^3.10"
11+
scrapfly-sdk = {extras = ["all"], version = "^0.8.5"}
12+
loguru = "^0.7.0"
13+
14+
[tool.poetry.group.dev.dependencies]
15+
black = "^23.3.0"
16+
ruff = "^0.0.269"
17+
cerberus = "^1.3.4"
18+
pytest = "^7.3.1"
19+
pytest-asyncio = "^0.21.0"
20+
21+
[build-system]
22+
requires = ["poetry-core"]
23+
build-backend = "poetry.core.masonry.api"
24+
25+
[tool.pytest.ini_options]
26+
python_files = "test.py"
27+
28+
[tool.black]
29+
line-length = 120
30+
target-version = ['py37', 'py38', 'py39', 'py310', 'py311']
31+
32+
[tool.ruff]
33+
line-length = 120

0 commit comments

Comments
 (0)