Skip to content

Commit ce2652b

Browse files
ilyazubDimitryDimitry
authored
Use pagination parameters from SerpApi instead of calculating on the client (#30)
* Use pagination parameters from SerpApi instead of calculating on the client `start` and `num` parameters are not suitable for token-based pagination. Such pagination is used on Google Maps, YouTube, Google Scholar Authors, and other search engines. This commit consumes URL query parameters for the next page. It stops paginating when parameters not change. Details: #22 Some tests are failing because `start` and `num` parameters are not supported anymore. These tests will be fixed in the following commits. * Add pagination tests for Bing, Baidu, and DuckDuckGo search API clients * Fix typo in SerpApi name in documentation * Add more pagination tests All of the tests follow the same pattern. Limit number of pages, iterate, and check for duplicates in the results. This is to make sure that pagination actually changes pages. * Test pagination for Naver and HomeDepot * Stop pagination when SerpApi backend doesn't update parameters * Fix flake8 linting errors Example errors: https://github.com/serpapi/google-search-results-python/runs/6659757610?check_suite_focus=true#step:5:37 * Lint code via `make lint` Currently linting script exists only in GitHub Action: `.github/workflows/python-package.yml`. This commit wraps that script in Makefile and invokes in an Action. * fix(tests): fix failing integration tests DuckDuckGo tests are failing because DuckDuckGo pagination doesn't take into account an offset of current results: serpapi/public-roadmap#619 Co-authored-by: Dimitry <[email protected]> * perf: run pytest in parallel Sample output: platform linux -- Python 3.10.9, pytest-7.2.1, pluggy-1.0.0 rootdir: /home/ilyazub/Workspace/google-search-results-python plugins: parallel-0.1.1 collected 48 pytest-parallel: 8 workers (processes), 6 tests per worker (threads) `py` dependency is used because pytest-parallel depends on it but doesn't require 😕 kevlened/pytest-parallel#118 Co-authored-by: Dimitry <[email protected]> * style: don't lint vendor packages with Flake8 Co-authored-by: Dimitry <[email protected]> * docs: fix minor typos in documentation Co-authored-by: Dimitry <[email protected]> * ci: cache pip dependencies Support Python 3.7+ based on the readme: https://github.com/serpapi/google-search-results-python/blob/35e51c94e7243c29650ed7b630db4e4e6d0c61aa/README.md#L18 Co-authored-by: "dimitryzub <[email protected]>" --------- Co-authored-by: Dimitry <[email protected]> Co-authored-by: Dimitry <[email protected]>
1 parent f69b884 commit ce2652b

25 files changed

+474
-132
lines changed

.flake8

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
[flake8]
2+
exclude =
3+
build
4+
dist
5+
.git
6+
.env
7+
.pytest_cache

.github/workflows/python-package.yml

+10-14
Original file line numberDiff line numberDiff line change
@@ -11,30 +11,26 @@ on:
1111

1212
jobs:
1313
build:
14-
1514
runs-on: ubuntu-latest
1615
strategy:
1716
matrix:
18-
python-version: [3.5, 3.6, 3.7, 3.8, 3.9]
17+
python-version: [3.7, 3.8, 3.9, '3.10']
1918

2019
steps:
21-
- uses: actions/checkout@v2
20+
- uses: actions/checkout@v3
2221
- name: Set up Python ${{ matrix.python-version }}
23-
uses: actions/setup-python@v2
22+
uses: actions/setup-python@v4
2423
with:
2524
python-version: ${{ matrix.python-version }}
25+
cache: pip
26+
2627
- name: Install dependencies
27-
run: |
28-
python -m pip install --upgrade pip
29-
pip install flake8 pytest
30-
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
28+
run: make install build_dep
29+
3130
- name: Lint with flake8
32-
run: |
33-
# stop the build if there are Python syntax errors or undefined names
34-
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
35-
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
36-
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
31+
run: make lint
32+
3733
- name: Test with pytest
38-
run: pytest
34+
run: make test
3935
env:
4036
API_KEY: ${{secrets.API_KEY}}

.gitignore

+2-1
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,5 @@ build/
1313
google_search_results.egg-info/PKG-INFO
1414
google_search_results.egg-info/SOURCES.txt
1515

16-
.env
16+
# Python virtual environment
17+
.env

Makefile

+16-5
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,38 @@ version=$(shell grep version setup.py | cut -d"'" -f2)
55

66
.PHONY: build
77

8-
all: clean install test
8+
all: clean install lint test
99

1010
clean:
1111
find . -name '*.pyc' -delete
1212
find . -type d -name "__pycache__" -delete
1313
pip3 uninstall google_search_results
1414

15+
create_env:
16+
python -m venv .env
17+
source .env/bin/activate
18+
1519
install:
16-
pip3 install -r requirements.txt
20+
python3 -m pip install --upgrade pip
21+
if [ -f requirements.txt ]; then pip3 install -r requirements.txt; fi
22+
23+
lint: build_dep
24+
# stop the build if there are Python syntax errors or undefined names
25+
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
26+
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
27+
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
1728

1829
# Test with Python 3
19-
test:
20-
pytest
30+
test: build_dep
31+
pytest --workers auto --tests-per-worker auto
2132

2233
# run example only
2334
# and display output (-s)
2435
example:
2536
pytest -s "tests/test_example.py::TestExample::test_async"
2637

2738
build_dep:
28-
pip3 install -U setuptools twine
39+
pip3 install -U setuptools pytest py pytest-parallel flake8 twine
2940

3041
# https://packaging.python.org/tutorials/packaging-projects/
3142
build: build_dep

README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ See the [playground to generate your code.](https://serpapi.com/playground)
7979
- [Quick start](#quick-start)
8080
- [Summary](#summary)
8181
- [Google Search API capability](#google-search-api-capability)
82-
- [How to set SERP API key](#how-to-set-serp-api-key)
82+
- [How to set SerpApi key](#how-to-set-serp-api-key)
8383
- [Example by specification](#example-by-specification)
8484
- [Location API](#location-api)
8585
- [Search Archive API](#search-archive-api)
@@ -116,7 +116,7 @@ params = {
116116
"safe": "Safe Search Flag",
117117
"num": "Number of Results",
118118
"start": "Pagination Offset",
119-
"api_key": "Your SERP API Key",
119+
"api_key": "Your SerpApi Key",
120120
# To be match
121121
"tbm": "nws|isch|shop",
122122
# To be search
@@ -145,7 +145,7 @@ object_result = search.get_object()
145145

146146
See below for more hands-on examples.
147147

148-
### How to set SERP API key
148+
### How to set SerpApi key
149149

150150
You can get an API key here if you don't already have one: https://serpapi.com/users/sign_up
151151

@@ -602,7 +602,7 @@ If it's a client error, then a SerpApiClientException is raised.
602602
- youtube
603603
- walmart
604604
- apple_app_store
605-
- naver
605+
- naver
606606
- raise SerpApiClientException instead of raw string in order to follow Python guideline 3.5+
607607
- add more unit error tests for serp_api_client
608608

README.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Quick start
3535
3636
This example runs a search about "coffee" using your secret api key.
3737

38-
The Serp API service (backend)
38+
The SerpApi service (backend)
3939

4040
* searches on Google using the query: q = "coffee"
4141
* parses the messy HTML responses
@@ -44,7 +44,7 @@ The Serp API service (backend)
4444
The GoogleSearch class
4545

4646
* Format the request
47-
* Execute GET http request against Serp API service
47+
* Execute GET http request against SerpApi service
4848
* Parse JSON response into a dictionary
4949

5050
Et voila..

requirements.txt

+6
Original file line numberDiff line numberDiff line change
@@ -1 +1,7 @@
1+
-i https://pypi.org/simple
2+
certifi==2022.5.18
3+
chardet==4.0.0
4+
-e .
5+
idna==2.10
16
requests==2.25.1
7+
urllib3==1.26.9

serpapi/constant.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11

22
# Pagination constant
33
DEFAULT_START = 0
4-
DEFAULT_END = 100
4+
DEFAULT_END = 1000000000
55
DEFAULT_PAGE_SIZE = 10
6+
DEFAULT_LIMIT = 1000
67

78
# Supported earch engine
89
GOOGLE_ENGINE = 'google'
@@ -19,4 +20,4 @@
1920
NAVER_ENGINE = "naver"
2021
APPLE_APP_STORE_ENGINE = "apple_app_store"
2122

22-
# from serpapi.constant import *
23+
# from serpapi.constant import *

serpapi/pagination.py

+49-50
Original file line numberDiff line numberDiff line change
@@ -1,67 +1,66 @@
1-
from serpapi.serp_api_client_exception import SerpApiClientException
1+
from urllib import parse
2+
from serpapi import constant
23

3-
DEFAULT_START = 0
4-
DEFAULT_END = 1000000000
5-
DEFAULT_num = 10
6-
7-
# Paginate response in SearpApi
4+
# Paginate response in SerpApi
85
class Pagination:
9-
10-
def __init__(self, client, start = DEFAULT_START, end = DEFAULT_END, num = DEFAULT_num):
11-
# serp api client
6+
7+
def __init__(self, client, start = constant.DEFAULT_START, end = constant.DEFAULT_END, num = constant.DEFAULT_PAGE_SIZE, limit = constant.DEFAULT_LIMIT):
8+
# SerpApi client
129
self.client = client
13-
# range
14-
self.start = start
15-
self.end = end
16-
self.num = num
17-
18-
# use value from the client
19-
if self.start == DEFAULT_START:
20-
if 'start' in self.client.params_dict:
21-
self.start = self.client.params_dict['start']
22-
if self.end == DEFAULT_END:
23-
if 'end' in self.client.params_dict:
24-
self.end = self.client.params_dict['end']
25-
if self.num == DEFAULT_num:
26-
if 'num' in self.client.params_dict:
27-
self.num = self.client.params_dict['num']
28-
29-
# basic check
30-
if self.start > self.end:
31-
raise SerpApiClientException("start: {} must be less than end: {}".format(self.start, self.end))
32-
if(self.start + self.num) > self.end:
33-
raise SerpApiClientException("start + num: {} + {} must be less than end: {}".format(self.start, self.num, self.end))
10+
11+
self.limit = limit
12+
13+
"""Backwards-compatible workaround.
14+
`start`, `num`, and `end` parameters to `Pagination#__init__` are deprecated.
15+
16+
Set `start` and `num` search parameters.
17+
It works for Google Search API only.
18+
A correct way to set an offset, limit, and page size is in search parameters directly.
19+
(A hash that is passed to `SerpApi#__init__`.)
20+
"""
21+
if start != constant.DEFAULT_START:
22+
self.client.params_dict['start'] = start
23+
24+
if end != constant.DEFAULT_END:
25+
self.client.params_dict['end'] = end
26+
27+
if num != constant.DEFAULT_PAGE_SIZE:
28+
self.client.params_dict['num'] = num
29+
30+
31+
self.page_number = 0
3432

3533
def __iter__(self):
36-
self.update()
3734
return self
3835

39-
def update(self):
40-
self.client.params_dict['start'] = self.start
41-
self.client.params_dict['num'] = self.num
42-
if self.start > 0:
43-
self.client.params_dict['start'] += 1
44-
4536
def __next__(self):
46-
# update parameter
47-
self.update()
37+
if self.page_number >= self.limit:
38+
raise StopIteration
4839

4940
# execute request
5041
result = self.client.get_dict()
5142

52-
# stop if backend miss to return serpapi_pagination
53-
if not 'serpapi_pagination' in result:
43+
pagination = result.get('serpapi_pagination', result.get('pagination'))
44+
45+
# stop if backend miss to return `serpapi_pagination` or `pagination`
46+
if not pagination:
5447
raise StopIteration
5548

5649
# stop if no next page
57-
if not 'next' in result['serpapi_pagination']:
58-
raise StopIteration
59-
60-
# ends if no next page
61-
if self.start + self.num > self.end:
62-
raise StopIteration
63-
64-
# increment start page
65-
self.start += self.num
50+
if not 'next' in pagination:
51+
raise StopIteration
52+
53+
# Get actual parameters from next page of target website
54+
params_from_target_website = dict(
55+
parse.parse_qsl(parse.urlsplit(pagination['next']).query)
56+
)
57+
58+
# stop if parameters from the target website were not changed
59+
if params_from_target_website.items() <= self.client.params_dict.items():
60+
raise StopIteration
61+
62+
self.client.params_dict.update(params_from_target_website)
63+
64+
self.page_number += 1
6665

6766
return result

serpapi/serp_api_client.py

+4-12
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,9 @@
11
import requests
22
import json
3+
from serpapi.constant import *
34
from serpapi.pagination import Pagination
45
from serpapi.serp_api_client_exception import SerpApiClientException
56

6-
GOOGLE_ENGINE = 'google'
7-
BING_ENGINE = 'bing'
8-
BAIDU_ENGINE = 'baidu'
9-
GOOGLE_SCHOLAR_ENGINE = 'google_scholar'
10-
YANDEX_ENGINE = 'yandex'
11-
EBAY_ENGINE = 'ebay'
12-
YAHOO_ENGINE = 'yahoo'
13-
HOME_DEPOT_ENGINE = 'home_depot'
14-
YOUTUBE_ENGINE = 'youtube'
157

168
class SerpApiClient(object):
179
"""SerpApiClient enables to query any search engines supported by SerpApi and parse the results.
@@ -71,7 +63,7 @@ def get_results(self, path='/search'):
7163

7264
def get_html(self):
7365
"""Returns:
74-
Raw HTML search result from Gooogle
66+
Raw HTML search result from Google
7567
"""
7668
return self.get_results()
7769

@@ -170,8 +162,8 @@ def get_location(self, q, limit = 5):
170162
buffer = self.get_results('/locations.json')
171163
return json.loads(buffer)
172164

173-
def pagination(self, start = 0, end = 1000000000, page_size = 10):
165+
def pagination(self, start = DEFAULT_START, end = DEFAULT_END, page_size = DEFAULT_PAGE_SIZE, limit = DEFAULT_LIMIT):
174166
"""Return:
175167
Generator to iterate the search results pagination
176168
"""
177-
return Pagination(self, start, end, page_size)
169+
return Pagination(self, start, end, page_size, limit)

tests/test_apple_app_store_search.py

+17
Original file line numberDiff line numberDiff line change
@@ -22,5 +22,22 @@ def test_get_json(self):
2222
pp.pprint(data)
2323
print(data.keys())
2424

25+
@unittest.skipIf((os.getenv("API_KEY") == None), "no api_key provided")
26+
def test_paginate(self):
27+
page_size = 20
28+
search = AppleAppStoreSearch({"term": "Coffee", "page": 0, "num": page_size})
29+
30+
limit = 4
31+
pages = search.pagination(limit=limit)
32+
33+
page_count = 0
34+
result_count = 0
35+
36+
for page in pages:
37+
page_count += 1
38+
result_count += len(page["organic_results"])
39+
40+
self.assertEqual(page_count, limit)
41+
2542
if __name__ == '__main__':
2643
unittest.main()

tests/test_baidu_search.py

+17
Original file line numberDiff line numberDiff line change
@@ -21,5 +21,22 @@ def test_get_json(self):
2121
pp = pprint.PrettyPrinter(indent=2)
2222
pp.pprint(data)
2323

24+
@unittest.skipIf((os.getenv("API_KEY") == None), "no api_key provided")
25+
def test_paginate(self):
26+
page_size = 30
27+
search = BaiduSearch({"q": "Coffee", "pn": 10, "m": page_size})
28+
29+
limit = 3
30+
pages = search.pagination(limit=limit)
31+
32+
page_count = 0
33+
result_count = 0
34+
35+
for page in pages:
36+
page_count += 1
37+
result_count += len(page["organic_results"])
38+
39+
self.assertEqual(page_count, limit)
40+
2441
if __name__ == '__main__':
2542
unittest.main()

0 commit comments

Comments
 (0)