Skip to content

Commit 882d256

Browse files
authored
Is verified user info (#198)
* Add fake-useragent to requirements * update version numbering in setup and changelog * update readme * Additionally scrape for is_verified when scraping user profiles
1 parent 1fe473b commit 882d256

File tree

6 files changed

+51
-48
lines changed

6 files changed

+51
-48
lines changed

README.rst

+36-16
Original file line numberDiff line numberDiff line change
@@ -37,20 +37,28 @@ Per Tweet it scrapes the following information:
3737
+ Tweet text
3838
+ Tweet html
3939
+ Tweet timestamp
40+
+ Tweet Epoch timestamp
4041
+ Tweet No. of likes
4142
+ Tweet No. of replies
4243
+ Tweet No. of retweets
4344
+ Username
4445
+ User Full Name
4546
+ User ID
47+
+ Tweet is an retweet
48+
+ Username retweeter
49+
+ Userid retweeter
50+
+ Retweer ID
51+
52+
In addition it can scrape for the following user information:
4653
+ Date user joined
4754
+ User location (if filled in)
4855
+ User blog (if filled in)
49-
+ User No. of tweets
56+
+ User No. of tweets
5057
+ User No. of following
5158
+ User No. of followers
5259
+ User No. of likes
5360
+ User No. of lists
61+
+ User is verified
5462

5563

5664
2. Installation and Usage
@@ -96,7 +104,12 @@ JSON right away. Twitterscraper takes several arguments:
96104
default value is set to today. This does not work in combination with ``--user``.
97105

98106
- ``-u`` or ``--user`` Scrapes the tweets from that users profile page.
99-
This also includes all retweets by that user. See examples below.
107+
This also includes all retweets by that user. See section 2.2.4 in the examples below
108+
for more information.
109+
110+
- ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile
111+
information of the users who have written these tweets. The results will be saved in the
112+
file userprofiles_<filename>.
100113

101114
- ``-p`` or ``--poolsize`` Set the number of parallel processes
102115
TwitterScraper should initiate while scraping for your query. Default
@@ -121,21 +134,18 @@ JSON right away. Twitterscraper takes several arguments:
121134
- ``-ow`` or ``--overwrite``: With this argument, if the output file already exists
122135
it will be overwritten. If this argument is not set (default) twitterscraper will
123136
exit with the warning that the output file already exists.
124-
125-
- ``--profiles``: twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets.
126-
The results will be saved in the file "userprofiles_<filename>".
127137

128138

129139
2.2.1 Examples of simple queries
130140
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
131141

132142
Below is an example of how twitterscraper can be used:
133143

134-
``twitterscraper Trump --limit 100 --output=tweets.json``
144+
``twitterscraper Trump --limit 1000 --output=tweets.json``
135145

136-
``twitterscraper Trump -l 100 -o tweets.json``
146+
``twitterscraper Trump -l 1000 -o tweets.json``
137147

138-
``twitterscraper Trump -l 100 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json``
148+
``twitterscraper Trump -l 1000 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json``
139149

140150

141151

@@ -149,9 +159,9 @@ as one single query.
149159
Here are some examples:
150160

151161
- search for the occurence of 'Bitcoin' or 'BTC':
152-
``twitterscraper "Bitcoin OR BTC " -o bitcoin_tweets.json -l 1000``
162+
``twitterscraper "Bitcoin OR BTC" -o bitcoin_tweets.json -l 1000``
153163
- search for the occurence of 'Bitcoin' and 'BTC':
154-
``twitterscraper "Bitcoin AND BTC " -o bitcoin_tweets.json -l 1000``
164+
``twitterscraper "Bitcoin AND BTC" -o bitcoin_tweets.json -l 1000``
155165
- search for tweets from a specific user:
156166
``twitterscraper "Blockchain from:VitalikButerin" -o blockchain_tweets.json -l 1000``
157167
- search for tweets to a specific user:
@@ -167,17 +177,19 @@ Also see `Twitter's Standard operators <https://developer.twitter.com/en/docs/tw
167177
2.2.3 Examples of scraping user pages
168178
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
169179

170-
You can also scraped all tweets written or retweetet by a specific user. This can be done by adding the boolean argument ``-u / --user`` argument to the query.
171-
If this argument is used, the query should be equal to the username.
180+
You can also scraped all tweets written or retweetet by a specific user.
181+
This can be done by adding the boolean argument ``-u / --user`` argument.
182+
If this argument is used, the search term should be equal to the username.
172183

173184
Here is an example of scraping a specific user:
174185

175-
``twitterscraper realDonaldTrump -u -o tweets_username.json``
186+
``twitterscraper realDonaldTrump --user -o tweets_username.json``
176187

177188
This does not work in combination with ``-p``, ``-bd``, or ``-ed``.
178189

179190
The main difference with the example "search for tweets from a specific user" in section 2.2.2 is that this method really scrapes
180-
all tweets from a profile page (including retweets). The example in 2.2.2 scrapes the results from the search page (excluding retweets).
191+
all tweets from a profile page (including retweets).
192+
The example in 2.2.2 scrapes the results from the search page (excluding retweets).
181193

182194

183195
2.3 From within Python
@@ -206,15 +218,23 @@ You can easily use TwitterScraper from within python:
206218
2.4 Scraping for retweets
207219
----------------------
208220

209-
A regular search within Twitter will not show you any retweets. Twitterscraper therefore does not contain any retweets in the output. To give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet, a search for ``#trump2020`` will only show the original tweet. The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument.
221+
A regular search within Twitter will not show you any retweets.
222+
Twitterscraper therefore does not contain any retweets in the output.
223+
224+
To give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet,
225+
a search for ``#trump2020`` will only show the original tweet.
226+
227+
The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument.
210228

211229

212230
2.5 Scraping for User Profile information
213231
----------------------
214232
By adding the argument ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets.
215233
The results will be saved in the file "userprofiles_<filename>".
234+
216235
Try not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :)
217-
It is also possible to scrape for profile information without scraping for tweets. Examples of this can be found in the examples folder.
236+
It is also possible to scrape for profile information without scraping for tweets.
237+
Examples of this can be found in the examples folder.
218238

219239

220240
3. Output

changelog.txt

+6
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# twitterscraper changelog
22

3+
# 1.2.0 ( 2019-06-22 )
4+
### Added
5+
- PR #186: adds the fields is_retweet, retweeter related information, and timestamp_epochs to the output.
6+
- PR #184: use fake_useragent for generation of random user agent headers.
7+
- Additionally scraper for 'is_verified' when scraping for user profile pages.
8+
39
# 1.1.0 ( 2019-06-15 )
410
### Added
511
- PR #176: Using billiard library instead of multiprocessing to add the ability to use this library with Celery.

requirements.txt

+1-27
Original file line numberDiff line numberDiff line change
@@ -2,31 +2,5 @@ coala-utils~=0.5.0
22
bs4
33
lxml
44
requests
5-
backports.functools-lru-cache==1.5
6-
BeautifulSoup==3.2.1
7-
beautifulsoup4==4.7.1
8-
bs4==0.0.1
9-
certifi==2019.3.9
10-
chardet==3.0.4
11-
coala-utils==0.5.1
12-
idna==2.8
13-
lxml==4.3.3
14-
requests==2.21.0
15-
soupsieve==1.9.1
16-
twitterscraper==0.9.3
17-
urllib3==1.24.3
18-
backports.functools-lru-cache==1.5
19-
BeautifulSoup==3.2.1
20-
beautifulsoup4==4.7.1
21-
bs4==0.0.1
22-
certifi==2019.3.9
23-
chardet==3.0.4
24-
coala-utils==0.5.1
25-
fake-useragent==0.1.11
26-
idna==2.8
27-
lxml==4.3.3
28-
requests==2.21.0
29-
soupsieve==1.9.1
30-
twitterscraper==0.9.3
31-
urllib3==1.24.3
325
billiard
6+
fake-useragent

setup.py

+1-3
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,12 @@
11
#!/usr/bin/env python3
22

33
from setuptools import setup, find_packages
4-
5-
64
with open('requirements.txt') as requirements:
75
required = requirements.read().splitlines()
86

97
setup(
108
name='twitterscraper',
11-
version='1.1.0',
9+
version='1.2.0',
1210
description='Tool for scraping Tweets',
1311
url='https://github.com/taspinar/twitterscraper',
1412
author=['Ahmet Taspinar', 'Lasse Schuirmann'],

twitterscraper/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
Twitter Scraper tool
66
"""
77

8-
__version__ = '1.0.0'
8+
__version__ = '1.2.0'
99
__author__ = 'Ahmet Taspinar'
1010
__license__ = 'MIT'
1111

twitterscraper/user.py

+6-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
class User:
55
def __init__(self, user="", full_name="", location="", blog="", date_joined="", id="", tweets=0,
6-
following=0, followers=0, likes=0, lists=0):
6+
following=0, followers=0, likes=0, lists=0, is_verified=0):
77
self.user = user
88
self.full_name = full_name
99
self.location = location
@@ -15,6 +15,7 @@ def __init__(self, user="", full_name="", location="", blog="", date_joined="",
1515
self.followers = followers
1616
self.likes = likes
1717
self.lists = lists
18+
self.is_verified = is_verified
1819

1920
@classmethod
2021
def from_soup(self, tag_prof_header, tag_prof_nav):
@@ -47,6 +48,10 @@ def from_soup(self, tag_prof_header, tag_prof_nav):
4748
else:
4849
self.date_joined = date_joined.strip()
4950

51+
tag_verified = tag_prof_header.find('span', {'class': "ProfileHeaderCard-badges"})
52+
if tag_verified is not None:
53+
self.is_verified = 1
54+
5055
self.id = tag_prof_nav.find('div',{'class':'ProfileNav'})['data-user-id']
5156
tweets = tag_prof_nav.find('span', {'class':"ProfileNav-value"})['data-count']
5257
if tweets is None:

0 commit comments

Comments
 (0)