You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+36-11
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,11 @@
1
1
2
+
<h1align=center> OpenScraper</h1>
2
3
4
+
----
3
5
4
-
<h1align=center> OpenScraper</h1>
6
+
part 1/3 of the TADATA! sofware suite ([ApiViz](https://github.com/entrepreneur-interet-general/CIS-front) / [Solidata_backend](https://github.com/entrepreneur-interet-general/solidata_backend) / [Solidata_frontend](https://github.com/entrepreneur-interet-general/solidata_frontend) / [OpenScraper](https://github.com/entrepreneur-interet-general/OpenScraper) )
5
7
8
+
---
6
9
<h4align=center> a public service for webscraping</h4>
7
10
8
11
<h2align=center>
@@ -17,9 +20,20 @@ OpenScraper is a minimalistic, open source webscraper with a simple interface, <
17
20
18
21
<h4align=center> OpenScraper is a project by <ahref=https://entrepreneur-interet-general.etalab.gouv.fr/defi/2017/09/26/socialconnect/ >SocialConnect</a></h4>
- vBeta : Scrapy parser based on spider configuration with Xpaths
21
34
22
35
----
36
+
23
37
## ARGUMENT
24
38
25
39
#### **To which needs this project aims to answer ?**
@@ -59,16 +73,29 @@ OpenScraper is developped in open source, and will provide a documentation as mu
59
73
60
74
1.**clone or [download](https://github.com/entrepreneur-interet-general/OpenScraper/archive/master.zip) the repo**
61
75
1.**[install MongoDB](https://docs.mongodb.com/manual/installation/) locally** or get the URI of the MongoDB you're using
76
+
1.**install chromedriver**
77
+
78
+
- on MacOS :
79
+
>
80
+
$ brew tap caskroom/cask
81
+
$ brew cask install chromedriver
82
+
83
+
- on Ubuntu :
84
+
>
85
+
$ sudo apt-get install chromium-chromedriver
86
+
62
87
1.**go to your openscraper folder**
63
88
1.**install the libraries (without [virtual environment](https://pypi.python.org/pypi/virtualenv))**
64
89
65
-
> $ pip install -r requirements.txt
90
+
>
91
+
$ pip install -r requirements.txt
66
92
67
93
1.**update the `config/settings.py` file** with your mongoDB URI (if you're not using default mongoDB connection)
68
94
69
95
1.**run app** from `$ ~/../app_scrapnado`
70
96
71
-
> $ python main.py
97
+
>
98
+
$ python main.py
72
99
73
100
1.**check in your browser** at `localhost:8000`
74
101
@@ -94,7 +121,7 @@ OpenScraper is developped in open source, and will provide a documentation as mu
94
121
#### Tech stack
95
122
-_Language_ : **Python**... because let's be honest, I don't manage so many languages for that kind of project
96
123
-_Backend_ : **[Tornado](http://www.tornadoweb.org/en/stable/)**... one of the few async/non-blocking Python frameworks
97
-
-_Scraping_ : **[Scrapy](https://scrapy.org/)**, perhaps with **[Selenium](http://selenium-python.readthedocs.io/) for Python** inside specific instances of the generic spider, or **Splash** for jquery follow-up...
124
+
-_Scraping_ : **[Scrapy](https://scrapy.org/)**, with **[Selenium](http://selenium-python.readthedocs.io/) for Python** inside specific instances of the generic spider, or **Splash** for jquery follow-up...
98
125
-_Frontend_ : **[Bulma](https://bulma.io/)** (to make it nice) and then **[Vue.js](https://vuejs.org/)** (to make it even nicer and bi-directional)
99
126
100
127
#### Tech goals for a MVP
@@ -119,22 +146,20 @@ OpenScraper is developped in open source, and will provide a documentation as mu
119
146
1. DONE - integrate generic spider + tests + run
120
147
1. DONE - make Tornado and a basic scrapy spider work together (non-blocking)
121
148
1. DONE - make a nice front in Bulma
122
-
1. add API points for JSON feeds
149
+
1. DONE - add Selenium to mimic navigation by clics on reactive websites
150
+
1. DONE - add API points for JSON feeds
151
+
1. DONE - add an "export csv" button and function to download the dataset
123
152
1. deploy a demo at [http://www.cis-openscraper.com/](http://www.cis-openscraper.com/)
124
153
1. ... nicer front in vue.js
125
154
1. integrate JWT and hash private infos for users and API
126
-
1. add an "exposrt csv" button and function to download the dataset
127
155
128
156
129
157
#### Currently :
130
-
- currently at step(s) /10 and /11
131
-
- working on the API points
158
+
- adding documentation ...
132
159
- ...
133
160
134
161
#### Notes for later / issues :
135
-
- must add logic to masterspider to request APIs
136
162
- must migrate/copy data to a Elastic search (not only MongoDB)
137
-
- must add a feature to use Selenium inside a spider if needed (ajax or .click function)
138
163
- containerize the app for simpler deployment (locally or in production)
0 commit comments