Skip to content

Commit c96c6d8

Browse files
committed
Merge branch 'selenium'
inserting all Selenium changes + API scraper + new statics
2 parents 35252fc + e8c4095 commit c96c6d8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+4642
-588
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ config.cfg
2626
### BACKUP FILES
2727
openscraper/_backups_collections/*.json
2828

29+
### RUNNING SPIDERS
30+
openscraper/running_spiders
2931

3032
### BOWER
3133
bower_components

README.md

+36-11
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11

2+
<h1 align=center> OpenScraper</h1>
23

4+
----
35

4-
<h1 align=center> OpenScraper</h1>
6+
part 1/3 of the TADATA! sofware suite ([ApiViz](https://github.com/entrepreneur-interet-general/CIS-front) / [Solidata_backend](https://github.com/entrepreneur-interet-general/solidata_backend) / [Solidata_frontend](https://github.com/entrepreneur-interet-general/solidata_frontend) / [OpenScraper](https://github.com/entrepreneur-interet-general/OpenScraper) )
57

8+
---
69
<h4 align=center> a public service for webscraping</h4>
710

811
<h2 align=center>
@@ -17,9 +20,20 @@ OpenScraper is a minimalistic, open source webscraper with a simple interface, <
1720

1821
<h4 align=center> OpenScraper is a project by <a href=https://entrepreneur-interet-general.etalab.gouv.fr/defi/2017/09/26/socialconnect/ >SocialConnect</a></h4>
1922

20-
<p align=center>#python #tornado #scrapy #mongodb #bulma</p>
23+
<p align=center>#python #tornado #scrapy #selenium #mongodb #bulma</p>
24+
25+
----
26+
## WHAT IS NEW ?
27+
28+
29+
- v1.3 - 20/10/2018 : added first csv converter and downloader for every spider's dataset and first routes to documentation
30+
- v1.2 - 18/10/2018 : a spider can adapt to follow pages either if they are API or HTML
31+
- v1.1 - 15/10/2018 : parser adapts to API Rest, configuration based on "/" description for the path inside the JSON
32+
- v1.0 - 10/06/2018 : parser adapts to reactive website (SPA, vue, etc...)
33+
- vBeta : Scrapy parser based on spider configuration with Xpaths
2134

2235
----
36+
2337
## ARGUMENT
2438

2539
#### **To which needs this project aims to answer ?**
@@ -59,16 +73,29 @@ OpenScraper is developped in open source, and will provide a documentation as mu
5973

6074
1. **clone or [download](https://github.com/entrepreneur-interet-general/OpenScraper/archive/master.zip) the repo**
6175
1. **[install MongoDB](https://docs.mongodb.com/manual/installation/) locally** or get the URI of the MongoDB you're using
76+
1. **install chromedriver**
77+
78+
- on MacOS :
79+
>
80+
$ brew tap caskroom/cask
81+
$ brew cask install chromedriver
82+
83+
- on Ubuntu :
84+
>
85+
$ sudo apt-get install chromium-chromedriver
86+
6287
1. **go to your openscraper folder**
6388
1. **install the libraries (without [virtual environment](https://pypi.python.org/pypi/virtualenv))**
6489

65-
> $ pip install -r requirements.txt
90+
>
91+
$ pip install -r requirements.txt
6692

6793
1. **update the `config/settings.py` file** with your mongoDB URI (if you're not using default mongoDB connection)
6894

6995
1. **run app** from `$ ~/../app_scrapnado`
7096

71-
> $ python main.py
97+
>
98+
$ python main.py
7299

73100
1. **check in your browser** at `localhost:8000`
74101

@@ -94,7 +121,7 @@ OpenScraper is developped in open source, and will provide a documentation as mu
94121
#### Tech stack
95122
- _Language_ : **Python**... because let's be honest, I don't manage so many languages for that kind of project
96123
- _Backend_ : **[Tornado](http://www.tornadoweb.org/en/stable/)**... one of the few async/non-blocking Python frameworks
97-
- _Scraping_ : **[Scrapy](https://scrapy.org/)**, perhaps with **[Selenium](http://selenium-python.readthedocs.io/) for Python** inside specific instances of the generic spider, or **Splash** for jquery follow-up...
124+
- _Scraping_ : **[Scrapy](https://scrapy.org/)**, with **[Selenium](http://selenium-python.readthedocs.io/) for Python** inside specific instances of the generic spider, or **Splash** for jquery follow-up...
98125
- _Frontend_ : **[Bulma](https://bulma.io/)** (to make it nice) and then **[Vue.js](https://vuejs.org/)** (to make it even nicer and bi-directional)
99126

100127
#### Tech goals for a MVP
@@ -119,22 +146,20 @@ OpenScraper is developped in open source, and will provide a documentation as mu
119146
1. DONE - integrate generic spider + tests + run
120147
1. DONE - make Tornado and a basic scrapy spider work together (non-blocking)
121148
1. DONE - make a nice front in Bulma
122-
1. add API points for JSON feeds
149+
1. DONE - add Selenium to mimic navigation by clics on reactive websites
150+
1. DONE - add API points for JSON feeds
151+
1. DONE - add an "export csv" button and function to download the dataset
123152
1. deploy a demo at [http://www.cis-openscraper.com/](http://www.cis-openscraper.com/)
124153
1. ... nicer front in vue.js
125154
1. integrate JWT and hash private infos for users and API
126-
1. add an "exposrt csv" button and function to download the dataset
127155

128156

129157
#### Currently :
130-
- currently at step(s) /10 and /11
131-
- working on the API points
158+
- adding documentation ...
132159
- ...
133160

134161
#### Notes for later / issues :
135-
- must add logic to masterspider to request APIs
136162
- must migrate/copy data to a Elastic search (not only MongoDB)
137-
- must add a feature to use Selenium inside a spider if needed (ajax or .click function)
138163
- containerize the app for simpler deployment (locally or in production)
139164
- ...
140165

0 commit comments

Comments
 (0)