Skip to content

Commit

Permalink
Update instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
shaansubbaiah committed May 8, 2021
1 parent 91b8ca8 commit 94b92c8
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 4 deletions.
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Data scraped for each recipe includes:


## Run

First run `pip install scrapy`
### Run the spider:

- Set `DEBUG = False` in `recipescrape/spiders/recipes.py`
Expand All @@ -78,10 +78,14 @@ Data scraped for each recipe includes:
- Run `scrapy crawl recipes -L WARN --logfile=scrapelog.txt -s JOBDIR=recipes/spider-1`

## Extras
Spider to extract all the nutrient names:

### Spider to extract all the nutrient names:

- Run `pip install pandas`
- Make sure `DEBUG = True` in `nutrient_list.py`
- Run `scrapy crawl nutrients -o nutrients.csv -L WARN -s JOBDIR=nutrients/spider-1`
- Outputs 'items_count', 'nutri_count', 'nutri_list', 'url' in `nutrients.csv` in which the last row contains the unified list of nutritients found so far.

### Combine CSV files generated

<!-- https://www.health.harvard.edu/staying-healthy/listing_of_vitamins -->
- There are many ways to combine CSV files, a sample python file `/extras/combine_csv.py` is included for quick reference
7 changes: 7 additions & 0 deletions extras/combine_csv.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
import pandas as pd

# just included for reference
# this can be improved by using loops, reading the dir for CSV files.

# read the created CSV files
df1 = pd.read_csv('1.csv')

print(df1.head())
Expand All @@ -17,11 +21,13 @@

print(f'total: {len(df1)+len(df2)+len(df3)}')

# generate a single combined CSV
df4 = pd.concat([df1, df2, df3])
print(df4.head())
combined_len = len(df4)
print(combined_len)

# drop duplicated entries if any
df4 = df4.drop_duplicates()
print(df4.head())
unique_len = len(df4)
Expand All @@ -30,6 +36,7 @@

df4.to_csv('combined.csv', index=False)

# view the combined CSV with unique entries
df5 = pd.read_csv('combined.csv')
print(df5.head())
print(len(df5))
Expand Down
2 changes: 1 addition & 1 deletion recipescrape/spiders/recipes.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import re
import logging

DEBUG = True
DEBUG = False
FIELDS = [
# Recipe and related information
'name', 'url', 'category', 'author', 'summary', 'rating', 'rating_count', 'review_count', 'ingredients', 'directions', 'prep', 'cook', 'total', 'servings', 'yield',
Expand Down

0 comments on commit 94b92c8

Please sign in to comment.