Skip to content

Commit 69348b5

Browse files
committed
Add Beatiful Soup example
1 parent 78e4921 commit 69348b5

File tree

5 files changed

+1193
-0
lines changed

5 files changed

+1193
-0
lines changed

Diff for: source-code/web-scraping/README.md

+5
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,14 @@
11
# WebScraping
2+
23
Web scraping can be quite useful to gather data that is not avaialble
34
through an API. Here, some sample code is provided for Beautiful Soup,
45
a web scraping library that is easy to use.
56

7+
68
## What is it?
9+
710
1. `link_web.py`: script that uses Beautiful Soup and NetworkX to
811
create a graph representing the links between web pages, starting
912
from a given page.
13+
1. `preprocessing`: Python script that scrapes a web page containing FAQs and
14+
printing them in JSONL format.

Diff for: source-code/web-scraping/preprocessing/README.md

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Preprocessing
2+
3+
This directory contains an example of data preprocessing using Beautiful Soup.
4+
5+
6+
## What is it?
7+
8+
1. `preprocess_openmp_faq.py`: Python script that reads the OpenMP FAQ from an
9+
HTML file, and writes the output to standard output. The output is in JSONL
10+
format, with each question-answer pair on a single line.
11+
1. `openmp_faq.html`: HTML file containing the OpenMP FAQ.
12+
1. `openmp_faq.jsonl`: JSONL file containing the OpenMP FAQ in a structured
13+
format.

0 commit comments

Comments
 (0)