🤖 Manual web scraper for pages with hot reload and Pagination
Web crawler for scraping web pages
An Object Search Object is initialized which sets a level of needed criterias for the scrape to be processed and also for the files to be saved as CSV and JSON.
Async function to handle scrape is called drawing arguments from the Search Object,then the browser(chromium) is is launched and navigates to the given url.
If the Pages to be scraped exceeds 1 page then puppeteer emulates clicks to navigate the pages and store new scapped data in an array of before returning
The entire Scrapper is written in JavaScript
- Clone repository and install packages:
npm install
- Navigate to the
Search_Object
object and enter initial values
const Search__Object = {
data__source: 'oddFEELING PortFolio',
source__url: 'https://odd-portfolio.web.app/',
total__pages: 2,
Scrape__cli: 'false',
JSON__name: 'Links__json',
JSON__path: path.resolve(__dirname, './Scrapped__Data/JSON__files'),
CSV__name: 'Links__csv',
CSV__path: path.resolve(__dirname, './Scrapped__Data/CSV__files'),
};
- Naviget to the item selector to input an element selector
//--> select main query element
let items = document.querySelectorAll(`## Selector`);
replace ## Selector
with a selector e.g div.sc-fKFxtB ivoVis > h3
🚩 This selects all h3 in any div element with the classname of sc-fKFxtB ivoVis |
---|
- Navigate to the loop that pushes an object to the finl result then enter the values that should be extracted from the element.
The attributes are taken from the
item
object
//--> loop through items and add to result
items.forEach((item) => {
results.push({
source: `oddFEELING portfolio`, //--> ##Source
url: item.getAttribute('href'),
content: item.textContent,
});
});
🚩 This gets the href of the element and textContent |
---|
-
If the site that is to be scraped is paginated (makes an API call and renders items into different pages)
Puppeteer
would need to auto click and navigate the page. Specify the element to click by setting its selector in thepage.click(## Paginator)
//--> puppeteer auto click next button (pagination) if (currentPage < pagesToScrape) { await Promise.all([ await page.click(`div.sc-fKFxtB ivoVis > h3`), await page.waitForSelector(`## selector`), ]); }
THen set an element to wait for before continuing the scrape process in the
page.waitForSelector( ## selector)
//--> puppeteer auto click next button (pagination) if (currentPage < pagesToScrape) { await Promise.all([ await page.click(`div.sc-fKFxtB ivoVis > h3`), await page.waitForSelector(`div`), ]); }
🚩 This waits for all divs to render before continuing |
---|
-
Your're set!!.
Navigate to your terminal or press
cntrl
+shift
+~
then run code using
node scraper.js
Options to be filled
## Selector
--> element to query## Paginator
--> element to click for page integration## source
--> source of data$$ Name
--> Names of file given to written files$$ boolean
--> true or false
- NODE - Yep! Node...... not python
- PUPPETEER - Headless 3rd party browser emulator
- CONVERT-ARRAY-TO-CSV - Npm package
- @oddFEELING - Author and Owner
See also the list of contributors who participated in this project.