Skip to content

oddFEELING/web-scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bot logo

web-scrapper

Dependencies Platform GitHub Issues GitHub Pull Requests License


🤖 Manual web scraper for pages with hot reload and Pagination

📝 Table of Contents

🧐 About

Web crawler for scraping web pages

🎥 Demo / Working

Working

💭 How it works

An Object Search Object is initialized which sets a level of needed criterias for the scrape to be processed and also for the files to be saved as CSV and JSON.

Async function to handle scrape is called drawing arguments from the Search Object,then the browser(chromium) is is launched and navigates to the given url.

If the Pages to be scraped exceeds 1 page then puppeteer emulates clicks to navigate the pages and store new scapped data in an array of before returning

The entire Scrapper is written in JavaScript

🎈 Usage

  1. Clone repository and install packages:
npm install
  1. Navigate to the Search_Object object and enter initial values

Example

const Search__Object = {
  data__source: 'oddFEELING PortFolio',
  source__url: 'https://odd-portfolio.web.app/',
  total__pages: 2,
  Scrape__cli: 'false',
  JSON__name: 'Links__json',
  JSON__path: path.resolve(__dirname, './Scrapped__Data/JSON__files'),
  CSV__name: 'Links__csv',
  CSV__path: path.resolve(__dirname, './Scrapped__Data/CSV__files'),
};
  1. Naviget to the item selector to input an element selector
//-->  select main query element
let items = document.querySelectorAll(`## Selector`);

replace ## Selector with a selector e.g div.sc-fKFxtB ivoVis > h3

🚩 This selects all h3 in any div element with the classname of sc-fKFxtB ivoVis
  1. Navigate to the loop that pushes an object to the finl result then enter the values that should be extracted from the element. The attributes are taken from the item object

Example

//-->  loop through items and add to result
items.forEach((item) => {
  results.push({
    source: `oddFEELING portfolio`, //-->  ##Source
    url: item.getAttribute('href'),
    content: item.textContent,
  });
});
🚩 This gets the href of the element and textContent
  1. If the site that is to be scraped is paginated (makes an API call and renders items into different pages) Puppeteer would need to auto click and navigate the page. Specify the element to click by setting its selector in the page.click(## Paginator)

    //-->  puppeteer auto click next button (pagination)
    if (currentPage < pagesToScrape) {
      await Promise.all([
        await page.click(`div.sc-fKFxtB ivoVis > h3`),
        await page.waitForSelector(`## selector`),
      ]);
    }

    THen set an element to wait for before continuing the scrape process in the page.waitForSelector( ## selector)

    //-->  puppeteer auto click next button (pagination)
    if (currentPage < pagesToScrape) {
      await Promise.all([
        await page.click(`div.sc-fKFxtB ivoVis > h3`),
        await page.waitForSelector(`div`),
      ]);
    }
🚩 This waits for all divs to render before continuing
  1. Your're set!!.

    Navigate to your terminal or press cntrl + shift + ~ then run code using

node scraper.js

📦 Options

Options to be filled

  • ## Selector --> element to query
  • ## Paginator --> element to click for page integration
  • ## source --> source of data
  • $$ Name --> Names of file given to written files
  • $$ boolean --> true or false

⛏️ Built Using

✍️ Author

See also the list of contributors who participated in this project.

About

A front facing app to scrape sites

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published