Web Scraping with Node.js

Clicking and Autoscrolling

In this activity we will look at some issues and how to resolve them


Clicking on Links

Sometimes the content of a website is served in pages. You have to click on a link on the page to view the next page. The web scraping code won't be able to scrape the second page and subsequent pages without simulating a user click on the link. This example uses a library called puppeteer to simulate the mouse clicks. It's a much more complex example!

In scrape5-books-to-scrape.js we have code that scrapes a single page from the Books to Scrape site. Then in scrape6-books-to-scrape-click.js we extend the code to simulate the mouse click.

As I say, the code is quite complex and an explanation is beyond the scope of this worksheet.

Scrolling

Sometimes, rather than clicking to reveal more content the user is required to scroll down the page. Again, we can use puppeteer to simulate this user action of scrolling down the page.

In scrape7-reddit.js, we have some code that scrapes the first page of content only from reddit. Then in scrape8-reddit-autoscroll.js we extend the code to simulate a scrolling down the page.

The code is quite complex and an explanation is beyond the scope of this worksheet.

Table of Contents

  1. Scrape data from a web page with Cheerio
  2. Activity 1: Modify the sample code
  3. Cheerio Selectors
  4. Activity 2: Trying out Cheerio Selectors
  5. Activity 3: Trying out some Tables
  6. Activity 4: Reading attributes
  7. Activity 5: Books to Scrape
  8. Clicking and Autoscrolling
  9. Links to Scrape Samples