In this activity we will look at some issues and how to resolve them
Sometimes the content of a website is served in pages. You have to click on a link on the page to view the next page. The web scraping code won't be able to scrape the second page and subsequent pages without simulating a user click on the link. This example uses a library called puppeteer to simulate the mouse clicks. It's a much more complex example!
In scrape5-books-to-scrape.js
we have code that scrapes a single page from the Books to Scrape site. Then in scrape6-books-to-scrape-click.js
we extend the code to simulate the mouse click.
As I say, the code is quite complex and an explanation is beyond the scope of this worksheet.
Sometimes, rather than clicking to reveal more content the user is required to scroll down the page. Again, we can use puppeteer to simulate this user action of scrolling down the page.
In scrape7-reddit.js
, we have some code that scrapes the first page of content only from reddit. Then in scrape8-reddit-autoscroll.js
we extend the code to simulate a scrolling down the page.
The code is quite complex and an explanation is beyond the scope of this worksheet.
Table of Contents