Web Scraping with Node.js

Scrape data from a web page with Cheerio

In this workshop we will automate the process of pulling data from a web page


Table of Contents

  1. Scrape data from a web page with Cheerio
  2. Activity 1: Modify the sample code
  3. Cheerio Selectors
  4. Activity 2: Trying out Cheerio Selectors
  5. Activity 3: Trying out some Tables
  6. Activity 4: Reading attributes
  7. Activity 5: Books to Scrape
  8. Clicking and Autoscrolling
  9. Links to Scrape Samples

Web scraping

Take a look at this web page from Open Street Map:

Underground

There is a list of underground stations together with some useful information, such as the geo-coordinates of the stations. If we wanted to use this data in a program (e.g. to draw a map of the London Underground) we would need to get this data in to a more useful form, such as a list in Python or a JSON object in a Javascript program.

Copy-and-Paste

A really simple approach would be to select the table on the page using your mouse, copy it, and paste it into your favourite spreadsheet program. This will work, but has a few significant disadvantages:

A Really Simple Scrape

Here is a more automated, code-free approach:

Here is an HTML to JSON tool which you can use for this process:

HTML to JSON

The above approach works to some degree, but it is not full automated. We still need to do some copying and pasting.

The rest of this workshop explores a coded approach, using node.js.

The Source Code

The source code for the examples in this worksheet can be found here:

Source Code

Before we get started, download the code and open it up in Visual Studio Code.

Web Scraping with Node.js and Cheerio

Let's do a simple web-scraping exercise. We will scrape this test page:

Test Page

The test page looks like this:

Sample1

We will be using a Node.js module called Cheerio to help us.

First, let's run this program. Open up a Terminal window in Visual Studio Code (menu Terminal > New Window). Type in the following command:

node scrape1-simple.js 

You should see the following message:

Found 8 rows
See data.csv for results

You should also see a file called data.csv containing the following:

country
Albania
Armenia
Austria
Azerbaijan
Belarus
Belgium
Bosnia and Herzegovina
Bulgaria

Open the code file scrape1-simple.js and take a look at it:

// Load the modules we need 
const axios = require('axios');                     // for sending web requests
const cheerio = require('cheerio');                 // for web scraping
const scrape_helper = require("./scrape-helper")          // for saving objects to csv

// Call the scrape function
scrape()


// Function to scrape a page
function scrape() {
  // Specify the URL of page we want to scrape
  let url = "https://www.thinkcreatelearn.co.uk/resources/node/web-scraping/sample1.html";

  // Make the http request to the URL to get the data
  axios.get(url).then(response => {

    // Get the data from the response
    data = response.data

    // Load the HTML into the Cheerio web scraper
    const $ = cheerio.load(data);

    // Create a list to receive the data we will scrape
    results = []

    // Create a new csv file
    scrape_helper.initialiseCsv('data.csv')

    // Search for the elements we want
    selection = $('h2')

    // Add the elements to the list
    selection.each((i,el) => {
      text = $(el).text()
      results.push({country:text})        
    })

    // Save the data to the csv
    scrape_helper.storeCsv('data.csv', results)
    console.log("See data.csv for results")

  }).catch((err) => {
    // Show any error message
    console.log("Error: " + err.message);
  });
}

Essentially what the code is doing is searching for particular elements in the HTML and using them to build up a JSON object. Take a look at the HTML of the web page we are scraping (in Chrome, visit the Test Page and right-click on the page, then select View Page Source).

Here's the HTML code for this page. The H2 tags are highlighted in yellow:

<!DOCTYPE html>
<html lang="en">

    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <meta http-equiv="X-UA-Compatible" content="ie=edge">
        <title>Web scraping exercise</title>
        <style>
            .fpar {
                font-family: Georgia;
            }
        </style>
    </head>

    <body>
        <p>Sample for web scraping exercises</p>

        <h1>Countries beggining with A</h1>

        <h2>Albania</h2>
        <h3>Tirana</h3>

        <h2>Armenia</h2>
        <h3>Yerevan</h3>    

        <h2>Austria</h2>
        <h3>Vienna</h3> 

        <h2>Azerbaijan</h2>
        <h3>Baku</h3>

        <h1 id="b-countries">Countries beggining with B</h1>     

        <h2>Belarus</h2>   
        <h3>Minsk</h3>

        <h2>Belgium</h2>
        <h3>Brussels</h3>

        <h2>Bosnia and Herzegovina</h2>
        <h3>Sarajevo</h3>

        <h2>Bulgaria</h2>
        <h3>Sofia</h3>


    </body>
</html>

The web scraping code is looking for all the <h2> tags:

      // Search for the elements we want
      const selection = $('h2')

Then building up the list of all the h2 tags:

    // Add the elements to the list
    selection.each((i,el) => {
      text = $(el).text()
      results.push({country:text})        
    })

Then saving the results as a csv file:

    // Save the data to the csv
    scrape_helper.storeCsv('data.csv', results)
    console.log("See data.csv for results")

Table of Contents

  1. Scrape data from a web page with Cheerio
  2. Activity 1: Modify the sample code
  3. Cheerio Selectors
  4. Activity 2: Trying out Cheerio Selectors
  5. Activity 3: Trying out some Tables
  6. Activity 4: Reading attributes
  7. Activity 5: Books to Scrape
  8. Clicking and Autoscrolling
  9. Links to Scrape Samples