Web Scraping with Node.js

Activity 2: Trying out Cheerio Selectors

In this activity we will try out some selectors


Sample Page

Take a look at the sample web page here:

Sample Page 2

Inspect the HTML code of the page. To do this in most browsers you can right-click on the page and select the View Page Source menu. You should see something like this:

Sample2

Browse through the HTML code and try to understand how the structure of the page relates to what is shown on the web page.

Template Code

We will start with the template code called template.js. Make a copy of this and call it activity2.js.

Locate the TODOs in the server.js code file. This is where you will enter your code

// Function to scrape a page
function scrape() {
  // Specify the URL of page we want to scrape
  let url = "TODO: ENTER URL HERE";

  // Make the http request to the URL to get the data
  axios.get(url).then(response => {

    // Get the data from the response
    data = response.data

    // Load the HTML into the Cheerio web scraper
    const $ = cheerio.load(data);

    // Create a list to receive the data we will scrape
    results = []

    // Create a new csv file
    scrape_helper.initialiseCsv('data.csv')

    // TODO: Enter your scraping code here

    // Save the data to the csv
    scrape_helper.storeCsv('data.csv', results)

    console.log("See data.csv for results")

  }).catch((err) => {
    // Show any error message
    console.log("Error: " + err.message);
  });
}

First, add the URL of the page we will be scraping in the first TODO:

  let url = "https://www.thinkcreatelearn.co.uk/resources/node/web-scraping/sample2.html";

You can run the program now:

node activity2.js

It will create an empy data.csv file.

We will now add code after the second TODO: Enter you scraping code here.

1. Select the Title

Let's select the page title. This is the name that appears in the tab in your browser:

Html title

You can find this in the page's HTML code:

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>List of foods</title>
    <style>

Enter the following code after the TODO in server.js:

    // 1. Get all the title tag
    selection = $("title")
    text = $(selection).text()
    results.push({exercise:"title", result:text})     

Run the program again. The output in data.csv should change:

exercise,result
title,List of foods

2. Select all the H1 Tags

We will now select all the H1 tags on the page.

Enter the following code after the title selection code:

    // 2. Get all the H1 tags on the page
    selection = $("h1")
    selection.each((i,el) => {
      text = $(el).text()
      results.push({exercise:"h1", result:text})        
    })    

The output should change:

exercise,result
title,List of foods
h1,Ingredients
h1,Dishes

3. Select all the H2 tags

We can use the same approach to select the H2 tags. Add the following code:

    // 3. Search for the H2 tags and add to the results
    selection = $("h2")
    selection.each((i,el) => {
      text = $(el).text()
      results.push({exercise:"h2", result:text})        
    })   

4. Select all the H1 and H2 tags

If we want to select both the H1 and H2 tags at the same time we can simply list the selectors like this:

    // 4. Get all the H1 and H2 tags on the page
    selection = $("h1, h2")
    selection.each((i,el) => {
      text = $(el).text()
      results.push({exercise:"h1 and h2", result:text})        
    })        

If you have followed the above steps your output should look like this:

exercise,result
title,List of foods
h1,Ingredients
h1,Dishes
h2,Fruit
h2,Meat
h2,Cereal
h2,Vegetables
h1 and h2,Ingredients
h1 and h2,Fruit
h1 and h2,Meat
h1 and h2,Cereal
h1 and h2,Vegetables
h1 and h2,Dishes

5. Select based on an Element Id

HTML elements often have IDs which can be used to identify an element in the page:

    <body>
        <h1>Ingredients</h1>
        <div id="fruit">
            <h2>Fruit</h2>
            <p>List of fruit</p>
            <div id="berries">
                <p id="stberr">Strawberries</p>
                <p class="estyle">Blackberries</p>
            </div>
            <div id="autumn-fruit">
                <p>Apples</p>
                <p>Pears</p>
            </div>
        </div>

We can specify that we want to select based on an ID by using # followed by the ID name:

    // 5. Select based on an element id stberr
    selection = $("#stberr")
    selection.each((i,el) => {
      text = $(el).text()
      results.push({exercise:"strawberry", result:text})        
    })      

Adding the above code should give you another item in your output:

strawberry,Strawberries

Exercise: Select the Element with an Id

Try out the following exercise.


Can you select an element that has the id 'wht'?


Here is some code to help:

    // Exercise. Select based on an element id wht 
    selection = $("INSERT YOUR SELECTOR HERE")
    selection.each((i,el) => {
      text = $(el).text()
      results.push({exercise:"wheat", result:text})        
    })      

6. Select based on an Element Class

HTML elements can also have a class, which specifies the element's CSS style:

<body>
    <h1>Ingredients</h1>
    <div id="fruit">
        <h2>Fruit</h2>
        <p>List of fruit</p>
        <div id="berries">
            <p id="stberr">Strawberries</p>
            <p class="estyle">Blackberries</p>
        </div>
        <div id="autumn-fruit">
            <p>Apples</p>
            <p>Pears</p>
        </div>
    </div>

We can specify that we want to select based on a class by using . followed by the class name:

    // 6. Select based on a class estyle
    selection = $(".estyle")
    selection.each((i,el) => {
      text = $(el).text()
      results.push({exercise:"blackberry", result:text})        
    })    

Adding the above code should give you another item in your web service's JSON output:

blackberry,Blackberries

Exercise: Select the Element with a Class

Try out the following exercise.


Can you select an element that has the class 'dstyle'?


Here is some code to help:

    selection = $("INSERT YOUR SELECTOR HERE")
    selection.each((i,el) => {
      text = $(el).text()
      results.push({exercise:"oats", result:text})        
    }) 

7. Select Direct Children

What if we want all p elements that fall directly under a div element? We can list selectors separated by >:

    // 7. Select based on one selection within another  
    // All p tags under a div
    selection = $("div > p")
    selection.each((i,el) => {
      text = $(el).text()
      results.push({exercise:"div-p", result:text})        
    })     

Note how this selects only p elements that are directly under a div element:

div-p,List of fruit
div-p,Strawberries
div-p,Blackberries
div-p,Apples
div-p,Pears
div-p,List of meat
div-p,Beef
div-p,Chicken 
div-p,Turkey
div-p,List of cereal
div-p,Rice
div-p,Wild Rice
div-p,Wheat
div-p,Oats

The Lamb element is not selected because it sits under a span:

<div id="meat">
    <h2>Meat</h2>
    <p>List of meat</p>
    <div class="astyle">
        <span><p>Lamb</p></span>
        <p>Beef</p>
    </div>
    <div class="bstyle">
        <p>Chicken </p>
        <p>Turkey</p>
    </div>
</div>

8. Select Direct Grandchildren

We are not limited to 2 selectors in a compound selection. Here we have 3. This will only select where a p sits directly under 2 divs:

    // 8. Select based on one selection within another which in turn is in another
    // All p tags under 2 divs
    selection = $("div > div > p")
    selection.each((i,el) => {
      text = $(el).text()
      results.push({exercise:"div-div-p", result:text})        
    })   

Compare the output with the output for the "div > p" selector above:

div-div-p,Strawberries
div-div-p,Blackberries
div-div-p,Apples
div-div-p,Pears
div-div-p,Beef
div-div-p,Chicken 
div-div-p,Turkey
div-div-p,Rice
div-div-p,Wild Rice
div-div-p,Wheat
div-div-p,Oats

9. Select Ancestors (Indirect Children)

If we omit the > we get all ancestors, not just direct children. Here we select all p elements that are somewhere under a div element:

    // 9. Select based on one selection within another  
    // All p tags under a div
    selection = $("div p")
    selection.each((i,el) => {
        text = $(el).text()
        results.push({exercise:"div-p-indirect", result:text})        
    })   

Now we have Lamb appearing:

div-p-indirect,List of fruit
div-p-indirect,Strawberries
div-p-indirect,Blackberries
div-p-indirect,Apples
div-p-indirect,Pears
div-p-indirect,List of meat
div-p-indirect,Lamb
div-p-indirect,Beef
div-p-indirect,Chicken 
div-p-indirect,Turkey
div-p-indirect,List of cereal
div-p-indirect,Rice
div-p-indirect,Wild Rice
div-p-indirect,Wheat
div-p-indirect,Oats

10. Some More Compound Selectors

We can use all the different types of selectors such as element selectors, class selectors, id selectors within compound selectors. Here are some examples:

a

  // 10a. Select based on one selection within another  
    // Here we select all p elements within the elements with id of "berries"   
    selection = $("#berries > p")
    selection.each((i,el) => {
        text = $(el).text()
        results.push({exercise:"berries", result:text})        
    })       
berries,Strawberries
berries,Blackberries

b

    // 10b. Select based on one selection within another  
    // Here we select all p elements within the elements with a style of "bstyle"
    selection = $(".bstyle > p")
    selection.each((i,el) => {
        text = $(el).text()
        results.push({exercise:"poultry", result:text})        
    })  
poultry,Chicken 
poultry,Turkey

c

    // 10c. Select based on one selection within another  
    // Here we select all p elements within the elements with a style of "astyle"
    selection = $(".astyle > p")
    selection.each((i,el) => {
        text = $(el).text()
        results.push({exercise:"meat-and-rice", result:text})        
    })  
meat-and-rice,Beef
meat-and-rice,Rice
meat-and-rice,Wild Rice

d

    // 10d. Select based on one selection within another which in turn is in another
    // Here we select all p elements within div elements within the elements with id of "fruit"
    selection = $("#fruit > div > p")
    selection.each((i,el) => {
        text = $(el).text()
        results.push({exercise:"fruit", result:text})        
    })   
fruit,Strawberries
fruit,Blackberries
fruit,Apples
fruit,Pears

e

    // 10e Select based on a style within an element id
    selection = $("#cereal > .astyle > p")
    selection.each((i,el) => {
        text = $(el).text()
        results.push({exercise:"rice", result:text})        
    })     
rice,Rice
rice,Wild Rice

Exercise: Select Just Meat

To exercise your skills, try out this task.


Can you write a selector that selects just the meat, i.e. Lamb, Beef, Chicken, Turkey?


Here is some code to help you:

    // Exercise Select just the meat
    selection = $("INSERT YOUR SELECTOR HERE")
    selection.each((i,el) => {
        text = $(el).text()
        results.push({exercise:"meat", result:text})        
    })         

11. Lists

We often want to pull items from lists on a web page. Our sample page has the following list:

<ul>
    <li>Leeks</li>
    <li>Cabbage</li>
    <li>Cauliflower</li>
</ul>

Here is some code that pulls the first element from the list:

    // 11a Get first list item
    selection = $("li").first()
    text = selection.text()
    results.push({exercise:"veg-first", result:text})  

the last element from the list:

    // 11b Get last list item
    selection = $("li").last()
    text = selection.text()
    results.push({exercise:"veg-last", result:text})  

and all elements from the list:

    // 11c Select all list items
    selection = $("li")
    selection.each((i,el) => {
        text = $(el).text()
        results.push({exercise:"all-veg", result:text})        
    })    

12. Tables

Tables in an HTML page frequently contain desirable data. Take a look at this HTML table:

<table>
    <tr>
        <th>Dish</th>
        <th>Country</th>
    </tr>
    <tr>
        <td>Paella</td>
        <td>Spain</td>
    </tr>
    <tr>
        <td>Moules-frites</td>
        <td>Belgium</td>
    </tr>
    <tr>
        <td>Pastel de Nata</td>
        <td>Portugal</td>
    </tr>            
    </table>

Notice how we have <table> <tr> <th> and <td> tags. The <tr> tags represent table rows. The <th> tags represent header rows. The <td> tags represent cells in the table.

Scraping data from these tables is a bit more involved. We want to preserve the structure of the table and that requires a little work. I've added a function called scrapeTable() in scrape-helper.js to help with scraping tables. You can scrape the table in the sample page using the following code:

    // Create a new csv file
    scrape_helper.initialiseCsv('table.csv')

    // Get the table
    table = $("table")[0]  //[0] to get first table  
    selection = scrape_helper.scrapeTable($, table)  

    // Save to csv
    scrape_helper.storeCsv('table.csv', selection)
    console.log("See table.csv for results")

When we run this code we get a nice, structured csv, preserving the table structure:

Dish,Country
Paella,Spain
Moules-frites,Belgium
Pastel de Nata,Portugal

Complete code

You can find the completed code in scrape2-selectors.js.

Table of Contents

  1. Scrape data from a web page with Cheerio
  2. Activity 1: Modify the sample code
  3. Cheerio Selectors
  4. Activity 2: Trying out Cheerio Selectors
  5. Activity 3: Trying out some Tables
  6. Activity 4: Reading attributes
  7. Activity 5: Books to Scrape
  8. Clicking and Autoscrolling
  9. Links to Scrape Samples