Web Scraping with Node.js

Activity 4: Reading attributes

In this activity we will investigate how to extract the attribute information in elements.


Attributes

Take a look at this page, which lists the London boroughs and their council websites:

London Boroughs

View the page source.

Look at the first link:

<a href="http://www.lbbd.gov.uk/">London Borough of Barking and Dagenham</a><br/>

We already know how to extract the text "London Borough of Barking and Dagenham". We would write a selector that selects the a elements and then use the text() function to extract the text.

But what about the link embedded in the href attribute? To extact that we can use the attr() function:

    // Create a list to receive the data we will scrape
    results = []

    // Find all the a elements and extract the name and link
    links = $('a')
    links.each((i,el) => {
        borough = {}

        // The name is in the element text
        boroughName = $(el).text().trim()
        borough.name = boroughName

        // The link is in the element's href attribute
        link = $(el).attr('href')
        borough.link = link 

        // Add the borough to the array
        results.push(borough)
    })

Try it out


Use the above code in a new project to extrac the London borough names and links


Complete code

You can find the completed code in scrape4-attributes.js.

Table of Contents

  1. Scrape data from a web page with Cheerio
  2. Activity 1: Modify the sample code
  3. Cheerio Selectors
  4. Activity 2: Trying out Cheerio Selectors
  5. Activity 3: Trying out some Tables
  6. Activity 4: Reading attributes
  7. Activity 5: Books to Scrape
  8. Clicking and Autoscrolling
  9. Links to Scrape Samples