In this activity we will try out some selectors
Take a look at the sample web page here:
Sample Page 2Inspect the HTML code of the page. To do this in most browsers you can right-click on the page and select the View Page Source menu. You should see something like this:
Browse through the HTML code and try to understand how the structure of the page relates to what is shown on the web page.
We will start with the template code called template.js
. Make a copy of this and call it activity2.js
.
Locate the TODOs in the server.js code file. This is where you will enter your code
// Function to scrape a page
function scrape() {
// Specify the URL of page we want to scrape
let url = "TODO: ENTER URL HERE";
// Make the http request to the URL to get the data
axios.get(url).then(response => {
// Get the data from the response
data = response.data
// Load the HTML into the Cheerio web scraper
const $ = cheerio.load(data);
// Create a list to receive the data we will scrape
results = []
// Create a new csv file
scrape_helper.initialiseCsv('data.csv')
// TODO: Enter your scraping code here
// Save the data to the csv
scrape_helper.storeCsv('data.csv', results)
console.log("See data.csv for results")
}).catch((err) => {
// Show any error message
console.log("Error: " + err.message);
});
}
First, add the URL of the page we will be scraping in the first TODO:
let url = "https://www.thinkcreatelearn.co.uk/resources/node/web-scraping/sample2.html";
You can run the program now:
node activity2.js
It will create an empy data.csv
file.
We will now add code after the second TODO: Enter you scraping code here.
Let's select the page title. This is the name that appears in the tab in your browser:
You can find this in the page's HTML code:
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>List of foods</title>
<style>
Enter the following code after the TODO in server.js:
// 1. Get all the title tag
selection = $("title")
text = $(selection).text()
results.push({exercise:"title", result:text})
Run the program again. The output in data.csv should change:
exercise,result
title,List of foods
We will now select all the H1 tags on the page.
Enter the following code after the title selection code:
// 2. Get all the H1 tags on the page
selection = $("h1")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"h1", result:text})
})
The output should change:
exercise,result
title,List of foods
h1,Ingredients
h1,Dishes
We can use the same approach to select the H2 tags. Add the following code:
// 3. Search for the H2 tags and add to the results
selection = $("h2")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"h2", result:text})
})
If we want to select both the H1 and H2 tags at the same time we can simply list the selectors like this:
// 4. Get all the H1 and H2 tags on the page
selection = $("h1, h2")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"h1 and h2", result:text})
})
If you have followed the above steps your output should look like this:
exercise,result
title,List of foods
h1,Ingredients
h1,Dishes
h2,Fruit
h2,Meat
h2,Cereal
h2,Vegetables
h1 and h2,Ingredients
h1 and h2,Fruit
h1 and h2,Meat
h1 and h2,Cereal
h1 and h2,Vegetables
h1 and h2,Dishes
HTML elements often have IDs which can be used to identify an element in the page:
<body>
<h1>Ingredients</h1>
<div id="fruit">
<h2>Fruit</h2>
<p>List of fruit</p>
<div id="berries">
<p id="stberr">Strawberries</p>
<p class="estyle">Blackberries</p>
</div>
<div id="autumn-fruit">
<p>Apples</p>
<p>Pears</p>
</div>
</div>
We can specify that we want to select based on an ID by using # followed by the ID name:
// 5. Select based on an element id stberr
selection = $("#stberr")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"strawberry", result:text})
})
Adding the above code should give you another item in your output:
strawberry,Strawberries
Try out the following exercise.
Here is some code to help:
// Exercise. Select based on an element id wht
selection = $("INSERT YOUR SELECTOR HERE")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"wheat", result:text})
})
HTML elements can also have a class, which specifies the element's CSS style:
<body>
<h1>Ingredients</h1>
<div id="fruit">
<h2>Fruit</h2>
<p>List of fruit</p>
<div id="berries">
<p id="stberr">Strawberries</p>
<p class="estyle">Blackberries</p>
</div>
<div id="autumn-fruit">
<p>Apples</p>
<p>Pears</p>
</div>
</div>
We can specify that we want to select based on a class by using . followed by the class name:
// 6. Select based on a class estyle
selection = $(".estyle")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"blackberry", result:text})
})
Adding the above code should give you another item in your web service's JSON output:
blackberry,Blackberries
Try out the following exercise.
Here is some code to help:
selection = $("INSERT YOUR SELECTOR HERE")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"oats", result:text})
})
What if we want all p elements that fall directly under a div element? We can list selectors separated by >:
// 7. Select based on one selection within another
// All p tags under a div
selection = $("div > p")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"div-p", result:text})
})
Note how this selects only p elements that are directly under a div element:
div-p,List of fruit
div-p,Strawberries
div-p,Blackberries
div-p,Apples
div-p,Pears
div-p,List of meat
div-p,Beef
div-p,Chicken
div-p,Turkey
div-p,List of cereal
div-p,Rice
div-p,Wild Rice
div-p,Wheat
div-p,Oats
The Lamb element is not selected because it sits under a span:
<div id="meat">
<h2>Meat</h2>
<p>List of meat</p>
<div class="astyle">
<span><p>Lamb</p></span>
<p>Beef</p>
</div>
<div class="bstyle">
<p>Chicken </p>
<p>Turkey</p>
</div>
</div>
We are not limited to 2 selectors in a compound selection. Here we have 3. This will only select where a p sits directly under 2 divs:
// 8. Select based on one selection within another which in turn is in another
// All p tags under 2 divs
selection = $("div > div > p")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"div-div-p", result:text})
})
Compare the output with the output for the "div > p" selector above:
div-div-p,Strawberries
div-div-p,Blackberries
div-div-p,Apples
div-div-p,Pears
div-div-p,Beef
div-div-p,Chicken
div-div-p,Turkey
div-div-p,Rice
div-div-p,Wild Rice
div-div-p,Wheat
div-div-p,Oats
If we omit the > we get all ancestors, not just direct children. Here we select all p elements that are somewhere under a div element:
// 9. Select based on one selection within another
// All p tags under a div
selection = $("div p")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"div-p-indirect", result:text})
})
Now we have Lamb appearing:
div-p-indirect,List of fruit
div-p-indirect,Strawberries
div-p-indirect,Blackberries
div-p-indirect,Apples
div-p-indirect,Pears
div-p-indirect,List of meat
div-p-indirect,Lamb
div-p-indirect,Beef
div-p-indirect,Chicken
div-p-indirect,Turkey
div-p-indirect,List of cereal
div-p-indirect,Rice
div-p-indirect,Wild Rice
div-p-indirect,Wheat
div-p-indirect,Oats
We can use all the different types of selectors such as element selectors, class selectors, id selectors within compound selectors. Here are some examples:
// 10a. Select based on one selection within another
// Here we select all p elements within the elements with id of "berries"
selection = $("#berries > p")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"berries", result:text})
})
berries,Strawberries
berries,Blackberries
// 10b. Select based on one selection within another
// Here we select all p elements within the elements with a style of "bstyle"
selection = $(".bstyle > p")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"poultry", result:text})
})
poultry,Chicken
poultry,Turkey
// 10c. Select based on one selection within another
// Here we select all p elements within the elements with a style of "astyle"
selection = $(".astyle > p")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"meat-and-rice", result:text})
})
meat-and-rice,Beef
meat-and-rice,Rice
meat-and-rice,Wild Rice
// 10d. Select based on one selection within another which in turn is in another
// Here we select all p elements within div elements within the elements with id of "fruit"
selection = $("#fruit > div > p")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"fruit", result:text})
})
fruit,Strawberries
fruit,Blackberries
fruit,Apples
fruit,Pears
// 10e Select based on a style within an element id
selection = $("#cereal > .astyle > p")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"rice", result:text})
})
rice,Rice
rice,Wild Rice
To exercise your skills, try out this task.
Here is some code to help you:
// Exercise Select just the meat
selection = $("INSERT YOUR SELECTOR HERE")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"meat", result:text})
})
We often want to pull items from lists on a web page. Our sample page has the following list:
<ul>
<li>Leeks</li>
<li>Cabbage</li>
<li>Cauliflower</li>
</ul>
Here is some code that pulls the first element from the list:
// 11a Get first list item
selection = $("li").first()
text = selection.text()
results.push({exercise:"veg-first", result:text})
the last element from the list:
// 11b Get last list item
selection = $("li").last()
text = selection.text()
results.push({exercise:"veg-last", result:text})
and all elements from the list:
// 11c Select all list items
selection = $("li")
selection.each((i,el) => {
text = $(el).text()
results.push({exercise:"all-veg", result:text})
})
Tables in an HTML page frequently contain desirable data. Take a look at this HTML table:
<table>
<tr>
<th>Dish</th>
<th>Country</th>
</tr>
<tr>
<td>Paella</td>
<td>Spain</td>
</tr>
<tr>
<td>Moules-frites</td>
<td>Belgium</td>
</tr>
<tr>
<td>Pastel de Nata</td>
<td>Portugal</td>
</tr>
</table>
Notice how we have <table>
<tr>
<th>
and <td>
tags. The <tr>
tags represent table rows. The <th>
tags represent header rows. The <td>
tags represent cells in the table.
Scraping data from these tables is a bit more involved. We want to preserve the structure of the table and that requires a little work. I've added a function called scrapeTable()
in scrape-helper.js to help with scraping tables. You can scrape the table in the sample page using the following code:
// Create a new csv file
scrape_helper.initialiseCsv('table.csv')
// Get the table
table = $("table")[0] //[0] to get first table
selection = scrape_helper.scrapeTable($, table)
// Save to csv
scrape_helper.storeCsv('table.csv', selection)
console.log("See table.csv for results")
When we run this code we get a nice, structured csv, preserving the table structure:
Dish,Country
Paella,Spain
Moules-frites,Belgium
Pastel de Nata,Portugal
You can find the completed code in scrape2-selectors.js
.
Table of Contents