Web Scraping with rvest
The internet contains vast amounts of data that never make it into tidy datasets. Web scraping lets you extract that data directly from HTML pages. The rvest package makes this process straightforward in R.
This guide teaches you how to scrape web pages using rvest. You will learn to navigate HTML documents, extract specific elements, handle different data types, and follow best practices that keep your scraping respectful.
Installing and Loading rvest
Install rvest from CRAN along with the tidyverse, which rvest is part of:
install.packages("rvest")
install.packages("tidyverse")
Load the package in your R session:
library(rvest)
Reading HTML Pages
The first step in any scraping task is reading the HTML content of a page. Use read_html() for this:
# Read a web page into R
page <- read_html("https://example.com")
page
read_html() accepts a URL and returns an XML document object. This object contains the parsed HTML that you can query with selectors. The function works with any HTTP GET request, including pages that require basic authentication.
For pages with JavaScript-rendered content, rvest cannot help. The package only sees what the server sends in the initial HTML response. For JavaScript-heavy sites, consider using a headless browser like {chromote} or {selenium}.
Selecting Elements
Once you have an HTML document, you need to find the elements containing your target data. rvest supports two selector systems: CSS selectors and XPath expressions.
CSS Selectors
CSS selectors are the more common choice. They use the same syntax that web developers use for styling:
# Read a page with movie data
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")
# Find all section elements
films <- starwars |> html_elements("section")
films
This returns a nodeset containing all matching elements. You can then extract data from each element.
XPath Expressions
XPath offers more powerful selection capabilities. Use it when CSS selectors cannot express what you need:
# Select elements by attribute
films <- starwars |> html_elements(xpath = "//section")
# Select parent elements
titles <- starwars |> html_elements(xpath = "//h2/..")
# Select by position
first_film <- starwars |> html_elements(xpath = "//section[1]")
XPath expressions start with // to match nodes anywhere in the document. You can navigate the tree structure using / for children and .. for parents.
Extracting Data from Elements
After selecting elements, extract the actual data you need. rvest provides several functions for this.
Extracting Text
Use html_text2() to get the text content of elements:
# Get film titles
titles <- films |>
html_element("h2") |>
html_text2()
titles
html_text2() handles whitespace better than the older html_text(). It normalizes spacing and removes extra newlines.
Extracting Attributes
HTML elements have attributes that contain data. Use html_attr() to extract them:
# Get the data-id attribute from each heading
episode_ids <- films |>
html_element("h2") |>
html_attr("data-id")
episode_ids
This returns a character vector. Convert it to numeric if needed:
as.integer(episode_ids)
Common attributes include href for links, src for images, and class for CSS classes.
Extracting Tables
Many websites present data in HTML tables. The html_table() function converts them directly to data frames:
# Read Wikipedia page with a table
lego <- read_html("https://en.wikipedia.org/wiki/The_Lego_Movie")
# Extract the tracklist table
tracklist <- lego |>
html_element(".tracklist") |>
html_table()
tracklist
The resulting data frame has one column per table header. Clean column names and types as needed after extraction.
Building a Complete Scraping Pipeline
Combine these functions into a complete pipeline that scrapes structured data:
scrape_films <- function() {
# Read the page
page <- read_html("https://rvest.tidyverse.org/articles/starwars.html")
# Extract films
films <- page |> html_elements("section")
# Extract multiple pieces of data
data.frame(
title = films |> html_element("h2") |> html_text2(),
episode = films |> html_element("h2") |> html_attr("data-id") |> as.integer(),
released = films |> html_element(xpath = ".//p[contains(., 'Released')]") |>
html_text2() |>
stringr::str_extract("\\d{4}-\\d{2}-\\d{2}")
)
}
film_data <- scrape_films()
film_data
This function returns a clean data frame ready for analysis. The XPath expression .//p[contains(., 'Released')] uses a relative path starting from each section element.
Handling Errors Gracefully
Web scraping involves many points of failure. Network timeouts, missing elements, and page changes can break your code. Handle these gracefully:
safe_scrape <- function(url, selector) {
tryCatch({
page <- read_html(url)
page |> html_elements(selector) |> html_text2()
},
error = function(e) {
message("Failed to scrape: ", url)
NA
})
}
For production pipelines, also handle HTTP errors that read_html() might throw. Check the HTTP status code before parsing:
check_and_read <- function(url) {
response <- httr::GET(url)
if (httr::status_code(response) != 200) {
stop("HTTP error: ", httr::status_code(response))
}
httr::content(response, as = "text") |>
read_html()
}
Respectful Scraping
Scraping puts load on web servers. Follow these practices to be a good citizen:
-
Check robots.txt — This file tells you what a site allows bots to access. The polite package checks this automatically.
-
Add delays between requests — Waiting 1-2 seconds between requests prevents overwhelming servers:
for (url in urls) {
# Scrape code here
Sys.sleep(2) # Wait 2 seconds
}
- Identify your scraper — Set a proper User-Agent header that describes your project:
read_html(url, user_agent = "MyResearchProject (contact@example.com)")
- Cache your results — Once scraped, save the data locally. Do not re-scrape the same pages repeatedly.
The polite package formalizes these practices:
library(polite)
session <- bow("https://example.com", user_agent = "MyProject/1.0")
session |>
scrape() |>
html_element(".content") |>
html_text2()
The bow() function checks robots.txt and creates a session that respects rate limits.
Common Problems
Selector finds nothing: The page structure likely changed. Inspect the page in your browser using F12 or View Source to find the correct selector.
Text extraction includes unwanted content: Use html_element() to drill down to the exact element before extracting text.
Table has merged cells: html_table() handles some cases but may need manual adjustment. Check the resulting data frame and fix issues in post-processing.
Page requires login: rvest cannot handle authentication. For APIs that require tokens, use the httr2 package instead.
What Comes Next
You now know how to scrape web pages with rvest. From here, explore these areas:
- httr2 — For APIs and authenticated requests
- polite — For respectful scraping at scale
- xml2 — The underlying XML parsing library that powers rvest
- SelectorGadget — A browser extension that helps find CSS selectors
Web scraping opens up data sources that would otherwise require manual copying. With rvest, you can build reproducible pipelines that adapt when websites change.
Written
- File: sites/rguides/src/content/guides/r-web-scraping-rvest.md
- Words: ~1050
- Read time: 9 min
- Topics covered: rvest basics, read_html, html_elements, html_text, html_attr, html_table, XPath selectors, error handling, polite scraping
- Verified via: rvest.tidyverse.org, web fetch
- Unverified items: none