Web Scraping with rvest

· 5 min read · Updated March 11, 2026 · intermediate
rvest web-scraping html tidyverse data-collection

The internet contains vast amounts of data that never make it into tidy datasets. Web scraping lets you extract that data directly from HTML pages. The rvest package makes this process straightforward in R.

This guide teaches you how to scrape web pages using rvest. You will learn to navigate HTML documents, extract specific elements, handle different data types, and follow best practices that keep your scraping respectful.

Installing and Loading rvest

Install rvest from CRAN along with the tidyverse, which rvest is part of:

install.packages("rvest")
install.packages("tidyverse")

Load the package in your R session:

library(rvest)

Reading HTML Pages

The first step in any scraping task is reading the HTML content of a page. Use read_html() for this:

# Read a web page into R
page <- read_html("https://example.com")
page

read_html() accepts a URL and returns an XML document object. This object contains the parsed HTML that you can query with selectors. The function works with any HTTP GET request, including pages that require basic authentication.

For pages with JavaScript-rendered content, rvest cannot help. The package only sees what the server sends in the initial HTML response. For JavaScript-heavy sites, consider using a headless browser like {chromote} or {selenium}.

Selecting Elements

Once you have an HTML document, you need to find the elements containing your target data. rvest supports two selector systems: CSS selectors and XPath expressions.

CSS Selectors

CSS selectors are the more common choice. They use the same syntax that web developers use for styling:

# Read a page with movie data
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")

# Find all section elements
films <- starwars |> html_elements("section")
films

This returns a nodeset containing all matching elements. You can then extract data from each element.

XPath Expressions

XPath offers more powerful selection capabilities. Use it when CSS selectors cannot express what you need:

# Select elements by attribute
films <- starwars |> html_elements(xpath = "//section")

# Select parent elements
titles <- starwars |> html_elements(xpath = "//h2/..")

# Select by position
first_film <- starwars |> html_elements(xpath = "//section[1]")

XPath expressions start with // to match nodes anywhere in the document. You can navigate the tree structure using / for children and .. for parents.

Extracting Data from Elements

After selecting elements, extract the actual data you need. rvest provides several functions for this.

Extracting Text

Use html_text2() to get the text content of elements:

# Get film titles
titles <- films |>
  html_element("h2") |>
  html_text2()

titles

html_text2() handles whitespace better than the older html_text(). It normalizes spacing and removes extra newlines.

Extracting Attributes

HTML elements have attributes that contain data. Use html_attr() to extract them:

# Get the data-id attribute from each heading
episode_ids <- films |>
  html_element("h2") |>
  html_attr("data-id")

episode_ids

This returns a character vector. Convert it to numeric if needed:

as.integer(episode_ids)

Common attributes include href for links, src for images, and class for CSS classes.

Extracting Tables

Many websites present data in HTML tables. The html_table() function converts them directly to data frames:

# Read Wikipedia page with a table
lego <- read_html("https://en.wikipedia.org/wiki/The_Lego_Movie")

# Extract the tracklist table
tracklist <- lego |>
  html_element(".tracklist") |>
  html_table()

tracklist

The resulting data frame has one column per table header. Clean column names and types as needed after extraction.

Building a Complete Scraping Pipeline

Combine these functions into a complete pipeline that scrapes structured data:

scrape_films <- function() {
  # Read the page
  page <- read_html("https://rvest.tidyverse.org/articles/starwars.html")
  
  # Extract films
  films <- page |> html_elements("section")
  
  # Extract multiple pieces of data
  data.frame(
    title = films |> html_element("h2") |> html_text2(),
    episode = films |> html_element("h2") |> html_attr("data-id") |> as.integer(),
    released = films |> html_element(xpath = ".//p[contains(., 'Released')]") |> 
      html_text2() |> 
      stringr::str_extract("\\d{4}-\\d{2}-\\d{2}")
  )
}

film_data <- scrape_films()
film_data

This function returns a clean data frame ready for analysis. The XPath expression .//p[contains(., 'Released')] uses a relative path starting from each section element.

Handling Errors Gracefully

Web scraping involves many points of failure. Network timeouts, missing elements, and page changes can break your code. Handle these gracefully:

safe_scrape <- function(url, selector) {
  tryCatch({
    page <- read_html(url)
    page |> html_elements(selector) |> html_text2()
  },
  error = function(e) {
    message("Failed to scrape: ", url)
    NA
  })
}

For production pipelines, also handle HTTP errors that read_html() might throw. Check the HTTP status code before parsing:

check_and_read <- function(url) {
  response <- httr::GET(url)
  if (httr::status_code(response) != 200) {
    stop("HTTP error: ", httr::status_code(response))
  }
  httr::content(response, as = "text") |>
    read_html()
}

Respectful Scraping

Scraping puts load on web servers. Follow these practices to be a good citizen:

  1. Check robots.txt — This file tells you what a site allows bots to access. The polite package checks this automatically.

  2. Add delays between requests — Waiting 1-2 seconds between requests prevents overwhelming servers:

for (url in urls) {
  # Scrape code here
  Sys.sleep(2)  # Wait 2 seconds
}
  1. Identify your scraper — Set a proper User-Agent header that describes your project:
read_html(url, user_agent = "MyResearchProject (contact@example.com)")
  1. Cache your results — Once scraped, save the data locally. Do not re-scrape the same pages repeatedly.

The polite package formalizes these practices:

library(polite)

session <- bow("https://example.com", user_agent = "MyProject/1.0")

session |>
  scrape() |>
  html_element(".content") |>
  html_text2()

The bow() function checks robots.txt and creates a session that respects rate limits.

Common Problems

Selector finds nothing: The page structure likely changed. Inspect the page in your browser using F12 or View Source to find the correct selector.

Text extraction includes unwanted content: Use html_element() to drill down to the exact element before extracting text.

Table has merged cells: html_table() handles some cases but may need manual adjustment. Check the resulting data frame and fix issues in post-processing.

Page requires login: rvest cannot handle authentication. For APIs that require tokens, use the httr2 package instead.

What Comes Next

You now know how to scrape web pages with rvest. From here, explore these areas:

  • httr2 — For APIs and authenticated requests
  • polite — For respectful scraping at scale
  • xml2 — The underlying XML parsing library that powers rvest
  • SelectorGadget — A browser extension that helps find CSS selectors

Web scraping opens up data sources that would otherwise require manual copying. With rvest, you can build reproducible pipelines that adapt when websites change.

Written

  • File: sites/rguides/src/content/guides/r-web-scraping-rvest.md
  • Words: ~1050
  • Read time: 9 min
  • Topics covered: rvest basics, read_html, html_elements, html_text, html_attr, html_table, XPath selectors, error handling, polite scraping
  • Verified via: rvest.tidyverse.org, web fetch
  • Unverified items: none