Advanced Web Scraping with rvest and polite

· 4 min read · Updated March 13, 2026 · advanced
r web scraping rvest data collection html

Advanced Web Scraping with rvest and polite

Web scraping is a technique for extracting data from websites programmatically. The rvest package provides R with web scraping capabilities modeled after Python’s Beautiful Soup, while polite adds a layer of respectfulness to your scraping workflow. This guide covers advanced techniques for building reliable, responsible web scrapers.

Setting Up rvest and polite

Install and load the required packages:

install.packages(c("rvest", "polite", "httr"))

library(rvest)
library(polite)
library(httr)

The polite package enforces three principles: asking permission to scrape, taking nothing but data, and imposing a delay between requests. This approach reduces the risk of IP bans and respects server resources.

Respectful Scraping with Delay and User-Agent

Always identify your scraper using a custom user-agent string. This practice is professional and helps site administrators track traffic:

session <- bow("https://example.com",
  user_agent = "MyRScraper/1.0 (research purposes; contact: researcher@example.com)")

# Verify the session
session
# <polite session> https://example.com
#   User-agent: MyRScraper/1.0 (research purposes; contact: researcher@example.com)
#   Delay: 0 seconds

Set a delay between requests to avoid overwhelming the server:

session <- bow("https://example.com",
  user_agent = "ResearchBot/1.0",
  delay = 2)  # 2 seconds between requests

# Session configured with 2-second delay (see session object)

Parsing HTML with html_elements() and html_text2()

The rvest package uses CSS selectors and XPath to navigate HTML documents. The html_elements() function extracts nodes matching a selector, while html_text2() retrieves text with proper whitespace handling:

url <- "https://example.com"
page <- read_html(url)

# Extract all paragraph text
paragraphs <- page |>
  html_elements("p") |>
  html_text2()

# Extract headings by level
headings <- page |>
  html_elements("h2") |>
  html_text2()

# Extract links with their URLs
links <- page |>
  html_elements("a") |>
  html_attr("href")

# [1] "/about"

For more complex selections, use XPath with xpath = TRUE:

# Extract all text within a specific div
content <- page |>
  html_elements(xpath = "//div[@class='content']") |>
  html_text2()

Handling Forms with html_form()

Many websites require form submission to access data. The rvest package provides functions to fill and submit forms:

# Discover form fields on a page
page <- read_html("https://example.com/login")
form <- page |> html_form()

# View form fields
form_fields <- form |> html_form_show()
# $username
# $password
# $submit

# Fill and submit a form
filled_form <- form |>
  html_form_set(username = "myuser", password = "mypassword")

response <- filled_form |> session_submit(session, .url = "https://example.com/login")

For search forms, the pattern is similar:

search_page <- read_html("https://example.com/search")
search_form <- search_page |> html_form()

filled <- search_form |>
  html_form_set(q = "r programming")

results <- filled |> session_submit(session)

Scraping multiple pages requires detecting and following pagination links. A common pattern involves iterating through page URLs or following “next” links:

base_url <- "https://example.com/items?page="
all_items <- vector("list", 10)

for (i in 1:10) {
  url <- paste0(base_url, i)
  page <- nod(session, url) |> scrape()
  
  items <- page |>
    html_elements(".item") |>
    html_text2()
  
  all_items[[i]] <- items
  
  Sys.sleep(2)  # Respect delay between pages
}

combined <- unlist(all_items)

Alternative approach using “next” button links:

scrape_page <- function(url) {
  page <- nod(session, url) |> scrape()
  
  items <- page |> html_elements(".product-title") |> html_text2()
  
  # Find next page URL
  next_link <- page |>
    html_elements("a.next") |>
    html_attr("href")
  
  list(items = items, next_url = next_link)
}

Error Handling for Failed Requests

Network requests fail for various reasons: timeouts, 404 errors, or server blocks. Wrap scraping code in tryCatch blocks:

safe_scrape <- function(url) {
  result <- tryCatch({
    page <- nod(session, url) |> scrape()
    list(success = TRUE, data = page)
  }, error = function(e) {
    list(success = FALSE, error = e$message)
  })
  
  # Handle HTTP errors specifically
  if (is.null(result$data)) {
    return(result)
  }
  
  status <- result$data |> html_elements(xpath = "//status-code") |> html_text()
  
  if (!is.na(status) && status >= 400) {
    return(list(success = FALSE, error = paste("HTTP", status)))
  }
  
  result
}

# Test with error handling
test <- safe_scrape("https://example.com")
$success
# [1] TRUE

For transient errors, implement retry logic:

retry_scrape <- function(url, max_retries = 3) {
  for (attempt in 1:max_retries) {
    result <- safe_scrape(url)
    if (result$success) {
      return(result)
    }
    Sys.sleep(5 * attempt)  # Exponential backoff
  }
  NULL
}

Practical Examples

Extracting Table Data

# Scrape and parse HTML tables
page <- read_html("https://example.com/data")
tables <- page |> html_table()

# First table as data frame
df <- tables[[1]]
#   Column1 Column2 Column3
# 1      A       B       C
# 2      D       E       F

Extracting JSON from Script Tags

Many sites embed data in JSON within script tags:

page <- read_html("https://example.com/dashboard")

json_data <- page |>
  html_elements(xpath = "//script[@id='data']") |>
  html_text()

parsed <- jsonlite::fromJSON(json_data)

See Also