Advanced Web Scraping with rvest and polite
Advanced Web Scraping with rvest and polite
Web scraping is a technique for extracting data from websites programmatically. The rvest package provides R with web scraping capabilities modeled after Python’s Beautiful Soup, while polite adds a layer of respectfulness to your scraping workflow. This guide covers advanced techniques for building reliable, responsible web scrapers.
Setting Up rvest and polite
Install and load the required packages:
install.packages(c("rvest", "polite", "httr"))
library(rvest)
library(polite)
library(httr)
The polite package enforces three principles: asking permission to scrape, taking nothing but data, and imposing a delay between requests. This approach reduces the risk of IP bans and respects server resources.
Respectful Scraping with Delay and User-Agent
Always identify your scraper using a custom user-agent string. This practice is professional and helps site administrators track traffic:
session <- bow("https://example.com",
user_agent = "MyRScraper/1.0 (research purposes; contact: researcher@example.com)")
# Verify the session
session
# <polite session> https://example.com
# User-agent: MyRScraper/1.0 (research purposes; contact: researcher@example.com)
# Delay: 0 seconds
Set a delay between requests to avoid overwhelming the server:
session <- bow("https://example.com",
user_agent = "ResearchBot/1.0",
delay = 2) # 2 seconds between requests
# Session configured with 2-second delay (see session object)
Parsing HTML with html_elements() and html_text2()
The rvest package uses CSS selectors and XPath to navigate HTML documents. The html_elements() function extracts nodes matching a selector, while html_text2() retrieves text with proper whitespace handling:
url <- "https://example.com"
page <- read_html(url)
# Extract all paragraph text
paragraphs <- page |>
html_elements("p") |>
html_text2()
# Extract headings by level
headings <- page |>
html_elements("h2") |>
html_text2()
# Extract links with their URLs
links <- page |>
html_elements("a") |>
html_attr("href")
# [1] "/about"
For more complex selections, use XPath with xpath = TRUE:
# Extract all text within a specific div
content <- page |>
html_elements(xpath = "//div[@class='content']") |>
html_text2()
Handling Forms with html_form()
Many websites require form submission to access data. The rvest package provides functions to fill and submit forms:
# Discover form fields on a page
page <- read_html("https://example.com/login")
form <- page |> html_form()
# View form fields
form_fields <- form |> html_form_show()
# $username
# $password
# $submit
# Fill and submit a form
filled_form <- form |>
html_form_set(username = "myuser", password = "mypassword")
response <- filled_form |> session_submit(session, .url = "https://example.com/login")
For search forms, the pattern is similar:
search_page <- read_html("https://example.com/search")
search_form <- search_page |> html_form()
filled <- search_form |>
html_form_set(q = "r programming")
results <- filled |> session_submit(session)
Navigating Pagination
Scraping multiple pages requires detecting and following pagination links. A common pattern involves iterating through page URLs or following “next” links:
base_url <- "https://example.com/items?page="
all_items <- vector("list", 10)
for (i in 1:10) {
url <- paste0(base_url, i)
page <- nod(session, url) |> scrape()
items <- page |>
html_elements(".item") |>
html_text2()
all_items[[i]] <- items
Sys.sleep(2) # Respect delay between pages
}
combined <- unlist(all_items)
Alternative approach using “next” button links:
scrape_page <- function(url) {
page <- nod(session, url) |> scrape()
items <- page |> html_elements(".product-title") |> html_text2()
# Find next page URL
next_link <- page |>
html_elements("a.next") |>
html_attr("href")
list(items = items, next_url = next_link)
}
Error Handling for Failed Requests
Network requests fail for various reasons: timeouts, 404 errors, or server blocks. Wrap scraping code in tryCatch blocks:
safe_scrape <- function(url) {
result <- tryCatch({
page <- nod(session, url) |> scrape()
list(success = TRUE, data = page)
}, error = function(e) {
list(success = FALSE, error = e$message)
})
# Handle HTTP errors specifically
if (is.null(result$data)) {
return(result)
}
status <- result$data |> html_elements(xpath = "//status-code") |> html_text()
if (!is.na(status) && status >= 400) {
return(list(success = FALSE, error = paste("HTTP", status)))
}
result
}
# Test with error handling
test <- safe_scrape("https://example.com")
$success
# [1] TRUE
For transient errors, implement retry logic:
retry_scrape <- function(url, max_retries = 3) {
for (attempt in 1:max_retries) {
result <- safe_scrape(url)
if (result$success) {
return(result)
}
Sys.sleep(5 * attempt) # Exponential backoff
}
NULL
}
Practical Examples
Extracting Table Data
# Scrape and parse HTML tables
page <- read_html("https://example.com/data")
tables <- page |> html_table()
# First table as data frame
df <- tables[[1]]
# Column1 Column2 Column3
# 1 A B C
# 2 D E F
Extracting JSON from Script Tags
Many sites embed data in JSON within script tags:
page <- read_html("https://example.com/dashboard")
json_data <- page |>
html_elements(xpath = "//script[@id='data']") |>
html_text()
parsed <- jsonlite::fromJSON(json_data)
See Also
- base-readline - Reading interactive input
- stringr-gsub - Text replacement for cleaning scraped data
- dplyr-filter - Filtering extracted data frames