Advanced Web Scraping with rvest and polite
Advanced web scraping with rvest and polite
Web scraping is a technique for extracting data from websites programmatically. The rvest package provides R with web scraping capabilities modeled after Python’s Beautiful Soup, while polite adds a layer of respectfulness to your scraping workflow. This guide covers advanced techniques for building reliable, responsible web scrapers.
Setting up rvest and polite
Install and load the required packages:
install.packages(c("rvest", "polite", "httr"))
library(rvest)
library(polite)
library(httr)
The polite package enforces three principles: asking permission to scrape, taking nothing but data, and imposing a delay between requests. This approach reduces the risk of IP bans and respects server resources.
Respectful scraping with delay and user-Agent
Always identify your scraper using a custom user-agent string. This practice is professional and helps site administrators track traffic:
session <- bow("https://example.com",
user_agent = "MyRScraper/1.0 (research purposes; contact: researcher@example.com)")
# Verify the session
session
# <polite session> https://example.com
# User-agent: MyRScraper/1.0 (research purposes; contact: researcher@example.com)
# Delay: 0 seconds
Set a delay between requests to avoid overwhelming the server:
session <- bow("https://example.com",
user_agent = "ResearchBot/1.0",
delay = 2) # 2 seconds between requests
# Session configured with 2-second delay (see session object)
Parsing HTML with html_elements() and html_text2()
The rvest package uses CSS selectors and XPath to navigate HTML documents. The html_elements() function extracts nodes matching a selector, while html_text2() retrieves text with proper whitespace handling:
url <- "https://example.com"
page <- read_html(url)
# Extract all paragraph text
paragraphs <- page |>
html_elements("p") |>
html_text2()
# Extract headings by level
headings <- page |>
html_elements("h2") |>
html_text2()
# Extract links with their URLs
links <- page |>
html_elements("a") |>
html_attr("href")
# [1] "/about"
For more complex selections, use XPath with xpath = TRUE:
# Extract all text within a specific div
content <- page |>
html_elements(xpath = "//div[@class='content']") |>
html_text2()
Handling forms with html_form()
Many websites require form submission to access data. The rvest package provides functions to fill and submit forms:
# Discover form fields on a page
page <- read_html("https://example.com/login")
form <- page |> html_form()
# View form fields
form_fields <- form |> html_form_show()
# $username
# $password
# $submit
# Fill and submit a form
filled_form <- form |>
html_form_set(username = "myuser", password = "mypassword")
response <- filled_form |> session_submit(session, .url = "https://example.com/login")
For search forms, the pattern is similar:
search_page <- read_html("https://example.com/search")
search_form <- search_page |> html_form()
filled <- search_form |>
html_form_set(q = "r programming")
results <- filled |> session_submit(session)
Navigating pagination
Scraping multiple pages requires detecting and following pagination links. A common pattern involves iterating through page URLs or following “next” links:
base_url <- "https://example.com/items?page="
all_items <- vector("list", 10)
for (i in 1:10) {
url <- paste0(base_url, i)
page <- nod(session, url) |> scrape()
items <- page |>
html_elements(".item") |>
html_text2()
all_items[[i]] <- items
Sys.sleep(2) # Respect delay between pages
}
combined <- unlist(all_items)
Alternative approach using “next” button links:
scrape_page <- function(url) {
page <- nod(session, url) |> scrape()
items <- page |> html_elements(".product-title") |> html_text2()
# Find next page URL
next_link <- page |>
html_elements("a.next") |>
html_attr("href")
list(items = items, next_url = next_link)
}
Error handling for failed requests
Network requests fail for various reasons: timeouts, 404 errors, or server blocks. Wrap scraping code in tryCatch blocks:
safe_scrape <- function(url) {
result <- tryCatch({
page <- nod(session, url) |> scrape()
list(success = TRUE, data = page)
}, error = function(e) {
list(success = FALSE, error = e$message)
})
# Handle HTTP errors specifically
if (is.null(result$data)) {
return(result)
}
status <- result$data |> html_elements(xpath = "//status-code") |> html_text()
if (!is.na(status) && status >= 400) {
return(list(success = FALSE, error = paste("HTTP", status)))
}
result
}
# Test with error handling
test <- safe_scrape("https://example.com")
$success
# [1] TRUE
For transient errors, implement retry logic:
retry_scrape <- function(url, max_retries = 3) {
for (attempt in 1:max_retries) {
result <- safe_scrape(url)
if (result$success) {
return(result)
}
Sys.sleep(5 * attempt) # Exponential backoff
}
NULL
}
Practical examples
Extracting table data
# Scrape and parse HTML tables
page <- read_html("https://example.com/data")
tables <- page |> html_table()
# First table as data frame
df <- tables[[1]]
# Column1 Column2 Column3
# 1 A B C
# 2 D E F
Extracting JSON from script tags
Many sites embed data in JSON within script tags:
page <- read_html("https://example.com/dashboard")
json_data <- page |>
html_elements(xpath = "//script[@id='data']") |>
html_text()
parsed <- jsonlite::fromJSON(json_data)
Handling pagination
Paginated scraping requires a loop or recursion. For URL-based pagination (?page=1, ?page=2), build URLs with sprintf() and iterate: map(1:n_pages, ~ read_html(sprintf("https://site.com?page=%d", .x))). For “next page” link-based pagination, extract the next URL from the page and stop when no next link is found. rvest::session_follow_link() handles this naturally within a session that maintains cookies.
Respecting server limits
Add Sys.sleep(runif(1, 1, 3)) between requests to introduce a randomized delay. Fixed delays (e.g., exactly 1 second) are sometimes detected as bot traffic; variable delays appear more human. Check robots.txt with robotstxt::get_robotstxt("https://site.com") and robotstxt::paths_allowed() to verify your target URLs are permitted. Set a descriptive User-Agent header to identify your scraper: httr::user_agent("MyResearchBot/1.0 (contact@example.com)").
Storing intermediate results
Write scraped pages to disk immediately after fetching and parse later. This separates the I/O-heavy scraping phase from the parsing phase, so a parse error does not require re-fetching. Save raw HTML with writeLines(as.character(page), paste0("pages/page_", i, ".html")). read_html("pages/page_1.html") parses from disk during the parse pass. This pattern also allows restarting a failed scrape from the last saved page.
Parsing complex structures
Not all scraped data fits neatly into tables. rvest::html_elements() with nested CSS selectors extracts hierarchical data. For each container element, extract sub-elements with a second html_elements() call. Build the result row by row: map_dfr(containers, ~ tibble(title = html_text2(html_element(.x, ".title")), price = html_text2(html_element(.x, ".price")))) extracts multiple fields from each item container into a data frame.
Error recovery
Production scrapers must handle errors without crashing. Wrap individual page requests in tryCatch(): page <- tryCatch(read_html(url), error = function(e) { message("Failed: ", url, " - ", e$message); NULL }). Check the result before parsing: if (is.null(page)) return(NULL). Log failures to a file with cat(url, "FAILED", date(), "\n", file = "errors.log", append = TRUE). A failed scrape should save progress up to the failure point so the run can be resumed.
Dynamic JavaScript pages
rvest handles static HTML well, but many modern web pages render content with JavaScript after the initial HTML loads. Tools like React and Vue produce minimal initial HTML and populate content via API calls. rvest cannot access this content because it downloads the initial HTML only.
RSelenium drives a real browser from R, executing JavaScript and waiting for dynamic content. rsDriver() starts a browser session: rD <- rsDriver(browser = "chrome", chromever = "latest"). client <- rD$client gives the WebDriver client. client$navigate(url) loads a page; client$findElement(using = "css selector", value = ".classname") finds elements; client$getPageSource() returns the rendered HTML, which you can then parse with rvest.
chromote is a newer alternative that controls Chrome via the Chrome DevTools Protocol without the Selenium overhead. It is faster and more reliable for modern Chrome versions.
For sites that load data via XHR/fetch requests, the network tab in browser DevTools reveals the actual API endpoint. Often the API returns JSON directly, and httr2::req_perform() accesses it more reliably than browser automation.
Rate limiting and polite scraping
Scraping without rate limiting can overwhelm servers and get your IP banned. polite wraps rvest with automatic polite behavior: it reads robots.txt and respects crawl delays, caches pages, and identifies the user agent.
bow("https://example.com", user_agent = "mybot/1.0 contact@example.com") checks the site’s scraping permissions. scrape(bow, query = "page=2") fetches pages with the configured delay. The polite package makes it easy to do the right thing by default.
Manual rate limiting: Sys.sleep(runif(1, 1, 3)) between requests waits 1-3 random seconds, which is less detectable than a fixed delay and avoids synchronized request bursts.
Authentication and sessions
Some sites require login before allowing access to content. rvest::session() maintains a cookie-based session across requests. session_submit() submits a form, useful for login forms. After login, subsequent session_navigate() calls use the established session.
For sites using modern authentication (OAuth, JWT), capture the access token from a browser session (using browser DevTools network tab) and pass it as an Authorization header in httr2 requests.
Parsing strategies
Complex HTML often requires multiple extraction passes. html_nodes() selects a set of elements; map over them to extract structured data from each. For tables, html_table() converts a <table> element to a data frame directly.
xml2::xml_find_all(html, ".//td[position()=2]") uses XPath for more complex selections, XPath can select nodes based on sibling position, ancestor attributes, and text content. XPath is more powerful than CSS selectors for complex structural queries.
For extracting data from paginated lists, identify the next-page link pattern and loop: next_url <- page %>% html_node(".next-page") %>% html_attr("href"). Break when next_url is NA or points to the last page.
Handling errors and retries
Network requests fail intermittently. tryCatch() wraps requests; on error, log the URL and continue. For systematic collection, store completed URLs in a local database and skip them on rerun, this makes the scraper resumable after interruption.
httr2::req_retry(req, max_tries = 3, is_transient = httr2::resp_is_error) retries failed requests automatically with exponential backoff. Pair with req_throttle(req, rate = 1) to limit to one request per second.
Save raw HTML to disk before parsing, so you can re-parse without re-downloading. This is especially valuable during development when your parsing code changes frequently.
Advanced scraping patterns
Basic web scraping with rvest handles static HTML pages. Advanced scraping addresses the complications that arise with real websites: JavaScript-rendered content that is not in the HTML source, rate limiting and bot detection that blocks automated requests, authentication requirements, and pagination across many pages.
JavaScript rendering is the most common obstacle. Many modern websites load content dynamically after the page loads, the initial HTML is a shell, and JavaScript fetches and renders the actual content. rvest reads the initial HTML and does not execute JavaScript, so dynamically rendered content is invisible to it. Tools like Selenium (through RSelenium) and Playwright (through chromote) drive a real browser that executes JavaScript, allowing access to fully rendered pages.
Session management and cookies
Some websites require login sessions to access content. Managing cookies lets you authenticate once and access protected pages in subsequent requests. httr2’s cookie handling stores cookies from a login response and sends them with subsequent requests. For more complex session management — CSRF tokens, session refresh, multi-step login flows — using a headless browser through chromote handles these automatically because it behaves exactly as a real browser does.
Rotating user-agent strings and adding realistic request headers reduces the likelihood of bot detection. Websites identify automated clients by their HTTP headers — the User-Agent string, Accept headers, and connection timing. Adding realistic headers that match what a real browser sends makes requests look more like organic traffic. However, this is in tension with websites’ legitimate interest in identifying and managing scrapers, so check the website’s terms of service before deploying aggressive anti-detection techniques.
Handling pagination and rate limits
Systematic scraping of many pages requires handling pagination and rate limits without crashing the process. Rate limiting with Sys.sleep between requests avoids triggering rate limit responses. Using httr2’s req_throttle as part of the request object applies the limit consistently across the session. For request failures with 429 (Too Many Requests) responses, exponential backoff — waiting progressively longer before retrying — gives the server time to reset the rate limit counter.
Saving scraped data incrementally to disk rather than accumulating in memory prevents losing all work if the process is interrupted. Writing one file per page, or appending to a running output file, allows resuming from where the process left off. Tracking which pages have been scraped in a log file prevents re-scraping the same pages after a restart.
See also
- base-readline - Reading interactive input
- stringr-gsub - Text replacement for cleaning scraped data
- dplyr-filter - Filtering extracted data frames