rguides

How to Build Web Scrapers with rvest

A web scraper automates data extraction from websites. While the rvest package makes basic scraping straightforward, building a reliable scraper that handles multiple pages, avoids getting blocked, and stores data properly requires more thought.

In this project, you will build web scrapers that extract structured data from real websites. You will learn to navigate multiple pages, handle pagination, respect rate limits, and save your scraped data in a usable format.

What you will build

By the end of this guide, you will have a scraper that:

  • Extracts structured data from multiple pages
  • Handles pagination automatically
  • Uses polite scraping practices to avoid overloading servers
  • Saves data to CSV and JSON formats
  • Includes error handling for reliable production use

This pattern applies to any website you need to scrape.

Project setup

First, create a new R script for your project and install the necessary packages:

# Install required packages
install.packages(c("rvest", "tidyverse", "polite", "jsonlite"))

# Load libraries
library(rvest)
library(tidyverse)
library(polite)
library(jsonlite)

The polite package handles the respectful side of scraping: checking robots.txt, setting appropriate delays, and identifying your scraper.

Inspecting the target website

Before writing any code, you need to understand the website structure. Open your browser and navigate to the page you want to scrape. Use the developer tools (F12) to inspect the HTML.

For this project, we will scrape a demo page. The principles apply to any real website you target:

# Test with rvest demo page
demo_url <- "https://rvest.tidyverse.org/articles/starwars.html"

# Check if we can access it
page <- read_html(demo_url)
page

You should see the HTML content returned in the console. The page contains sections with film information: title, episode number, and description, each wrapped in its own <section> tag. Mapping this DOM structure to R is the foundation of scraping: every HTML element selector you write corresponds to a column in your eventual data frame. The next step is to encapsulate this extraction logic inside a reusable function.

Building the web scraper function

The core of any web scraper is a function that extracts structured data from a single page:

scrape_films <- function(url) {
  # Use polite to bow to the site first
  session <- bow(url)
  
  # Scrape the page
  page <- session |>
    scrape()
  
  # Extract each film section
  films <- page |>
    html_elements("section")
  
  # Extract data from each film
  film_data <- map_dfr(films, function(film) {
    tibble(
      title = film |>
        html_element("h2") |>
        html_text2(),
      episode = film |>
        html_element("h2") |>
        html_attr("data-id") |>
        as.integer(),
      description = film |>
        html_element("p") |>
        html_text2()
    )
  })
  
  film_data
}

# Test the function
films <- scrape_films(demo_url)
head(films)

The function returns a tidy data frame with one row per film. The map_dfr() function loops over each section and combines the results into one data frame.

Handling pagination

Most websites spread data across multiple pages. You need to detect pagination and follow links to scrape everything.

First, identify how the pagination works. Common patterns include:

  • Next/Previous buttons with specific URLs
  • Page numbers that follow a predictable pattern
  • “Load More” functionality (requires JavaScript, not covered here)

Create a function that finds all page URLs:

find_page_urls <- function(base_url, max_pages = 5) {
  # For the starwars demo, pages follow a pattern
  urls <- paste0(base_url, "?page=", 1:max_pages)
  
  # Check which pages actually exist
  valid_urls <- urls |>
    map_lgl(function(url) {
      tryCatch({
        page <- read_html(url)
        html_elements(page, "section") |>
          length() > 0
      }, error = FALSE)
    })
  
  urls[valid_urls]
}

# Get all page URLs
all_pages <- find_page_urls(demo_url, max_pages = 3)
all_pages

For real websites, you would need to adapt this to their specific pagination structure; some sites use ?page=N query parameters, others rely on relative links inside Next buttons, and a few load data through JavaScript APIs that require a headless browser. With the page URL builder working, you can now feed those URLs into the main scraping pipeline that pulls structured data from every page and combines the results.

Building the complete pipeline

Combine everything into a single pipeline that scrapes all pages:

scrape_all_films <- function(base_url, max_pages = 10) {
  # Find all page URLs
  page_urls <- find_page_urls(base_url, max_pages)
  
  message("Found ", length(page_urls), " pages to scrape")
  
  # Scrape each page with a delay
  all_films <- page_urls |>
    map_dfr(function(url) {
      message("Scraping: ", url)
      scrape_films(url)
    })
  
  all_films
}

# Run the complete pipeline
all_films <- scrape_all_films(demo_url, max_pages = 3)
nrow(all_films)

The pipeline adds a polite delay between requests automatically through the bow() function, which checks robots.txt and respects crawl-delay directives. Running scrape_all_films() on a real site with many pages takes a few seconds of wall-clock time but keeps your scraper invisible to rate limiters. Once the data frames are collected, the next step is persisting them to disk for later analysis.

Saving scraped data

After scraping, save the data in useful formats. CSV for analysis, JSON for web applications:

# Save to CSV
write_csv(all_films, "starwars_films.csv")

# Save to JSON
write_json(all_films, "starwars_films.json", pretty = TRUE)

# Verify the saved files
list.files(pattern = "starwars")

For larger projects, consider saving to a database instead; CSV and JSON work well for one-off collections, but a SQLite database lets you append new rows incrementally and resume interrupted scrapes from where they left off. Beyond storage concerns, no production scraper is complete without defenses against network failures and unexpected HTTP error codes.

Adding reliable error handling

Production scrapers must handle failures gracefully. Add retry logic and error tracking:

scrape_with_retry <- function(url, max_retries = 3) {
  result <- tryCatch({
    scrape_films(url)
  },
  error = function(e) {
    message("Error scraping ", url, ": ", e$message)
    
    # Retry with backoff
    if (max_retries > 0) {
      Sys.sleep(2)
      scrape_with_retry(url, max_retries - 1)
    } else {
      NULL
    }
  })
  
  result
}

# Safe wrapper for multiple URLs
scrape_multiple_safe <- function(urls) {
  urls |>
    map(~scrape_with_retry(.x)) |>
    discard(is.null) |>
    bind_rows()
}

This handles temporary network issues and gives up gracefully after too many failures.

Scraping real websites

When scraping real websites, you need to adapt the selectors to match their HTML structure. Here’s a general workflow:

  1. Check robots.txt to see what’s allowed
  2. Identify your target elements using browser developer tools
  3. Test selectors in R before building the full pipeline
  4. Add delays between requests
  5. Handle errors for missing elements

For example, to scrape a typical e-commerce product listing:

scrape_products <- function(url) {
  session <- bow(url)
  page <- session |> scrape()
  
  # Adapt these selectors to your target site
  products <- page |> html_elements(".product-item")
  
  map_dfr(products, function(p) {
    tibble(
      name = p |> html_element(".product-name") |> html_text2(),
      price = p |> html_element(".product-price") |> html_text2(),
      link = p |> html_element("a") |> html_attr("href")
    )
  })
}

The CSS classes (.product-item, .product-name, etc.) would differ for each website.

Respectful scraping practices

Always scrape responsibly:

  1. Check robots.txt first, the polite package does this automatically
  2. Identify your scraper with a meaningful User-Agent
  3. Add delays between requests (1-2 seconds is reasonable)
  4. Cache results, save locally and don’t re-scrape the same page
  5. Respect rate limits, if you get 429 errors, slow down
  6. Don’t scrape behind login without permission
# Proper polite session with identification
session <- bow(
  "https://example.com",
  user_agent = "MyResearchProject/1.0 (contact@example.com)"
)

Here is a more realistic scraper that pulls data from GitHub’s trending repositories page, a real-world target that changes daily and requires handling multiple CSS selectors, pagination, and GitHub’s rate limits. The function extracts repository names, descriptions, and star counts into a structured data frame, demonstrating how the patterns you built earlier adapt to live web pages.

scrape_github_trending <- function(language = "r", pages = 1) {
  base_url <- paste0("https://github.com/trending/", language, "?since=weekly")
  
  all_repos <- map_dfr(1:pages, function(page) {
    url <- if (page == 1) base_url else paste0(base_url, "&page=", page)
    
    session <- bow(url, user_agent = "RScraper/1.0")
    page <- session |> scrape()
    
    page |>
      html_elements(".Box-row") |>
      map_dfr(function(repo) {
        tibble(
          name = repo |>
            html_element("h2 a") |>
            html_text(trim = TRUE),
          description = repo |>
            html_element("p") |>
            html_text(trim = TRUE),
          stars = repo |>
            html_element(".Link--muted span") |>
            html_text(trim = TRUE) |>
            str_remove_all(",") |>
            as.integer()
        )
      })
  })
  
  all_repos
}

# This would scrape GitHub trending repositories
# Note: GitHub structure may change; adjust selectors as needed

Project structure

For larger scraping projects, keeping your code organized prevents technical debt as the scraper grows. Separating concerns into dedicated files — one module for HTML extraction, another for data transformation, and a third for storage — makes it easier to update selectors when the target site redesigns. A typical project layout looks like this:

web-scraper/
├── R/
│   ├── scrape.R        # Scraping functions
│   ├── parse.R         # HTML parsing logic
│   └── save.R          # Data saving functions
├── data/
│   └── raw/            # Raw HTML snapshots
├── output/             # Processed data
├── scripts/
│   └── run.R           # Main execution script
└── README.md

This separation makes it easier to maintain and update your scraper when websites change.

Error handling in scrapers

Production scrapers encounter errors: pages that do not exist, server errors, rate limiting, and structure changes when the target website updates. tryCatch() and purrr::possibly() handle these gracefully. Wrap the scraping function with possibly(scrape_page, NULL) to return NULL on error instead of crashing, then compact(results) to remove failed pages.

Log errors with the URL and error message for later investigation: tryCatch(scrape(url), error = function(e) { message("Failed: ", url, " - ", conditionMessage(e)); NULL }). For large scraping projects, a retry mechanism with exponential backoff handles transient failures: retry after 1s, then 2s, then 4s, giving up after 3 attempts.

Data storage patterns

For small scrapers (hundreds to thousands of pages), storing results in a data frame and writing to CSV or RDS works well. For large scrapers (tens of thousands of pages), SQLite via RSQLite provides a more reliable storage layer: insert each page’s results as you go, and queries let you inspect progress and resume interrupted jobs.

The append pattern: dbWriteTable(con, "results", new_rows, append = TRUE) inserts new rows without overwriting previous results. Combined with a scraped_urls table to track which pages have been processed, the scraper can resume from where it left off after an interruption.

For scheduling regular scrapes, use cronR (for Linux/macOS) or taskscheduleR (for Windows) to run the scraper script on a schedule. Store results in a database (SQLite or PostgreSQL) rather than flat files for easier querying and deduplication. Track which pages have been scraped to enable resumable scraping after failures.

Iterating over multiple pages

Scraping structured data from multiple pages follows a pattern: build a list of URLs, apply a scraping function to each, and bind the results. Using purrr::map() with a scraping function and bind_rows() on the result produces a single data frame from all pages. Adding Sys.sleep() between requests avoids overwhelming the server.

When the list of URLs is not known in advance — pagination where the next page URL comes from the current page — a while loop that extracts the next page link, fetches it, extracts the data, and updates the next page URL handles this pattern. Include a maximum iteration count to prevent infinite loops if the next-page detection fails.

Summary

You now know how to build a web scraper with rvest, from single-page extraction to multi-page pipelines with retry logic. rvest provides a readable, CSS-selector-based interface for HTML scraping that integrates naturally with the tidyverse. The core workflow is: read_html() to fetch and parse, html_elements() to select nodes, and html_text2() or html_attr() to extract content. For dynamic JavaScript-rendered pages, chromote or RSelenium are required. Always check the site’s robots.txt and terms of service before scraping, add delays with Sys.sleep() between requests, and handle HTTP errors with tryCatch() to make the scraper resilient to transient failures. rvest::html_table() extracts all HTML tables from a page as a list of data frames, useful for scraping tabular data without manually selecting cells. chromote::ChromoteSession provides programmatic browser control for scraping JavaScript-rendered content that rvest cannot access.

See also