Build a Web Scraper with rvest

· 6 min read · Updated March 17, 2026 · intermediate
rvest web-scraping data-collection tidyverse automation

A web scraper automates data extraction from websites. While the rvest package makes basic scraping straightforward, building a robust scraper that handles multiple pages, avoids getting blocked, and stores data properly requires more thought.

In this project, you will build a complete web scraping application that extracts product data from an e-commerce style website. You will learn to navigate multiple pages, handle pagination, respect rate limits, and save your scraped data in a usable format.

What You Will Build

By the end of this guide, you will have a scraper that:

  • Extracts structured data from multiple pages
  • Handles pagination automatically
  • Uses polite scraping practices to avoid overloading servers
  • Saves data to CSV and JSON formats
  • Includes error handling for robust production use

This pattern applies to any website you need to scrape.

Project Setup

First, create a new R script for your project and install the necessary packages:

# Install required packages
install.packages(c("rvest", "tidyverse", "polite", "jsonlite"))

# Load libraries
library(rvest)
library(tidyverse)
library(polite)
library(jsonlite)

The polite package handles the respectful side of scraping—checking robots.txt, setting appropriate delays, and identifying your scraper.

Inspecting the Target Website

Before writing any code, you need to understand the website structure. Open your browser and navigate to the page you want to scrape. Use the developer tools (F12) to inspect the HTML.

For this project, we will scrape a demo page. The principles apply to any real website you target:

# Test with rvest demo page
demo_url <- "https://rvest.tidyverse.org/articles/starwars.html"

# Check if we can access it
page <- read_html(demo_url)
page

You should see the HTML content returned. The page contains sections with film information—title, episode number, and description.

Building the Scraper Function

Create a function that extracts film data from a single page:

scrape_films <- function(url) {
  # Use polite to bow to the site first
  session <- bow(url)
  
  # Scrape the page
  page <- session |>
    scrape()
  
  # Extract each film section
  films <- page |>
    html_elements("section")
  
  # Extract data from each film
  film_data <- map_dfr(films, function(film) {
    tibble(
      title = film |>
        html_element("h2") |>
        html_text2(),
      episode = film |>
        html_element("h2") |>
        html_attr("data-id") |>
        as.integer(),
      description = film |>
        html_element("p") |>
        html_text2()
    )
  })
  
  film_data
}

# Test the function
films <- scrape_films(demo_url)
head(films)

The function returns a tidy data frame with one row per film. The map_dfr() function loops over each section and combines the results into one data frame.

Handling Pagination

Most websites spread data across multiple pages. You need to detect pagination and follow links to scrape everything.

First, identify how the pagination works. Common patterns include:

  • Next/Previous buttons with specific URLs
  • Page numbers that follow a predictable pattern
  • “Load More” functionality (requires JavaScript, not covered here)

Create a function that finds all page URLs:

find_page_urls <- function(base_url, max_pages = 5) {
  # For the starwars demo, pages follow a pattern
  urls <- paste0(base_url, "?page=", 1:max_pages)
  
  # Check which pages actually exist
  valid_urls <- urls |>
    map_lgl(function(url) {
      tryCatch({
        page <- read_html(url)
        html_elements(page, "section") |>
          length() > 0
      }, error = FALSE)
    })
  
  urls[valid_urls]
}

# Get all page URLs
all_pages <- find_page_urls(demo_url, max_pages = 3)
all_pages

For real websites, you would need to adapt this to their specific pagination structure.

Building the Complete Pipeline

Combine everything into a single pipeline that scrapes all pages:

scrape_all_films <- function(base_url, max_pages = 10) {
  # Find all page URLs
  page_urls <- find_page_urls(base_url, max_pages)
  
  message("Found ", length(page_urls), " pages to scrape")
  
  # Scrape each page with a delay
  all_films <- page_urls |>
    map_dfr(function(url) {
      message("Scraping: ", url)
      scrape_films(url)
    })
  
  all_films
}

# Run the complete pipeline
all_films <- scrape_all_films(demo_url, max_pages = 3)
nrow(all_films)

The pipeline adds a polite delay between requests automatically through the bow() function.

Saving Scraped Data

After scraping, save the data in useful formats. CSV for analysis, JSON for web applications:

# Save to CSV
write_csv(all_films, "starwars_films.csv")

# Save to JSON
write_json(all_films, "starwars_films.json", pretty = TRUE)

# Verify the saved files
list.files(pattern = "starwars")

For larger projects, consider saving to a database instead.

Adding Robust Error Handling

Production scrapers must handle failures gracefully. Add retry logic and error tracking:

scrape_with_retry <- function(url, max_retries = 3) {
  result <- tryCatch({
    scrape_films(url)
  },
  error = function(e) {
    message("Error scraping ", url, ": ", e$message)
    
    # Retry with backoff
    if (max_retries > 0) {
      Sys.sleep(2)
      scrape_with_retry(url, max_retries - 1)
    } else {
      NULL
    }
  })
  
  result
}

# Safe wrapper for multiple URLs
scrape_multiple_safe <- function(urls) {
  urls |>
    map(~scrape_with_retry(.x)) |>
    discard(is.null) |>
    bind_rows()
}

This handles temporary network issues and gives up gracefully after too many failures.

Scraping Real Websites

When scraping real websites, you need to adapt the selectors to match their HTML structure. Here’s a general workflow:

  1. Check robots.txt to see what’s allowed
  2. Identify your target elements using browser developer tools
  3. Test selectors in R before building the full pipeline
  4. Add delays between requests
  5. Handle errors for missing elements

For example, to scrape a typical e-commerce product listing:

scrape_products <- function(url) {
  session <- bow(url)
  page <- session |> scrape()
  
  # Adapt these selectors to your target site
  products <- page |> html_elements(".product-item")
  
  map_dfr(products, function(p) {
    tibble(
      name = p |> html_element(".product-name") |> html_text2(),
      price = p |> html_element(".product-price") |> html_text2(),
      link = p |> html_element("a") |> html_attr("href")
    )
  })
}

The CSS classes (.product-item, .product-name, etc.) would differ for each website.

Respectful Scraping Practices

Always scrape responsibly:

  1. Check robots.txt first — the polite package does this automatically
  2. Identify your scraper with a meaningful User-Agent
  3. Add delays between requests (1-2 seconds is reasonable)
  4. Cache results — save locally and don’t re-scrape the same page
  5. Respect rate limits — if you get 429 errors, slow down
  6. Don’t scrape behind login without permission
# Proper polite session with identification
session <- bow(
  "https://example.com",
  user_agent = "MyResearchProject/1.0 (contact@example.com)"
)

Here’s a more realistic example that scrapes GitHub’s trending page:

scrape_github_trending <- function(language = "r", pages = 1) {
  base_url <- paste0("https://github.com/trending/", language, "?since=weekly")
  
  all_repos <- map_dfr(1:pages, function(page) {
    url <- if (page == 1) base_url else paste0(base_url, "&page=", page)
    
    session <- bow(url, user_agent = "RScraper/1.0")
    page <- session |> scrape()
    
    page |>
      html_elements(".Box-row") |>
      map_dfr(function(repo) {
        tibble(
          name = repo |>
            html_element("h2 a") |>
            html_text(trim = TRUE),
          description = repo |>
            html_element("p") |>
            html_text(trim = TRUE),
          stars = repo |>
            html_element(".Link--muted span") |>
            html_text(trim = TRUE) |>
            str_remove_all(",") |>
            as.integer()
        )
      })
  })
  
  all_repos
}

# This would scrape GitHub trending repositories
# Note: GitHub structure may change; adjust selectors as needed

Project Structure

For larger scraping projects, organize your code:

web-scraper/
├── R/
│   ├── scrape.R        # Scraping functions
│   ├── parse.R         # HTML parsing logic
│   └── save.R          # Data saving functions
├── data/
│   └── raw/            # Raw HTML snapshots
├── output/             # Processed data
├── scripts/
│   └── run.R           # Main execution script
└── README.md

This separation makes it easier to maintain and update your scraper when websites change.

Summary

You built a complete web scraper that:

  • Uses rvest for HTML parsing
  • Uses polite for respectful scraping
  • Handles multiple pages with pagination
  • Includes error handling and retry logic
  • Saves data to CSV and JSON

The key patterns—bowing to pages, extracting elements with CSS selectors, and looping over multiple pages—apply to scraping any website. Remember to always respect website policies and rate limits.

See Also