Build a Web Scraper with rvest
A web scraper automates data extraction from websites. While the rvest package makes basic scraping straightforward, building a robust scraper that handles multiple pages, avoids getting blocked, and stores data properly requires more thought.
In this project, you will build a complete web scraping application that extracts product data from an e-commerce style website. You will learn to navigate multiple pages, handle pagination, respect rate limits, and save your scraped data in a usable format.
What You Will Build
By the end of this guide, you will have a scraper that:
- Extracts structured data from multiple pages
- Handles pagination automatically
- Uses polite scraping practices to avoid overloading servers
- Saves data to CSV and JSON formats
- Includes error handling for robust production use
This pattern applies to any website you need to scrape.
Project Setup
First, create a new R script for your project and install the necessary packages:
# Install required packages
install.packages(c("rvest", "tidyverse", "polite", "jsonlite"))
# Load libraries
library(rvest)
library(tidyverse)
library(polite)
library(jsonlite)
The polite package handles the respectful side of scraping—checking robots.txt, setting appropriate delays, and identifying your scraper.
Inspecting the Target Website
Before writing any code, you need to understand the website structure. Open your browser and navigate to the page you want to scrape. Use the developer tools (F12) to inspect the HTML.
For this project, we will scrape a demo page. The principles apply to any real website you target:
# Test with rvest demo page
demo_url <- "https://rvest.tidyverse.org/articles/starwars.html"
# Check if we can access it
page <- read_html(demo_url)
page
You should see the HTML content returned. The page contains sections with film information—title, episode number, and description.
Building the Scraper Function
Create a function that extracts film data from a single page:
scrape_films <- function(url) {
# Use polite to bow to the site first
session <- bow(url)
# Scrape the page
page <- session |>
scrape()
# Extract each film section
films <- page |>
html_elements("section")
# Extract data from each film
film_data <- map_dfr(films, function(film) {
tibble(
title = film |>
html_element("h2") |>
html_text2(),
episode = film |>
html_element("h2") |>
html_attr("data-id") |>
as.integer(),
description = film |>
html_element("p") |>
html_text2()
)
})
film_data
}
# Test the function
films <- scrape_films(demo_url)
head(films)
The function returns a tidy data frame with one row per film. The map_dfr() function loops over each section and combines the results into one data frame.
Handling Pagination
Most websites spread data across multiple pages. You need to detect pagination and follow links to scrape everything.
First, identify how the pagination works. Common patterns include:
- Next/Previous buttons with specific URLs
- Page numbers that follow a predictable pattern
- “Load More” functionality (requires JavaScript, not covered here)
Create a function that finds all page URLs:
find_page_urls <- function(base_url, max_pages = 5) {
# For the starwars demo, pages follow a pattern
urls <- paste0(base_url, "?page=", 1:max_pages)
# Check which pages actually exist
valid_urls <- urls |>
map_lgl(function(url) {
tryCatch({
page <- read_html(url)
html_elements(page, "section") |>
length() > 0
}, error = FALSE)
})
urls[valid_urls]
}
# Get all page URLs
all_pages <- find_page_urls(demo_url, max_pages = 3)
all_pages
For real websites, you would need to adapt this to their specific pagination structure.
Building the Complete Pipeline
Combine everything into a single pipeline that scrapes all pages:
scrape_all_films <- function(base_url, max_pages = 10) {
# Find all page URLs
page_urls <- find_page_urls(base_url, max_pages)
message("Found ", length(page_urls), " pages to scrape")
# Scrape each page with a delay
all_films <- page_urls |>
map_dfr(function(url) {
message("Scraping: ", url)
scrape_films(url)
})
all_films
}
# Run the complete pipeline
all_films <- scrape_all_films(demo_url, max_pages = 3)
nrow(all_films)
The pipeline adds a polite delay between requests automatically through the bow() function.
Saving Scraped Data
After scraping, save the data in useful formats. CSV for analysis, JSON for web applications:
# Save to CSV
write_csv(all_films, "starwars_films.csv")
# Save to JSON
write_json(all_films, "starwars_films.json", pretty = TRUE)
# Verify the saved files
list.files(pattern = "starwars")
For larger projects, consider saving to a database instead.
Adding Robust Error Handling
Production scrapers must handle failures gracefully. Add retry logic and error tracking:
scrape_with_retry <- function(url, max_retries = 3) {
result <- tryCatch({
scrape_films(url)
},
error = function(e) {
message("Error scraping ", url, ": ", e$message)
# Retry with backoff
if (max_retries > 0) {
Sys.sleep(2)
scrape_with_retry(url, max_retries - 1)
} else {
NULL
}
})
result
}
# Safe wrapper for multiple URLs
scrape_multiple_safe <- function(urls) {
urls |>
map(~scrape_with_retry(.x)) |>
discard(is.null) |>
bind_rows()
}
This handles temporary network issues and gives up gracefully after too many failures.
Scraping Real Websites
When scraping real websites, you need to adapt the selectors to match their HTML structure. Here’s a general workflow:
- Check robots.txt to see what’s allowed
- Identify your target elements using browser developer tools
- Test selectors in R before building the full pipeline
- Add delays between requests
- Handle errors for missing elements
For example, to scrape a typical e-commerce product listing:
scrape_products <- function(url) {
session <- bow(url)
page <- session |> scrape()
# Adapt these selectors to your target site
products <- page |> html_elements(".product-item")
map_dfr(products, function(p) {
tibble(
name = p |> html_element(".product-name") |> html_text2(),
price = p |> html_element(".product-price") |> html_text2(),
link = p |> html_element("a") |> html_attr("href")
)
})
}
The CSS classes (.product-item, .product-name, etc.) would differ for each website.
Respectful Scraping Practices
Always scrape responsibly:
- Check robots.txt first — the polite package does this automatically
- Identify your scraper with a meaningful User-Agent
- Add delays between requests (1-2 seconds is reasonable)
- Cache results — save locally and don’t re-scrape the same page
- Respect rate limits — if you get 429 errors, slow down
- Don’t scrape behind login without permission
# Proper polite session with identification
session <- bow(
"https://example.com",
user_agent = "MyResearchProject/1.0 (contact@example.com)"
)
Complete Example: GitHub trending scraper
Here’s a more realistic example that scrapes GitHub’s trending page:
scrape_github_trending <- function(language = "r", pages = 1) {
base_url <- paste0("https://github.com/trending/", language, "?since=weekly")
all_repos <- map_dfr(1:pages, function(page) {
url <- if (page == 1) base_url else paste0(base_url, "&page=", page)
session <- bow(url, user_agent = "RScraper/1.0")
page <- session |> scrape()
page |>
html_elements(".Box-row") |>
map_dfr(function(repo) {
tibble(
name = repo |>
html_element("h2 a") |>
html_text(trim = TRUE),
description = repo |>
html_element("p") |>
html_text(trim = TRUE),
stars = repo |>
html_element(".Link--muted span") |>
html_text(trim = TRUE) |>
str_remove_all(",") |>
as.integer()
)
})
})
all_repos
}
# This would scrape GitHub trending repositories
# Note: GitHub structure may change; adjust selectors as needed
Project Structure
For larger scraping projects, organize your code:
web-scraper/
├── R/
│ ├── scrape.R # Scraping functions
│ ├── parse.R # HTML parsing logic
│ └── save.R # Data saving functions
├── data/
│ └── raw/ # Raw HTML snapshots
├── output/ # Processed data
├── scripts/
│ └── run.R # Main execution script
└── README.md
This separation makes it easier to maintain and update your scraper when websites change.
Summary
You built a complete web scraper that:
- Uses rvest for HTML parsing
- Uses polite for respectful scraping
- Handles multiple pages with pagination
- Includes error handling and retry logic
- Saves data to CSV and JSON
The key patterns—bowing to pages, extracting elements with CSS selectors, and looping over multiple pages—apply to scraping any website. Remember to always respect website policies and rate limits.
See Also
- Web Scraping with rvest — The foundational rvest guide
- HTTP Requests with httr2 — For APIs and authenticated requests
- Reading and Writing CSV Files in R — Work with your scraped data