Reshaping Data with tidyr
One of the most common data wrangling challenges is switching between wide and long data formats. Wide data has multiple columns representing different time points or categories, while long data has each observation in its own row. The tidyr package provides two powerful functions for this: pivot_longer() and pivot_wider().
Why Reshape Your Data?
Different data formats serve different purposes. Wide format is great for human reading and reporting:
# Wide format - easy to read
wide_data <- data.frame(
id = c(1, 2, 3),
name = c("Alice", "Bob", "Carol"),
score_2023 = c(85, 92, 78),
score_2024 = c(88, 95, 82),
score_2025 = c(91, 98, 86)
)
print(wide_data)
Long format is required for most statistical modeling and visualization in R:
# Long format - each row is one observation
long_data <- data.frame(
id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
name = c("Alice", "Alice", "Alice", "Bob", "Bob", "Bob", "Carol", "Carol", "Carol"),
year = c(2023, 2024, 2025, 2023, 2024, 2025, 2023, 2024, 2025),
score = c(85, 88, 91, 92, 95, 98, 78, 82, 86)
)
print(long_data)
Most ggplot2 visualizations and dplyr operations work best with long data.
Converting Wide to Long with pivot_longer()
The pivot_longer() function transforms wide data into long format. You specify which columns to gather and what to name the new columns.
Basic Usage
library(tidyr)
library(dplyr)
# Starting with wide data
wide <- data.frame(
id = 1:3,
name = c("Alice", "Bob", "Carol"),
score_2023 = c(85, 92, 78),
score_2024 = c(88, 95, 82),
score_2025 = c(91, 98, 86)
)
# Convert to long format
long <- wide %>%
pivot_longer(
cols = starts_with("score_"),
names_to = "year",
values_to = "score"
) %>%
mutate(year = parse_number(year))
print(long)
The key arguments:
- cols: Which columns to gather (use
starts_with(),contains(), or column names) - names_to: Name for the new column that holds the old column names
- values_to: Name for the new column that holds the values
Advanced pivot_longer()
You can use regular expressions and more complex patterns:
# Example with multiple value columns
wide2 <- data.frame(
id = 1:3,
product = c("Widget A", "Widget B", "Widget C"),
price_q1 = c(100, 150, 200),
price_q2 = c(105, 155, 210),
units_q1 = c(50, 30, 20),
units_q2 = c(55, 35, 22)
)
long2 <- wide2 %>%
pivot_longer(
cols = matches("_(q1|q2)$"),
names_to = c(".value", "quarter"),
names_sep = "_"
)
print(long2)
The .value special token tells tidyr to split the column names and use the first part as the column names for the values.
Converting Long to Wide with pivot_wider()
The pivot_wider() function does the opposite: it takes long data and spreads it into wide format.
Basic Usage
# Starting with long data
long <- data.frame(
id = c(1, 1, 1, 2, 2, 2),
name = c("Alice", "Alice", "Alice", "Bob", "Bob", "Bob"),
year = c(2023, 2024, 2025, 2023, 2024, 2025),
score = c(85, 88, 91, 92, 95, 98)
)
# Convert to wide format
wide <- long %>%
pivot_wider(
names_from = year,
values_from = score,
names_prefix = "score_"
)
print(wide)
Key arguments:
- names_from: Column containing the new column names
- values_from: Column containing the values
- names_prefix: Optional prefix for new column names
Handling Missing Values
When pivoting wider, you might encounter missing values. Use values_fill to specify a default:
long_with_na <- data.frame(
id = c(1, 1, 2, 2),
measure = c("height", "weight", "height", "weight"),
value = c(175, 70, 180, NA)
)
wide_filled <- long_with_na %>%
pivot_wider(
names_from = measure,
values_from = value,
values_fill = 0 # Fill missing with 0
)
print(wide_filled)
Practical Example: Time Series Analysis
Let’s walk through a real example. Imagine you have sales data in wide format and need to analyze trends:
# Wide sales data
sales_wide <- data.frame(
product_id = 1:4,
product_name = c("Widget", "Gadget", "Gizmo", "Thingamajig"),
jan = c(100, 150, 200, 50),
feb = c(110, 160, 190, 55),
mar = c(120, 170, 180, 60),
apr = c(130, 180, 170, 65)
)
# Convert to long for analysis
sales_long <- sales_wide %>%
pivot_longer(
cols = jan:apr,
names_to = "month",
values_to = "sales"
)
# Analyze monthly trends
monthly_summary <- sales_long %>%
group_by(month) %>%
summarise(total_sales = sum(sales), .groups = "drop_last")
print(monthly_summary)
Best Practices
- Plan your structure first: Know which format you need before reshaping
- Use clear column names: After pivoting, clean up the new column names
- Check for duplicates:
pivot_wider()will fail if you have duplicate identifier-value pairs - Handle NA values explicitly: Use
values_fillorvalues_drop_naas needed
Summary
| Function | Use Case | Key Arguments |
|---|---|---|
pivot_longer() | Wide to Long | cols, names_to, values_to |
pivot_wider() | Long to Wide | names_from, values_from |
These functions are essential tools in your data wrangling toolkit. Master them, and you’ll be able to transform any messy dataset into the format you need for analysis.