How to Join Two Data Frames in R with dplyr and merge()
To join two data frames in R, the most common approach is dplyr::left_join(). It keeps all rows from the left table and fills unmatched rows from the right with NA — exactly the behavior you want when enriching a primary dataset with lookup information from a secondary table.
library(dplyr)
employees <- data.frame(
id = c(1, 2, 3, 4),
name = c("Alice", "Bob", "Carol", "David"),
dept_id = c(10, 20, 10, 30)
)
departments <- data.frame(
dept_id = c(10, 20, 30),
department = c("Engineering", "Sales", "Marketing")
)
# Left join: keep all employees, match departments
employees %>%
left_join(departments, by = "dept_id")
# id name dept_id department
# 1 1 Alice 10 Engineering
# 2 2 Bob 20 Sales
# 3 3 Carol 10 Engineering
# 4 4 David 30 Marketing
Other join types include inner_join() (only matching rows), right_join() (all rows from right), and full_join() (all rows from both). When key columns have different names, use by = c("col_a" = "col_b").
# Base R equivalents
merge(employees, departments, by = "dept_id", all.x = TRUE) # left join
merge(employees, departments, by = "dept_id") # inner join
# data.table for large datasets
library(data.table)
employees_dt <- as.data.table(employees)
departments_dt <- as.data.table(departments)
employees_dt[departments_dt, on = "dept_id"] # left join
Watch for duplicate keys (causing many-to-many joins) and NA values in join columns, which never match. Use suffix = c("_x", "_y") to resolve column name conflicts when tables share non-key column names. For everyday analysis, left_join() covers 90% of join tasks — it reads like English, works inside a pipe, and handles mismatched key names with the by = c("col_a" = "col_b") syntax. If you need to check whether a join produced the expected number of rows, nrow() before and after the join catches accidental many-to-many expansions silently inflating your dataset.
See also
- left_join(), Left join documentation
- filter(), Filter rows
- arrange(), Sort results