rguides

How to Join Two Data Frames in R with dplyr and merge()

To join two data frames in R, the most common approach is dplyr::left_join(). It keeps all rows from the left table and fills unmatched rows from the right with NA — exactly the behavior you want when enriching a primary dataset with lookup information from a secondary table.

library(dplyr)

employees <- data.frame(
  id = c(1, 2, 3, 4),
  name = c("Alice", "Bob", "Carol", "David"),
  dept_id = c(10, 20, 10, 30)
)

departments <- data.frame(
  dept_id = c(10, 20, 30),
  department = c("Engineering", "Sales", "Marketing")
)

# Left join: keep all employees, match departments
employees %>%
  left_join(departments, by = "dept_id")
#   id  name dept_id  department
# 1  1 Alice      10 Engineering
# 2  2   Bob      20      Sales
# 3  3 Carol      10 Engineering
# 4  4 David      30   Marketing

Other join types include inner_join() (only matching rows), right_join() (all rows from right), and full_join() (all rows from both). When key columns have different names, use by = c("col_a" = "col_b").

# Base R equivalents
merge(employees, departments, by = "dept_id", all.x = TRUE)  # left join
merge(employees, departments, by = "dept_id")                 # inner join

# data.table for large datasets
library(data.table)
employees_dt <- as.data.table(employees)
departments_dt <- as.data.table(departments)
employees_dt[departments_dt, on = "dept_id"]  # left join

Watch for duplicate keys (causing many-to-many joins) and NA values in join columns, which never match. Use suffix = c("_x", "_y") to resolve column name conflicts when tables share non-key column names. For everyday analysis, left_join() covers 90% of join tasks — it reads like English, works inside a pipe, and handles mismatched key names with the by = c("col_a" = "col_b") syntax. If you need to check whether a join produced the expected number of rows, nrow() before and after the join catches accidental many-to-many expansions silently inflating your dataset.

See also