Spatial Joins in R: A Complete Guide with sf
Every GIS analyst eventually hits the same wall: you have a spreadsheet of 50,000 customer addresses and a shapefile of sales territories, and you need to know which territory each customer belongs to. There is no shared ID column to join on. The only thing the two datasets have in common is location. Spatial joins solve this problem by combining two datasets based on their geographic relationship rather than a common identifier. This technique is essential for tasks like finding all crimes within each police district, counting restaurants near each transit station, or assigning census tracts to neighborhood boundaries.
Prerequisites
Before you can run spatial joins, you need the core geospatial toolchain installed. The sf package provides the geometry types and spatial predicate functions, tidyverse gives you the pipe and dplyr verbs that make the workflow readable, and spData ships with example datasets that let you practice without hunting down shapefiles. Install all three if you haven’t already, then load them into your session:
install.packages(c("sf", "tidyverse", "spData"))
library(sf)
library(tidyverse)
library(spData)
Now that the packages are loaded, let’s look at the actual data. The us_states dataset contains state boundaries as multipolygon geometries with attributes like population and area. The us_cities dataset is a plain data frame with city names, population counts, and latitude/longitude columns. Notice that us_cities is not yet a spatial object: it has coordinates stored as regular columns, but sf doesn’t know they represent geography. Converting those columns to a proper geometry column is the first step before any join can happen:
# Load example datasets
data("us_states")
data("us_cities")
# Check the structure
print(head(us_states))
# Simple feature collection with 10 features and 14 fields
# Geometry type: MULTIPOLYGON
# CRS: 4326
print(head(us_cities))
# # A tibble: 6 × 10
# NAME pop2010 CAPITAL
# 1 Abilene 115500 0
# 2 Akron 198549 0
# 3 Albany 93920 1
Converting cities to sf
Cities need coordinates to become proper sf objects. The st_as_sf() function takes the longitude and latitude columns and turns them into a geometry column of POINT objects. Setting crs = 4326 tells sf that these coordinates use the WGS 84 geographic coordinate system, which matches the CRS of the states data. Without matching CRS values, spatial predicates like st_within would produce empty results because the coordinates live in different mathematical spaces:
# Convert cities to sf
cities_sf <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326)
print(cities_sf)
# Simple feature collection with 249 features and 9 fields
# Geometry type: POINT
# CRS: 4326
Performing the join
With both datasets in the same coordinate system, the join itself is a single st_join() call. The function works like dplyr’s left_join() but instead of matching on equal column values, it matches on the spatial relationship you specify. Here we use st_within as the join predicate, which asks: for each city point, which state polygon contains it? The result is a combined sf object where each city row now carries the state name, area, and other attributes from the matching polygon:
# Spatial join: which state contains each city?
cities_with_states <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_join(us_states, join = st_within)
print(cities_with_states %>%
select(NAME.x, NAME.y, pop2010) %>%
head(10))
# # A tibble: 10 × 3
# NAME.x NAME.y pop2010
# 1 Abilene Texas 115500
# 2 Akron Ohio 198549
# 3 Albany New York 93920
How spatial joins match geometries
The join predicate you choose determines which rows match. st_intersects is the default and the most permissive: any overlap at all counts as a match, including points that sit exactly on polygon boundaries. st_within is stricter: it only matches when one geometry is entirely inside another, which makes it the safer choice for point-in-polygon joins when you don’t want boundary cases. st_nearest_feature doesn’t require any overlap at all: it finds the closest geometry from the target layer for every source feature, which is useful when your points fall just outside the polygons you’re interested in (a common situation with geocoded addresses near coastline boundaries).
st_intersects()
The most common join, returns all overlapping features. Here we create three sample points over the central United States, convert them to an sf object, and join them to state boundaries. Each point lands squarely inside one state, so st_intersects returns exactly one match per point. In real-world data, points near borders can match multiple polygons when boundaries are imprecise or when a point sits on a river that forms a border between two states:
# Create sample point data
points <- tibble(
id = 1:3,
value = c(10, 20, 30),
geometry = st_sfc(
st_point(c(-95, 40)), # Kansas
st_point(c(-85, 40)), # Indiana
st_point(c(-75, 40)), # Pennsylvania
crs = 4326
)
) %>% st_sf()
# Which states do these points intersect?
points %>%
st_join(us_states, join = st_intersects) %>%
select(id, value, NAME)
# Simple feature collection with 3 features
# id value NAME
# 1 1 10 Kansas
# 2 2 20 Ohio
# 3 3 30 Pennsylvania
st_within()
Only returns matches where one geometry is completely inside another. This predicate is stricter than st_intersects and is the go-to choice for point-in-polygon joins when you want to exclude points that lie exactly on boundaries. In the example below, we create a bounding box polygon covering the southeastern US and use st_within to find large cities that fall strictly inside it. The filter(!is.na(join)) step removes cities outside the box because st_within returns NA for non-matching rows in a left join:
# Create a bounding box polygon
bbox_polygon <- st_as_sfc(st_bbox(
c(xmin = -90, xmin = -80, ymin = 35, ymax = 45),
crs = 4326
))
# Filter cities strictly within the bounding box
us_cities %>%
filter(pop2010 > 500000) %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_join(bbox_polygon, join = st_within) %>%
filter(!is.na(join))
st_nearest_feature()
Finds the closest geometry regardless of overlap. This predicate is essential when your points don’t fall inside any polygon from the target layer. For example, if you geocode addresses against a coastline and some points end up in the water just offshore, st_nearest_feature still assigns each point to the nearest land-based region. In the example below, two facilities near Chicago are matched to the closest city in the us_cities dataset, which happens to be Chicago for both. The join adds a nearest_city column containing the matched city name:
# Create sample locations
facilities <- tibble(
name = c("Facility A", "Facility B"),
geometry = st_sfc(
st_point(c(-87.6, 41.9)), # Chicago area
st_point(c(-87.9, 41.8)),
crs = 4326
)
) %>% st_sf()
# Find nearest city to each facility
facilities %>%
st_join(us_cities %>% st_as_sf(coords = c("lng", "lat"), crs = 4326),
join = st_nearest_feature) %>%
select(name, nearest_city = NAME)
# Simple feature collection with 2 features
# name nearest_city
# 1 Facility A Chicago
# 2 Facility B Chicago
Practical example: restaurant analysis
Now let’s combine multiple datasets for a realistic analysis workflow. Imagine you’re analyzing restaurant distribution across New York State. You need the state boundary to filter cities, the cities themselves to provide geographic context, and a set of restaurant points to analyze. This example demonstrates how spatial joins chain together: first st_filter keeps only cities within New York, then st_join with st_nearest_feature assigns each restaurant to its closest city. The result tells you which city each restaurant belongs to, even for restaurants that sit just outside city boundaries:
# Get New York state boundary
ny_state <- us_states %>% filter(NAME == "New York")
# Get cities in or near NY (within 1 degree buffer)
ny_cities <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_filter(ny_state, .predicate = st_intersects)
# Create sample restaurant locations
restaurants <- tibble(
name = c("Joe's Pizza", "Le Bernardin", "Gramercy Tavern"),
type = c("Pizza", "French", "American"),
geometry = st_sfc(
st_point(c(-74.0, 40.7)), # Joe's Pizza
st_point(c(-73.9, 40.8)), # Le Bernardin
st_point(c(-73.98, 40.74)), # Gramercy Tavern
crs = 4326
)
) %>% st_sf()
# Find which city each restaurant is in
restaurants %>%
st_join(ny_cities, join = st_nearest_feature) %>%
select(restaurant = name, type, city = NAME.x)
# Simple feature collection with 3 features
# restaurant type city
# 1 Joe's Pizza Pizza New York
# 2 Le Bernardin French New York
# 3 Gramercy Tavern American New York
Grouping and summarizing joined data
Spatial joins become truly powerful when you pair them with dplyr’s grouping and summarization. After attaching state attributes to each city via st_join, the result is a flat table where each city row also carries its state name. Grouping by state and calling summarize() collapses those rows, counting how many cities fall in each state and summing their populations. The geometry column is automatically combined with st_union() so the result remains a valid sf object. This pattern, join then group then summarize, answers questions like “which state has the most large cities” in a single pipeline:
# Count cities per state
city_counts <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_join(us_states, join = st_within) %>%
group_by(NAME) %>%
summarize(city_count = n(),
total_pop = sum(pop2010, na.rm = TRUE))
print(city_counts %>%
arrange(desc(city_count)) %>%
head(10))
# # A tibble: 10 × 3
# NAME city_count total_pop
# 1 California 26 32000000
# 2 Texas 19 25000000
# 3 New York 16 15000000
Handling CRS mismatches
Coordinate reference system mismatches are the most common silent failure in spatial joins. When two layers use different CRS values, st_join still runs without error, but it produces an empty result because the coordinate spaces don’t align. Always verify that both layers share the same CRS before joining. If they differ, transform one layer to match the other using st_transform(). For distance-based joins and nearest-neighbor searches, consider projecting to a metric CRS like UTM: measurements in meters are more intuitive and computationally predictable than degrees of latitude and longitude:
# Check CRS of both layers
st_crs(us_cities)
# Coordinate Reference System:
# User input: EPSG:4326
# wkt: GEOGCRS["WGS 84", ...]
# If different, transform one layer to match
# Example: transform to projected CRS (meters) for more accurate distances
us_cities_proj <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_transform(32618) # UTM zone 18N
us_states_proj <- us_states %>%
st_transform(32618)
# Now distance calculations will be in meters
Common pitfalls
Two pitfalls trip up almost everyone who works with spatial joins: unexpected duplicate rows from multiple polygon matches, and silent empty results from CRS mismatches. The first happens because st_intersects returns every overlapping polygon for each source feature, so a point near a border can appear once for each adjacent state. The second happens when CRS values don’t match but the join still runs without warning. Knowing both failure modes and their fixes before you start will save you from puzzling over empty or inflated results.
1. join returns unexpected large results
When a point lands on or near a polygon border, st_intersects can match it to multiple adjacent polygons simultaneously. The join result then contains more rows than you started with, one row per source-feature/target-polygon pair. The fix depends on your analysis goal: switch to st_within if you only want points fully inside a polygon, or use st_filter to keep only the first match per source feature. In the example below, the initial st_intersects join produces 280 rows from 249 cities, 31 extra rows from boundary overlaps. Switching to st_within drops those duplicates:
# Problem: points join to multiple overlapping polygons
us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_join(us_states, join = st_intersects) %>%
nrow()
# [1] 280 # More than 249 cities!
# Solution: use st_filter or keep = FALSE
cities_in_states <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_filter(us_states, .predicate = st_within)
nrow(cities_in_states)
# [1] 248 # One city excluded (probably DC/MD issue)
2. empty results
The second common pitfall is an empty join result with no error message. This almost always means a CRS mismatch: the two layers use different coordinate reference systems, so the spatial predicate evaluates every pair as non-overlapping. The fix is straightforward, check CRS equality before joining and transform one layer if needed. Adding a stopifnot() guard in your script catches this early rather than letting an empty result propagate silently through downstream analysis steps:
# Problem: CRS mismatch causes empty join
# Always verify CRS before joining
st_crs(us_cities) == st_crs(us_states)
# [1] TRUE # Good
# If FALSE, transform:
# us_cities <- st_transform(us_cities, st_crs(us_states))
Point-in-Polygon
The most common spatial join is assigning polygon attributes to points. Given a points dataset and a regions polygon dataset, st_join(points, regions) adds region attributes to each point. If a point falls in multiple polygons (overlapping polygons), it appears multiple times in the result. Use st_join(points, regions, left = FALSE) to keep only points that fall within at least one polygon. For large point datasets, pre-filtering points to the bounding box of the polygon layer with st_crop() avoids testing pairs that can’t possibly match, cutting runtime by an order of magnitude.
Buffering for proximity analysis
st_buffer(geometry, dist = 1000) creates a polygon buffer dist units around each geometry (1000 meters if the CRS uses meters). Buffer points to find nearby features: buffered_points |> st_join(roads) returns roads within 1000m of each point. st_is_within_distance(x, y, dist = 500) returns a sparse matrix of which features are within distance without creating explicit buffers. When you need distances in meters but your data uses geographic coordinates (degrees), project to a local UTM zone first: a one-degree buffer at the equator spans roughly 111 km, far too large for proximity analysis.
Aggregating spatial data
Spatial aggregation computes statistics for features falling within zones. aggregate(points, zones, FUN = sum) sums point values for each zone. The dplyr equivalent chains a join with a group-and-summarize: st_join(zones, points) |> group_by(zone_id) |> summarise(total = sum(value)). st_intersection(x, y) computes the exact overlapping geometry, necessary for area-weighted aggregation where you need the fraction of each polygon that falls within each zone. When zones overlap, st_join returns multiple rows per point; aggregate by zone ID afterward to collapse duplicates.
How sf joins geometries
sf::st_join(x, y) performs a left join by default, keeping all rows from x and adding matched attributes from y. The join predicate defaults to st_intersects but accepts any binary spatial predicate: st_within, st_contains, st_overlaps, st_touches, st_crosses, or st_nearest_feature. Think of it as a geographic version of dplyr’s relational joins: rows match when their geometries satisfy the predicate, not when column values are equal.
Handling multiple matches
When one feature in x intersects multiple features in y, st_join() returns multiple rows per feature. This is correct but requires downstream aggregation with group_by() %>% summarise(). For overlapping boundaries, group_by(point_id) %>% slice_min(area) keeps the smallest containing polygon, typically the most precise administrative unit.
Distance-Based joins
st_is_within_distance(x, y, dist) returns pairs within a given distance. Use it as a join predicate with st_join(x, y, join = st_is_within_distance, dist = units::set_units(500, "m")). Always project to a metric CRS before distance joins: units::set_units() prevents silent errors where R treats a bare number as degrees. For large datasets, nngeo::st_nn() uses spatial indexes for faster nearest-neighbor searches.
# Example: find the 3 nearest cities within 500 km of each facility
# Requires nngeo package
library(nngeo)
facilities <- us_cities %>%
filter(pop2010 > 1000000) %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_transform(32618) # project to meters
all_cities <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_transform(32618)
nearest <- st_nn(facilities, all_cities, k = 3, maxdist = 500000, returnDist = TRUE)
Computing statistics from joined layers
The core pattern is join, group, summarize: st_join(polygons, points) %>% group_by(polygon_id) %>% summarise(count = n()). For weighted summaries like average income per district, the join attaches values and the summarize computes the mean. Use st_drop_geometry(df) to strip geometry when you only need tabular results.
Performance considerations
sf builds spatial indexes automatically, but simplifying complex polygons with rmapshaper::ms_simplify(polygons, keep = 0.05) speeds up intersection tests. For repeated lookups, cache the polygon object to avoid rebuilding the index on each call. Pre-filtering large point sets with st_crop(points, st_bbox(polygons)) eliminates impossible matches before the join starts.
Real-world applications
Customer locations get census tract data, restaurant addresses get neighborhood characteristics, and monitoring stations get watershed information, all through spatial joins. In every case the join is purely geographic: no shared identifier, only shared location.
Summary
You have learned how to:
- Perform spatial joins using
st_join()with different predicates - Choose the right join predicate for your use case
- Aggregate data after spatial joins
- Handle CRS mismatches between datasets
- Avoid common spatial join pitfalls
Spatial joins are fundamental to geospatial analysis in R, enabling you to combine datasets based on geographic relationships. Every GIS analysis that crosses data layers, whether assigning census demographics to store locations or measuring proximity to public transit, relies on performing spatial joins under the hood.
Next steps
Now that you can combine datasets by location, try applying spatial joins to your own data. Practice with st_nearest_feature for address-to-neighborhood matching, explore st_filter for subsetting one layer by another, and experiment with different predicates to see how results change. The CRS and pitfalls sections above will save you hours of debugging when joins don’t produce the output you expect.
See also
- Introduction to spatial data in R: Covers the sf object model, coordinate reference systems, and reading shapefiles before you attempt joins.
- Working with sf and terra in R: Compares the sf and terra packages, covering raster-vector integration workflows that often follow a spatial join.
- Geocoding and mapping in R: Learn how to turn street addresses into coordinate pairs, the typical upstream step before running a spatial join.
- Raster analysis in R: The next tutorial in this series, covering raster operations and how they integrate with vector data from spatial joins.
- dplyr joins reference: The relational join counterpart to spatial joins, essential for combining non-spatial attributes after a spatial match.
- R spatial sf guide: A broader guide to the sf package covering geometry creation, transformations, and coordinate reference systems.