Spatial Joins in R
Spatial joins combine two datasets based on their geographic relationship rather than a common identifier. This is essential for tasks like finding all crimes within each police district, counting restaurants near each transit station, or assigning census tracts to neighborhood boundaries.
Prerequisites
Load the required packages:
install.packages(c("sf", "tidyverse", "spData"))
library(sf)
library(tidyverse)
library(spData)
Understanding Spatial Joins
Unlike traditional joins that match on equal values (like id or name), spatial joins match based on geometric relationships:
- Intersects: geometries overlap at any point
- Within: one geometry is completely inside another
- Contains: one geometry completely contains another
- Nearest: closest geometry from one layer to another
When to Use Each Predicate
| Predicate | Use Case |
|---|---|
st_intersects() | General overlap detection |
st_within() | Points completely inside polygons |
st_contains() | Polygons containing points |
st_nearest_feature() | Closest match |
Your First Spatial Join
Let’s work with example data from the spData package:
# Load example datasets
data("us_states")
data("us_cities")
# Check the structure
print(head(us_states))
# Simple feature collection with 10 features and 14 fields
# Geometry type: MULTIPOLYGON
# CRS: 4326
print(head(us_cities))
# # A tibble: 6 × 10
# NAME pop2010 CAPITAL
# 1 Abilene 115500 0
# 2 Akron 198549 0
# 3 Albany 93920 1
Converting Cities to sf
Cities need coordinates to become proper sf objects:
# Convert cities to sf
cities_sf <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326)
print(cities_sf)
# Simple feature collection with 249 features and 9 fields
# Geometry type: POINT
# CRS: 4326
Performing the Join
Now join cities to states based on which state contains each city:
# Spatial join: which state contains each city?
cities_with_states <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_join(us_states, join = st_within)
print(cities_with_states %>%
select(NAME.x, NAME.y, pop2010) %>%
head(10))
# # A tibble: 10 × 3
# NAME.x NAME.y pop2010
# 1 Abilene Texas 115500
# 2 Akron Ohio 198549
# 3 Albany New York 93920
Spatial Join Types in Detail
st_intersects()
The most common join — returns all overlapping features:
# Create sample point data
points <- tibble(
id = 1:3,
value = c(10, 20, 30),
geometry = st_sfc(
st_point(c(-95, 40)), # Kansas
st_point(c(-85, 40)), # Indiana
st_point(c(-75, 40)), # Pennsylvania
crs = 4326
)
) %>% st_sf()
# Which states do these points intersect?
points %>%
st_join(us_states, join = st_intersects) %>%
select(id, value, NAME)
# Simple feature collection with 3 features
# id value NAME
# 1 1 10 Kansas
# 2 2 20 Ohio
# 3 3 30 Pennsylvania
st_within()
Only returns matches where one geometry is completely inside another:
# Create a bounding box polygon
bbox_polygon <- st_as_sfc(st_bbox(
c(xmin = -90, xmin = -80, ymin = 35, ymax = 45),
crs = 4326
))
# Filter cities strictly within the bounding box
us_cities %>%
filter(pop2010 > 500000) %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_join(bbox_polygon, join = st_within) %>%
filter(!is.na(join))
st_nearest_feature()
Finds the closest geometry regardless of overlap:
# Create sample locations
facilities <- tibble(
name = c("Facility A", "Facility B"),
geometry = st_sfc(
st_point(c(-87.6, 41.9)), # Chicago area
st_point(c(-87.9, 41.8)),
crs = 4326
)
) %>% st_sf()
# Find nearest city to each facility
facilities %>%
st_join(us_cities %>% st_as_sf(coords = c("lng", "lat"), crs = 4326),
join = st_nearest_feature) %>%
select(name, nearest_city = NAME)
# Simple feature collection with 2 features
# name nearest_city
# 1 Facility A Chicago
# 2 Facility B Chicago
Practical Example: Restaurant Analysis
Combine multiple datasets for meaningful analysis:
# Get New York state boundary
ny_state <- us_states %>% filter(NAME == "New York")
# Get cities in or near NY (within 1 degree buffer)
ny_cities <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_filter(ny_state, .predicate = st_intersects)
# Create sample restaurant locations
restaurants <- tibble(
name = c("Joe's Pizza", "Le Bernardin", "Gramercy Tavern"),
type = c("Pizza", "French", "American"),
geometry = st_sfc(
st_point(c(-74.0, 40.7)), # Joe's Pizza
st_point(c(-73.9, 40.8)), # Le Bernardin
st_point(c(-73.98, 40.74)), # Gramercy Tavern
crs = 4326
)
) %>% st_sf()
# Find which city each restaurant is in
restaurants %>%
st_join(ny_cities, join = st_nearest_feature) %>%
select(restaurant = name, type, city = NAME.x)
# Simple feature collection with 3 features
# restaurant type city
# 1 Joe's Pizza Pizza New York
# 2 Le Bernardin French New York
# 3 Gramercy Tavern American New York
Aggregation After Spatial Joins
One powerful pattern combines spatial join with grouping:
# Count cities per state
city_counts <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_join(us_states, join = st_within) %>%
group_by(NAME) %>%
summarize(city_count = n(),
total_pop = sum(pop2010, na.rm = TRUE))
print(city_counts %>%
arrange(desc(city_count)) %>%
head(10))
# # A tibble: 10 × 3
# NAME city_count total_pop
# 1 California 26 32000000
# 2 Texas 19 25000000
# 3 New York 16 15000000
Handling CRS Mismatches
Always ensure consistent coordinate reference systems:
# Check CRS of both layers
st_crs(us_cities)
# Coordinate Reference System:
# User input: EPSG:4326
# wkt: GEOGCRS["WGS 84", ...]
# If different, transform one layer to match
# Example: transform to projected CRS (meters) for more accurate distances
us_cities_proj <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_transform(32618) # UTM zone 18N
us_states_proj <- us_states %>%
st_transform(32618)
# Now distance calculations will be in meters
Common Pitfalls
1. Join Returns Unexpected Large Results
# Problem: points join to multiple overlapping polygons
us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_join(us_states, join = st_intersects) %>%
nrow()
# [1] 280 # More than 249 cities!
# Solution: use st_filter or keep = FALSE
cities_in_states <- us_cities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_filter(us_states, .predicate = st_within)
nrow(cities_in_states)
# [1] 248 # One city excluded (probably DC/MD issue)
2. Empty Results
# Problem: CRS mismatch causes empty join
# Always verify CRS before joining
st_crs(us_cities) == st_crs(us_states)
# [1] TRUE # Good
# If FALSE, transform:
# us_cities <- st_transform(us_cities, st_crs(us_states))
Summary
You have learned how to:
- Perform spatial joins using
st_join()with different predicates - Choose the right join predicate for your use case
- Aggregate data after spatial joins
- Handle CRS mismatches between datasets
- Avoid common spatial join pitfalls
Spatial joins are fundamental to geospatial analysis in R, enabling you to combine datasets based on geographic relationships.
See Also
- dplyr::join — Traditional joins for combining data by key columns
- sf::st_as_sf — Converting data frames to sf objects
- dplyr::filter — Filtering data before spatial operations