rguides

Spatial Data Analysis in R with sf

The sf package is the modern standard for working with spatial data in R. It provides a tidy approach to geographic information systems (GIS) by representing spatial features as simple features within data frames. This guide walks you through creating, manipulating, and visualizing spatial data using sf.

What are simple features?

Simple Features (SF) is a standard for representing geographic vector data. The sf package implements this standard in R, storing spatial data as a special type of data frame with a geometry list-column. Each row represents a feature (like a city, road, or country boundary), and the geometry column contains the spatial information.

The package connects to GDAL for reading and writing spatial file formats, GEOS for geometric operations on projected coordinates, and PROJ for coordinate reference system transformations. This makes sf powerful yet dependent on system libraries.

Installing and loading sf

Install sf from CRAN along with ggplot2 for visualization:

install.packages(c("sf", "ggplot2", "dplyr"))

library(sf)
library(ggplot2)
library(dplyr)

The package should load without errors. If you see warnings about missing system libraries, you will need to install GDAL, GEOS, and PROJ on your system first. On Ubuntu, apt install libgdal-dev libgeos-dev libproj-dev handles all three; on macOS with Homebrew, brew install gdal geos proj provides the same libraries. Once these system dependencies are in place, reinstall sf from source with install.packages("sf", type = "source").

Creating spatial data

The fundamental building blocks are simple feature geometries (sfc) and simple features (sf). You create them from coordinates using functions like st_point() for individual locations, st_linestring() for paths and routes, and st_polygon() for area boundaries. Each geometry type has its own constructor that validates the coordinate structure:

# Create a single point
point <- st_point(c(-122.4194, 37.7749))  # San Francisco
print(point)

# Create a linestring (a road, for example)
road <- st_linestring(matrix(c(
  -122.4194, -122.4050, -122.3900,
  37.7749, 37.7700, 37.7650
), ncol = 2, byrow = TRUE))
print(road)

# Create a polygon (a bounding box, for example)
bbox <- st_polygon(list(matrix(c(
  -122.5, -122.3, -122.3, -122.5, -122.5,
  37.7, 37.7, 37.85, 37.85, 37.7
), ncol = 2, byrow = TRUE)))
print(bbox)

The output shows the geometry type and coordinates. Each geometry gets a class attribute indicating its type (POINT, LINESTRING, POLYGON). At this stage, the geometries are standalone objects with no coordinate reference system attached — they are just lists of numbers. To make them meaningful as spatial data, you need to wrap them in an sfc (simple feature geometry column) and assign a CRS, which tells sf how to interpret those coordinates as locations on the Earth’s surface.

Now wrap these geometries in an sfc object with a coordinate reference system. The crs argument accepts an EPSG code, a PROJ string, or a WKT representation:

# Create sfc with WGS84 (EPSG:4326)
sfc_point <- st_sfc(point, crs = 4326)
sfc_linestring <- st_sfc(road, crs = 4326)
sfc_polygon <- st_sfc(bbox, crs = 4326)

print(sfc_point)

The crs argument sets the coordinate reference system using an EPSG code. The code 4326 represents WGS84, the standard system for latitude and longitude coordinates.

Finally, create an sf object by combining sfc with a data frame:

# Create sf object with attributes
cities <- data.frame(
  name = c("San Francisco", "Oakland", "Berkeley"),
  population = c(873965, 433031, 124321)
)

sf_cities <- st_sf(
  cities,
  geometry = st_sfc(
    st_point(c(-122.4194, 37.7749)),
    st_point(c(-122.2711, 37.8044)),
    st_point(c(-122.2727, 37.8716))
  ),
  crs = 4326
)

print(sf_cities)

The result looks like a regular data frame with an extra geometry column. This is the key insight of sf: spatial data is just data with a special column. Any dplyr operation that works on a data frame also works on an sf object — the geometry column follows along transparently. This design means you can go from importing CSV data to mapping it with ggplot2 without learning a separate spatial data structure.

Reading and writing spatial data

Sf connects to GDAL, giving you access to dozens of spatial file formats without needing to know the details of each format. The main functions are st_read() for importing and st_write() for exporting:

# Read a shapefile
# nc <- st_read("path/to/counties.shp")

# Read a GeoJSON file
# nc <- st_read("counties.geojson")

# Write to GeoJSON
# st_write(sf_cities, "cities.geojson", driver = "GeoJSON")

# Write to shapefile
# st_write(sf_cities, "cities.shp", driver = "ESRI Shapefile")

The package automatically detects the driver from the file extension. Common formats include GeoJSON, ESRI Shapefile, and PostGIS databases.

Geometry operations

The sf package provides many geometric operations through GEOS. These include buffer, union, intersection, and difference operations.

# Create a buffer around a point (1 degree radius)
buffered <- st_buffer(sfc_point, dist = 0.1)
plot(buffered)

# Find the intersection of two polygons
poly1 <- st_polygon(list(matrix(c(
  0, 0, 1, 1, 0, 0,
  0, 0, 0, 1, 1, 0
), ncol = 2, byrow = TRUE)))

poly2 <- st_polygon(list(matrix(c(
  0.5, 0.5, 1.5, 1.5, 0.5, 0.5,
  0.5, 0.5, 0.5, 1.5, 1.5, 0.5
), ncol = 2, byrow = TRUE)))

intersection <- st_intersection(
  st_sfc(poly1, crs = 4326),
  st_sfc(poly2, crs = 4326)
)
plot(intersection)

Other useful functions include st_union() to combine geometries, st_difference() to subtract one geometry from another, and st_sym_difference() for the symmetric difference. These operations form the core of GIS analysis: buffering creates catchment areas, intersection finds overlapping regions, and union merges adjacent polygons. All geometric operations in sf use the GEOS library under the hood, which implements the same algorithms used by PostGIS and QGIS, so results are consistent across tools.

Coordinate reference systems

Coordinate reference systems (CRS) define how coordinates map to locations on Earth. The sf package handles both projected (Cartesian) and geographic (spherical) coordinate systems through the PROJ library:

# Check the CRS of an object
st_crs(sf_cities)

# Transform to a projected CRS (UTM zone 10N for California)
sf_cities_utm <- st_transform(sf_cities, crs = 32610)

# Create a new CRS from scratch
st_crs("+proj=longlat +datum=WGS84")

# EPSG codes are the easiest way to specify CRS
# 4326 = WGS84 (lat/lon)
# 32610 = UTM zone 10N
# 3857 = Web Mercator (for mapping)

Transforming to a projected CRS is necessary for accurate distance and area calculations. Geographic coordinates (like WGS84) use degrees, not meters, so buffer distances in degrees are approximate.

Operations like st_distance() and st_area() return values in the CRS’s units. For WGS84, distances are in degrees, which are not meaningful for physical distances. Always transform to an appropriate projected CRS before computing areas or distances. For global datasets where transformation is impractical, lwgeom::st_geod_distance() computes geodesic distances in meters without transforming.

Working with dplyr

Because sf objects are data frames, you can use dplyr verbs to manipulate them. The geometry column is preserved through standard operations.

# Filter spatial data
sf_cities |> filter(population > 400000)

# Select columns (including geometry)
sf_cities |> select(name, population)

# Add new attributes with mutate
sf_cities |> mutate(
  size_category = ifelse(population > 500000, "large", "small")
)

# Group and summarize
# Note: geometry is dropped by default in summarise
sf_cities |> summarise(
  total_pop = sum(population),
  avg_pop = mean(population)
)

There is one gotcha: most dplyr summarise operations drop the geometry column. To keep geometry through aggregation, use st_union() or st_collect().

Visualizing with ggplot2

The ggplot2 package has built-in support for sf through geom_sf(). This makes creating maps straightforward.

# Basic point map
ggplot(sf_cities) +
  geom_sf(aes(size = population, color = population)) +
  theme_minimal()

# Map with custom fill
ggplot(sf_cities) +
  geom_sf(aes(fill = name), alpha = 0.7) +
  geom_sf_text(aes(label = name), nudge_y = 0.01) +
  theme_minimal()

The geom_sf() function automatically handles different geometry types. You can layer multiple geom_sf() calls for complex visualizations.

Common issues and solutions

A few issues come up frequently when working with sf:

The most common error is mixing coordinate systems. Buffer operations with geographic CRS give approximate results because degrees vary in size. Always transform to a projected CRS first for accurate measurements.

Invalid geometries cause unexpected results in spatial operations. Use st_is_valid() to check and st_make_valid() to repair them.

Large files can be slow to process. Use st_read() with the query argument for databases, or filter spatially with st_filter() before loading all data.

Spatial joins and predicates

st_join(x, y) joins two sf objects based on spatial relationships. The default is st_intersects — any geometries that overlap are joined. Other predicates: st_within (x inside y), st_contains (y inside x), st_nearest_feature (nearest geometry). st_join is a left join by default; include left = FALSE for an inner join.

st_filter(x, y, .predicate = st_intersects) is a spatial filter — it keeps only features in x that meet the spatial relationship with y. This is equivalent to x[st_intersects(x, y, sparse = FALSE), ] but more readable.

Visualization with tmap and ggplot2

ggplot2::geom_sf() plots sf objects with the standard ggplot2 syntax: ggplot() + geom_sf(data = countries, aes(fill = gdp_per_capita)) + scale_fill_viridis_c(). The coord_sf() function controls the projection for the plot. facet_wrap() creates small multiples of spatial data for different time periods or groups.

tmap provides a dedicated mapping API with both interactive (tmap_mode("view")) and static (tmap_mode("plot")) modes. tm_shape(data) + tm_polygons("variable") is the basic pattern. tmap handles legend formatting, north arrows, scale bars, and insets with less code than the equivalent ggplot2.

For web maps, leaflet creates interactive maps that embed in Shiny apps and HTML documents. leaflet(data) |> addTiles() |> addCircleMarkers(~lng, ~lat, color = ~color) creates a map with basemap tiles and custom markers.

For raster data (gridded spatial data like satellite imagery, elevation models, or climate data), the terra package provides fast raster operations and integrates with sf for vector-raster interactions. sf::st_extract() extracts raster values at point locations.

Summary

The sf package provides a complete framework for vector spatial data in R, integrating with the tidyverse through dplyr methods and with ggplot2 through geom_sf(). For raster data, use the terra package. CRS management — knowing when to transform, and to which CRS — is the most common source of errors in spatial analysis. Checking st_crs(layer1) == st_crs(layer2) before joins and operations prevents silent failures from mismatched coordinate systems.

Wrapping up

The sf package brings the power of GIS to R in a way that integrates naturally with the tidyverse. Spatial data becomes just another tibble with a geometry column. You can filter, transform, and visualize it using tools you already know.

For further learning, explore the stars package for raster data, the sfnetworks package for network analysis, and the leaflet package for interactive maps.

See also