tidyr::drop_na()
Overview
drop_na() removes any row from a data frame that contains at least one NA. It is the fastest way to clean a dataset before analysis when you know that incomplete rows are not informative.
The function is straightforward: rows with any missing value in any column are removed. You can optionally restrict the check to specific columns.
Signature
drop_na(data, ...)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
data | tibble / data frame | , | Input data. |
... | Optional columns to restrict the check to. If not supplied, all columns are checked. |
Basic usage
When you call drop_na() without arguments, every row that contains at least one NA in any column is removed from the output. This aggressive filtering is appropriate when complete cases are required for the analysis. This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
Drop rows with any missing value
library(tidyr)
df <- tibble(
name = c("Alice", "Bob", "Carol"),
age = c(25, NA, 31),
score = c(90, 85, NA)
)
df %>% drop_na()
# # A tibble: 1 x 3
# name age score
# <chr> <dbl> <dbl>
# 1 Alice 25 90
Only Alice has no missing values in any column, so she is the only row that remains.
By naming specific columns inside drop_na(), you limit the missing-value check to those columns and keep rows that are complete in the columns that matter for your current step. This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
Drop rows with missing values in specific columns
Use column names as arguments to restrict the check to those columns:
df %>% drop_na(age)
# # A tibble: 2 x 3
# name age score
# <chr> <dbl> <dbl>
# 1 Alice 25 90
# 2 Carol 31 NA # age is present, kept even though score is NA
Only rows where age is missing would be dropped. Carol stays because age is not NA, even though score is NA. This step ensures that subsequent statistical functions will run without errors from missing data. This clean-up step is essential for preparing reliable input for any statistical model.
When you call drop_na() without arguments, every row that contains at least one NA in any column is removed from the output. This aggressive filtering is appropriate when complete cases are required for the analysis. This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
Drop across multiple specific columns
df %>% drop_na(age, score)
# # A tibble: 1 x 3
# name age score
# <chr> <dbl> <dbl>
# 1 Alice 25 90
Both age and score must be non-missing. Bob is dropped because age is NA. Carol is dropped because score is NA. This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
Common use cases
Cleaning before modelling
Most statistical functions in R fail or produce NA output when input contains missing values. drop_na() is a quick way to clean a dataset before fitting a model: This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
library(dplyr)
survey <- tibble(
id = 1:5,
q1 = c("agree", NA, "neutral", "agree", NA),
q2 = c("disagree", "agree", "neutral", NA, "agree"),
result = c(10, 20, 30, 40, 50)
)
survey %>%
drop_na() %>%
summarise(mean_result = mean(result))
# # A tibble: 1 x 1
# mean_result
# <dbl>
# 1 20
Rows 2, 4, and 5 had at least one NA in the response columns and were dropped before summarising.
By naming specific columns inside drop_na(), you limit the missing-value check to those columns and keep rows that are complete in the columns that matter for your current step. This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
Removing incomplete observations from time series
prices <- tibble(
date = as.Date(c("2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04")),
open = c(100, NA, 102, 103),
close = c(101, 101, NA, 104)
)
prices %>% drop_na()
# # A tibble: 2 x 3
# date open close
# <date> <dbl> <dbl>
# 1 2024-01-01 100 101
# 2 2024-01-04 103 104
When you call drop_na() without arguments, every row that contains at least one NA in any column is removed from the output. This aggressive filtering is appropriate when complete cases are required for the analysis. Use this approach when you need to prepare data for further analysis in a tidy workflow.
Using with dplyr pipelines
drop_na() integrates naturally in a %>% pipeline:
df %>%
filter(status == "complete") %>%
drop_na(starts_with("q")) %>%
mutate(total = rowSums(across(where(is.numeric))))
Alternative approaches
By naming specific columns inside drop_na(), you limit the missing-value check to those columns and keep rows that are complete in the columns that matter for your current step. This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
Base R
df[complete.cases(df), ]
complete.cases() returns a logical vector, TRUE for rows with no NA. This is equivalent to drop_na() but less readable in a pipeline. This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
When you call drop_na() without arguments, every row that contains at least one NA in any column is removed from the output. This aggressive filtering is appropriate when complete cases are required for the analysis. This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
Using tidyr::fill() before dropping
If missing values should be filled rather than dropped, use fill() first:
df %>%
fill(age, .direction = "down") %>%
drop_na()
fill() replaces NA values with the previous non-missing value (or next, depending on .direction), then drop_na() removes any remaining rows that still have NA in other columns. This pattern is common in real-world data analysis pipelines. This clean-up step is essential for preparing reliable input for any statistical model.
Using dplyr::filter() with is.na()
For more control over which rows to keep:
df %>%
filter(!is.na(age), !is.na(score))
This is equivalent to drop_na(age, score) but lets you apply different conditions to each column.
Gotchas
Dropping drops the whole row. drop_na() never drops individual cells, it drops the entire row if any cell in that row is NA. If you want to drop specific columns instead, use select() first: This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
df %>% select(-score) %>% drop_na()
Data-dependent dropping. Dropping rows changes your dataset’s structure. If different rows are missing on different runs (for example, when reading new data), the number of rows after drop_na() will vary. Check your row count after dropping to catch unexpected missingness.
NA in non-numeric columns. drop_na() checks all columns by default, not just numeric ones. A character column with NA as an explicit string (not an R NA) will not be dropped: This additional context makes the transformation pattern clearer and easier to adapt to your own data analysis needs.
df <- tibble(
name = c("Alice", "Bob", "Carol"),
note = c("active", NA_character_, "inactive") # NA is R's NA, not string
)
df %>% drop_na()
# # A tibble: 2 x 2
# name note
# <chr> <chr>
# 1 Alice active
# 2 Carol inactive
Bob is dropped because his note is R’s NA, not the string "NA".
See also
- /cookbooks/how-to-remove-na-values/, practical recipes for handling missing data
- /cookbooks/how-to-check-for-na-values/, detect and count NA values before deciding whether to drop
- /reference/tidyverse/dplyr-filter/ — filter rows by condition, including missing value checks