Data Frames and Tibbles
Data frames are the workhorse of data analysis in R. They store tabular data—think spreadsheet or SQL table—where each column can hold a different type. This tutorial covers creating data frames, understanding their structure, working with tibbles (the tidyverse modern alternative), and essential manipulation operations.
What Is a Data Frame?
A data frame is a list of vectors of equal length. Each vector represents a column, and all vectors must have the same number of rows. This rectangular structure makes data frames perfect for statistical analysis and data science.
Creating Data Frames
Use data.frame() to create a data frame:
# Create a simple data frame
df <- data.frame(
name = c("Alice", "Bob", "Charlie", "Diana"),
age = c(25, 30, 35, 28),
score = c(85.5, 92.0, 78.5, 88.0),
passed = c(TRUE, TRUE, FALSE, TRUE)
)
df
# name age score passed
# 1 Alice 25 85.5 TRUE
# 2 Bob 30 92.0 TRUE
# 3 Charlie 35 78.5 FALSE
# 4 Diana 28 88.0 TRUE
Checking Data Frame Properties
# Dimensions
nrow(df)
# [1] 4
ncol(df)
# [1] 4
dim(df)
# [1] 4 4
# Column names
names(df)
# [1] "name" "age" "score" "passed"
# Structure
str(df)
# 'data.frame': 4 obs. of 4 variables:
# $ name : chr "Alice" "Bob" "Charlie" "Diana"
# $ age : num 25 30 35 28
# $ score : num 85.5 92 78.5 88
# $ passed: logi TRUE TRUE FALSE TRUE
Accessing Data in Data Frames
Accessing Columns
# By name (returns a vector)
df$name
# [1] "Alice" "Bob" "Charlie" "Diana"
df$age
# [1] 25 30 35 28
# Using double brackets (also returns vector)
df[["score"]]
# [1] 85.5 92.0 78.5 88.0
Accessing Rows and Cells
# Single cell: row 1, column 2
df[1, 2]
# [1] 25
# Entire first row
df[1, ]
# name age score passed
# 1 Alice 25 85.5 TRUE
# Entire first column (as data frame)
df[, 1]
# [1] "Alice" "Bob" "Charlie" "Diana"
# Multiple rows and columns
df[c(1, 2), c("name", "score")]
# name score
# 1 Alice 85.5
# 2 Bob 92.0
Introducing Tibbles
Tibbles are the tidyverse’s modern take on data frames. They’re more strict and informative than base R data frames, making debugging easier.
Creating Tibbles
# Load tidyverse
library(tibble)
# Create a tibble
tb <- tibble(
name = c("Alice", "Bob", "Charlie", "Diana"),
age = c(25, 30, 35, 28),
score = c(85.5, 92.0, 78.5, 88.0),
passed = c(TRUE, TRUE, FALSE, TRUE)
)
tb
# # A tibble: 4 × 4
# name age score passed
# <chr> <dbl> <dbl> <lgl>
# 1 Alice 25 85.5 TRUE
# 2 Bob 30 92 TRUE
# 3 Charlie 35 78.5 FALSE
# 4 Diana 28 88 TRUE
Key Differences: tibble vs data.frame
# tibble never changes column types unexpectedly
# tibble never creates row names
# tibble prints nicely
# Base R data.frame: partial matching
df$score
# [1] 85.5 92.0 78.5 88.0
# tibble: no partial matching (safer)
# tb$sc # Would error: column "sc" not found
# tibble shows dimensions in print output
tb
# # A tibble: 4 × 4
Essential Data Frame Operations
Adding Columns
# Add column with $
df$grade <- c("B", "A", "C", "B")
df
# name age score passed grade
# 1 Alice 25 85.5 TRUE B
# 2 Bob 30 92.0 TRUE A
# 3 Charlie 35 78.5 FALSE C
# 4 Diana 28 88.0 TRUE B
# Using transform() (base R)
df <- transform(df,
bonus = ifelse(score > 85, 5, 0))
# Using mutate() (tidyverse)
library(dplyr)
df <- df %>%
mutate(letter = ifelse(score >= 90, "A",
ifelse(score >= 80, "B", "C")))
Removing Columns
# Remove a column
df$grade <- NULL
# Remove multiple columns
df[, c("bonus", "letter")] <- NULL
Adding Rows
# Add a row with rbind()
new_row <- data.frame(
name = "Eve",
age = 22,
score = 95.0,
passed = TRUE
)
df <- rbind(df, new_row)
df
# name age score passed
# 1 Alice 25 85.5 TRUE
# 2 Bob 30 92.0 TRUE
# 3 Charlie 35 78.5 FALSE
# 4 Diana 28 88.0 TRUE
# 5 Eve 22 95.0 TRUE
Subsetting Data Frames
Logical Subsetting
# Filter rows where condition is TRUE
df[df$score > 85, ]
# name age score passed
# 1 Alice 25 85.5 TRUE
# 2 Bob 30 92.0 TRUE
# 4 Diana 28 88.0 TRUE
# With base R subset()
subset(df, score > 85)
# Same result
Subset with select()
# Select specific columns
df[, c("name", "score")]
# name score
# 1 Alice 85.5
# 2 Bob 92.0
# 3 Charlie 78.5
# 4 Diana 88.0
# 5 Eve 95.0
Sorting Data Frames
Base R Approach
# Sort by score (descending)
df[order(df$score, decreasing = TRUE), ]
# name age score passed
# 5 Eve 22 95.0 TRUE
# 2 Bob 30 92.0 TRUE
# 4 Diana 28 88.0 TRUE
# 1 Alice 25 85.5 TRUE
# 3 Charlie 35 78.5 FALSE
# Sort by multiple columns
df[order(df$passed, df$score, decreasing = c(FALSE, TRUE)), ]
Tidyverse Approach
library(dplyr)
# Arrange by score
arrange(df, score)
# Arrange by multiple columns
arrange(df, desc(passed), score)
Summary
- Data frames are rectangular data structures—lists of equal-length vectors
- Use
data.frame()for base R data frames,tibble()for tidyverse tibbles - Access columns with
$or[[]], rows and cells with[row, col] - Tibbles are stricter and print more informatively than data frames
- Add columns with
$ormutate(), remove withNULL - Filter rows with logical conditions, sort with
order()orarrange()
Next Steps
Continue your r-fundamentals journey with Functions and Control Flow in R, where you’ll learn to write reusable code with functions and control program flow with conditionals and loops.
Practice data frame operations:
# Create your own data frame
students <- data.frame(
name = c("Alex", "Beth", "Carl", "Dina", "Eli"),
math = c(78, 92, 85, 88, 95),
english = c(82, 88, 79, 92, 90)
)
# Add an average column
students$average <- (students$math + students$english) / 2
# Filter students with average above 85
students[students$average > 85, ]