Data Frames and Tibbles

· 12 min read · beginner
data-frame tibble tidyverse data-wrangling
Part 3 of the r-fundamentals series

Data frames are the workhorse of data analysis in R. They store tabular data—think spreadsheet or SQL table—where each column can hold a different type. This tutorial covers creating data frames, understanding their structure, working with tibbles (the tidyverse modern alternative), and essential manipulation operations.

What Is a Data Frame?

A data frame is a list of vectors of equal length. Each vector represents a column, and all vectors must have the same number of rows. This rectangular structure makes data frames perfect for statistical analysis and data science.

Creating Data Frames

Use data.frame() to create a data frame:

# Create a simple data frame
df <- data.frame(
  name = c("Alice", "Bob", "Charlie", "Diana"),
  age = c(25, 30, 35, 28),
  score = c(85.5, 92.0, 78.5, 88.0),
  passed = c(TRUE, TRUE, FALSE, TRUE)
)

df
#      name age score passed
# 1    Alice  25  85.5   TRUE
# 2      Bob  30  92.0   TRUE
# 3  Charlie  35  78.5  FALSE
# 4    Diana  28  88.0   TRUE

Checking Data Frame Properties

# Dimensions
nrow(df)
# [1] 4

ncol(df)
# [1] 4

dim(df)
# [1] 4 4

# Column names
names(df)
# [1] "name"   "age"    "score"  "passed"

# Structure
str(df)
# 'data.frame': 4 obs. of  4 variables:
#  $ name  : chr  "Alice" "Bob" "Charlie" "Diana"
#  $ age   : num  25 30 35 28
#  $ score : num  85.5 92 78.5 88
#  $ passed: logi  TRUE TRUE FALSE TRUE

Accessing Data in Data Frames

Accessing Columns

# By name (returns a vector)
df$name
# [1] "Alice"   "Bob"     "Charlie" "Diana"

df$age
# [1] 25 30 35 28

# Using double brackets (also returns vector)
df[["score"]]
# [1] 85.5 92.0 78.5 88.0

Accessing Rows and Cells

# Single cell: row 1, column 2
df[1, 2]
# [1] 25

# Entire first row
df[1, ]
#   name age score passed
# 1 Alice  25  85.5   TRUE

# Entire first column (as data frame)
df[, 1]
# [1] "Alice"   "Bob"     "Charlie" "Diana"

# Multiple rows and columns
df[c(1, 2), c("name", "score")]
#    name score
# 1  Alice  85.5
# 2    Bob  92.0

Introducing Tibbles

Tibbles are the tidyverse’s modern take on data frames. They’re more strict and informative than base R data frames, making debugging easier.

Creating Tibbles

# Load tidyverse
library(tibble)

# Create a tibble
tb <- tibble(
  name = c("Alice", "Bob", "Charlie", "Diana"),
  age = c(25, 30, 35, 28),
  score = c(85.5, 92.0, 78.5, 88.0),
  passed = c(TRUE, TRUE, FALSE, TRUE)
)

tb
# # A tibble: 4 × 4
#   name      age score passed
#   <chr>   <dbl> <dbl> <lgl>
# 1 Alice      25  85.5 TRUE 
# 2 Bob        30  92   TRUE 
# 3 Charlie    35  78.5 FALSE
# 4 Diana      28  88   TRUE

Key Differences: tibble vs data.frame

# tibble never changes column types unexpectedly
# tibble never creates row names
# tibble prints nicely

# Base R data.frame: partial matching
df$score
# [1] 85.5 92.0 78.5 88.0

# tibble: no partial matching (safer)
# tb$sc  # Would error: column "sc" not found

# tibble shows dimensions in print output
tb
# # A tibble: 4 × 4

Essential Data Frame Operations

Adding Columns

# Add column with $
df$grade <- c("B", "A", "C", "B")
df
#      name age score passed grade
# 1    Alice  25  85.5   TRUE     B
# 2      Bob  30  92.0   TRUE     A
# 3  Charlie  35  78.5  FALSE     C
# 4    Diana  28  88.0   TRUE     B

# Using transform() (base R)
df <- transform(df, 
                bonus = ifelse(score > 85, 5, 0))

# Using mutate() (tidyverse)
library(dplyr)
df <- df %>% 
  mutate(letter = ifelse(score >= 90, "A",
                  ifelse(score >= 80, "B", "C")))

Removing Columns

# Remove a column
df$grade <- NULL

# Remove multiple columns
df[, c("bonus", "letter")] <- NULL

Adding Rows

# Add a row with rbind()
new_row <- data.frame(
  name = "Eve",
  age = 22,
  score = 95.0,
  passed = TRUE
)

df <- rbind(df, new_row)
df
#      name age score passed
# 1    Alice  25  85.5   TRUE
# 2      Bob  30  92.0   TRUE
# 3  Charlie  35  78.5  FALSE
# 4    Diana  28  88.0   TRUE
# 5      Eve  22  95.0   TRUE

Subsetting Data Frames

Logical Subsetting

# Filter rows where condition is TRUE
df[df$score > 85, ]
#      name age score passed
# 1    Alice  25  85.5   TRUE
# 2      Bob  30  92.0   TRUE
# 4    Diana  28  88.0   TRUE

# With base R subset()
subset(df, score > 85)
# Same result

Subset with select()

# Select specific columns
df[, c("name", "score")]
#      name score
# 1    Alice  85.5
# 2      Bob  92.0
# 3  Charlie  78.5
# 4    Diana  88.0
# 5      Eve  95.0

Sorting Data Frames

Base R Approach

# Sort by score (descending)
df[order(df$score, decreasing = TRUE), ]
#      name age score passed
# 5      Eve  22  95.0   TRUE
# 2      Bob  30  92.0   TRUE
# 4    Diana  28  88.0   TRUE
# 1    Alice  25  85.5   TRUE
# 3  Charlie  35  78.5  FALSE

# Sort by multiple columns
df[order(df$passed, df$score, decreasing = c(FALSE, TRUE)), ]

Tidyverse Approach

library(dplyr)

# Arrange by score
arrange(df, score)

# Arrange by multiple columns
arrange(df, desc(passed), score)

Summary

  • Data frames are rectangular data structures—lists of equal-length vectors
  • Use data.frame() for base R data frames, tibble() for tidyverse tibbles
  • Access columns with $ or [[]], rows and cells with [row, col]
  • Tibbles are stricter and print more informatively than data frames
  • Add columns with $ or mutate(), remove with NULL
  • Filter rows with logical conditions, sort with order() or arrange()

Next Steps

Continue your r-fundamentals journey with Functions and Control Flow in R, where you’ll learn to write reusable code with functions and control program flow with conditionals and loops.

Practice data frame operations:

# Create your own data frame
students <- data.frame(
  name = c("Alex", "Beth", "Carl", "Dina", "Eli"),
  math = c(78, 92, 85, 88, 95),
  english = c(82, 88, 79, 92, 90)
)

# Add an average column
students$average <- (students$math + students$english) / 2

# Filter students with average above 85
students[students$average > 85, ]