Tidytext Basics

· 4 min read · Updated March 16, 2026 · beginner
text-mining tidytext r nlp beginner

Building on the fundamentals from the introduction to text mining, this tutorial dives deeper into tidytext functions. You will learn how to tokenize text in different ways, work with multiple documents, and apply common text transformations.

The Tokenization Workflow

Tokenization converts raw text into a tidy structure where each row represents one token. The unnest_tokens() function is your primary tool:

library(tidytext)
library(tidyverse)

# Simple example
text <- "Text mining unlocks insights from data"

tibble(text = text) %>%
  unnest_tokens(word, text)

The output shows each word on its own row. The function handles punctuation, lowercase conversion, and special characters automatically.

Tokenizing by Different Units

Sometimes words are not the right unit. The tidytext package supports multiple tokenization strategies.

Sentences

For longer documents, sentence-level analysis can be more meaningful:

paragraph <- tibble(
  id = 1,
  text = "The first sentence makes a point. The second sentence agrees. Finally, the third concludes."
)

paragraph %>%
  unnest_tokens(sentence, text, token = "sentences")

N-grams

N-grams capture word sequences—useful for phrase analysis and context:

text <- tibble(
  id = 1:2,
  text = c(
    "Natural language processing",
    "Text mining in R"
  )
)

# Bigrams (pairs of consecutive words)
text %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

# Trigrams (triplets)
text %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3)

Characters

Character-level analysis helps with spelling errors and language identification:

text %>%
  unnest_tokens(char, text, token = "characters")

Working with Multiple Documents

Real text mining projects involve many documents. Here is how to structure and process them:

# Create a corpus of documents
documents <- tibble(
  doc_id = 1:3,
  title = c("Introduction", "Methods", "Results"),
  text = c(
    "This paper introduces a new method.",
    "We used statistical methods to analyze data.",
    "The results show significant findings."
  )
)

# Tokenize while preserving document metadata
tidy_docs <- documents %>%
  unnest_tokens(word, text)

print(tidy_docs)

Removing Stop Words

Stop words are common words that rarely carry meaningful information. The tidytext package includes built-in stop word lexicons:

# View available lexicons
get_stopwords()

# Remove stop words from your data
tidy_docs %>%
  anti_join(get_stopwords(), by = "word")

You can also customize stop words:

# Custom stop words specific to your domain
custom_stop <- tibble(
  word = c("data", "analysis", "paper"),
  lexicon = "custom"
)

tidy_docs %>%
  anti_join(custom_stop, by = "word")

Counting and Analyzing Tokens

With tidy text, familiar dplyr operations become powerful text analysis tools:

# Word frequency across documents
tidy_docs %>%
  count(doc_id, word, sort = TRUE)

# Most common words overall
tidy_docs %>%
  count(word, sort = TRUE) %>%
  head(10)

# Words per document
tidy_docs %>%
  group_by(doc_id) %>%
  summarize(word_count = n())

The Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF weights terms by their importance within a document collection. Terms that appear frequently in one document but rarely across all documents get higher scores:

# Calculate TF-IDF
doc_term_counts <- tidy_docs %>%
  count(doc_id, word)

tfidf <- doc_term_counts %>%
  bind_tf_idf(word, doc_id, n)

# Find distinctive terms per document
tfidf %>%
  group_by(doc_id) %>%
  top_n(5, tf_idf)

This reveals which terms are most characteristic of each document.

Practical Example: Analyzing Book Chapters

Let us apply these concepts to a more realistic dataset:

library(janeaustenr)

# Get the text
book_text <- austen_books() %>%
  filter(book == "Pride & Prejudice")

# Tokenize by chapter (using line numbers as proxy)
tidy_chapters <- book_text %>%
  mutate(chapter = cumsum(str_detect(text, "^chapter"))) %>%
  filter(chapter > 0) %>%
  unnest_tokens(word, text) %>%
  anti_join(get_stopwords())

# Most distinctive words per chapter
chapter_tf <- tidy_chapters %>%
  count(chapter, word) %>%
  bind_tf_idf(word, chapter, n) %>%
  group_by(chapter) %>%
  top_n(3, tf_idf)

print(chapter_tf)

Pairwise Comparisons

Comparing word usage between two groups reveals distinctive vocabulary:

# Compare two books
two_books <- austen_books() %>%
  filter(book %in% c("Pride & Prejudice", "Sense & Sensibility")) %>%
  mutate(book = factor(book, levels = c("Pride & Prejudice", "Sense & Sensibility"))) %>%
  unnest_tokens(word, text) %>%
  anti_join(get_stopwords())

# Count words by book
word_counts <- two_books %>%
  count(book, word) %>%
  spread(book, n, fill = 0)

# Calculate log odds ratio
word_counts %>%
  mutate(
    pp_rate = (`Pride & Prejudice` + 1) / sum(`Pride & Prejudice` + 1),
    ss_rate = (`Sense & Sensibility` + 1) / sum(`Sense & Sensibility` + 1),
    log_odds = log(pp_rate / ss_rate)
  ) %>%
  arrange(desc(log_odds))

What You Have Learned

This tutorial covered essential tidytext techniques:

TechniqueUse Case
Word tokenizationDefault—breaks text into words
N-gramsCapture word pairs and phrases
Sentence tokenizationDocument-level analysis
Stop word removalFilter common words
TF-IDFFind distinctive terms
Pairwise comparisonCompare vocabularies between texts

These tools form the foundation for more advanced text analysis, including sentiment analysis, topic modeling, and text classification.

See Also

Next Steps

Continue your text mining journey with related tutorials in this series:

  • Sentiment Analysis in R — Assign emotional scores to text
  • Topic Modeling with LDA in R — Discover hidden topics in document collections
  • Text Classification in R — Build models to categorize text automatically