Tidytext Basics
Building on the fundamentals from the introduction to text mining, this tutorial dives deeper into tidytext functions. You will learn how to tokenize text in different ways, work with multiple documents, and apply common text transformations.
The Tokenization Workflow
Tokenization converts raw text into a tidy structure where each row represents one token. The unnest_tokens() function is your primary tool:
library(tidytext)
library(tidyverse)
# Simple example
text <- "Text mining unlocks insights from data"
tibble(text = text) %>%
unnest_tokens(word, text)
The output shows each word on its own row. The function handles punctuation, lowercase conversion, and special characters automatically.
Tokenizing by Different Units
Sometimes words are not the right unit. The tidytext package supports multiple tokenization strategies.
Sentences
For longer documents, sentence-level analysis can be more meaningful:
paragraph <- tibble(
id = 1,
text = "The first sentence makes a point. The second sentence agrees. Finally, the third concludes."
)
paragraph %>%
unnest_tokens(sentence, text, token = "sentences")
N-grams
N-grams capture word sequences—useful for phrase analysis and context:
text <- tibble(
id = 1:2,
text = c(
"Natural language processing",
"Text mining in R"
)
)
# Bigrams (pairs of consecutive words)
text %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
# Trigrams (triplets)
text %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)
Characters
Character-level analysis helps with spelling errors and language identification:
text %>%
unnest_tokens(char, text, token = "characters")
Working with Multiple Documents
Real text mining projects involve many documents. Here is how to structure and process them:
# Create a corpus of documents
documents <- tibble(
doc_id = 1:3,
title = c("Introduction", "Methods", "Results"),
text = c(
"This paper introduces a new method.",
"We used statistical methods to analyze data.",
"The results show significant findings."
)
)
# Tokenize while preserving document metadata
tidy_docs <- documents %>%
unnest_tokens(word, text)
print(tidy_docs)
Removing Stop Words
Stop words are common words that rarely carry meaningful information. The tidytext package includes built-in stop word lexicons:
# View available lexicons
get_stopwords()
# Remove stop words from your data
tidy_docs %>%
anti_join(get_stopwords(), by = "word")
You can also customize stop words:
# Custom stop words specific to your domain
custom_stop <- tibble(
word = c("data", "analysis", "paper"),
lexicon = "custom"
)
tidy_docs %>%
anti_join(custom_stop, by = "word")
Counting and Analyzing Tokens
With tidy text, familiar dplyr operations become powerful text analysis tools:
# Word frequency across documents
tidy_docs %>%
count(doc_id, word, sort = TRUE)
# Most common words overall
tidy_docs %>%
count(word, sort = TRUE) %>%
head(10)
# Words per document
tidy_docs %>%
group_by(doc_id) %>%
summarize(word_count = n())
The Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF weights terms by their importance within a document collection. Terms that appear frequently in one document but rarely across all documents get higher scores:
# Calculate TF-IDF
doc_term_counts <- tidy_docs %>%
count(doc_id, word)
tfidf <- doc_term_counts %>%
bind_tf_idf(word, doc_id, n)
# Find distinctive terms per document
tfidf %>%
group_by(doc_id) %>%
top_n(5, tf_idf)
This reveals which terms are most characteristic of each document.
Practical Example: Analyzing Book Chapters
Let us apply these concepts to a more realistic dataset:
library(janeaustenr)
# Get the text
book_text <- austen_books() %>%
filter(book == "Pride & Prejudice")
# Tokenize by chapter (using line numbers as proxy)
tidy_chapters <- book_text %>%
mutate(chapter = cumsum(str_detect(text, "^chapter"))) %>%
filter(chapter > 0) %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords())
# Most distinctive words per chapter
chapter_tf <- tidy_chapters %>%
count(chapter, word) %>%
bind_tf_idf(word, chapter, n) %>%
group_by(chapter) %>%
top_n(3, tf_idf)
print(chapter_tf)
Pairwise Comparisons
Comparing word usage between two groups reveals distinctive vocabulary:
# Compare two books
two_books <- austen_books() %>%
filter(book %in% c("Pride & Prejudice", "Sense & Sensibility")) %>%
mutate(book = factor(book, levels = c("Pride & Prejudice", "Sense & Sensibility"))) %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords())
# Count words by book
word_counts <- two_books %>%
count(book, word) %>%
spread(book, n, fill = 0)
# Calculate log odds ratio
word_counts %>%
mutate(
pp_rate = (`Pride & Prejudice` + 1) / sum(`Pride & Prejudice` + 1),
ss_rate = (`Sense & Sensibility` + 1) / sum(`Sense & Sensibility` + 1),
log_odds = log(pp_rate / ss_rate)
) %>%
arrange(desc(log_odds))
What You Have Learned
This tutorial covered essential tidytext techniques:
| Technique | Use Case |
|---|---|
| Word tokenization | Default—breaks text into words |
| N-grams | Capture word pairs and phrases |
| Sentence tokenization | Document-level analysis |
| Stop word removal | Filter common words |
| TF-IDF | Find distinctive terms |
| Pairwise comparison | Compare vocabularies between texts |
These tools form the foundation for more advanced text analysis, including sentiment analysis, topic modeling, and text classification.
See Also
- dplyr::filter — Filter rows after tokenization
- dplyr::count — Essential for word frequency analysis
Next Steps
Continue your text mining journey with related tutorials in this series:
- Sentiment Analysis in R — Assign emotional scores to text
- Topic Modeling with LDA in R — Discover hidden topics in document collections
- Text Classification in R — Build models to categorize text automatically