Tidytext Basics, Master the core tidytext functions for text
Building on the fundamentals from the introduction to text mining, this tutorial dives deeper into tidytext functions. You will learn how to tokenize text in different ways, work with multiple documents, and apply common text transformations.
What you’ll learn
This tutorial covers the key concepts and practical techniques for working with Tidytext Basics — Master the core tidytext functions for text. By the end, you will know how to apply the core functions in real data analysis workflows.
The tokenization workflow
Tokenization converts raw text into a tidy structure where each row represents one token. The unnest_tokens() function is your primary tool:
library(tidytext)
library(tidyverse)
# Simple example
text <- "Text mining unlocks insights from data"
tibble(text = text) %>%
unnest_tokens(word, text)
The output shows each word on its own row. The function handles punctuation, lowercase conversion, and special characters automatically.
Tokenizing by different units
Sometimes words are not the right unit. The tidytext package supports multiple tokenization strategies.
Sentences
For longer documents, sentence-level analysis can be more meaningful:
paragraph <- tibble(
id = 1,
text = "The first sentence makes a point. The second sentence agrees. Finally, the third concludes."
)
paragraph %>%
unnest_tokens(sentence, text, token = "sentences")
N-grams
N-grams capture word sequences—useful for phrase analysis and context:
text <- tibble(
id = 1:2,
text = c(
"Natural language processing",
"Text mining in R"
)
)
# Bigrams (pairs of consecutive words)
text %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
# Trigrams (triplets)
text %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)
Characters
Character-level analysis helps with spelling errors and language identification:
text %>%
unnest_tokens(char, text, token = "characters")
Working with multiple documents
Real text mining projects involve many documents. Here is how to structure and process them:
# Create a corpus of documents
documents <- tibble(
doc_id = 1:3,
title = c("Introduction", "Methods", "Results"),
text = c(
"This paper introduces a new method.",
"We used statistical methods to analyze data.",
"The results show significant findings."
)
)
# Tokenize while preserving document metadata
tidy_docs <- documents %>%
unnest_tokens(word, text)
print(tidy_docs)
Removing stop words
Stop words are common words that rarely carry meaningful information. The tidytext package includes built-in stop word lexicons:
# View available lexicons
get_stopwords()
# Remove stop words from your data
tidy_docs %>%
anti_join(get_stopwords(), by = "word")
You can also customize stop words:
# Custom stop words specific to your domain
custom_stop <- tibble(
word = c("data", "analysis", "paper"),
lexicon = "custom"
)
tidy_docs %>%
anti_join(custom_stop, by = "word")
Counting and analyzing tokens
With tidy text, familiar dplyr operations become powerful text analysis tools:
# Word frequency across documents
tidy_docs %>%
count(doc_id, word, sort = TRUE)
# Most common words overall
tidy_docs %>%
count(word, sort = TRUE) %>%
head(10)
# Words per document
tidy_docs %>%
group_by(doc_id) %>%
summarize(word_count = n())
The term frequency-Inverse document frequency (TF-IDF)
TF-IDF weights terms by their importance within a document collection. Terms that appear frequently in one document but rarely across all documents get higher scores:
# Calculate TF-IDF
doc_term_counts <- tidy_docs %>%
count(doc_id, word)
tfidf <- doc_term_counts %>%
bind_tf_idf(word, doc_id, n)
# Find distinctive terms per document
tfidf %>%
group_by(doc_id) %>%
top_n(5, tf_idf)
This reveals which terms are most characteristic of each document.
Practical example: analyzing book chapters
Let us apply these concepts to a more realistic dataset:
library(janeaustenr)
# Get the text
book_text <- austen_books() %>%
filter(book == "Pride & Prejudice")
# Tokenize by chapter (using line numbers as proxy)
tidy_chapters <- book_text %>%
mutate(chapter = cumsum(str_detect(text, "^chapter"))) %>%
filter(chapter > 0) %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords())
# Most distinctive words per chapter
chapter_tf <- tidy_chapters %>%
count(chapter, word) %>%
bind_tf_idf(word, chapter, n) %>%
group_by(chapter) %>%
top_n(3, tf_idf)
print(chapter_tf)
Pairwise comparisons
Comparing word usage between two groups reveals distinctive vocabulary:
# Compare two books
two_books <- austen_books() %>%
filter(book %in% c("Pride & Prejudice", "Sense & Sensibility")) %>%
mutate(book = factor(book, levels = c("Pride & Prejudice", "Sense & Sensibility"))) %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords())
# Count words by book
word_counts <- two_books %>%
count(book, word) %>%
spread(book, n, fill = 0)
# Calculate log odds ratio
word_counts %>%
mutate(
pp_rate = (`Pride & Prejudice` + 1) / sum(`Pride & Prejudice` + 1),
ss_rate = (`Sense & Sensibility` + 1) / sum(`Sense & Sensibility` + 1),
log_odds = log(pp_rate / ss_rate)
) %>%
arrange(desc(log_odds))
What you have learned
This tutorial covered essential tidytext techniques:
| Technique | Use Case |
|---|---|
| Word tokenization | Default—breaks text into words |
| N-grams | Capture word pairs and phrases |
| Sentence tokenization | Document-level analysis |
| Stop word removal | Filter common words |
| TF-IDF | Find distinctive terms |
| Pairwise comparison | Compare vocabularies between texts |
These tools form the foundation for more advanced text analysis, including sentiment analysis, topic modeling, and text classification.
Tokenization
Tokenization splits text into individual tokens (words, sentences, or n-grams). unnest_tokens(df, word, text_col) converts a data frame with a text column into one-row-per-word format. By default it lowercases all tokens and strips punctuation. token = "sentences" splits into sentences. token = "ngrams", n = 2 creates bigrams.
Stop words
stop_words in tidytext contains common English words to exclude (the, and, a, etc.). Remove them with anti_join(df, stop_words). The lexicon argument selects the stop word list: "snowball", "smart", or "onix". For domain-specific text, extend with custom stop words: bind_rows(stop_words, tibble(word = c("said", "also"), lexicon = "custom")).
Term frequency
count(df, word, sort = TRUE) gives word frequencies. bind_tf_idf(df, word, document, n) computes TF-IDF scores that down-weight words common across all documents. Visualize with ggplot2::geom_bar() or a word cloud using wordcloud2::wordcloud2(). TF-IDF identifies the most distinctive words per document, useful for summarizing document content or comparing topics.
Document-Term matrix
Many ML models require a document-term matrix (DTM), one row per document, one column per term. cast_dtm(df, document, word, n) creates a DTM from tidy word counts. cast_sparse(df, document, word, n) creates a sparse matrix for memory efficiency with large vocabularies. The topicmodels package’s LDA() function accepts DTMs directly for topic modeling.
The tidy text format
The tidytext package represents text data in a tidy format: one token per row. Tokenization is the process of splitting text into units, words, sentences, n-grams, or characters. unnest_tokens(df, word, text_col) splits a text column into one-word-per-row format, lowercasing and stripping punctuation by default.
This tidy format makes standard tidyverse operations apply naturally to text. count(word, sort = TRUE) counts word frequencies. filter(!word %in% stop_words$word) removes stop words. left_join(sentiments, by = "word") attaches sentiment scores. The tidyverse tools you already know work directly on text data.
N-grams capture multi-word phrases. unnest_tokens(df, bigram, text_col, token = "ngrams", n = 2) produces bigrams. tidyr::separate(bigram, c("word1", "word2"), sep = " ") splits the bigram into two columns for filtering: remove bigrams where either word is a stop word, then tidyr::unite() to rejoin.
Term frequency and TF-IDF
Word frequency alone is not always the most useful statistic. Common words appear frequently across all documents. TF-IDF (term frequency-inverse document frequency) downweights words that appear in many documents and upweights words specific to particular documents.
bind_tf_idf(word, document, n) computes TF-IDF for a word frequency tibble. The n column is the count of each word in each document. The function adds tf, idf, and tf_idf columns. Words with high tf_idf are characteristic of specific documents and make good document keywords or topics.
For a document-term matrix (needed by many machine learning algorithms), cast_dtm(df, document, term, value) converts a tidy tibble to a DocumentTermMatrix from the tm package. cast_sparse(df, document, term, value) produces a sparse matrix, which is more memory-efficient.
Sentiment analysis approaches
tidytext includes three sentiment lexicons accessible with get_sentiments():
"bing", binary positive/negative classification for about 6,800 words"afinn", numeric scores from -5 to +5 for about 2,500 words"nrc", categorical emotions (anger, joy, sadness, etc.) for about 14,000 words
Join the word tibble to a lexicon: inner_join(get_sentiments("bing"), by = "word") returns only words with sentiment labels. This approach is dictionary-based, words not in the lexicon are ignored. Negation (“not happy”) is not handled; for accuracy, consider using the sentimentr package which handles valence shifters.
For document-level sentiment, aggregate word scores: group_by(doc_id) %>% summarise(sentiment = sum(value)) for AFINN, or count positive and negative words and compute the difference for Bing.
Topic modeling
Latent Dirichlet Allocation (LDA) discovers latent topics in a collection of documents. topicmodels::LDA(dtm, k = 5) fits a 5-topic model to a document-term matrix. The k parameter requires domain knowledge or model selection (measure coherence across different values of k).
tidytext::tidy(lda_model, matrix = "beta") returns per-topic-per-word probabilities in a tidy tibble. slice_max(beta, n = 10, by = topic) extracts the top 10 words per topic. Visualize with ggplot2::geom_col() faceted by topic.
tidy(lda_model, matrix = "gamma") returns per-document-per-topic probabilities, which topic mixture best describes each document. This is useful for classifying documents by their dominant topic.
Text preprocessing decisions
The choices made during preprocessing significantly affect results. Stop word removal: the default stop_words dataset in tidytext contains 1,149 words from Snowball, SMART, and Onix lists. For domain-specific text, add custom stop words with bind_rows(stop_words, tibble(word = c("fig", "table", "et"), lexicon = "custom")).
Stemming reduces words to their root: “running”, “runs”, “ran” all become “run”. SnowballC::wordStem(words, language = "english") performs Porter stemming. Lemmatization is more linguistically accurate (maps to actual dictionary words) but slower, textstem::lemmatize_words() provides this.
Removing very rare words (appearing in only 1-2 documents) and very common words (appearing in >90% of documents) often improves topic model quality. Filter by document frequency before building the DTM.
Tidy data principles applied to text
The tidytext package applies tidy data principles to text analysis. In a tidy text representation, one row equals one token, a word, a sentence, or an n-gram, paired with its source document metadata. This one-token-per-row format works naturally with dplyr for filtering and counting, with ggplot2 for visualization, and with standard tidyverse workflows. The conversion from raw text to tidy format is the unnest_tokens step.
The tidy text format contrasts with the document-term matrix (DTM) format used by many traditional text mining packages. A DTM has one row per document and one column per term, with cell values being word frequencies. While DTMs are required for many modeling algorithms, they are awkward for exploratory analysis because dplyr verbs do not apply to them directly. tidytext converts between the two representations, using the tidy format for exploration and converting to DTM for modeling.
Term frequency and weighting
Raw word counts are dominated by common words that appear everywhere, the, and, of, and carry little information about document content. TF-IDF (term frequency-inverse document frequency) weights terms by how often they appear in a document relative to how often they appear across all documents. Terms that are frequent in a specific document but rare in the corpus get high TF-IDF scores; they characterize what makes that document distinctive.
Computing TF-IDF with tidytext’s bind_tf_idf function takes the tidy token data frame and adds tf, idf, and tf_idf columns. The resulting data frame identifies the most characteristically important terms for each document. Visualizing the top TF-IDF terms per document as faceted bar charts gives an informative overview of a document collection’s content without reading every document.
Stop words and text cleaning
Stop words are common words that contribute little meaning to text analysis. Removing them reduces the dataset size and focuses analysis on content words. The tidytext package provides English stop word lists from multiple sources. Joining the token data frame with the stop word list and filtering removes the matched words. For domain-specific text, common technical terms that are uninformative in the specific context also benefit from removal.
Text cleaning before tokenization improves analysis quality. Removing punctuation, lowercasing all text, removing numbers, and stripping HTML tags if present all produce cleaner tokens. The order of operations matters: lowercasing before tokenization ensures that “The” and “the” are treated as the same word. Stripping HTML before tokenization prevents HTML tags from appearing as tokens.
Next steps
Continue your text mining journey with related tutorials in this series:
- Sentiment Analysis in R — Assign emotional scores to text
- Topic Modeling with LDA in R — Discover hidden topics in document collections
- Text Classification in R — Build models to categorize text automatically
See also
- dplyr::filter(), Filter rows after tokenization
- dplyr::count() — Essential for word frequency analysis