Introduction to Text Mining in R
Text mining is the process of extracting meaningful patterns and insights from text data. Whether you’re analyzing customer reviews, social media posts, or academic papers, turning unstructured text into structured data opens up powerful analytical possibilities. This tutorial introduces you to text mining in R using the tidytext package—a modern approach that integrates smoothly with the tidyverse.
What you’ll learn
This tutorial covers the key concepts and practical techniques for working with Introduction to Text Mining in R. By the end, you will know how to apply the core functions in real data analysis workflows.
Why text mining matters
Traditional data analysis works with numbers and categories. But text data is everywhere: emails, survey responses, product reviews, news articles. Text mining lets you:
- Quantify qualitative data: Convert opinions into scores
- Discover themes: Identify topics across thousands of documents
- Find patterns: Spot trends in customer feedback
- Build features: Create variables for machine learning models
Installing required packages
You’ll need several packages for text mining in R:
# Install tidyverse (includes dplyr, tidyr, ggplot2)
install.packages("tidyverse")
# Install text mining packages
install.packages("tidytext")
install.packages("textdata")
# Install janeaustenr for practice data
install.packages("janeaustenr")
# Load them all
library(tidyverse)
library(tidytext)
library(janeaustenr)
The tidytext package by Julia Silge and David Robinson provides tools for converting text to and from tidy formats. The textdata package gives you access to sentiment lexicons.
The tidy text format
The core principle of tidytext is simple: one token per row. A token is a meaningful unit of text—usually a word. This structure mirrors tidy data principles you’ve already seen in dplyr.
Consider this simple example:
# Start with a text vector
sample_text <- c(
"Text mining is fascinating",
"R makes text analysis easy",
"Tidy data helps everywhere"
)
# Convert to tidy format (one word per row)
tidy_words <- tibble(text = sample_text) %>%
unnest_tokens(word, text)
print(tidy_words)
The unnest_tokens() function does the heavy lifting—it tokenizes the text and creates a tidy data frame. The output shows each word on its own row with an index tracking which sentence it came from.
Tokenization: breaking text apart
Tokenization is the first step in any text mining project. The unnest_tokens() function handles this intelligently:
# Create a sample tibble
text_df <- tibble(
line = 1:3,
text = c(
"Machine learning is powerful",
"Text mining reveals patterns",
"R is the tool for data science"
)
)
# Tokenize by words
word_tokens <- text_df %>%
unnest_tokens(word, text)
print(word_tokens)
You can also tokenize by other units:
# Tokenize by sentences
sentence_tokens <- text_df %>%
unnest_tokens(sentence, text, token = "sentences")
print(sentence_tokens)
# Tokenize by n-grams (pairs of consecutive words)
bigram_tokens <- text_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
print(bigram_tokens)
A real example: jane austen’s novels
The janeaustenr package contains the complete texts of Jane Austen’s novels. Let’s analyze them:
# Get the book texts
austen_books <- janeaustenr::austen_books()
# Examine the structure
head(austen_books)
Each row contains a line from one of Austen’s six completed novels. Now let’s tidy this data:
# Tidy the novels
tidy_austen <- austen_books %>%
# Group by book
group_by(book) %>%
# Tokenize into words
unnest_tokens(word, text) %>%
# Remove stop words (common words like "the", "is", "at")
anti_join(get_stopwords())
print(tidy_austen)
The anti_join() with get_stopwords() removes common English words that don’t carry meaningful information.
Word frequency analysis
Once your text is tidy, analysis becomes straightforward:
# Find the most common words across all novels
word_counts <- tidy_austen %>%
count(word, sort = TRUE)
print(word_counts)
This single pipeline counts word occurrences. Let’s visualize the top words:
# Plot top 15 words
word_counts %>%
head(15) %>%
ggplot(aes(n, fct_reorder(word, n))) +
geom_col(fill = "steelblue") +
labs(
x = "Word Count",
y = NULL,
title = "Most Frequent Words in Jane Austen Novels"
) +
theme_minimal()
Sentiment analysis basics
Sentiment analysis assigns emotional values to words. The tidytext package provides several sentiment lexicons:
# Load the Bing lexicon (positive/negative)
get_sentiments("bing") %>% head(10)
Let’s analyze sentiment in one of Austen’s books:
# Filter to Pride and Prejudice
pride_prejudice <- tidy_austen %>%
filter(book == "Pride & Prejudice")
# Get sentiment for each word
sentiment_pride <- pride_prejudice %>%
inner_join(get_sentiments("bing"))
# Count positive vs negative
sentiment_counts <- sentiment_pride %>%
count(sentiment)
print(sentiment_counts)
Word cloud visualization
Word clouds provide an intuitive overview of term frequency:
library(wordcloud)
# Create word cloud of most common words
tidy_austen %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50))
What you’ve learned
This tutorial covered the fundamentals of text mining in R:
| Concept | Description |
|---|---|
| Tidy text format | One token (typically word) per row |
| Tokenization | Breaking text into meaningful units |
| Stop words | Common words to filter out |
| Word frequency | Counting and ranking terms |
| Sentiment analysis | Scoring words as positive/negative |
The tidytext approach transforms messy text into data you can manipulate with dplyr, visualize with ggplot2, and model with R’s statistical tools.
Text preprocessing pipeline
A text preprocessing pipeline typically includes: lowercasing, punctuation removal, tokenization, stop word removal, and optionally stemming or lemmatization. In tidytext: unnest_tokens(df, word, text) handles the first three steps automatically. anti_join(df, stop_words) removes common words. mutate(word = textstem::lemmatize_words(word)) lemmatizes (converts to base form).
Word frequencies and visualization
count(df, word, sort = TRUE) gives word frequencies. Visualize with geom_bar() after filtering to the top N: slice_max(n, n = 20) |> ggplot(aes(n, fct_reorder(word, n))) + geom_bar(stat = "identity"). Word clouds using wordcloud2::wordcloud2(freq_df) provide an alternative but offer less precise quantity perception than bar charts.
TF-IDF analysis
Term frequency-inverse document frequency (TF-IDF) identifies words that are distinctive to each document. bind_tf_idf(df, word, document_id, n) computes TF-IDF scores. High TF-IDF words appear frequently in a specific document but rarely across all documents. This is useful for identifying keywords, summarizing document content, and building text classifiers.
N-gram analysis
Bigrams (two-word sequences) capture phrases and context. unnest_tokens(df, bigram, text, token = "ngrams", n = 2) tokenizes into bigrams. separate(bigram, c("word1", "word2"), sep = " ") splits for filtering: filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word) removes bigrams containing stop words. Bigram networks visualized with ggraph reveal common word associations in the corpus.
The text mining workflow
Text mining follows a standard pipeline: collect text, preprocess, extract features, apply analysis, and interpret results. Each step involves choices that affect downstream quality.
Collection: web scraping, API calls, database queries, or file reading. The source determines the format, HTML pages need tag stripping, PDFs need extraction, social media APIs return JSON. Store raw text before any processing so you can rerun later steps without recollecting.
Preprocessing: lowercase conversion, punctuation removal, stop word removal, stemming or lemmatization. These steps reduce vocabulary size and normalize variation. The right level of preprocessing depends on the analysis, sentiment analysis benefits from keeping negations; topic modeling benefits from aggressive stop word removal; named entity recognition requires minimal preprocessing to preserve proper nouns and punctuation.
Feature extraction converts text to numbers. Bag of words (word counts or TF-IDF), n-grams (multi-word sequences), embeddings (dense vectors from neural models). The choice affects what structure the downstream model can learn.
Document-Term matrix operations
The document-term matrix (DTM) is the primary data structure for text analysis. Rows are documents, columns are terms, values are frequencies or TF-IDF scores. DTMs are typically very sparse, most documents contain only a tiny fraction of all vocabulary terms.
tm::DocumentTermMatrix(corpus) builds a DTM from a tm corpus. as.matrix(dtm) converts to a dense matrix (only for small DTMs, large DTMs exceed memory). inspect(dtm) shows dimensions and sparsity.
tm::removeSparseTerms(dtm, sparse = 0.99) removes terms that appear in fewer than 1% of documents. This reduces vocabulary to the most common terms, which carry more signal than rare terms in most analyses.
tidytext::cast_dtm(df, document, term, n) builds a DTM from a tidy word frequency tibble, more convenient when you’ve already done preprocessing in the tidyverse pipeline.
Corpus management with tm
tm::VCorpus(VectorSource(texts)) creates a corpus from a character vector. tm::PCorpus() creates a permanent corpus backed by disk, suitable for large text collections that do not fit in memory.
Transformations apply to all documents in a corpus: tm_map(corpus, content_transformer(tolower)) lowercases. tm_map(corpus, removePunctuation) removes punctuation. tm_map(corpus, removeWords, stopwords("english")) removes English stop words. tm_map(corpus, stemDocument) applies Porter stemming.
Custom transformations wrap any function: tm_map(corpus, content_transformer(function(x) gsub("[0-9]+", "NUM", x))) replaces numbers with a placeholder token.
Keyword in context (KWIC)
Keyword in context shows the words surrounding each occurrence of a search term, helping you understand how words are used in practice. quanteda::kwic(corp, pattern = "innovation", window = 5) returns a data frame with 5 words on each side of every occurrence of “innovation.”
This is valuable for exploratory analysis, before deciding how to handle a word, see how it is actually used in your corpus. A word with multiple meanings may need disambiguation; a technical term may have domain-specific collocations that inform feature engineering decisions.
Readability and complexity metrics
quanteda.textstats::textstat_readability(corp, measure = "Flesch") computes the Flesch Reading Ease score. Other measures: Gunning Fog Index, Coleman-Liau, SMOG. These measure text complexity based on sentence length and word length.
textstat_lexdiv(dtm, measure = "TTR") computes lexical diversity (type-token ratio), the ratio of unique words to total words. High lexical diversity indicates varied vocabulary; low indicates repetitive text.
These metrics are useful for quality control in automated text generation, for comparing writing styles across authors or time periods, and for matching document complexity to target audience reading levels.
Next steps
Continue with the next tutorials in this series to deepen your text mining skills:
- Tidytext Basics, Master the core tidytext functions for deeper text analysis
- Sentiment Analysis in R, Explore advanced sentiment scoring techniques
- Topic Modeling with LDA in R, Discover latent topics in large text corpora
- Text Classification in R — Build machine learning models for text categorization
You’ll build on these foundations to analyze larger datasets, extract themes, and apply machine learning to text.
See also
- dplyr::filter(), Select rows by condition (used with sentiment analysis)
- dplyr::count(), Count observations by groups (essential for word frequency)
- purrr::map(), Apply functions to each element (useful for processing multiple texts)