Topic Modeling with LDA in R
Topic modeling discovers latent themes in large document collections without predefined categories. Latent Dirichlet Allocation (LDA) is the most popular approach—it treats each document as a mixture of topics, and each topic as a mixture of words. This tutorial shows you how to apply LDA to text data using R and tidytext.
Prerequisites
You should be comfortable with tidytext basics—tokenization, stop word removal, and working with multiple documents. If you need background, work through the Tidytext Basics and Sentiment Analysis tutorials first.
Understanding LDA
LDA assumes documents are generated by:
- Choosing a distribution over topics for each document
- For each word, choosing a topic from that distribution
- Choosing a word from that topic’s word distribution
The algorithm reverse-engineers this process to find the topics.
Key Concepts
- Document: A single text (review, article, book chapter)
- Topic: A cluster of words that tend to appear together
- Word-topic probabilities: How strongly each word belongs to each topic
- Document-topic proportions: How much each topic contributes to each document
Installing Required Packages
install.packages("topicmodels")
install.packages("tidytext")
install.packages("tidyverse")
install.packages("janeaustenr")
The topicmodels package provides the actual LDA implementation.
Preparing Text for Topic Modeling
Topic models work on word counts, not raw text:
library(tidytext)
library(tidyverse)
library(janeaustenr)
library(topicmodels)
# Get text from Jane Austen books
book_words <- austen_books() %>%
filter(book %in% c("Pride & Prejudice", "Emma", "Sense & Sensibility")) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(book, word, sort = TRUE)
print(book_words)
Creating a Document-Term Matrix
LDA requires a document-term matrix (DTM):
# Create document-term matrix
book_dtm <- book_words %>%
cast_dtm(document = book, term = word, value = n)
print(book_dtm)
# <<DocumentTermMatrix (documents: 3, terms: 6682)>>
The DTM rows are documents, columns are terms, and values are word counts.
Fitting an LDA Model
# Fit LDA model with 2 topics
book_lda <- LDA(book_dtm, k = 2, control = list(seed = 1234))
print(book_lda)
The k parameter specifies the number of topics—you decide this based on your domain knowledge.
Extracting Topic-Word Probabilities
# Get word-topic probabilities (beta)
book_topics <- tidy(book_lda, matrix = "beta")
print(book_topics)
# # A tibble: 13,364 × 3
# topic term beta
# <int> <chr> <dbl>
# 1 1 abbess 0.00100
# 2 1 abbot 0.00125
Higher beta means the word is more strongly associated with that topic.
Finding Top Words per Topic
top_terms <- book_topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
print(top_terms)
Extracting Document-Topic Probabilities
# Get document-topic probabilities (gamma)
doc_topics <- tidy(book_lda, matrix = "gamma")
print(doc_topics)
# # A tibble: 6 × 3
# document topic gamma
# <chr> <int> <dbl>
# 1 Pride & Prejudice 1 0.148
# 2 Pride & Prejudice 2 0.852
Higher gamma means the document is more associated with that topic.
Choosing the Right Number of Topics
There is no single “correct” number of topics. Consider:
- Domain knowledge: How many themes do you expect?
- Interpretability: Can you label the topics meaningfully?
- Quantitative measures: Perplexity measures how well the model fits
# Try different numbers of topics
many_models <- data.frame(
k = c(2, 5, 10, 20)
) %>%
mutate(
lda = map(k, ~ LDA(book_dtm, k = .x, control = list(seed = 1234)))
)
# Compare perplexity (lower is better)
many_models %>%
mutate(
perplexity = map_dbl(lda, ~ perplexity(.x, book_dtm))
)
Visualizing Topic Words
library(ggplot2)
top_terms %>%
mutate(
term = reorder_within(term, beta, topic)
) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered() +
labs(
title = "Top Terms per Topic",
x = "Beta (word-topic probability)",
y = NULL
) +
theme_minimal()
Assigning Topics to New Documents
Once fitted, assign topics to new documents:
# New document
new_doc <- c("Love and marriage are the main themes in this novel.")
# Tokenize and prepare
new_tokens <- tibble(
doc = 1,
text = new_doc
) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
count(word)
# Create DTM (must use same terms as training)
new_dtm <- new_tokens %>%
cast_dtm(document = doc, term = word, value = n)
# Get topic probabilities
new_doc_topics <- posterior(book_lda, new_dtm)$topics
print(new_doc_topics)
Practical Example: News Articles
Apply LDA to a larger corpus:
# Simulate with available data - expand to more documents
# In practice, load your own corpus
# First, get more text data
more_books <- austen_books() %>%
filter(book != "Pride & Prejudice") %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(book, word, sort = TRUE)
# Create DTM
more_dtm <- more_books %>%
cast_dtm(document = book, term = word, value = n)
# Fit with more topics
more_lda <- LDA(more_dtm, k = 3, control = list(seed = 1234))
# See what emerges
tidy(more_lda, matrix = "beta") %>%
group_by(topic) %>%
slice_max(beta, n = 8) %>%
ungroup() %>%
arrange(topic, -beta)
What You Have Learned
| Concept | Description |
|---|---|
| Document-Term Matrix | Word counts organized by documents |
| Topics (k) | Number of latent themes to discover |
| Beta | Probability of word given topic |
| Gamma | Probability of topic given document |
Key Takeaways
- LDA discovers topics without labeled training data
- Choose
kbased on domain knowledge and interpretability - Higher beta = stronger word-topic association
- Higher gamma = stronger document-topic association
- Use
tidy()to convert model output to tidy format
See Also
- dplyr::count — Count words per document
- dplyr::filter — Filter tokens and documents
Next Steps
Continue your text mining journey:
- Text Classification in R — Build supervised models to categorize text
- Sentiment Analysis in R — Assign emotional scores to text
- Tidytext Basics — Foundation for all text mining in R