Topic Modeling with LDA in R
Topic modeling discovers latent themes in large document collections without predefined categories. Latent Dirichlet Allocation (LDA) is the most popular approach—it treats each document as a mixture of topics, and each topic as a mixture of words. This tutorial shows you how to apply LDA to text data using R and tidytext.
Prerequisites
You should be comfortable with tidytext basics—tokenization, stop word removal, and working with multiple documents. If you need background, work through the Tidytext Basics and Sentiment Analysis tutorials first.
Understanding LDA
LDA assumes documents are generated by:
- Choosing a distribution over topics for each document
- For each word, choosing a topic from that distribution
- Choosing a word from that topic’s word distribution
The algorithm reverse-engineers this process to find the topics.
Key concepts
- Document: A single text (review, article, book chapter)
- Topic: A cluster of words that tend to appear together
- Word-topic probabilities: How strongly each word belongs to each topic
- Document-topic proportions: How much each topic contributes to each document
Installing required packages
install.packages("topicmodels")
install.packages("tidytext")
install.packages("tidyverse")
install.packages("janeaustenr")
The topicmodels package provides the actual LDA implementation.
Preparing text for topic modeling
Topic models work on word counts, not raw text:
library(tidytext)
library(tidyverse)
library(janeaustenr)
library(topicmodels)
# Get text from Jane Austen books
book_words <- austen_books() %>%
filter(book %in% c("Pride & Prejudice", "Emma", "Sense & Sensibility")) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(book, word, sort = TRUE)
print(book_words)
Creating a document-Term matrix
LDA requires a document-term matrix (DTM):
# Create document-term matrix
book_dtm <- book_words %>%
cast_dtm(document = book, term = word, value = n)
print(book_dtm)
# <<DocumentTermMatrix (documents: 3, terms: 6682)>>
The DTM rows are documents, columns are terms, and values are word counts.
Fitting an LDA model
# Fit LDA model with 2 topics
book_lda <- LDA(book_dtm, k = 2, control = list(seed = 1234))
print(book_lda)
The k parameter specifies the number of topics—you decide this based on your domain knowledge.
Extracting topic-Word probabilities
# Get word-topic probabilities (beta)
book_topics <- tidy(book_lda, matrix = "beta")
print(book_topics)
# # A tibble: 13,364 × 3
# topic term beta
# <int> <chr> <dbl>
# 1 1 abbess 0.00100
# 2 1 abbot 0.00125
Higher beta means the word is more strongly associated with that topic.
Finding top words per topic
top_terms <- book_topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
print(top_terms)
Extracting document-Topic probabilities
# Get document-topic probabilities (gamma)
doc_topics <- tidy(book_lda, matrix = "gamma")
print(doc_topics)
# # A tibble: 6 × 3
# document topic gamma
# <chr> <int> <dbl>
# 1 Pride & Prejudice 1 0.148
# 2 Pride & Prejudice 2 0.852
Higher gamma means the document is more associated with that topic.
Choosing the right number of topics
There is no single “correct” number of topics. Consider:
- Domain knowledge: How many themes do you expect?
- Interpretability: Can you label the topics meaningfully?
- Quantitative measures: Perplexity measures how well the model fits
# Try different numbers of topics
many_models <- data.frame(
k = c(2, 5, 10, 20)
) %>%
mutate(
lda = map(k, ~ LDA(book_dtm, k = .x, control = list(seed = 1234)))
)
# Compare perplexity (lower is better)
many_models %>%
mutate(
perplexity = map_dbl(lda, ~ perplexity(.x, book_dtm))
)
Visualizing topic words
library(ggplot2)
top_terms %>%
mutate(
term = reorder_within(term, beta, topic)
) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered() +
labs(
title = "Top Terms per Topic",
x = "Beta (word-topic probability)",
y = NULL
) +
theme_minimal()
Assigning topics to new documents
Once fitted, assign topics to new documents:
# New document
new_doc <- c("Love and marriage are the main themes in this novel.")
# Tokenize and prepare
new_tokens <- tibble(
doc = 1,
text = new_doc
) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
count(word)
# Create DTM (must use same terms as training)
new_dtm <- new_tokens %>%
cast_dtm(document = doc, term = word, value = n)
# Get topic probabilities
new_doc_topics <- posterior(book_lda, new_dtm)$topics
print(new_doc_topics)
Practical example: news articles
Apply LDA to a larger corpus:
# Simulate with available data - expand to more documents
# In practice, load your own corpus
# First, get more text data
more_books <- austen_books() %>%
filter(book != "Pride & Prejudice") %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(book, word, sort = TRUE)
# Create DTM
more_dtm <- more_books %>%
cast_dtm(document = book, term = word, value = n)
# Fit with more topics
more_lda <- LDA(more_dtm, k = 3, control = list(seed = 1234))
# See what emerges
tidy(more_lda, matrix = "beta") %>%
group_by(topic) %>%
slice_max(beta, n = 8) %>%
ungroup() %>%
arrange(topic, -beta)
What you have learned
| Concept | Description |
|---|---|
| Document-Term Matrix | Word counts organized by documents |
| Topics (k) | Number of latent themes to discover |
| Beta | Probability of word given topic |
| Gamma | Probability of topic given document |
LDA theory
Latent Dirichlet Allocation (LDA) is a generative statistical model for text. It assumes each document is a mixture of topics, and each topic is a mixture of words. Training LDA finds topic-word distributions (which words characterize each topic) and document-topic distributions (how much each topic contributes to each document) that best explain the observed text.
Preparing data
LDA requires a document-term matrix (DTM). Using tidytext: tokenize the text, remove stop words, count word frequencies per document, then cast_dtm(df, document_id, word, count). Remove very rare terms (words appearing in fewer than 2 documents) and very common terms to reduce noise and improve interpretability.
Fitting and inspecting the model
topicmodels::LDA(dtm, k = 5) fits LDA with 5 topics. tidy(lda_model, matrix = "beta") returns word-topic probabilities in tidy format. The top words per topic (highest beta values) characterize the topic. tidy(lda_model, matrix = "gamma") returns document-topic probabilities, which topics dominate each document.
Choosing k
There is no definitive method for choosing the number of topics K. Practical approaches: perplexity on held-out documents decreases with K but levels off; coherence scores measure how semantically related the top words of each topic are; visual inspection of top words assesses interpretability. ldatuning::FindTopicsNumber() computes multiple metrics across a range of K values.
Interpreting topics
Topic interpretation requires human judgment. List the top 10-15 words per topic and assign a descriptive label. Topics with incoherent or overlapping word sets may indicate K is too small or too large. The most useful topics are those that clearly correspond to a recognizable theme in the corpus. Visualize topic prevalence over time or across document subgroups to identify substantive patterns.
Latent dirichlet allocation theory
LDA assumes documents are mixtures of topics, and topics are distributions over words. The generative model: to produce a document, choose a mixture of topics (e.g., 70% politics, 30% economics), then for each word, pick a topic from the mixture and draw a word from that topic’s word distribution.
The inference problem, given observed documents, estimate the topic-word distributions and document-topic distributions, is solved with variational Bayes (the topicmodels package) or Gibbs sampling (also available in topicmodels and lda). Both approximate the posterior distribution.
The key assumption is the “bag of words” model: word order does not matter. This simplification makes LDA tractable but means it misses syntax and phrase structure. For most topic discovery tasks, this tradeoff is acceptable.
Fitting and evaluating LDA
library(topicmodels)
# dtm is a DocumentTermMatrix
lda_model <- LDA(dtm, k = 10, control = list(seed = 42))
Choosing k (number of topics) requires both statistical evaluation and human judgment. Perplexity measures how well the model predicts held-out documents, lower is better, but perplexity decreases monotonically with k. Topic coherence (do the top words of each topic co-occur in real documents?) correlates better with human interpretability.
ldatuning::FindTopicsNumber(dtm, topics = c(5, 10, 15, 20, 25)) computes multiple metrics including Griffiths2004 (maximize), CaoJuan2009 (minimize), Arun2010 (minimize), and Deveaud2014 (maximize). Plot with FindTopicsNumber_plot() to find the elbow.
After fitting, inspect topics: tidytext::tidy(lda_model, matrix = "beta") gives per-topic-per-word probabilities. slice_max(beta, n = 10) per topic extracts top words. Label each topic with a descriptive name based on these words, this labeling requires human judgment.
Document-Topic assignments
tidytext::tidy(lda_model, matrix = "gamma") gives per-document-per-topic probabilities. Each document gets a probability for each topic; these sum to 1 per document. slice_max(gamma, n = 1) per document assigns each document to its dominant topic.
Soft assignments (probabilities) are more informative than hard assignments for documents that span multiple topics. A news article about economic policy has meaningful probability mass on both “economics” and “politics” topics, preserving this ambiguity is valuable for downstream analysis.
augment(lda_model, data = dtm) assigns each word to the topic it most likely came from (given the document’s topic mixture). This word-level assignment helps validate topic coherence and identify misclassified documents.
Preprocessing for better topics
Topic quality is highly sensitive to preprocessing. Over-preprocessing removes words that could distinguish topics; under-preprocessing creates topics dominated by stop words.
Domain-specific stop words matter more than generic ones. For scientific literature, remove generic academic words (“results”, “analysis”, “study”, “method”) that appear in every paper regardless of topic. For news articles, remove publication names and datelines. Add these to the stop word list before building the DTM.
Minimum document frequency threshold: remove terms appearing in fewer than 5 documents. Maximum threshold: remove terms appearing in more than 95% of documents. These cuts remove rare noise and universal stop words simultaneously.
Bigrams capture multi-word concepts: “machine learning”, “climate change”, “stock market”. Include bigrams as single tokens in the vocabulary. tidytext::unnest_ngrams() or quanteda::tokens_ngrams() generate them. Mixing unigrams and bigrams sometimes improves topic interpretability.
Structural topic models
The stm package extends LDA with document-level metadata. Covariates (document date, author affiliation, source) can influence topic prevalence (how much of a document is about a topic) and topic content (which words are used within a topic).
stm::stm(documents, vocab, K = 20, prevalence = ~ year + source) fits a model where topic prevalence varies with year and source. stm::estimateEffect() tests whether covariates significantly predict topic prevalence. This makes STM valuable for analyzing how topics change over time or vary across groups.
Key takeaways
- LDA discovers topics without labeled training data
- Choose
kbased on domain knowledge and interpretability - Higher beta = stronger word-topic association
- Higher gamma = stronger document-topic association
- Use
tidy()to convert model output to tidy format
Next steps
Continue your text mining journey:
- Text Classification in R, Build supervised models to categorize text
- Sentiment Analysis in R, Assign emotional scores to text
- Tidytext Basics — Foundation for all text mining in R
See also
- dplyr::count(), Count words per document
- dplyr::filter(), Filter tokens and documents