Topic Modeling with LDA in R

March 16, 2026 · 9 min read ·Updated May 29, 2026 ·intermediate

text-miningtopic-modelingldanlprtidytext

Topic modeling discovers latent themes in large document collections without predefined categories. Latent Dirichlet Allocation (LDA) is the most popular approach—it treats each document as a mixture of topics, and each topic as a mixture of words. This tutorial shows you how to apply LDA to text data using R and tidytext.

Prerequisites

You should be comfortable with tidytext basics—tokenization, stop word removal, and working with multiple documents. If you need background, work through the Tidytext Basics and Sentiment Analysis tutorials first.

Understanding LDA

LDA assumes documents are generated by:

Choosing a distribution over topics for each document
For each word, choosing a topic from that distribution
Choosing a word from that topic’s word distribution

The algorithm reverse-engineers this process to find the topics.

Key concepts

Document: A single text (review, article, book chapter)
Topic: A cluster of words that tend to appear together
Word-topic probabilities: How strongly each word belongs to each topic
Document-topic proportions: How much each topic contributes to each document

Installing required packages

install.packages("topicmodels")
install.packages("tidytext")
install.packages("tidyverse")
install.packages("janeaustenr")

The topicmodels package provides the actual LDA implementation.

Preparing text for topic modeling

Topic models work on word counts, not raw text:

library(tidytext)
library(tidyverse)
library(janeaustenr)
library(topicmodels)

# Get text from Jane Austen books
book_words <- austen_books() %>%
  filter(book %in% c("Pride & Prejudice", "Emma", "Sense & Sensibility")) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(book, word, sort = TRUE)

print(book_words)

Creating a document-Term matrix

LDA requires a document-term matrix (DTM):

# Create document-term matrix
book_dtm <- book_words %>%
  cast_dtm(document = book, term = word, value = n)

print(book_dtm)
# <<DocumentTermMatrix (documents: 3, terms: 6682)>>

The DTM rows are documents, columns are terms, and values are word counts.

Fitting an LDA model

# Fit LDA model with 2 topics
book_lda <- LDA(book_dtm, k = 2, control = list(seed = 1234))

print(book_lda)

The k parameter specifies the number of topics—you decide this based on your domain knowledge.

Extracting topic-Word probabilities

# Get word-topic probabilities (beta)
book_topics <- tidy(book_lda, matrix = "beta")

print(book_topics)
# # A tibble: 13,364 × 3
#   topic term         beta
#   <int> <chr>       <dbl>
# 1     1 abbess    0.00100
# 2     1 abbot     0.00125

Higher beta means the word is more strongly associated with that topic.

Finding top words per topic

top_terms <- book_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>%
  ungroup() %>%
  arrange(topic, -beta)

print(top_terms)

Extracting document-Topic probabilities

# Get document-topic probabilities (gamma)
doc_topics <- tidy(book_lda, matrix = "gamma")

print(doc_topics)
# # A tibble: 6 × 3
#   document                 topic     gamma
#   <chr>                    <int>     <dbl>
# 1 Pride & Prejudice            1   0.148
# 2 Pride & Prejudice            2   0.852

Higher gamma means the document is more associated with that topic.

Choosing the right number of topics

There is no single “correct” number of topics. Consider:

Domain knowledge: How many themes do you expect?
Interpretability: Can you label the topics meaningfully?
Quantitative measures: Perplexity measures how well the model fits

# Try different numbers of topics
many_models <- data.frame(
  k = c(2, 5, 10, 20)
) %>%
  mutate(
    lda = map(k, ~ LDA(book_dtm, k = .x, control = list(seed = 1234)))
  )

# Compare perplexity (lower is better)
many_models %>%
  mutate(
    perplexity = map_dbl(lda, ~ perplexity(.x, book_dtm))
  )

Visualizing topic words

library(ggplot2)

top_terms %>%
  mutate(
    term = reorder_within(term, beta, topic)
  ) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered() +
  labs(
    title = "Top Terms per Topic",
    x = "Beta (word-topic probability)",
    y = NULL
  ) +
  theme_minimal()

Assigning topics to new documents

Once fitted, assign topics to new documents:

# New document
new_doc <- c("Love and marriage are the main themes in this novel.")

# Tokenize and prepare
new_tokens <- tibble(
  doc = 1,
  text = new_doc
) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word") %>%
  count(word)

# Create DTM (must use same terms as training)
new_dtm <- new_tokens %>%
  cast_dtm(document = doc, term = word, value = n)

# Get topic probabilities
new_doc_topics <- posterior(book_lda, new_dtm)$topics

print(new_doc_topics)

Practical example: news articles

Apply LDA to a larger corpus:

# Simulate with available data - expand to more documents
# In practice, load your own corpus

# First, get more text data
more_books <- austen_books() %>%
  filter(book != "Pride & Prejudice") %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(book, word, sort = TRUE)

# Create DTM
more_dtm <- more_books %>%
  cast_dtm(document = book, term = word, value = n)

# Fit with more topics
more_lda <- LDA(more_dtm, k = 3, control = list(seed = 1234))

# See what emerges
tidy(more_lda, matrix = "beta") %>%
  group_by(topic) %>%
  slice_max(beta, n = 8) %>%
  ungroup() %>%
  arrange(topic, -beta)

What you have learned

Concept	Description
Document-Term Matrix	Word counts organized by documents
Topics (k)	Number of latent themes to discover
Beta	Probability of word given topic
Gamma	Probability of topic given document

LDA theory

Latent Dirichlet Allocation (LDA) is a generative statistical model for text. It assumes each document is a mixture of topics, and each topic is a mixture of words. Training LDA finds topic-word distributions (which words characterize each topic) and document-topic distributions (how much each topic contributes to each document) that best explain the observed text.

Preparing data

LDA requires a document-term matrix (DTM). Using tidytext: tokenize the text, remove stop words, count word frequencies per document, then cast_dtm(df, document_id, word, count). Remove very rare terms (words appearing in fewer than 2 documents) and very common terms to reduce noise and improve interpretability.

Fitting and inspecting the model

topicmodels::LDA(dtm, k = 5) fits LDA with 5 topics. tidy(lda_model, matrix = "beta") returns word-topic probabilities in tidy format. The top words per topic (highest beta values) characterize the topic. tidy(lda_model, matrix = "gamma") returns document-topic probabilities, which topics dominate each document.

Choosing k

There is no definitive method for choosing the number of topics K. Practical approaches: perplexity on held-out documents decreases with K but levels off; coherence scores measure how semantically related the top words of each topic are; visual inspection of top words assesses interpretability. ldatuning::FindTopicsNumber() computes multiple metrics across a range of K values.

Interpreting topics

Topic interpretation requires human judgment. List the top 10-15 words per topic and assign a descriptive label. Topics with incoherent or overlapping word sets may indicate K is too small or too large. The most useful topics are those that clearly correspond to a recognizable theme in the corpus. Visualize topic prevalence over time or across document subgroups to identify substantive patterns.

Latent dirichlet allocation theory

LDA assumes documents are mixtures of topics, and topics are distributions over words. The generative model: to produce a document, choose a mixture of topics (e.g., 70% politics, 30% economics), then for each word, pick a topic from the mixture and draw a word from that topic’s word distribution.

The inference problem, given observed documents, estimate the topic-word distributions and document-topic distributions, is solved with variational Bayes (the topicmodels package) or Gibbs sampling (also available in topicmodels and lda). Both approximate the posterior distribution.

The key assumption is the “bag of words” model: word order does not matter. This simplification makes LDA tractable but means it misses syntax and phrase structure. For most topic discovery tasks, this tradeoff is acceptable.

Fitting and evaluating LDA

library(topicmodels)
# dtm is a DocumentTermMatrix
lda_model <- LDA(dtm, k = 10, control = list(seed = 42))

Choosing k (number of topics) requires both statistical evaluation and human judgment. Perplexity measures how well the model predicts held-out documents, lower is better, but perplexity decreases monotonically with k. Topic coherence (do the top words of each topic co-occur in real documents?) correlates better with human interpretability.

ldatuning::FindTopicsNumber(dtm, topics = c(5, 10, 15, 20, 25)) computes multiple metrics including Griffiths2004 (maximize), CaoJuan2009 (minimize), Arun2010 (minimize), and Deveaud2014 (maximize). Plot with FindTopicsNumber_plot() to find the elbow.

After fitting, inspect topics: tidytext::tidy(lda_model, matrix = "beta") gives per-topic-per-word probabilities. slice_max(beta, n = 10) per topic extracts top words. Label each topic with a descriptive name based on these words, this labeling requires human judgment.

Document-Topic assignments

tidytext::tidy(lda_model, matrix = "gamma") gives per-document-per-topic probabilities. Each document gets a probability for each topic; these sum to 1 per document. slice_max(gamma, n = 1) per document assigns each document to its dominant topic.

Soft assignments (probabilities) are more informative than hard assignments for documents that span multiple topics. A news article about economic policy has meaningful probability mass on both “economics” and “politics” topics, preserving this ambiguity is valuable for downstream analysis.

augment(lda_model, data = dtm) assigns each word to the topic it most likely came from (given the document’s topic mixture). This word-level assignment helps validate topic coherence and identify misclassified documents.

Preprocessing for better topics

Topic quality is highly sensitive to preprocessing. Over-preprocessing removes words that could distinguish topics; under-preprocessing creates topics dominated by stop words.

Domain-specific stop words matter more than generic ones. For scientific literature, remove generic academic words (“results”, “analysis”, “study”, “method”) that appear in every paper regardless of topic. For news articles, remove publication names and datelines. Add these to the stop word list before building the DTM.

Minimum document frequency threshold: remove terms appearing in fewer than 5 documents. Maximum threshold: remove terms appearing in more than 95% of documents. These cuts remove rare noise and universal stop words simultaneously.

Bigrams capture multi-word concepts: “machine learning”, “climate change”, “stock market”. Include bigrams as single tokens in the vocabulary. tidytext::unnest_ngrams() or quanteda::tokens_ngrams() generate them. Mixing unigrams and bigrams sometimes improves topic interpretability.

Structural topic models

The stm package extends LDA with document-level metadata. Covariates (document date, author affiliation, source) can influence topic prevalence (how much of a document is about a topic) and topic content (which words are used within a topic).

stm::stm(documents, vocab, K = 20, prevalence = ~ year + source) fits a model where topic prevalence varies with year and source. stm::estimateEffect() tests whether covariates significantly predict topic prevalence. This makes STM valuable for analyzing how topics change over time or vary across groups.

Key takeaways

LDA discovers topics without labeled training data
Choose k based on domain knowledge and interpretability
Higher beta = stronger word-topic association
Higher gamma = stronger document-topic association
Use tidy() to convert model output to tidy format

Next steps

Continue your text mining journey:

Text Classification in R, Build supervised models to categorize text
Sentiment Analysis in R, Assign emotional scores to text
Tidytext Basics — Foundation for all text mining in R