Topic Modeling with LDA in R

· 5 min read · Updated March 16, 2026 · intermediate
text-mining topic-modeling lda nlp r tidytext

Topic modeling discovers latent themes in large document collections without predefined categories. Latent Dirichlet Allocation (LDA) is the most popular approach—it treats each document as a mixture of topics, and each topic as a mixture of words. This tutorial shows you how to apply LDA to text data using R and tidytext.

Prerequisites

You should be comfortable with tidytext basics—tokenization, stop word removal, and working with multiple documents. If you need background, work through the Tidytext Basics and Sentiment Analysis tutorials first.

Understanding LDA

LDA assumes documents are generated by:

  1. Choosing a distribution over topics for each document
  2. For each word, choosing a topic from that distribution
  3. Choosing a word from that topic’s word distribution

The algorithm reverse-engineers this process to find the topics.

Key Concepts

  • Document: A single text (review, article, book chapter)
  • Topic: A cluster of words that tend to appear together
  • Word-topic probabilities: How strongly each word belongs to each topic
  • Document-topic proportions: How much each topic contributes to each document

Installing Required Packages

install.packages("topicmodels")
install.packages("tidytext")
install.packages("tidyverse")
install.packages("janeaustenr")

The topicmodels package provides the actual LDA implementation.

Preparing Text for Topic Modeling

Topic models work on word counts, not raw text:

library(tidytext)
library(tidyverse)
library(janeaustenr)
library(topicmodels)

# Get text from Jane Austen books
book_words <- austen_books() %>%
  filter(book %in% c("Pride & Prejudice", "Emma", "Sense & Sensibility")) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(book, word, sort = TRUE)

print(book_words)

Creating a Document-Term Matrix

LDA requires a document-term matrix (DTM):

# Create document-term matrix
book_dtm <- book_words %>%
  cast_dtm(document = book, term = word, value = n)

print(book_dtm)
# <<DocumentTermMatrix (documents: 3, terms: 6682)>>

The DTM rows are documents, columns are terms, and values are word counts.

Fitting an LDA Model

# Fit LDA model with 2 topics
book_lda <- LDA(book_dtm, k = 2, control = list(seed = 1234))

print(book_lda)

The k parameter specifies the number of topics—you decide this based on your domain knowledge.

Extracting Topic-Word Probabilities

# Get word-topic probabilities (beta)
book_topics <- tidy(book_lda, matrix = "beta")

print(book_topics)
# # A tibble: 13,364 × 3
#   topic term         beta
#   <int> <chr>       <dbl>
# 1     1 abbess    0.00100
# 2     1 abbot     0.00125

Higher beta means the word is more strongly associated with that topic.

Finding Top Words per Topic

top_terms <- book_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>%
  ungroup() %>%
  arrange(topic, -beta)

print(top_terms)

Extracting Document-Topic Probabilities

# Get document-topic probabilities (gamma)
doc_topics <- tidy(book_lda, matrix = "gamma")

print(doc_topics)
# # A tibble: 6 × 3
#   document                 topic     gamma
#   <chr>                    <int>     <dbl>
# 1 Pride & Prejudice            1   0.148
# 2 Pride & Prejudice            2   0.852

Higher gamma means the document is more associated with that topic.

Choosing the Right Number of Topics

There is no single “correct” number of topics. Consider:

  1. Domain knowledge: How many themes do you expect?
  2. Interpretability: Can you label the topics meaningfully?
  3. Quantitative measures: Perplexity measures how well the model fits
# Try different numbers of topics
many_models <- data.frame(
  k = c(2, 5, 10, 20)
) %>%
  mutate(
    lda = map(k, ~ LDA(book_dtm, k = .x, control = list(seed = 1234)))
  )

# Compare perplexity (lower is better)
many_models %>%
  mutate(
    perplexity = map_dbl(lda, ~ perplexity(.x, book_dtm))
  )

Visualizing Topic Words

library(ggplot2)

top_terms %>%
  mutate(
    term = reorder_within(term, beta, topic)
  ) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered() +
  labs(
    title = "Top Terms per Topic",
    x = "Beta (word-topic probability)",
    y = NULL
  ) +
  theme_minimal()

Assigning Topics to New Documents

Once fitted, assign topics to new documents:

# New document
new_doc <- c("Love and marriage are the main themes in this novel.")

# Tokenize and prepare
new_tokens <- tibble(
  doc = 1,
  text = new_doc
) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word") %>%
  count(word)

# Create DTM (must use same terms as training)
new_dtm <- new_tokens %>%
  cast_dtm(document = doc, term = word, value = n)

# Get topic probabilities
new_doc_topics <- posterior(book_lda, new_dtm)$topics

print(new_doc_topics)

Practical Example: News Articles

Apply LDA to a larger corpus:

# Simulate with available data - expand to more documents
# In practice, load your own corpus

# First, get more text data
more_books <- austen_books() %>%
  filter(book != "Pride & Prejudice") %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(book, word, sort = TRUE)

# Create DTM
more_dtm <- more_books %>%
  cast_dtm(document = book, term = word, value = n)

# Fit with more topics
more_lda <- LDA(more_dtm, k = 3, control = list(seed = 1234))

# See what emerges
tidy(more_lda, matrix = "beta") %>%
  group_by(topic) %>%
  slice_max(beta, n = 8) %>%
  ungroup() %>%
  arrange(topic, -beta)

What You Have Learned

ConceptDescription
Document-Term MatrixWord counts organized by documents
Topics (k)Number of latent themes to discover
BetaProbability of word given topic
GammaProbability of topic given document

Key Takeaways

  1. LDA discovers topics without labeled training data
  2. Choose k based on domain knowledge and interpretability
  3. Higher beta = stronger word-topic association
  4. Higher gamma = stronger document-topic association
  5. Use tidy() to convert model output to tidy format

See Also

Next Steps

Continue your text mining journey:

  • Text Classification in R — Build supervised models to categorize text
  • Sentiment Analysis in R — Assign emotional scores to text
  • Tidytext Basics — Foundation for all text mining in R