rguides

Text Classification in R with tidymodels: A Complete Guide

Text classification assigns predefined categories to text documents. Unlike sentiment analysis which assigns emotional scores, classification puts documents into discrete buckets: spam versus not spam, urgent versus normal, topic A versus topic B. This tutorial shows you how to build text classification models in R using the tidymodels framework.

Prerequisites

This is an advanced tutorial. You should be comfortable with:

  • Text preprocessing with tidytext (tokenization, stop words)
  • Basic R programming (functions, pipes, data frames)
  • Machine learning concepts (training/test splits, accuracy)

If you need background, work through the earlier tutorials in this series first, starting with the introduction to text mining in R and the tidytext basics tutorial.

What you will build

By the end of this tutorial, you will have:

  1. A text preprocessing pipeline
  2. A document-term matrix (DTM) for machine learning
  3. Trained classifiers using tidymodels
  4. Model evaluation with confusion matrices

Installing required packages

Install the core packages for this tutorial: tidymodels provides the modeling framework, tidytext handles text tokenization, textrecipes extends the recipes pipeline for text data, and discrim and glmnet supply the classification algorithms:

install.packages("tidymodels")
install.packages("tidytext")
install.packages("textrecipes")
install.packages("discrim")
install.packages("glmnet")

Key packages:

  • tidymodels: Unified interface for model training
  • textrecipes: Text preprocessing steps for recipes
  • discrim: Discriminant analysis models
  • glmnet: Regularized regression (Lasso/Elastic Net)

Understanding text classification

Text classification is supervised learning, so you need labeled examples. Common applications:

ApplicationCategories
Spam detectionspam, not_spam
Topic labelingsports, politics, technology, entertainment
Sentiment categorizationpositive, negative, neutral
Intent detectionquestion, complaint, compliment

The workflow: preprocess text → create features → train model → evaluate → predict.

Loading and exploring data

For this tutorial, use the spam dataset from the textrecipes package:

library(tidytext)
library(tidyverse)
library(tidymodels)

# Load spam data (comes with textrecipes)
data("spam", package = "textrecipes")

# Explore the data
glimpse(spam)
# Rows: 3,581
# Columns: 2 (text, type)

# Check class distribution
spam %>%
  count(type)
# # A tibble: 2 × 2
#   type      n
#   <fct>  <1>
#   ham    2,772
#   spam     809

The dataset has 3,581 emails with imbalanced classes (more ham than spam).

Text preprocessing pipeline

Raw email text contains capitalisation inconsistencies, numbers, URLs, and extraneous whitespace — all noise for a classifier. A standard preprocessing pipeline normalises the text before feature extraction: lowercasing ensures “FREE” and “free” are the same token, removing numbers and URLs strips content that rarely discriminates between classes, and str_squish() collapses multiple spaces.

spam_clean <- spam %>%
  mutate(
    text = str_to_lower(text),
    text = str_remove_all(text, "[0-9]+"),
    text = str_remove_all(text, "http[^ ]*"),
    text = str_squish(text)
  )

Creating train/test split

Splitting before any feature engineering prevents data leakage — information from the test set must not influence training. initial_split() with strata = type preserves the class proportions (ham vs. spam) in both the training and test partitions, which is critical for the evaluation metrics to be meaningful when classes are imbalanced.

set.seed(1234)
spam_split <- spam_clean %>%
  initial_split(strata = type, prop = 0.8)

spam_train <- training(spam_split)  # ~2,864 rows
spam_test <- testing(spam_split)    # ~717 rows

Feature engineering: document-term matrix

Convert text to numerical features using bag-of-words. With the data partitioned into training and test sets, the next step is converting the raw email text into a format that machine learning algorithms can process. The standard approach for traditional classifiers is a document-term matrix (DTM), where each row is a document and each column is a word count. The code below uses unnest_tokens() from tidytext to break each email into individual words, then removes stop words and tallies the frequency of each term per document:

# Tokenize and count words
spam_tokens <- spam_train %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(id, word, sort = TRUE)

print(spam_tokens)
# # A tibble: ~30,000 × 3
#      id word         n
#   <int> <chr>    <int>
# 1  4304 call     1,234
# 2  4304 free      1,089

Creating the dtm

Cast tokenized word counts into a document-term matrix using cast_dtm(). Each row represents a document, each column a term, and each cell the count or weight of that term in that document. This sparse numeric matrix is the input that classification models consume:

# Cast to document-term matrix
spam_dtm <- spam_tokens %>%
  cast_dtm(document = id, term = word, value = n)

print(spam_dtm)
# <<DocumentTermMatrix (documents: 2864, terms: 6281)>>

The DTM has 2,864 documents and 6,281 terms; this is high-dimensional.

Building a classification model

Use tidymodels for a consistent interface:

Step 1: define the model

Use tidymodels to specify a model type, engine, and mode. logistic_reg() with the "glm" engine and "classification" mode sets up a standard logistic regression classifier, which works well as a baseline for binary text classification tasks:

# Specify a logistic regression model
log_reg <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

print(log_reg)

Step 2: create a recipe

The recipe defines preprocessing steps using textrecipes, a specialized extension that adds tokenization, filtering, and TF-IDF steps to the standard tidymodels recipe pipeline. The formula type ~ text tells the recipe that type is the outcome and text is the predictor:

# Create preprocessing recipe
spam_rec <- recipe(type ~ text, data = spam_train) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 1000) %>%
  step_tfidf(text)

print(spam_rec)

This tokenizes, keeps the top 1000 tokens, and applies TF-IDF weighting.

Step 3: create a workflow

Combine the recipe and model into a single workflow object. The workflow manages the full pipeline. Preprocessing via the recipe followed by model training ensures that resampling and prediction both apply the same preprocessing steps consistently:

# Create workflow
spam_wf <- workflow() %>%
  add_recipe(spam_rec) %>%
  add_model(log_reg)

print(spam_wf)

Step 4: train the model

Fit the workflow to the training data. The fit() function triggers the full pipeline end to end: it applies the recipe’s preprocessing steps to the training data, converts raw text into TF-IDF features based on the vocabulary learned from the training set, and then trains the logistic regression model on the resulting numeric matrix. This single call replaces what would otherwise be several separate preprocessing and modeling steps, which reduces the risk of inconsistencies between how you prepared the training data and how you later prepare test data for prediction:

# Fit the model
spam_fit <- spam_wf %>%
  fit(data = spam_train)

print(spam_fit)

Evaluating the model

Training the model is only half the task. Before you can trust your classifier’s predictions on new emails, you need to measure how accurately it performs on data it has not seen during training. The test set set aside earlier provides an unbiased benchmark. Passing the test data through the fitted workflow and comparing the predicted classes against the true labels tells you whether the model generalizes or has simply memorized the training examples.

Assess performance on the test set:

# Generate predictions
spam_pred <- spam_fit %>%
  predict(spam_test) %>%
  bind_cols(spam_test %>% select(type, text))

print(spam_pred)

Confusion matrix

The predictions give you per-document class estimates, but a single print() output does not reveal the model’s overall error patterns. A confusion matrix tabulates every combination of predicted and actual class across all test documents, showing exactly how many ham messages were misclassified as spam and vice versa. This granular view is the foundation for computing precision, recall, and other diagnostic metrics.

# Confusion matrix
spam_pred %>%
  conf_mat(truth = type, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Performance metrics

The heatmap from autoplot() gives a quick visual overview of classification outcomes, but you also need numeric summaries to compare models quantitatively and track improvement over time. The metrics() function from yardstick computes standard evaluation measures in a single call, and its output is a tidy data frame that slots naturally into further analysis or reporting pipelines.

# Calculate metrics
spam_pred %>%
  metrics(truth = type, estimate = .pred_class)
#   .metric  .estimator .estimate
# 1 accuracy binary         0.923
# 2 kap     binary         0.783

93% accuracy with basic logistic regression. Not bad! However, accuracy alone can be deceptive when the classes are unevenly distributed. The spam dataset contains roughly three times as many ham messages as spam, which means a naive classifier that predicts ham for every message would score around 77% accuracy without detecting a single unwanted email. The next section demonstrates how to account for this imbalance during model training.

Class-imbalanced data

The dataset is imbalanced (more ham than spam). Adjust with class weights:

# Logistic regression with class weights
log_reg_balanced <- logistic_reg(
  penalty = 0.1,
  engine = "glm",
  class_weight = "balanced"
) %>%
  set_mode("classification")

# Refit with balanced weights
spam_wf_balanced <- spam_wf %>%
  add_model(log_reg_balanced)

spam_fit_balanced <- spam_wf_balanced %>%
  fit(data = spam_train)

# Evaluate
spam_pred_balanced <- spam_fit_balanced %>%
  predict(spam_test) %>%
  bind_cols(spam_test %>% select(type))

spam_pred_balanced %>%
  metrics(truth = type, estimate = .pred_class)

Alternative models

Logistic regression gave us a solid baseline, but different classification algorithms make different assumptions about the data and can perform better on text tasks where word interactions are nonlinear. The tidymodels workflow abstraction lets you swap model specifications without rebuilding the preprocessing pipeline, making it straightforward to compare several classifiers on the same feature set.

Try other classifiers:

Naive bayes

library(discrim)

# Naive Bayes classifier
nb_spec <- naive_Bayes() %>%
  set_mode("classification") %>%
  set_engine("naivebayes")

nb_wf <- spam_wf %>%
  update_model(nb_spec)

nb_fit <- nb_wf %>% fit(data = spam_train)

nb_pred <- nb_fit %>%
  predict(spam_test) %>%
  bind_cols(spam_test %>% select(type))

nb_pred %>%
  metrics(truth = type, estimate = .pred_class)

Regularized regression (lasso)

Naive Bayes is fast and works well with high-dimensional text data, but it treats each word as an independent piece of evidence, which ignores word interactions and context. Lasso regression takes a fundamentally different approach: it performs automatic feature selection by shrinking irrelevant word coefficients all the way to zero. This built-in sparsity often produces a model that is both more accurate and easier to interpret than Naive Bayes, because you can read off the handful of surviving terms to understand what drives predictions.

# Lasso regression
lasso_spec <- logistic_reg(
  penalty = 0.1,
  mixture = 1
) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

lasso_wf <- spam_wf %>%
  update_model(lasso_spec)

lasso_fit <- lasso_wf %>% fit(data = spam_train)

lasso_pred <- lasso_fit %>%
  predict(spam_test) %>%
  bind_cols(spam_test %>% select(type))

lasso_pred %>%
  metrics(truth = type, estimate = .pred_class)

Cross-validation

A single train/test split gives one estimate of model performance, but that number depends on which specific examples happened to land in the test set. Cross-validation addresses this by repeatedly partitioning the training data, training and evaluating the model on different subsets each time, and averaging the results. This produces a more stable performance estimate and gives you confidence intervals that capture the variability across folds.

For more reliable evaluation, use k-fold cross-validation:

# Create 5-fold cross-validation
spam_folds <- vfold_cv(spam_train, v = 5, strata = type)

# Fit with cross-validation
spam_cv_results <- spam_wf %>%
  fit_resamples(
    spam_folds,
    metrics = metric_set(accuracy, precision, recall, f_meas),
    control = control_resamples(save_pred = TRUE)
)

# View results
spam_cv_results %>%
  collect_metrics()

This gives you average performance across multiple train/val splits and a clearer picture of how the model will perform in production. However, the cross-validation above used a fixed penalty value. The model’s performance depends heavily on the regularization strength and the mix between lasso and ridge penalties, so the next logical step is to let the data choose the optimal values.

Hyperparameter tuning

Optimize model parameters for better performance:

# Define tunable model
log_reg_tune <- logistic_reg(
  penalty = tune(),
  mixture = tune()
) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

# Update workflow
spam_wf_tune <- spam_wf %>%
  update_model(log_reg_tune)

# Grid search
spam_grid <- grid_regular(
  penalty(),
  mixture(),
  levels = 5
)

# Tune
tune_results <- spam_wf_tune %>%
  tune_grid(
    spam_folds,
    grid = spam_grid,
    metrics = metric_set(accuracy)
)

# Best parameters
best_params <- tune_results %>%
  select_best("accuracy")

print(best_params)

Feature importance

After tuning, you have a well-performing model, but accuracy numbers alone do not explain which words drive the classification decisions. Inspecting the model coefficients reveals exactly that: each term gets a weight that pushes the prediction toward spam (positive coefficient) or ham (negative coefficient). This interpretability is a major advantage of logistic regression over black-box models, and it helps you verify that the classifier is relying on sensible linguistic signals rather than spurious patterns in the training data.

Understand what the model learned:

# Extract model coefficients
spam_coefs <- spam_fit %>%
  extract_fit_engine() %>%
  tidy() %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(abs(estimate)))

print(spam_coefs)
# # A tibble: 1,000 × 4
#   term     estimate    yintercept   penalty
#   <chr>       <dbl>       <dbl>     <dbl>
# 1 call      2.45           0        0.1
# 2 free      2.12           0        0.1
# 3 txt       1.89           0        0.1
# 4 ur        1.76           0        0.1

# Visualize top features
spam_coefs %>%
  head(20) %>%
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(estimate, term, fill = estimate > 0)) +
  geom_col() +
  labs(
    title = "Most Important Words for Spam Detection",
    x = "Coefficient (positive = spam)",
    y = NULL
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("coral", "steelblue"), guide = "none")

Making predictions on new data

The coefficient chart confirms that the model has learned sensible feature weights. Words like “call” and “free” push predictions toward spam, which matches what you would expect from a spam detector. The final practical test is feeding the model brand-new messages it has never seen and checking whether the class assignments match your intuition. The examples below simulate three realistic emails that a spam filter might encounter, covering an obvious promotional message, a casual personal note, and an urgent security-themed phishing attempt.

Use the trained model to predict new emails:

# New emails to classify
new_emails <- tibble(
  text = c(
    "Congratulations! You have won a free iPhone. Click here to claim your prize!",
    "Hey, are we still meeting for lunch tomorrow?",
    "URGENT: Your account has been compromised. Verify your password immediately."
  )
)

# Predict
new_predictions <- spam_fit %>%
  predict(new_emails) %>%
  bind_cols(new_emails)

print(new_predictions)
# # A tibble: 3 × 2
#   .pred_class .pred_h...
# 1 spam         0.998
# 2 ham          0.991
# 3 spam         0.723

What you have learned

StepDescription
Text preprocessingClean and standardize text
Feature extractionCreate document-term matrix
Model trainingFit classifier with tidymodels
EvaluationConfusion matrix and metrics
TuningOptimize hyperparameters

Feature engineering for text

Text classification requires converting text to numeric features. The most common representations: bag-of-words (word frequency counts), TF-IDF (frequency weighted by inverse document frequency), and word embeddings (dense vectors from pre-trained models like word2vec or fasttext). tidytext::cast_dtm() creates a document-term matrix for traditional ML. text2vec::itoken() and create_dtm() provide a faster pipeline for large corpora.

Training a classifier

Fit a classifier on numeric features. glmnet::cv.glmnet(x = dtm, y = labels, family = "multinomial") fits a regularized logistic regression, good baseline for text. xgboost and ranger are also effective. The tidymodels + textrecipes pipeline handles text preprocessing: step_tokenize(), step_stopwords(), step_tfidf() convert raw text to TF-IDF features within a recipe.

Evaluation

For text classification, check accuracy, precision, recall, and F1 per class. caret::confusionMatrix(predicted, actual) returns per-class metrics. For imbalanced classes (rare categories), macro-averaged F1 is more informative than accuracy. pROC::multiclass.roc() computes AUC for multi-class problems. Use stratified cross-validation (createFolds(y, k = 5) in caret) to ensure each fold contains all classes.

Deep learning for text

Pre-trained transformer models (BERT, RoBERTa) achieve state-of-the-art text classification. The transformers Python library, accessible via reticulate, provides fine-tuning APIs. The text R package wraps HuggingFace transformers for R: textClassify() fine-tunes a transformer on labeled text data. For production, pre-compute embeddings with Python and store them for retrieval, avoids model loading overhead per prediction.

Document representation

Text classification requires converting documents to numeric feature vectors. The three main representations: bag of words (TF or TF-IDF), word embeddings (dense vectors), and sequence models (preserve word order).

Bag of words: tidytext::cast_dtm(df, document, word, n) creates a sparse document-term matrix. textrecipes::step_tfidf(text_col) computes TF-IDF in a recipes preprocessing pipeline. The vocabulary can have thousands to millions of terms, making dimensionality reduction (PCA, SVD) important. For a deeper treatment of these preprocessing fundamentals, the topic modeling tutorial covers tokenization and TF-IDF weighting in more detail.

Word embeddings map words to dense low-dimensional vectors where semantic similarity corresponds to vector distance. Pre-trained embeddings (GloVe, Word2Vec, fastText) are available from various sources. word2vec::word2vec() trains embeddings on your corpus. Document vectors are typically the mean of word vectors.

For the textrecipes approach in tidymodels: step_tokenize() tokenizes, step_stopwords() removes stop words, step_stem() stems, step_tokenfilter() removes rare tokens, step_tfidf() computes TF-IDF. This pipeline handles train/test consistently, applying the vocabulary learned from training to test data.

Training classification models

With a feature matrix ready, any classifier works. Logistic regression is interpretable and fast. Random forests handle nonlinear relationships. Support vector machines have theoretical guarantees for text. Gradient boosting (XGBoost) often achieves top performance.

In tidymodels:

library(tidymodels); library(textrecipes)

recipe_spec <- recipe(label ~ text, data = train) %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tfidf(text, num_terms = 1000)

model_spec <- logistic_reg(penalty = 0.01) %>%
  set_engine("glmnet")

workflow <- workflow() %>%
  add_recipe(recipe_spec) %>%
  add_model(model_spec)

fitted <- fit(workflow, train)
predictions <- predict(fitted, test)

Evaluation metrics

For balanced classes, accuracy is meaningful. For imbalanced classes, use precision, recall, F1 score, and AUC.

yardstick::f_meas(pred_df, truth, estimate) computes F1 score. yardstick::precision() and yardstick::recall() compute precision and recall. yardstick::metric_set(accuracy, f_meas, roc_auc) evaluates multiple metrics at once.

Micro-averaging (compute metric globally across all classes) vs macro-averaging (compute per class and average) gives different perspectives on multi-class performance. Use macro-averaging to treat all classes equally regardless of frequency; use micro-averaging to weight by class frequency.

Dealing with class imbalance

themis::step_upsample(label) in a recipe oversamples the minority class. step_downsample() undersamples the majority class. step_smote() generates synthetic examples using SMOTE. Apply these only to training data, the recipe handles this automatically if the dataset split is done before recipe creation.

Adjusting classification threshold: probably::threshold_perf(pred_df, truth, .pred_positive, thresholds = seq(0.1, 0.9, 0.05)) evaluates precision and recall at multiple thresholds. Choose a threshold that balances the cost of false positives and false negatives for your use case.

Pre-trained models via huggingFace

The text package interfaces with HuggingFace transformer models from R. Pre-trained models like BERT and RoBERTa achieve state-of-the-art accuracy on text classification tasks with minimal fine-tuning.

textTrainRegression() and textTrainRandomForest() from the text package use transformer-generated embeddings as features for downstream models. This approach combines the representational power of transformers with the interpretability and speed of simple classifiers.

For production use, fine-tuning a transformer directly (via Python’s transformers library and reticulate) gives the best performance but requires GPU resources and more complex training code.

Text classification tasks

Text classification assigns a label to a piece of text from a predefined set of categories. Spam detection classifies email as spam or not spam. Sentiment analysis classifies reviews as positive, neutral, or negative. Topic categorization assigns news articles to sections. The R ecosystem has tools for each stage of a text classification pipeline: text preprocessing, feature extraction, model training, and evaluation.

The choice of classification approach depends on the task complexity, the amount of training data, and the required accuracy. Rule-based approaches, keyword lists, regular expression patterns, work for simple categorizations with clear rules. Machine learning with bag-of-words features works for more complex patterns with sufficient labeled data. Pre-trained language model embeddings work for tasks where meaning matters more than keyword matching and where labeled data is limited.

Evaluation beyond accuracy

Accuracy is a poor metric for imbalanced classification tasks. If 95% of emails are not spam, a classifier that labels everything as not spam achieves 95% accuracy while being completely useless. Precision, recall, and F1 score are more meaningful for imbalanced problems. Precision measures what fraction of predicted positives are truly positive. Recall measures what fraction of true positives were predicted positive.

The confusion matrix makes these tradeoffs explicit. Each cell shows the count of a specific prediction-truth combination: true positives, false positives, true negatives, false negatives. Examining the confusion matrix identifies systematic errors, categories that are frequently confused with each other, that aggregate metrics hide. A model with good overall F1 but high confusion between two specific categories may need additional features or training data for those categories.

Handling imbalanced training data

When training data is imbalanced, the model learns to bias toward the majority class. Addressing this requires either resampling, oversampling the minority class or undersampling the majority, or adjusting class weights in the loss function. The themis package provides resampling methods that integrate with tidymodels recipes. The class_weights argument in many model specifications adjusts the learning objective to penalize minority class errors more heavily.

The choice between resampling and class weights depends on the severity of the imbalance and the model type. For severe imbalance (1:100 or worse), combining both approaches, oversample the minority class and set class weights; this often produces better results than either alone. Evaluate on a held-out test set that reflects the natural class distribution, not the resampled training distribution, to get realistic performance estimates.

Key takeaways

  1. Text classification requires labeled training data
  2. TF-IDF weighting often improves performance
  3. Class imbalance requires special handling (weights, sampling)
  4. Cross-validation gives more reliable performance estimates
  5. Logistic regression works well as a baseline

Next steps

Continue building your text mining skills:

  • Text Regression, Predict continuous values from text
  • Deep Learning with torch, Neural networks for text
  • BERT embeddings, Modern transformer-based features

See also