rguides

Gradient Boosting with xgboost

Gradient boosting is one of the most powerful techniques in machine learning, and xgboost (Extreme Gradient Boosting) is its most popular implementation. In this tutorial, you’ll learn how to build, train, and tune xgboost models in R for both classification and regression tasks.

What you’ll learn

This tutorial covers the key concepts and practical techniques for working with Gradient Boosting with xgboost. By the end, you will know how to apply the core functions in real data analysis workflows.

What is gradient boosting?

Before diving into code, let’s understand what makes xgboost so effective. Gradient boosting is an ensemble method that builds models sequentially, with each new model correcting the errors made by the previous ones. Unlike random forests (which build independent trees in parallel), gradient boosting learns from mistakes in a focused, iterative way.

Why xgboost stands out:

  • Speed and scalability, designed for efficiency
  • Built-in regularization, prevents overfitting
  • Handles missing values automatically
  • Works well with imbalanced datasets
  • Consistent winner in Kaggle competitions

Installing and loading xgboost

First, install the package from CRAN if you haven’t already:

install.packages("xgboost")
library(xgboost)

For this tutorial, we’ll also use the tidyverse for data manipulation:

library(tidyverse)

Preparing your data

xgboost requires numeric input, so you’ll need to prepare your data accordingly. Let’s use the iris dataset as an example:

# Load and prepare data
data(iris)

# For classification, let's predict Species
# First, encode the target variable as numeric
iris_train <- iris %>%
  mutate(Species_num = as.numeric(Species) - 1) %>%
  select(-Species)

# Create training matrix (xgboost requires matrix input)
train_matrix <- iris_train %>%
  select(-Species_num) %>%
  as.matrix()

train_label <- iris_train$Species_num

For regression tasks, the process is similar but your target will be continuous:

# Example: Predict petal length (regression)
train_matrix_reg <- iris %>%
  select(-Petal.Length) %>%
  as.matrix()

train_label_reg <- iris$Petal.Length

Training your first xgboost model

Classification example

# Train an xgboost classifier
xgb_model <- xgboost(
  data = train_matrix,
  label = train_label,
  max.depth = 3,
  eta = 0.1,
  nrounds = 100,
  objective = "binary:logistic",
  verbose = 0
)

# View model details
xgb_model

Regression example

# Train an xgboost regressor
xgb_model_reg <- xgboost(
  data = train_matrix_reg,
  label = train_label_reg,
  max.depth = 4,
  eta = 0.1,
  nrounds = 100,
  objective = "reg:squarederror",
  verbose = 0
)

xgb_model_reg

Understanding key hyperparameters

The magic of xgboost lies in its hyperparameters. Here’s what you need to know:

ParameterDescriptionTypical Range
max.depthMaximum tree depth3-10
eta (learning rate)Step size shrinkage0.01-0.3
nroundsNumber of boosting rounds100-1000
subsampleRow sampling ratio0.5-1.0
colsample_bytreeColumn sampling ratio0.5-1.0
min.child_weightMinimum sum of instance weight1-10
gammaMinimum loss reduction for split0-5

Key insight: Lower eta values require more nrounds but often produce better models. This is called “learning rate scheduling.”

Making predictions

Once trained, making predictions is straightforward:

# Classification predictions (probabilities)
pred_probs <- predict(xgb_model, train_matrix)

# Convert to class labels (threshold at 0.5)
pred_classes <- ifelse(pred_probs > 0.5, 1, 0)

# Regression predictions
pred_values <- predict(xgb_model_reg, train_matrix_reg)

Cross-Validation for model selection

Always use cross-validation to estimate model performance:

# Create DMatrix object for efficient computation
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)

# Run cross-validation
cv_results <- xgb.cv(
  data = dtrain,
  max.depth = 3,
  eta = 0.1,
  nround = 100,
  objective = "binary:logistic",
  eval_metric = "error",
  nfold = 5,
  verbose = 0
)

# View results
cv_results$evaluation_log %>%
  tail()

The output shows training and test error at each round. Choose the number of rounds that minimizes test error.

Hyperparameter tuning with caret

For systematic tuning, use the caret package:

library(caret)

# Define tuning grid
tune_grid <- expand.grid(
  max.depth = c(3, 5, 7),
  eta = c(0.01, 0.1, 0.3),
  nrounds = c(50, 100, 200),
  subsample = c(0.7, 1),
  colsample_bytree = c(0.7, 1),
  min.child.weight = 1
)

# Train with tuning
caret_model <- train(
  x = train_matrix,
  y = factor(train_label),
  method = "xgbTree",
  tuneGrid = tune_grid,
  trControl = trainControl(method = "cv", number = 5),
  verbose = FALSE
)

# Best parameters
caret_model$bestTune

Handling class imbalance

For imbalanced datasets, use the scale.pos.weight parameter:

# Calculate imbalance ratio
neg_count <- sum(train_label == 0)
pos_count <- sum(train_label == 1)
scale_weight <- neg_count / pos_count

# Train with class weights
imbalanced_model <- xgboost(
  data = train_matrix,
  label = train_label,
  max.depth = 3,
  eta = 0.1,
  nrounds = 100,
  scale.pos.weight = scale_weight,
  objective = "binary:logistic",
  verbose = 0
)

Saving and loading models

Save your trained model for later use:

# Save model
xgb.save(imbalanced_model, "xgboost_model.json")

# Load model later
loaded_model <- xgb.load("xgboost_model.json")

Best practices

  1. Start simple: Begin with default parameters, then tune
  2. Use cross-validation: Never trust a single train/test split
  3. Monitor overfitting: Stop early if test error increases
  4. Feature engineering matters: xgboost is powerful but can’t fix bad features
  5. Scale appropriately: While xgboost handles raw data well, proper encoding helps

Frequently asked questions

How many trees should i use?

Start with 100-200 rounds and use early stopping. Watch the validation error—once it stops decreasing, you’ve found your optimal number.

Should i normalize my features?

Unlike distance-based algorithms (KNN, SVM), xgboost doesn’t require feature normalization. Tree-based methods are invariant to monotonic transformations.

When should i use xgboost over random forests?

Use xgboost when you need maximum predictive accuracy and have time for tuning. Random forests are more reliable out-of-the-box and less prone to overfitting with default parameters.

Conclusion

xgboost is a versatile, powerful algorithm that belongs in every data scientist’s toolkit. With proper tuning, it consistently delivers state-of-the-art performance on structured data problems.

Start with the basics covered here, then explore advanced topics like custom objective functions, regularization parameters (alpha and lambda), and GPU acceleration for larger datasets.

Next steps: Try applying xgboost to a real dataset. The titanic dataset from Kaggle is an excellent practice problem with missing values and categorical features.

Data preparation

XGBoost requires a numeric matrix and a label vector, no factors or character columns. xgb.DMatrix(as.matrix(train_features), label = train_labels) creates the input. model.matrix(~ . - 1, data = df) converts factors to dummy variables. The recipes package in tidymodels automates this: step_dummy() encodes factors; step_normalize() scales numeric features.

Key hyperparameters

nrounds sets the number of boosting iterations. max_depth controls tree depth (3-6 is typical). eta (learning rate) controls how much each tree contributes, lower values with more rounds generally perform better. subsample randomly samples rows per tree; colsample_bytree samples features. min_child_weight controls minimum leaf size. Start with max_depth = 6, eta = 0.1, nrounds = 100 and tune from there.

Early stopping

xgb.train() with watchlist and early_stopping_rounds stops training when validation performance does not improve for a specified number of rounds. This prevents overfitting and eliminates the need to manually set nrounds: xgb.train(params, dtrain, nrounds = 1000, watchlist = list(eval = dtest), early_stopping_rounds = 20) stops when validation loss does not improve for 20 rounds.

Feature importance

xgb.importance(model = fit) returns a data frame with three importance metrics: gain (average improvement in loss), cover (average number of samples affected), and frequency (number of times a feature is used in trees). xgb.plot.importance(importance_matrix) visualizes the top features. Gain is the most informative metric for understanding which features drive the model’s predictions.

XGBoost in R

XGBoost is a gradient boosting algorithm that builds an ensemble of decision trees. Each tree corrects the errors of the previous ensemble, with a regularization term preventing overfitting. It consistently achieves state-of-the-art results on structured tabular data.

The R xgboost package provides a fast C++ implementation. The interface requires numeric matrices — no data frames, no factors. xgb.DMatrix(as.matrix(X_train), label = y_train) creates the optimized data structure. Factor and character columns must be one-hot encoded or integer-encoded first.

xgboost(data = dtrain, nrounds = 100, params = list(objective = "binary:logistic", eta = 0.1, max_depth = 6)) trains a binary classifier. predict(model, xgb.DMatrix(as.matrix(X_test))) returns predicted probabilities.

Hyperparameter tuning

Key parameters: nrounds (number of trees), eta (learning rate, shrinkage), max_depth (maximum tree depth), subsample (fraction of data per tree), colsample_bytree (fraction of features per tree), min_child_weight (minimum sample size for a leaf), gamma (minimum loss reduction for a split), lambda (L2 regularization), alpha (L1 regularization).

Smaller eta with more nrounds generally gives better results but takes longer. subsample and colsample_bytree below 1 add stochasticity that reduces overfitting. max_depth of 3-8 is typical; deeper trees overfit on small datasets.

xgb.cv(params, dtrain, nrounds = 200, nfold = 5, early_stopping_rounds = 10) uses cross-validation to find the optimal number of rounds — stopping when validation performance does not improve for 10 rounds. This prevents overfitting to the training set.

The tidymodels interface via parsnip::boost_tree() with engine = "xgboost" provides a consistent API: tune() marks parameters for tuning, tune_grid() or tune_bayes() searches the space. This integrates XGBoost into the full tidymodels workflow with recipes, cross-validation, and metrics.

Feature importance and SHAP

xgb.importance(model = model) returns feature importance. gain (the improvement in accuracy from splits on this feature) is the most informative metric. cover (number of observations on splits using this feature) and frequency (fraction of trees using this feature) are alternatives.

xgb.plot.importance(importance_matrix, top_n = 20) plots the top features. For SHAP (SHapley Additive exPlanations) values that explain individual predictions: xgb.plot.shap(data = as.matrix(X_test), model = model, top_n = 5).

The SHAPforxgboost package provides a cleaner interface: shap_values <- shap.values(xgb_model, X_train) computes SHAP values; shap.plot.summary(shap_values) produces a beeswarm plot showing each feature’s impact on each prediction. SHAP plots distinguish the direction of feature effects (high values of feature X increase or decrease the prediction), which standard importance plots do not.

Handling imbalanced classes and missing data

XGBoost handles missing values natively — it learns the best direction for missing values at each split. Leave missing values as NA in the input matrix rather than imputing them. The missing parameter in xgb.DMatrix specifies the missing value representation (default NA).

For imbalanced classes, scale_pos_weight = sum(y == 0) / sum(y == 1) sets the weight ratio, penalizing misclassification of the minority class proportionally. Alternatively, use eval_metric = "auc" and optimize AUC rather than accuracy, which is more appropriate for imbalanced problems.

Multi-class and regression

objective = "multi:softmax" for multi-class classification (returns class predictions); "multi:softprob" returns probabilities for all classes. Set num_class to the number of classes.

objective = "reg:squarederror" for regression. objective = "reg:tweedie" for count data or positive-valued skewed distributions. objective = "survival:cox" for survival analysis.

Next steps

Now that you understand gradient boosting with xgboost, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.