Deep Learning with torch for R
This guide covers the fundamentals: creating tensors, building neural networks, training models, and leveraging GPU acceleration. By the end, you’ll have a solid foundation for implementing deep learning solutions in R.
Installation
Install torch from CRAN or the development version from GitHub:
# From CRAN
install.packages("torch")
# Latest development version
remotes::install_github("torch-api/torch")
You’ll also need PyTorch installed. The torch package handles this automatically:
library(torch)
install_torch()
The installation process downloads a Python environment with PyTorch, so you’ll need about 2GB of free disk space. First-time installation can take several minutes depending on your internet connection.
Creating Tensors
Tensors are the fundamental data structure in torch — multi-dimensional arrays similar to R’s matrices but with GPU support and automatic differentiation capabilities. Understanding tensors thoroughly is essential, as all operations in torch revolve around them.
Basic Tensor Creation
library(torch)
# From R objects
x <- torch_tensor(c(1, 2, 3, 4))
x <- torch_tensor(matrix(1:6, nrow = 2))
# Directly in torch
x <- torch_ones(3, 4) # all ones
x <- torch_zeros(3, 4) # all zeros
x <- torch_randn(3, 4) # standard normal distribution
x <- torch_rand(3, 4) # uniform [0, 1)
x <- torch_arange(1, 10) # 1D tensor from 1 to 10
x <- torch_eye(5) # identity matrix
# With specific dtype
x <- torch_tensor(1:10, dtype = torch_float32())
x <- torch_tensor(1:10, dtype = torch_int64())
Tensor Properties and Operations
x <- torch_tensor(matrix(1:6, nrow = 2))
# Shape and dtype
dim(x) # returns c(2, 3)
x$shape # torch_size object
x$dtype # torch_int64
# Reshaping
y <- x$view(c(3, 2)) # reshape to 3x2 (must have same total elements)
y <- x$t() # transpose
y <- x$unsqueeze(1) # add dimension at position 1
# Basic operations
y <- x * 2 # element-wise multiplication
z <- torch_matmul(x, x$t()) # matrix multiplication
sum_x <- x$sum()
mean_x <- x$mean()
max_x <- x$max()
Slicing and Indexing
x <- torch_tensor(matrix(1:12, nrow = 3))
# Single element (returns tensor)
x[1, 1]
# Single element (returns as R scalar)
x[1, 1]$item()
# Slicing
x[1, ] # first row
x[, 2] # second column
x[1:2, 1:3] # rows 1-2, columns 1-3
# Boolean indexing
mask <- x > 5
x[mask] # elements greater than 5
Moving Between Devices
# Check if CUDA (GPU) is available
torch.cuda_is_available()
# Create tensor on GPU if available
device <- if (torch.cuda_is_available()) "cuda" else "cpu"
x <- torch_tensor(1:10, device = device)
# Move existing tensor to device
x <- x$to(device = "cpu")
# Check current device
x$device
Building Neural Networks
The nn_ prefix designates neural network modules — the building blocks for constructing models. Compose them using nn_sequential() for simple, straightforward networks, or subclass nn_module() for custom architectures requiring more control.
Using nn_sequential
For straightforward architectures, nn_sequential provides a clean, readable way to stack layers:
model <- nn_sequential(
nn_linear(784, 256),
nn_relu(),
nn_dropout(0.2),
nn_linear(256, 10)
)
This creates a feedforward network with 784 inputs, a 256-neuron hidden layer with ReLU activation, 20% dropout for regularization, and 10 outputs — suitable for MNIST digit classification.
Custom Modules
For complex architectures requiring custom forward logic or multiple branches, subclass nn_module():
net <- nn_module(
"Net",
initialize = function() {
# First conv block
self$conv1 <- nn_conv2d(1, 32, kernel_size = 3, padding = 1)
self$bn1 <- nn_batch_norm2d(32)
# Second conv block
self$conv2 <- nn_conv2d(32, 64, kernel_size = 3, padding = 1)
self$bn2 <- nn_batch_norm2d(64)
# Fully connected layers
self$fc1 <- nn_linear(64 * 7 * 7, 256)
self$fc2 <- nn_linear(256, 10)
self$pool <- nn_max_pool2d(2)
self$relu <- nn_relu()
self$dropout <- nn_dropout(0.5)
},
forward = function(x) {
x <- x$reshape(c(x$size(1), 1, 28, 28)) # reshape flat input to image
x <- x |>
self$pool(self$relu(self$bn1(self$conv1(x)))) |>
self$pool(self$relu(self$bn2(self$conv2(x))))
x <- x$view(c(x$size(1), -1)) # flatten
x <- x |>
self$relu(self$fc1(x)) |>
self$dropout(x) |>
self$fc2(x)
x
}
)
model <- net()
Common Layer Types
| Layer | Description |
|---|---|
nn_linear(in, out) | Fully connected (dense) layer |
nn_conv2d(in, out, kernel_size) | 2D convolution for images |
nn_conv1d(in, out, kernel_size) | 1D convolution for sequences |
nn_lstm(input_size, hidden_size, num_layers) | Long Short-Term Memory layer |
nn_gru(input_size, hidden_size, num_layers) | Gated Recurrent Unit layer |
nn_embedding(num_embeddings, embedding_dim) | Word embedding lookup |
nn_dropout(p) | Dropout regularization |
nn_batch_norm2d(num_features) | Batch normalization |
nn_layer_norm(normalized_shape) | Layer normalization |
Activation Functions
nn_relu() # Rectified Linear Unit
nn_sigmoid() # Sigmoid activation
nn_tanh() # Hyperbolic tangent
nn_softmax(dim) # Softmax (specify dimension)
nn_log_softmax(dim) # Log-softmax for numerical stability
Training Models
Training involves a repeated cycle: forward pass (compute predictions), loss computation (evaluate error), backward pass (compute gradients), and parameter updates (adjust weights). This is the fundamental training loop in deep learning.
The Training Loop
# Create model
model <- nn_sequential(
nn_linear(10, 32),
nn_relu(),
nn_linear(32, 1)
)
# Loss function and optimizer
criterion <- nn_mse_loss()
optimizer <- optim_adam(model$parameters, lr = 0.01)
# Training data
set.seed(42)
x_data <- torch_randn(100, 10)
y_data <- x_data$sum(dim = 2, keepdim = TRUE) + torch_randn(100, 1) * 0.1
# Training loop
epochs <- 100
for (epoch in 1:epochs) {
# Forward pass: compute predictions
predictions <- model(x_data)
# Compute loss
loss <- criterion(predictions, y_data)
# Backward pass: compute gradients
optimizer$zero_grad() # clear previous gradients
loss$backward() # compute new gradients
# Update weights
optimizer$step()
if (epoch %% 10 == 0) {
cat("Epoch:", epoch, "Loss:", loss$item(), "\n")
}
}
Loss Functions
Different tasks require different loss functions. Choosing the right one is crucial for training success:
# Regression losses
nn_mse_loss() # Mean Squared Error — good for smooth targets
nn_l1_loss() # Mean Absolute Error — robust to outliers
nn_smooth_l1_loss() # Smooth combination of MSE and MAE
# Classification losses
nn_cross_entropy_loss() # Cross-entropy for multi-class
nn_bce_loss() # Binary Cross-Entropy for binary classification
nn_bce_with_logits_loss() # BCE with sigmoid built-in (more numerically stable)
Optimizers
The optimizer adjusts model parameters based on computed gradients. Choosing the right optimizer and learning rate often requires experimentation:
# Adam (adaptive learning rate, usually good default)
optim_adam(model$parameters, lr = 0.001)
# AdamW (Adam with proper weight decay regularization)
optim_adamw(model$parameters, lr = 0.001, weight_decay = 0.01)
# SGD with momentum (traditional, often needs tuning)
optim_sgd(model$parameters, lr = 0.01, momentum = 0.9)
# RMSprop (adaptive learning rate)
optim_rmsprop(model$parameters, lr = 0.01)
# Learning rate schedulers
scheduler <- optim_lr_step(optimizer, step_size = 10, gamma = 0.5)
scheduler <- optim_lr ReduceLROnPlateau(optimizer, mode = "min", factor = 0.5, patience = 5)
# In training loop:
for (epoch in 1:epochs) {
# ... training code ...
scheduler$step() # for step scheduler
# scheduler$step(loss) # for ReduceLROnPlateau
}
Training with Validation
For production models, you should monitor validation loss to detect overfitting:
# Split data
n_train <- 800
train_loader <- dataloader(tensor_dataset(
x_train[1:n_train, ],
y_train[1:n_train, ]
), batch_size = 32, shuffle = TRUE)
val_loader <- dataloader(tensor_dataset(
x_train[(n_train + 1):nrow(x_train), ],
y_train[(n_train + 1):nrow(y_train), ]
), batch_size = 32)
train <- function(model, train_loader, val_loader, epochs) {
for (epoch in 1:epochs) {
# Training phase
model$train()
train_loss <- 0
for (batch in train_loader) {
optimizer$zero_grad()
output <- model(batch[[1]])
loss <- criterion(output, batch[[2]])
loss$backward()
optimizer$step()
train_loss <- train_loss + loss$item()
}
# Validation phase
model$eval()
val_loss <- 0
with_no_grad({
for (batch in val_loader) {
output <- model(batch[[1]])
val_loss <- val_loss + criterion(output, batch[[2]])$item()
}
})
cat("Epoch:", epoch,
"Train Loss:", train_loss / length(train_loader),
"Val Loss:", val_loss / length(val_loader), "\n")
}
}
Using GPU Acceleration
GPU acceleration dramatically speeds up training for large models and datasets. Modern GPUs have thousands of cores optimized for parallel tensor operations, making them ideal for deep learning.
Checking GPU Availability
torch.cuda_is_available() # Returns TRUE if CUDA is available
torch$cuda$device_count() # Number of available GPUs
torch$cuda$get_device_name(0) # Name of the first GPU
# For Apple Silicon (M1/M2/M3)
# torch$backends$mps$is_available() # Metal Performance Shaders
Moving Data and Models to GPU
device <- "cuda"
# Move model to GPU
model <- model$to(device = device)
# Move data to GPU
x_train <- x_train$to(device = device)
y_train <- y_train$to(device = device)
# Or use cuda_() shorthand (in-place)
x_train <- x_train$cuda()
Complete GPU Training Example
# Set device based on availability
device <- if (torch.cuda_is_available()) "cuda" else "cpu"
cat("Using device:", device, "\n")
# Create model and move to GPU
model <- nn_sequential(
nn_linear(784, 256),
nn_relu(),
nn_dropout(0.2),
nn_linear(256, 10)
)$to(device = device)
optimizer <- optim_adam(model$parameters, lr = 0.001)
criterion <- nn_cross_entropy_loss()
# Training loop with GPU
for (epoch in 1:50) {
model$train()
optimizer$zero_grad()
# Data must be on same device as model
output <- model(x_data$to(device = device))
loss <- criterion(output, y_data$to(device = device))
loss$backward()
optimizer$step()
if (epoch %% 10 == 0) {
cat("Epoch:", epoch, "Loss:", loss$item(), "\n")
}
}
Mixed Precision Training
For even faster training on modern GPUs with Tensor Cores, use mixed precision. This uses float16 for most operations while keeping critical operations in float32:
model <- model$to(device = "cuda")
mixed_precision <- autocast(device_type = "cuda")
scaler <- torch_amp_GradScaler()
train_step <- function(x, y) {
optimizer$zero_grad()
# Automatic casting to float16
with(mixed_precision, {
output <- model(x)
loss <- criterion(output, y)
})
# Scale loss, backward, unscale, update
scaler$scale(loss)$backward()
scaler$step(optimizer)
scaler$update()
}
Mixed precision typically provides 1.5-3x speedup on modern GPUs with minimal accuracy impact.
Working with Data
Efficient data handling is crucial for training performance. The dataloader function provides batching, shuffling, and parallel data loading.
DataLoaders for Batching
# Create tensor dataset
dataset <- tensor_dataset(
x = torch_randn(1000, 784),
y = torch_randint(0, 10, c(1000, 1))
)
# Create dataloader with batching and shuffling
dataloader <- dataloader(dataset, batch_size = 32, shuffle = TRUE)
# Iterate over batches
for (batch in dataloader) {
x_batch <- batch[[1]]
y_batch <- batch[[2]]
# Process batch...
}
Custom Datasets
For loading from files or applying custom transformations, create a custom dataset:
image_dataset <- dataset(
initialize = function(data_dir, transform = NULL) {
self$files <- list.files(data_dir, pattern = "\\.jpg$", full.names = TRUE)
self$transform <- transform
},
.getitem = function(i) {
img_path <- self$files[i]
img <- jpeg::readJPEG(img_path)
img_tensor <- torch_tensor(img)$permute(c(3, 1, 2)) # HWC to CHW
if (!is.null(self$transform)) {
img_tensor <- self$transform(img_tensor)
}
list(x = img_tensor, y = self$get_label(img_path))
},
.length = function() {
length(self$files)
}
)
Data Augmentation
For image classification, apply augmentations during training:
augment <- function(x) {
# Random horizontal flip
if (runif(1) > 0.5) {
x <- x[, , , flip(x$size(4):1)]
}
# Random crop
# ... (more augmentations)
x
}
Saving and Loading Models
Model persistence enables checkpointing during training and deployment:
# Save entire model (includes architecture)
torch_save(model, "model.pt")
# Save only state dict (recommended — architecture separate)
torch_save(model$state_dict(), "model_state.pt")
# Save checkpoint (includes optimizer state for resuming)
checkpoint <- list(
epoch = epoch,
model_state_dict = model$state_dict(),
optimizer_state_dict = optimizer$state_dict(),
loss = loss
)
torch_save(checkpoint, "checkpoint.pt")
# Load model
model <- nn_sequential(nn_linear(10, 32), nn_relu(), nn_linear(32, 1))
model$load_state_dict(torch_load("model_state.pt"))
model$eval()
# Load checkpoint and resume
checkpoint <- torch_load("checkpoint.pt")
model$load_state_dict(checkpoint$model_state_dict)
optimizer$load_state_dict(checkpoint$optimizer_dict)
start_epoch <- checkpoint$epoch + 1
Summary
The torch package brings PyTorch’s deep learning capabilities to R, enabling you to:
- Create tensors — the fundamental building blocks for all computations, with full GPU support
- Build neural networks — using predefined modules like
nn_linearandnn_conv2d, or custom architectures viann_module() - Train models — with flexible training loops, diverse loss functions, and various optimizers
- Use GPU acceleration — for significant speedups on large-scale problems, with mixed precision for modern hardware
The API closely mirrors PyTorch’s Python interface, making it straightforward to adapt Python deep learning code or follow PyTorch tutorials while working in R. The main differences are R’s 1-based indexing and the pipe operator %>% (or R’s native |> in R 4.1+) for chaining operations.
For production deployment, consider exporting trained models to ONNX format using torch::torch_save(model, ..., export = TRUE) for interoperability with other frameworks and deployment platforms.