rguides

Deep Learning with torch for R

This guide covers the fundamentals: creating tensors, building neural networks, training models, and using GPU acceleration. By the end, you’ll have a solid foundation for implementing deep learning solutions in R.

Installation

Install torch from CRAN or the development version from GitHub:

# From CRAN
install.packages("torch")

# Latest development version
remotes::install_github("torch-api/torch")

You’ll also need PyTorch installed. The torch package handles this automatically. Behind the scenes, install_torch() downloads a self-contained Conda environment with the correct PyTorch binaries for your operating system. This approach avoids version conflicts with any system Python installations and means you do not need to install Python or PyTorch separately before using the R torch package.

library(torch)
install_torch()

The installation process downloads a Python environment with PyTorch, so you’ll need about 2GB of free disk space. First-time installation can take several minutes depending on your internet connection.

Creating tensors

Tensors are the fundamental data structure in torch, multi-dimensional arrays similar to R’s matrices but with GPU support and automatic differentiation capabilities. Understanding tensors thoroughly is essential, as all operations in torch revolve around them.

Basic tensor creation

Tensors are the fundamental data structure in torch, multi-dimensional arrays similar to R matrices but with GPU support and automatic differentiation. You can create them from R vectors and matrices with torch_tensor(), or generate them directly with factory functions like torch_ones(), torch_randn(), and torch_arange(). Always specify dtype explicitly when the default integer/float choice matters for your model.

library(torch)

# From R objects — vectors and matrices become tensors
x <- torch_tensor(c(1, 2, 3, 4))
x <- torch_tensor(matrix(1:6, nrow = 2))

# Factory functions — create tensors with known patterns
x <- torch_zeros(3, 4)       # all zeros, useful for bias terms
x <- torch_randn(3, 4)       # standard normal, good for weight initialisation
x <- torch_rand(3, 4)        # uniform [0, 1)
x <- torch_arange(1, 10)     # 1D sequence from 1 to 10

# Explicit dtype — torch_float32() for model parameters
x <- torch_tensor(1:10, dtype = torch_float32())

Tensor properties and operations

Inspecting tensor shape and dtype is essential for debugging dimension mismatches, the most common source of errors in deep learning code. Use dim() for the R-compatible shape vector or $shape for the native torch representation. Reshaping with $view() changes the dimensions without copying data, but the total number of elements must remain unchanged.

x <- torch_tensor(matrix(1:6, nrow = 2))

dim(x)                   #> c(2, 3)
x$shape                  # torch_size(2, 3)
x$dtype                  # torch_int64

# Reshaping — $view() is a view, not a copy
y <- x$view(c(3, 2))     # reshape to 3x2
y <- x$t()               # transpose

# Element-wise and matrix operations
y <- x * 2               # scalar broadcast
z <- torch_matmul(x, x$t())  # matrix multiply
x$sum()                  # sum of all elements
max_x <- x$max()

Slicing and indexing

Tensor indexing follows similar patterns to R matrix indexing but with important differences. Individual elements are accessed with bracket notation, while boolean masks select elements based on a condition expressed as another tensor. Mastering indexing is necessary because data preprocessing and batch manipulation both depend on extracting the right slices from tensors. Operations like filtering rows by label or selecting specific feature columns are done through indexing before the data enters your model.

x <- torch_tensor(matrix(1:12, nrow = 3))

# Single element (returns tensor)
x[1, 1]

# Single element (returns as R scalar)
x[1, 1]$item()

# Slicing
x[1, ]                      # first row
x[, 2]                      # second column
x[1:2, 1:3]                 # rows 1-2, columns 1-3

# Boolean indexing
mask <- x > 5
x[mask]                     # elements greater than 5

Moving between devices

Tensors reside on a specific device (CPU or GPU) and operations between tensors on different devices will fail. Use $to() to move tensors between devices, and cuda_is_available() to check whether GPU acceleration is supported on your machine. Proper device management is necessary for multi-GPU workflows.

# Check if CUDA (GPU) is available
torch.cuda_is_available()

# Create tensor on GPU if available
device <- if (torch.cuda_is_available()) "cuda" else "cpu"
x <- torch_tensor(1:10, device = device)

# Move existing tensor to device
x <- x$to(device = "cpu")

# Check current device
x$device

Building neural networks

The nn_ prefix designates neural network modules, the building blocks for constructing models. Compose them using nn_sequential() for simple, straightforward networks, or subclass nn_module() for custom architectures requiring more control.

Using nn_sequential

For straightforward architectures, nn_sequential provides a clean, readable way to stack layers:

model <- nn_sequential(
  nn_linear(784, 256),
  nn_relu(),
  nn_dropout(0.2),
  nn_linear(256, 10)
)

This creates a feedforward network with 784 inputs, a 256-neuron hidden layer with ReLU activation, 20% dropout for regularization, and 10 outputs, suitable for MNIST digit classification.

Custom modules

For complex architectures requiring custom forward logic or multiple branches, subclass nn_module():

net <- nn_module(
  "Net",
  initialize = function() {
    # First conv block
    self$conv1 <- nn_conv2d(1, 32, kernel_size = 3, padding = 1)
    self$bn1 <- nn_batch_norm2d(32)
    
    # Second conv block
    self$conv2 <- nn_conv2d(32, 64, kernel_size = 3, padding = 1)
    self$bn2 <- nn_batch_norm2d(64)
    
    # Fully connected layers
    self$fc1 <- nn_linear(64 * 7 * 7, 256)
    self$fc2 <- nn_linear(256, 10)
    
    self$pool <- nn_max_pool2d(2)
    self$relu <- nn_relu()
    self$dropout <- nn_dropout(0.5)
  },
  forward = function(x) {
    x <- x$reshape(c(x$size(1), 1, 28, 28))  # reshape flat input to image
    
    x <- x |> 
      self$pool(self$relu(self$bn1(self$conv1(x)))) |>
      self$pool(self$relu(self$bn2(self$conv2(x))))
    
    x <- x$view(c(x$size(1), -1))  # flatten
    x <- x |> 
      self$relu(self$fc1(x)) |>
      self$dropout(x) |>
      self$fc2(x)
    
    x
  }
)

model <- net()

Common layer types

LayerDescription
nn_linear(in, out)Fully connected (dense) layer
nn_conv2d(in, out, kernel_size)2D convolution for images
nn_conv1d(in, out, kernel_size)1D convolution for sequences
nn_lstm(input_size, hidden_size, num_layers)Long Short-Term Memory layer
nn_gru(input_size, hidden_size, num_layers)Gated Recurrent Unit layer
nn_embedding(num_embeddings, embedding_dim)Word embedding lookup
nn_dropout(p)Dropout regularization
nn_batch_norm2d(num_features)Batch normalization
nn_layer_norm(normalized_shape)Layer normalization

Activation functions

Activation functions introduce non-linearity into neural networks, allowing them to learn complex patterns. Each activation function has specific use cases: ReLU is the default for hidden layers, sigmoid for binary output layers, and softmax for multi-class classification:

nn_relu()       # Rectified Linear Unit
nn_sigmoid()    # Sigmoid activation
nn_tanh()       # Hyperbolic tangent
nn_softmax(dim) # Softmax (specify dimension)
nn_log_softmax(dim) # Log-softmax for numerical stability

Training models

Training involves a repeated cycle: forward pass (compute predictions), loss computation (evaluate error), backward pass (compute gradients), and parameter updates (adjust weights). This is the fundamental training loop in deep learning. Each iteration through this cycle moves the model parameters closer to values that minimise the loss. The forward pass produces predictions, the loss function scores them against ground truth, and backpropagation computes how much each parameter contributed to the error so the optimizer can make informed adjustments.

The training loop

# Create model
model <- nn_sequential(
  nn_linear(10, 32),
  nn_relu(),
  nn_linear(32, 1)
)

# Loss function and optimizer
criterion <- nn_mse_loss()
optimizer <- optim_adam(model$parameters, lr = 0.01)

# Training data
set.seed(42)
x_data <- torch_randn(100, 10)
y_data <- x_data$sum(dim = 2, keepdim = TRUE) + torch_randn(100, 1) * 0.1

# Training loop
epochs <- 100
for (epoch in 1:epochs) {
  # Forward pass: compute predictions
  predictions <- model(x_data)
  
  # Compute loss
  loss <- criterion(predictions, y_data)
  
  # Backward pass: compute gradients
  optimizer$zero_grad()       # clear previous gradients
  loss$backward()            # compute new gradients
  
  # Update weights
  optimizer$step()
  
  if (epoch %% 10 == 0) {
    cat("Epoch:", epoch, "Loss:", loss$item(), "\n")
  }
}

Loss functions

Different tasks require different loss functions. Choosing the right one is essential for training success because the loss function defines what “good” means for your model, since minimizing the wrong loss can produce a model that converges numerically but produces useless predictions:

# Regression losses
nn_mse_loss()           # Mean Squared Error, good for smooth targets
nn_l1_loss()            # Mean Absolute Error, reliable to outliers
nn_smooth_l1_loss()    # Smooth combination of MSE and MAE

# Classification losses
nn_cross_entropy_loss() # Cross-entropy for multi-class
nn_bce_loss()           # Binary Cross-Entropy for binary classification
nn_bce_with_logits_loss() # BCE with sigmoid built-in (more numerically stable)

Optimizers

The optimizer adjusts model parameters based on computed gradients. Choosing the right optimizer and learning rate often requires experimentation since Adam is a reliable default, while SGD with momentum can achieve better final results given sufficient tuning. The optimizer’s job is to take the gradient information from the backward pass and decide how to update each parameter. Different optimizers use different strategies for this update step, which is why having several options matters.

# Adam (adaptive learning rate, usually good default)
optim_adam(model$parameters, lr = 0.001)

# AdamW (Adam with proper weight decay regularization)
optim_adamw(model$parameters, lr = 0.001, weight_decay = 0.01)

# SGD with momentum (traditional, often needs tuning)
optim_sgd(model$parameters, lr = 0.01, momentum = 0.9)

# RMSprop (adaptive learning rate)
optim_rmsprop(model$parameters, lr = 0.01)

# Learning rate schedulers
scheduler <- optim_lr_step(optimizer, step_size = 10, gamma = 0.5)
scheduler <- optim_lr ReduceLROnPlateau(optimizer, mode = "min", factor = 0.5, patience = 5)

# In training loop:
for (epoch in 1:epochs) {
  # ... training code ...
  scheduler$step()           # for step scheduler
  # scheduler$step(loss)     # for ReduceLROnPlateau
}

Training with validation

For production models, you should monitor validation loss to detect overfitting. A validation set provides an unbiased estimate of how well your model generalizes to unseen data. If training loss keeps dropping while validation loss starts rising, your model is memorizing the training data rather than learning generalisable patterns. Monitoring both metrics during training lets you decide when to stop or adjust regularisation.

# Split data
n_train <- 800
train_loader <- dataloader(tensor_dataset(
  x_train[1:n_train, ],
  y_train[1:n_train, ]
), batch_size = 32, shuffle = TRUE)

val_loader <- dataloader(tensor_dataset(
  x_train[(n_train + 1):nrow(x_train), ],
  y_train[(n_train + 1):nrow(y_train), ]
), batch_size = 32)

train <- function(model, train_loader, val_loader, epochs) {
  for (epoch in 1:epochs) {
    # Training phase
    model$train()
    train_loss <- 0
    for (batch in train_loader) {
      optimizer$zero_grad()
      output <- model(batch[[1]])
      loss <- criterion(output, batch[[2]])
      loss$backward()
      optimizer$step()
      train_loss <- train_loss + loss$item()
    }
    
    # Validation phase
    model$eval()
    val_loss <- 0
    with_no_grad({
      for (batch in val_loader) {
        output <- model(batch[[1]])
        val_loss <- val_loss + criterion(output, batch[[2]])$item()
      }
    })
    
    cat("Epoch:", epoch, 
        "Train Loss:", train_loss / length(train_loader), 
        "Val Loss:", val_loss / length(val_loader), "\n")
  }
}

Using GPU acceleration

GPU acceleration dramatically speeds up training for large models and datasets. Modern GPUs have thousands of cores optimized for parallel tensor operations, making them ideal for deep learning. For models with millions of parameters or datasets too large to fit in RAM, moving computation to a GPU can reduce training time from hours to minutes. Before writing GPU-dependent code, confirm that your system supports it.

Checking GPU availability

torch.cuda_is_available()        # Returns TRUE if CUDA is available
torch$cuda$device_count()        # Number of available GPUs
torch$cuda$get_device_name(0)   # Name of the first GPU

# For Apple Silicon (M1/M2/M3)
# torch$backends$mps$is_available()  # Metal Performance Shaders

Moving data and models to GPU

Once you have confirmed GPU availability, the next step is transferring your tensors and model to the GPU device. By default tensors and models reside on the CPU, and any operation between tensors on different devices will raise an error. The $to() method handles this transfer for both data and model objects, moving all parameters and buffers to the target device in a single call.

device <- "cuda"

# Move model to GPU
model <- model$to(device = device)

# Move data to GPU
x_train <- x_train$to(device = device)
y_train <- y_train$to(device = device)

# Or use cuda_() shorthand (in-place)
x_train <- x_train$cuda()

Complete GPU training example

The example below ties together device selection, model creation on the GPU, and a full training loop. Notice that data tensors must be moved to the same device as the model inside each batch iteration. Forgetting to move data to the GPU is a common mistake that produces confusing device-mismatch errors during the forward pass.

# Set device based on availability
device <- if (torch.cuda_is_available()) "cuda" else "cpu"
cat("Using device:", device, "\n")

# Create model and move to GPU
model <- nn_sequential(
  nn_linear(784, 256),
  nn_relu(),
  nn_dropout(0.2),
  nn_linear(256, 10)
)$to(device = device)

optimizer <- optim_adam(model$parameters, lr = 0.001)
criterion <- nn_cross_entropy_loss()

# Training loop with GPU
for (epoch in 1:50) {
  model$train()
  optimizer$zero_grad()
  
  # Data must be on same device as model
  output <- model(x_data$to(device = device))
  loss <- criterion(output, y_data$to(device = device))
  
  loss$backward()
  optimizer$step()
  
  if (epoch %% 10 == 0) {
    cat("Epoch:", epoch, "Loss:", loss$item(), "\n")
  }
}

Mixed precision training

For even faster training on modern GPUs with Tensor Cores, use mixed precision. This uses float16 for most operations while keeping critical operations in float32. Mixed precision reduces memory usage and speeds up matrix multiplications on hardware with Tensor Cores such as NVIDIA Volta, Turing, and Ampere GPUs. The autocast context manager automatically selects the appropriate precision for each operation.

model <- model$to(device = "cuda")
mixed_precision <- autocast(device_type = "cuda")
scaler <- torch_amp_GradScaler()

train_step <- function(x, y) {
  optimizer$zero_grad()
  
  # Automatic casting to float16
  with(mixed_precision, {
    output <- model(x)
    loss <- criterion(output, y)
  })
  
  # Scale loss, backward, unscale, update
  scaler$scale(loss)$backward()
  scaler$step(optimizer)
  scaler$update()
}

Mixed precision typically provides 1.5-3x speedup on modern GPUs with minimal accuracy impact.

Working with data

Efficient data handling is critical for training performance. The dataloader function provides batching, shuffling, and parallel data loading. Batching groups multiple samples together so the GPU can process them in parallel, while shuffling prevents the model from learning spurious patterns tied to data ordering. Without proper batching, training on large datasets becomes impractically slow.

DataLoaders for batching

# Create tensor dataset
dataset <- tensor_dataset(
  x = torch_randn(1000, 784),
  y = torch_randint(0, 10, c(1000, 1))
)

# Create dataloader with batching and shuffling
dataloader <- dataloader(dataset, batch_size = 32, shuffle = TRUE)

# Iterate over batches
for (batch in dataloader) {
  x_batch <- batch[[1]]
  y_batch <- batch[[2]]
  # Process batch...
}

Custom datasets

For loading from files or applying custom transformations, create a custom dataset. A custom dataset subclass gives you full control over how samples are loaded, preprocessed, and labelled. This is necessary when working with image files on disk, text corpora, or any data that does not fit neatly into an in-memory tensor. You define the total number of items, how each item is retrieved, and any transformations to apply during retrieval.

image_dataset <- dataset(
  initialize = function(data_dir, transform = NULL) {
    self$files <- list.files(data_dir, pattern = "\\.jpg$", full.names = TRUE)
    self$transform <- transform
  },
  
  .getitem = function(i) {
    img_path <- self$files[i]
    img <- jpeg::readJPEG(img_path)
    img_tensor <- torch_tensor(img)$permute(c(3, 1, 2))  # HWC to CHW
    
    if (!is.null(self$transform)) {
      img_tensor <- self$transform(img_tensor)
    }
    
    list(x = img_tensor, y = self$get_label(img_path))
  },
  
  .length = function() {
    length(self$files)
  }
)

Data augmentation

For image classification, apply augmentations during training. Data augmentation artificially expands your training set by applying random transformations like flips, rotations, and colour jitter. This improves model generalisation by exposing it to varied versions of each image, which reduces overfitting without requiring more labelled data. Augmentations should only be applied during training, never during validation or inference.

augment <- function(x) {
  # Random horizontal flip
  if (runif(1) > 0.5) {
    x <- x[, , , flip(x$size(4):1)]
  }
  # Random crop
  # ... (more augmentations)
  x
}

Saving and loading models

Model persistence enables checkpointing during training and deployment. Saving model weights lets you resume interrupted training runs, share trained models with collaborators, and deploy to production without retraining. The recommended approach is to save the state dictionary rather than the entire model object, because the state dict is smaller and does not depend on the exact class definition being available at load time.

# Save entire model (includes architecture)
torch_save(model, "model.pt")

# Save only state dict (recommended, keeping architecture separate)
torch_save(model$state_dict(), "model_state.pt")

# Save checkpoint (includes optimizer state for resuming)
checkpoint <- list(
  epoch = epoch,
  model_state_dict = model$state_dict(),
  optimizer_state_dict = optimizer$state_dict(),
  loss = loss
)
torch_save(checkpoint, "checkpoint.pt")

# Load model
model <- nn_sequential(nn_linear(10, 32), nn_relu(), nn_linear(32, 1))
model$load_state_dict(torch_load("model_state.pt"))
model$eval()

# Load checkpoint and resume
checkpoint <- torch_load("checkpoint.pt")
model$load_state_dict(checkpoint$model_state_dict)
optimizer$load_state_dict(checkpoint$optimizer_dict)
start_epoch <- checkpoint$epoch + 1

See also