Deep Learning with torch for R

· 9 min read · Updated March 17, 2026 · advanced
deep-learning neural-networks pytorch torch r

This guide covers the fundamentals: creating tensors, building neural networks, training models, and leveraging GPU acceleration. By the end, you’ll have a solid foundation for implementing deep learning solutions in R.

Installation

Install torch from CRAN or the development version from GitHub:

# From CRAN
install.packages("torch")

# Latest development version
remotes::install_github("torch-api/torch")

You’ll also need PyTorch installed. The torch package handles this automatically:

library(torch)
install_torch()

The installation process downloads a Python environment with PyTorch, so you’ll need about 2GB of free disk space. First-time installation can take several minutes depending on your internet connection.

Creating Tensors

Tensors are the fundamental data structure in torch — multi-dimensional arrays similar to R’s matrices but with GPU support and automatic differentiation capabilities. Understanding tensors thoroughly is essential, as all operations in torch revolve around them.

Basic Tensor Creation

library(torch)

# From R objects
x <- torch_tensor(c(1, 2, 3, 4))
x <- torch_tensor(matrix(1:6, nrow = 2))

# Directly in torch
x <- torch_ones(3, 4)              # all ones
x <- torch_zeros(3, 4)             # all zeros
x <- torch_randn(3, 4)              # standard normal distribution
x <- torch_rand(3, 4)              # uniform [0, 1)
x <- torch_arange(1, 10)           # 1D tensor from 1 to 10
x <- torch_eye(5)                   # identity matrix

# With specific dtype
x <- torch_tensor(1:10, dtype = torch_float32())
x <- torch_tensor(1:10, dtype = torch_int64())

Tensor Properties and Operations

x <- torch_tensor(matrix(1:6, nrow = 2))

# Shape and dtype
dim(x)                      # returns c(2, 3)
x$shape                     # torch_size object
x$dtype                     # torch_int64

# Reshaping
y <- x$view(c(3, 2))        # reshape to 3x2 (must have same total elements)
y <- x$t()                  # transpose
y <- x$unsqueeze(1)         # add dimension at position 1

# Basic operations
y <- x * 2                  # element-wise multiplication
z <- torch_matmul(x, x$t())  # matrix multiplication
sum_x <- x$sum()
mean_x <- x$mean()
max_x <- x$max()

Slicing and Indexing

x <- torch_tensor(matrix(1:12, nrow = 3))

# Single element (returns tensor)
x[1, 1]

# Single element (returns as R scalar)
x[1, 1]$item()

# Slicing
x[1, ]                      # first row
x[, 2]                      # second column
x[1:2, 1:3]                 # rows 1-2, columns 1-3

# Boolean indexing
mask <- x > 5
x[mask]                     # elements greater than 5

Moving Between Devices

# Check if CUDA (GPU) is available
torch.cuda_is_available()

# Create tensor on GPU if available
device <- if (torch.cuda_is_available()) "cuda" else "cpu"
x <- torch_tensor(1:10, device = device)

# Move existing tensor to device
x <- x$to(device = "cpu")

# Check current device
x$device

Building Neural Networks

The nn_ prefix designates neural network modules — the building blocks for constructing models. Compose them using nn_sequential() for simple, straightforward networks, or subclass nn_module() for custom architectures requiring more control.

Using nn_sequential

For straightforward architectures, nn_sequential provides a clean, readable way to stack layers:

model <- nn_sequential(
  nn_linear(784, 256),
  nn_relu(),
  nn_dropout(0.2),
  nn_linear(256, 10)
)

This creates a feedforward network with 784 inputs, a 256-neuron hidden layer with ReLU activation, 20% dropout for regularization, and 10 outputs — suitable for MNIST digit classification.

Custom Modules

For complex architectures requiring custom forward logic or multiple branches, subclass nn_module():

net <- nn_module(
  "Net",
  initialize = function() {
    # First conv block
    self$conv1 <- nn_conv2d(1, 32, kernel_size = 3, padding = 1)
    self$bn1 <- nn_batch_norm2d(32)
    
    # Second conv block
    self$conv2 <- nn_conv2d(32, 64, kernel_size = 3, padding = 1)
    self$bn2 <- nn_batch_norm2d(64)
    
    # Fully connected layers
    self$fc1 <- nn_linear(64 * 7 * 7, 256)
    self$fc2 <- nn_linear(256, 10)
    
    self$pool <- nn_max_pool2d(2)
    self$relu <- nn_relu()
    self$dropout <- nn_dropout(0.5)
  },
  forward = function(x) {
    x <- x$reshape(c(x$size(1), 1, 28, 28))  # reshape flat input to image
    
    x <- x |> 
      self$pool(self$relu(self$bn1(self$conv1(x)))) |>
      self$pool(self$relu(self$bn2(self$conv2(x))))
    
    x <- x$view(c(x$size(1), -1))  # flatten
    x <- x |> 
      self$relu(self$fc1(x)) |>
      self$dropout(x) |>
      self$fc2(x)
    
    x
  }
)

model <- net()

Common Layer Types

LayerDescription
nn_linear(in, out)Fully connected (dense) layer
nn_conv2d(in, out, kernel_size)2D convolution for images
nn_conv1d(in, out, kernel_size)1D convolution for sequences
nn_lstm(input_size, hidden_size, num_layers)Long Short-Term Memory layer
nn_gru(input_size, hidden_size, num_layers)Gated Recurrent Unit layer
nn_embedding(num_embeddings, embedding_dim)Word embedding lookup
nn_dropout(p)Dropout regularization
nn_batch_norm2d(num_features)Batch normalization
nn_layer_norm(normalized_shape)Layer normalization

Activation Functions

nn_relu()       # Rectified Linear Unit
nn_sigmoid()    # Sigmoid activation
nn_tanh()       # Hyperbolic tangent
nn_softmax(dim) # Softmax (specify dimension)
nn_log_softmax(dim) # Log-softmax for numerical stability

Training Models

Training involves a repeated cycle: forward pass (compute predictions), loss computation (evaluate error), backward pass (compute gradients), and parameter updates (adjust weights). This is the fundamental training loop in deep learning.

The Training Loop

# Create model
model <- nn_sequential(
  nn_linear(10, 32),
  nn_relu(),
  nn_linear(32, 1)
)

# Loss function and optimizer
criterion <- nn_mse_loss()
optimizer <- optim_adam(model$parameters, lr = 0.01)

# Training data
set.seed(42)
x_data <- torch_randn(100, 10)
y_data <- x_data$sum(dim = 2, keepdim = TRUE) + torch_randn(100, 1) * 0.1

# Training loop
epochs <- 100
for (epoch in 1:epochs) {
  # Forward pass: compute predictions
  predictions <- model(x_data)
  
  # Compute loss
  loss <- criterion(predictions, y_data)
  
  # Backward pass: compute gradients
  optimizer$zero_grad()       # clear previous gradients
  loss$backward()            # compute new gradients
  
  # Update weights
  optimizer$step()
  
  if (epoch %% 10 == 0) {
    cat("Epoch:", epoch, "Loss:", loss$item(), "\n")
  }
}

Loss Functions

Different tasks require different loss functions. Choosing the right one is crucial for training success:

# Regression losses
nn_mse_loss()           # Mean Squared Error — good for smooth targets
nn_l1_loss()            # Mean Absolute Error — robust to outliers
nn_smooth_l1_loss()    # Smooth combination of MSE and MAE

# Classification losses
nn_cross_entropy_loss() # Cross-entropy for multi-class
nn_bce_loss()           # Binary Cross-Entropy for binary classification
nn_bce_with_logits_loss() # BCE with sigmoid built-in (more numerically stable)

Optimizers

The optimizer adjusts model parameters based on computed gradients. Choosing the right optimizer and learning rate often requires experimentation:

# Adam (adaptive learning rate, usually good default)
optim_adam(model$parameters, lr = 0.001)

# AdamW (Adam with proper weight decay regularization)
optim_adamw(model$parameters, lr = 0.001, weight_decay = 0.01)

# SGD with momentum (traditional, often needs tuning)
optim_sgd(model$parameters, lr = 0.01, momentum = 0.9)

# RMSprop (adaptive learning rate)
optim_rmsprop(model$parameters, lr = 0.01)

# Learning rate schedulers
scheduler <- optim_lr_step(optimizer, step_size = 10, gamma = 0.5)
scheduler <- optim_lr ReduceLROnPlateau(optimizer, mode = "min", factor = 0.5, patience = 5)

# In training loop:
for (epoch in 1:epochs) {
  # ... training code ...
  scheduler$step()           # for step scheduler
  # scheduler$step(loss)     # for ReduceLROnPlateau
}

Training with Validation

For production models, you should monitor validation loss to detect overfitting:

# Split data
n_train <- 800
train_loader <- dataloader(tensor_dataset(
  x_train[1:n_train, ],
  y_train[1:n_train, ]
), batch_size = 32, shuffle = TRUE)

val_loader <- dataloader(tensor_dataset(
  x_train[(n_train + 1):nrow(x_train), ],
  y_train[(n_train + 1):nrow(y_train), ]
), batch_size = 32)

train <- function(model, train_loader, val_loader, epochs) {
  for (epoch in 1:epochs) {
    # Training phase
    model$train()
    train_loss <- 0
    for (batch in train_loader) {
      optimizer$zero_grad()
      output <- model(batch[[1]])
      loss <- criterion(output, batch[[2]])
      loss$backward()
      optimizer$step()
      train_loss <- train_loss + loss$item()
    }
    
    # Validation phase
    model$eval()
    val_loss <- 0
    with_no_grad({
      for (batch in val_loader) {
        output <- model(batch[[1]])
        val_loss <- val_loss + criterion(output, batch[[2]])$item()
      }
    })
    
    cat("Epoch:", epoch, 
        "Train Loss:", train_loss / length(train_loader), 
        "Val Loss:", val_loss / length(val_loader), "\n")
  }
}

Using GPU Acceleration

GPU acceleration dramatically speeds up training for large models and datasets. Modern GPUs have thousands of cores optimized for parallel tensor operations, making them ideal for deep learning.

Checking GPU Availability

torch.cuda_is_available()        # Returns TRUE if CUDA is available
torch$cuda$device_count()        # Number of available GPUs
torch$cuda$get_device_name(0)   # Name of the first GPU

# For Apple Silicon (M1/M2/M3)
# torch$backends$mps$is_available()  # Metal Performance Shaders

Moving Data and Models to GPU

device <- "cuda"

# Move model to GPU
model <- model$to(device = device)

# Move data to GPU
x_train <- x_train$to(device = device)
y_train <- y_train$to(device = device)

# Or use cuda_() shorthand (in-place)
x_train <- x_train$cuda()

Complete GPU Training Example

# Set device based on availability
device <- if (torch.cuda_is_available()) "cuda" else "cpu"
cat("Using device:", device, "\n")

# Create model and move to GPU
model <- nn_sequential(
  nn_linear(784, 256),
  nn_relu(),
  nn_dropout(0.2),
  nn_linear(256, 10)
)$to(device = device)

optimizer <- optim_adam(model$parameters, lr = 0.001)
criterion <- nn_cross_entropy_loss()

# Training loop with GPU
for (epoch in 1:50) {
  model$train()
  optimizer$zero_grad()
  
  # Data must be on same device as model
  output <- model(x_data$to(device = device))
  loss <- criterion(output, y_data$to(device = device))
  
  loss$backward()
  optimizer$step()
  
  if (epoch %% 10 == 0) {
    cat("Epoch:", epoch, "Loss:", loss$item(), "\n")
  }
}

Mixed Precision Training

For even faster training on modern GPUs with Tensor Cores, use mixed precision. This uses float16 for most operations while keeping critical operations in float32:

model <- model$to(device = "cuda")
mixed_precision <- autocast(device_type = "cuda")
scaler <- torch_amp_GradScaler()

train_step <- function(x, y) {
  optimizer$zero_grad()
  
  # Automatic casting to float16
  with(mixed_precision, {
    output <- model(x)
    loss <- criterion(output, y)
  })
  
  # Scale loss, backward, unscale, update
  scaler$scale(loss)$backward()
  scaler$step(optimizer)
  scaler$update()
}

Mixed precision typically provides 1.5-3x speedup on modern GPUs with minimal accuracy impact.

Working with Data

Efficient data handling is crucial for training performance. The dataloader function provides batching, shuffling, and parallel data loading.

DataLoaders for Batching

# Create tensor dataset
dataset <- tensor_dataset(
  x = torch_randn(1000, 784),
  y = torch_randint(0, 10, c(1000, 1))
)

# Create dataloader with batching and shuffling
dataloader <- dataloader(dataset, batch_size = 32, shuffle = TRUE)

# Iterate over batches
for (batch in dataloader) {
  x_batch <- batch[[1]]
  y_batch <- batch[[2]]
  # Process batch...
}

Custom Datasets

For loading from files or applying custom transformations, create a custom dataset:

image_dataset <- dataset(
  initialize = function(data_dir, transform = NULL) {
    self$files <- list.files(data_dir, pattern = "\\.jpg$", full.names = TRUE)
    self$transform <- transform
  },
  
  .getitem = function(i) {
    img_path <- self$files[i]
    img <- jpeg::readJPEG(img_path)
    img_tensor <- torch_tensor(img)$permute(c(3, 1, 2))  # HWC to CHW
    
    if (!is.null(self$transform)) {
      img_tensor <- self$transform(img_tensor)
    }
    
    list(x = img_tensor, y = self$get_label(img_path))
  },
  
  .length = function() {
    length(self$files)
  }
)

Data Augmentation

For image classification, apply augmentations during training:

augment <- function(x) {
  # Random horizontal flip
  if (runif(1) > 0.5) {
    x <- x[, , , flip(x$size(4):1)]
  }
  # Random crop
  # ... (more augmentations)
  x
}

Saving and Loading Models

Model persistence enables checkpointing during training and deployment:

# Save entire model (includes architecture)
torch_save(model, "model.pt")

# Save only state dict (recommended — architecture separate)
torch_save(model$state_dict(), "model_state.pt")

# Save checkpoint (includes optimizer state for resuming)
checkpoint <- list(
  epoch = epoch,
  model_state_dict = model$state_dict(),
  optimizer_state_dict = optimizer$state_dict(),
  loss = loss
)
torch_save(checkpoint, "checkpoint.pt")

# Load model
model <- nn_sequential(nn_linear(10, 32), nn_relu(), nn_linear(32, 1))
model$load_state_dict(torch_load("model_state.pt"))
model$eval()

# Load checkpoint and resume
checkpoint <- torch_load("checkpoint.pt")
model$load_state_dict(checkpoint$model_state_dict)
optimizer$load_state_dict(checkpoint$optimizer_dict)
start_epoch <- checkpoint$epoch + 1

Summary

The torch package brings PyTorch’s deep learning capabilities to R, enabling you to:

  • Create tensors — the fundamental building blocks for all computations, with full GPU support
  • Build neural networks — using predefined modules like nn_linear and nn_conv2d, or custom architectures via nn_module()
  • Train models — with flexible training loops, diverse loss functions, and various optimizers
  • Use GPU acceleration — for significant speedups on large-scale problems, with mixed precision for modern hardware

The API closely mirrors PyTorch’s Python interface, making it straightforward to adapt Python deep learning code or follow PyTorch tutorials while working in R. The main differences are R’s 1-based indexing and the pipe operator %>% (or R’s native |> in R 4.1+) for chaining operations.

For production deployment, consider exporting trained models to ONNX format using torch::torch_save(model, ..., export = TRUE) for interoperability with other frameworks and deployment platforms.