Feature Engineering in R

· 6 min read · Updated March 27, 2026 · intermediate
r machine-learning feature-engineering tidymodels dplyr

Feature engineering is the process of selecting, modifying, and creating variables in your dataset so that machine learning models can extract the best possible signal from your data. Switching algorithms gets attention, but well-engineered features often deliver bigger accuracy gains. R’s tidyverse ecosystem and tidymodels framework give you coherent, reproducible tools for the entire pipeline.

Numeric Transformations

Raw numeric variables frequently need reshaping before they are useful to a model. Skewed distributions, variables on wildly different scales, and outliers all cause problems.

Scaling and Standardization

scale() centers a vector at zero and divides by its standard deviation, producing z-scores:

x_scaled <- scale(mtcars$mpg)

To rescale values to an arbitrary range such as [0, 1], use scales::rescale():

library(scales)
mpg_rescaled <- rescale(mtcars$mpg, to = c(0, 1))

For a training/test-aware pipeline, the recipes package is preferable because it estimates scaling parameters only on the training data and applies them to new data without leakage:

library(recipes)

rec <- recipe(mpg ~ wt + hp + disp, data = mtcars) %>%
  step_normalize(all_numeric_predictors())

rec_prepped <- prep(rec, training = mtcars)
mtcars_scaled <- bake(rec_prepped, new_data = NULL)

Log and Power Transforms

When a variable spans multiple orders of magnitude or is heavily right-skewed, a log transform compresses the range and can make the distribution more symmetric:

# Natural log — safe for positive values
income_log <- log(mydata$income)

# log1p handles values close to zero: log(1 + x)
income_log1p <- log1p(mydata$income)

# Back-transform with expm1
income_original <- expm1(income_log1p)

The Box-Cox transform via car::powerTransform() finds the optimal power parameter lambda automatically:

library(car)
bc <- powerTransform(mtcars$mpg ~ 1)
lambda <- bc$lambda
mpg_bc <- mtcars$mpg ^ lambda

When lambda equals 0, the transform is a log. Lambda of 0.5 is equivalent to a square root transform.

Encoding Categorical Variables

Most modeling algorithms require numeric input, so categorical columns need conversion.

Dummy Variables with model.matrix

Base R’s model.matrix() creates a design matrix with binary columns for each factor level:

df <- data.frame(color = c("red", "blue", "red", "green"), stringsAsFactors = TRUE)
model.matrix(~ color - 1, data = df)

The -1 suppresses the intercept column. Each color becomes its own 0/1 column.

One-Hot and Label Encoding

For more control, fastDummies::dummy_cols() handles multiple categorical columns at once:

library(fastDummies)
df <- dummy_cols(df, .data$color, remove_first_dummy = FALSE)

For ordered categories where the model should learn a rank relationship, as.integer() on an ordered factor preserves the order:

education_ordered <- factor(c("high_school", "bachelors", "phd"),
                            levels = c("high_school", "bachelors", "phd"),
                            ordered = TRUE)
as.integer(education_ordered)

The recipes equivalent is step_dummy(), which also handles one-hot encoding within a pipeline:

rec <- recipe(mpg ~ ., data = mtcars) %>%
  step_dummy(all_nominal_predictors())

Interaction and Polynomial Features

Some relationships are multiplicative. An interaction term captures the effect of two variables together. A polynomial lets a linear model fit curves.

Interaction Terms

In formulas, x1:x2 represents the interaction and x1 * x2 expands to x1 + x2 + x1:x2:

model <- lm(mpg ~ wt * hp, data = mtcars)

In dplyr, build the column directly:

mtcars <- mtcars %>%
  mutate(wt_hp = wt * hp)

Polynomial Features

poly() generates orthogonal polynomials, which avoids the multicollinearity problems of raw powers:

poly(mtcars$wt, degree = 3)

Orthogonal polynomials are generally preferred in regression because each term is independent of the others.

In a recipes pipeline, use step_poly():

rec <- recipe(mpg ~ wt, data = mtcars) %>%
  step_poly(wt, degree = 3) %>%
  step_normalize(all_numeric_predictors())

Date and Time Features with lubridate

Calendar variables often carry signal. The day of week, month, quarter, and year all correlate with outcomes in time-series or event data.

library(lubridate)

events <- events %>%
  mutate(
    year      = year(event_date),
    month     = month(event_date),
    day       = day(event_date),
    wday      = wday(event_date, label = TRUE),
    is_weekend = wday %in% c(1, 7),
    quarter   = quarter(event_date),
    hour      = hour(event_time),
    minute    = minute(event_time)
  )

Parse character strings to dates with ymd(), mdy(), or dmy() depending on the string format:

parsed <- ymd("2024-03-15")

Handling Missing Values

Real datasets always have missing values. Imputation replaces them with plausible substitutes.

Simple Imputation

The quickest approach replaces NAs with the median or mean. It is fast but flattens the distribution:

mtcars$hp[is.na(mtcars$hp)] <- median(mtcars$hp, na.rm = TRUE)

Within a recipes pipeline:

rec <- recipe(mpg ~ ., data = mtcars) %>%
  step_impute_median(all_numeric_predictors())

KNN Imputation

K-Nearest Neighbors imputation finds the k rows most similar (by Euclidean distance on non-missing columns) and averages their values for the missing column. This preserves relationships between variables better than mean imputation:

library(VIM)
mtcars_imputed <- kNN(mtcars, k = 5)

A reasonable starting value for k is the square root of the number of columns.

Multivariate Imputation with mice

mice (Multivariate Imputation by Chained Equations) builds a regression model for each incomplete variable using all other variables as predictors:

library(mice)

imp <- mice(mtcars, method = "pmm", m = 5)
mtcars_complete <- complete(imp)

The m = 5 argument creates five imputed datasets. Modeling results are then pooled across all five, which gives you proper uncertainty estimates.

Feature Selection

After creating many features, you often need to prune them. Correlated predictors, irrelevant variables, and redundant features all hurt model performance.

Removing Highly Correlated Predictors

High correlation between predictors causes multicollinearity in linear models. findCorrelation() flags column pairs above a threshold:

numeric_df <- mtcars %>% select(where(is.numeric))
cor_matrix <- cor(numeric_df, use = "complete.obs")
high_corr <- findCorrelation(cor_matrix, cutoff = 0.8)

df_filtered <- numeric_df[, -high_corr]

Variable Importance from a Model

Train any model and extract importance scores with caret::varImp():

library(caret)

rf_model <- train(mpg ~ ., data = mtcars, method = "rf")
importance <- varImp(rf_model)
plot(importance)

Stepwise Selection by AIC

For traditional regression, stepwise selection adds or removes predictors to minimize AIC:

full_model <- lm(mpg ~ ., data = mtcars)
null_model <- lm(mpg ~ 1, data = mtcars)

stepwise_model <- step(null_model,
                       scope = list(lower = null_model, upper = full_model),
                       direction = "both")

The recipes Pipeline End-to-End

recipes lets you chain every preprocessing step into a single reproducible object. Each step is estimated on training data and applied consistently to new data.

library(recipes)
library(dplyr)

rec <- recipe(mpg ~ wt + hp + disp + cyl + am, data = mtcars) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(am) %>%
  step_interact(~ wt:hp) %>%
  step_poly(hp, degree = 2)

rec_prepped <- prep(rec, training = mtcars)
bake(rec_prepped, new_data = mtcars)

Key idea: prep() estimates parameters like means, standard deviations, and dummy variable levels from the training set. bake() applies those estimated parameters to any data. This separation is what prevents data leakage in your modeling pipeline.

Conclusion

Feature engineering converts raw observations into variables that models can actually use. The core techniques — numeric transforms, categorical encoding, interactions, date extraction, imputation, and selection — apply across nearly every modeling problem. R’s tidyverse gives you dplyr and tidyr for flexible transformations, lubridate for datetime features, and recipes for production-grade preprocessing pipelines that prevent data leakage. Getting these fundamentals right matters more than which algorithm you choose.

See Also