Feature Engineering in R: Transform Raw Data for ML Models
Feature engineering is the process of selecting, modifying, and creating variables in your dataset so that machine learning models can extract the best possible signal from your data. Switching algorithms gets attention, but well-engineered features often deliver bigger accuracy gains. R’s tidyverse ecosystem and tidymodels framework give you coherent, reproducible tools for the entire pipeline.
What you’ll learn
This tutorial covers the key concepts and practical techniques for working with Feature Engineering in R. By the end, you will know how to apply the core functions in real data analysis workflows.
Numeric transformations
Raw numeric variables frequently need reshaping before they are useful to a model. Skewed distributions, variables on wildly different scales, and outliers all cause problems.
Scaling and standardization
scale() centers a vector at zero and divides by its standard deviation, producing z-scores. This is the standard preprocessing step for distance-based models like KNN and SVM, where variables on different scales can dominate the distance metric and skew results:
x_scaled <- scale(mtcars$mpg)
To rescale values to an arbitrary range such as [0, 1], use scales::rescale(). While scale() always produces z-scores centered at zero with unit variance, rescale() lets you define the exact lower and upper bounds you want. This is useful for algorithms sensitive to input magnitude and for visualizations where you need values in a bounded interval:
library(scales)
mpg_rescaled <- rescale(mtcars$mpg, to = c(0, 1))
For a training/test-aware pipeline, the recipes package is preferable because it estimates scaling parameters only on the training data and applies them to new data without leakage. If you compute the mean and standard deviation from the entire dataset before splitting, information from the test set bleeds into your training features and inflates your validation scores. step_normalize() inside a recipe avoids this by fitting parameters during prep() on the training split only:
library(recipes)
rec <- recipe(mpg ~ wt + hp + disp, data = mtcars) %>%
step_normalize(all_numeric_predictors())
rec_prepped <- prep(rec, training = mtcars)
mtcars_scaled <- bake(rec_prepped, new_data = NULL)
Log and power transforms
Variables with long-tailed distributions mislead many models. When a variable spans multiple orders of magnitude or is heavily right-skewed, such as income data, population counts, or response times, a log transform compresses the range and can make the distribution more symmetric. The base R log() and log1p() functions handle positive values cleanly, with log1p() being the safer choice when your data includes values near zero:
# Natural log — safe for positive values
income_log <- log(mydata$income)
# log1p handles values close to zero: log(1 + x)
income_log1p <- log1p(mydata$income)
# Back-transform with expm1
income_original <- expm1(income_log1p)
The Box-Cox transform via car::powerTransform() finds the optimal power parameter lambda automatically by maximising the log-likelihood of a normal distribution after transformation. When lambda equals 0, the transform is a log; lambda of 0.5 is equivalent to a square root. This automated approach removes the guesswork from choosing between log, square root, and reciprocal transforms. The following code estimates lambda and applies the transformation:
library(car)
bc <- powerTransform(mtcars$mpg ~ 1)
lambda <- bc$lambda
mpg_bc <- mtcars$mpg ^ lambda
Encoding categorical variables
Most modeling algorithms require numeric input, so categorical columns need conversion. The encoding method you choose depends on the relationship between categories: nominal (no order) calls for one-hot encoding, while ordinal (ranked) benefits from integer encoding that preserves the order.
Dummy variables and one-hot encoding
Base R’s model.matrix() creates a design matrix with binary columns for each factor level. The -1 in the formula suppresses the intercept column, giving each category its own 0/1 indicator. For workflows with many categorical columns, fastDummies::dummy_cols() handles multiple columns at once, and recipes::step_dummy() integrates one-hot encoding directly into a tidymodels pipeline where the encoding is estimated from training data and applied consistently to new data.
# Base R: model.matrix with intercept suppressed
df <- data.frame(color = c("red", "blue", "red", "green"), stringsAsFactors = TRUE)
model.matrix(~ color - 1, data = df)
# fastDummies: multiple columns at once
library(fastDummies)
df <- dummy_cols(df, .data$color, remove_first_dummy = FALSE)
# recipes: pipeline-safe encoding
rec <- recipe(mpg ~ ., data = mtcars) %>%
step_dummy(all_nominal_predictors())
Label encoding
For ordered categories where the model should learn a rank relationship, as.integer() on an ordered factor maps the levels to integers in the specified order. This preserves the monotonic relationship that one-hot encoding would discard: “phd” > “bachelors” > “high_school” becomes 3 > 2 > 1, which tree-based models and regularised regression can use directly. The following code creates an ordered factor and extracts its integer encoding:
education_ordered <- factor(c("high_school", "bachelors", "phd"),
levels = c("high_school", "bachelors", "phd"),
ordered = TRUE)
as.integer(education_ordered)
Interaction and polynomial features
Some relationships are multiplicative. An interaction term captures the effect of two variables together. A polynomial lets a linear model fit curves.
Interaction terms
In formulas, x1:x2 represents the interaction and x1 * x2 expands to x1 + x2 + x1:x2:
model <- lm(mpg ~ wt * hp, data = mtcars)
In dplyr, build the column directly with mutate(), multiplying the two columns to create a new predictor. This approach gives you full control over which interactions to include and lets you name the resulting column explicitly. The formula shortcut is convenient during exploration, but in production pipelines you’ll want the transparency of seeing each interaction as a named column:
mtcars <- mtcars %>%
mutate(wt_hp = wt * hp)
Polynomial features
poly() generates orthogonal polynomials, which avoids the multicollinearity problems of raw powers. When you square a variable directly (I(wt^2)), the new column is highly correlated with the original, inflating variance in linear models. Orthogonal polynomials produce predictors that are uncorrelated with each other, so each term captures independent information about curvature:
poly(mtcars$wt, degree = 3)
Orthogonal polynomials are generally preferred in regression because each term is independent of the others, which keeps coefficient estimates stable. Raw powers are easier to interpret directly, but the collinearity they introduce can cause coefficient signs to flip as the degree increases. In a recipes pipeline, use step_poly() to generate orthogonal polynomial features that fit cleanly into the rest of your preprocessing steps:
rec <- recipe(mpg ~ wt, data = mtcars) %>%
step_poly(wt, degree = 3) %>%
step_normalize(all_numeric_predictors())
Date and time features with lubridate
Calendar variables often carry signal that raw timestamps hide. The day of week, month, quarter, and year all correlate with outcomes in time-series or event data; retail sales spike on weekends, energy usage peaks in winter months, and customer churn follows quarterly contract cycles. Extracting these components with lubridate turns a single datetime column into multiple numeric or factor features that tree-based and linear models can use directly:
library(lubridate)
events <- events %>%
mutate(
year = year(event_date),
month = month(event_date),
day = day(event_date),
wday = wday(event_date, label = TRUE),
is_weekend = wday %in% c(1, 7),
quarter = quarter(event_date),
hour = hour(event_time),
minute = minute(event_time)
)
Parse character strings to dates with ymd(), mdy(), or dmy() depending on the string format. These lubridate functions are vectorized, fast, and handle common irregularities like missing leading zeros or trailing whitespace. Once parsed into R’s Date class, the date supports arithmetic (subtraction gives difftime), logical comparisons, and extraction via the same year(), month(), and wday() helpers shown above:
parsed <- ymd("2024-03-15")
Handling missing values
Real datasets always have missing values. Imputation replaces them with plausible substitutes.
Simple imputation
The quickest approach replaces NAs with the median or mean. It is fast but flattens the distribution; imputing every missing value with the same constant removes variance that may carry predictive signal. Use this when speed is critical and the proportion of missing values is small enough that distribution flattening won’t skew the model:
mtcars$hp[is.na(mtcars$hp)] <- median(mtcars$hp, na.rm = TRUE)
Within a recipes pipeline, step_impute_median() does the same replacement but records the median value as a preprocessing parameter. During prediction on new data, the recipe applies the stored median rather than recomputing it, which is critical because the test set’s median might differ from the training set’s, and computing imputation values from the test set constitutes data leakage:
rec <- recipe(mpg ~ ., data = mtcars) %>%
step_impute_median(all_numeric_predictors())
KNN imputation
K-Nearest Neighbors imputation finds the k rows most similar (by Euclidean distance on non-missing columns) and averages their values for the missing column. Unlike mean imputation, which fills every gap with the same value, KNN preserves relationships between variables by using similar observed rows as templates. This matters when missingness itself correlates with the outcome; for example, patients who skip reporting their income tend to have different health outcomes than those who report it:
library(VIM)
mtcars_imputed <- kNN(mtcars, k = 5)
A reasonable starting value for k is the square root of the number of columns.
Multivariate imputation with mice
mice (Multivariate Imputation by Chained Equations) builds a regression model for each incomplete variable using all other variables as predictors. Unlike single-imputation methods that fill each gap once and treat the result as known, mice generates multiple plausible imputations. You then fit your model on each imputed dataset and pool the results; the variation across imputations becomes part of your standard errors, giving you honest uncertainty estimates that single imputation cannot provide:
library(mice)
imp <- mice(mtcars, method = "pmm", m = 5)
mtcars_complete <- complete(imp)
The m = 5 argument creates five imputed datasets. Modeling results are then pooled across all five, which gives you proper uncertainty estimates.
Feature selection
After creating many features, you often need to prune them. Correlated predictors, irrelevant variables, and redundant features all hurt model performance.
Removing highly correlated predictors
High correlation between predictors causes multicollinearity in linear models. findCorrelation() flags column pairs above a threshold:
numeric_df <- mtcars %>% select(where(is.numeric))
cor_matrix <- cor(numeric_df, use = "complete.obs")
high_corr <- findCorrelation(cor_matrix, cutoff = 0.8)
df_filtered <- numeric_df[, -high_corr]
Variable importance from a model
Train any model and extract importance scores with caret::varImp(). Importance measures quantify how much each predictor contributes to the model’s accuracy; variables near zero contribute nothing and can be dropped without degrading performance. This approach is model-aware, meaning the importance ranking reflects which features the specific algorithm actually used, rather than a generic statistical test:
library(caret)
rf_model <- train(mpg ~ ., data = mtcars, method = "rf")
importance <- varImp(rf_model)
plot(importance)
Stepwise selection by AIC
For traditional regression, stepwise selection adds or removes predictors to minimize AIC. Starting from the null model (no predictors) and working forward, or from the full model and working backward, the procedure iteratively adds the variable that most improves the AIC and removes any that no longer contribute. Be aware that stepwise selection inflates p-values and produces overly optimistic R-squared values; always validate the selected model on held-out data:
full_model <- lm(mpg ~ ., data = mtcars)
null_model <- lm(mpg ~ 1, data = mtcars)
stepwise_model <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "both")
The recipes pipeline end-to-End
recipes lets you chain every preprocessing step into a single reproducible object. Each step (imputation, normalization, dummy encoding, interactions, polynomial expansion) is defined declaratively as a formula-like expression. During prep(), the recipe estimates parameters from the training data; during bake(), it applies those stored parameters to any dataset. This design means your preprocessing logic lives in one place and you never accidentally leak test data into your feature calculations:
library(recipes)
library(dplyr)
rec <- recipe(mpg ~ wt + hp + disp + cyl + am, data = mtcars) %>%
step_impute_median(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(am) %>%
step_interact(~ wt:hp) %>%
step_poly(hp, degree = 2)
rec_prepped <- prep(rec, training = mtcars)
bake(rec_prepped, new_data = mtcars)
Key idea: prep() estimates parameters like means, standard deviations, and dummy variable levels from the training set. bake() applies those estimated parameters to any data. This separation is what prevents data leakage in your modeling pipeline.
Numeric feature transformations
step_log() applies log transformation to right-skewed features. step_sqrt() is gentler for moderately skewed data. step_normalize() centers and scales to mean 0, standard deviation 1, required for distance-based models (KNN, SVM) and regularized regression. step_range() scales to [0, 1]. Apply transformations in a recipe so the parameters (mean, SD, range) are estimated only on training data and applied identically to test data.
Categorical encoding
step_dummy() one-hot encodes factors, creating binary columns for each level (minus one reference level). step_other(col, threshold = 0.05) collapses rare levels (below 5% frequency) into “other” before encoding to reduce dimensionality. step_novel() assigns a category for unseen levels that appear in test data but not training data, prevents errors during prediction. step_unknown() handles NA levels explicitly.
Date and time features
step_date(date_col, features = c("year", "month", "dow")) extracts year, month, and day-of-week from a date column as numeric features. step_holiday(date_col) adds binary indicators for holidays. step_mutate(week = lubridate::week(date)) adds custom date features. Time-based features are essential for seasonal patterns that tree-based models cannot capture from raw timestamps.
Feature engineering as model input design
Models cannot learn from raw data alone, they learn from the numerical representations you give them. Feature engineering is the process of creating those representations: transforming raw variables into forms that expose patterns the model can detect. A date of birth is not useful to a model; age in years might be; whether the person is under 25 might be even more useful for a specific task. The transformation that turns the raw value into a useful model input is a feature engineering decision.
Good feature engineering requires domain knowledge about what aspects of the data are likely to predict the outcome. Statistical transformations, normalization, log transforms, polynomial features, are generally applicable. Domain-specific transformations, extracting day of week from a timestamp, computing distance between geographic coordinates, creating a ratio of two quantities, encode knowledge about the problem that the model cannot discover on its own.
Handling categorical variables
Most models require numeric inputs. Converting categorical variables to numbers is a feature engineering step. One-hot encoding (dummy coding) creates one binary column per category level. The reference level is dropped to avoid perfect multicollinearity in linear models. For tree-based models, multicollinearity is not a concern, and including all levels sometimes improves performance.
High-cardinality categorical variables, ZIP codes, product IDs, user IDs, create hundreds or thousands of dummy columns with one-hot encoding. Target encoding (replacing each category with the mean of the target for that category) is a compact alternative. Target encoding requires regularization or cross-validation to avoid overfitting, where rare categories’ encodings are highly influenced by just a few training observations.
Conclusion
Feature engineering converts raw observations into variables that models can actually use. The core techniques, numeric transforms, categorical encoding, interactions, date extraction, imputation, and selection, apply across nearly every modeling problem. R’s tidyverse gives you dplyr and tidyr for flexible transformations, lubridate for datetime features, and recipes for production-grade preprocessing pipelines that prevent data leakage. Getting these fundamentals right matters more than which algorithm you choose.
Next steps
Now that you understand feature engineering in r, explore these related topics to deepen your knowledge and apply these techniques in more complex scenarios.
See also
- dplyr Basics: Core data manipulation verbs for creating and transforming features
- Date and Time with lubridate: Extracting temporal signal from datetime columns
- tidymodels Regression: Connecting feature-engineered data to a full modeling workflow