*41*

In most cases, when people talk about “normalizing” variables in a dataset, it means they’d like to scale the values such that the variable has a mean of 0 and a standard deviation of 1.

The most common reason to normalize variables is when you’re conducting some type of multivariate analysis (i.e. you want to understand the relationship between several predictor variables and a response variable) and you want each variable to contribute equally to the analysis.

When variables are measured at different scales, they often do not contribute equally to the analysis. For example, if the values of one variable range from 0 to 100,000 and the values of another variable range from 0 to 100, the variable with the larger range will be given a larger weight in the analysis.

This is common when one variable measures something like salary ($0 to $100,000) and another variable measures something like age (0 to 100 years).

By normalizing the variables, we can be sure that each variable contributes equally to the analysis. Two common ways to normalize (or “scale”) variables include:

**Min-Max Normalization:**(X – min(X)) / (max(X) – min(X))**Z-Score Standardization:**(X – μ) / σ

Next, we’ll show how to implement both of these techniques in R.

**How to Normalize (or “Scale”) Variables in R**

For each of the following examples, we’ll use the built-in R dataset **iris **to illustrate how to normalize or scale variables in R:

**#view first six rows of ***iris *dataset
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa

**Min-Max Normalization**

The formula for a min-max normalization is:

(X – min(X))/(max(X) – min(X))

For each value of a variable, we simply find how far that value is from the minimum value, then divide by the range.

To implement this in R, we can define a simple function and then use lapply to apply that function to whichever columns in the **iris **dataset we would like:

#define Min-Max normalization function min_max_norm function(x) { (x - min(x)) / (max(x) - min(x)) } #apply Min-Max normalization to first four columns inirisdataset iris_norm #view first six rows of normalizedirisdataset head(iris_norm) # Sepal.Length Sepal.Width Petal.Length Petal.Width #1 0.22222222 0.6250000 0.06779661 0.04166667 #2 0.16666667 0.4166667 0.06779661 0.04166667 #3 0.11111111 0.5000000 0.05084746 0.04166667 #4 0.08333333 0.4583333 0.08474576 0.04166667 #5 0.19444444 0.6666667 0.06779661 0.04166667 #6 0.30555556 0.7916667 0.11864407 0.12500000

Notice that each of the columns now have values that range from 0 to 1. Also notice that the fifth column “Species” was dropped from this data frame. We can easily add it back by using the following code:

#add backSpeciescolumn iris_norm$Species #view first six rows ofiris_normhead(iris_norm) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #1 0.22222222 0.6250000 0.06779661 0.04166667 setosa #2 0.16666667 0.4166667 0.06779661 0.04166667 setosa #3 0.11111111 0.5000000 0.05084746 0.04166667 setosa #4 0.08333333 0.4583333 0.08474576 0.04166667 setosa #5 0.19444444 0.6666667 0.06779661 0.04166667 setosa #6 0.30555556 0.7916667 0.11864407 0.12500000 setosa

**Z-Score Standardization**

The drawback of the min-max normalization technique is that it brings the data values towards the mean. If we want to make sure that outliers get weighted more than other values, a z-score standardization is a better technique to implement.

The formula for a z-score standardization is:

(X – μ) / σ

For each value of a variable, we simply subtract the mean value of the variable, then divide by the standard deviation of the variable.

To implement this in R, we have a few different options:

**1. Standardize one variable**

If we simply want to standardize one variable in a dataset, such as Sepal.Width in the **iris **dataset, we can use the following code:

#standardizeSepal.Widthiris$Sepal.Width

The values of *Sepal.Width *are now scaled such that the mean is 0 and the standard deviation is 1. We can even verify this if we’d like:

#find mean ofSepal.Widthmean(iris$Sepal.Width) #[1] 2.034094e-16 #basically zero #find standard deviation ofSepal.Widthsd(iris$Sepal.Width) #[1] 1

**2. Standardize several variables using the scale function**

To standardize several variables, we can simply use the *scale *function. For example, the following code shows how to scale the first four columns of the **iris **dataset:

#standardize first four columns ofirisdataset iris_standardize #view first six rows of standardized datasethead(iris_standardize) # Sepal.Length Sepal.Width Petal.Length Petal.Width #1 -0.8976739 1.01560199 -1.335752 -1.311052 #2 -1.1392005 -0.13153881 -1.335752 -1.311052 #3 -1.3807271 0.32731751 -1.392399 -1.311052 #4 -1.5014904 0.09788935 -1.279104 -1.311052 #5 -1.0184372 1.24503015 -1.335752 -1.311052 #6 -0.5353840 1.93331463 -1.165809 -1.048667

Note that the *scale *function, by default, attempts to standardize every column in a data frame. Thus, we would get an error if we attempted to use **scale(iris)** because the *Species* column is not numeric and cannot be standardized:

scale(iris) #Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

However, it is possible to standardize only certain variables in a data frame while also keeping all other variables the same by using the **dplyr **package. For example, the following code standardizes the variables *Sepal.Width *and *Sepal.Length *while keeping all other variables the same:

#loaddplyrpackage library(dplyr) #standardizeSepal.WidthandSepal.Lengthiris_new % mutate_each_(list(~scale(.) %>% as.vector), vars = c("Sepal.Width","Sepal.Length")) #view first six rows of new data frame head(iris_new) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #1 -0.8976739 1.01560199 1.4 0.2 setosa #2 -1.1392005 -0.13153881 1.4 0.2 setosa #3 -1.3807271 0.32731751 1.3 0.2 setosa #4 -1.5014904 0.09788935 1.5 0.2 setosa #5 -1.0184372 1.24503015 1.4 0.2 setosa #6 -0.5353840 1.93331463 1.7 0.4 setosa

Notice that *Sepal.Length *and *Sepal.Width *are standardized such that both variables have a mean of 0 and a standard deviation of 1, while the other three variables in the data frame remain unchanged.