Home » How to Perform a Box-Cox Transformation in Python

How to Perform a Box-Cox Transformation in Python

by Tutor Aspire

A box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.

The basic idea behind this method is to find some value for λ such that the transformed data is as close to normally distributed as possible, using the following formula:

  • y(λ) = (yλ – 1) / λ  if y ≠ 0
  • y(λ) = log(y)  if y = 0

We can perform a box-cox transformation in Python by using the scipy.stats.boxcox() function.

The following example shows how to use this function in practice.

Example: Box-Cox Transformation in Python

Suppose we generate a random set of 1,000 values that come from an exponential distribution:

#load necessary packages
import numpy as np 
from scipy.stats import boxcox 
import seaborn as sns 

#make this example reproducible
np.random.seed(0)

#generate dataset
data = np.random.exponential(size=1000)

#plot the distribution of data values
sns.distplot(data, hist=False, kde=True) 

We can see that the distribution does not appear to be normal.

We can use the boxcox() function to find an optimal value of lambda that produces a more normal distribution:

#perform Box-Cox transformation on original data
transformed_data, best_lambda = boxcox(data) 

#plot the distribution of the transformed data values
sns.distplot(transformed_data, hist=False, kde=True) 

Box-cox transformation in Python

We can see that the transformed data follows much more of a normal distribution.

We can also find the exact lambda value used to perform the Box-Cox transformation:

#display optimal lambda value
print(best_lambda)

0.2420131978174143

The optimal lambda was found to be roughly 0.242.

Thus, each data value was transformed using the following equation:

New = (old0.242 – 1) / 0.242

We can confirm this by looking at the values from the original data compared to the transformed data:

#view first five values of original dataset
data[0:5]

array([0.79587451, 1.25593076, 0.92322315, 0.78720115, 0.55104849])

#view first five values of transformed dataset
transformed_data[0:5]

array([-0.22212062,  0.23427768, -0.07911706, -0.23247555, -0.55495228])

The first value in the original dataset was 0.79587. Thus, we applied the following formula to transform this value:

New = (.795870.242 – 1) / 0.242 = -0.222

We can confirm that the first value in the transformed dataset is indeed -0.222.

Additional Resources

How to Create & Interpret a Q-Q Plot in Python
How to Perform a Shapiro-Wilk Test for Normality in Python

You may also like