AÂ box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.
The basic idea behind this method is to find some value for λ such that the transformed data is as close to normally distributed as possible, using the following formula:
- y(λ) = (yλ – 1) / λ if y ≠0
- y(λ) = log(y) if y = 0
We can perform a box-cox transformation in Python by using the scipy.stats.boxcox() function.
The following example shows how to use this function in practice.
Example: Box-Cox Transformation in Python
Suppose we generate a random set of 1,000 values that come from an exponential distribution:
#load necessary packages import numpy as np from scipy.stats import boxcox import seaborn as sns #make this example reproducible np.random.seed(0) #generate dataset data = np.random.exponential(size=1000) #plot the distribution of data values sns.distplot(data, hist=False, kde=True)
We can see that the distribution does not appear to be normal.
We can use the boxcox() function to find an optimal value of lambda that produces a more normal distribution:
#perform Box-Cox transformation on original data transformed_data, best_lambda = boxcox(data) #plot the distribution of the transformed data values sns.distplot(transformed_data, hist=False, kde=True)
We can see that the transformed data follows much more of a normal distribution.
We can also find the exact lambda value used to perform the Box-Cox transformation:
#display optimal lambda value print(best_lambda) 0.2420131978174143
The optimal lambda was found to be roughly 0.242.
Thus, each data value was transformed using the following equation:
New = (old0.242 – 1) / 0.242
We can confirm this by looking at the values from the original data compared to the transformed data:
#view first five values of original dataset data[0:5] array([0.79587451, 1.25593076, 0.92322315, 0.78720115, 0.55104849]) #view first five values of transformed dataset transformed_data[0:5] array([-0.22212062, 0.23427768, -0.07911706, -0.23247555, -0.55495228])
The first value in the original dataset was 0.79587. Thus, we applied the following formula to transform this value:
New = (.795870.242 – 1) / 0.242 = -0.222
We can confirm that the first value in the transformed dataset is indeed -0.222.
Additional Resources
How to Create & Interpret a Q-Q Plot in Python
How to Perform a Shapiro-Wilk Test for Normality in Python