One of the key assumptions in linear regression is that there is no correlation between the residuals, e.g. the residuals are independent.
To test for first-order autocorrelation, we can perform a Durbin-Watson test. However, if we’d like to test for autocorrelation at higher orders then we need to perform a Breusch-Godfrey test.
This test uses the following hypotheses:
H0 (null hypothesis): There is no autocorrelation at any order less than or equal to p.
HA (alternative hypothesis): There exists autocorrelation at some order less than or equal to p.
The test statistic follows a Chi-Square distribution with p degrees of freedom.
If the p-value that corresponds to this test statistic is less than a certain significance level (e.g. 0.05) then we can reject the null hypothesis and conclude that autocorrelation exists among the residuals at some order less than or equal to p.
To perform a Breusch-Godfrey test in Python, we can use the acorr_breusch_godfrey() function from the statsmodels library.
The following step-by-step example explains how to perform the Breusch-Godfrey test in Python.
Step 1: Create the Data
First, let’s create a dataset that contains two predictor variables (x1 and x2) and one response variable (y).
import pandas as pd #create dataset df = pd.DataFrame({'x1': [3, 4, 4, 5, 8, 9, 11, 13, 14, 16, 17, 20], 'x2': [7, 7, 8, 8, 12, 4, 5, 15, 9, 17, 19, 19], 'y': [24, 25, 25, 27, 29, 31, 34, 34, 39, 30, 40, 49]}) #view first five rows of dataset df.head() x1 x2 y 0 3 7 24 1 4 7 25 2 4 8 25 3 5 8 27 4 8 12 29
Step 2: Fit a Regression Model
Next, we can fit a multiple linear regression model using x1 and x2 as predictor variables and y as the response variable.
import statsmodels.api as sm
#define response variable
y = df['y']
#define predictor variables
x = df[['x1', 'x2']]
#add constant to predictor variables
x = sm.add_constant(x)
#fit linear regression model
model = sm.OLS(y, x).fit()
Step 3: Perform the Breusch-Godfrey Test
Next, we’ll perform the Breusch-Godfrey test to test for autocorrelation among the residuals at order p. For this example, we’ll choose p = 3.
import statsmodels.stats.diagnostic as dg
#perform Breusch-Godfrey test at order p = 3
print(dg.acorr_breusch_godfrey(model, nlags=3))
(8.70314827, 0.0335094873, 5.27967224, 0.0403980576)
The first value in the output represents the test statistic and the second value represents the corresponding p-value.
From the output we can see the following:
- Test statistic X2 = 8.7031
- P-value = 0.0335
Since this p-value is less than 0.05, we can reject the null hypothesis and conclude that autocorrelation exists among the residuals at some order less than or equal to 3.
How to Handle Autocorrelation
If you reject the null hypothesis and conclude that autocorrelation is present in the residuals, then you have a few different options to correct this problem if you deem it to be serious enough:
- For positive serial correlation, consider adding lags of the dependent and/or independent variable to the model.
- For negative serial correlation, check to make sure that none of your variables are overdifferenced.
- For seasonal correlation, consider adding seasonal dummy variables to the model.
Additional Resources
A Complete Guide to Linear Regression in Python
How to Perform a Durbin-Watson Test in Python
How to Perform a Ljung-Box Test in Python