Home » How to Perform Weighted Least Squares Regression in Python

How to Perform Weighted Least Squares Regression in Python

by Tutor Aspire

One of the key assumptions of linear regression is that the residuals are distributed with equal variance at each level of the predictor variable. This assumption is known as homoscedasticity.

When this assumption is violated, we say that heteroscedasticity is present in the residuals. When this occurs, the results of the regression become unreliable.

One way to handle this issue is to instead use weighted least squares regression, which places weights on the observations such that those with small error variance are given more weight since they contain more information compared to observations with larger error variance.

This tutorial provides a step-by-step example of how to perform weight least squares regression in Python.

Step 1: Create the Data

First, let’s create the following pandas DataFrame that contains information about the number of hours studied and the final exam score for 16 students in some class:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'hours': [1, 1, 2, 2, 2, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 8],
                   'score': [48, 78, 72, 70, 66, 92, 93, 75, 75, 80, 95, 97,
                             90, 96, 99, 99]})

#view first five rows of DataFrame
print(df.head())

   hours  score
0      1     48
1      1     78
2      2     72
3      2     70
4      2     66

Step 2: Fit Simple Linear Regression Model

Next, we’ll use functions from the statsmodels module to fit a simple linear regression model using hours as the predictor variable and score as the response variable:

import statsmodels.api as sm

#define predictor and response variables
y = df['score']
X = df['hours']

#add constant to predictor variables
X = sm.add_constant(x)

#fit linear regression model
fit = sm.OLS(y, X).fit()

#view model summary
print(fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  score   R-squared:                       0.630
Model:                            OLS   Adj. R-squared:                  0.603
Method:                 Least Squares   F-statistic:                     23.80
Date:                Mon, 31 Oct 2022   Prob (F-statistic):           0.000244
Time:                        11:19:54   Log-Likelihood:                -57.184
No. Observations:                  16   AIC:                             118.4
Df Residuals:                      14   BIC:                             119.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         60.4669      5.128     11.791      0.000      49.468      71.465
hours          5.5005      1.127      4.879      0.000       3.082       7.919
==============================================================================
Omnibus:                        0.041   Durbin-Watson:                   1.910
Prob(Omnibus):                  0.980   Jarque-Bera (JB):                0.268
Skew:                          -0.010   Prob(JB):                        0.875
Kurtosis:                       2.366   Cond. No.                         10.5

From the model summary we can see that the R-squared value of the model is 0.630.

Related: What is a Good R-squared Value?

Step 3: Fit Weighted Least Squares Model

Next, we can use the WLS() function from statsmodels to perform weighted least squares by defining the weights in such a way that the observations with lower variance are given more weight:

#define weights to use
wt = 1 / smf.ols('fit.resid.abs() ~ fit.fittedvalues', data=df).fit().fittedvalues**2

#fit weighted least squares regression model
fit_wls = sm.WLS(y, X, weights=wt).fit()

#view summary of weighted least squares regression model
print(fit_wls.summary())

                            WLS Regression Results                            
==============================================================================
Dep. Variable:                  score   R-squared:                       0.676
Model:                            WLS   Adj. R-squared:                  0.653
Method:                 Least Squares   F-statistic:                     29.24
Date:                Mon, 31 Oct 2022   Prob (F-statistic):           9.24e-05
Time:                        11:20:10   Log-Likelihood:                -55.074
No. Observations:                  16   AIC:                             114.1
Df Residuals:                      14   BIC:                             115.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         63.9689      5.159     12.400      0.000      52.905      75.033
hours          4.7091      0.871      5.407      0.000       2.841       6.577
==============================================================================
Omnibus:                        2.482   Durbin-Watson:                   1.786
Prob(Omnibus):                  0.289   Jarque-Bera (JB):                1.058
Skew:                           0.029   Prob(JB):                        0.589
Kurtosis:                       1.742   Cond. No.                         17.6
==============================================================================

From the output we can see that the R-squared value for this weighted least squares model increased to 0.676.

This indicates that the weighted least squares model is able to explain more of the variance in exam scores compared to the simple linear regression model.

This tells us that the weighted least squares model offers a better fit to the data compared to the simple linear regression model.

Additional Resources

The following tutorials explain how to perform other common tasks in Python:

How to Create a Residual Plot in Python
How to Create a Q-Q Plot in Python
How to Test for Multicollinearity in Python

You may also like