In statistics, we often use the Pearson correlation coefficient to measure the linear relationship between two variables. However, sometimes we’re interested in understanding the relationship between two variables while controlling for a third variable.
For example, suppose we want to measure the association between the number of hours a student studies and the final exam score they receive, while controlling for the student’s current grade in the class. In this case, we could use a partial correlation to measure the relationship between hours studied and final exam score.
This tutorial explains how to calculate partial correlation in Python.
Example: Partial Correlation in Python
Suppose we have the following Pandas DataFrame that displays the current grade, total hours studied, and final exam score for 10 students:
import numpy as np import panda as pd data = {'currentGrade': [82, 88, 75, 74, 93, 97, 83, 90, 90, 80], 'hours': [4, 3, 6, 5, 4, 5, 8, 7, 4, 6], 'examScore': [88, 85, 76, 70, 92, 94, 89, 85, 90, 93], } df = pd.DataFrame(data, columns = ['currentGrade','hours', 'examScore']) df currentGrade hours examScore 0 82 4 88 1 88 3 85 2 75 6 76 3 74 5 70 4 93 4 92 5 97 5 94 6 83 8 89 7 90 7 85 8 90 4 90 9 80 6 93
To calculate the partial correlation between hours and examScore while controlling for currentGrade, we can use the partial_corr() function from the pingouin package, which uses the following syntax:
partial_corr(data, x, y, covar)
where:
- data: name of the dataframe
- x, y: names of columns in the dataframe
- covar: the name of the covariate column in the dataframe (e.g. the variable you’re controlling for)
Here is how to use this function in this particular example:
#install and import pingouin package pip install pingouin import pingouin as pg #find partial correlation between hours and exam score while controlling for grade pg.partial_corr(data=df, x='hours', y='examScore', covar='currentGrade') n r CI95% r2 adj_r2 p-val BF10 power pearson 10 0.191 [-0.5, 0.73] 0.036 -0.238 0.598 0.438 0.082
We can see that the partial correlation between hours studied and final exam score is .191, which is a small positive correlation. As hours studied increases, exam score tends to increase as well, assuming current grade is held constant.
To calculate the partial correlation between multiple variables at once, we can use the .pcorr() function:
#calculate all pairwise partial correlations, rounded to three decimal places
df.pcorr().round(3)
currentGrade hours examScore
currentGrade 1.000 -0.311 0.736
hours -0.311 1.000 0.191
examScore 0.736 0.191 1.000
The way to interpret the output is as follows:
- The partial correlation between current grade and hours studied is -0.311.
- The partial correlation between current grade and exam score 0.736.
- The partial correlation between hours studied and exam score 0.191.