Often in statistics, the datasets we’re working with include categorical variables.
These are variables that take on names or labels. Examples include:
- Marital status (“married”, “single”, “divorced”)
- Smoking status (“smoker”, “non-smoker”)
- Eye color (“blue”, “green”, “hazel”)
- Level of education (e.g. “high school”, “Bachelor’s degree”, “Master’s degree”)
When fitting machine learning algorithms (like linear regression, logistic regression, random forests, etc.), we often convert categorical variables to dummy variables, which are numeric variables that are used to represent categorical data.
For example, suppose we have a dataset that contains the categorical variable Gender. To use this variable as a predictor in a regression model, we would first need to convert it to a dummy variable.
To create this dummy variable, we can choose one of the values (“Male”) to represent 0 and the other value (“Female”) to represent 1:
How to Create Dummy Variables in Pandas
To create dummy variables for a variable in a pandas DataFrame, we can use the pandas.get_dummies() function, which uses the following basic syntax:
pandas.get_dummies(data, prefix=None, columns=None, drop_first=False)
where:
- data: The name of the pandas DataFrame
- prefix: A string to append to the front of the new dummy variable column
- columns: The name of the column(s) to convert to a dummy variable
- drop_first: Whether or not to drop the first dummy variable column
The following examples show how to use this function in practice.
Example 1: Create a Single Dummy Variable
Suppose we have the following pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'income': [45, 48, 54, 57, 65, 69, 78], 'age': [23, 25, 24, 29, 38, 36, 40], 'gender': ['M', 'F', 'M', 'F', 'F', 'F', 'M']}) #view DataFrame df income age gender 0 45 23 M 1 48 25 F 2 54 24 M 3 57 29 F 4 65 38 F 5 69 36 F 6 78 40 M
We can use the pd.get_dummies() function to turn gender into a dummy variable:
#convert gender to dummy variable pd.get_dummies(df, columns=['gender'], drop_first=True) income age gender_M 0 45 23 1 1 48 25 0 2 54 24 1 3 57 29 0 4 65 38 0 5 69 36 0 6 78 40 1
The gender column is now a dummy variable where:
- A value of 0 represents “Female”
- A value of 1 represents “Male”
Example 2: Create Multiple Dummy Variables
Suppose we have the following pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'income': [45, 48, 54, 57, 65, 69, 78], 'age': [23, 25, 24, 29, 38, 36, 40], 'gender': ['M', 'F', 'M', 'F', 'F', 'F', 'M'], 'college': ['Y', 'N', 'N', 'N', 'Y', 'Y', 'Y']}) #view DataFrame df income age gender college 0 45 23 M Y 1 48 25 F N 2 54 24 M N 3 57 29 F N 4 65 38 F Y 5 69 36 F Y 6 78 40 M Y
We can use the pd.get_dummies() function to convert gender and college both into dummy variables:
#convert gender to dummy variable pd.get_dummies(df, columns=['gender', 'college'], drop_first=True) income age gender_M college_Y 0 45 23 1 1 1 48 25 0 0 2 54 24 1 0 3 57 29 0 0 4 65 38 0 1 5 69 36 0 1 6 78 40 1 1
The gender column is now a dummy variable where:
- A value of 0 represents “Female”
- A value of 1 represents “Male”
And the college column is now a dummy variable where:
- A value of 0 represents “No” college
- A value of 1 represents “Yes” college
Additional Resources
How to Use Dummy Variables in Regression Analysis
What is the Dummy Variable Trap?