One-hot encoding is used to convert categorical variables into a format that can be readily used by machine learning algorithms.
The basic idea of one-hot encoding is to create new variables that take on values 0 and 1 to represent the original categorical values.
For example, the following image shows how we would perform one-hot encoding to convert a categorical variable that contains team names into new variables that contain only 0 and 1 values:
The following step-by-step example shows how to perform one-hot encoding for this exact dataset in Python.
Step 1: Create the Data
First, let’s create the following pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'], 'points': [25, 12, 15, 14, 19, 23, 25, 29]}) #view DataFrame print(df) team points 0 A 25 1 A 12 2 B 15 3 B 14 4 B 19 5 B 23 6 C 25 7 C 29
Step 2: Perform One-Hot Encoding
Next, let’s import the OneHotEncoder() function from the sklearn library and use it to perform one-hot encoding on the ‘team’ variable in the pandas DataFrame:
from sklearn.preprocessing import OneHotEncoder #creating instance of one-hot-encoder encoder = OneHotEncoder(handle_unknown='ignore') #perform one-hot encoding on 'team' column encoder_df = pd.DataFrame(encoder.fit_transform(df[['team']]).toarray()) #merge one-hot encoded columns back with original DataFrame final_df = df.join(encoder_df) #view final df print(final_df) team points 0 1 2 0 A 25 1.0 0.0 0.0 1 A 12 1.0 0.0 0.0 2 B 15 0.0 1.0 0.0 3 B 14 0.0 1.0 0.0 4 B 19 0.0 1.0 0.0 5 B 23 0.0 1.0 0.0 6 C 25 0.0 0.0 1.0 7 C 29 0.0 0.0 1.0
Notice that three new columns were added to the DataFrame since the original ‘team’ column contained three unique values.
Note: You can find the complete documentation for the OneHotEncoder() function here.
Step 3: Drop the Original Categorical Variable
Lastly, we can drop the original ‘team’ variable from the DataFrame since we no longer need it:
#drop 'team' column final_df.drop('team', axis=1, inplace=True) #view final df print(final_df) points 0 1 2 0 25 1.0 0.0 0.0 1 12 1.0 0.0 0.0 2 15 0.0 1.0 0.0 3 14 0.0 1.0 0.0 4 19 0.0 1.0 0.0 5 23 0.0 1.0 0.0 6 25 0.0 0.0 1.0 7 29 0.0 0.0 1.0
Related:Â How to Drop Columns in Pandas (4 Methods)
We could also rename the columns of the final DataFrame to make them easier to read:
#rename columns final_df.columns = ['points', 'teamA', 'teamB', 'teamC'] #view final df print(final_df) points teamA teamB teamC 0 25 1.0 0.0 0.0 1 12 1.0 0.0 0.0 2 15 0.0 1.0 0.0 3 14 0.0 1.0 0.0 4 19 0.0 1.0 0.0 5 23 0.0 1.0 0.0 6 25 0.0 0.0 1.0 7 29 0.0 0.0 1.0
The one-hot encoding is complete and we can now feed this pandas DataFrame into any machine learning algorithm that we’d like.
Additional Resources
How to Calculate a Trimmed Mean in Python
How to Perform Linear Regression in Python
How to Perform Logistic Regression in Python