Home » How to Convert Categorical Variable to Numeric in Pandas

How to Convert Categorical Variable to Numeric in Pandas

by Tutor Aspire

You can use the following basic syntax to convert a categorical variable to a numeric variable in a pandas DataFrame:

df['column_name'] = pd.factorize(df['column_name'])[0]

You can also use the following syntax to convert every categorical variable in a DataFrame to a numeric variable:

#identify all categorical variables
cat_columns = df.select_dtypes(['object']).columns

#convert all categorical variables to numeric
df[cat_columns] = df[cat_columns].apply(lambda x: pd.factorize(x)[0])

The following examples show how to use this syntax in practice.

Example 1: Convert One Categorical Variable to Numeric

Suppose we have the following pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
                   'position': ['G', 'G', 'F', 'G', 'F', 'C', 'G', 'F', 'C'],
                   'points': [5, 7, 7, 9, 12, 9, 9, 4, 13],
                   'rebounds': [11, 8, 10, 6, 6, 5, 9, 12, 10]})

#view DataFrame
df

        team	position points	rebounds
0	A	G	 5	11
1	A	G	 7	8
2	A	F	 7	10
3	B	G	 9	6
4	B	F	 12	6
5	B	C	 9	5
6	C	G	 9	9
7	C	F	 4	12
8	C	C	 13	10

We can use the following syntax to convert the ‘team’ column to numeric:

#convert 'team' column to numeric
df['team'] = pd.factorize(df['team'])[0]

#view updated DataFrame
df

	team	position points	rebounds
0	0	G	 5	11
1	0	G	 7	8
2	0	F	 7	10
3	1	G	 9	6
4	1	F	 12	6
5	1	C	 9	5
6	2	G	 9	9
7	2	F	 4	12
8	2	C	 13	10

Here is how the conversion worked:

  • Each team that had a value of ‘A‘ was converted to 0.
  • Each team that had a value of ‘B‘ was converted to 1.
  • Each team that had a value of ‘C‘ was converted to 2.

Example 2: Convert Multiple Categorical Variables to Numeric

Once again suppose we have the following pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
                   'position': ['G', 'G', 'F', 'G', 'F', 'C', 'G', 'F', 'C'],
                   'points': [5, 7, 7, 9, 12, 9, 9, 4, 13],
                   'rebounds': [11, 8, 10, 6, 6, 5, 9, 12, 10]})

#view DataFrame
df

        team	position points	rebounds
0	A	G	 5	11
1	A	G	 7	8
2	A	F	 7	10
3	B	G	 9	6
4	B	F	 12	6
5	B	C	 9	5
6	C	G	 9	9
7	C	F	 4	12
8	C	C	 13	10

We can use the following syntax to convert every categorical variable in the DataFrame to a numeric variable:

#get all categorical columns
cat_columns = df.select_dtypes(['object']).columns

#convert all categorical columns to numeric
df[cat_columns] = df[cat_columns].apply(lambda x: pd.factorize(x)[0])

#view updated DataFrame
df

	team	position points	rebounds
0	0	0	 5	11
1	0	0	 7	8
2	0	1	 7	10
3	1	0	 9	6
4	1	1	 12	6
5	1	2	 9	5
6	2	0	 9	9
7	2	1	 4	12
8	2	2	 13	10

Notice that the two categorical columns (team and position) both got converted to numeric while the points and rebounds columns remained the same.

Note: You can find the complete documentation for the pandas factorize() function here.

Additional Resources

The following tutorials explain how to perform other common operations in pandas:

How to Convert Pandas DataFrame Columns to Strings
How to Convert Pandas DataFrame Columns to Integer
How to Convert Strings to Float in Pandas DataFrame

You may also like