The easiest way to drop duplicate rows in a pandas DataFrame is by using the drop_duplicates() function, which uses the following syntax:
df.drop_duplicates(subset=None, keep=’first’, inplace=False)
where:
- subset: Which columns to consider for identifying duplicates. Default is all columns.
- keep: Indicates which duplicates (if any) to keep.Â
- first: Delete all duplicate rows except first.
- last: Delete all duplicate rows except last.
- False: Delete all duplicates.
- inplace: Indicates whether to drop duplicates in place or return a copy of the DataFrame.
This tutorial provides several examples of how to use this function in practice on the following DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'team': ['a', 'b', 'b', 'c', 'c', 'd'], 'points': [3, 7, 7, 8, 8, 9], 'assists': [8, 6, 7, 9, 9, 3]}) #display DataFrame print(df) team points assists 0 a 3 8 1 b 7 6 2 b 7 7 3 c 8 9 4 c 8 9 5 d 9 3
Example 1: Remove Duplicates Across All Columns
The following code shows how to remove rows that have duplicate values across all columns:
df.drop_duplicates()
team points assists
0 a 3 8
1 b 7 6
2 b 7 7
3 c 8 9
5 d 9 3
By default, the drop_duplicates() function deletes all duplicates except the first.
However, we could use the keep=False argument to delete all duplicates entirely:
df.drop_duplicates(keep=False) team points assists 0 a 3 8 1 b 7 6 2 b 7 7 5 d 9 3
Example 2: Remove Duplicates Across Specific Columns
The following code shows how to remove rows that have duplicate values across just the columns titled team and points:
df.drop_duplicates(subset=['team', 'points']) team points assists 0 a 3 8 1 b 7 6 3 c 8 9 5 d 9 3
Additional Resources
How to Drop Duplicate Columns in Pandas
How to Sort Values in a Pandas DataFrame
How to Filter a Pandas DataFrame on Multiple Conditions
How to Insert a Column Into a Pandas DataFrame