You can use the following basic syntax to use the groupby() function with the diff() function in pandas:
df = df.sort_values(by=['group_var1', 'group_var2']) df['diff'] = df.groupby(['group_var1'])['values_var'].diff().fillna(0)
This particular example sorts the rows of the DataFrame by two specific variables, then groups by group_var1 and calculates the difference between rows in the values_var column.
Note that fillna(0) tells pandas to insert a zero whenever the value of the group variable changes between consecutive rows in the DataFrame.
The following example shows how to use this syntax in practice.
Example: How to Use groupby with diff in Pandas
Suppose we have the following pandas DataFrame that contains the total sales made by two different stores on various dates:
import pandas as pd #create DataFrame df = pd.DataFrame({'store': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'], 'date': pd.to_datetime(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04']), 'sales': [12, 15, 24, 24, 14, 19, 12, 38]}) #view DataFrame print(df) store date sales 0 A 2022-01-01 12 1 A 2022-01-02 15 2 A 2022-01-03 24 3 A 2022-01-04 24 4 B 2022-01-01 14 5 B 2022-01-02 19 6 B 2022-01-03 12 7 B 2022-01-04 38
Now suppose that we would like to create a new column called sales_diff that contains the difference in sales values between consecutive dates, grouped by store.
We can use the following syntax to do so:
#sort DataFrame by store and date
df = df.sort_values(by=['store', 'date'])
#create new column that contains difference between sales grouped by store
df['sales_diff'] = df.groupby(['store'])['sales'].diff().fillna(0)
#view update DataFrame
print(df)
store date sales sales_diff
0 A 2022-01-01 12 0.0
1 A 2022-01-02 15 3.0
2 A 2022-01-03 24 9.0
3 A 2022-01-04 24 0.0
4 B 2022-01-01 14 0.0
5 B 2022-01-02 19 5.0
6 B 2022-01-03 12 -7.0
7 B 2022-01-04 38 26.0
The new sales_diff column contains the difference in sales values between consecutive dates, grouped by store.
For example, we can see:
- The difference in sales at store A between 1/1/2022 and 1/2/2022 is 3.
- The difference in sales at store A between 1/2/2022 and 1/3/2022 is 9.
- The difference in sales at store A between 1/3/2022 and 1/4/2022 is 0.
And so on.
Additional Resources
The following tutorials explain how to perform other common operations in pandas:
How to Perform a GroupBy Sum in Pandas
How to Use Groupby and Plot in Pandas
How to Count Unique Values Using GroupBy in Pandas