Chapter 10. Data aggregation and group operations
The grouping key can be provided in different forms:
a list of values the same length oas the axis being grouped
a column name
a dictionary or Series that corresponds the values on the axis being grouped and the group names
a function to be invoked on the axis index or individual index labels
The following are some examples using these methods.
import numpy as np
import pandas as pd
import matplotlib .pyplot as plt
np .random .seed (0 )
df = pd .DataFrame ({
'key1' : ['a' , 'a' , 'b' , 'b' , 'a' ],
'key2' : ['one' , 'two' , 'one' , 'two' , 'one' ],
'data1' : np .random .randn (5 ),
'data2' : np .random .randn (5 )
})
df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key1
key2
data1
data2
0
a
one
1.764052
-0.977278
1
a
two
0.400157
0.950088
2
b
one
0.978738
-0.151357
3
b
two
2.240893
-0.103219
4
a
one
1.867558
0.410599
df = pd .DataFrame ({
'key1' : ['a' , 'a' , 'b' , 'b' , 'a' ],
'key2' : ['one' , 'two' , 'one' , 'two' , 'one' ],
'data1' : np .random .randn (5 ),
'data2' : np .random .randn (5 )
})
df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key1
key2
data1
data2
0
a
one
0.144044
0.333674
1
a
two
1.454274
1.494079
2
b
one
0.761038
-0.205158
3
b
two
0.121675
0.313068
4
a
one
0.443863
-0.854096
# `groupby()` creates a new `GroupBy` object.
grouped = df ['data1' ].groupby (df ['key1' ])
grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2e21d0>
# Mean of the data in the 'data1' column, grouped by 'key1'.
grouped .mean ()
key1
a 0.680727
b 0.441356
Name: data1, dtype: float64
# Group by two columns.
means = df ['data1' ].groupby ([df ['key1' ], df ['key2' ]]).mean ()
means
key1 key2
a one 0.293953
two 1.454274
b one 0.761038
two 0.121675
Name: data1, dtype: float64
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key2
one
two
key1
a
0.293953
1.454274
b
0.761038
0.121675
If the grouping information is a column of the same DataFrame, then only the grouping column name is required.
df .groupby ('key1' ).mean ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1
data2
key1
a
0.680727
0.324553
b
0.441356
0.053955
df .groupby (['key1' , 'key2' ]).mean ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1
data2
key1
key2
a
one
0.293953
-0.260211
two
1.454274
1.494079
b
one
0.761038
-0.205158
two
0.121675
0.313068
A frequently useful method on a grouped DataFrame is size()
.
df .groupby (['key1' , 'key2' ]).size ()
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
The GroupBy object created by groupby()
supports iteration over a sequence of 2-tuples containing the group name and the data.
for name , group in df .groupby ('key1' ):
print (f'group name: { name } ' )
print (group )
print ('' )
group name: a
key1 key2 data1 data2
0 a one 0.144044 0.333674
1 a two 1.454274 1.494079
4 a one 0.443863 -0.854096
group name: b
key1 key2 data1 data2
2 b one 0.761038 -0.205158
3 b two 0.121675 0.313068
for name , group in df .groupby (['key1' , 'key2' ]):
print (f'group name: { name [0 ]} -{ name [1 ]} ' )
print (group )
print ('' )
group name: a-one
key1 key2 data1 data2
0 a one 0.144044 0.333674
4 a one 0.443863 -0.854096
group name: a-two
key1 key2 data1 data2
1 a two 1.454274 1.494079
group name: b-one
key1 key2 data1 data2
2 b one 0.761038 -0.205158
group name: b-two
key1 key2 data1 data2
3 b two 0.121675 0.313068
By default, groupby()
groups on axis=0
, though the columns could also be grouped.
key1 object
key2 object
data1 float64
data2 float64
dtype: object
grouped = df .groupby (df .dtypes , axis = 1 )
for dtype , group in grouped :
print (f'data type: { dtype } ' )
print (group )
print ('' )
data type: float64
data1 data2
0 0.144044 0.333674
1 1.454274 1.494079
2 0.761038 -0.205158
3 0.121675 0.313068
4 0.443863 -0.854096
data type: object
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one
Selecting a column or subset of columns
A GroupBy object can still be indexed by column name.
The next two statements are equivalent.
df .groupby ('key1' )['data1' ]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2db190>
df ['data1' ].groupby (df ['key1' ])
<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2f02d0>
Grouping with dictionaries and Series
people = pd .DataFrame (np .random .randn (5 , 5 ),
columns = list ('abcde' ),
index = ['Joe' , 'Steve' , 'Wex' , 'Jim' , 'Travis' ])
people
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
e
Joe
-2.552990
0.653619
0.864436
-0.742165
2.269755
Steve
-1.454366
0.045759
-0.187184
1.532779
1.469359
Wex
0.154947
0.378163
-0.887786
-1.980796
-0.347912
Jim
0.156349
1.230291
1.202380
-0.387327
-0.302303
Travis
-1.048553
-1.420018
-1.706270
1.950775
-0.509652
people .iloc [2 , [1 , 2 ]] = np .nan
people
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
e
Joe
-2.552990
0.653619
0.864436
-0.742165
2.269755
Steve
-1.454366
0.045759
-0.187184
1.532779
1.469359
Wex
0.154947
NaN
NaN
-1.980796
-0.347912
Jim
0.156349
1.230291
1.202380
-0.387327
-0.302303
Travis
-1.048553
-1.420018
-1.706270
1.950775
-0.509652
If I have a group correspondence for the columns and want to sum together the columns by these groups, I can just pass the dictionary for grouping.
mapping = {
'a' : 'red' , 'b' : 'red' , 'c' : 'blue' ,
'd' : 'blue' , 'e' : 'red' , 'f' : 'orange'
}
by_column = people .groupby (mapping , axis = 1 )
by_column .sum ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
blue
red
Joe
0.122271
0.370383
Steve
1.345595
0.060752
Wex
-1.980796
-0.192965
Jim
0.815053
1.084337
Travis
0.244505
-2.978223
A function can be used to create the mappings.
Each group key will be passed once, and the return value defines the groups.
Here is an example of grouping by the length of the first names.
people .groupby (len ).sum ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
e
3
-2.241693
1.883909
2.066816
-3.110288
1.619540
5
-1.454366
0.045759
-0.187184
1.532779
1.469359
6
-1.048553
-1.420018
-1.706270
1.950775
-0.509652
It is possible to use both a function and an array or dictionary for grouping at the same time.
key_list = ['one' , 'one' , 'one' , 'two' , 'two' ]
people .groupby ([len , key_list ]).min ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
e
3
one
-2.552990
0.653619
0.864436
-1.980796
-0.347912
two
0.156349
1.230291
1.202380
-0.387327
-0.302303
5
one
-1.454366
0.045759
-0.187184
1.532779
1.469359
6
two
-1.048553
-1.420018
-1.706270
1.950775
-0.509652
For hierarchically indexed data structures, the levels of the axis can be used for grouping.
columns = pd .MultiIndex .from_arrays ([['US' , 'US' , 'US' , 'JP' , 'JP' ],
[1 , 3 , 5 , 1 , 3 ]],
names = ['cty' , 'tenor' ])
hier_df = pd .DataFrame (np .random .randn (4 , 5 ), columns = columns )
hier_df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
</style>
cty
US
JP
tenor
1
3
5
1
3
0
-0.438074
-1.252795
0.777490
-1.613898
-0.212740
1
-0.895467
0.386902
-0.510805
-1.180632
-0.028182
2
0.428332
0.066517
0.302472
-0.634322
-0.362741
3
-0.672460
-0.359553
-0.813146
-1.726283
0.177426
hier_df .groupby (level = 'cty' , axis = 1 ).count ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
cty
JP
US
0
2
3
1
2
3
2
2
3
3
2
3
An aggregation method is a data transformation that turns an array into a scalar value.
When used on a GroupBy object, the transformation is applied to each group, separately.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key1
key2
data1
data2
0
a
one
0.144044
0.333674
1
a
two
1.454274
1.494079
2
b
one
0.761038
-0.205158
3
b
two
0.121675
0.313068
4
a
one
0.443863
-0.854096
grouped = df .groupby ('key1' )
grouped ['data1' ].quantile ()
key1
a 0.443863
b 0.441356
Name: data1, dtype: float64
It is common to define custom aggregation methods.
The can be applied to grouped data by passing them to the agg()
method of a GroupBy object.
def peak_to_peak (arr ):
return arr .max () - arr .min ()
grouped .agg (peak_to_peak )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1
data2
key1
a
1.310230
2.348175
b
0.639363
0.518226
Other methods will also perform as expected, even though they are not sticktly aggregations.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
data1
data2
count
mean
std
min
25%
50%
75%
max
count
mean
std
min
25%
50%
75%
max
key1
a
3.0
0.680727
0.686479
0.144044
0.293953
0.443863
0.949068
1.454274
3.0
0.324553
1.174114
-0.854096
-0.260211
0.333674
0.913877
1.494079
b
2.0
0.441356
0.452098
0.121675
0.281516
0.441356
0.601197
0.761038
2.0
0.053955
0.366441
-0.205158
-0.075602
0.053955
0.183511
0.313068
Column-wise and multiple function application
It is common to need to use a different aggregation method for each column.
This is demonstrated using an example.
# Read in data on tipping at restaurants.
tips = pd .read_csv ('assets/examples/tips.csv' )
# Calculate the percent of the bill the tip covered.
tips ['tip_pct' ] = tips ['tip' ] / tips ['total_bill' ]
tips .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
total_bill
tip
smoker
day
time
size
tip_pct
0
16.99
1.01
No
Sun
Dinner
2
0.059447
1
10.34
1.66
No
Sun
Dinner
3
0.160542
2
21.01
3.50
No
Sun
Dinner
3
0.166587
3
23.68
3.31
No
Sun
Dinner
2
0.139780
4
24.59
3.61
No
Sun
Dinner
4
0.146808
grouped = tips .groupby (['day' , 'smoker' ])
grouped_pct = grouped ['tip_pct' ]
# The average tipping percentage per day, split into smokers and non-smokers.
grouped_pct .agg ('mean' )
day smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thur No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
The agg
method can be passed a list of functions or function names to apply to the grouped data.
grouped_pct .agg (['mean' , 'std' , peak_to_peak ])
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
mean
std
peak_to_peak
day
smoker
Fri
No
0.151650
0.028123
0.067349
Yes
0.174783
0.051293
0.159925
Sat
No
0.158048
0.039767
0.235193
Yes
0.147906
0.061375
0.290095
Sun
No
0.160113
0.042347
0.193226
Yes
0.187250
0.154134
0.644685
Thur
No
0.160298
0.038774
0.193350
Yes
0.163863
0.039389
0.151240
# Passing 2-tuples provides the name of the new column and the aggregation method.
grouped_pct .agg ([('foo' , 'mean' ), ('bar' , np .std )])
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
foo
bar
day
smoker
Fri
No
0.151650
0.028123
Yes
0.174783
0.051293
Sat
No
0.158048
0.039767
Yes
0.147906
0.061375
Sun
No
0.160113
0.042347
Yes
0.187250
0.154134
Thur
No
0.160298
0.038774
Yes
0.163863
0.039389
The above example was using a grouped Series.
With a DataFrame, a list of functions can be applied to all columns, or specific functions can be applied to specified columns.
functions = ['count' , 'mean' , 'max' ]
result = grouped ['tip_pct' , 'total_bill' ].agg (functions )
result
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
tip_pct
total_bill
count
mean
max
count
mean
max
day
smoker
Fri
No
4
0.151650
0.187735
4
18.420000
22.75
Yes
15
0.174783
0.263480
15
16.813333
40.17
Sat
No
45
0.158048
0.291990
45
19.661778
48.33
Yes
42
0.147906
0.325733
42
21.276667
50.81
Sun
No
57
0.160113
0.252672
57
20.506667
48.17
Yes
19
0.187250
0.710345
19
24.120000
45.35
Thur
No
45
0.160298
0.266312
45
17.113111
41.19
Yes
17
0.163863
0.241255
17
19.190588
43.11
# A mapping of column names to functions.
fxn_mapping = {
'tip_pct' : ['min' , 'max' , 'mean' , 'std' ],
'size' : 'sum'
}
grouped .agg (fxn_mapping )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
tip_pct
size
min
max
mean
std
sum
day
smoker
Fri
No
0.120385
0.187735
0.151650
0.028123
9
Yes
0.103555
0.263480
0.174783
0.051293
31
Sat
No
0.056797
0.291990
0.158048
0.039767
115
Yes
0.035638
0.325733
0.147906
0.061375
104
Sun
No
0.059447
0.252672
0.160113
0.042347
167
Yes
0.065660
0.710345
0.187250
0.154134
49
Thur
No
0.072961
0.266312
0.160298
0.038774
112
Yes
0.090014
0.241255
0.163863
0.039389
40
Return aggregated data without row indexes
So far, the returned aggregations have the group key combinations for an index, often hierarchical.
This can be disabled by passing as_index=False
.
# with index
tips .groupby (['day' , 'smoker' ]).mean ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
total_bill
tip
size
tip_pct
day
smoker
Fri
No
18.420000
2.812500
2.250000
0.151650
Yes
16.813333
2.714000
2.066667
0.174783
Sat
No
19.661778
3.102889
2.555556
0.158048
Yes
21.276667
2.875476
2.476190
0.147906
Sun
No
20.506667
3.167895
2.929825
0.160113
Yes
24.120000
3.516842
2.578947
0.187250
Thur
No
17.113111
2.673778
2.488889
0.160298
Yes
19.190588
3.030000
2.352941
0.163863
# without index
tips .groupby (['day' , 'smoker' ], as_index = False ).mean ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
day
smoker
total_bill
tip
size
tip_pct
0
Fri
No
18.420000
2.812500
2.250000
0.151650
1
Fri
Yes
16.813333
2.714000
2.066667
0.174783
2
Sat
No
19.661778
3.102889
2.555556
0.158048
3
Sat
Yes
21.276667
2.875476
2.476190
0.147906
4
Sun
No
20.506667
3.167895
2.929825
0.160113
5
Sun
Yes
24.120000
3.516842
2.578947
0.187250
6
Thur
No
17.113111
2.673778
2.488889
0.160298
7
Thur
Yes
19.190588
3.030000
2.352941
0.163863
# also without index
tips .groupby (['day' , 'smoker' ]).mean ().reset_index ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
day
smoker
total_bill
tip
size
tip_pct
0
Fri
No
18.420000
2.812500
2.250000
0.151650
1
Fri
Yes
16.813333
2.714000
2.066667
0.174783
2
Sat
No
19.661778
3.102889
2.555556
0.158048
3
Sat
Yes
21.276667
2.875476
2.476190
0.147906
4
Sun
No
20.506667
3.167895
2.929825
0.160113
5
Sun
Yes
24.120000
3.516842
2.578947
0.187250
6
Thur
No
17.113111
2.673778
2.488889
0.160298
7
Thur
Yes
19.190588
3.030000
2.352941
0.163863
10.3 Apply: general split-apply-combine
Apply splits the object being manipulated into pieces, invokes a provided function on each piece, and then concatenates the pieces together.
Here is an example using the tipping data.
def top (df , n = 5 , column = 'tip_pct' ):
'''
Return rows with the top `n` values in `column`.
'''
return df .sort_values (by = column )[- n :]
top (tips , n = 6 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
total_bill
tip
smoker
day
time
size
tip_pct
109
14.31
4.00
Yes
Sat
Dinner
2
0.279525
183
23.17
6.50
Yes
Sun
Dinner
4
0.280535
232
11.61
3.39
No
Sat
Dinner
2
0.291990
67
3.07
1.00
Yes
Sat
Dinner
1
0.325733
178
9.60
4.00
Yes
Sun
Dinner
2
0.416667
172
7.25
5.15
Yes
Sun
Dinner
2
0.710345
tips .groupby ('smoker' ).apply (top , n = 3 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
total_bill
tip
smoker
day
time
size
tip_pct
smoker
No
51
10.29
2.60
No
Sun
Dinner
2
0.252672
149
7.51
2.00
No
Thur
Lunch
2
0.266312
232
11.61
3.39
No
Sat
Dinner
2
0.291990
Yes
67
3.07
1.00
Yes
Sat
Dinner
1
0.325733
178
9.60
4.00
Yes
Sun
Dinner
2
0.416667
172
7.25
5.15
Yes
Sun
Dinner
2
0.710345
Suppressing the group keys
To prevent the resulting DataFrame from having the grouping keys for row indexes, use group_keys=False
in groupby()
.
tips .groupby ('smoker' , group_keys = False ).apply (top , n = 3 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
total_bill
tip
smoker
day
time
size
tip_pct
51
10.29
2.60
No
Sun
Dinner
2
0.252672
149
7.51
2.00
No
Thur
Lunch
2
0.266312
232
11.61
3.39
No
Sat
Dinner
2
0.291990
67
3.07
1.00
Yes
Sat
Dinner
1
0.325733
178
9.60
4.00
Yes
Sun
Dinner
2
0.416667
172
7.25
5.15
Yes
Sun
Dinner
2
0.710345
Quantile and bucket analysis
pandas includes cut()
and qcut()
for dividing data into quantiles.
This often is useful in conjunction with groupby()
.
The categorical object returned by cut()
can be passed directly to groupby()
to use to separate the data.
frame = pd .DataFrame ({'data1' : np .random .randn (1000 ),
'data2' : np .random .randn (1000 )})
quartiles = pd .cut (frame .data1 , 4 )
quartiles .head ()
0 (-1.492, 0.0624]
1 (-3.052, -1.492]
2 (0.0624, 1.617]
3 (-1.492, 0.0624]
4 (-1.492, 0.0624]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.052, -1.492] < (-1.492, 0.0624] < (0.0624, 1.617] < (1.617, 3.171]]
def get_stats (group ):
return {
'min' : group .min (),
'max' : group .max (),
'count' : group .count (),
'mean' : group .mean ()
}
frame .data2 .groupby (quartiles ).apply (get_stats ).unstack ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
min
max
count
mean
data1
(-3.052, -1.492]
-1.730276
2.153120
68.0
0.097696
(-1.492, 0.0624]
-2.777359
2.680571
482.0
0.056230
(0.0624, 1.617]
-3.116857
2.929096
398.0
-0.046616
(1.617, 3.171]
-2.158069
2.042072
52.0
-0.156563
Use qcut()
to get equally-sized buckets.
grouping = pd .qcut (frame .data1 , 10 , labels = False )
frame .data2 .groupby (grouping ).apply (get_stats ).unstack ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
min
max
count
mean
data1
0
-2.534554
2.642936
100.0
-0.013313
1
-1.641703
2.232016
100.0
0.059095
2
-2.777359
2.620574
100.0
0.193168
3
-1.830029
2.488442
100.0
-0.103425
4
-2.123890
2.526368
100.0
0.169418
5
-2.121176
2.680571
100.0
0.044411
6
-3.116857
2.929096
100.0
0.083797
7
-2.994613
2.540232
100.0
-0.305252
8
-1.980566
2.190898
100.0
-0.058111
9
-2.802203
2.198296
100.0
0.000731
10.4 Pivot tables and cross-tabulation
Pivot tables are made with pandas by using groupby()
and some reshaping operations.
There is a pivot_table()
method for DataFrames and a top-level pd.pivot_table()
function.
tips .pivot_table (index = ['day' , 'smoker' ])
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
size
tip
tip_pct
total_bill
day
smoker
Fri
No
2.250000
2.812500
0.151650
18.420000
Yes
2.066667
2.714000
0.174783
16.813333
Sat
No
2.555556
3.102889
0.158048
19.661778
Yes
2.476190
2.875476
0.147906
21.276667
Sun
No
2.929825
3.167895
0.160113
20.506667
Yes
2.578947
3.516842
0.187250
24.120000
Thur
No
2.488889
2.673778
0.160298
17.113111
Yes
2.352941
3.030000
0.163863
19.190588
The following example only aggragates 'tip_pct'
and 'size'
and additionally groups by 'time'
.
tips .pivot_table (['tip_pct' , 'size' ], index = ['time' , 'day' ], columns = 'smoker' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
size
tip_pct
smoker
No
Yes
No
Yes
time
day
Dinner
Fri
2.000000
2.222222
0.139622
0.165347
Sat
2.555556
2.476190
0.158048
0.147906
Sun
2.929825
2.578947
0.160113
0.187250
Thur
2.000000
NaN
0.159744
NaN
Lunch
Fri
3.000000
1.833333
0.187735
0.188937
Thur
2.500000
2.352941
0.160311
0.163863
Setting margins=True
computes partial totals, adding an 'All'
row and column.
tips .pivot_table (['tip_pct' , 'size' ],
index = ['time' , 'day' ],
columns = 'smoker' ,
margins = True )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
size
tip_pct
smoker
No
Yes
All
No
Yes
All
time
day
Dinner
Fri
2.000000
2.222222
2.166667
0.139622
0.165347
0.158916
Sat
2.555556
2.476190
2.517241
0.158048
0.147906
0.153152
Sun
2.929825
2.578947
2.842105
0.160113
0.187250
0.166897
Thur
2.000000
NaN
2.000000
0.159744
NaN
0.159744
Lunch
Fri
3.000000
1.833333
2.000000
0.187735
0.188937
0.188765
Thur
2.500000
2.352941
2.459016
0.160311
0.163863
0.161301
All
2.668874
2.408602
2.569672
0.159328
0.163196
0.160803
The default aggregation behaviour is to take the mean.
This can be changed by passing a different function to aggfunc
.
tips .pivot_table ('tip_pct' ,
index = ['time' , 'smoker' ],
columns = 'day' ,
aggfunc = len ,
margins = True )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
day
Fri
Sat
Sun
Thur
All
time
smoker
Dinner
No
3.0
45.0
57.0
1.0
106.0
Yes
9.0
42.0
19.0
NaN
70.0
Lunch
No
1.0
NaN
NaN
44.0
45.0
Yes
6.0
NaN
NaN
17.0
23.0
All
19.0
87.0
76.0
62.0
244.0
tips .pivot_table ('tip_pct' ,
index = ['time' , 'smoker' ],
columns = 'day' ,
aggfunc = len ,
margins = True ,
fill_value = 0 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
day
Fri
Sat
Sun
Thur
All
time
smoker
Dinner
No
3
45
57
1
106.0
Yes
9
42
19
0
70.0
Lunch
No
1
0
0
44
45.0
Yes
6
0
0
17
23.0
All
19
87
76
62
244.0
Cross-tabulations (crosstab)
Cross-tabulation is a special case of a pivot table the computes group frequencies.
data = pd .DataFrame ({
'Sample' : list (range (1 , 11 )),
'Nationality' : ['USA' , 'Japan' , 'USA' , 'Japan' , 'Japan' , 'Japan' , 'USA' , 'USA' , 'Japan' , 'USA' ],
'Handedness' : ['Right-handed' , 'Left-handed' , 'Right-handed' , 'Right-handed' , 'Left-handed' , 'Right-handed' , 'Right-handed' , 'Left-handed' , 'Right-handed' , 'Right-handed' ]
})
data
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
Sample
Nationality
Handedness
0
1
USA
Right-handed
1
2
Japan
Left-handed
2
3
USA
Right-handed
3
4
Japan
Right-handed
4
5
Japan
Left-handed
5
6
Japan
Right-handed
6
7
USA
Right-handed
7
8
USA
Left-handed
8
9
Japan
Right-handed
9
10
USA
Right-handed
pd .crosstab (data .Nationality , data .Handedness , margins = True )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
Handedness
Left-handed
Right-handed
All
Nationality
Japan
2
3
5
USA
1
4
5
All
3
7
10
pd .crosstab ([tips .time , tips .day ], tips .smoker , margins = True )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
smoker
No
Yes
All
time
day
Dinner
Fri
3
9
12
Sat
45
42
87
Sun
57
19
76
Thur
1
0
1
Lunch
Fri
1
6
7
Thur
44
17
61
All
151
93
244