Skip to content

Latest commit

 

History

History
3660 lines (3137 loc) · 62 KB

pfda_ch10.md

File metadata and controls

3660 lines (3137 loc) · 62 KB

Chapter 10. Data aggregation and group operations

10.1 Groupby mechanics

The grouping key can be provided in different forms:

  1. a list of values the same length oas the axis being grouped
  2. a column name
  3. a dictionary or Series that corresponds the values on the axis being grouped and the group names
  4. a function to be invoked on the axis index or individual index labels

The following are some examples using these methods.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(0)
df = pd.DataFrame({
    'key1' : ['a', 'a', 'b', 'b', 'a'],
    'key2' : ['one', 'two', 'one', 'two', 'one'],
    'data1' : np.random.randn(5), 
    'data2' : np.random.randn(5)
})
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key1 key2 data1 data2
0 a one 1.764052 -0.977278
1 a two 0.400157 0.950088
2 b one 0.978738 -0.151357
3 b two 2.240893 -0.103219
4 a one 1.867558 0.410599
df = pd.DataFrame({
    'key1' : ['a', 'a', 'b', 'b', 'a'],
    'key2' : ['one', 'two', 'one', 'two', 'one'],
    'data1' : np.random.randn(5), 
    'data2' : np.random.randn(5)
})
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key1 key2 data1 data2
0 a one 0.144044 0.333674
1 a two 1.454274 1.494079
2 b one 0.761038 -0.205158
3 b two 0.121675 0.313068
4 a one 0.443863 -0.854096
# `groupby()` creates a new `GroupBy` object.
grouped = df['data1'].groupby(df['key1'])
grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2e21d0>
# Mean of the data in the 'data1' column, grouped by 'key1'.
grouped.mean()
key1
a    0.680727
b    0.441356
Name: data1, dtype: float64
# Group by two columns.
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means
key1  key2
a     one     0.293953
      two     1.454274
b     one     0.761038
      two     0.121675
Name: data1, dtype: float64
means.unstack()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key2 one two
key1
a 0.293953 1.454274
b 0.761038 0.121675

If the grouping information is a column of the same DataFrame, then only the grouping column name is required.

df.groupby('key1').mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
key1
a 0.680727 0.324553
b 0.441356 0.053955
df.groupby(['key1', 'key2']).mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
key1 key2
a one 0.293953 -0.260211
two 1.454274 1.494079
b one 0.761038 -0.205158
two 0.121675 0.313068

A frequently useful method on a grouped DataFrame is size().

df.groupby(['key1', 'key2']).size()
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

Iterating over groups

The GroupBy object created by groupby() supports iteration over a sequence of 2-tuples containing the group name and the data.

for name, group in df.groupby('key1'):
    print(f'group name: {name}')
    print(group)
    print('')
group name: a
  key1 key2     data1     data2
0    a  one  0.144044  0.333674
1    a  two  1.454274  1.494079
4    a  one  0.443863 -0.854096

group name: b
  key1 key2     data1     data2
2    b  one  0.761038 -0.205158
3    b  two  0.121675  0.313068
for name, group in df.groupby(['key1', 'key2']):
    print(f'group name: {name[0]}-{name[1]}')
    print(group)
    print('')
group name: a-one
  key1 key2     data1     data2
0    a  one  0.144044  0.333674
4    a  one  0.443863 -0.854096

group name: a-two
  key1 key2     data1     data2
1    a  two  1.454274  1.494079

group name: b-one
  key1 key2     data1     data2
2    b  one  0.761038 -0.205158

group name: b-two
  key1 key2     data1     data2
3    b  two  0.121675  0.313068

By default, groupby() groups on axis=0, though the columns could also be grouped.

df.dtypes
key1      object
key2      object
data1    float64
data2    float64
dtype: object
grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
    print(f'data type: {dtype}')
    print(group)
    print('')
data type: float64
      data1     data2
0  0.144044  0.333674
1  1.454274  1.494079
2  0.761038 -0.205158
3  0.121675  0.313068
4  0.443863 -0.854096

data type: object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one

Selecting a column or subset of columns

A GroupBy object can still be indexed by column name. The next two statements are equivalent.

df.groupby('key1')['data1']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2db190>
df['data1'].groupby(df['key1'])
<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2f02d0>

Grouping with dictionaries and Series

people = pd.DataFrame(np.random.randn(5, 5),
                      columns=list('abcde'),
                      index=['Joe', 'Steve', 'Wex', 'Jim', 'Travis'])
people
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d e
Joe -2.552990 0.653619 0.864436 -0.742165 2.269755
Steve -1.454366 0.045759 -0.187184 1.532779 1.469359
Wex 0.154947 0.378163 -0.887786 -1.980796 -0.347912
Jim 0.156349 1.230291 1.202380 -0.387327 -0.302303
Travis -1.048553 -1.420018 -1.706270 1.950775 -0.509652
people.iloc[2, [1, 2]] = np.nan
people
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d e
Joe -2.552990 0.653619 0.864436 -0.742165 2.269755
Steve -1.454366 0.045759 -0.187184 1.532779 1.469359
Wex 0.154947 NaN NaN -1.980796 -0.347912
Jim 0.156349 1.230291 1.202380 -0.387327 -0.302303
Travis -1.048553 -1.420018 -1.706270 1.950775 -0.509652

If I have a group correspondence for the columns and want to sum together the columns by these groups, I can just pass the dictionary for grouping.

mapping = {
    'a': 'red', 'b': 'red', 'c': 'blue',
    'd': 'blue', 'e': 'red', 'f' : 'orange'
}
by_column = people.groupby(mapping, axis=1)
by_column.sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
blue red
Joe 0.122271 0.370383
Steve 1.345595 0.060752
Wex -1.980796 -0.192965
Jim 0.815053 1.084337
Travis 0.244505 -2.978223

Grouping with functions

A function can be used to create the mappings. Each group key will be passed once, and the return value defines the groups.

Here is an example of grouping by the length of the first names.

people.groupby(len).sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d e
3 -2.241693 1.883909 2.066816 -3.110288 1.619540
5 -1.454366 0.045759 -0.187184 1.532779 1.469359
6 -1.048553 -1.420018 -1.706270 1.950775 -0.509652

It is possible to use both a function and an array or dictionary for grouping at the same time.

key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d e
3 one -2.552990 0.653619 0.864436 -1.980796 -0.347912
two 0.156349 1.230291 1.202380 -0.387327 -0.302303
5 one -1.454366 0.045759 -0.187184 1.532779 1.469359
6 two -1.048553 -1.420018 -1.706270 1.950775 -0.509652

Grouping by index levels

For hierarchically indexed data structures, the levels of the axis can be used for grouping.

columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}
</style>
cty US JP
tenor 1 3 5 1 3
0 -0.438074 -1.252795 0.777490 -1.613898 -0.212740
1 -0.895467 0.386902 -0.510805 -1.180632 -0.028182
2 0.428332 0.066517 0.302472 -0.634322 -0.362741
3 -0.672460 -0.359553 -0.813146 -1.726283 0.177426
hier_df.groupby(level='cty', axis=1).count()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3

10.2 Data Aggregation

An aggregation method is a data transformation that turns an array into a scalar value. When used on a GroupBy object, the transformation is applied to each group, separately.

df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key1 key2 data1 data2
0 a one 0.144044 0.333674
1 a two 1.454274 1.494079
2 b one 0.761038 -0.205158
3 b two 0.121675 0.313068
4 a one 0.443863 -0.854096
grouped = df.groupby('key1')
grouped['data1'].quantile()
key1
a    0.443863
b    0.441356
Name: data1, dtype: float64

It is common to define custom aggregation methods. The can be applied to grouped data by passing them to the agg() method of a GroupBy object.

def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
key1
a 1.310230 2.348175
b 0.639363 0.518226

Other methods will also perform as expected, even though they are not sticktly aggregations.

grouped.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}
</style>
data1 data2
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
key1
a 3.0 0.680727 0.686479 0.144044 0.293953 0.443863 0.949068 1.454274 3.0 0.324553 1.174114 -0.854096 -0.260211 0.333674 0.913877 1.494079
b 2.0 0.441356 0.452098 0.121675 0.281516 0.441356 0.601197 0.761038 2.0 0.053955 0.366441 -0.205158 -0.075602 0.053955 0.183511 0.313068

Column-wise and multiple function application

It is common to need to use a different aggregation method for each column. This is demonstrated using an example.

# Read in data on tipping at restaurants.
tips = pd.read_csv('assets/examples/tips.csv')

# Calculate the percent of the bill the tip covered.
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808
grouped = tips.groupby(['day', 'smoker'])
grouped_pct = grouped['tip_pct']

# The average tipping percentage per day, split into smokers and non-smokers.
grouped_pct.agg('mean')
day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

The agg method can be passed a list of functions or function names to apply to the grouped data.

grouped_pct.agg(['mean', 'std', peak_to_peak])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mean std peak_to_peak
day smoker
Fri No 0.151650 0.028123 0.067349
Yes 0.174783 0.051293 0.159925
Sat No 0.158048 0.039767 0.235193
Yes 0.147906 0.061375 0.290095
Sun No 0.160113 0.042347 0.193226
Yes 0.187250 0.154134 0.644685
Thur No 0.160298 0.038774 0.193350
Yes 0.163863 0.039389 0.151240
# Passing 2-tuples provides the name of the new column and the aggregation method.
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
foo bar
day smoker
Fri No 0.151650 0.028123
Yes 0.174783 0.051293
Sat No 0.158048 0.039767
Yes 0.147906 0.061375
Sun No 0.160113 0.042347
Yes 0.187250 0.154134
Thur No 0.160298 0.038774
Yes 0.163863 0.039389

The above example was using a grouped Series. With a DataFrame, a list of functions can be applied to all columns, or specific functions can be applied to specified columns.

functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}
</style>
tip_pct total_bill
count mean max count mean max
day smoker
Fri No 4 0.151650 0.187735 4 18.420000 22.75
Yes 15 0.174783 0.263480 15 16.813333 40.17
Sat No 45 0.158048 0.291990 45 19.661778 48.33
Yes 42 0.147906 0.325733 42 21.276667 50.81
Sun No 57 0.160113 0.252672 57 20.506667 48.17
Yes 19 0.187250 0.710345 19 24.120000 45.35
Thur No 45 0.160298 0.266312 45 17.113111 41.19
Yes 17 0.163863 0.241255 17 19.190588 43.11
# A mapping of column names to functions.
fxn_mapping = {
    'tip_pct': ['min', 'max', 'mean', 'std'],
    'size': 'sum'
}
grouped.agg(fxn_mapping)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}
</style>
tip_pct size
min max mean std sum
day smoker
Fri No 0.120385 0.187735 0.151650 0.028123 9
Yes 0.103555 0.263480 0.174783 0.051293 31
Sat No 0.056797 0.291990 0.158048 0.039767 115
Yes 0.035638 0.325733 0.147906 0.061375 104
Sun No 0.059447 0.252672 0.160113 0.042347 167
Yes 0.065660 0.710345 0.187250 0.154134 49
Thur No 0.072961 0.266312 0.160298 0.038774 112
Yes 0.090014 0.241255 0.163863 0.039389 40

Return aggregated data without row indexes

So far, the returned aggregations have the group key combinations for an index, often hierarchical. This can be disabled by passing as_index=False.

# with index
tips.groupby(['day', 'smoker']).mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
total_bill tip size tip_pct
day smoker
Fri No 18.420000 2.812500 2.250000 0.151650
Yes 16.813333 2.714000 2.066667 0.174783
Sat No 19.661778 3.102889 2.555556 0.158048
Yes 21.276667 2.875476 2.476190 0.147906
Sun No 20.506667 3.167895 2.929825 0.160113
Yes 24.120000 3.516842 2.578947 0.187250
Thur No 17.113111 2.673778 2.488889 0.160298
Yes 19.190588 3.030000 2.352941 0.163863
# without index
tips.groupby(['day', 'smoker'], as_index=False).mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
day smoker total_bill tip size tip_pct
0 Fri No 18.420000 2.812500 2.250000 0.151650
1 Fri Yes 16.813333 2.714000 2.066667 0.174783
2 Sat No 19.661778 3.102889 2.555556 0.158048
3 Sat Yes 21.276667 2.875476 2.476190 0.147906
4 Sun No 20.506667 3.167895 2.929825 0.160113
5 Sun Yes 24.120000 3.516842 2.578947 0.187250
6 Thur No 17.113111 2.673778 2.488889 0.160298
7 Thur Yes 19.190588 3.030000 2.352941 0.163863
# also without index
tips.groupby(['day', 'smoker']).mean().reset_index()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
day smoker total_bill tip size tip_pct
0 Fri No 18.420000 2.812500 2.250000 0.151650
1 Fri Yes 16.813333 2.714000 2.066667 0.174783
2 Sat No 19.661778 3.102889 2.555556 0.158048
3 Sat Yes 21.276667 2.875476 2.476190 0.147906
4 Sun No 20.506667 3.167895 2.929825 0.160113
5 Sun Yes 24.120000 3.516842 2.578947 0.187250
6 Thur No 17.113111 2.673778 2.488889 0.160298
7 Thur Yes 19.190588 3.030000 2.352941 0.163863

10.3 Apply: general split-apply-combine

Apply splits the object being manipulated into pieces, invokes a provided function on each piece, and then concatenates the pieces together. Here is an example using the tipping data.

def top(df, n=5, column='tip_pct'):
    '''
    Return rows with the top `n` values in `column`.
    '''
    return df.sort_values(by=column)[-n:]

top(tips, n=6)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
total_bill tip smoker day time size tip_pct
109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
tips.groupby('smoker').apply(top, n=3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
total_bill tip smoker day time size tip_pct
smoker
No 51 10.29 2.60 No Sun Dinner 2 0.252672
149 7.51 2.00 No Thur Lunch 2 0.266312
232 11.61 3.39 No Sat Dinner 2 0.291990
Yes 67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345

Suppressing the group keys

To prevent the resulting DataFrame from having the grouping keys for row indexes, use group_keys=False in groupby().

tips.groupby('smoker', group_keys=False).apply(top, n=3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
total_bill tip smoker day time size tip_pct
51 10.29 2.60 No Sun Dinner 2 0.252672
149 7.51 2.00 No Thur Lunch 2 0.266312
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345

Quantile and bucket analysis

pandas includes cut() and qcut() for dividing data into quantiles. This often is useful in conjunction with groupby(). The categorical object returned by cut() can be passed directly to groupby() to use to separate the data.

frame = pd.DataFrame({'data1': np.random.randn(1000),
                      'data2': np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
quartiles.head()
0    (-1.492, 0.0624]
1    (-3.052, -1.492]
2     (0.0624, 1.617]
3    (-1.492, 0.0624]
4    (-1.492, 0.0624]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.052, -1.492] < (-1.492, 0.0624] < (0.0624, 1.617] < (1.617, 3.171]]
def get_stats(group):
    return {
        'min': group.min(),
        'max': group.max(),
        'count': group.count(),
        'mean': group.mean()
    }

frame.data2.groupby(quartiles).apply(get_stats).unstack()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
min max count mean
data1
(-3.052, -1.492] -1.730276 2.153120 68.0 0.097696
(-1.492, 0.0624] -2.777359 2.680571 482.0 0.056230
(0.0624, 1.617] -3.116857 2.929096 398.0 -0.046616
(1.617, 3.171] -2.158069 2.042072 52.0 -0.156563

Use qcut() to get equally-sized buckets.

grouping = pd.qcut(frame.data1, 10, labels=False)
frame.data2.groupby(grouping).apply(get_stats).unstack()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
min max count mean
data1
0 -2.534554 2.642936 100.0 -0.013313
1 -1.641703 2.232016 100.0 0.059095
2 -2.777359 2.620574 100.0 0.193168
3 -1.830029 2.488442 100.0 -0.103425
4 -2.123890 2.526368 100.0 0.169418
5 -2.121176 2.680571 100.0 0.044411
6 -3.116857 2.929096 100.0 0.083797
7 -2.994613 2.540232 100.0 -0.305252
8 -1.980566 2.190898 100.0 -0.058111
9 -2.802203 2.198296 100.0 0.000731

10.4 Pivot tables and cross-tabulation

Pivot tables are made with pandas by using groupby() and some reshaping operations. There is a pivot_table() method for DataFrames and a top-level pd.pivot_table() function.

tips.pivot_table(index=['day', 'smoker'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
size tip tip_pct total_bill
day smoker
Fri No 2.250000 2.812500 0.151650 18.420000
Yes 2.066667 2.714000 0.174783 16.813333
Sat No 2.555556 3.102889 0.158048 19.661778
Yes 2.476190 2.875476 0.147906 21.276667
Sun No 2.929825 3.167895 0.160113 20.506667
Yes 2.578947 3.516842 0.187250 24.120000
Thur No 2.488889 2.673778 0.160298 17.113111
Yes 2.352941 3.030000 0.163863 19.190588

The following example only aggragates 'tip_pct' and 'size' and additionally groups by 'time'.

tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}
</style>
size tip_pct
smoker No Yes No Yes
time day
Dinner Fri 2.000000 2.222222 0.139622 0.165347
Sat 2.555556 2.476190 0.158048 0.147906
Sun 2.929825 2.578947 0.160113 0.187250
Thur 2.000000 NaN 0.159744 NaN
Lunch Fri 3.000000 1.833333 0.187735 0.188937
Thur 2.500000 2.352941 0.160311 0.163863

Setting margins=True computes partial totals, adding an 'All' row and column.

tips.pivot_table(['tip_pct', 'size'],
                 index=['time', 'day'], 
                 columns='smoker', 
                 margins=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}
</style>
size tip_pct
smoker No Yes All No Yes All
time day
Dinner Fri 2.000000 2.222222 2.166667 0.139622 0.165347 0.158916
Sat 2.555556 2.476190 2.517241 0.158048 0.147906 0.153152
Sun 2.929825 2.578947 2.842105 0.160113 0.187250 0.166897
Thur 2.000000 NaN 2.000000 0.159744 NaN 0.159744
Lunch Fri 3.000000 1.833333 2.000000 0.187735 0.188937 0.188765
Thur 2.500000 2.352941 2.459016 0.160311 0.163863 0.161301
All 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803

The default aggregation behaviour is to take the mean. This can be changed by passing a different function to aggfunc.

tips.pivot_table('tip_pct', 
                 index=['time', 'smoker'], 
                 columns='day', 
                 aggfunc=len, 
                 margins=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
day Fri Sat Sun Thur All
time smoker
Dinner No 3.0 45.0 57.0 1.0 106.0
Yes 9.0 42.0 19.0 NaN 70.0
Lunch No 1.0 NaN NaN 44.0 45.0
Yes 6.0 NaN NaN 17.0 23.0
All 19.0 87.0 76.0 62.0 244.0
tips.pivot_table('tip_pct',
                 index=['time', 'smoker'], 
                 columns='day', 
                 aggfunc=len, 
                 margins=True, 
                 fill_value=0)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
day Fri Sat Sun Thur All
time smoker
Dinner No 3 45 57 1 106.0
Yes 9 42 19 0 70.0
Lunch No 1 0 0 44 45.0
Yes 6 0 0 17 23.0
All 19 87 76 62 244.0

Cross-tabulations (crosstab)

Cross-tabulation is a special case of a pivot table the computes group frequencies.

data = pd.DataFrame({
    'Sample': list(range(1, 11)),
    'Nationality': ['USA', 'Japan', 'USA', 'Japan', 'Japan', 'Japan', 'USA', 'USA', 'Japan', 'USA'],
    'Handedness': ['Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed']
})
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Sample Nationality Handedness
0 1 USA Right-handed
1 2 Japan Left-handed
2 3 USA Right-handed
3 4 Japan Right-handed
4 5 Japan Left-handed
5 6 Japan Right-handed
6 7 USA Right-handed
7 8 USA Left-handed
8 9 Japan Right-handed
9 10 USA Right-handed
pd.crosstab(data.Nationality, data.Handedness, margins=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Handedness Left-handed Right-handed All
Nationality
Japan 2 3 5
USA 1 4 5
All 3 7 10
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244