Chapter 10. Data aggregation and group operations

10.1 Groupby mechanics

The grouping key can be provided in different forms:

a list of values the same length oas the axis being grouped
a column name
a dictionary or Series that corresponds the values on the axis being grouped and the group names
a function to be invoked on the axis index or individual index labels

The following are some examples using these methods.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(0)

df = pd.DataFrame({
    'key1' : ['a', 'a', 'b', 'b', 'a'],
    'key2' : ['one', 'two', 'one', 'two', 'one'],
    'data1' : np.random.randn(5), 
    'data2' : np.random.randn(5)
})
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	key1	key2	data1	data2
0	a	one	1.764052	-0.977278
1	a	two	0.400157	0.950088
2	b	one	0.978738	-0.151357
3	b	two	2.240893	-0.103219
4	a	one	1.867558	0.410599

df = pd.DataFrame({
    'key1' : ['a', 'a', 'b', 'b', 'a'],
    'key2' : ['one', 'two', 'one', 'two', 'one'],
    'data1' : np.random.randn(5), 
    'data2' : np.random.randn(5)
})
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	key1	key2	data1	data2
0	a	one	0.144044	0.333674
1	a	two	1.454274	1.494079
2	b	one	0.761038	-0.205158
3	b	two	0.121675	0.313068
4	a	one	0.443863	-0.854096

# `groupby()` creates a new `GroupBy` object.
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2e21d0>

# Mean of the data in the 'data1' column, grouped by 'key1'.
grouped.mean()

key1
a    0.680727
b    0.441356
Name: data1, dtype: float64

# Group by two columns.
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     0.293953
      two     1.454274
b     one     0.761038
      two     0.121675
Name: data1, dtype: float64

means.unstack()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

key2	one	two
key1
a	0.293953	1.454274
b	0.761038	0.121675

If the grouping information is a column of the same DataFrame, then only the grouping column name is required.

df.groupby('key1').mean()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	data1	data2
key1
a	0.680727	0.324553
b	0.441356	0.053955

df.groupby(['key1', 'key2']).mean()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

		data1	data2
key1	key2
a	one	0.293953	-0.260211
a	two	1.454274	1.494079
b	one	0.761038	-0.205158
b	two	0.121675	0.313068

A frequently useful method on a grouped DataFrame is size().

df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

Iterating over groups

The GroupBy object created by groupby() supports iteration over a sequence of 2-tuples containing the group name and the data.

for name, group in df.groupby('key1'):
    print(f'group name: {name}')
    print(group)
    print('')

group name: a
  key1 key2     data1     data2
0    a  one  0.144044  0.333674
1    a  two  1.454274  1.494079
4    a  one  0.443863 -0.854096

group name: b
  key1 key2     data1     data2
2    b  one  0.761038 -0.205158
3    b  two  0.121675  0.313068

for name, group in df.groupby(['key1', 'key2']):
    print(f'group name: {name[0]}-{name[1]}')
    print(group)
    print('')

group name: a-one
  key1 key2     data1     data2
0    a  one  0.144044  0.333674
4    a  one  0.443863 -0.854096

group name: a-two
  key1 key2     data1     data2
1    a  two  1.454274  1.494079

group name: b-one
  key1 key2     data1     data2
2    b  one  0.761038 -0.205158

group name: b-two
  key1 key2     data1     data2
3    b  two  0.121675  0.313068

By default, groupby() groups on axis=0, though the columns could also be grouped.

df.dtypes

key1      object
key2      object
data1    float64
data2    float64
dtype: object

grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
    print(f'data type: {dtype}')
    print(group)
    print('')

data type: float64
      data1     data2
0  0.144044  0.333674
1  1.454274  1.494079
2  0.761038 -0.205158
3  0.121675  0.313068
4  0.443863 -0.854096

data type: object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one

Selecting a column or subset of columns

A GroupBy object can still be indexed by column name. The next two statements are equivalent.

df.groupby('key1')['data1']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2db190>

df['data1'].groupby(df['key1'])

<pandas.core.groupby.generic.SeriesGroupBy object at 0x12d2f02d0>

Grouping with dictionaries and Series

people = pd.DataFrame(np.random.randn(5, 5),
                      columns=list('abcde'),
                      index=['Joe', 'Steve', 'Wex', 'Jim', 'Travis'])
people

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	e
Joe	-2.552990	0.653619	0.864436	-0.742165	2.269755
Steve	-1.454366	0.045759	-0.187184	1.532779	1.469359
Wex	0.154947	0.378163	-0.887786	-1.980796	-0.347912
Jim	0.156349	1.230291	1.202380	-0.387327	-0.302303
Travis	-1.048553	-1.420018	-1.706270	1.950775	-0.509652

people.iloc[2, [1, 2]] = np.nan
people

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	e
Joe	-2.552990	0.653619	0.864436	-0.742165	2.269755
Steve	-1.454366	0.045759	-0.187184	1.532779	1.469359
Wex	0.154947	NaN	NaN	-1.980796	-0.347912
Jim	0.156349	1.230291	1.202380	-0.387327	-0.302303
Travis	-1.048553	-1.420018	-1.706270	1.950775	-0.509652

If I have a group correspondence for the columns and want to sum together the columns by these groups, I can just pass the dictionary for grouping.

mapping = {
    'a': 'red', 'b': 'red', 'c': 'blue',
    'd': 'blue', 'e': 'red', 'f' : 'orange'
}

by_column = people.groupby(mapping, axis=1)
by_column.sum()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	blue	red
Joe	0.122271	0.370383
Steve	1.345595	0.060752
Wex	-1.980796	-0.192965
Jim	0.815053	1.084337
Travis	0.244505	-2.978223

Grouping with functions

A function can be used to create the mappings. Each group key will be passed once, and the return value defines the groups.

Here is an example of grouping by the length of the first names.

people.groupby(len).sum()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	e
3	-2.241693	1.883909	2.066816	-3.110288	1.619540
5	-1.454366	0.045759	-0.187184	1.532779	1.469359
6	-1.048553	-1.420018	-1.706270	1.950775	-0.509652

It is possible to use both a function and an array or dictionary for grouping at the same time.

key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

		a	b	c	d	e
3	one	-2.552990	0.653619	0.864436	-1.980796	-0.347912
3	two	0.156349	1.230291	1.202380	-0.387327	-0.302303
5	one	-1.454366	0.045759	-0.187184	1.532779	1.469359
6	two	-1.048553	-1.420018	-1.706270	1.950775	-0.509652

Grouping by index levels

For hierarchically indexed data structures, the levels of the axis can be used for grouping.

columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

</style>

cty	US			JP
tenor	1	3	5	1	3
0	-0.438074	-1.252795	0.777490	-1.613898	-0.212740
1	-0.895467	0.386902	-0.510805	-1.180632	-0.028182
2	0.428332	0.066517	0.302472	-0.634322	-0.362741
3	-0.672460	-0.359553	-0.813146	-1.726283	0.177426

hier_df.groupby(level='cty', axis=1).count()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

cty	JP	US
0	2	3
1	2	3
2	2	3
3	2	3

10.2 Data Aggregation

An aggregation method is a data transformation that turns an array into a scalar value. When used on a GroupBy object, the transformation is applied to each group, separately.

df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	key1	key2	data1	data2
0	a	one	0.144044	0.333674
1	a	two	1.454274	1.494079
2	b	one	0.761038	-0.205158
3	b	two	0.121675	0.313068
4	a	one	0.443863	-0.854096

grouped = df.groupby('key1')
grouped['data1'].quantile()

key1
a    0.443863
b    0.441356
Name: data1, dtype: float64

It is common to define custom aggregation methods. The can be applied to grouped data by passing them to the agg() method of a GroupBy object.

def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	data1	data2
key1
a	1.310230	2.348175
b	0.639363	0.518226

Other methods will also perform as expected, even though they are not sticktly aggregations.

grouped.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}

</style>

	data1								data2
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
key1
a	3.0	0.680727	0.686479	0.144044	0.293953	0.443863	0.949068	1.454274	3.0	0.324553	1.174114	-0.854096	-0.260211	0.333674	0.913877	1.494079
b	2.0	0.441356	0.452098	0.121675	0.281516	0.441356	0.601197	0.761038	2.0	0.053955	0.366441	-0.205158	-0.075602	0.053955	0.183511	0.313068

Column-wise and multiple function application

It is common to need to use a different aggregation method for each column. This is demonstrated using an example.

# Read in data on tipping at restaurants.
tips = pd.read_csv('assets/examples/tips.csv')

# Calculate the percent of the bill the tip covered.
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	total_bill	tip	smoker	day	time	size	tip_pct
0	16.99	1.01	No	Sun	Dinner	2	0.059447
1	10.34	1.66	No	Sun	Dinner	3	0.160542
2	21.01	3.50	No	Sun	Dinner	3	0.166587
3	23.68	3.31	No	Sun	Dinner	2	0.139780
4	24.59	3.61	No	Sun	Dinner	4	0.146808

grouped = tips.groupby(['day', 'smoker'])
grouped_pct = grouped['tip_pct']

# The average tipping percentage per day, split into smokers and non-smokers.
grouped_pct.agg('mean')

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

The agg method can be passed a list of functions or function names to apply to the grouped data.

grouped_pct.agg(['mean', 'std', peak_to_peak])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

		mean	std	peak_to_peak
day	smoker
Fri	No	0.151650	0.028123	0.067349
Fri	Yes	0.174783	0.051293	0.159925
Sat	No	0.158048	0.039767	0.235193
Sat	Yes	0.147906	0.061375	0.290095
Sun	No	0.160113	0.042347	0.193226
Sun	Yes	0.187250	0.154134	0.644685
Thur	No	0.160298	0.038774	0.193350
Thur	Yes	0.163863	0.039389	0.151240

# Passing 2-tuples provides the name of the new column and the aggregation method.
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

		foo	bar
day	smoker
Fri	No	0.151650	0.028123
Fri	Yes	0.174783	0.051293
Sat	No	0.158048	0.039767
Sat	Yes	0.147906	0.061375
Sun	No	0.160113	0.042347
Sun	Yes	0.187250	0.154134
Thur	No	0.160298	0.038774
Thur	Yes	0.163863	0.039389

The above example was using a grouped Series. With a DataFrame, a list of functions can be applied to all columns, or specific functions can be applied to specified columns.

functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}

</style>

		tip_pct			total_bill
		count	mean	max	count	mean	max
day	smoker
Fri	No	4	0.151650	0.187735	4	18.420000	22.75
Fri	Yes	15	0.174783	0.263480	15	16.813333	40.17
Sat	No	45	0.158048	0.291990	45	19.661778	48.33
Sat	Yes	42	0.147906	0.325733	42	21.276667	50.81
Sun	No	57	0.160113	0.252672	57	20.506667	48.17
Sun	Yes	19	0.187250	0.710345	19	24.120000	45.35
Thur	No	45	0.160298	0.266312	45	17.113111	41.19
Thur	Yes	17	0.163863	0.241255	17	19.190588	43.11

# A mapping of column names to functions.
fxn_mapping = {
    'tip_pct': ['min', 'max', 'mean', 'std'],
    'size': 'sum'
}
grouped.agg(fxn_mapping)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}

</style>

		tip_pct				size
		min	max	mean	std	sum
day	smoker
Fri	No	0.120385	0.187735	0.151650	0.028123	9
Fri	Yes	0.103555	0.263480	0.174783	0.051293	31
Sat	No	0.056797	0.291990	0.158048	0.039767	115
Sat	Yes	0.035638	0.325733	0.147906	0.061375	104
Sun	No	0.059447	0.252672	0.160113	0.042347	167
Sun	Yes	0.065660	0.710345	0.187250	0.154134	49
Thur	No	0.072961	0.266312	0.160298	0.038774	112
Thur	Yes	0.090014	0.241255	0.163863	0.039389	40

Return aggregated data without row indexes

So far, the returned aggregations have the group key combinations for an index, often hierarchical. This can be disabled by passing as_index=False.

# with index
tips.groupby(['day', 'smoker']).mean()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

		total_bill	tip	size	tip_pct
day	smoker
Fri	No	18.420000	2.812500	2.250000	0.151650
Fri	Yes	16.813333	2.714000	2.066667	0.174783
Sat	No	19.661778	3.102889	2.555556	0.158048
Sat	Yes	21.276667	2.875476	2.476190	0.147906
Sun	No	20.506667	3.167895	2.929825	0.160113
Sun	Yes	24.120000	3.516842	2.578947	0.187250
Thur	No	17.113111	2.673778	2.488889	0.160298
Thur	Yes	19.190588	3.030000	2.352941	0.163863

# without index
tips.groupby(['day', 'smoker'], as_index=False).mean()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	day	smoker	total_bill	tip	size	tip_pct
0	Fri	No	18.420000	2.812500	2.250000	0.151650
1	Fri	Yes	16.813333	2.714000	2.066667	0.174783
2	Sat	No	19.661778	3.102889	2.555556	0.158048
3	Sat	Yes	21.276667	2.875476	2.476190	0.147906
4	Sun	No	20.506667	3.167895	2.929825	0.160113
5	Sun	Yes	24.120000	3.516842	2.578947	0.187250
6	Thur	No	17.113111	2.673778	2.488889	0.160298
7	Thur	Yes	19.190588	3.030000	2.352941	0.163863

# also without index
tips.groupby(['day', 'smoker']).mean().reset_index()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	day	smoker	total_bill	tip	size	tip_pct
0	Fri	No	18.420000	2.812500	2.250000	0.151650
1	Fri	Yes	16.813333	2.714000	2.066667	0.174783
2	Sat	No	19.661778	3.102889	2.555556	0.158048
3	Sat	Yes	21.276667	2.875476	2.476190	0.147906
4	Sun	No	20.506667	3.167895	2.929825	0.160113
5	Sun	Yes	24.120000	3.516842	2.578947	0.187250
6	Thur	No	17.113111	2.673778	2.488889	0.160298
7	Thur	Yes	19.190588	3.030000	2.352941	0.163863

10.3 Apply: general split-apply-combine

Apply splits the object being manipulated into pieces, invokes a provided function on each piece, and then concatenates the pieces together. Here is an example using the tipping data.

def top(df, n=5, column='tip_pct'):
    '''
    Return rows with the top `n` values in `column`.
    '''
    return df.sort_values(by=column)[-n:]

top(tips, n=6)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	total_bill	tip	smoker	day	time	size	tip_pct
109	14.31	4.00	Yes	Sat	Dinner	2	0.279525
183	23.17	6.50	Yes	Sun	Dinner	4	0.280535
232	11.61	3.39	No	Sat	Dinner	2	0.291990
67	3.07	1.00	Yes	Sat	Dinner	1	0.325733
178	9.60	4.00	Yes	Sun	Dinner	2	0.416667
172	7.25	5.15	Yes	Sun	Dinner	2	0.710345

tips.groupby('smoker').apply(top, n=3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

		total_bill	tip	smoker	day	time	size	tip_pct
smoker
No	51	10.29	2.60	No	Sun	Dinner	2	0.252672
	149	7.51	2.00	No	Thur	Lunch	2	0.266312
	232	11.61	3.39	No	Sat	Dinner	2	0.291990
Yes	67	3.07	1.00	Yes	Sat	Dinner	1	0.325733
	178	9.60	4.00	Yes	Sun	Dinner	2	0.416667
	172	7.25	5.15	Yes	Sun	Dinner	2	0.710345

Suppressing the group keys

To prevent the resulting DataFrame from having the grouping keys for row indexes, use group_keys=False in groupby().

tips.groupby('smoker', group_keys=False).apply(top, n=3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	total_bill	tip	smoker	day	time	size	tip_pct
51	10.29	2.60	No	Sun	Dinner	2	0.252672
149	7.51	2.00	No	Thur	Lunch	2	0.266312
232	11.61	3.39	No	Sat	Dinner	2	0.291990
67	3.07	1.00	Yes	Sat	Dinner	1	0.325733
178	9.60	4.00	Yes	Sun	Dinner	2	0.416667
172	7.25	5.15	Yes	Sun	Dinner	2	0.710345

Quantile and bucket analysis

pandas includes cut() and qcut() for dividing data into quantiles. This often is useful in conjunction with groupby(). The categorical object returned by cut() can be passed directly to groupby() to use to separate the data.

frame = pd.DataFrame({'data1': np.random.randn(1000),
                      'data2': np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
quartiles.head()

0    (-1.492, 0.0624]
1    (-3.052, -1.492]
2     (0.0624, 1.617]
3    (-1.492, 0.0624]
4    (-1.492, 0.0624]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.052, -1.492] < (-1.492, 0.0624] < (0.0624, 1.617] < (1.617, 3.171]]

def get_stats(group):
    return {
        'min': group.min(),
        'max': group.max(),
        'count': group.count(),
        'mean': group.mean()
    }

frame.data2.groupby(quartiles).apply(get_stats).unstack()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	min	max	count	mean
data1
(-3.052, -1.492]	-1.730276	2.153120	68.0	0.097696
(-1.492, 0.0624]	-2.777359	2.680571	482.0	0.056230
(0.0624, 1.617]	-3.116857	2.929096	398.0	-0.046616
(1.617, 3.171]	-2.158069	2.042072	52.0	-0.156563

Use qcut() to get equally-sized buckets.

grouping = pd.qcut(frame.data1, 10, labels=False)
frame.data2.groupby(grouping).apply(get_stats).unstack()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	min	max	count	mean
data1
0	-2.534554	2.642936	100.0	-0.013313
1	-1.641703	2.232016	100.0	0.059095
2	-2.777359	2.620574	100.0	0.193168
3	-1.830029	2.488442	100.0	-0.103425
4	-2.123890	2.526368	100.0	0.169418
5	-2.121176	2.680571	100.0	0.044411
6	-3.116857	2.929096	100.0	0.083797
7	-2.994613	2.540232	100.0	-0.305252
8	-1.980566	2.190898	100.0	-0.058111
9	-2.802203	2.198296	100.0	0.000731

10.4 Pivot tables and cross-tabulation

Pivot tables are made with pandas by using groupby() and some reshaping operations. There is a pivot_table() method for DataFrames and a top-level pd.pivot_table() function.

tips.pivot_table(index=['day', 'smoker'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

		size	tip	tip_pct	total_bill
day	smoker
Fri	No	2.250000	2.812500	0.151650	18.420000
Fri	Yes	2.066667	2.714000	0.174783	16.813333
Sat	No	2.555556	3.102889	0.158048	19.661778
Sat	Yes	2.476190	2.875476	0.147906	21.276667
Sun	No	2.929825	3.167895	0.160113	20.506667
Sun	Yes	2.578947	3.516842	0.187250	24.120000
Thur	No	2.488889	2.673778	0.160298	17.113111
Thur	Yes	2.352941	3.030000	0.163863	19.190588

The following example only aggragates 'tip_pct' and 'size' and additionally groups by 'time'.

tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}

</style>

		size		tip_pct
	smoker	No	Yes	No	Yes
time	day
Dinner	Fri	2.000000	2.222222	0.139622	0.165347
	Sat	2.555556	2.476190	0.158048	0.147906
	Sun	2.929825	2.578947	0.160113	0.187250
	Thur	2.000000	NaN	0.159744	NaN
Lunch	Fri	3.000000	1.833333	0.187735	0.188937
Lunch	Thur	2.500000	2.352941	0.160311	0.163863

Setting margins=True computes partial totals, adding an 'All' row and column.

tips.pivot_table(['tip_pct', 'size'],
                 index=['time', 'day'], 
                 columns='smoker', 
                 margins=True)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}

</style>

		size			tip_pct
	smoker	No	Yes	All	No	Yes	All
time	day
Dinner	Fri	2.000000	2.222222	2.166667	0.139622	0.165347	0.158916
	Sat	2.555556	2.476190	2.517241	0.158048	0.147906	0.153152
	Sun	2.929825	2.578947	2.842105	0.160113	0.187250	0.166897
	Thur	2.000000	NaN	2.000000	0.159744	NaN	0.159744
Lunch	Fri	3.000000	1.833333	2.000000	0.187735	0.188937	0.188765
Lunch	Thur	2.500000	2.352941	2.459016	0.160311	0.163863	0.161301
All		2.668874	2.408602	2.569672	0.159328	0.163196	0.160803

The default aggregation behaviour is to take the mean. This can be changed by passing a different function to aggfunc.

tips.pivot_table('tip_pct', 
                 index=['time', 'smoker'], 
                 columns='day', 
                 aggfunc=len, 
                 margins=True)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	day	Fri	Sat	Sun	Thur	All
time	smoker
Dinner	No	3.0	45.0	57.0	1.0	106.0
Dinner	Yes	9.0	42.0	19.0	NaN	70.0
Lunch	No	1.0	NaN	NaN	44.0	45.0
Lunch	Yes	6.0	NaN	NaN	17.0	23.0
All		19.0	87.0	76.0	62.0	244.0

tips.pivot_table('tip_pct',
                 index=['time', 'smoker'], 
                 columns='day', 
                 aggfunc=len, 
                 margins=True, 
                 fill_value=0)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	day	Fri	Sat	Sun	Thur	All
time	smoker
Dinner	No	3	45	57	1	106.0
Dinner	Yes	9	42	19	0	70.0
Lunch	No	1	0	0	44	45.0
Lunch	Yes	6	0	0	17	23.0
All		19	87	76	62	244.0

Cross-tabulations (crosstab)

Cross-tabulation is a special case of a pivot table the computes group frequencies.

data = pd.DataFrame({
    'Sample': list(range(1, 11)),
    'Nationality': ['USA', 'Japan', 'USA', 'Japan', 'Japan', 'Japan', 'USA', 'USA', 'Japan', 'USA'],
    'Handedness': ['Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed']
})
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Sample	Nationality	Handedness
0	1	USA	Right-handed
1	2	Japan	Left-handed
2	3	USA	Right-handed
3	4	Japan	Right-handed
4	5	Japan	Left-handed
5	6	Japan	Right-handed
6	7	USA	Right-handed
7	8	USA	Left-handed
8	9	Japan	Right-handed
9	10	USA	Right-handed

pd.crosstab(data.Nationality, data.Handedness, margins=True)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Handedness	Left-handed	Right-handed	All
Nationality
Japan	2	3	5
USA	1	4	5
All	3	7	10

pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	smoker	No	Yes	All
time	day
Dinner	Fri	3	9	12
	Sat	45	42	87
	Sun	57	19	76
	Thur	1	0	1
Lunch	Fri	1	6	7
Lunch	Thur	44	17	61
All		151	93	244

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!