Skip to content

Latest commit

 

History

History
6636 lines (5086 loc) · 96.8 KB

pfda_ch05.md

File metadata and controls

6636 lines (5086 loc) · 96.8 KB

Chapter 5. Getting Starting with pandas

Designed to for conducting vectorized functions with tabular data.

Introduction to pandas data structures

The two most common data structures from pandas are Series and DataFrame.

Series

A 1-D array like object containing a sequence of values and an associated array of data labels (called the index).

The simplest Series is from from an array.

import pandas as pd
import numpy as np

obj = pd.Series([4, 7, -5, 3])
obj
0    4
1    7
2   -5
3    3
dtype: int64

The index is shown to the left of each data point. The values and indices can be extracted, specifically.

obj.values
array([ 4,  7, -5,  3])
obj.index
RangeIndex(start=0, stop=4, step=1)

The index can be specified.

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
d    4
b    7
a   -5
c    3
dtype: int64
obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')

The index can be used to select single or multiple values.

obj2['a']
-5
obj2['d'] = 6
obj2[['c', 'a', 'd']]
c    3
a   -5
d    6
dtype: int64

Boolean indices, multiplication, and mathematical operations can also be used just like with NumPy.

obj2[obj2 > 0]
d    6
b    7
c    3
dtype: int64
obj2 * 2
d    12
b    14
a   -10
c     6
dtype: int64
np.exp(obj2)
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

A Series can be thought of as a fixed-length, ordered dictionary. It can often be used in a simillar fashion as a dictionary. A Series can be created from a dictionary.

'b' in obj2
True
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3  = pd.Series(sdata)
obj3
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

By default, the resulting Series is ordered by the index. This can be overriden when the Series is created. Note that indices without values get assigned NaN and only values from the dictionary with included indices are retained.

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
obj4.isnull()
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series can be joined using the + operator that automatically joins by index.

obj3 + obj4
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Bot the Series object itself and its index have a name attribute.

obj4.name = 'population'
obj4.index.name = 'state'
obj4
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

DataFrame

A rectangular table of data with an ordered collection of columns. It has both a row and column index.

A DataFrame can be constructed from a dictionary of NumPy arrays.

data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)
frame
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
frame.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9

The column order can be declared during creation.

pd.DataFrame(data, columns=['year', 'state', 'pop'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2

This is also a way to declare an empty column.

frame2 = pd.DataFrame(
    data, 
    columns=['year', 'state', 'pop', 'debt'],
    index=['one', 'two', 'three', 'four', 'five', 'six']
)
frame2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
frame2.year
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64
type(frame2.year)
pandas.core.series.Series
type(frame2)
pandas.core.frame.DataFrame

A DataFrame be indexed by column using either a list or dictionary-like syntax. The rows can be subset using the loc method and passing the row index.

frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
frame2.year
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64
frame2.loc['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment.

frame2['debt'] = 16.5
frame2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
frame2['debt'] = np.arange(6.0)
frame2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN

Columns can be easily created and deleted.

frame2['eastern'] = frame2.state == 'Ohio'
frame2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False
del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

DataFrames can also be created from nested dictionaries. The first level becomes the column index and the second the row index.

 pop = {
    'Nevada': {2001: 2.4, 2002: 2.9},
    'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}
}
frame3 = pd.DataFrame(pop)
frame3
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5

A DataFrame can be transposed.

frame3.T
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
2001 2002 2000
Nevada 2.4 2.9 NaN
Ohio 1.7 3.6 1.5

A DataFrame can be created from a dictionary of Series.

pdata = {
    'Ohio': frame3['Ohio'][:-1],
    'Nevada': frame3['Nevada'][:2]
}
pd.DataFrame(pdata)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Ohio Nevada
2001 1.7 2.4
2002 3.6 2.9

A DataFrame's index and columns have their own name attribute.

frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
state Nevada Ohio
year
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5

The values attribute returns the data as a 2-D ndarray.

frame3.values
array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])
frame2.values
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

Index Objects

These hold the axis labels and other metadata. They are immutable.

obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Index(['b', 'c'], dtype='object')

Index objects can be shared amongst data stuctures.

labels = pd.Index(np.arange(3))
labels
Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
0    1.5
1   -2.5
2    0.0
dtype: float64
obj2.index is labels
True
frame3.columns
Index(['Nevada', 'Ohio'], dtype='object', name='state')
'Ohio' in frame3.columns
True
2003 in frame3.index
False

5.2 Eddential functionality

This section discusses the fundamental interations with Series and DataFrames.

Reindexing

This creates a new object with the data conformed to a new index.

obj = pd.Series([4.3, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
obj
d    4.3
b    7.2
a   -5.3
c    3.6
dtype: float64
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a   -5.3
b    7.2
c    3.6
d    4.3
e    NaN
dtype: float64

There is a method option to describe how to handle missing data. Here is an example with 'ffill' which is a "forward-fill."

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
0      blue
2    purple
4    yellow
dtype: object
obj3.reindex(range(6), method='ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, 'reindex' can alter either the row index, columns, or both.

frame = pd.DataFrame(
    np.arange(9).reshape((3, 3)),
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)
frame
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0

Columns can be reindexed with the columns keyward.

states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8

Reindexing can be done more succienctly by label-indexing with 'loc'.

frame.loc[['a', 'b', 'c', 'd'], states]
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/pandas/core/indexing.py:1418: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0

Dropping entries from an axis

The drop method returns a new object with the indicated values deleted from an axis.

obj = pd.Series(np.arange(5.0), index=['a', 'b', 'c', 'd', 'e'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
new_obj = obj.drop('c')
new_obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
obj.drop(['d', 'c'])
a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis.

data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data.drop(['Colorado', 'Ohio'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four
Utah 8 9 10 11
New York 12 13 14 15
data.drop('two', axis=1)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
data.drop(['two', 'four'], axis='columns')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14

There is an option to make the change in-place.

obj.drop('c', inplace=True)
obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Indexing, selection, and filtering

For Series, can use either position, boolean, or index values for indexing.

obj = pd.Series(np.arange(4.0), index=['a', 'b', 'c', 'd'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
obj['b']
1.0
obj[1]
1.0
obj[2:4]
c    2.0
d    3.0
dtype: float64
obj[['b', 'a', 'd']]
b    1.0
a    0.0
d    3.0
dtype: float64
obj[[1, 3]]
b    1.0
d    3.0
dtype: float64
obj[obj < 2]
a    0.0
b    1.0
dtype: float64
obj['b':'c']
b    1.0
c    2.0
dtype: float64
obj['b':'c'] = 5
obj
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing a DataFrame can retrieve multiple columns.

data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data['two']
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
data[['three', 'one']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
data[:2]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
data < 5
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data[data['three'] > 5]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data[data < 5] = 0
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

loc and iloc are methods specifically for label-indexing on the rows of a DataFrame.

data.loc['Colorado', ['two', 'three']]
two      5
three    6
Name: Colorado, dtype: int64
data.iloc[2, [3, 0, 1]]
four    11
one      8
two      9
Name: Utah, dtype: int64
data.iloc[2]
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
data.iloc[[1, 2], [3, 0, 1]]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
four one two
Colorado 7 0 5
Utah 11 8 9
data.loc[:'Utah', 'two']
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64
data.iloc[:, :3][data.three > 5]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14

Arithmetic and data alignment

When adding objects together, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
s2
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

NaN are returned for when there is only one value being added together.

s1 + s2
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

For DataFrame, alignment is performed on both rows and columns.

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), 
                   columns=list('bcd'), 
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), 
                   columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
df2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0

When added together, values are only returned for positions in both DataFrames.

df1 + df2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN

There is an option to fill missing data with a specific value to be used for the operation.

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
df2.loc[1, 'b'] = np.nan
df2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 NaN 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df1 + df2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
df1.add(df2, fill_value=0)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0

There are defined behaviours for arithmetic between DataFrame and Series. It is slightly different than for 1D and multidimensional ndarrys.

arr = np.arange(12.0).reshape((3, 4))
arr
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])
arr[0]
array([0., 1., 2., 3.])

The subtration below results in one operation per row (an example of broadcasting).

arr - arr[0]
array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

A simillar mechanism is used for operations between a Series and DataFrame.

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
series = frame.iloc[0]
series
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
frame - series
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0

IF an index is not found in either the DataFrame columns or Series index, the objects are reindexed to form the union.

series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2
b    0
e    1
f    2
dtype: int64
frame + series2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
series3 = frame['d']
series3
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64
frame.sub(series3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Ohio Oregon Texas Utah b d e
Utah NaN NaN NaN NaN NaN NaN NaN
Ohio NaN NaN NaN NaN NaN NaN NaN
Texas NaN NaN NaN NaN NaN NaN NaN
Oregon NaN NaN NaN NaN NaN NaN NaN
frame.sub(series3, axis='index')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0

Function application and mapping

DataFrames work well with ufuncs, too.

frame = pd.DataFrame(np.random.randn(4, 3), 
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b d e
Utah -1.167574 -1.923661 -0.856915
Ohio -0.712398 0.835224 1.783416
Texas 0.283403 1.123811 -1.542719
Oregon -1.049275 -0.444459 -0.123116
np.abs(frame)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b d e
Utah 1.167574 1.923661 0.856915
Ohio 0.712398 0.835224 1.783416
Texas 0.283403 1.123811 1.542719
Oregon 1.049275 0.444459 0.123116

Use the apply method to apply a function to the 1D arrays from columns or rows.

f = lambda x: x.max() - x.min()

frame.apply(f)
b    1.450977
d    3.047472
e    3.326136
dtype: float64

To operate of the rows, pass the value axis=1.

frame.apply(f, axis=1)
Utah      1.066746
Ohio      2.495815
Texas     2.666531
Oregon    0.926159
dtype: float64
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(f)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b d e
min -1.167574 -1.923661 -1.542719
max 0.283403 1.123811 1.783416

Sorting and ranking

Use the sort_index() method to sort a Series or DataFrame lexicographically.

obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
a    1
b    2
c    3
d    0
dtype: int64
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame.sort_index()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
d a b c
one 4 5 6 7
three 0 1 2 3
frame.sort_index(axis=1)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d
three 1 2 3 0
one 5 6 7 4
frame.sort_index(axis=1, ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
d c b a
three 0 3 2 1
one 4 7 6 5

Use the sort_values method to sort a Series by its values.

obj = pd.Series([4, 7, -3, -2])
obj.sort_values()
2   -3
3   -2
0    4
1    7
dtype: int64
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

For a DataFrame, a column can be specified to use for sorting.

frame = pd.DataFrame({
    'b': [4, 7, -3, 2],
    'a': [0, 1, 0, 1]
})
frame.sort_values(by='b')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b a
2 -3 0
3 2 1
0 4 0
1 7 1
frame.sort_values(by=['a', 'b'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
b a
2 -3 0
0 4 0
3 2 1
1 7 1

Ranking assigns ranks from 1 through the number of valid data points (rows). There are a few different ways to handle ties and they can be declared using the method argument.

obj = pd.Series([7, -5, 7, 4, 3, 0, 4])
obj.rank()
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
obj.rank(method='first')
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
obj.rank(pct=True)
0    0.928571
1    0.142857
2    0.928571
3    0.642857
4    0.428571
5    0.285714
6    0.642857
dtype: float64
obj.rank(ascending=False, method='max')
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

DataFrames can be ranked over the columns are rows.

frame = pd.DataFrame({
    'b': [4.3, 7, -3, 2],
    'a': [0, 1, 0, 1],
    'c': [-2, 5, 8, -2.5]
}).sort_index(axis=1)
frame
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c
0 0 4.3 -2.0
1 1 7.0 5.0
2 0 -3.0 8.0
3 1 2.0 -2.5
frame.rank()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c
0 1.5 3.0 2.0
1 3.5 4.0 3.0
2 1.5 1.0 4.0
3 3.5 2.0 1.0
frame.rank(axis=1)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c
0 2.0 3.0 1.0
1 1.0 3.0 2.0
2 2.0 1.0 3.0
3 2.0 3.0 1.0

Axis indexes with duplicate labels

Many pandas functions require the labels to be unique, but it is not mandatory for a Series or DataFrame. HEre is a small example Series with non-unique labels.

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
a    0
a    1
b    2
b    3
c    4
dtype: int64
obj.index.is_unique
False
obj.a
a    0
a    1
dtype: int64
obj.c
4

Here is an example with DataFrame.

df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'b', 'b', 'b'])
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
a 0.315786 -0.868693 -1.045036
b 1.144670 0.902213 -2.583696
b -0.499635 0.883807 -1.347935
b -1.053965 -0.075102 0.066315
df.loc['b']
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
b 1.144670 0.902213 -2.583696
b -0.499635 0.883807 -1.347935
b -1.053965 -0.075102 0.066315

5.3 Summarizing and computing descriptive statistics

Pandas has many methods for computing reductions and summarising the data in a DataFrame. Importantly, they naturally handle missing data.

df = pd.DataFrame([[1.4, np.nan],
                   [7.1, -4.5],
                   [np.nan, np.nan],
                   [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.sum()
one    9.25
two   -5.80
dtype: float64
df.sum(axis=1)
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
df.sum(axis=1, skipna=False)
a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

There are some methods for getting the index value of key data.

df.idxmax()
one    b
two    d
dtype: object
df.cumsum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8

The describe method is useful to getting a high-level overview of the data in a DataFrame.

df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000

For non-numeric data.

obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj
0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object
obj.describe
<bound method NDFrame.describe of 0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object>

Below are examples a some of the more common summary statistics.

df = pd.DataFrame(np.random.randn(100).reshape((20, 5)))
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3 4
0 -1.045266 1.282355 2.090783 -0.252437 1.040161
1 0.150544 -0.147333 0.320805 1.567862 -0.174218
2 1.166097 -1.042703 -0.079506 -0.118705 0.123881
3 -0.043958 2.006066 0.471954 0.686379 0.540834
4 1.244186 -0.874540 -1.410670 -1.593762 0.318683
5 -0.071923 0.148731 -0.456628 -1.155017 0.624898
6 -0.083861 0.218961 2.650821 -0.113464 -0.422801
7 -0.289215 3.006028 -2.028504 0.290864 0.984765
8 0.866779 1.039549 0.772484 -1.578464 -0.247977
9 -0.079468 0.327789 0.740373 0.268327 -1.121380
10 -0.728884 -1.385715 1.322314 -0.502122 0.456615
11 0.460693 2.520307 -0.929717 1.136553 1.818280
12 -0.226714 -0.790623 0.940863 -0.279578 0.408899
13 -2.252070 -0.513878 -1.115919 -1.057158 -0.112186
14 2.193616 1.330908 0.217283 0.489868 -0.011269
15 1.823363 1.266734 -1.286281 0.673735 -0.398431
16 0.780628 1.615514 0.525163 -0.062664 0.088914
17 2.037299 1.412365 -0.284433 0.963503 0.373894
18 -0.214364 2.878076 -0.450742 -0.346410 0.520312
19 -0.610795 -1.430337 0.121110 0.483379 2.001245
df.count()  # number of non-NA values
0    20
1    20
2    20
3    20
4    20
dtype: int64
df.mean()
0    0.253834
1    0.643413
2    0.106578
3   -0.024966
4    0.340656
dtype: float64
df.median()
0   -0.057940
1    0.683669
2    0.169197
3   -0.088064
4    0.346288
dtype: float64
df.mad()
0    0.854199
1    1.192377
2    0.900160
3    0.678516
4    0.536334
dtype: float64
df.prod()
0   -1.721891e-07
1   -2.590494e-01
2   -1.466648e-04
3   -4.670023e-07
4   -1.657156e-09
dtype: float64
df.std()
0    1.094775
1    1.396016
2    1.169365
3    0.866003
4    0.737545
dtype: float64
df.var()
0    1.198532
1    1.948861
2    1.367415
3    0.749961
4    0.543972
dtype: float64
df.cumsum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3 4
0 -1.045266 1.282355 2.090783 -0.252437 1.040161
1 -0.894723 1.135022 2.411587 1.315425 0.865944
2 0.271374 0.092320 2.332081 1.196720 0.989825
3 0.227416 2.098386 2.804035 1.883099 1.530659
4 1.471602 1.223846 1.393365 0.289337 1.849342
5 1.399679 1.372576 0.936738 -0.865680 2.474240
6 1.315818 1.591538 3.587559 -0.979144 2.051439
7 1.026603 4.597565 1.559055 -0.688280 3.036203
8 1.893382 5.637114 2.331539 -2.266744 2.788226
9 1.813915 5.964904 3.071912 -1.998416 1.666846
10 1.085031 4.579189 4.394226 -2.500538 2.123461
11 1.545723 7.099496 3.464508 -1.363985 3.941741
12 1.319009 6.308873 4.405372 -1.643564 4.350640
13 -0.933061 5.794995 3.289452 -2.700722 4.238454
14 1.260555 7.125903 3.506736 -2.210854 4.227184
15 3.083918 8.392638 2.220455 -1.537119 3.828753
16 3.864546 10.008152 2.745618 -1.599783 3.917667
17 5.901845 11.420517 2.461186 -0.636280 4.291561
18 5.687480 14.298593 2.010443 -0.982690 4.811873
19 5.076686 12.868256 2.131554 -0.499311 6.813118
df.cumprod()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3 4
0 -1.045266e+00 1.282355 2.090783 -2.524371e-01 1.040161e+00
1 -1.573583e-01 -0.188933 0.670733 -3.957865e-01 -1.812143e-01
2 -1.834949e-01 0.197001 -0.053328 4.698193e-02 -2.244901e-02
3 8.066110e-03 0.395197 -0.025168 3.224742e-02 -1.214120e-02
4 1.003574e-02 -0.345616 0.035504 -5.139471e-02 -3.869194e-03
5 -7.217967e-04 -0.051404 -0.016212 5.936175e-02 -2.417850e-03
6 6.053054e-05 -0.011255 -0.042975 -6.735445e-03 1.022269e-03
7 -1.750635e-05 -0.033834 0.087176 -1.959101e-03 1.006695e-03
8 -1.517415e-05 -0.035172 0.067342 3.092369e-03 -2.496376e-04
9 1.205853e-06 -0.011529 0.049858 8.297670e-04 2.799386e-04
10 -8.789272e-07 0.015976 0.065928 -4.166443e-04 1.278243e-04
11 -4.049152e-07 0.040264 -0.061294 -4.735383e-04 2.324203e-04
12 9.180015e-08 -0.031834 -0.057670 1.323911e-04 9.503635e-05
13 -2.067404e-07 0.016359 0.064355 -1.399583e-04 -1.066178e-05
14 -4.535091e-07 0.021772 0.013983 -6.856113e-05 1.201514e-07
15 -8.269117e-07 0.027579 -0.017986 -4.619204e-05 -4.787200e-08
16 -6.455103e-07 0.044555 -0.009446 2.894595e-06 -4.256488e-09
17 -1.315097e-06 0.062928 0.002687 2.788951e-06 -1.591474e-09
18 2.819100e-07 0.181111 -0.001211 -9.661200e-07 -8.280626e-10
19 -1.721891e-07 -0.259049 -0.000147 -4.670023e-07 -1.657156e-09

Correlation and covariance

Here, the author showed an example of using pandas to compute correlation and covariance measures between stocks. The first step was to download the data from Yahoo! Finance

import pandas_datareader.data as web

stock_names = ['AAPL', 'IBM', 'MSFT', 'GOOG']
all_data = {ticker: web.get_data_yahoo(ticker) for ticker in stock_names}
all_data
{'AAPL':                   High         Low        Open       Close        Volume  \
 Date                                                                       
 2014-11-03  110.300003  108.010002  108.220001  109.400002  5.228260e+07   
 2014-11-04  109.489998  107.720001  109.360001  108.599998  4.157440e+07   
 2014-11-05  109.300003  108.129997  109.099998  108.860001  3.743590e+07   
 2014-11-06  108.790001  107.800003  108.599998  108.699997  3.496850e+07   
 2014-11-07  109.320000  108.550003  108.750000  109.010002  3.369150e+07   
 ...                ...         ...         ...         ...           ...   
 2019-10-28  249.250000  246.720001  247.419998  249.050003  2.414320e+07   
 2019-10-29  249.750000  242.570007  248.970001  243.289993  3.566010e+07   
 2019-10-30  245.300003  241.210007  244.759995  243.259995  3.095060e+07   
 2019-10-31  249.169998  237.259995  247.240005  248.759995  3.476660e+09   
 2019-11-01  255.929993  249.160004  249.539993  255.820007  3.773870e+07   
 
              Adj Close  
 Date                    
 2014-11-03  100.385109  
 2014-11-04   99.651047  
 2014-11-05   99.889618  
 2014-11-06  100.175293  
 2014-11-07  100.460983  
 ...                ...  
 2019-10-28  249.050003  
 2019-10-29  243.289993  
 2019-10-30  243.259995  
 2019-10-31  248.759995  
 2019-11-01  255.820007  
 
 [1259 rows x 6 columns],
 'IBM':                   High         Low        Open       Close       Volume  \
 Date                                                                      
 2014-11-03  164.539993  163.380005  164.250000  164.360001    4688200.0   
 2014-11-04  164.360001  162.240005  164.339996  162.649994    4246900.0   
 2014-11-05  163.539993  161.559998  163.130005  161.820007    4104700.0   
 2014-11-06  161.529999  160.050003  161.279999  161.460007    4067600.0   
 2014-11-07  162.210007  160.850006  161.419998  162.070007    3494800.0   
 ...                ...         ...         ...         ...          ...   
 2019-10-28  136.630005  135.449997  136.000000  135.970001    3225700.0   
 2019-10-29  135.570007  133.440002  135.419998  133.820007    4159600.0   
 2019-10-30  135.279999  133.199997  133.830002  135.250000    2252700.0   
 2019-10-31  135.250000  133.229996  135.110001  133.729996  341090000.0   
 2019-11-01  135.559998  134.089996  134.500000  135.529999    3088800.0   
 
              Adj Close  
 Date                    
 2014-11-03  133.097809  
 2014-11-04  131.713089  
 2014-11-05  131.040939  
 2014-11-06  131.644287  
 2014-11-07  132.141617  
 ...                ...  
 2019-10-28  135.970001  
 2019-10-29  133.820007  
 2019-10-30  135.250000  
 2019-10-31  133.729996  
 2019-11-01  135.529999  
 
 [1259 rows x 6 columns],
 'MSFT':                   High         Low        Open       Close        Volume  \
 Date                                                                       
 2014-11-03   47.459999   46.730000   46.889999   47.439999  2.313040e+07   
 2014-11-04   47.730000   47.250000   47.299999   47.570000  2.153080e+07   
 2014-11-05   47.900002   47.259998   47.799999   47.860001  2.244960e+07   
 2014-11-06   48.860001   47.790001   47.860001   48.700001  3.303780e+07   
 2014-11-07   48.919998   48.290001   48.919998   48.680000  2.800060e+07   
 ...                ...         ...         ...         ...           ...   
 2019-10-28  145.669998  143.509995  144.399994  144.190002  3.528010e+07   
 2019-10-29  144.500000  142.649994  144.080002  142.830002  2.051970e+07   
 2019-10-30  145.000000  142.789993  143.520004  144.610001  1.847170e+07   
 2019-10-31  144.929993  142.990005  144.899994  143.369995  2.459620e+09   
 2019-11-01  144.419998  142.970001  144.259995  143.720001  3.311920e+07   
 
              Adj Close  
 Date                    
 2014-11-03   42.466732  
 2014-11-04   42.583115  
 2014-11-05   42.842709  
 2014-11-06   43.594646  
 2014-11-07   43.576752  
 ...                ...  
 2019-10-28  144.190002  
 2019-10-29  142.830002  
 2019-10-30  144.610001  
 2019-10-31  143.369995  
 2019-11-01  143.720001  
 
 [1259 rows x 6 columns],
 'GOOG':                    High          Low         Open        Close       Volume  \
 Date                                                                          
 2014-11-03   556.372498   551.715271   553.979065   553.699829    1382200.0   
 2014-11-04   553.979065   547.796021   551.485901   552.592834    1244200.0   
 2014-11-05   555.275513   542.560425   555.275513   544.425293    2032200.0   
 2014-11-06   545.387634   539.488831   544.006409   540.555908    1333200.0   
 2014-11-07   544.714478   537.195129   544.714478   539.528748    1633700.0   
 ...                 ...          ...          ...          ...          ...   
 2019-10-28  1299.310059  1272.540039  1275.449951  1290.000000    2613200.0   
 2019-10-29  1281.589966  1257.212036  1276.229980  1262.619995    1869200.0   
 2019-10-30  1269.359985  1252.000000  1252.969971  1261.290039    1407700.0   
 2019-10-31  1267.670044  1250.843018  1261.280029  1260.109985  145470000.0   
 2019-11-01  1274.619995  1260.500000  1265.000000  1273.739990    1669400.0   
 
               Adj Close  
 Date                     
 2014-11-03   553.699829  
 2014-11-04   552.592834  
 2014-11-05   544.425293  
 2014-11-06   540.555908  
 2014-11-07   539.528748  
 ...                 ...  
 2019-10-28  1290.000000  
 2019-10-29  1262.619995  
 2019-10-30  1261.290039  
 2019-10-31  1260.109985  
 2019-11-01  1273.739990  
 
 [1259 rows x 6 columns]}
price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})
price
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
AAPL IBM MSFT GOOG
Date
2014-11-03 100.385109 133.097809 42.466732 553.699829
2014-11-04 99.651047 131.713089 42.583115 552.592834
2014-11-05 99.889618 131.040939 42.842709 544.425293
2014-11-06 100.175293 131.644287 43.594646 540.555908
2014-11-07 100.460983 132.141617 43.576752 539.528748
... ... ... ... ...
2019-10-28 249.050003 135.970001 144.190002 1290.000000
2019-10-29 243.289993 133.820007 142.830002 1262.619995
2019-10-30 243.259995 135.250000 144.610001 1261.290039
2019-10-31 248.759995 133.729996 143.369995 1260.109985
2019-11-01 255.820007 135.529999 143.720001 1273.739990

1259 rows × 4 columns

volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})
volume
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
AAPL IBM MSFT GOOG
Date
2014-11-03 5.228260e+07 4688200.0 2.313040e+07 1382200.0
2014-11-04 4.157440e+07 4246900.0 2.153080e+07 1244200.0
2014-11-05 3.743590e+07 4104700.0 2.244960e+07 2032200.0
2014-11-06 3.496850e+07 4067600.0 3.303780e+07 1333200.0
2014-11-07 3.369150e+07 3494800.0 2.800060e+07 1633700.0
... ... ... ... ...
2019-10-28 2.414320e+07 3225700.0 3.528010e+07 2613200.0
2019-10-29 3.566010e+07 4159600.0 2.051970e+07 1869200.0
2019-10-30 3.095060e+07 2252700.0 1.847170e+07 1407700.0
2019-10-31 3.476660e+09 341090000.0 2.459620e+09 145470000.0
2019-11-01 3.773870e+07 3088800.0 3.311920e+07 1669400.0

1259 rows × 4 columns

We can now compute the percent changes of the prices.

returns = price.pct_change()
returns
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
AAPL IBM MSFT GOOG
Date
2014-11-03 NaN NaN NaN NaN
2014-11-04 -0.007312 -0.010404 0.002741 -0.001999
2014-11-05 0.002394 -0.005103 0.006096 -0.014780
2014-11-06 0.002860 0.004604 0.017551 -0.007107
2014-11-07 0.002852 0.003778 -0.000410 -0.001900
... ... ... ... ...
2019-10-28 0.010017 0.003913 0.024586 0.019658
2019-10-29 -0.023128 -0.015812 -0.009432 -0.021225
2019-10-30 -0.000123 0.010686 0.012462 -0.001053
2019-10-31 0.022610 -0.011238 -0.008575 -0.000936
2019-11-01 0.028381 0.013460 0.002441 0.010817

1259 rows × 4 columns

The corr and cov methods for Series computes the correlation and covariance of two Series, aligned by index. The corr and cov methods for DataFrame computes the correlation and covariance matrices.

returns.MSFT.corr(returns.IBM)
0.4870311189559724
returns.corr()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
AAPL IBM MSFT GOOG
AAPL 1.000000 0.405805 0.570409 0.521556
IBM 0.405805 1.000000 0.487031 0.413659
MSFT 0.570409 0.487031 1.000000 0.655724
GOOG 0.521556 0.413659 0.655724 1.000000
returns.cov()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
AAPL IBM MSFT GOOG
AAPL 0.000247 0.000083 0.000133 0.000125
IBM 0.000083 0.000170 0.000094 0.000082
MSFT 0.000133 0.000094 0.000219 0.000147
GOOG 0.000125 0.000082 0.000147 0.000231

DataFrame can still compute specific correlations using the corrwith method.

returns.corrwith(returns.IBM)
AAPL    0.405805
IBM     1.000000
MSFT    0.487031
GOOG    0.413659
dtype: float64
returns.corrwith(volume)
AAPL    0.016438
IBM    -0.057802
MSFT   -0.035033
GOOG   -0.004697
dtype: float64

Unique values, value counts, and membership

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
obj.value_counts()
c    3
a    3
b    2
d    1
dtype: int64
mask = obj.isin(['b', 'c'])
mask
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)
array([0, 2, 1, 1, 0, 2])