Chapter 5. Getting Starting with pandas

Designed to for conducting vectorized functions with tabular data.

Introduction to pandas data structures

The two most common data structures from pandas are Series and DataFrame.

Series

A 1-D array like object containing a sequence of values and an associated array of data labels (called the index).

The simplest Series is from from an array.

import pandas as pd
import numpy as np

obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

The index is shown to the left of each data point. The values and indices can be extracted, specifically.

obj.values

array([ 4,  7, -5,  3])

obj.index

RangeIndex(start=0, stop=4, step=1)

The index can be specified.

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

The index can be used to select single or multiple values.

obj2['a']

-5

obj2['d'] = 6

obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

Boolean indices, multiplication, and mathematical operations can also be used just like with NumPy.

obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

A Series can be thought of as a fixed-length, ordered dictionary. It can often be used in a simillar fashion as a dictionary. A Series can be created from a dictionary.

'b' in obj2

True

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3  = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

By default, the resulting Series is ordered by the index. This can be overriden when the Series is created. Note that indices without values get assigned NaN and only values from the dictionary with included indices are retained.

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series can be joined using the + operator that automatically joins by index.

obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Bot the Series object itself and its index have a name attribute.

obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

DataFrame

A rectangular table of data with an ordered collection of columns. It has both a row and column index.

A DataFrame can be constructed from a dictionary of NumPy arrays.

data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)
frame

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

frame.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9

The column order can be declared during creation.

pd.DataFrame(data, columns=['year', 'state', 'pop'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	state	pop
0	2000	Ohio	1.5
1	2001	Ohio	1.7
2	2002	Ohio	3.6
3	2001	Nevada	2.4
4	2002	Nevada	2.9
5	2003	Nevada	3.2

This is also a way to declare an empty column.

frame2 = pd.DataFrame(
    data, 
    columns=['year', 'state', 'pop', 'debt'],
    index=['one', 'two', 'three', 'four', 'five', 'six']
)
frame2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	state	pop	debt
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN
six	2003	Nevada	3.2	NaN

frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

type(frame2.year)

pandas.core.series.Series

type(frame2)

pandas.core.frame.DataFrame

A DataFrame be indexed by column using either a list or dictionary-like syntax. The rows can be subset using the loc method and passing the row index.

frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment.

frame2['debt'] = 16.5
frame2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	state	pop	debt
one	2000	Ohio	1.5	16.5
two	2001	Ohio	1.7	16.5
three	2002	Ohio	3.6	16.5
four	2001	Nevada	2.4	16.5
five	2002	Nevada	2.9	16.5
six	2003	Nevada	3.2	16.5

frame2['debt'] = np.arange(6.0)
frame2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	state	pop	debt
one	2000	Ohio	1.5	0.0
two	2001	Ohio	1.7	1.0
three	2002	Ohio	3.6	2.0
four	2001	Nevada	2.4	3.0
five	2002	Nevada	2.9	4.0
six	2003	Nevada	3.2	5.0

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	state	pop	debt
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	-1.2
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	-1.5
five	2002	Nevada	2.9	-1.7
six	2003	Nevada	3.2	NaN

Columns can be easily created and deleted.

frame2['eastern'] = frame2.state == 'Ohio'
frame2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	state	pop	debt	eastern
one	2000	Ohio	1.5	NaN	True
two	2001	Ohio	1.7	-1.2	True
three	2002	Ohio	3.6	NaN	True
four	2001	Nevada	2.4	-1.5	False
five	2002	Nevada	2.9	-1.7	False
six	2003	Nevada	3.2	NaN	False

del frame2['eastern']

frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

DataFrames can also be created from nested dictionaries. The first level becomes the column index and the second the row index.

 pop = {
    'Nevada': {2001: 2.4, 2002: 2.9},
    'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}
}
frame3 = pd.DataFrame(pop)
frame3

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Nevada	Ohio
2001	2.4	1.7
2002	2.9	3.6
2000	NaN	1.5

A DataFrame can be transposed.

frame3.T

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	2001	2002	2000
Nevada	2.4	2.9	NaN
Ohio	1.7	3.6	1.5

A DataFrame can be created from a dictionary of Series.

pdata = {
    'Ohio': frame3['Ohio'][:-1],
    'Nevada': frame3['Nevada'][:2]
}
pd.DataFrame(pdata)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Ohio	Nevada
2001	1.7	2.4
2002	3.6	2.9

A DataFrame's index and columns have their own name attribute.

frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

state	Nevada	Ohio
year
2001	2.4	1.7
2002	2.9	3.6
2000	NaN	1.5

The values attribute returns the data as a 2-D ndarray.

frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

Index Objects

These hold the axis labels and other metadata. They are immutable.

obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

index[1:]

Index(['b', 'c'], dtype='object')

Index objects can be shared amongst data stuctures.

labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

obj2.index is labels

True

frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

'Ohio' in frame3.columns

True

2003 in frame3.index

False

5.2 Eddential functionality

This section discusses the fundamental interations with Series and DataFrames.

Reindexing

This creates a new object with the data conformed to a new index.

obj = pd.Series([4.3, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
obj

d    4.3
b    7.2
a   -5.3
c    3.6
dtype: float64

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.3
e    NaN
dtype: float64

There is a method option to describe how to handle missing data. Here is an example with 'ffill' which is a "forward-fill."

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, 'reindex' can alter either the row index, columns, or both.

frame = pd.DataFrame(
    np.arange(9).reshape((3, 3)),
    index=['a', 'c', 'd'],
    columns=['Ohio', 'Texas', 'California']
)
frame

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

Columns can be reindexed with the columns keyward.

states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Texas	Utah	California
a	1	NaN	2
c	4	NaN	5
d	7	NaN	8

Reindexing can be done more succienctly by label-indexing with 'loc'.

frame.loc[['a', 'b', 'c', 'd'], states]

/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/pandas/core/indexing.py:1418: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Texas	Utah	California
a	1.0	NaN	2.0
b	NaN	NaN	NaN
c	4.0	NaN	5.0
d	7.0	NaN	8.0

Dropping entries from an axis

The drop method returns a new object with the indicated values deleted from an axis.

obj = pd.Series(np.arange(5.0), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis.

data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data.drop(['Colorado', 'Ohio'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

data.drop('two', axis=1)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	three	four
Ohio	0	2	3
Colorado	4	6	7
Utah	8	10	11
New York	12	14	15

data.drop(['two', 'four'], axis='columns')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	three
Ohio	0	2
Colorado	4	6
Utah	8	10
New York	12	14

There is an option to make the change in-place.

obj.drop('c', inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Indexing, selection, and filtering

For Series, can use either position, boolean, or index values for indexing.

obj = pd.Series(np.arange(4.0), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

obj['b']

1.0

obj[1]

1.0

obj[2:4]

c    2.0
d    3.0
dtype: float64

obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

obj[obj < 2]

a    0.0
b    1.0
dtype: float64

obj['b':'c']

b    1.0
c    2.0
dtype: float64

obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing a DataFrame can retrieve multiple columns.

data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

data[['three', 'one']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	three	one
Ohio	2	0
Colorado	6	4
Utah	10	8
New York	14	12

data[:2]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

data < 5

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four
Ohio	True	True	True	True
Colorado	True	False	False	False
Utah	False	False	False	False
New York	False	False	False	False

data[data['three'] > 5]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data[data < 5] = 0
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

loc and iloc are methods specifically for label-indexing on the rows of a DataFrame.

data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

data.iloc[[1, 2], [3, 0, 1]]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	four	one	two
Colorado	7	0	5
Utah	11	8	9

data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

data.iloc[:, :3][data.three > 5]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three
Colorado	0	5	6
Utah	8	9	10
New York	12	13	14

Arithmetic and data alignment

When adding objects together, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

NaN are returned for when there is only one value being added together.

s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

For DataFrame, alignment is performed on both rows and columns.

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), 
                   columns=list('bcd'), 
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), 
                   columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	c	d
Ohio	0.0	1.0	2.0
Texas	3.0	4.0	5.0
Colorado	6.0	7.0	8.0

df2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

When added together, values are only returned for positions in both DataFrames.

df1 + df2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	3.0	NaN	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0	NaN	12.0	NaN
Utah	NaN	NaN	NaN	NaN

There is an option to fill missing data with a specific value to be used for the operation.

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

df2.loc[1, 'b'] = np.nan
df2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	NaN	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0

df1 + df2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	NaN	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

df1.add(df2, fill_value=0)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	5.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

There are defined behaviours for arithmetic between DataFrame and Series. It is slightly different than for 1D and multidimensional ndarrys.

arr = np.arange(12.0).reshape((3, 4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

arr[0]

array([0., 1., 2., 3.])

The subtration below results in one operation per row (an example of broadcasting).

arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

A simillar mechanism is used for operations between a Series and DataFrame.

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

frame - series

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

IF an index is not found in either the DataFrame columns or Series index, the objects are reindexed to form the union.

series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2

b    0
e    1
f    2
dtype: int64

frame + series2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

frame.sub(series3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Ohio	Oregon	Texas	Utah	b	d	e
Utah	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Ohio	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Texas	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Oregon	NaN	NaN	NaN	NaN	NaN	NaN	NaN

frame.sub(series3, axis='index')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	e
Utah	-1.0	1.0
Ohio	-1.0	1.0
Texas	-1.0	1.0
Oregon	-1.0	1.0

Function application and mapping

DataFrames work well with ufuncs, too.

frame = pd.DataFrame(np.random.randn(4, 3), 
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	d	e
Utah	-1.167574	-1.923661	-0.856915
Ohio	-0.712398	0.835224	1.783416
Texas	0.283403	1.123811	-1.542719
Oregon	-1.049275	-0.444459	-0.123116

np.abs(frame)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	d	e
Utah	1.167574	1.923661	0.856915
Ohio	0.712398	0.835224	1.783416
Texas	0.283403	1.123811	1.542719
Oregon	1.049275	0.444459	0.123116

Use the apply method to apply a function to the 1D arrays from columns or rows.

f = lambda x: x.max() - x.min()

frame.apply(f)

b    1.450977
d    3.047472
e    3.326136
dtype: float64

To operate of the rows, pass the value axis=1.

frame.apply(f, axis=1)

Utah      1.066746
Ohio      2.495815
Texas     2.666531
Oregon    0.926159
dtype: float64

def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(f)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	d	e
min	-1.167574	-1.923661	-1.542719
max	0.283403	1.123811	1.783416

Sorting and ranking

Use the sort_index() method to sort a Series or DataFrame lexicographically.

obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame.sort_index()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	d	a	b	c
one	4	5	6	7
three	0	1	2	3

frame.sort_index(axis=1)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d
three	1	2	3	0
one	5	6	7	4

frame.sort_index(axis=1, ascending=False)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	d	c	b	a
three	0	3	2	1
one	4	7	6	5

Use the sort_values method to sort a Series by its values.

obj = pd.Series([4, 7, -3, -2])
obj.sort_values()

2   -3
3   -2
0    4
1    7
dtype: int64

obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

For a DataFrame, a column can be specified to use for sorting.

frame = pd.DataFrame({
    'b': [4, 7, -3, 2],
    'a': [0, 1, 0, 1]
})
frame.sort_values(by='b')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	a
2	-3	0
3	2	1
0	4	0
1	7	1

frame.sort_values(by=['a', 'b'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	b	a
2	-3	0
0	4	0
3	2	1
1	7	1

Ranking assigns ranks from 1 through the number of valid data points (rows). There are a few different ways to handle ties and they can be declared using the method argument.

obj = pd.Series([7, -5, 7, 4, 3, 0, 4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

obj.rank(pct=True)

0    0.928571
1    0.142857
2    0.928571
3    0.642857
4    0.428571
5    0.285714
6    0.642857
dtype: float64

obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

DataFrames can be ranked over the columns are rows.

frame = pd.DataFrame({
    'b': [4.3, 7, -3, 2],
    'a': [0, 1, 0, 1],
    'c': [-2, 5, 8, -2.5]
}).sort_index(axis=1)
frame

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c
0	0	4.3	-2.0
1	1	7.0	5.0
2	0	-3.0	8.0
3	1	2.0	-2.5

frame.rank()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c
0	1.5	3.0	2.0
1	3.5	4.0	3.0
2	1.5	1.0	4.0
3	3.5	2.0	1.0

frame.rank(axis=1)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c
0	2.0	3.0	1.0
1	1.0	3.0	2.0
2	2.0	1.0	3.0
3	2.0	3.0	1.0

Axis indexes with duplicate labels

Many pandas functions require the labels to be unique, but it is not mandatory for a Series or DataFrame. HEre is a small example Series with non-unique labels.

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

obj.index.is_unique

False

obj.a

a    0
a    1
dtype: int64

obj.c

Here is an example with DataFrame.

df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'b', 'b', 'b'])
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
a	0.315786	-0.868693	-1.045036
b	1.144670	0.902213	-2.583696
b	-0.499635	0.883807	-1.347935
b	-1.053965	-0.075102	0.066315

df.loc['b']

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
b	1.144670	0.902213	-2.583696
b	-0.499635	0.883807	-1.347935
b	-1.053965	-0.075102	0.066315

5.3 Summarizing and computing descriptive statistics

Pandas has many methods for computing reductions and summarising the data in a DataFrame. Importantly, they naturally handle missing data.

df = pd.DataFrame([[1.4, np.nan],
                   [7.1, -4.5],
                   [np.nan, np.nan],
                   [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

df.sum()

one    9.25
two   -5.80
dtype: float64

df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

df.sum(axis=1, skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

There are some methods for getting the index value of key data.

df.idxmax()

one    b
two    d
dtype: object

df.cumsum()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

The describe method is useful to getting a high-level overview of the data in a DataFrame.

df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

For non-numeric data.

obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

obj.describe

<bound method NDFrame.describe of 0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object>

Below are examples a some of the more common summary statistics.

df = pd.DataFrame(np.random.randn(100).reshape((20, 5)))
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3	4
0	-1.045266	1.282355	2.090783	-0.252437	1.040161
1	0.150544	-0.147333	0.320805	1.567862	-0.174218
2	1.166097	-1.042703	-0.079506	-0.118705	0.123881
3	-0.043958	2.006066	0.471954	0.686379	0.540834
4	1.244186	-0.874540	-1.410670	-1.593762	0.318683
5	-0.071923	0.148731	-0.456628	-1.155017	0.624898
6	-0.083861	0.218961	2.650821	-0.113464	-0.422801
7	-0.289215	3.006028	-2.028504	0.290864	0.984765
8	0.866779	1.039549	0.772484	-1.578464	-0.247977
9	-0.079468	0.327789	0.740373	0.268327	-1.121380
10	-0.728884	-1.385715	1.322314	-0.502122	0.456615
11	0.460693	2.520307	-0.929717	1.136553	1.818280
12	-0.226714	-0.790623	0.940863	-0.279578	0.408899
13	-2.252070	-0.513878	-1.115919	-1.057158	-0.112186
14	2.193616	1.330908	0.217283	0.489868	-0.011269
15	1.823363	1.266734	-1.286281	0.673735	-0.398431
16	0.780628	1.615514	0.525163	-0.062664	0.088914
17	2.037299	1.412365	-0.284433	0.963503	0.373894
18	-0.214364	2.878076	-0.450742	-0.346410	0.520312
19	-0.610795	-1.430337	0.121110	0.483379	2.001245

df.count()  # number of non-NA values

0    20
1    20
2    20
3    20
4    20
dtype: int64

df.mean()

0    0.253834
1    0.643413
2    0.106578
3   -0.024966
4    0.340656
dtype: float64

df.median()

0   -0.057940
1    0.683669
2    0.169197
3   -0.088064
4    0.346288
dtype: float64

df.mad()

0    0.854199
1    1.192377
2    0.900160
3    0.678516
4    0.536334
dtype: float64

df.prod()

0   -1.721891e-07
1   -2.590494e-01
2   -1.466648e-04
3   -4.670023e-07
4   -1.657156e-09
dtype: float64

df.std()

0    1.094775
1    1.396016
2    1.169365
3    0.866003
4    0.737545
dtype: float64

df.var()

0    1.198532
1    1.948861
2    1.367415
3    0.749961
4    0.543972
dtype: float64

df.cumsum()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3	4
0	-1.045266	1.282355	2.090783	-0.252437	1.040161
1	-0.894723	1.135022	2.411587	1.315425	0.865944
2	0.271374	0.092320	2.332081	1.196720	0.989825
3	0.227416	2.098386	2.804035	1.883099	1.530659
4	1.471602	1.223846	1.393365	0.289337	1.849342
5	1.399679	1.372576	0.936738	-0.865680	2.474240
6	1.315818	1.591538	3.587559	-0.979144	2.051439
7	1.026603	4.597565	1.559055	-0.688280	3.036203
8	1.893382	5.637114	2.331539	-2.266744	2.788226
9	1.813915	5.964904	3.071912	-1.998416	1.666846
10	1.085031	4.579189	4.394226	-2.500538	2.123461
11	1.545723	7.099496	3.464508	-1.363985	3.941741
12	1.319009	6.308873	4.405372	-1.643564	4.350640
13	-0.933061	5.794995	3.289452	-2.700722	4.238454
14	1.260555	7.125903	3.506736	-2.210854	4.227184
15	3.083918	8.392638	2.220455	-1.537119	3.828753
16	3.864546	10.008152	2.745618	-1.599783	3.917667
17	5.901845	11.420517	2.461186	-0.636280	4.291561
18	5.687480	14.298593	2.010443	-0.982690	4.811873
19	5.076686	12.868256	2.131554	-0.499311	6.813118

df.cumprod()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3	4
0	-1.045266e+00	1.282355	2.090783	-2.524371e-01	1.040161e+00
1	-1.573583e-01	-0.188933	0.670733	-3.957865e-01	-1.812143e-01
2	-1.834949e-01	0.197001	-0.053328	4.698193e-02	-2.244901e-02
3	8.066110e-03	0.395197	-0.025168	3.224742e-02	-1.214120e-02
4	1.003574e-02	-0.345616	0.035504	-5.139471e-02	-3.869194e-03
5	-7.217967e-04	-0.051404	-0.016212	5.936175e-02	-2.417850e-03
6	6.053054e-05	-0.011255	-0.042975	-6.735445e-03	1.022269e-03
7	-1.750635e-05	-0.033834	0.087176	-1.959101e-03	1.006695e-03
8	-1.517415e-05	-0.035172	0.067342	3.092369e-03	-2.496376e-04
9	1.205853e-06	-0.011529	0.049858	8.297670e-04	2.799386e-04
10	-8.789272e-07	0.015976	0.065928	-4.166443e-04	1.278243e-04
11	-4.049152e-07	0.040264	-0.061294	-4.735383e-04	2.324203e-04
12	9.180015e-08	-0.031834	-0.057670	1.323911e-04	9.503635e-05
13	-2.067404e-07	0.016359	0.064355	-1.399583e-04	-1.066178e-05
14	-4.535091e-07	0.021772	0.013983	-6.856113e-05	1.201514e-07
15	-8.269117e-07	0.027579	-0.017986	-4.619204e-05	-4.787200e-08
16	-6.455103e-07	0.044555	-0.009446	2.894595e-06	-4.256488e-09
17	-1.315097e-06	0.062928	0.002687	2.788951e-06	-1.591474e-09
18	2.819100e-07	0.181111	-0.001211	-9.661200e-07	-8.280626e-10
19	-1.721891e-07	-0.259049	-0.000147	-4.670023e-07	-1.657156e-09

Correlation and covariance

Here, the author showed an example of using pandas to compute correlation and covariance measures between stocks. The first step was to download the data from Yahoo! Finance

import pandas_datareader.data as web

stock_names = ['AAPL', 'IBM', 'MSFT', 'GOOG']
all_data = {ticker: web.get_data_yahoo(ticker) for ticker in stock_names}
all_data

{'AAPL':                   High         Low        Open       Close        Volume  \
 Date                                                                       
 2014-11-03  110.300003  108.010002  108.220001  109.400002  5.228260e+07   
 2014-11-04  109.489998  107.720001  109.360001  108.599998  4.157440e+07   
 2014-11-05  109.300003  108.129997  109.099998  108.860001  3.743590e+07   
 2014-11-06  108.790001  107.800003  108.599998  108.699997  3.496850e+07   
 2014-11-07  109.320000  108.550003  108.750000  109.010002  3.369150e+07   
 ...                ...         ...         ...         ...           ...   
 2019-10-28  249.250000  246.720001  247.419998  249.050003  2.414320e+07   
 2019-10-29  249.750000  242.570007  248.970001  243.289993  3.566010e+07   
 2019-10-30  245.300003  241.210007  244.759995  243.259995  3.095060e+07   
 2019-10-31  249.169998  237.259995  247.240005  248.759995  3.476660e+09   
 2019-11-01  255.929993  249.160004  249.539993  255.820007  3.773870e+07   
 
              Adj Close  
 Date                    
 2014-11-03  100.385109  
 2014-11-04   99.651047  
 2014-11-05   99.889618  
 2014-11-06  100.175293  
 2014-11-07  100.460983  
 ...                ...  
 2019-10-28  249.050003  
 2019-10-29  243.289993  
 2019-10-30  243.259995  
 2019-10-31  248.759995  
 2019-11-01  255.820007  
 
 [1259 rows x 6 columns],
 'IBM':                   High         Low        Open       Close       Volume  \
 Date                                                                      
 2014-11-03  164.539993  163.380005  164.250000  164.360001    4688200.0   
 2014-11-04  164.360001  162.240005  164.339996  162.649994    4246900.0   
 2014-11-05  163.539993  161.559998  163.130005  161.820007    4104700.0   
 2014-11-06  161.529999  160.050003  161.279999  161.460007    4067600.0   
 2014-11-07  162.210007  160.850006  161.419998  162.070007    3494800.0   
 ...                ...         ...         ...         ...          ...   
 2019-10-28  136.630005  135.449997  136.000000  135.970001    3225700.0   
 2019-10-29  135.570007  133.440002  135.419998  133.820007    4159600.0   
 2019-10-30  135.279999  133.199997  133.830002  135.250000    2252700.0   
 2019-10-31  135.250000  133.229996  135.110001  133.729996  341090000.0   
 2019-11-01  135.559998  134.089996  134.500000  135.529999    3088800.0   
 
              Adj Close  
 Date                    
 2014-11-03  133.097809  
 2014-11-04  131.713089  
 2014-11-05  131.040939  
 2014-11-06  131.644287  
 2014-11-07  132.141617  
 ...                ...  
 2019-10-28  135.970001  
 2019-10-29  133.820007  
 2019-10-30  135.250000  
 2019-10-31  133.729996  
 2019-11-01  135.529999  
 
 [1259 rows x 6 columns],
 'MSFT':                   High         Low        Open       Close        Volume  \
 Date                                                                       
 2014-11-03   47.459999   46.730000   46.889999   47.439999  2.313040e+07   
 2014-11-04   47.730000   47.250000   47.299999   47.570000  2.153080e+07   
 2014-11-05   47.900002   47.259998   47.799999   47.860001  2.244960e+07   
 2014-11-06   48.860001   47.790001   47.860001   48.700001  3.303780e+07   
 2014-11-07   48.919998   48.290001   48.919998   48.680000  2.800060e+07   
 ...                ...         ...         ...         ...           ...   
 2019-10-28  145.669998  143.509995  144.399994  144.190002  3.528010e+07   
 2019-10-29  144.500000  142.649994  144.080002  142.830002  2.051970e+07   
 2019-10-30  145.000000  142.789993  143.520004  144.610001  1.847170e+07   
 2019-10-31  144.929993  142.990005  144.899994  143.369995  2.459620e+09   
 2019-11-01  144.419998  142.970001  144.259995  143.720001  3.311920e+07   
 
              Adj Close  
 Date                    
 2014-11-03   42.466732  
 2014-11-04   42.583115  
 2014-11-05   42.842709  
 2014-11-06   43.594646  
 2014-11-07   43.576752  
 ...                ...  
 2019-10-28  144.190002  
 2019-10-29  142.830002  
 2019-10-30  144.610001  
 2019-10-31  143.369995  
 2019-11-01  143.720001  
 
 [1259 rows x 6 columns],
 'GOOG':                    High          Low         Open        Close       Volume  \
 Date                                                                          
 2014-11-03   556.372498   551.715271   553.979065   553.699829    1382200.0   
 2014-11-04   553.979065   547.796021   551.485901   552.592834    1244200.0   
 2014-11-05   555.275513   542.560425   555.275513   544.425293    2032200.0   
 2014-11-06   545.387634   539.488831   544.006409   540.555908    1333200.0   
 2014-11-07   544.714478   537.195129   544.714478   539.528748    1633700.0   
 ...                 ...          ...          ...          ...          ...   
 2019-10-28  1299.310059  1272.540039  1275.449951  1290.000000    2613200.0   
 2019-10-29  1281.589966  1257.212036  1276.229980  1262.619995    1869200.0   
 2019-10-30  1269.359985  1252.000000  1252.969971  1261.290039    1407700.0   
 2019-10-31  1267.670044  1250.843018  1261.280029  1260.109985  145470000.0   
 2019-11-01  1274.619995  1260.500000  1265.000000  1273.739990    1669400.0   
 
               Adj Close  
 Date                     
 2014-11-03   553.699829  
 2014-11-04   552.592834  
 2014-11-05   544.425293  
 2014-11-06   540.555908  
 2014-11-07   539.528748  
 ...                 ...  
 2019-10-28  1290.000000  
 2019-10-29  1262.619995  
 2019-10-30  1261.290039  
 2019-10-31  1260.109985  
 2019-11-01  1273.739990  
 
 [1259 rows x 6 columns]}

price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})
price

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AAPL	IBM	MSFT	GOOG
Date
2014-11-03	100.385109	133.097809	42.466732	553.699829
2014-11-04	99.651047	131.713089	42.583115	552.592834
2014-11-05	99.889618	131.040939	42.842709	544.425293
2014-11-06	100.175293	131.644287	43.594646	540.555908
2014-11-07	100.460983	132.141617	43.576752	539.528748
...	...	...	...	...
2019-10-28	249.050003	135.970001	144.190002	1290.000000
2019-10-29	243.289993	133.820007	142.830002	1262.619995
2019-10-30	243.259995	135.250000	144.610001	1261.290039
2019-10-31	248.759995	133.729996	143.369995	1260.109985
2019-11-01	255.820007	135.529999	143.720001	1273.739990

1259 rows × 4 columns

volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})
volume

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AAPL	IBM	MSFT	GOOG
Date
2014-11-03	5.228260e+07	4688200.0	2.313040e+07	1382200.0
2014-11-04	4.157440e+07	4246900.0	2.153080e+07	1244200.0
2014-11-05	3.743590e+07	4104700.0	2.244960e+07	2032200.0
2014-11-06	3.496850e+07	4067600.0	3.303780e+07	1333200.0
2014-11-07	3.369150e+07	3494800.0	2.800060e+07	1633700.0
...	...	...	...	...
2019-10-28	2.414320e+07	3225700.0	3.528010e+07	2613200.0
2019-10-29	3.566010e+07	4159600.0	2.051970e+07	1869200.0
2019-10-30	3.095060e+07	2252700.0	1.847170e+07	1407700.0
2019-10-31	3.476660e+09	341090000.0	2.459620e+09	145470000.0
2019-11-01	3.773870e+07	3088800.0	3.311920e+07	1669400.0

1259 rows × 4 columns

We can now compute the percent changes of the prices.

returns = price.pct_change()
returns

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AAPL	IBM	MSFT	GOOG
Date
2014-11-03	NaN	NaN	NaN	NaN
2014-11-04	-0.007312	-0.010404	0.002741	-0.001999
2014-11-05	0.002394	-0.005103	0.006096	-0.014780
2014-11-06	0.002860	0.004604	0.017551	-0.007107
2014-11-07	0.002852	0.003778	-0.000410	-0.001900
...	...	...	...	...
2019-10-28	0.010017	0.003913	0.024586	0.019658
2019-10-29	-0.023128	-0.015812	-0.009432	-0.021225
2019-10-30	-0.000123	0.010686	0.012462	-0.001053
2019-10-31	0.022610	-0.011238	-0.008575	-0.000936
2019-11-01	0.028381	0.013460	0.002441	0.010817

1259 rows × 4 columns

The corr and cov methods for Series computes the correlation and covariance of two Series, aligned by index. The corr and cov methods for DataFrame computes the correlation and covariance matrices.

returns.MSFT.corr(returns.IBM)

0.4870311189559724

returns.corr()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AAPL	IBM	MSFT	GOOG
AAPL	1.000000	0.405805	0.570409	0.521556
IBM	0.405805	1.000000	0.487031	0.413659
MSFT	0.570409	0.487031	1.000000	0.655724
GOOG	0.521556	0.413659	0.655724	1.000000

returns.cov()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	AAPL	IBM	MSFT	GOOG
AAPL	0.000247	0.000083	0.000133	0.000125
IBM	0.000083	0.000170	0.000094	0.000082
MSFT	0.000133	0.000094	0.000219	0.000147
GOOG	0.000125	0.000082	0.000147	0.000231

DataFrame can still compute specific correlations using the corrwith method.

returns.corrwith(returns.IBM)

AAPL    0.405805
IBM     1.000000
MSFT    0.487031
GOOG    0.413659
dtype: float64

returns.corrwith(volume)

AAPL    0.016438
IBM    -0.057802
MSFT   -0.035033
GOOG   -0.004697
dtype: float64

Unique values, value counts, and membership

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!