Designed to for conducting vectorized functions with tabular data.
The two most common data structures from pandas are Series and DataFrame.
A 1-D array like object containing a sequence of values and an associated array of data labels (called the index).
The simplest Series is from from an array.
import pandas as pd
import numpy as np
obj = pd.Series([4, 7, -5, 3])
obj
0 4
1 7
2 -5
3 3
dtype: int64
The index is shown to the left of each data point. The values and indices can be extracted, specifically.
obj.values
array([ 4, 7, -5, 3])
obj.index
RangeIndex(start=0, stop=4, step=1)
The index can be specified.
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
d 4
b 7
a -5
c 3
dtype: int64
obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')
The index can be used to select single or multiple values.
obj2['a']
-5
obj2['d'] = 6
obj2[['c', 'a', 'd']]
c 3
a -5
d 6
dtype: int64
Boolean indices, multiplication, and mathematical operations can also be used just like with NumPy.
obj2[obj2 > 0]
d 6
b 7
c 3
dtype: int64
obj2 * 2
d 12
b 14
a -10
c 6
dtype: int64
np.exp(obj2)
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
dtype: float64
A Series can be thought of as a fixed-length, ordered dictionary. It can often be used in a simillar fashion as a dictionary. A Series can be created from a dictionary.
'b' in obj2
True
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
By default, the resulting Series is ordered by the index.
This can be overriden when the Series is created.
Note that indices without values get assigned NaN
and only values from the dictionary with included indices are retained.
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
pd.isnull(obj4)
California True
Ohio False
Oregon False
Texas False
dtype: bool
obj4.isnull()
California True
Ohio False
Oregon False
Texas False
dtype: bool
Series can be joined using the +
operator that automatically joins by index.
obj3 + obj4
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
Bot the Series object itself and its index have a name
attribute.
obj4.name = 'population'
obj4.index.name = 'state'
obj4
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
A rectangular table of data with an ordered collection of columns. It has both a row and column index.
A DataFrame can be constructed from a dictionary of NumPy arrays.
data = {
'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)
frame
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
frame.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
The column order can be declared during creation.
pd.DataFrame(data, columns=['year', 'state', 'pop'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | |
---|---|---|---|
0 | 2000 | Ohio | 1.5 |
1 | 2001 | Ohio | 1.7 |
2 | 2002 | Ohio | 3.6 |
3 | 2001 | Nevada | 2.4 |
4 | 2002 | Nevada | 2.9 |
5 | 2003 | Nevada | 3.2 |
This is also a way to declare an empty column.
frame2 = pd.DataFrame(
data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five', 'six']
)
frame2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN |
two | 2001 | Ohio | 1.7 | NaN |
three | 2002 | Ohio | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | NaN |
five | 2002 | Nevada | 2.9 | NaN |
six | 2003 | Nevada | 3.2 | NaN |
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
frame2['state']
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
frame2.year
one 2000
two 2001
three 2002
four 2001
five 2002
six 2003
Name: year, dtype: int64
type(frame2.year)
pandas.core.series.Series
type(frame2)
pandas.core.frame.DataFrame
A DataFrame be indexed by column using either a list or dictionary-like syntax.
The rows can be subset using the loc
method and passing the row index.
frame2['state']
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
frame2.year
one 2000
two 2001
three 2002
four 2001
five 2002
six 2003
Name: year, dtype: int64
frame2.loc['three']
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
Columns can be modified by assignment.
frame2['debt'] = 16.5
frame2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | 16.5 |
two | 2001 | Ohio | 1.7 | 16.5 |
three | 2002 | Ohio | 3.6 | 16.5 |
four | 2001 | Nevada | 2.4 | 16.5 |
five | 2002 | Nevada | 2.9 | 16.5 |
six | 2003 | Nevada | 3.2 | 16.5 |
frame2['debt'] = np.arange(6.0)
frame2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | 0.0 |
two | 2001 | Ohio | 1.7 | 1.0 |
three | 2002 | Ohio | 3.6 | 2.0 |
four | 2001 | Nevada | 2.4 | 3.0 |
five | 2002 | Nevada | 2.9 | 4.0 |
six | 2003 | Nevada | 3.2 | 5.0 |
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | debt | |
---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN |
two | 2001 | Ohio | 1.7 | -1.2 |
three | 2002 | Ohio | 3.6 | NaN |
four | 2001 | Nevada | 2.4 | -1.5 |
five | 2002 | Nevada | 2.9 | -1.7 |
six | 2003 | Nevada | 3.2 | NaN |
Columns can be easily created and deleted.
frame2['eastern'] = frame2.state == 'Ohio'
frame2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | state | pop | debt | eastern | |
---|---|---|---|---|---|
one | 2000 | Ohio | 1.5 | NaN | True |
two | 2001 | Ohio | 1.7 | -1.2 | True |
three | 2002 | Ohio | 3.6 | NaN | True |
four | 2001 | Nevada | 2.4 | -1.5 | False |
five | 2002 | Nevada | 2.9 | -1.7 | False |
six | 2003 | Nevada | 3.2 | NaN | False |
del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
DataFrames can also be created from nested dictionaries. The first level becomes the column index and the second the row index.
pop = {
'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}
}
frame3 = pd.DataFrame(pop)
frame3
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Nevada | Ohio | |
---|---|---|
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
2000 | NaN | 1.5 |
A DataFrame can be transposed.
frame3.T
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
2001 | 2002 | 2000 | |
---|---|---|---|
Nevada | 2.4 | 2.9 | NaN |
Ohio | 1.7 | 3.6 | 1.5 |
A DataFrame can be created from a dictionary of Series.
pdata = {
'Ohio': frame3['Ohio'][:-1],
'Nevada': frame3['Nevada'][:2]
}
pd.DataFrame(pdata)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Ohio | Nevada | |
---|---|---|
2001 | 1.7 | 2.4 |
2002 | 3.6 | 2.9 |
A DataFrame's index
and columns
have their own name
attribute.
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
state | Nevada | Ohio |
---|---|---|
year | ||
2001 | 2.4 | 1.7 |
2002 | 2.9 | 3.6 |
2000 | NaN | 1.5 |
The values
attribute returns the data as a 2-D ndarray.
frame3.values
array([[2.4, 1.7],
[2.9, 3.6],
[nan, 1.5]])
frame2.values
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7],
[2003, 'Nevada', 3.2, nan]], dtype=object)
These hold the axis labels and other metadata. They are immutable.
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Index(['b', 'c'], dtype='object')
Index objects can be shared amongst data stuctures.
labels = pd.Index(np.arange(3))
labels
Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
0 1.5
1 -2.5
2 0.0
dtype: float64
obj2.index is labels
True
frame3.columns
Index(['Nevada', 'Ohio'], dtype='object', name='state')
'Ohio' in frame3.columns
True
2003 in frame3.index
False
This section discusses the fundamental interations with Series and DataFrames.
This creates a new object with the data conformed to a new index.
obj = pd.Series([4.3, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c'])
obj
d 4.3
b 7.2
a -5.3
c 3.6
dtype: float64
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a -5.3
b 7.2
c 3.6
d 4.3
e NaN
dtype: float64
There is a method
option to describe how to handle missing data.
Here is an example with 'ffill'
which is a "forward-fill."
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
0 blue
2 purple
4 yellow
dtype: object
obj3.reindex(range(6), method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
With DataFrame, 'reindex' can alter either the row index, columns, or both.
frame = pd.DataFrame(
np.arange(9).reshape((3, 3)),
index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California']
)
frame
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Ohio | Texas | California | |
---|---|---|---|
a | 0 | 1 | 2 |
c | 3 | 4 | 5 |
d | 6 | 7 | 8 |
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Ohio | Texas | California | |
---|---|---|---|
a | 0.0 | 1.0 | 2.0 |
b | NaN | NaN | NaN |
c | 3.0 | 4.0 | 5.0 |
d | 6.0 | 7.0 | 8.0 |
Columns can be reindexed with the columns
keyward.
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Texas | Utah | California | |
---|---|---|---|
a | 1 | NaN | 2 |
c | 4 | NaN | 5 |
d | 7 | NaN | 8 |
Reindexing can be done more succienctly by label-indexing with 'loc'.
frame.loc[['a', 'b', 'c', 'd'], states]
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/pandas/core/indexing.py:1418: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
return self._getitem_tuple(key)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Texas | Utah | California | |
---|---|---|---|
a | 1.0 | NaN | 2.0 |
b | NaN | NaN | NaN |
c | 4.0 | NaN | 5.0 |
d | 7.0 | NaN | 8.0 |
The drop
method returns a new object with the indicated values deleted from an axis.
obj = pd.Series(np.arange(5.0), index=['a', 'b', 'c', 'd', 'e'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
new_obj = obj.drop('c')
new_obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
obj.drop(['d', 'c'])
a 0.0
b 1.0
e 4.0
dtype: float64
With DataFrame, index values can be deleted from either axis.
data = pd.DataFrame(
np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four']
)
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
data.drop(['Colorado', 'Ohio'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
data.drop('two', axis=1)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | three | four | |
---|---|---|---|
Ohio | 0 | 2 | 3 |
Colorado | 4 | 6 | 7 |
Utah | 8 | 10 | 11 |
New York | 12 | 14 | 15 |
data.drop(['two', 'four'], axis='columns')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | three | |
---|---|---|
Ohio | 0 | 2 |
Colorado | 4 | 6 |
Utah | 8 | 10 |
New York | 12 | 14 |
There is an option to make the change in-place.
obj.drop('c', inplace=True)
obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
For Series, can use either position, boolean, or index values for indexing.
obj = pd.Series(np.arange(4.0), index=['a', 'b', 'c', 'd'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
obj['b']
1.0
obj[1]
1.0
obj[2:4]
c 2.0
d 3.0
dtype: float64
obj[['b', 'a', 'd']]
b 1.0
a 0.0
d 3.0
dtype: float64
obj[[1, 3]]
b 1.0
d 3.0
dtype: float64
obj[obj < 2]
a 0.0
b 1.0
dtype: float64
obj['b':'c']
b 1.0
c 2.0
dtype: float64
obj['b':'c'] = 5
obj
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
Indexing a DataFrame can retrieve multiple columns.
data = pd.DataFrame(
np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four']
)
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
data['two']
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int64
data[['three', 'one']]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
three | one | |
---|---|---|
Ohio | 2 | 0 |
Colorado | 6 | 4 |
Utah | 10 | 8 |
New York | 14 | 12 |
data[:2]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
data < 5
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
Ohio | True | True | True | True |
Colorado | True | False | False | False |
Utah | False | False | False | False |
New York | False | False | False | False |
data[data['three'] > 5]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
data[data < 5] = 0
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 0 | 0 | 0 |
Colorado | 0 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
loc
and iloc
are methods specifically for label-indexing on the rows of a DataFrame.
data.loc['Colorado', ['two', 'three']]
two 5
three 6
Name: Colorado, dtype: int64
data.iloc[2, [3, 0, 1]]
four 11
one 8
two 9
Name: Utah, dtype: int64
data.iloc[2]
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
data.iloc[[1, 2], [3, 0, 1]]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
four | one | two | |
---|---|---|---|
Colorado | 7 | 0 | 5 |
Utah | 11 | 8 | 9 |
data.loc[:'Utah', 'two']
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int64
data.iloc[:, :3][data.three > 5]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | |
---|---|---|---|
Colorado | 0 | 5 | 6 |
Utah | 8 | 9 | 10 |
New York | 12 | 13 | 14 |
When adding objects together, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
s2
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
NaN
are returned for when there is only one value being added together.
s1 + s2
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
For DataFrame, alignment is performed on both rows and columns.
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),
columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | c | d | |
---|---|---|---|
Ohio | 0.0 | 1.0 | 2.0 |
Texas | 3.0 | 4.0 | 5.0 |
Colorado | 6.0 | 7.0 | 8.0 |
df2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | d | e | |
---|---|---|---|
Utah | 0.0 | 1.0 | 2.0 |
Ohio | 3.0 | 4.0 | 5.0 |
Texas | 6.0 | 7.0 | 8.0 |
Oregon | 9.0 | 10.0 | 11.0 |
When added together, values are only returned for positions in both DataFrames.
df1 + df2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | c | d | e | |
---|---|---|---|---|
Colorado | NaN | NaN | NaN | NaN |
Ohio | 3.0 | NaN | 6.0 | NaN |
Oregon | NaN | NaN | NaN | NaN |
Texas | 9.0 | NaN | 12.0 | NaN |
Utah | NaN | NaN | NaN | NaN |
There is an option to fill missing data with a specific value to be used for the operation.
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | |
---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 |
1 | 4.0 | 5.0 | 6.0 | 7.0 |
2 | 8.0 | 9.0 | 10.0 | 11.0 |
df2.loc[1, 'b'] = np.nan
df2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 |
1 | 5.0 | NaN | 7.0 | 8.0 | 9.0 |
2 | 10.0 | 11.0 | 12.0 | 13.0 | 14.0 |
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
df1 + df2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | NaN |
1 | 9.0 | NaN | 13.0 | 15.0 | NaN |
2 | 18.0 | 20.0 | 22.0 | 24.0 | NaN |
3 | NaN | NaN | NaN | NaN | NaN |
df1.add(df2, fill_value=0)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | 4.0 |
1 | 9.0 | 5.0 | 13.0 | 15.0 | 9.0 |
2 | 18.0 | 20.0 | 22.0 | 24.0 | 14.0 |
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
There are defined behaviours for arithmetic between DataFrame and Series. It is slightly different than for 1D and multidimensional ndarrys.
arr = np.arange(12.0).reshape((3, 4))
arr
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
array([0., 1., 2., 3.])
The subtration below results in one operation per row (an example of broadcasting).
arr - arr[0]
array([[0., 0., 0., 0.],
[4., 4., 4., 4.],
[8., 8., 8., 8.]])
A simillar mechanism is used for operations between a Series and DataFrame.
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | d | e | |
---|---|---|---|
Utah | 0.0 | 1.0 | 2.0 |
Ohio | 3.0 | 4.0 | 5.0 |
Texas | 6.0 | 7.0 | 8.0 |
Oregon | 9.0 | 10.0 | 11.0 |
series = frame.iloc[0]
series
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
frame - series
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | d | e | |
---|---|---|---|
Utah | 0.0 | 0.0 | 0.0 |
Ohio | 3.0 | 3.0 | 3.0 |
Texas | 6.0 | 6.0 | 6.0 |
Oregon | 9.0 | 9.0 | 9.0 |
IF an index is not found in either the DataFrame columns or Series index, the objects are reindexed to form the union.
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2
b 0
e 1
f 2
dtype: int64
frame + series2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | d | e | f | |
---|---|---|---|---|
Utah | 0.0 | NaN | 3.0 | NaN |
Ohio | 3.0 | NaN | 6.0 | NaN |
Texas | 6.0 | NaN | 9.0 | NaN |
Oregon | 9.0 | NaN | 12.0 | NaN |
series3 = frame['d']
series3
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
frame.sub(series3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Ohio | Oregon | Texas | Utah | b | d | e | |
---|---|---|---|---|---|---|---|
Utah | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Ohio | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Texas | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Oregon | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
frame.sub(series3, axis='index')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | d | e | |
---|---|---|---|
Utah | -1.0 | 0.0 | 1.0 |
Ohio | -1.0 | 0.0 | 1.0 |
Texas | -1.0 | 0.0 | 1.0 |
Oregon | -1.0 | 0.0 | 1.0 |
DataFrames work well with ufuncs, too.
frame = pd.DataFrame(np.random.randn(4, 3),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | d | e | |
---|---|---|---|
Utah | -1.167574 | -1.923661 | -0.856915 |
Ohio | -0.712398 | 0.835224 | 1.783416 |
Texas | 0.283403 | 1.123811 | -1.542719 |
Oregon | -1.049275 | -0.444459 | -0.123116 |
np.abs(frame)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | d | e | |
---|---|---|---|
Utah | 1.167574 | 1.923661 | 0.856915 |
Ohio | 0.712398 | 0.835224 | 1.783416 |
Texas | 0.283403 | 1.123811 | 1.542719 |
Oregon | 1.049275 | 0.444459 | 0.123116 |
Use the apply
method to apply a function to the 1D arrays from columns or rows.
f = lambda x: x.max() - x.min()
frame.apply(f)
b 1.450977
d 3.047472
e 3.326136
dtype: float64
To operate of the rows, pass the value axis=1
.
frame.apply(f, axis=1)
Utah 1.066746
Ohio 2.495815
Texas 2.666531
Oregon 0.926159
dtype: float64
def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | d | e | |
---|---|---|---|
min | -1.167574 | -1.923661 | -1.542719 |
max | 0.283403 | 1.123811 | 1.783416 |
Use the sort_index()
method to sort a Series or DataFrame lexicographically.
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
a 1
b 2
c 3
d 0
dtype: int64
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
frame.sort_index()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
d | a | b | c | |
---|---|---|---|---|
one | 4 | 5 | 6 | 7 |
three | 0 | 1 | 2 | 3 |
frame.sort_index(axis=1)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | |
---|---|---|---|---|
three | 1 | 2 | 3 | 0 |
one | 5 | 6 | 7 | 4 |
frame.sort_index(axis=1, ascending=False)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
d | c | b | a | |
---|---|---|---|---|
three | 0 | 3 | 2 | 1 |
one | 4 | 7 | 6 | 5 |
Use the sort_values
method to sort a Series by its values.
obj = pd.Series([4, 7, -3, -2])
obj.sort_values()
2 -3
3 -2
0 4
1 7
dtype: int64
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64
For a DataFrame, a column can be specified to use for sorting.
frame = pd.DataFrame({
'b': [4, 7, -3, 2],
'a': [0, 1, 0, 1]
})
frame.sort_values(by='b')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | a | |
---|---|---|
2 | -3 | 0 |
3 | 2 | 1 |
0 | 4 | 0 |
1 | 7 | 1 |
frame.sort_values(by=['a', 'b'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b | a | |
---|---|---|
2 | -3 | 0 |
0 | 4 | 0 |
3 | 2 | 1 |
1 | 7 | 1 |
Ranking assigns ranks from 1 through the number of valid data points (rows).
There are a few different ways to handle ties and they can be declared using the method
argument.
obj = pd.Series([7, -5, 7, 4, 3, 0, 4])
obj.rank()
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
obj.rank(method='first')
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
obj.rank(pct=True)
0 0.928571
1 0.142857
2 0.928571
3 0.642857
4 0.428571
5 0.285714
6 0.642857
dtype: float64
obj.rank(ascending=False, method='max')
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
DataFrames can be ranked over the columns are rows.
frame = pd.DataFrame({
'b': [4.3, 7, -3, 2],
'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]
}).sort_index(axis=1)
frame
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | |
---|---|---|---|
0 | 0 | 4.3 | -2.0 |
1 | 1 | 7.0 | 5.0 |
2 | 0 | -3.0 | 8.0 |
3 | 1 | 2.0 | -2.5 |
frame.rank()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | |
---|---|---|---|
0 | 1.5 | 3.0 | 2.0 |
1 | 3.5 | 4.0 | 3.0 |
2 | 1.5 | 1.0 | 4.0 |
3 | 3.5 | 2.0 | 1.0 |
frame.rank(axis=1)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | |
---|---|---|---|
0 | 2.0 | 3.0 | 1.0 |
1 | 1.0 | 3.0 | 2.0 |
2 | 2.0 | 1.0 | 3.0 |
3 | 2.0 | 3.0 | 1.0 |
Many pandas functions require the labels to be unique, but it is not mandatory for a Series or DataFrame. HEre is a small example Series with non-unique labels.
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
a 0
a 1
b 2
b 3
c 4
dtype: int64
obj.index.is_unique
False
obj.a
a 0
a 1
dtype: int64
obj.c
4
Here is an example with DataFrame.
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'b', 'b', 'b'])
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
a | 0.315786 | -0.868693 | -1.045036 |
b | 1.144670 | 0.902213 | -2.583696 |
b | -0.499635 | 0.883807 | -1.347935 |
b | -1.053965 | -0.075102 | 0.066315 |
df.loc['b']
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
b | 1.144670 | 0.902213 | -2.583696 |
b | -0.499635 | 0.883807 | -1.347935 |
b | -1.053965 | -0.075102 | 0.066315 |
Pandas has many methods for computing reductions and summarising the data in a DataFrame. Importantly, they naturally handle missing data.
df = pd.DataFrame([[1.4, np.nan],
[7.1, -4.5],
[np.nan, np.nan],
[0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | |
---|---|---|
a | 1.40 | NaN |
b | 7.10 | -4.5 |
c | NaN | NaN |
d | 0.75 | -1.3 |
df.sum()
one 9.25
two -5.80
dtype: float64
df.sum(axis=1)
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
df.sum(axis=1, skipna=False)
a NaN
b 2.60
c NaN
d -0.55
dtype: float64
There are some methods for getting the index value of key data.
df.idxmax()
one b
two d
dtype: object
df.cumsum()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | |
---|---|---|
a | 1.40 | NaN |
b | 8.50 | -4.5 |
c | NaN | NaN |
d | 9.25 | -5.8 |
The describe
method is useful to getting a high-level overview of the data in a DataFrame.
df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | |
---|---|---|
count | 3.000000 | 2.000000 |
mean | 3.083333 | -2.900000 |
std | 3.493685 | 2.262742 |
min | 0.750000 | -4.500000 |
25% | 1.075000 | -3.700000 |
50% | 1.400000 | -2.900000 |
75% | 4.250000 | -2.100000 |
max | 7.100000 | -1.300000 |
For non-numeric data.
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj
0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object
obj.describe
<bound method NDFrame.describe of 0 a
1 a
2 b
3 c
4 a
5 a
6 b
7 c
8 a
9 a
10 b
11 c
12 a
13 a
14 b
15 c
dtype: object>
Below are examples a some of the more common summary statistics.
df = pd.DataFrame(np.random.randn(100).reshape((20, 5)))
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | -1.045266 | 1.282355 | 2.090783 | -0.252437 | 1.040161 |
1 | 0.150544 | -0.147333 | 0.320805 | 1.567862 | -0.174218 |
2 | 1.166097 | -1.042703 | -0.079506 | -0.118705 | 0.123881 |
3 | -0.043958 | 2.006066 | 0.471954 | 0.686379 | 0.540834 |
4 | 1.244186 | -0.874540 | -1.410670 | -1.593762 | 0.318683 |
5 | -0.071923 | 0.148731 | -0.456628 | -1.155017 | 0.624898 |
6 | -0.083861 | 0.218961 | 2.650821 | -0.113464 | -0.422801 |
7 | -0.289215 | 3.006028 | -2.028504 | 0.290864 | 0.984765 |
8 | 0.866779 | 1.039549 | 0.772484 | -1.578464 | -0.247977 |
9 | -0.079468 | 0.327789 | 0.740373 | 0.268327 | -1.121380 |
10 | -0.728884 | -1.385715 | 1.322314 | -0.502122 | 0.456615 |
11 | 0.460693 | 2.520307 | -0.929717 | 1.136553 | 1.818280 |
12 | -0.226714 | -0.790623 | 0.940863 | -0.279578 | 0.408899 |
13 | -2.252070 | -0.513878 | -1.115919 | -1.057158 | -0.112186 |
14 | 2.193616 | 1.330908 | 0.217283 | 0.489868 | -0.011269 |
15 | 1.823363 | 1.266734 | -1.286281 | 0.673735 | -0.398431 |
16 | 0.780628 | 1.615514 | 0.525163 | -0.062664 | 0.088914 |
17 | 2.037299 | 1.412365 | -0.284433 | 0.963503 | 0.373894 |
18 | -0.214364 | 2.878076 | -0.450742 | -0.346410 | 0.520312 |
19 | -0.610795 | -1.430337 | 0.121110 | 0.483379 | 2.001245 |
df.count() # number of non-NA values
0 20
1 20
2 20
3 20
4 20
dtype: int64
df.mean()
0 0.253834
1 0.643413
2 0.106578
3 -0.024966
4 0.340656
dtype: float64
df.median()
0 -0.057940
1 0.683669
2 0.169197
3 -0.088064
4 0.346288
dtype: float64
df.mad()
0 0.854199
1 1.192377
2 0.900160
3 0.678516
4 0.536334
dtype: float64
df.prod()
0 -1.721891e-07
1 -2.590494e-01
2 -1.466648e-04
3 -4.670023e-07
4 -1.657156e-09
dtype: float64
df.std()
0 1.094775
1 1.396016
2 1.169365
3 0.866003
4 0.737545
dtype: float64
df.var()
0 1.198532
1 1.948861
2 1.367415
3 0.749961
4 0.543972
dtype: float64
df.cumsum()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | -1.045266 | 1.282355 | 2.090783 | -0.252437 | 1.040161 |
1 | -0.894723 | 1.135022 | 2.411587 | 1.315425 | 0.865944 |
2 | 0.271374 | 0.092320 | 2.332081 | 1.196720 | 0.989825 |
3 | 0.227416 | 2.098386 | 2.804035 | 1.883099 | 1.530659 |
4 | 1.471602 | 1.223846 | 1.393365 | 0.289337 | 1.849342 |
5 | 1.399679 | 1.372576 | 0.936738 | -0.865680 | 2.474240 |
6 | 1.315818 | 1.591538 | 3.587559 | -0.979144 | 2.051439 |
7 | 1.026603 | 4.597565 | 1.559055 | -0.688280 | 3.036203 |
8 | 1.893382 | 5.637114 | 2.331539 | -2.266744 | 2.788226 |
9 | 1.813915 | 5.964904 | 3.071912 | -1.998416 | 1.666846 |
10 | 1.085031 | 4.579189 | 4.394226 | -2.500538 | 2.123461 |
11 | 1.545723 | 7.099496 | 3.464508 | -1.363985 | 3.941741 |
12 | 1.319009 | 6.308873 | 4.405372 | -1.643564 | 4.350640 |
13 | -0.933061 | 5.794995 | 3.289452 | -2.700722 | 4.238454 |
14 | 1.260555 | 7.125903 | 3.506736 | -2.210854 | 4.227184 |
15 | 3.083918 | 8.392638 | 2.220455 | -1.537119 | 3.828753 |
16 | 3.864546 | 10.008152 | 2.745618 | -1.599783 | 3.917667 |
17 | 5.901845 | 11.420517 | 2.461186 | -0.636280 | 4.291561 |
18 | 5.687480 | 14.298593 | 2.010443 | -0.982690 | 4.811873 |
19 | 5.076686 | 12.868256 | 2.131554 | -0.499311 | 6.813118 |
df.cumprod()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | -1.045266e+00 | 1.282355 | 2.090783 | -2.524371e-01 | 1.040161e+00 |
1 | -1.573583e-01 | -0.188933 | 0.670733 | -3.957865e-01 | -1.812143e-01 |
2 | -1.834949e-01 | 0.197001 | -0.053328 | 4.698193e-02 | -2.244901e-02 |
3 | 8.066110e-03 | 0.395197 | -0.025168 | 3.224742e-02 | -1.214120e-02 |
4 | 1.003574e-02 | -0.345616 | 0.035504 | -5.139471e-02 | -3.869194e-03 |
5 | -7.217967e-04 | -0.051404 | -0.016212 | 5.936175e-02 | -2.417850e-03 |
6 | 6.053054e-05 | -0.011255 | -0.042975 | -6.735445e-03 | 1.022269e-03 |
7 | -1.750635e-05 | -0.033834 | 0.087176 | -1.959101e-03 | 1.006695e-03 |
8 | -1.517415e-05 | -0.035172 | 0.067342 | 3.092369e-03 | -2.496376e-04 |
9 | 1.205853e-06 | -0.011529 | 0.049858 | 8.297670e-04 | 2.799386e-04 |
10 | -8.789272e-07 | 0.015976 | 0.065928 | -4.166443e-04 | 1.278243e-04 |
11 | -4.049152e-07 | 0.040264 | -0.061294 | -4.735383e-04 | 2.324203e-04 |
12 | 9.180015e-08 | -0.031834 | -0.057670 | 1.323911e-04 | 9.503635e-05 |
13 | -2.067404e-07 | 0.016359 | 0.064355 | -1.399583e-04 | -1.066178e-05 |
14 | -4.535091e-07 | 0.021772 | 0.013983 | -6.856113e-05 | 1.201514e-07 |
15 | -8.269117e-07 | 0.027579 | -0.017986 | -4.619204e-05 | -4.787200e-08 |
16 | -6.455103e-07 | 0.044555 | -0.009446 | 2.894595e-06 | -4.256488e-09 |
17 | -1.315097e-06 | 0.062928 | 0.002687 | 2.788951e-06 | -1.591474e-09 |
18 | 2.819100e-07 | 0.181111 | -0.001211 | -9.661200e-07 | -8.280626e-10 |
19 | -1.721891e-07 | -0.259049 | -0.000147 | -4.670023e-07 | -1.657156e-09 |
Here, the author showed an example of using pandas to compute correlation and covariance measures between stocks. The first step was to download the data from Yahoo! Finance
import pandas_datareader.data as web
stock_names = ['AAPL', 'IBM', 'MSFT', 'GOOG']
all_data = {ticker: web.get_data_yahoo(ticker) for ticker in stock_names}
all_data
{'AAPL': High Low Open Close Volume \
Date
2014-11-03 110.300003 108.010002 108.220001 109.400002 5.228260e+07
2014-11-04 109.489998 107.720001 109.360001 108.599998 4.157440e+07
2014-11-05 109.300003 108.129997 109.099998 108.860001 3.743590e+07
2014-11-06 108.790001 107.800003 108.599998 108.699997 3.496850e+07
2014-11-07 109.320000 108.550003 108.750000 109.010002 3.369150e+07
... ... ... ... ... ...
2019-10-28 249.250000 246.720001 247.419998 249.050003 2.414320e+07
2019-10-29 249.750000 242.570007 248.970001 243.289993 3.566010e+07
2019-10-30 245.300003 241.210007 244.759995 243.259995 3.095060e+07
2019-10-31 249.169998 237.259995 247.240005 248.759995 3.476660e+09
2019-11-01 255.929993 249.160004 249.539993 255.820007 3.773870e+07
Adj Close
Date
2014-11-03 100.385109
2014-11-04 99.651047
2014-11-05 99.889618
2014-11-06 100.175293
2014-11-07 100.460983
... ...
2019-10-28 249.050003
2019-10-29 243.289993
2019-10-30 243.259995
2019-10-31 248.759995
2019-11-01 255.820007
[1259 rows x 6 columns],
'IBM': High Low Open Close Volume \
Date
2014-11-03 164.539993 163.380005 164.250000 164.360001 4688200.0
2014-11-04 164.360001 162.240005 164.339996 162.649994 4246900.0
2014-11-05 163.539993 161.559998 163.130005 161.820007 4104700.0
2014-11-06 161.529999 160.050003 161.279999 161.460007 4067600.0
2014-11-07 162.210007 160.850006 161.419998 162.070007 3494800.0
... ... ... ... ... ...
2019-10-28 136.630005 135.449997 136.000000 135.970001 3225700.0
2019-10-29 135.570007 133.440002 135.419998 133.820007 4159600.0
2019-10-30 135.279999 133.199997 133.830002 135.250000 2252700.0
2019-10-31 135.250000 133.229996 135.110001 133.729996 341090000.0
2019-11-01 135.559998 134.089996 134.500000 135.529999 3088800.0
Adj Close
Date
2014-11-03 133.097809
2014-11-04 131.713089
2014-11-05 131.040939
2014-11-06 131.644287
2014-11-07 132.141617
... ...
2019-10-28 135.970001
2019-10-29 133.820007
2019-10-30 135.250000
2019-10-31 133.729996
2019-11-01 135.529999
[1259 rows x 6 columns],
'MSFT': High Low Open Close Volume \
Date
2014-11-03 47.459999 46.730000 46.889999 47.439999 2.313040e+07
2014-11-04 47.730000 47.250000 47.299999 47.570000 2.153080e+07
2014-11-05 47.900002 47.259998 47.799999 47.860001 2.244960e+07
2014-11-06 48.860001 47.790001 47.860001 48.700001 3.303780e+07
2014-11-07 48.919998 48.290001 48.919998 48.680000 2.800060e+07
... ... ... ... ... ...
2019-10-28 145.669998 143.509995 144.399994 144.190002 3.528010e+07
2019-10-29 144.500000 142.649994 144.080002 142.830002 2.051970e+07
2019-10-30 145.000000 142.789993 143.520004 144.610001 1.847170e+07
2019-10-31 144.929993 142.990005 144.899994 143.369995 2.459620e+09
2019-11-01 144.419998 142.970001 144.259995 143.720001 3.311920e+07
Adj Close
Date
2014-11-03 42.466732
2014-11-04 42.583115
2014-11-05 42.842709
2014-11-06 43.594646
2014-11-07 43.576752
... ...
2019-10-28 144.190002
2019-10-29 142.830002
2019-10-30 144.610001
2019-10-31 143.369995
2019-11-01 143.720001
[1259 rows x 6 columns],
'GOOG': High Low Open Close Volume \
Date
2014-11-03 556.372498 551.715271 553.979065 553.699829 1382200.0
2014-11-04 553.979065 547.796021 551.485901 552.592834 1244200.0
2014-11-05 555.275513 542.560425 555.275513 544.425293 2032200.0
2014-11-06 545.387634 539.488831 544.006409 540.555908 1333200.0
2014-11-07 544.714478 537.195129 544.714478 539.528748 1633700.0
... ... ... ... ... ...
2019-10-28 1299.310059 1272.540039 1275.449951 1290.000000 2613200.0
2019-10-29 1281.589966 1257.212036 1276.229980 1262.619995 1869200.0
2019-10-30 1269.359985 1252.000000 1252.969971 1261.290039 1407700.0
2019-10-31 1267.670044 1250.843018 1261.280029 1260.109985 145470000.0
2019-11-01 1274.619995 1260.500000 1265.000000 1273.739990 1669400.0
Adj Close
Date
2014-11-03 553.699829
2014-11-04 552.592834
2014-11-05 544.425293
2014-11-06 540.555908
2014-11-07 539.528748
... ...
2019-10-28 1290.000000
2019-10-29 1262.619995
2019-10-30 1261.290039
2019-10-31 1260.109985
2019-11-01 1273.739990
[1259 rows x 6 columns]}
price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})
price
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AAPL | IBM | MSFT | GOOG | |
---|---|---|---|---|
Date | ||||
2014-11-03 | 100.385109 | 133.097809 | 42.466732 | 553.699829 |
2014-11-04 | 99.651047 | 131.713089 | 42.583115 | 552.592834 |
2014-11-05 | 99.889618 | 131.040939 | 42.842709 | 544.425293 |
2014-11-06 | 100.175293 | 131.644287 | 43.594646 | 540.555908 |
2014-11-07 | 100.460983 | 132.141617 | 43.576752 | 539.528748 |
... | ... | ... | ... | ... |
2019-10-28 | 249.050003 | 135.970001 | 144.190002 | 1290.000000 |
2019-10-29 | 243.289993 | 133.820007 | 142.830002 | 1262.619995 |
2019-10-30 | 243.259995 | 135.250000 | 144.610001 | 1261.290039 |
2019-10-31 | 248.759995 | 133.729996 | 143.369995 | 1260.109985 |
2019-11-01 | 255.820007 | 135.529999 | 143.720001 | 1273.739990 |
1259 rows × 4 columns
volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})
volume
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AAPL | IBM | MSFT | GOOG | |
---|---|---|---|---|
Date | ||||
2014-11-03 | 5.228260e+07 | 4688200.0 | 2.313040e+07 | 1382200.0 |
2014-11-04 | 4.157440e+07 | 4246900.0 | 2.153080e+07 | 1244200.0 |
2014-11-05 | 3.743590e+07 | 4104700.0 | 2.244960e+07 | 2032200.0 |
2014-11-06 | 3.496850e+07 | 4067600.0 | 3.303780e+07 | 1333200.0 |
2014-11-07 | 3.369150e+07 | 3494800.0 | 2.800060e+07 | 1633700.0 |
... | ... | ... | ... | ... |
2019-10-28 | 2.414320e+07 | 3225700.0 | 3.528010e+07 | 2613200.0 |
2019-10-29 | 3.566010e+07 | 4159600.0 | 2.051970e+07 | 1869200.0 |
2019-10-30 | 3.095060e+07 | 2252700.0 | 1.847170e+07 | 1407700.0 |
2019-10-31 | 3.476660e+09 | 341090000.0 | 2.459620e+09 | 145470000.0 |
2019-11-01 | 3.773870e+07 | 3088800.0 | 3.311920e+07 | 1669400.0 |
1259 rows × 4 columns
We can now compute the percent changes of the prices.
returns = price.pct_change()
returns
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AAPL | IBM | MSFT | GOOG | |
---|---|---|---|---|
Date | ||||
2014-11-03 | NaN | NaN | NaN | NaN |
2014-11-04 | -0.007312 | -0.010404 | 0.002741 | -0.001999 |
2014-11-05 | 0.002394 | -0.005103 | 0.006096 | -0.014780 |
2014-11-06 | 0.002860 | 0.004604 | 0.017551 | -0.007107 |
2014-11-07 | 0.002852 | 0.003778 | -0.000410 | -0.001900 |
... | ... | ... | ... | ... |
2019-10-28 | 0.010017 | 0.003913 | 0.024586 | 0.019658 |
2019-10-29 | -0.023128 | -0.015812 | -0.009432 | -0.021225 |
2019-10-30 | -0.000123 | 0.010686 | 0.012462 | -0.001053 |
2019-10-31 | 0.022610 | -0.011238 | -0.008575 | -0.000936 |
2019-11-01 | 0.028381 | 0.013460 | 0.002441 | 0.010817 |
1259 rows × 4 columns
The corr
and cov
methods for Series computes the correlation and covariance of two Series, aligned by index.
The corr
and cov
methods for DataFrame computes the correlation and covariance matrices.
returns.MSFT.corr(returns.IBM)
0.4870311189559724
returns.corr()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AAPL | IBM | MSFT | GOOG | |
---|---|---|---|---|
AAPL | 1.000000 | 0.405805 | 0.570409 | 0.521556 |
IBM | 0.405805 | 1.000000 | 0.487031 | 0.413659 |
MSFT | 0.570409 | 0.487031 | 1.000000 | 0.655724 |
GOOG | 0.521556 | 0.413659 | 0.655724 | 1.000000 |
returns.cov()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AAPL | IBM | MSFT | GOOG | |
---|---|---|---|---|
AAPL | 0.000247 | 0.000083 | 0.000133 | 0.000125 |
IBM | 0.000083 | 0.000170 | 0.000094 | 0.000082 |
MSFT | 0.000133 | 0.000094 | 0.000219 | 0.000147 |
GOOG | 0.000125 | 0.000082 | 0.000147 | 0.000231 |
DataFrame can still compute specific correlations using the corrwith
method.
returns.corrwith(returns.IBM)
AAPL 0.405805
IBM 1.000000
MSFT 0.487031
GOOG 0.413659
dtype: float64
returns.corrwith(volume)
AAPL 0.016438
IBM -0.057802
MSFT -0.035033
GOOG -0.004697
dtype: float64
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques
array(['c', 'a', 'd', 'b'], dtype=object)
obj.value_counts()
c 3
a 3
b 2
d 1
dtype: int64
mask = obj.isin(['b', 'c'])
mask
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)
array([0, 2, 1, 1, 0, 2])