In this chapter, the author reviewed tools for missing data, duplicate data, string manipulation, and a few other other common practices used for data preparation.
Missing data is ignored by default when computing summary and descriptive statistics on a DataFrame or Series.
For numeric data, the floating-point value NaN
("not a number") is used.
import pandas as pd
import numpy as np
np.random.seed(0)
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
The dropna()
method removes NaN
values from a Series.
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
data.dropna()
0 1.0
2 3.5
4 7.0
dtype: float64
On a DataFrame, dropna()
removes rows with an NaN
values.
data = pd.DataFrame([[1., 6.5, 3.],
[1., NA, NA],
[NA, NA, NA],
[NA, 6.5, 3.]])
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
cleaned = data.dropna()
cleaned
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
However, passing how='all'
will remove only rows with all NaN
.
data.dropna(how='all')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
Or only columns made up of all NaN
can be dropped by declaring the axis.
data[4] = NA
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
1 | 1.0 | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
data.dropna(axis=1, how='all')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
You can also specify the maximum number of missing data values allowed for an individual row.
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 1.764052 | NaN | NaN |
1 | 2.240893 | NaN | NaN |
2 | 0.950088 | NaN | -0.103219 |
3 | 0.410599 | NaN | 1.454274 |
4 | 0.761038 | 0.121675 | 0.443863 |
5 | 0.333674 | 1.494079 | -0.205158 |
6 | 0.313068 | -0.854096 | -2.552990 |
df.dropna()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
4 | 0.761038 | 0.121675 | 0.443863 |
5 | 0.333674 | 1.494079 | -0.205158 |
6 | 0.313068 | -0.854096 | -2.552990 |
df.dropna(thresh=2)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
2 | 0.950088 | NaN | -0.103219 |
3 | 0.410599 | NaN | 1.454274 |
4 | 0.761038 | 0.121675 | 0.443863 |
5 | 0.333674 | 1.494079 | -0.205158 |
6 | 0.313068 | -0.854096 | -2.552990 |
fillna()
can be used in most cases to fill in missing data.
df.fillna(0)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 1.764052 | 0.000000 | 0.000000 |
1 | 2.240893 | 0.000000 | 0.000000 |
2 | 0.950088 | 0.000000 | -0.103219 |
3 | 0.410599 | 0.000000 | 1.454274 |
4 | 0.761038 | 0.121675 | 0.443863 |
5 | 0.333674 | 1.494079 | -0.205158 |
6 | 0.313068 | -0.854096 | -2.552990 |
A dictionary can be used to fill specific columns with specific values.
df.fillna({1: 0.5, 2: 0})
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 1.764052 | 0.500000 | 0.000000 |
1 | 2.240893 | 0.500000 | 0.000000 |
2 | 0.950088 | 0.500000 | -0.103219 |
3 | 0.410599 | 0.500000 | 1.454274 |
4 | 0.761038 | 0.121675 | 0.443863 |
5 | 0.333674 | 1.494079 | -0.205158 |
6 | 0.313068 | -0.854096 | -2.552990 |
df.fillna(0, inplace=True)
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 1.764052 | 0.000000 | 0.000000 |
1 | 2.240893 | 0.000000 | 0.000000 |
2 | 0.950088 | 0.000000 | -0.103219 |
3 | 0.410599 | 0.000000 | 1.454274 |
4 | 0.761038 | 0.121675 | 0.443863 |
5 | 0.333674 | 1.494079 | -0.205158 |
6 | 0.313068 | -0.854096 | -2.552990 |
The same interpolation methods available for deciding which rows or columns to drop can be used for filling in data.
df = pd.DataFrame(np.random.rand(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 0.264556 | 0.774234 | 0.456150 |
1 | 0.568434 | 0.018790 | 0.617635 |
2 | 0.612096 | NaN | 0.943748 |
3 | 0.681820 | NaN | 0.437032 |
4 | 0.697631 | NaN | NaN |
5 | 0.670638 | NaN | NaN |
df.fillna(method='ffill') # 'ffill' = 'forward fill'
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 0.264556 | 0.774234 | 0.456150 |
1 | 0.568434 | 0.018790 | 0.617635 |
2 | 0.612096 | 0.018790 | 0.943748 |
3 | 0.681820 | 0.018790 | 0.437032 |
4 | 0.697631 | 0.018790 | 0.437032 |
5 | 0.670638 | 0.018790 | 0.437032 |
df.fillna(method='ffill', limit=2)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | |
---|---|---|---|
0 | 0.264556 | 0.774234 | 0.456150 |
1 | 0.568434 | 0.018790 | 0.617635 |
2 | 0.612096 | 0.018790 | 0.943748 |
3 | 0.681820 | 0.018790 | 0.437032 |
4 | 0.697631 | NaN | 0.437032 |
5 | 0.670638 | NaN | 0.437032 |
Duplicate rows can be found using the duplicated()
method on a DataFrame.
data = pd.DataFrame({
'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]
})
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | two | 1 |
2 | one | 2 |
3 | two | 3 |
4 | one | 3 |
5 | two | 4 |
6 | two | 4 |
data.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data.drop_duplicates()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
k1 | k2 | |
---|---|---|
0 | one | 1 |
1 | two | 1 |
2 | one | 2 |
3 | two | 3 |
4 | one | 3 |
5 | two | 4 |
It is also possible to specify the column by which to find duplicates.
data['v1'] = range(7)
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
k1 | k2 | v1 | |
---|---|---|---|
0 | one | 1 | 0 |
1 | two | 1 | 1 |
2 | one | 2 | 2 |
3 | two | 3 | 3 |
4 | one | 3 | 4 |
5 | two | 4 | 5 |
6 | two | 4 | 6 |
data.drop_duplicates(['k1'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
k1 | k2 | v1 | |
---|---|---|---|
0 | one | 1 | 0 |
1 | two | 1 | 1 |
By default, the first unique row is kept, though this can be changed to keeping the last one.
data.drop_duplicates(['k1', 'k2'], keep='last')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
k1 | k2 | v1 | |
---|---|---|---|
0 | one | 1 | 0 |
1 | two | 1 | 1 |
2 | one | 2 | 2 |
3 | two | 3 | 3 |
4 | one | 3 | 4 |
6 | two | 4 | 6 |
Often want to perform a transformation on an array, Series of column of a DataFrame.
data = pd.DataFrame({
'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]
})
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
food | ounces | |
---|---|---|
0 | bacon | 4.0 |
1 | pulled pork | 3.0 |
2 | bacon | 12.0 |
3 | Pastrami | 6.0 |
4 | corned beef | 7.5 |
5 | Bacon | 8.0 |
6 | pastrami | 3.0 |
7 | honey ham | 5.0 |
8 | nova lox | 6.0 |
We can add a column to indicate the source animal for each food.
The map
method on a Series accepts function or dictionary to apply to each element.
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
data['animal'] = data['food'].str.lower().map(meat_to_animal)
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
food | ounces | animal | |
---|---|---|---|
0 | bacon | 4.0 | pig |
1 | pulled pork | 3.0 | pig |
2 | bacon | 12.0 | pig |
3 | Pastrami | 6.0 | cow |
4 | corned beef | 7.5 | cow |
5 | Bacon | 8.0 | pig |
6 | pastrami | 3.0 | cow |
7 | honey ham | 5.0 | pig |
8 | nova lox | 6.0 | salmon |
Alternatively, we could have passed a function.
data['food'].map(lambda x: meat_to_animal[x.lower()])
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
The replace(to_replace=x, value=y)
method replaces x
with y
in a Series.
data = pd.Series([1.0, -999.0, 2.0, -999.0, -1000.0, 3.0])
data.replace(-999, np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data.replace([-999, -1000], np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
data.replace([-999, -1000], [np.nan, 0])
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data.replace({-999: np.nan, -1000: 0})
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
Axis labels can be transformed by a function or mapping, too. This can be accomplished in-place.
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
New York | 8 | 9 | 10 | 11 |
transform = lambda x: x[:4].upper()
data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform)
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | |
---|---|---|---|---|
OHIO | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
The index and column names can be transformed by passing functions to the rename()
method.
data.rename(index = str.title, columns = str.upper)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
ONE | TWO | THREE | FOUR | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colo | 4 | 5 | 6 | 7 |
New | 8 | 9 | 10 | 11 |
One can rename a subset of the indices and column names using a dictionary mapping.
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | peekaboo | four | |
---|---|---|---|---|
INDIANA | 0 | 1 | 2 | 3 |
COLO | 4 | 5 | 6 | 7 |
NEW | 8 | 9 | 10 | 11 |
As an example, we want to bin the following age data into 18-25, 26-35, 36-60, 61 and older.
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
The cut()
function returns a Categorical
object.
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
pd.value_counts(cats)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
We can dictate a few paramters for cut()
such as right
for setting the left of the boundary to be inclusive and the right to be exclusive (denoted in standard methematical notation).
pd.cut(ages, [18, 26, 36, 61, 100], right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
The names of te bins can be specified.
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
Instead of specifying the bins, an integer can be passed for the number of equispaced bins to use.
data = np.random.rand(20)
pd.cut(data, 4, precision=2)
[(0.096, 0.32], (0.32, 0.54], (0.54, 0.77], (0.32, 0.54], (0.77, 0.99], ..., (0.096, 0.32], (0.096, 0.32], (0.32, 0.54], (0.77, 0.99], (0.096, 0.32]]
Length: 20
Categories (4, interval[float64]): [(0.096, 0.32] < (0.32, 0.54] < (0.54, 0.77] < (0.77, 0.99]]
The qcut()
function binds the data based on sample quantiles.
This results in roughly equal-sized bins.
data = np.random.randn(1000)
cats = pd.qcut(data, 4) # cut into quartiles
cats
[(0.605, 2.759], (-0.0516, 0.605], (-0.0516, 0.605], (-0.0516, 0.605], (-0.0516, 0.605], ..., (-0.0516, 0.605], (-3.0469999999999997, -0.698], (-3.0469999999999997, -0.698], (-3.0469999999999997, -0.698], (-0.698, -0.0516]]
Length: 1000
Categories (4, interval[float64]): [(-3.0469999999999997, -0.698] < (-0.698, -0.0516] < (-0.0516, 0.605] < (0.605, 2.759]]
pd.value_counts(cats)
(0.605, 2.759] 250
(-0.0516, 0.605] 250
(-0.698, -0.0516] 250
(-3.0469999999999997, -0.698] 250
dtype: int64
The specific quantiles to use can also be passed to qcut()
.
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.0])
[(-0.0516, 1.182], (-0.0516, 1.182], (-0.0516, 1.182], (-0.0516, 1.182], (-0.0516, 1.182], ..., (-0.0516, 1.182], (-1.28, -0.0516], (-1.28, -0.0516], (-1.28, -0.0516], (-1.28, -0.0516]]
Length: 1000
Categories (4, interval[float64]): [(-3.0469999999999997, -1.28] < (-1.28, -0.0516] < (-0.0516, 1.182] < (1.182, 2.759]]
Here is an example for detecting and filtering outliers in a data set.
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 0.014493 | -0.005655 | -0.002238 | -0.041619 |
std | 0.971579 | 0.999478 | 0.991132 | 0.990014 |
min | -3.007437 | -3.116857 | -3.392300 | -3.740101 |
25% | -0.654814 | -0.671151 | -0.637831 | -0.763313 |
50% | -0.010606 | 0.003705 | -0.063378 | 0.011749 |
75% | 0.659244 | 0.693056 | 0.646954 | 0.637535 |
max | 2.979976 | 3.801660 | 3.427539 | 2.929096 |
Lets identify values greater than 3 in magnitude.
col = data[2]
col[np.abs(col) > 3]
598 3.427539
656 -3.392300
Name: 2, dtype: float64
The rows with data greater than 3 in magnitide can be selected using the any()
method.
data[(np.abs(data) > 3).any(1)]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.049093 | 3.170975 | 0.189500 | -1.348413 |
241 | 0.422819 | -3.116857 | 0.644452 | -1.913743 |
507 | -0.347585 | 3.306574 | -1.510200 | 0.203540 |
516 | -0.126856 | 3.801660 | 2.315171 | 0.139827 |
598 | -0.752582 | 0.045113 | 3.427539 | 0.604682 |
602 | -0.553965 | -3.006499 | -0.047166 | 0.831878 |
656 | -0.479297 | -1.345508 | -3.392300 | 0.155794 |
674 | 1.061435 | -0.435034 | 0.657682 | -3.740101 |
854 | -3.007437 | -2.330467 | -0.567803 | 2.667322 |
Values can be set using this boolean indexing.
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 0.014500 | -0.006811 | -0.002273 | -0.040879 |
std | 0.971556 | 0.994856 | 0.988474 | 0.987520 |
min | -3.000000 | -3.000000 | -3.000000 | -3.000000 |
25% | -0.654814 | -0.671151 | -0.637831 | -0.763313 |
50% | -0.010606 | 0.003705 | -0.063378 | 0.011749 |
75% | 0.659244 | 0.693056 | 0.646954 | 0.637535 |
max | 2.979976 | 3.000000 | 3.000000 | 2.929096 |
Permuting a Series or rows of a DataFrame is easily accomplished using numpy.random.permutation()
.
Providing this function with the length of the axis to permute returns a new indexing to use for reordering.
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler
array([0, 3, 2, 4, 1])
df.take(sampler)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
3 | 12 | 13 | 14 | 15 |
2 | 8 | 9 | 10 | 11 |
4 | 16 | 17 | 18 | 19 |
1 | 4 | 5 | 6 | 7 |
A sample of a DataFrame can be taken with or without replacement.
df.sample(n=3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
2 | 8 | 9 | 10 | 11 |
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
df.sample(n=6, replace=True)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | |
---|---|---|---|---|
1 | 4 | 5 | 6 | 7 |
1 | 4 | 5 | 6 | 7 |
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
0 | 0 | 1 | 2 | 3 |
0 | 0 | 1 | 2 | 3 |
This is often a useful tool for statistical analysis or machine learning.
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data1': range(6)})
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | b | 5 |
pd.get_dummies(df['key'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 1 | 0 | 0 |
5 | 0 | 1 | 0 |
Somtimes if it useful to include a prefix in the column names.
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
data1 | key_a | key_b | key_c | |
---|---|---|---|---|
0 | 0 | 0 | 1 | 0 |
1 | 1 | 0 | 1 | 0 |
2 | 2 | 1 | 0 | 0 |
3 | 3 | 0 | 0 | 1 |
4 | 4 | 1 | 0 | 0 |
5 | 5 | 0 | 1 | 0 |
Creating a dummy matrix is more complicated if a row of a DataFrame belongs to multiple categories. For an example, we used the MovieLens 1M data set.
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('assets/datasets/movielens/movies.dat', sep='::',
header=None, names = mnames)
movies
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/ipykernel_launcher.py:3: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
This is separate from the ipykernel package so we can avoid doing imports until
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
movie_id | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
... | ... | ... | ... |
3878 | 3948 | Meet the Parents (2000) | Comedy |
3879 | 3949 | Requiem for a Dream (2000) | Drama |
3880 | 3950 | Tigerland (2000) | Drama |
3881 | 3951 | Two Family House (2000) | Drama |
3882 | 3952 | Contender, The (2000) | Drama|Thriller |
3883 rows × 3 columns
The goal of the next set of operations is to add an indicator variable for genre.
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
gen = movies.genres[0]
gen.split('|')
['Animation', "Children's", 'Comedy']
dummies.columns.get_indexer(gen.split('|'))
array([0, 1, 2])
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
dummies
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Animation | Children's | Comedy | Adventure | Fantasy | Romance | Drama | Action | Crime | Thriller | Horror | Sci-Fi | Documentary | War | Musical | Mystery | Film-Noir | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3878 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3879 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3880 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3881 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3882 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3883 rows × 18 columns
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
Genre_Animation 1
Genre_Children's 1
Genre_Comedy 1
Genre_Adventure 0
Genre_Fantasy 0
Genre_Romance 0
Genre_Drama 0
Genre_Action 0
Genre_Crime 0
Genre_Thriller 0
Genre_Horror 0
Genre_Sci-Fi 0
Genre_Documentary 0
Genre_War 0
Genre_Musical 0
Genre_Mystery 0
Genre_Film-Noir 0
Genre_Western 0
Name: 0, dtype: object
pandas builds on top of Python's rich assortment of string manipulations and object methods by applying them to arrays of strings.
Here are a few commonly used string methods built into Python.
val = 'a,b, guido'
val.split(',')
['a', 'b', ' guido']
pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']
first, second, third = pieces
first + '::' + second + '::' + third
'a::b::guido'
'::'.join(pieces)
'a::b::guido'
'guido' in val
True
val.index(',')
1
val.find(':')
-1
val.rfind(',')
3
val.count(',')
2
val.replace(',', '::')
'a::b:: guido'
val.replace(',', '')
'ab guido'
upper_val = val.upper()
upper_val
'A,B, GUIDO'
upper_val.lower()
'a,b, guido'
A regex provides a powerful way to identify patterns in strings. Python's built in re module is responsible for applying regular expressions to strings. The functions in re are in one of three categories: pattern matching, substitution, and splitting.
Below are a bunch of expample os using regular expressions for various tasks. The first is to split a string with a variable number of white spaces.
import re
text = "foo bat\t baz \tqux"
re.split('\s+', text)
['foo', 'bat', 'baz', 'qux']
When re.split('\s+', text)
was called, the regular expression was first compiled.
This can be done manually, creating a new object that can be used for further matching.
regex = re.compile('\s+')
regex.split(text)
['foo', 'bat', 'baz', 'qux']
regex.findall(text)
[' ', '\t ', ' \t']
Here is an example with a larger string and pattern.
The re.IGNORECASE
makes the regex case-insensitive.
text = """Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)
m = regex.search(text)
m
<re.Match object; span=(5, 20), match='[email protected]'>
text[m.start():m.end()]
match()
only checks for the pattern at the begininng of the string.
print(regex.match(text))
None
The sub()
function substitutes another string at the points of pattern match.
print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
It is possible to find and segment a regular expression into groups using parantheses.
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('[email protected]')
m.groups()
('wesm', 'bright', 'net')
regex.findall(text)
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
data = {
'Dave': '[email protected]', 'Steve': '[email protected]',
'Rob': '[email protected]', 'Wes': np.nan
}
data = pd.Series(data)
data
Dave [email protected]
Steve [email protected]
Rob [email protected]
Wes NaN
dtype: object
data.isnull()
Dave False
Steve False
Rob False
Wes True
dtype: bool
Series has an str
attribute for applying regular expressions over all values, skipping missing values (NaN
).
data.str.contains('gmail')
Dave False
Steve True
Rob True
Wes NaN
dtype: object
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE)
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
dtype: object
There are mutliple ways to do vectorized element retrieval.
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
Dave True
Steve True
Rob True
Wes NaN
dtype: object
data.str[:5]
Dave dave@
Steve steve
Rob rob@g
Wes NaN
dtype: object