Skip to content

Latest commit

 

History

History
3901 lines (3019 loc) · 59.1 KB

pfda_ch07.md

File metadata and controls

3901 lines (3019 loc) · 59.1 KB

Chapter 7. Data Cleaning and Preparation

In this chapter, the author reviewed tools for missing data, duplicate data, string manipulation, and a few other other common practices used for data preparation.

7.1 Handling missing data

Missing data is ignored by default when computing summary and descriptive statistics on a DataFrame or Series. For numeric data, the floating-point value NaN ("not a number") is used.

import pandas as pd
import numpy as np

np.random.seed(0)
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool

Filtering out missing data

The dropna() method removes NaN values from a Series.

from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])
data
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64

On a DataFrame, dropna() removes rows with an NaN values.

data = pd.DataFrame([[1., 6.5, 3.],
                     [1., NA, NA],
                     [NA, NA, NA],
                     [NA, 6.5, 3.]])
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
cleaned = data.dropna()
cleaned
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 1.0 6.5 3.0

However, passing how='all' will remove only rows with all NaN.

data.dropna(how='all')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0

Or only columns made up of all NaN can be dropped by declaring the axis.

data[4] = NA
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
data.dropna(axis=1, how='all')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0

You can also specify the maximum number of missing data values allowed for an individual row.

df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 1.764052 NaN NaN
1 2.240893 NaN NaN
2 0.950088 NaN -0.103219
3 0.410599 NaN 1.454274
4 0.761038 0.121675 0.443863
5 0.333674 1.494079 -0.205158
6 0.313068 -0.854096 -2.552990
df.dropna()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
4 0.761038 0.121675 0.443863
5 0.333674 1.494079 -0.205158
6 0.313068 -0.854096 -2.552990
df.dropna(thresh=2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
2 0.950088 NaN -0.103219
3 0.410599 NaN 1.454274
4 0.761038 0.121675 0.443863
5 0.333674 1.494079 -0.205158
6 0.313068 -0.854096 -2.552990

Filling in missing data

fillna() can be used in most cases to fill in missing data.

df.fillna(0)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 1.764052 0.000000 0.000000
1 2.240893 0.000000 0.000000
2 0.950088 0.000000 -0.103219
3 0.410599 0.000000 1.454274
4 0.761038 0.121675 0.443863
5 0.333674 1.494079 -0.205158
6 0.313068 -0.854096 -2.552990

A dictionary can be used to fill specific columns with specific values.

df.fillna({1: 0.5, 2: 0})
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 1.764052 0.500000 0.000000
1 2.240893 0.500000 0.000000
2 0.950088 0.500000 -0.103219
3 0.410599 0.500000 1.454274
4 0.761038 0.121675 0.443863
5 0.333674 1.494079 -0.205158
6 0.313068 -0.854096 -2.552990
df.fillna(0, inplace=True)
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 1.764052 0.000000 0.000000
1 2.240893 0.000000 0.000000
2 0.950088 0.000000 -0.103219
3 0.410599 0.000000 1.454274
4 0.761038 0.121675 0.443863
5 0.333674 1.494079 -0.205158
6 0.313068 -0.854096 -2.552990

The same interpolation methods available for deciding which rows or columns to drop can be used for filling in data.

df = pd.DataFrame(np.random.rand(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 0.264556 0.774234 0.456150
1 0.568434 0.018790 0.617635
2 0.612096 NaN 0.943748
3 0.681820 NaN 0.437032
4 0.697631 NaN NaN
5 0.670638 NaN NaN
df.fillna(method='ffill')  # 'ffill' = 'forward fill'
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 0.264556 0.774234 0.456150
1 0.568434 0.018790 0.617635
2 0.612096 0.018790 0.943748
3 0.681820 0.018790 0.437032
4 0.697631 0.018790 0.437032
5 0.670638 0.018790 0.437032
df.fillna(method='ffill', limit=2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2
0 0.264556 0.774234 0.456150
1 0.568434 0.018790 0.617635
2 0.612096 0.018790 0.943748
3 0.681820 0.018790 0.437032
4 0.697631 NaN 0.437032
5 0.670638 NaN 0.437032

7.2 Data transformation

Removing duplicates

Duplicate rows can be found using the duplicated() method on a DataFrame.

data = pd.DataFrame({
    'k1': ['one', 'two'] * 3 + ['two'],
    'k2': [1, 1, 2, 3, 3, 4, 4]
})
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
data.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
data.drop_duplicates()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4

It is also possible to specify the column by which to find duplicates.

data['v1'] = range(7)
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
data.drop_duplicates(['k1'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
k1 k2 v1
0 one 1 0
1 two 1 1

By default, the first unique row is kept, though this can be changed to keeping the last one.

data.drop_duplicates(['k1', 'k2'], keep='last')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6

Transforming data using a function or mapping

Often want to perform a transformation on an array, Series of column of a DataFrame.

data = pd.DataFrame({
    'food': ['bacon', 'pulled pork', 'bacon',
             'Pastrami', 'corned beef', 'Bacon',
             'pastrami', 'honey ham', 'nova lox'],
    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]
})
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0

We can add a column to indicate the source animal for each food. The map method on a Series accepts function or dictionary to apply to each element.

meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

data['animal'] = data['food'].str.lower().map(meat_to_animal)
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon

Alternatively, we could have passed a function.

data['food'].map(lambda x: meat_to_animal[x.lower()])
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Replacing values

The replace(to_replace=x, value=y) method replaces x with y in a Series.

data = pd.Series([1.0, -999.0, 2.0, -999.0, -1000.0, 3.0])
data.replace(-999, np.nan)
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64
data.replace([-999, -1000], np.nan)
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64
data.replace([-999, -1000], [np.nan, 0])
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64
data.replace({-999: np.nan, -1000: 0})
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

Renaming axis indexes

Axis labels can be transformed by a function or mapping, too. This can be accomplished in-place.

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
transform = lambda x: x[:4].upper()
data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform)
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11

The index and column names can be transformed by passing functions to the rename() method.

data.rename(index = str.title, columns = str.upper)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11

One can rename a subset of the indices and column names using a dictionary mapping.

data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11

Discretization and binning

As an example, we want to bin the following age data into 18-25, 26-35, 36-60, 61 and older.

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The cut() function returns a Categorical object.

cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')
pd.value_counts(cats)
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

We can dictate a few paramters for cut() such as rightfor setting the left of the boundary to be inclusive and the right to be exclusive (denoted in standard methematical notation).

pd.cut(ages, [18, 26, 36, 61, 100], right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

The names of te bins can be specified.

group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

Instead of specifying the bins, an integer can be passed for the number of equispaced bins to use.

data = np.random.rand(20)
pd.cut(data, 4, precision=2)
[(0.096, 0.32], (0.32, 0.54], (0.54, 0.77], (0.32, 0.54], (0.77, 0.99], ..., (0.096, 0.32], (0.096, 0.32], (0.32, 0.54], (0.77, 0.99], (0.096, 0.32]]
Length: 20
Categories (4, interval[float64]): [(0.096, 0.32] < (0.32, 0.54] < (0.54, 0.77] < (0.77, 0.99]]

The qcut() function binds the data based on sample quantiles. This results in roughly equal-sized bins.

data = np.random.randn(1000)
cats = pd.qcut(data, 4)  # cut into quartiles
cats
[(0.605, 2.759], (-0.0516, 0.605], (-0.0516, 0.605], (-0.0516, 0.605], (-0.0516, 0.605], ..., (-0.0516, 0.605], (-3.0469999999999997, -0.698], (-3.0469999999999997, -0.698], (-3.0469999999999997, -0.698], (-0.698, -0.0516]]
Length: 1000
Categories (4, interval[float64]): [(-3.0469999999999997, -0.698] < (-0.698, -0.0516] < (-0.0516, 0.605] < (0.605, 2.759]]
pd.value_counts(cats)
(0.605, 2.759]                   250
(-0.0516, 0.605]                 250
(-0.698, -0.0516]                250
(-3.0469999999999997, -0.698]    250
dtype: int64

The specific quantiles to use can also be passed to qcut().

pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.0])
[(-0.0516, 1.182], (-0.0516, 1.182], (-0.0516, 1.182], (-0.0516, 1.182], (-0.0516, 1.182], ..., (-0.0516, 1.182], (-1.28, -0.0516], (-1.28, -0.0516], (-1.28, -0.0516], (-1.28, -0.0516]]
Length: 1000
Categories (4, interval[float64]): [(-3.0469999999999997, -1.28] < (-1.28, -0.0516] < (-0.0516, 1.182] < (1.182, 2.759]]

Detecting and filtering outliers

Here is an example for detecting and filtering outliers in a data set.

data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.014493 -0.005655 -0.002238 -0.041619
std 0.971579 0.999478 0.991132 0.990014
min -3.007437 -3.116857 -3.392300 -3.740101
25% -0.654814 -0.671151 -0.637831 -0.763313
50% -0.010606 0.003705 -0.063378 0.011749
75% 0.659244 0.693056 0.646954 0.637535
max 2.979976 3.801660 3.427539 2.929096

Lets identify values greater than 3 in magnitude.

col = data[2]
col[np.abs(col) > 3]
598    3.427539
656   -3.392300
Name: 2, dtype: float64

The rows with data greater than 3 in magnitide can be selected using the any() method.

data[(np.abs(data) > 3).any(1)]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3
0 1.049093 3.170975 0.189500 -1.348413
241 0.422819 -3.116857 0.644452 -1.913743
507 -0.347585 3.306574 -1.510200 0.203540
516 -0.126856 3.801660 2.315171 0.139827
598 -0.752582 0.045113 3.427539 0.604682
602 -0.553965 -3.006499 -0.047166 0.831878
656 -0.479297 -1.345508 -3.392300 0.155794
674 1.061435 -0.435034 0.657682 -3.740101
854 -3.007437 -2.330467 -0.567803 2.667322

Values can be set using this boolean indexing.

data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.014500 -0.006811 -0.002273 -0.040879
std 0.971556 0.994856 0.988474 0.987520
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.654814 -0.671151 -0.637831 -0.763313
50% -0.010606 0.003705 -0.063378 0.011749
75% 0.659244 0.693056 0.646954 0.637535
max 2.979976 3.000000 3.000000 2.929096

Permutation and random sampling

Permuting a Series or rows of a DataFrame is easily accomplished using numpy.random.permutation(). Providing this function with the length of the axis to permute returns a new indexing to use for reordering.

df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler
array([0, 3, 2, 4, 1])
df.take(sampler)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3
0 0 1 2 3
3 12 13 14 15
2 8 9 10 11
4 16 17 18 19
1 4 5 6 7

A sample of a DataFrame can be taken with or without replacement.

df.sample(n=3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3
2 8 9 10 11
0 0 1 2 3
1 4 5 6 7
df.sample(n=6, replace=True)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3
1 4 5 6 7
1 4 5 6 7
0 0 1 2 3
1 4 5 6 7
0 0 1 2 3
0 0 1 2 3

Computing indicator or dummy variables

This is often a useful tool for statistical analysis or machine learning.

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
pd.get_dummies(df['key'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0

Somtimes if it useful to include a prefix in the column names.

dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0

Creating a dummy matrix is more complicated if a row of a DataFrame belongs to multiple categories. For an example, we used the MovieLens 1M data set.

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('assets/datasets/movielens/movies.dat', sep='::',
                       header=None, names = mnames)
movies
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/ipykernel_launcher.py:3: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  This is separate from the ipykernel package so we can avoid doing imports until
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
... ... ... ...
3878 3948 Meet the Parents (2000) Comedy
3879 3949 Requiem for a Dream (2000) Drama
3880 3950 Tigerland (2000) Drama
3881 3951 Two Family House (2000) Drama
3882 3952 Contender, The (2000) Drama|Thriller

3883 rows × 3 columns

The goal of the next set of operations is to add an indicator variable for genre.

all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))

genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
gen = movies.genres[0]
gen.split('|')
['Animation', "Children's", 'Comedy']
dummies.columns.get_indexer(gen.split('|'))
array([0, 1, 2])
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

dummies
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Animation Children's Comedy Adventure Fantasy Romance Drama Action Crime Thriller Horror Sci-Fi Documentary War Musical Mystery Film-Noir Western
0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3878 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3879 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3880 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3881 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3882 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3883 rows × 18 columns

movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western                                  0
Name: 0, dtype: object

7.3 String manipulation

pandas builds on top of Python's rich assortment of string manipulations and object methods by applying them to arrays of strings.

String object methods

Here are a few commonly used string methods built into Python.

val = 'a,b,  guido'
val.split(',')
['a', 'b', '  guido']
pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']
first, second, third = pieces
first + '::' + second + '::' + third
'a::b::guido'
'::'.join(pieces)
'a::b::guido'
'guido' in val
True
val.index(',')
1
val.find(':')
-1
val.rfind(',')
3
val.count(',')
2
val.replace(',', '::')
'a::b::  guido'
val.replace(',', '')
'ab  guido'
upper_val = val.upper()
upper_val
'A,B,  GUIDO'
upper_val.lower()
'a,b,  guido'

Regular expressions

A regex provides a powerful way to identify patterns in strings. Python's built in re module is responsible for applying regular expressions to strings. The functions in re are in one of three categories: pattern matching, substitution, and splitting.

Below are a bunch of expample os using regular expressions for various tasks. The first is to split a string with a variable number of white spaces.

import re

text = "foo   bat\t baz \tqux"
re.split('\s+', text)
['foo', 'bat', 'baz', 'qux']

When re.split('\s+', text) was called, the regular expression was first compiled. This can be done manually, creating a new object that can be used for further matching.

regex = re.compile('\s+')
regex.split(text)
['foo', 'bat', 'baz', 'qux']
regex.findall(text)
['   ', '\t ', ' \t']

Here is an example with a larger string and pattern. The re.IGNORECASE makes the regex case-insensitive.

text = """Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

regex = re.compile(pattern, flags=re.IGNORECASE)

regex.findall(text)
m = regex.search(text)
m
<re.Match object; span=(5, 20), match='[email protected]'>
text[m.start():m.end()]

match() only checks for the pattern at the begininng of the string.

print(regex.match(text))
None

The sub() function substitutes another string at the points of pattern match.

print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

It is possible to find and segment a regular expression into groups using parantheses.

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('[email protected]')
m.groups()
('wesm', 'bright', 'net')
regex.findall(text)
[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

Vectorized string functions in pandas

data = {
    'Dave': '[email protected]', 'Steve': '[email protected]',
    'Rob': '[email protected]', 'Wes': np.nan
}
data = pd.Series(data)
data
Dave     [email protected]
Steve    [email protected]
Rob        [email protected]
Wes                  NaN
dtype: object
data.isnull()
Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

Series has an str attribute for applying regular expressions over all values, skipping missing values (NaN).

data.str.contains('gmail')
Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE)
Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

There are mutliple ways to do vectorized element retrieval.

matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object
data.str[:5]
Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object