Chapter 7. Data Cleaning and Preparation

In this chapter, the author reviewed tools for missing data, duplicate data, string manipulation, and a few other other common practices used for data preparation.

7.1 Handling missing data

Missing data is ignored by default when computing summary and descriptive statistics on a DataFrame or Series. For numeric data, the floating-point value NaN ("not a number") is used.

import pandas as pd
import numpy as np

np.random.seed(0)

string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

Filtering out missing data

The dropna() method removes NaN values from a Series.

from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

On a DataFrame, dropna() removes rows with an NaN values.

data = pd.DataFrame([[1., 6.5, 3.],
                     [1., NA, NA],
                     [NA, NA, NA],
                     [NA, 6.5, 3.]])
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

cleaned = data.dropna()
cleaned

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	1.0	6.5	3.0

However, passing how='all' will remove only rows with all NaN.

data.dropna(how='all')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
3	NaN	6.5	3.0

Or only columns made up of all NaN can be dropped by declaring the axis.

data[4] = NA
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	4
0	1.0	6.5	3.0	NaN
1	1.0	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	NaN	6.5	3.0	NaN

data.dropna(axis=1, how='all')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

You can also specify the maximum number of missing data values allowed for an individual row.

df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	1.764052	NaN	NaN
1	2.240893	NaN	NaN
2	0.950088	NaN	-0.103219
3	0.410599	NaN	1.454274
4	0.761038	0.121675	0.443863
5	0.333674	1.494079	-0.205158
6	0.313068	-0.854096	-2.552990

df.dropna()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
4	0.761038	0.121675	0.443863
5	0.333674	1.494079	-0.205158
6	0.313068	-0.854096	-2.552990

df.dropna(thresh=2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
2	0.950088	NaN	-0.103219
3	0.410599	NaN	1.454274
4	0.761038	0.121675	0.443863
5	0.333674	1.494079	-0.205158
6	0.313068	-0.854096	-2.552990

Filling in missing data

fillna() can be used in most cases to fill in missing data.

df.fillna(0)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	1.764052	0.000000	0.000000
1	2.240893	0.000000	0.000000
2	0.950088	0.000000	-0.103219
3	0.410599	0.000000	1.454274
4	0.761038	0.121675	0.443863
5	0.333674	1.494079	-0.205158
6	0.313068	-0.854096	-2.552990

A dictionary can be used to fill specific columns with specific values.

df.fillna({1: 0.5, 2: 0})

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	1.764052	0.500000	0.000000
1	2.240893	0.500000	0.000000
2	0.950088	0.500000	-0.103219
3	0.410599	0.500000	1.454274
4	0.761038	0.121675	0.443863
5	0.333674	1.494079	-0.205158
6	0.313068	-0.854096	-2.552990

df.fillna(0, inplace=True)
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	1.764052	0.000000	0.000000
1	2.240893	0.000000	0.000000
2	0.950088	0.000000	-0.103219
3	0.410599	0.000000	1.454274
4	0.761038	0.121675	0.443863
5	0.333674	1.494079	-0.205158
6	0.313068	-0.854096	-2.552990

The same interpolation methods available for deciding which rows or columns to drop can be used for filling in data.

df = pd.DataFrame(np.random.rand(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	0.264556	0.774234	0.456150
1	0.568434	0.018790	0.617635
2	0.612096	NaN	0.943748
3	0.681820	NaN	0.437032
4	0.697631	NaN	NaN
5	0.670638	NaN	NaN

df.fillna(method='ffill')  # 'ffill' = 'forward fill'

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	0.264556	0.774234	0.456150
1	0.568434	0.018790	0.617635
2	0.612096	0.018790	0.943748
3	0.681820	0.018790	0.437032
4	0.697631	0.018790	0.437032
5	0.670638	0.018790	0.437032

df.fillna(method='ffill', limit=2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2
0	0.264556	0.774234	0.456150
1	0.568434	0.018790	0.617635
2	0.612096	0.018790	0.943748
3	0.681820	0.018790	0.437032
4	0.697631	NaN	0.437032
5	0.670638	NaN	0.437032

7.2 Data transformation

Removing duplicates

Duplicate rows can be found using the duplicated() method on a DataFrame.

data = pd.DataFrame({
    'k1': ['one', 'two'] * 3 + ['two'],
    'k2': [1, 1, 2, 3, 3, 4, 4]
})
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4
6	two	4

data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

data.drop_duplicates()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4

It is also possible to specify the column by which to find duplicates.

data['v1'] = range(7)
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
5	two	4	5
6	two	4	6

data.drop_duplicates(['k1'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	k1	k2	v1
0	one	1	0
1	two	1	1

By default, the first unique row is kept, though this can be changed to keeping the last one.

data.drop_duplicates(['k1', 'k2'], keep='last')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
6	two	4	6

Transforming data using a function or mapping

Often want to perform a transformation on an array, Series of column of a DataFrame.

data = pd.DataFrame({
    'food': ['bacon', 'pulled pork', 'bacon',
             'Pastrami', 'corned beef', 'Bacon',
             'pastrami', 'honey ham', 'nova lox'],
    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]
})
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

We can add a column to indicate the source animal for each food. The map method on a Series accepts function or dictionary to apply to each element.

meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

data['animal'] = data['food'].str.lower().map(meat_to_animal)
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	food	ounces	animal
0	bacon	4.0	pig
1	pulled pork	3.0	pig
2	bacon	12.0	pig
3	Pastrami	6.0	cow
4	corned beef	7.5	cow
5	Bacon	8.0	pig
6	pastrami	3.0	cow
7	honey ham	5.0	pig
8	nova lox	6.0	salmon

Alternatively, we could have passed a function.

data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Replacing values

The replace(to_replace=x, value=y) method replaces x with y in a Series.

data = pd.Series([1.0, -999.0, 2.0, -999.0, -1000.0, 3.0])
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

Renaming axis indexes

Axis labels can be transformed by a function or mapping, too. This can be accomplished in-place.

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
New York	8	9	10	11

transform = lambda x: x[:4].upper()
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

data.index = data.index.map(transform)
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

The index and column names can be transformed by passing functions to the rename() method.

data.rename(index = str.title, columns = str.upper)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ONE	TWO	THREE	FOUR
Ohio	0	1	2	3
Colo	4	5	6	7
New	8	9	10	11

One can rename a subset of the indices and column names using a dictionary mapping.

data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	peekaboo	four
INDIANA	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

Discretization and binning

As an example, we want to bin the following age data into 18-25, 26-35, 36-60, 61 and older.

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The cut() function returns a Categorical object.

cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

We can dictate a few paramters for cut() such as rightfor setting the left of the boundary to be inclusive and the right to be exclusive (denoted in standard methematical notation).

pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

The names of te bins can be specified.

group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

Instead of specifying the bins, an integer can be passed for the number of equispaced bins to use.

data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.096, 0.32], (0.32, 0.54], (0.54, 0.77], (0.32, 0.54], (0.77, 0.99], ..., (0.096, 0.32], (0.096, 0.32], (0.32, 0.54], (0.77, 0.99], (0.096, 0.32]]
Length: 20
Categories (4, interval[float64]): [(0.096, 0.32] < (0.32, 0.54] < (0.54, 0.77] < (0.77, 0.99]]

The qcut() function binds the data based on sample quantiles. This results in roughly equal-sized bins.

data = np.random.randn(1000)
cats = pd.qcut(data, 4)  # cut into quartiles
cats

[(0.605, 2.759], (-0.0516, 0.605], (-0.0516, 0.605], (-0.0516, 0.605], (-0.0516, 0.605], ..., (-0.0516, 0.605], (-3.0469999999999997, -0.698], (-3.0469999999999997, -0.698], (-3.0469999999999997, -0.698], (-0.698, -0.0516]]
Length: 1000
Categories (4, interval[float64]): [(-3.0469999999999997, -0.698] < (-0.698, -0.0516] < (-0.0516, 0.605] < (0.605, 2.759]]

pd.value_counts(cats)

(0.605, 2.759]                   250
(-0.0516, 0.605]                 250
(-0.698, -0.0516]                250
(-3.0469999999999997, -0.698]    250
dtype: int64

The specific quantiles to use can also be passed to qcut().

pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.0])

[(-0.0516, 1.182], (-0.0516, 1.182], (-0.0516, 1.182], (-0.0516, 1.182], (-0.0516, 1.182], ..., (-0.0516, 1.182], (-1.28, -0.0516], (-1.28, -0.0516], (-1.28, -0.0516], (-1.28, -0.0516]]
Length: 1000
Categories (4, interval[float64]): [(-3.0469999999999997, -1.28] < (-1.28, -0.0516] < (-0.0516, 1.182] < (1.182, 2.759]]

Detecting and filtering outliers

Here is an example for detecting and filtering outliers in a data set.

data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.014493	-0.005655	-0.002238	-0.041619
std	0.971579	0.999478	0.991132	0.990014
min	-3.007437	-3.116857	-3.392300	-3.740101
25%	-0.654814	-0.671151	-0.637831	-0.763313
50%	-0.010606	0.003705	-0.063378	0.011749
75%	0.659244	0.693056	0.646954	0.637535
max	2.979976	3.801660	3.427539	2.929096

Lets identify values greater than 3 in magnitude.

col = data[2]
col[np.abs(col) > 3]

598    3.427539
656   -3.392300
Name: 2, dtype: float64

The rows with data greater than 3 in magnitide can be selected using the any() method.

data[(np.abs(data) > 3).any(1)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3
0	1.049093	3.170975	0.189500	-1.348413
241	0.422819	-3.116857	0.644452	-1.913743
507	-0.347585	3.306574	-1.510200	0.203540
516	-0.126856	3.801660	2.315171	0.139827
598	-0.752582	0.045113	3.427539	0.604682
602	-0.553965	-3.006499	-0.047166	0.831878
656	-0.479297	-1.345508	-3.392300	0.155794
674	1.061435	-0.435034	0.657682	-3.740101
854	-3.007437	-2.330467	-0.567803	2.667322

Values can be set using this boolean indexing.

data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	0.014500	-0.006811	-0.002273	-0.040879
std	0.971556	0.994856	0.988474	0.987520
min	-3.000000	-3.000000	-3.000000	-3.000000
25%	-0.654814	-0.671151	-0.637831	-0.763313
50%	-0.010606	0.003705	-0.063378	0.011749
75%	0.659244	0.693056	0.646954	0.637535
max	2.979976	3.000000	3.000000	2.929096

Permutation and random sampling

Permuting a Series or rows of a DataFrame is easily accomplished using numpy.random.permutation(). Providing this function with the length of the axis to permute returns a new indexing to use for reordering.

df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler

array([0, 3, 2, 4, 1])

df.take(sampler)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3
0	0	1	2	3
3	12	13	14	15
2	8	9	10	11
4	16	17	18	19
1	4	5	6	7

A sample of a DataFrame can be taken with or without replacement.

df.sample(n=3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3
2	8	9	10	11
0	0	1	2	3
1	4	5	6	7

df.sample(n=6, replace=True)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3
1	4	5	6	7
1	4	5	6	7
0	0	1	2	3
1	4	5	6	7
0	0	1	2	3
0	0	1	2	3

Computing indicator or dummy variables

This is often a useful tool for statistical analysis or machine learning.

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	b	5

pd.get_dummies(df['key'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c
0	0	1	0
1	0	1	0
2	1	0	0
3	0	0	1
4	1	0	0
5	0	1	0

Somtimes if it useful to include a prefix in the column names.

dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	data1	key_a	key_b	key_c
0	0	0	1	0
1	1	0	1	0
2	2	1	0	0
3	3	0	0	1
4	4	1	0	0
5	5	0	1	0

Creating a dummy matrix is more complicated if a row of a DataFrame belongs to multiple categories. For an example, we used the MovieLens 1M data set.

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('assets/datasets/movielens/movies.dat', sep='::',
                       header=None, names = mnames)
movies

/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/ipykernel_launcher.py:3: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  This is separate from the ipykernel package so we can avoid doing imports until

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy
...	...	...	...
3878	3948	Meet the Parents (2000)	Comedy
3879	3949	Requiem for a Dream (2000)	Drama
3880	3950	Tigerland (2000)	Drama
3881	3951	Two Family House (2000)	Drama
3882	3952	Contender, The (2000)	Drama\|Thriller

3883 rows × 3 columns

The goal of the next set of operations is to add an indicator variable for genre.

all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))

genres = pd.unique(all_genres)
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
gen = movies.genres[0]
gen.split('|')

['Animation', "Children's", 'Comedy']

dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2])

for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

dummies

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Animation	Children's	Comedy	Adventure	Fantasy	Romance	Drama	Action	Crime	Thriller	Horror	Sci-Fi	Documentary	War	Musical	Mystery	Film-Noir	Western
0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3878	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3879	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3880	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3881	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3882	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

3883 rows × 18 columns

movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western                                  0
Name: 0, dtype: object

7.3 String manipulation

pandas builds on top of Python's rich assortment of string manipulations and object methods by applying them to arrays of strings.

String object methods

Here are a few commonly used string methods built into Python.

val = 'a,b,  guido'
val.split(',')

['a', 'b', '  guido']

pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

'::'.join(pieces)

'a::b::guido'

'guido' in val

True

val.index(',')

val.find(':')

-1

val.rfind(',')

val.count(',')

val.replace(',', '::')

'a::b::  guido'

val.replace(',', '')

'ab  guido'

upper_val = val.upper()
upper_val

'A,B,  GUIDO'

upper_val.lower()

'a,b,  guido'

Regular expressions

A regex provides a powerful way to identify patterns in strings. Python's built in re module is responsible for applying regular expressions to strings. The functions in re are in one of three categories: pattern matching, substitution, and splitting.

Below are a bunch of expample os using regular expressions for various tasks. The first is to split a string with a variable number of white spaces.

import re

text = "foo   bat\t baz \tqux"
re.split('\s+', text)

['foo', 'bat', 'baz', 'qux']

When re.split('\s+', text) was called, the regular expression was first compiled. This can be done manually, creating a new object that can be used for further matching.

regex = re.compile('\s+')
regex.split(text)

['foo', 'bat', 'baz', 'qux']

regex.findall(text)

['   ', '\t ', ' \t']

Here is an example with a larger string and pattern. The re.IGNORECASE makes the regex case-insensitive.

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

regex = re.compile(pattern, flags=re.IGNORECASE)

regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

text[m.start():m.end()]

'dave@google.com'

match() only checks for the pattern at the begininng of the string.

print(regex.match(text))

None

The sub() function substitutes another string at the points of pattern match.

print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

It is possible to find and segment a regular expression into groups using parantheses.

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

Vectorized string functions in pandas

data = {
    'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
    'Rob': 'rob@gmail.com', 'Wes': np.nan
}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

Series has an str attribute for applying regular expressions over all values, skipping missing values (NaN).

data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

There are mutliple ways to do vectorized element retrieval.

matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!