Skip to content

Latest commit

 

History

History
2060 lines (1669 loc) · 37 KB

pfda_ch06.md

File metadata and controls

2060 lines (1669 loc) · 37 KB

Chapter 6. Data Loading, Storage, and File Formats

The three main ways to access data is by reading in on-disk files, loading from databases, requesting via a web API.

6.1 Reading and writing data in text format

There are various way to read in tabular data as a DataFrame in pandas. The functions I am most likely to frequently use are read_csv(), read_table(), read_excel(), read_pickle(), read_json(), and read_sql().

These functions are generally do a few similar processes on the input file:

  1. Index the rows and columns.
  2. Type inference and data conversion.
  3. Datetime parsing.
  4. Fixing unlean data such as footers, data with commas (especially numerics), etc.

These functions have many parameters to adjust to the quirks of the data; therefore, if you run into a problem, look through a documentation to see if any of the parameters are made for fixing it.

Here are some simple examples of reading data in with pandas.

!cat assets/examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
import pandas as pd
import numpy as np

np.random.seed(0)

ex_dir = 'assets/examples/'
df = pd.read_csv(ex_dir + 'ex1.csv')
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
!cat assets/examples/ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv(ex_dir + 'ex2.csv', header=None)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
pd.read_csv(ex_dir + 'ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv(ex_dir + 'ex2.csv', names=names, index_col='message')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12

The DataFrame can be read in with multiple hierarchical indices.

!cat assets/examples/csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
parsed = pd.read_csv(ex_dir + 'csv_mindex.csv', index_col=['key1', 'key2'])
parsed
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16

Troublesome rows can be skipped explicitly.

!cat assets/examples/ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv(ex_dir + 'ex4.csv')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
# hey!
a b c d message
# just wanted to make things more difficult for you NaN NaN NaN NaN
# who reads CSV files with computers anyway? NaN NaN NaN
1 2 3 4 hello
5 6 7 8 world
9 10 11 12 foo
pd.read_csv(ex_dir + 'ex4.csv', skiprows=[0, 2, 3])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

The common missing data sentinels, such as NA and NULL, are automatically idenfied by pandas.

!cat assets/examples/ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
result = pd.read_csv(ex_dir + 'ex5.csv')
result
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

Some of the other more common arguments used when reading in data with pandas are sep/delimiter, header, index_col, names, skiprows, parse_dates, nrows, chunksize (for reading in files piecemeal), skip_footer.

Reading test files in pieces

Before continuing with large DataFrames, set the print-out to be smaller.

pd.options.display.max_rows = 10

We can then read in a large file and have it nicely printed out.

result = pd.read_csv(ex_dir + 'ex6.csv')
result
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
... ... ... ... ... ...
9995 2.311896 -0.417070 -1.409599 -0.515821 L
9996 -0.479893 -0.650419 0.745152 -0.646038 E
9997 0.523331 0.787112 0.486066 1.093156 K
9998 -0.362559 0.598894 -1.843201 0.887292 G
9999 -0.096376 -1.012999 -0.657431 -0.573315 0

10000 rows × 5 columns

We could have also just read in the first few rows.

pd.read_csv(ex_dir + 'ex6.csv', nrows=5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q

Or the file could be read in in pieces.

chunker = pd.read_csv(ex_dir + 'ex6.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x122787050>
tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)
tot.head()
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
dtype: float64

Writing data to text format

data = pd.read_csv(ex_dir + 'ex5.csv')
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
data.to_csv(ex_dir + 'out.csv')
!cat assets/examples/out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
data.to_csv(ex_dir + 'out.csv', sep='|')
!cat assets/examples/out.csv
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
data.to_csv(ex_dir + 'out.csv', index=False, columns=['a', 'b', 'c'])
!cat assets/examples/out.csv
a,b,c
1,2,3.0
5,6,
9,10,11.0

JSON data

JSON stnads for "JavaScript Object Notation" and is very nearly valid Python code.

obj = """
    {"name": "Wes",
     "places_lived": ["United States", "Spain", "Germany"],
     "pet": null,
     "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
                  {"name": "Katie", "age": 38,
                   "pets": ["Sixes", "Stache", "Cisco"]}]
} """
obj
'\n    {"name": "Wes",\n     "places_lived": ["United States", "Spain", "Germany"],\n     "pet": null,\n     "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},\n                  {"name": "Katie", "age": 38,\n                   "pets": ["Sixes", "Stache", "Cisco"]}]\n} '

There are many JSON parsing libraries, but we will use 'json'.

import json

result = json.loads(obj)
result
{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
asjson = json.dumps(result)
asjson
'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

There are many ways to go from JSON to DataFrame. One way is to pass a list of dictionaries and select a subset of the data fields.

siblings = pd.DataFrame(result['siblings'], columns=['names', 'age'])
siblings
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
names age
0 NaN 30
1 NaN 38

There is also the read_json() function from pandas.

!cat assets/examples/example.json
[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]
data = pd.read_json(ex_dir + 'example.json')
data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c
0 1 2 3
1 4 5 6
2 7 8 9
data.to_json()
'{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}'

XML and HTML: web scraping

For this section, we installed a few popular libariesfor reading and writing HTML and XML.

conda install lxml beautifulsoup4 html5lib

The pandas read_html() searches for and parses tabular data, often within <table><\table> tags.

tables = pd.read_html(ex_dir + 'fdic_failed_bank_list.html')
len(tables)
failures = tables[0]
failures.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date
0 Allied Bank Mulberry AR 91 Today's Bank September 23, 2016 November 17, 2016
1 The Woodbury Banking Company Woodbury GA 11297 United Bank August 19, 2016 November 17, 2016
2 First CornerStone Bank King of Prussia PA 35312 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016
3 Trust Company Bank Memphis TN 9956 The Bank of Fayette County April 29, 2016 September 6, 2016
4 North Milwaukee State Bank Milwaukee WI 20364 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016

XML can be parsed using lxml or BeautifulSoup. (The author includes an example of how one could do this; I just read it, but did not take notes.)

6.2 Binary Data Formats

One of the most common binary serialization protocols in Python is pickle serialization. All pandas data structures have a to_pickle() method.

frame = pd.read_csv(ex_dir + 'ex1.csv')
frame
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
frame.to_pickle(ex_dir + 'frame_pickle')
pd.read_pickle(ex_dir + 'frame_pickle')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

There are also the HDF5, MessagePack, bcolz, and Feather serialization formats.

HDF5

File format intended for storing large scientific data arrays. It has interfaces for many other languages (including MATLAB).

frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore(ex_dir + 'mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store
<class 'pandas.io.pytables.HDFStore'>
File path: assets/examples/mydata.h5
store['obj1']
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a
0 1.764052
1 0.400157
2 0.978738
3 2.240893
4 1.867558
... ...
95 0.706573
96 0.010500
97 1.785870
98 0.126912
99 0.401989

100 rows × 1 columns

HSF5Store() supports two schemas, 'fixed' and 'table'. The latter is slower, but supports queries.

store.put('obj2', frame, format='table')
store.select('obj2', where=['index >= 10 and index <= 15'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a
10 0.144044
11 1.454274
12 0.761038
13 0.121675
14 0.443863
15 0.333674
store.close()

The read_hdf() function provides easy access to a HDF5 file.

frame.to_hdf(ex_dir + 'mydata.h5', 'obj3', format='table')
pd.read_hdf(ex_dir + 'mydata.h5', 'obj3', where=['index < 5'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
a
0 1.764052
1 0.400157
2 0.978738
3 2.240893
4 1.867558

Reading Microsoft Excel files

pandas supports reading tabular data stored in Excel (≥2003). Internally, pandas uses the xlrd and openpyxl libraries to read XLS and XLSX files.

If there are multiple sheets in an Excel file to be used, it is likely faster to use ExcelFile().

xlsx = pd.ExcelFile(ex_dir + 'ex1.xlsx')
xlsx.sheet_names
['Sheet1']
xlsx.parse(sheet_name='Sheet1')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Unnamed: 0 a b c d message
0 0 1 2 3 4 hello
1 1 5 6 7 8 world
2 2 9 10 11 12 foo

Otherwise, read_excel() just returns a DataFrame, immediately.

frame = pd.read_excel(ex_dir + 'ex1.xlsx')
frame
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Unnamed: 0 a b c d message
0 0 1 2 3 4 hello
1 1 5 6 7 8 world
2 2 9 10 11 12 foo

Excel files can be written using an ExcelWriter object and to_excel() or by passing the file name to to_excel().

writer = pd.ExcelWriter(ex_dir  + 'ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()
frame.to_excel(ex_dir + 'ex2.xlsx')

Interacting with Web APIs

One of the most popular libraries for interacting with web APIs is requests (Real Python put together a tutorial, too).

This example requests the last 30 issues from the pandas GitHub page.

import requests

url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp
<Response [200]>

The response can then be parsedto JSON.

data = resp.json()  # convert to JSON
data[0]  # show first issue
{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411',
 'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
 'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411/labels{/name}',
 'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411/comments',
 'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411/events',
 'html_url': 'https://github.com/pandas-dev/pandas/issues/29411',
 'id': 517663592,
 'node_id': 'MDU6SXNzdWU1MTc2NjM1OTI=',
 'number': 29411,
 'title': 'Memory leak in Dataframe.memory_usage',
 'user': {'login': 'hyfjjjj',
  'id': 7194638,
  'node_id': 'MDQ6VXNlcjcxOTQ2Mzg=',
  'avatar_url': 'https://avatars2.githubusercontent.com/u/7194638?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/hyfjjjj',
  'html_url': 'https://github.com/hyfjjjj',
  'followers_url': 'https://api.github.com/users/hyfjjjj/followers',
  'following_url': 'https://api.github.com/users/hyfjjjj/following{/other_user}',
  'gists_url': 'https://api.github.com/users/hyfjjjj/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/hyfjjjj/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/hyfjjjj/subscriptions',
  'organizations_url': 'https://api.github.com/users/hyfjjjj/orgs',
  'repos_url': 'https://api.github.com/users/hyfjjjj/repos',
  'events_url': 'https://api.github.com/users/hyfjjjj/events{/privacy}',
  'received_events_url': 'https://api.github.com/users/hyfjjjj/received_events',
  'type': 'User',
  'site_admin': False},
 'labels': [],
 'state': 'open',
 'locked': False,
 'assignee': None,
 'assignees': [],
 'milestone': None,
 'comments': 0,
 'created_at': '2019-11-05T09:52:21Z',
 'updated_at': '2019-11-05T09:52:21Z',
 'closed_at': None,
 'author_association': 'NONE',
 'body': '#### Code Sample, a copy-pastable example if possible\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nimport gc\r\nimport os\r\nimport psutil\r\n\r\ndef get_process_memory():\r\n  return round(psutil.Process(os.getpid()).memory_info().rss / float(2 ** 20), 2)\r\n\r\ntest_dict = {}\r\nfor i in range(0, 50):\r\n  test_dict[i] = np.empty(10)\r\n\r\ndfs = []\r\nfor i in range(0, 1000):\r\n  df = pd.DataFrame(test_dict)\r\n  dfs.append(df)\r\n\r\ngc.collect()\r\n# before\r\nprint(\'memory usage (before "memory_usage"):\\t{} MB\'.format(get_process_memory()))\r\n\r\nfor df in dfs:\r\n  df.memory_usage(index=True, deep=True)\r\n\r\ngc.collect()\r\n# after\r\nprint(\'memory usage (after "memory_usage"):\\t{} MB\'.format(get_process_memory()))\r\n\r\n```\r\n#### Problem description\r\n\r\nDataframe\'s memory_usage function has memory leak. Memory usage after executing \'memory_usage\' function  should be the same as before.\r\n\r\n<img width="399" alt="截屏2019-11-05下午5 44 25" src="https://user-images.githubusercontent.com/7194638/68196715-f390ce00-fff3-11e9-939a-e84d850673e8.png">\r\n\r\n#### Expected Output\r\n\r\nNone\r\n\r\n#### Output of ``pd.show_versions()``\r\n\r\n<details>\r\n\r\nINSTALLED VERSIONS\r\n------------------\r\ncommit: None\r\npython: 2.7.16.final.0\r\npython-bits: 64\r\nOS: Darwin\r\nOS-release: 19.0.0\r\nmachine: x86_64\r\nprocessor: i386\r\nbyteorder: little\r\nLC_ALL: None\r\nLANG: zh_CN.UTF-8\r\nLOCALE: None.None\r\n\r\npandas: 0.24.2\r\npytest: None\r\npip: 19.3.1\r\nsetuptools: 19.6.1\r\nCython: 0.29.13\r\nnumpy: 1.16.5\r\nscipy: None\r\npyarrow: None\r\nxarray: None\r\nIPython: None\r\nsphinx: None\r\npatsy: None\r\ndateutil: 2.8.1\r\npytz: 2019.3\r\nblosc: None\r\nbottleneck: None\r\ntables: None\r\nnumexpr: None\r\nfeather: None\r\nmatplotlib: None\r\nopenpyxl: None\r\nxlrd: None\r\nxlwt: None\r\nxlsxwriter: None\r\nlxml.etree: None\r\nbs4: None\r\nhtml5lib: None\r\nsqlalchemy: None\r\npymysql: None\r\npsycopg2: None\r\njinja2: None\r\ns3fs: None\r\nfastparquet: None\r\npandas_gbq: None\r\npandas_datareader: None\r\ngcsfs: None\r\n\r\n</details>\r\n'}

Each element in data is a dictionary containing a single GitHub issue. This can be turned into a DataFrame.

issues = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])
issues
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
number title labels state
0 29411 Memory leak in Dataframe.memory_usage [] open
1 29410 Fixed SS03 errors [] open
2 29409 Formatting issues with column width truncation... [{'id': 13101118, 'node_id': 'MDU6TGFiZWwxMzEw... open
3 29408 DataFrame.equals incorrect `See Also` section ... [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT... open
4 29406 CLN: assorted cleanups [] open
... ... ... ... ...
25 29364 26302 add typing to assert star equal funcs [{'id': 1280988427, 'node_id': 'MDU6TGFiZWwxMj... open
26 29361 TYPING: scalar type that matches lib.is_scalar [{'id': 1280988427, 'node_id': 'MDU6TGFiZWwxMj... open
27 29357 ensure consistent structure for groupby on ind... [{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj... open
28 29356 TST: new test for subset of a MultiIndex dtype [{'id': 127685, 'node_id': 'MDU6TGFiZWwxMjc2OD... open
29 29355 TST: Test type issue fix in empty groupby from... [{'id': 78527356, 'node_id': 'MDU6TGFiZWw3ODUy... open

30 rows × 4 columns