Chapter 6. Data Loading, Storage, and File Formats

The three main ways to access data is by reading in on-disk files, loading from databases, requesting via a web API.

6.1 Reading and writing data in text format

There are various way to read in tabular data as a DataFrame in pandas. The functions I am most likely to frequently use are read_csv(), read_table(), read_excel(), read_pickle(), read_json(), and read_sql().

These functions are generally do a few similar processes on the input file:

Index the rows and columns.
Type inference and data conversion.
Datetime parsing.
Fixing unlean data such as footers, data with commas (especially numerics), etc.

These functions have many parameters to adjust to the quirks of the data; therefore, if you run into a problem, look through a documentation to see if any of the parameters are made for fixing it.

Here are some simple examples of reading data in with pandas.

!cat assets/examples/ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

import pandas as pd
import numpy as np

np.random.seed(0)

ex_dir = 'assets/examples/'
df = pd.read_csv(ex_dir + 'ex1.csv')
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

!cat assets/examples/ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

pd.read_csv(ex_dir + 'ex2.csv', header=None)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1	2	3	4
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

pd.read_csv(ex_dir + 'ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv(ex_dir + 'ex2.csv', names=names, index_col='message')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d
message
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12

The DataFrame can be read in with multiple hierarchical indices.

!cat assets/examples/csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

parsed = pd.read_csv(ex_dir + 'csv_mindex.csv', index_col=['key1', 'key2'])
parsed

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

		value1	value2
key1	key2
one	a	1	2
	b	3	4
	c	5	6
	d	7	8
two	a	9	10
	b	11	12
	c	13	14
	d	15	16

Troublesome rows can be skipped explicitly.

!cat assets/examples/ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

pd.read_csv(ex_dir + 'ex4.csv')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

				# hey!
a	b	c	d	message
# just wanted to make things more difficult for you	NaN	NaN	NaN	NaN
# who reads CSV files with computers	anyway?	NaN	NaN	NaN
1	2	3	4	hello
5	6	7	8	world
9	10	11	12	foo

pd.read_csv(ex_dir + 'ex4.csv', skiprows=[0, 2, 3])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

The common missing data sentinels, such as NA and NULL, are automatically idenfied by pandas.

!cat assets/examples/ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

result = pd.read_csv(ex_dir + 'ex5.csv')
result

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

Some of the other more common arguments used when reading in data with pandas are sep/delimiter, header, index_col, names, skiprows, parse_dates, nrows, chunksize (for reading in files piecemeal), skip_footer.

Reading test files in pieces

Before continuing with large DataFrames, set the print-out to be smaller.

pd.options.display.max_rows = 10

We can then read in a large file and have it nicely printed out.

result = pd.read_csv(ex_dir + 'ex6.csv')
result

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q
...	...	...	...	...	...
9995	2.311896	-0.417070	-1.409599	-0.515821	L
9996	-0.479893	-0.650419	0.745152	-0.646038	E
9997	0.523331	0.787112	0.486066	1.093156	K
9998	-0.362559	0.598894	-1.843201	0.887292	G
9999	-0.096376	-1.012999	-0.657431	-0.573315	0

10000 rows × 5 columns

We could have also just read in the first few rows.

pd.read_csv(ex_dir + 'ex6.csv', nrows=5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q

Or the file could be read in in pieces.

chunker = pd.read_csv(ex_dir + 'ex6.csv', chunksize=1000)
chunker

<pandas.io.parsers.TextFileReader at 0x122787050>

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)
tot.head()

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
dtype: float64

Writing data to text format

data = pd.read_csv(ex_dir + 'ex5.csv')
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

data.to_csv(ex_dir + 'out.csv')
!cat assets/examples/out.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

data.to_csv(ex_dir + 'out.csv', sep='|')
!cat assets/examples/out.csv

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

data.to_csv(ex_dir + 'out.csv', index=False, columns=['a', 'b', 'c'])
!cat assets/examples/out.csv

a,b,c
1,2,3.0
5,6,
9,10,11.0

JSON data

JSON stnads for "JavaScript Object Notation" and is very nearly valid Python code.

obj = """
    {"name": "Wes",
     "places_lived": ["United States", "Spain", "Germany"],
     "pet": null,
     "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
                  {"name": "Katie", "age": 38,
                   "pets": ["Sixes", "Stache", "Cisco"]}]
} """
obj

'\n    {"name": "Wes",\n     "places_lived": ["United States", "Spain", "Germany"],\n     "pet": null,\n     "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},\n                  {"name": "Katie", "age": 38,\n                   "pets": ["Sixes", "Stache", "Cisco"]}]\n} '

There are many JSON parsing libraries, but we will use 'json'.

import json

result = json.loads(obj)
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

asjson = json.dumps(result)
asjson

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

There are many ways to go from JSON to DataFrame. One way is to pass a list of dictionaries and select a subset of the data fields.

siblings = pd.DataFrame(result['siblings'], columns=['names', 'age'])
siblings

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	names	age
0	NaN	30
1	NaN	38

There is also the read_json() function from pandas.

!cat assets/examples/example.json

[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]

data = pd.read_json(ex_dir + 'example.json')
data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

data.to_json()

'{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}'

XML and HTML: web scraping

For this section, we installed a few popular libariesfor reading and writing HTML and XML.

conda install lxml beautifulsoup4 html5lib

The pandas read_html() searches for and parses tabular data, often within <table><\table> tags.

tables = pd.read_html(ex_dir + 'fdic_failed_bank_list.html')
len(tables)
failures = tables[0]
failures.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Bank Name	City	ST	CERT	Acquiring Institution	Closing Date	Updated Date
0	Allied Bank	Mulberry	AR	91	Today's Bank	September 23, 2016	November 17, 2016
1	The Woodbury Banking Company	Woodbury	GA	11297	United Bank	August 19, 2016	November 17, 2016
2	First CornerStone Bank	King of Prussia	PA	35312	First-Citizens Bank & Trust Company	May 6, 2016	September 6, 2016
3	Trust Company Bank	Memphis	TN	9956	The Bank of Fayette County	April 29, 2016	September 6, 2016
4	North Milwaukee State Bank	Milwaukee	WI	20364	First-Citizens Bank & Trust Company	March 11, 2016	June 16, 2016

XML can be parsed using lxml or BeautifulSoup. (The author includes an example of how one could do this; I just read it, but did not take notes.)

6.2 Binary Data Formats

One of the most common binary serialization protocols in Python is pickle serialization. All pandas data structures have a to_pickle() method.

frame = pd.read_csv(ex_dir + 'ex1.csv')
frame

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

frame.to_pickle(ex_dir + 'frame_pickle')

pd.read_pickle(ex_dir + 'frame_pickle')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

There are also the HDF5, MessagePack, bcolz, and Feather serialization formats.

HDF5

File format intended for storing large scientific data arrays. It has interfaces for many other languages (including MATLAB).

frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore(ex_dir + 'mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store

<class 'pandas.io.pytables.HDFStore'>
File path: assets/examples/mydata.h5

store['obj1']

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a
0	1.764052
1	0.400157
2	0.978738
3	2.240893
4	1.867558
...	...
95	0.706573
96	0.010500
97	1.785870
98	0.126912
99	0.401989

100 rows × 1 columns

HSF5Store() supports two schemas, 'fixed' and 'table'. The latter is slower, but supports queries.

store.put('obj2', frame, format='table')
store.select('obj2', where=['index >= 10 and index <= 15'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a
10	0.144044
11	1.454274
12	0.761038
13	0.121675
14	0.443863
15	0.333674

store.close()

The read_hdf() function provides easy access to a HDF5 file.

frame.to_hdf(ex_dir + 'mydata.h5', 'obj3', format='table')
pd.read_hdf(ex_dir + 'mydata.h5', 'obj3', where=['index < 5'])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	a
0	1.764052
1	0.400157
2	0.978738
3	2.240893
4	1.867558

Reading Microsoft Excel files

pandas supports reading tabular data stored in Excel (≥2003). Internally, pandas uses the xlrd and openpyxl libraries to read XLS and XLSX files.

If there are multiple sheets in an Excel file to be used, it is likely faster to use ExcelFile().

xlsx = pd.ExcelFile(ex_dir + 'ex1.xlsx')
xlsx.sheet_names

['Sheet1']

xlsx.parse(sheet_name='Sheet1')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Unnamed: 0	a	b	c	d	message
0	0	1	2	3	4	hello
1	1	5	6	7	8	world
2	2	9	10	11	12	foo

Otherwise, read_excel() just returns a DataFrame, immediately.

frame = pd.read_excel(ex_dir + 'ex1.xlsx')
frame

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Unnamed: 0	a	b	c	d	message
0	0	1	2	3	4	hello
1	1	5	6	7	8	world
2	2	9	10	11	12	foo

Excel files can be written using an ExcelWriter object and to_excel() or by passing the file name to to_excel().

writer = pd.ExcelWriter(ex_dir  + 'ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()

frame.to_excel(ex_dir + 'ex2.xlsx')

Interacting with Web APIs

One of the most popular libraries for interacting with web APIs is requests (Real Python put together a tutorial, too).

This example requests the last 30 issues from the pandas GitHub page.

import requests

url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp

<Response [200]>

The response can then be parsedto JSON.

data = resp.json()  # convert to JSON
data[0]  # show first issue

{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411',
 'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
 'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411/labels{/name}',
 'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411/comments',
 'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411/events',
 'html_url': 'https://github.com/pandas-dev/pandas/issues/29411',
 'id': 517663592,
 'node_id': 'MDU6SXNzdWU1MTc2NjM1OTI=',
 'number': 29411,
 'title': 'Memory leak in Dataframe.memory_usage',
 'user': {'login': 'hyfjjjj',
  'id': 7194638,
  'node_id': 'MDQ6VXNlcjcxOTQ2Mzg=',
  'avatar_url': 'https://avatars2.githubusercontent.com/u/7194638?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/hyfjjjj',
  'html_url': 'https://github.com/hyfjjjj',
  'followers_url': 'https://api.github.com/users/hyfjjjj/followers',
  'following_url': 'https://api.github.com/users/hyfjjjj/following{/other_user}',
  'gists_url': 'https://api.github.com/users/hyfjjjj/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/hyfjjjj/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/hyfjjjj/subscriptions',
  'organizations_url': 'https://api.github.com/users/hyfjjjj/orgs',
  'repos_url': 'https://api.github.com/users/hyfjjjj/repos',
  'events_url': 'https://api.github.com/users/hyfjjjj/events{/privacy}',
  'received_events_url': 'https://api.github.com/users/hyfjjjj/received_events',
  'type': 'User',
  'site_admin': False},
 'labels': [],
 'state': 'open',
 'locked': False,
 'assignee': None,
 'assignees': [],
 'milestone': None,
 'comments': 0,
 'created_at': '2019-11-05T09:52:21Z',
 'updated_at': '2019-11-05T09:52:21Z',
 'closed_at': None,
 'author_association': 'NONE',
 'body': '#### Code Sample, a copy-pastable example if possible\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nimport gc\r\nimport os\r\nimport psutil\r\n\r\ndef get_process_memory():\r\n  return round(psutil.Process(os.getpid()).memory_info().rss / float(2 ** 20), 2)\r\n\r\ntest_dict = {}\r\nfor i in range(0, 50):\r\n  test_dict[i] = np.empty(10)\r\n\r\ndfs = []\r\nfor i in range(0, 1000):\r\n  df = pd.DataFrame(test_dict)\r\n  dfs.append(df)\r\n\r\ngc.collect()\r\n# before\r\nprint(\'memory usage (before "memory_usage"):\\t{} MB\'.format(get_process_memory()))\r\n\r\nfor df in dfs:\r\n  df.memory_usage(index=True, deep=True)\r\n\r\ngc.collect()\r\n# after\r\nprint(\'memory usage (after "memory_usage"):\\t{} MB\'.format(get_process_memory()))\r\n\r\n```\r\n#### Problem description\r\n\r\nDataframe\'s memory_usage function has memory leak. Memory usage after executing \'memory_usage\' function  should be the same as before.\r\n\r\n<img width="399" alt="截屏2019-11-05下午5 44 25" src="https://user-images.githubusercontent.com/7194638/68196715-f390ce00-fff3-11e9-939a-e84d850673e8.png">\r\n\r\n#### Expected Output\r\n\r\nNone\r\n\r\n#### Output of ``pd.show_versions()``\r\n\r\n<details>\r\n\r\nINSTALLED VERSIONS\r\n------------------\r\ncommit: None\r\npython: 2.7.16.final.0\r\npython-bits: 64\r\nOS: Darwin\r\nOS-release: 19.0.0\r\nmachine: x86_64\r\nprocessor: i386\r\nbyteorder: little\r\nLC_ALL: None\r\nLANG: zh_CN.UTF-8\r\nLOCALE: None.None\r\n\r\npandas: 0.24.2\r\npytest: None\r\npip: 19.3.1\r\nsetuptools: 19.6.1\r\nCython: 0.29.13\r\nnumpy: 1.16.5\r\nscipy: None\r\npyarrow: None\r\nxarray: None\r\nIPython: None\r\nsphinx: None\r\npatsy: None\r\ndateutil: 2.8.1\r\npytz: 2019.3\r\nblosc: None\r\nbottleneck: None\r\ntables: None\r\nnumexpr: None\r\nfeather: None\r\nmatplotlib: None\r\nopenpyxl: None\r\nxlrd: None\r\nxlwt: None\r\nxlsxwriter: None\r\nlxml.etree: None\r\nbs4: None\r\nhtml5lib: None\r\nsqlalchemy: None\r\npymysql: None\r\npsycopg2: None\r\njinja2: None\r\ns3fs: None\r\nfastparquet: None\r\npandas_gbq: None\r\npandas_datareader: None\r\ngcsfs: None\r\n\r\n</details>\r\n'}

Each element in data is a dictionary containing a single GitHub issue. This can be turned into a DataFrame.

issues = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])
issues

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	number	title	labels	state
0	29411	Memory leak in Dataframe.memory_usage	[]	open
1	29410	Fixed SS03 errors	[]	open
2	29409	Formatting issues with column width truncation...	[{'id': 13101118, 'node_id': 'MDU6TGFiZWwxMzEw...	open
3	29408	DataFrame.equals incorrect `See Also` section ...	[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...	open
4	29406	CLN: assorted cleanups	[]	open
...	...	...	...	...
25	29364	26302 add typing to assert star equal funcs	[{'id': 1280988427, 'node_id': 'MDU6TGFiZWwxMj...	open
26	29361	TYPING: scalar type that matches lib.is_scalar	[{'id': 1280988427, 'node_id': 'MDU6TGFiZWwxMj...	open
27	29357	ensure consistent structure for groupby on ind...	[{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj...	open
28	29356	TST: new test for subset of a MultiIndex dtype	[{'id': 127685, 'node_id': 'MDU6TGFiZWwxMjc2OD...	open
29	29355	TST: Test type issue fix in empty groupby from...	[{'id': 78527356, 'node_id': 'MDU6TGFiZWw3ODUy...	open

30 rows × 4 columns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!