The three main ways to access data is by reading in on-disk files, loading from databases, requesting via a web API.
There are various way to read in tabular data as a DataFrame in pandas.
The functions I am most likely to frequently use are read_csv()
, read_table()
, read_excel()
, read_pickle()
, read_json()
, and read_sql()
.
These functions are generally do a few similar processes on the input file:
- Index the rows and columns.
- Type inference and data conversion.
- Datetime parsing.
- Fixing unlean data such as footers, data with commas (especially numerics), etc.
These functions have many parameters to adjust to the quirks of the data; therefore, if you run into a problem, look through a documentation to see if any of the parameters are made for fixing it.
Here are some simple examples of reading data in with pandas.
!cat assets/examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
import pandas as pd
import numpy as np
np.random.seed(0)
ex_dir = 'assets/examples/'
df = pd.read_csv(ex_dir + 'ex1.csv')
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
!cat assets/examples/ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv(ex_dir + 'ex2.csv', header=None)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
pd.read_csv(ex_dir + 'ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv(ex_dir + 'ex2.csv', names=names, index_col='message')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | |
---|---|---|---|---|
message | ||||
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
The DataFrame can be read in with multiple hierarchical indices.
!cat assets/examples/csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
parsed = pd.read_csv(ex_dir + 'csv_mindex.csv', index_col=['key1', 'key2'])
parsed
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
value1 | value2 | ||
---|---|---|---|
key1 | key2 | ||
one | a | 1 | 2 |
b | 3 | 4 | |
c | 5 | 6 | |
d | 7 | 8 | |
two | a | 9 | 10 |
b | 11 | 12 | |
c | 13 | 14 | |
d | 15 | 16 |
Troublesome rows can be skipped explicitly.
!cat assets/examples/ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv(ex_dir + 'ex4.csv')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# hey! | ||||
---|---|---|---|---|
a | b | c | d | message |
# just wanted to make things more difficult for you | NaN | NaN | NaN | NaN |
# who reads CSV files with computers | anyway? | NaN | NaN | NaN |
1 | 2 | 3 | 4 | hello |
5 | 6 | 7 | 8 | world |
9 | 10 | 11 | 12 | foo |
pd.read_csv(ex_dir + 'ex4.csv', skiprows=[0, 2, 3])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
The common missing data sentinels, such as NA
and NULL
, are automatically idenfied by pandas.
!cat assets/examples/ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
result = pd.read_csv(ex_dir + 'ex5.csv')
result
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
Some of the other more common arguments used when reading in data with pandas are sep
/delimiter
, header
, index_col
, names
, skiprows
, parse_dates
, nrows
, chunksize
(for reading in files piecemeal), skip_footer
.
Before continuing with large DataFrames, set the print-out to be smaller.
pd.options.display.max_rows = 10
We can then read in a large file and have it nicely printed out.
result = pd.read_csv(ex_dir + 'ex6.csv')
result
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
... | ... | ... | ... | ... | ... |
9995 | 2.311896 | -0.417070 | -1.409599 | -0.515821 | L |
9996 | -0.479893 | -0.650419 | 0.745152 | -0.646038 | E |
9997 | 0.523331 | 0.787112 | 0.486066 | 1.093156 | K |
9998 | -0.362559 | 0.598894 | -1.843201 | 0.887292 | G |
9999 | -0.096376 | -1.012999 | -0.657431 | -0.573315 | 0 |
10000 rows × 5 columns
We could have also just read in the first few rows.
pd.read_csv(ex_dir + 'ex6.csv', nrows=5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
Or the file could be read in in pieces.
chunker = pd.read_csv(ex_dir + 'ex6.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x122787050>
tot = pd.Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
tot.head()
E 368.0
X 364.0
L 346.0
O 343.0
Q 340.0
dtype: float64
data = pd.read_csv(ex_dir + 'ex5.csv')
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
data.to_csv(ex_dir + 'out.csv')
!cat assets/examples/out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
data.to_csv(ex_dir + 'out.csv', sep='|')
!cat assets/examples/out.csv
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
data.to_csv(ex_dir + 'out.csv', index=False, columns=['a', 'b', 'c'])
!cat assets/examples/out.csv
a,b,c
1,2,3.0
5,6,
9,10,11.0
JSON stnads for "JavaScript Object Notation" and is very nearly valid Python code.
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
} """
obj
'\n {"name": "Wes",\n "places_lived": ["United States", "Spain", "Germany"],\n "pet": null,\n "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},\n {"name": "Katie", "age": 38,\n "pets": ["Sixes", "Stache", "Cisco"]}]\n} '
There are many JSON parsing libraries, but we will use 'json'.
import json
result = json.loads(obj)
result
{'name': 'Wes',
'places_lived': ['United States', 'Spain', 'Germany'],
'pet': None,
'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
{'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
asjson = json.dumps(result)
asjson
'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'
There are many ways to go from JSON to DataFrame. One way is to pass a list of dictionaries and select a subset of the data fields.
siblings = pd.DataFrame(result['siblings'], columns=['names', 'age'])
siblings
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
names | age | |
---|---|---|
0 | NaN | 30 |
1 | NaN | 38 |
There is also the read_json()
function from pandas.
!cat assets/examples/example.json
[{"a": 1, "b": 2, "c": 3},
{"a": 4, "b": 5, "c": 6},
{"a": 7, "b": 8, "c": 9}]
data = pd.read_json(ex_dir + 'example.json')
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
data.to_json()
'{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}'
For this section, we installed a few popular libariesfor reading and writing HTML and XML.
conda install lxml beautifulsoup4 html5lib
The pandas read_html()
searches for and parses tabular data, often within <table><\table>
tags.
tables = pd.read_html(ex_dir + 'fdic_failed_bank_list.html')
len(tables)
failures = tables[0]
failures.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Bank Name | City | ST | CERT | Acquiring Institution | Closing Date | Updated Date | |
---|---|---|---|---|---|---|---|
0 | Allied Bank | Mulberry | AR | 91 | Today's Bank | September 23, 2016 | November 17, 2016 |
1 | The Woodbury Banking Company | Woodbury | GA | 11297 | United Bank | August 19, 2016 | November 17, 2016 |
2 | First CornerStone Bank | King of Prussia | PA | 35312 | First-Citizens Bank & Trust Company | May 6, 2016 | September 6, 2016 |
3 | Trust Company Bank | Memphis | TN | 9956 | The Bank of Fayette County | April 29, 2016 | September 6, 2016 |
4 | North Milwaukee State Bank | Milwaukee | WI | 20364 | First-Citizens Bank & Trust Company | March 11, 2016 | June 16, 2016 |
XML can be parsed using lxml or BeautifulSoup. (The author includes an example of how one could do this; I just read it, but did not take notes.)
One of the most common binary serialization protocols in Python is pickle serialization.
All pandas data structures have a to_pickle()
method.
frame = pd.read_csv(ex_dir + 'ex1.csv')
frame
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
frame.to_pickle(ex_dir + 'frame_pickle')
pd.read_pickle(ex_dir + 'frame_pickle')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
There are also the HDF5, MessagePack, bcolz, and Feather serialization formats.
File format intended for storing large scientific data arrays. It has interfaces for many other languages (including MATLAB).
frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore(ex_dir + 'mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store
<class 'pandas.io.pytables.HDFStore'>
File path: assets/examples/mydata.h5
store['obj1']
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | |
---|---|
0 | 1.764052 |
1 | 0.400157 |
2 | 0.978738 |
3 | 2.240893 |
4 | 1.867558 |
... | ... |
95 | 0.706573 |
96 | 0.010500 |
97 | 1.785870 |
98 | 0.126912 |
99 | 0.401989 |
100 rows × 1 columns
HSF5Store()
supports two schemas, 'fixed'
and 'table'
.
The latter is slower, but supports queries.
store.put('obj2', frame, format='table')
store.select('obj2', where=['index >= 10 and index <= 15'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | |
---|---|
10 | 0.144044 |
11 | 1.454274 |
12 | 0.761038 |
13 | 0.121675 |
14 | 0.443863 |
15 | 0.333674 |
store.close()
The read_hdf()
function provides easy access to a HDF5 file.
frame.to_hdf(ex_dir + 'mydata.h5', 'obj3', format='table')
pd.read_hdf(ex_dir + 'mydata.h5', 'obj3', where=['index < 5'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
a | |
---|---|
0 | 1.764052 |
1 | 0.400157 |
2 | 0.978738 |
3 | 2.240893 |
4 | 1.867558 |
pandas supports reading tabular data stored in Excel (≥2003). Internally, pandas uses the xlrd and openpyxl libraries to read XLS and XLSX files.
If there are multiple sheets in an Excel file to be used, it is likely faster to use ExcelFile()
.
xlsx = pd.ExcelFile(ex_dir + 'ex1.xlsx')
xlsx.sheet_names
['Sheet1']
xlsx.parse(sheet_name='Sheet1')
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | hello |
1 | 1 | 5 | 6 | 7 | 8 | world |
2 | 2 | 9 | 10 | 11 | 12 | foo |
Otherwise, read_excel()
just returns a DataFrame, immediately.
frame = pd.read_excel(ex_dir + 'ex1.xlsx')
frame
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | hello |
1 | 1 | 5 | 6 | 7 | 8 | world |
2 | 2 | 9 | 10 | 11 | 12 | foo |
Excel files can be written using an ExcelWriter
object and to_excel()
or by passing the file name to to_excel()
.
writer = pd.ExcelWriter(ex_dir + 'ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()
frame.to_excel(ex_dir + 'ex2.xlsx')
One of the most popular libraries for interacting with web APIs is requests (Real Python put together a tutorial, too).
This example requests the last 30 issues from the pandas GitHub page.
import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp
<Response [200]>
The response can then be parsedto JSON.
data = resp.json() # convert to JSON
data[0] # show first issue
{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411',
'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411/labels{/name}',
'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411/comments',
'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/29411/events',
'html_url': 'https://github.com/pandas-dev/pandas/issues/29411',
'id': 517663592,
'node_id': 'MDU6SXNzdWU1MTc2NjM1OTI=',
'number': 29411,
'title': 'Memory leak in Dataframe.memory_usage',
'user': {'login': 'hyfjjjj',
'id': 7194638,
'node_id': 'MDQ6VXNlcjcxOTQ2Mzg=',
'avatar_url': 'https://avatars2.githubusercontent.com/u/7194638?v=4',
'gravatar_id': '',
'url': 'https://api.github.com/users/hyfjjjj',
'html_url': 'https://github.com/hyfjjjj',
'followers_url': 'https://api.github.com/users/hyfjjjj/followers',
'following_url': 'https://api.github.com/users/hyfjjjj/following{/other_user}',
'gists_url': 'https://api.github.com/users/hyfjjjj/gists{/gist_id}',
'starred_url': 'https://api.github.com/users/hyfjjjj/starred{/owner}{/repo}',
'subscriptions_url': 'https://api.github.com/users/hyfjjjj/subscriptions',
'organizations_url': 'https://api.github.com/users/hyfjjjj/orgs',
'repos_url': 'https://api.github.com/users/hyfjjjj/repos',
'events_url': 'https://api.github.com/users/hyfjjjj/events{/privacy}',
'received_events_url': 'https://api.github.com/users/hyfjjjj/received_events',
'type': 'User',
'site_admin': False},
'labels': [],
'state': 'open',
'locked': False,
'assignee': None,
'assignees': [],
'milestone': None,
'comments': 0,
'created_at': '2019-11-05T09:52:21Z',
'updated_at': '2019-11-05T09:52:21Z',
'closed_at': None,
'author_association': 'NONE',
'body': '#### Code Sample, a copy-pastable example if possible\r\n\r\n```python\r\nimport numpy as np\r\nimport pandas as pd\r\nimport gc\r\nimport os\r\nimport psutil\r\n\r\ndef get_process_memory():\r\n return round(psutil.Process(os.getpid()).memory_info().rss / float(2 ** 20), 2)\r\n\r\ntest_dict = {}\r\nfor i in range(0, 50):\r\n test_dict[i] = np.empty(10)\r\n\r\ndfs = []\r\nfor i in range(0, 1000):\r\n df = pd.DataFrame(test_dict)\r\n dfs.append(df)\r\n\r\ngc.collect()\r\n# before\r\nprint(\'memory usage (before "memory_usage"):\\t{} MB\'.format(get_process_memory()))\r\n\r\nfor df in dfs:\r\n df.memory_usage(index=True, deep=True)\r\n\r\ngc.collect()\r\n# after\r\nprint(\'memory usage (after "memory_usage"):\\t{} MB\'.format(get_process_memory()))\r\n\r\n```\r\n#### Problem description\r\n\r\nDataframe\'s memory_usage function has memory leak. Memory usage after executing \'memory_usage\' function should be the same as before.\r\n\r\n<img width="399" alt="截屏2019-11-05下午5 44 25" src="https://user-images.githubusercontent.com/7194638/68196715-f390ce00-fff3-11e9-939a-e84d850673e8.png">\r\n\r\n#### Expected Output\r\n\r\nNone\r\n\r\n#### Output of ``pd.show_versions()``\r\n\r\n<details>\r\n\r\nINSTALLED VERSIONS\r\n------------------\r\ncommit: None\r\npython: 2.7.16.final.0\r\npython-bits: 64\r\nOS: Darwin\r\nOS-release: 19.0.0\r\nmachine: x86_64\r\nprocessor: i386\r\nbyteorder: little\r\nLC_ALL: None\r\nLANG: zh_CN.UTF-8\r\nLOCALE: None.None\r\n\r\npandas: 0.24.2\r\npytest: None\r\npip: 19.3.1\r\nsetuptools: 19.6.1\r\nCython: 0.29.13\r\nnumpy: 1.16.5\r\nscipy: None\r\npyarrow: None\r\nxarray: None\r\nIPython: None\r\nsphinx: None\r\npatsy: None\r\ndateutil: 2.8.1\r\npytz: 2019.3\r\nblosc: None\r\nbottleneck: None\r\ntables: None\r\nnumexpr: None\r\nfeather: None\r\nmatplotlib: None\r\nopenpyxl: None\r\nxlrd: None\r\nxlwt: None\r\nxlsxwriter: None\r\nlxml.etree: None\r\nbs4: None\r\nhtml5lib: None\r\nsqlalchemy: None\r\npymysql: None\r\npsycopg2: None\r\njinja2: None\r\ns3fs: None\r\nfastparquet: None\r\npandas_gbq: None\r\npandas_datareader: None\r\ngcsfs: None\r\n\r\n</details>\r\n'}
Each element in data
is a dictionary containing a single GitHub issue.
This can be turned into a DataFrame.
issues = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])
issues
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
number | title | labels | state | |
---|---|---|---|---|
0 | 29411 | Memory leak in Dataframe.memory_usage | [] | open |
1 | 29410 | Fixed SS03 errors | [] | open |
2 | 29409 | Formatting issues with column width truncation... | [{'id': 13101118, 'node_id': 'MDU6TGFiZWwxMzEw... | open |
3 | 29408 | DataFrame.equals incorrect `See Also` section ... | [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT... | open |
4 | 29406 | CLN: assorted cleanups | [] | open |
... | ... | ... | ... | ... |
25 | 29364 | 26302 add typing to assert star equal funcs | [{'id': 1280988427, 'node_id': 'MDU6TGFiZWwxMj... | open |
26 | 29361 | TYPING: scalar type that matches lib.is_scalar | [{'id': 1280988427, 'node_id': 'MDU6TGFiZWwxMj... | open |
27 | 29357 | ensure consistent structure for groupby on ind... | [{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj... | open |
28 | 29356 | TST: new test for subset of a MultiIndex dtype | [{'id': 127685, 'node_id': 'MDU6TGFiZWwxMjc2OD... | open |
29 | 29355 | TST: Test type issue fix in empty groupby from... | [{'id': 78527356, 'node_id': 'MDU6TGFiZWw3ODUy... | open |
30 rows × 4 columns