Skip to content

Commit

Permalink
BUG: Categorical data fails to load from hdf when all columns are NaN (
Browse files Browse the repository at this point in the history
  • Loading branch information
ssche authored and jreback committed Dec 10, 2017
1 parent 1c1f507 commit 2db1cc0
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 4 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,4 @@ doc/build/html/index.html
doc/tmp.sv
doc/source/styled.xlsx
doc/source/templates/
env/
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ I/O
- Bug in :func:`read_json` not decoding when reading line deliminted JSON from S3 (:issue:`17200`)
- Bug in :func:`pandas.io.json.json_normalize` to avoid modification of ``meta`` (:issue:`18610`)
- Bug in :func:`to_latex` where repeated multi-index values were not printed even though a higher level index differed from the previous row (:issue:`14484`)
- Bug when reading NaN-only categorical columns in :class:`HDFStore` (:issue:`18413`)

Plotting
^^^^^^^^
Expand Down
15 changes: 11 additions & 4 deletions pandas/io/pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -2137,10 +2137,17 @@ def convert(self, values, nan_rep, encoding):
# if we have stored a NaN in the categories
# then strip it; in theory we could have BOTH
# -1s in the codes and nulls :<
mask = isna(categories)
if mask.any():
categories = categories[~mask]
codes[codes != -1] -= mask.astype(int).cumsum().values
if categories is None:
# Handle case of NaN-only categorical columns in which case
# the categories are an empty array; when this is stored,
# pytables cannot write a zero-len array, so on readback
# the categories would be None and `read_hdf()` would fail.
categories = Index([], dtype=np.float64)
else:
mask = isna(categories)
if mask.any():
categories = categories[~mask]
codes[codes != -1] -= mask.astype(int).cumsum().values

self.data = Categorical.from_codes(codes,
categories=categories,
Expand Down
19 changes: 19 additions & 0 deletions pandas/tests/io/test_pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -4928,6 +4928,25 @@ def test_categorical_conversion(self):
result = read_hdf(path, 'df', where='obsids=B')
tm.assert_frame_equal(result, expected)

def test_categorical_nan_only_columns(self):
# GH18413
# Check that read_hdf with categorical columns with NaN-only values can
# be read back.
df = pd.DataFrame({
'a': ['a', 'b', 'c', np.nan],
'b': [np.nan, np.nan, np.nan, np.nan],
'c': [1, 2, 3, 4],
'd': pd.Series([None] * 4, dtype=object)
})
df['a'] = df.a.astype('category')
df['b'] = df.b.astype('category')
df['d'] = df.b.astype('category')
expected = df
with ensure_clean_path(self.path) as path:
df.to_hdf(path, 'df', format='table', data_columns=True)
result = read_hdf(path, 'df')
tm.assert_frame_equal(result, expected)

def test_duplicate_column_name(self):
df = DataFrame(columns=["a", "a"], data=[[0, 0]])

Expand Down

0 comments on commit 2db1cc0

Please sign in to comment.