Skip to content

Commit 36c1104

Browse files
zhezherungfyoung
authored andcommitted
BUG: Fixing memory leaks in read_csv
* Move allocation of na_hashset down to avoid a leak on continue * Delete na_hashset if there is an exception * Clean up table before raising an exception Closes gh-21353.
1 parent 0ab8eb2 commit 36c1104

File tree

2 files changed

+29
-18
lines changed

2 files changed

+29
-18
lines changed

doc/source/whatsnew/v0.24.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1382,6 +1382,7 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
13821382
- Bug in :func:`DataFrame.to_string()` that caused representations of :class:`DataFrame` to not take up the whole window (:issue:`22984`)
13831383
- Bug in :func:`DataFrame.to_csv` where a single level MultiIndex incorrectly wrote a tuple. Now just the value of the index is written (:issue:`19589`).
13841384
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
1385+
- Bug in :func:`read_csv()` in which memory leaks occurred in the C engine when parsing ``NaN`` values due to insufficient cleanup on completion or error (:issue:`21353`)
13851386
- Bug in :func:`read_csv()` in which incorrect error messages were being raised when ``skipfooter`` was passed in along with ``nrows``, ``iterator``, or ``chunksize`` (:issue:`23711`)
13861387
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
13871388
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)

pandas/_libs/parsers.pyx

+28-18
Original file line numberDiff line numberDiff line change
@@ -1070,18 +1070,6 @@ cdef class TextReader:
10701070

10711071
conv = self._get_converter(i, name)
10721072

1073-
# XXX
1074-
na_flist = set()
1075-
if self.na_filter:
1076-
na_list, na_flist = self._get_na_list(i, name)
1077-
if na_list is None:
1078-
na_filter = 0
1079-
else:
1080-
na_filter = 1
1081-
na_hashset = kset_from_list(na_list)
1082-
else:
1083-
na_filter = 0
1084-
10851073
col_dtype = None
10861074
if self.dtype is not None:
10871075
if isinstance(self.dtype, dict):
@@ -1106,13 +1094,34 @@ cdef class TextReader:
11061094
self.c_encoding)
11071095
continue
11081096

1109-
# Should return as the desired dtype (inferred or specified)
1110-
col_res, na_count = self._convert_tokens(
1111-
i, start, end, name, na_filter, na_hashset,
1112-
na_flist, col_dtype)
1097+
# Collect the list of NaN values associated with the column.
1098+
# If we aren't supposed to do that, or none are collected,
1099+
# we set `na_filter` to `0` (`1` otherwise).
1100+
na_flist = set()
1101+
1102+
if self.na_filter:
1103+
na_list, na_flist = self._get_na_list(i, name)
1104+
if na_list is None:
1105+
na_filter = 0
1106+
else:
1107+
na_filter = 1
1108+
na_hashset = kset_from_list(na_list)
1109+
else:
1110+
na_filter = 0
11131111

1114-
if na_filter:
1115-
self._free_na_set(na_hashset)
1112+
# Attempt to parse tokens and infer dtype of the column.
1113+
# Should return as the desired dtype (inferred or specified).
1114+
try:
1115+
col_res, na_count = self._convert_tokens(
1116+
i, start, end, name, na_filter, na_hashset,
1117+
na_flist, col_dtype)
1118+
finally:
1119+
# gh-21353
1120+
#
1121+
# Cleanup the NaN hash that we generated
1122+
# to avoid memory leaks.
1123+
if na_filter:
1124+
self._free_na_set(na_hashset)
11161125

11171126
if upcast_na and na_count > 0:
11181127
col_res = _maybe_upcast(col_res)
@@ -2059,6 +2068,7 @@ cdef kh_str_t* kset_from_list(list values) except NULL:
20592068

20602069
# None creeps in sometimes, which isn't possible here
20612070
if not isinstance(val, bytes):
2071+
kh_destroy_str(table)
20622072
raise ValueError('Must be all encoded bytes')
20632073

20642074
k = kh_put_str(table, PyBytes_AsString(val), &ret)

0 commit comments

Comments
 (0)