-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Credits: https://github.com/pandas-dev/pandas/issues/21353
import gc
import os.path
import psutil
import sys
import pandas as pd
from memory_profiler import profile
@profile
def read_from_file():
memory_before_read = psutil.Process().memory_info().rss / 1024 ** 2
print("memory_before_read: ", memory_before_read)
df = pd.read_csv(file_name)
print("df.shape: ", df.shape)
memory_after_read = psutil.Process().memory_info().rss / 1024 ** 2
print("memory_after_read: ", memory_after_read)
del df
# Attempt to trace memory leak
gc.collect()
memory_after_gc = psutil.Process().memory_info().rss / 1024 ** 2
print("memory_after_gc: ", memory_after_gc)
print("memory leak: ", memory_after_gc - memory_before_read)
if len(gc.garbage) > 0:
# Inspect the output of the garbage collector
print("-" * 120)
print("ERROR: gc.garbage:")
print("-" * 120)
print(gc.garbage)
print()
'''
The output of the garbage collector will show you the objects that were not successfully freed up by the
garbage collector. These objects are likely the source of the memory leak.
Once you have identified the objects that are causing the memory leak, you can inspect your code to
determine why these objects are not being garbage collected properly. Common causes of memory leaks
include circular references, which occur when objects reference each other in a way that prevents them
from being garbage collected, and forgetting to close file handles or database connections.
You can also use third-party tools like memory_profiler or objgraph to help you track down memory leaks.
'''
if __name__ == '__main__':
'''
Usage: python ./memory_leak_when_read_csv.py 10000000 20
'''
m = int(sys.argv[1])
n = int(sys.argv[2])
print("pd.__version__: ", pd.__version__)
file_name = 'df_{}_{}.csv'.format(m, n)
if not os.path.exists(file_name):
mode = "wt"
with open(file_name, mode) as f:
for i in range(n - 1):
f.write('c' + str(i) + ',')
f.write('c' + str(n - 1) + '\n')
for j in range(m):
for i in range(n - 1):
f.write('1,')
f.write('1\n')
read_from_file()
Issue Description
Memory still leak despite fix https://github.com/pandas-dev/pandas/pull/24837/commits still intact in main branch for pd.read_csv().
Sample output in my run:
python memory_leak_when_read_csv.py 10000000 20
pd.__version__: 1.5.3
memory_before_read: 13.765625
df.shape: (10000000, 20)
memory_after_read: 1549.0234375
memory_after_gc: 41.01953125
memory leak: 27.25390625
Filename: memory_leak_when_read_csv.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
13 13.7 MiB 13.7 MiB 1 @profile
14 def read_from_file():
15 13.8 MiB 0.1 MiB 1 memory_before_read = psutil.Process().memory_info().rss / 1024 ** 2
16 13.8 MiB 0.0 MiB 1 print("memory_before_read: ", memory_before_read)
17
18 1549.0 MiB 1535.2 MiB 1 df = pd.read_csv(file_name)
19
20 1549.0 MiB 0.0 MiB 1 print("df.shape: ", df.shape)
21 1549.0 MiB 0.0 MiB 1 memory_after_read = psutil.Process().memory_info().rss / 1024 ** 2
22 1549.0 MiB 0.0 MiB 1 print("memory_after_read: ", memory_after_read)
23
24 23.1 MiB -1525.9 MiB 1 del df
25
26 # Attempt to trace memory leak
27 41.0 MiB 17.9 MiB 1 gc.collect()
28
29 41.0 MiB 0.0 MiB 1 memory_after_gc = psutil.Process().memory_info().rss / 1024 ** 2
30 41.0 MiB 0.0 MiB 1 print("memory_after_gc: ", memory_after_gc)
31 41.0 MiB 0.0 MiB 1 print("memory leak: ", memory_after_gc - memory_before_read)
32
33 41.0 MiB 0.0 MiB 1 if len(gc.garbage) > 0:
34 # Inspect the output of the garbage collector
35 print("-" * 120)
36 print("ERROR: gc.garbage:")
37 print("-" * 120)
38 print(gc.garbage)
39 print()
40 '''
41 The output of the garbage collector will show you the objects that were not successfully freed up by the
42 garbage collector. These objects are likely the source of the memory leak.
43
44 Once you have identified the objects that are causing the memory leak, you can inspect your code to
45 determine why these objects are not being garbage collected properly. Common causes of memory leaks
46 include circular references, which occur when objects reference each other in a way that prevents them
47 from being garbage collected, and forgetting to close file handles or database connections.
48
49 You can also use third-party tools like memory_profiler or objgraph to help you track down memory leaks.
50 '''
Process finished with exit code 0
In the run above, it is causing 27.25390625 MB of memory leak. IMHO, it is a huge leak for csv file size of 391MB.
Expected Behavior
There should be no memory leak after del df
and garbage is collected.
This memory leak issue was originally detected by VPS crash in the cloud. After some digging, it has been discovered that the memory leak occurs in both Linux and Windows platform. Logically, you should be able to reproduce it in any platform.
The chances for the memory leak to occur is sourcing from C language buffer that either over allocated or under freed. IMHO you could consider running Memory Leak Detection and Management Tools of your choice to completely eliminate memory leak from the C programming code.
Related issues FYR:
#21353
#37031
#49582
I hope you will help to prioritize this issue as I trust there will be more users discovering this issue as time passes.