Skip to content

BUG: Memory leak after pd.read_csv() with default parameters #51667

@viper7882

Description

@viper7882

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Credits: https://github.com/pandas-dev/pandas/issues/21353
import gc
import os.path

import psutil
import sys

import pandas as pd

from memory_profiler import profile


@profile
def read_from_file():
    memory_before_read = psutil.Process().memory_info().rss / 1024 ** 2
    print("memory_before_read: ", memory_before_read)

    df = pd.read_csv(file_name)

    print("df.shape: ", df.shape)
    memory_after_read = psutil.Process().memory_info().rss / 1024 ** 2
    print("memory_after_read: ", memory_after_read)

    del df

    # Attempt to trace memory leak
    gc.collect()

    memory_after_gc = psutil.Process().memory_info().rss / 1024 ** 2
    print("memory_after_gc: ", memory_after_gc)
    print("memory leak: ", memory_after_gc - memory_before_read)

    if len(gc.garbage) > 0:
        # Inspect the output of the garbage collector
        print("-" * 120)
        print("ERROR: gc.garbage:")
        print("-" * 120)
        print(gc.garbage)
        print()
        '''
        The output of the garbage collector will show you the objects that were not successfully freed up by the
        garbage collector. These objects are likely the source of the memory leak.

        Once you have identified the objects that are causing the memory leak, you can inspect your code to
        determine why these objects are not being garbage collected properly. Common causes of memory leaks
        include circular references, which occur when objects reference each other in a way that prevents them
        from being garbage collected, and forgetting to close file handles or database connections.

        You can also use third-party tools like memory_profiler or objgraph to help you track down memory leaks.
        '''


if __name__ == '__main__':
    '''
    Usage: python ./memory_leak_when_read_csv.py 10000000 20
    '''
    m = int(sys.argv[1])
    n = int(sys.argv[2])

    print("pd.__version__: ", pd.__version__)

    file_name = 'df_{}_{}.csv'.format(m, n)
    if not os.path.exists(file_name):
        mode = "wt"
        with open(file_name, mode) as f:
            for i in range(n - 1):
                f.write('c' + str(i) + ',')
            f.write('c' + str(n - 1) + '\n')
            for j in range(m):
                for i in range(n - 1):
                    f.write('1,')
                f.write('1\n')

    read_from_file()

Issue Description

Memory still leak despite fix https://github.com/pandas-dev/pandas/pull/24837/commits still intact in main branch for pd.read_csv().

Sample output in my run:

python memory_leak_when_read_csv.py 10000000 20 
pd.__version__: 1.5.3
memory_before_read:  13.765625
df.shape:  (10000000, 20)
memory_after_read:  1549.0234375
memory_after_gc:  41.01953125
memory leak:  27.25390625
Filename: memory_leak_when_read_csv.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    13     13.7 MiB     13.7 MiB           1   @profile
    14                                         def read_from_file():
    15     13.8 MiB      0.1 MiB           1       memory_before_read = psutil.Process().memory_info().rss / 1024 ** 2
    16     13.8 MiB      0.0 MiB           1       print("memory_before_read: ", memory_before_read)
    17                                         
    18   1549.0 MiB   1535.2 MiB           1       df = pd.read_csv(file_name)
    19                                         
    20   1549.0 MiB      0.0 MiB           1       print("df.shape: ", df.shape)
    21   1549.0 MiB      0.0 MiB           1       memory_after_read = psutil.Process().memory_info().rss / 1024 ** 2
    22   1549.0 MiB      0.0 MiB           1       print("memory_after_read: ", memory_after_read)
    23                                         
    24     23.1 MiB  -1525.9 MiB           1       del df
    25                                         
    26                                             # Attempt to trace memory leak
    27     41.0 MiB     17.9 MiB           1       gc.collect()
    28                                         
    29     41.0 MiB      0.0 MiB           1       memory_after_gc = psutil.Process().memory_info().rss / 1024 ** 2
    30     41.0 MiB      0.0 MiB           1       print("memory_after_gc: ", memory_after_gc)
    31     41.0 MiB      0.0 MiB           1       print("memory leak: ", memory_after_gc - memory_before_read)
    32                                         
    33     41.0 MiB      0.0 MiB           1       if len(gc.garbage) > 0:
    34                                                 # Inspect the output of the garbage collector
    35                                                 print("-" * 120)
    36                                                 print("ERROR: gc.garbage:")
    37                                                 print("-" * 120)
    38                                                 print(gc.garbage)
    39                                                 print()
    40                                                 '''
    41                                                 The output of the garbage collector will show you the objects that were not successfully freed up by the
    42                                                 garbage collector. These objects are likely the source of the memory leak.
    43                                         
    44                                                 Once you have identified the objects that are causing the memory leak, you can inspect your code to
    45                                                 determine why these objects are not being garbage collected properly. Common causes of memory leaks
    46                                                 include circular references, which occur when objects reference each other in a way that prevents them
    47                                                 from being garbage collected, and forgetting to close file handles or database connections.
    48                                         
    49                                                 You can also use third-party tools like memory_profiler or objgraph to help you track down memory leaks.
    50                                                 '''



Process finished with exit code 0

In the run above, it is causing 27.25390625 MB of memory leak. IMHO, it is a huge leak for csv file size of 391MB.

Expected Behavior

There should be no memory leak after del df and garbage is collected.

This memory leak issue was originally detected by VPS crash in the cloud. After some digging, it has been discovered that the memory leak occurs in both Linux and Windows platform. Logically, you should be able to reproduce it in any platform.

The chances for the memory leak to occur is sourcing from C language buffer that either over allocated or under freed. IMHO you could consider running Memory Leak Detection and Management Tools of your choice to completely eliminate memory leak from the C programming code.

Related issues FYR:
#21353
#37031
#49582

I hope you will help to prioritize this issue as I trust there will be more users discovering this issue as time passes.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2e218d1 python : 3.10.2.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22621 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_united states.1252 pandas : 1.5.3 numpy : 1.24.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 67.0.0 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO CSVread_csv, to_csvPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions