-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cyclic GC issues #2659
Comments
Ok, this is, in a word, f*cked up. If I add gc.collect to that for loop it stops leaking memory:
There are objects here that only get garbage collected when the cyclic GC runs. What's the solution here, break cycle explicitly in |
Can you try this: from ctypes import cdll, CDLL
import pandas as pd
import numpy as np
arr = np.random.randn(100000, 5)
cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")
def leak():
for i in xrange(10000):
libc.malloc_trim(0)
df = pd.DataFrame(arr.copy())
result = df.xs(1000)
# result = df.ix[5000]
if __name__ == '__main__':
leak() I suspect this has nothing to do with python, but that would confirm it. |
Yeah, that seemed to do the trick. Memory usage 450MB after running that in IPython, then malloc_trim freed 400MB. Very pernicious |
Following the see "fastbins" comment. In [1]: from ctypes import Structure,c_int,cdll,CDLL
...: class MallInfo(Structure):
...: _fields_ =[
...: ( 'arena',c_int ), # /* Non-mmapped space allocated (bytes) */
...: ('ordblks',c_int ),# /* Number of free chunks */
...: ( 'smblks',c_int ), # /* Number of free fastbin blocks */
...: ( 'hblks',c_int ), #/* Number of mmapped regions */
...: ( 'hblkhd' ,c_int ), #/* Space allocated in mmapped regions (bytes) */
...: ( 'usmblks' ,c_int), # /* Maximum total allocated space (bytes) */
...: ( 'fsmblks' ,c_int) ,#/* Space in freed fastbin blocks (bytes) */
...: ( 'uordblks' ,c_int),# /* Total allocated space (bytes) */
...: ( 'fordblks',c_int ),# /* Total free space (bytes) */
...: ( 'keepcost',c_int )# /* Top-most, releasable space (bytes) */
...: ]
...: def __repr__(self):
...: return "\n".join(["%s:%d" % (k,getattr(self,k)) for k,v in self._fields_])
...:
...: cdll.LoadLibrary("libc.so.6")
...: libc = CDLL("libc.so.6")
...: mallinfo=libc.mallinfo
...: mallinfo.restype=MallInfo
...: libc.malloc_trim(0)
...: mallinfo().fsmblks
Out[1]: 0
In [2]: import numpy as np
...: import pandas as pd
...: arr = np.random.randn(100000, 5)
...: def leak():
...: for i in xrange(10000):
...: df = pd.DataFrame(arr.copy())
...: result = df.xs(1000)
...: leak()
...: mallinfo().fsmblks
Out[2]: 128
In [3]: libc.malloc_trim(0)
...: mallinfo().fsmblks
Out[3]: 0 |
Won't fix then. Maybe we should add some helper functions to pandas someday to do the malloc trimming |
Entry in FAQ, maybe? |
For the record, we (+@sbneto) are using this in preduction for a bit of time, and is doing very good: # monkeypatches.py
# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")
libc.malloc_trim(0)
except (OSError, AttributeError):
libc = None
__old_del = getattr(pd.DataFrame, '__del__', None)
def __new_del(self):
if __old_del:
__old_del(self)
libc.malloc_trim(0)
if libc:
print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
pd.DataFrame.__del__ = __new_del
else:
print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr) |
@alanjds thanks very much! But there are other affected operations :-( It's VERY strange that's issue above (issue of glibc) doesn't have any reactions. It affects ALL the environment of Linux PCs and servers. And... Nothing!!! I know, you'll say me: ok, write a patch! I'll do it (UPD: but it'll be strange cause I know nothing about glibc code). But even nobody knows it. Everybody say: KDE leaks. Who know - why?! Nobody! Open source? For shame! Sorry but it's true for this situation. |
I do believe in you. 2 years and no move on that side :/ I say to fix on this side and put a huge comment of blame, because forking there looks unfeasible. |
@alanjds Your code fixed a problem for me that was causing a major headache. Would you be willing to explain what the default pandas behavior is and how your code fixes it? |
You can also work around this issue by switching to |
@tchristensenowlet The problem seems to be in the |
|
I think this might be the culprit in one of our project, but our users are running Windows with default Python 3.8 (from the official website) and with all dependencies installed via pip. Would this problem also be on Windows? If so, what would be the Edit: I ran these test described here, and the garbage collected did its job properly every time: |
Hi, I am also facing the same issue, |
@bhargav-kansagara I'm having the same issue as you with a buster vm. I can run libc.malloc_trim(0) with a return code 1 (successful) but no luck on releasing memory. Did you find any solutions? |
I am also interested in whether something extra needs to be installed or done in order to make the |
No, this should work out-of-the-box with Python itself. |
I've been trying understand why my memory usage wasn't flat after using chunksize argument. I'm using Cpython so I never taught about using garbage collector manually since cpython "guarantees" (AFAIK) garbage collection right after reference count hits zero. I spent 2 days on this and seeing this issue and a couple of simple gc.collect(), and seems like it fixed(most of it at least) my problem right away. I haven't seen anything related to this in the documentation(ignore this if there is a warning already) but I think documentation should have a BIG warning about this. |
For anyone else that sees this, this issue still exists as of 9/2022. Reading in pieces of a csv file (csv file is 21gb, and contains a little over 600 million rows), and even using chunks helps, but only delays the problem. Following each chunk read in with
|
I found an equivalent function that works on Windows and resolves memory leakage as effectively as malloc_trim does on Linux. It’s Here’s the cross-platform trim_ram() solution I use to ultimately trim RAM in a Python process that works intensively with large pandas DataFrames: import gc
import ctypes
import ctypes.wintypes
import platform
import logging
logger = logging.getLogger(__name__)
def trim_windows_process_memory(pid: int = None) -> bool:
"""Causes effect similar to malloc_trim on -nix."""
# Define SIZE_T based on the platform (32-bit or 64-bit)
if ctypes.sizeof(ctypes.c_void_p) == 4:
SIZE_T = ctypes.c_uint32
else:
SIZE_T = ctypes.c_uint64
# Get a handle to the current process
if not pid:
pid = ctypes.windll.kernel32.GetCurrentProcess()
# Define argument and return types for SetProcessWorkingSetSizeEx
ctypes.windll.kernel32.SetProcessWorkingSetSizeEx.argtypes = [
ctypes.wintypes.HANDLE, # Process handle
SIZE_T, # Minimum working set size
SIZE_T, # Maximum working set size
ctypes.wintypes.DWORD, # Flags
]
ctypes.windll.kernel32.SetProcessWorkingSetSizeEx.restype = ctypes.wintypes.BOOL
# Define constants for SetProcessWorkingSetSizeEx
QUOTA_LIMITS_HARDWS_MIN_DISABLE = 0x00000002
# Attempt to set the working set size
result = ctypes.windll.kernel32.SetProcessWorkingSetSizeEx(pid, SIZE_T(-1), SIZE_T(-1), QUOTA_LIMITS_HARDWS_MIN_DISABLE)
if result == 0:
# Retrieve the error code
error_code = ctypes.windll.kernel32.GetLastError()
logger.error(f"SetProcessWorkingSetSizeEx failed with error code: {error_code}")
return False
else:
return True
def trim_ram() -> None:
"""Forces python garbage collection.
Most importantly, calls malloc_trim/SetProcessWorkingSetSizeEx, which fixes pandas/libc (?) memory leak."""
gc.collect()
if platform.system() == "Windows":
trim_windows_process_memory()
else:
try:
ctypes.CDLL("libc.so.6").malloc_trim(0)
except Exception as e:
logger.error("malloc_trim attempt failed") |
A mystery to be debugged soon:
The text was updated successfully, but these errors were encountered: