-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] potential CPU
memory leak OOM
#445
Comments
CPU
memory leak OOMCPU
memory leak OOM
We are also noticing this problem. It may caused by decord (a package for reading video). |
thank u so much for ur prompt reply. may I ask what's the workaround? I also read this: After several iterations, the loader worker processes will consume the same amount of CPU memory as the parent process for all Python objects in the parent process which are accessed from the worker processes. This can be problematic if the Dataset contains a lot of data (e.g., you are loading a very large list of filenames at Dataset construction time) and/or you are using a lot of workers (overall memory usage is number of workers * size of parent process). The simplest workaround is to replace Python objects with non-refcounted representations such as Pandas, Numpy or PyArrow objects. Check out issue #13246 for more details on why this occurs and example code for how to workaround these problems. |
Hi Lin @LinB203 , I tried running dataloader alone and the decord lib do have a mem leak issue. I fixed by wrap it with a as u can see, after fix the memory grows slower, but there still seems to be other mem leak issues, any idea where should I debug first? |
We are also solving this now, but we have not solved it. |
@LinB203 Hi Lin, tks for ur reply. I guess decord is not the only issue: I switched to a rust implementation https://github.com/spacegoing/ospv12/blob/bba089a6e2cb5584125ac2b7d77e1fd451bcf5d1/opensora/dataset/t2v_datasets.py#L359 the lib I use: https://github.com/gcanat/video_reader-rs |
@LinB203 Also, my current roadblock is that I can't track issue. I tried to use 170 def getitem(self, idx): to track mem usage in each dataloader worker, but tracemalloc only tracks ~30 MB mem allocation Would love to know which mem profiler ru using? |
as can be seen from the figure, there seems to be a memory leak in cpu memory (NOT GPU). my json file for the dataset is ~19GB
Any idea what's the bug?
The text was updated successfully, but these errors were encountered: