Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] potential CPU memory leak OOM #445

Open
spacegoing opened this issue Sep 17, 2024 · 6 comments
Open

[bug] potential CPU memory leak OOM #445

spacegoing opened this issue Sep 17, 2024 · 6 comments

Comments

@spacegoing
Copy link

image I'm running a 8-node, 64gpu training task. the cuda memory is fine. the problem is with CPU memory:

as can be seen from the figure, there seems to be a memory leak in cpu memory (NOT GPU). my json file for the dataset is ~19GB

Any idea what's the bug?

@spacegoing spacegoing changed the title possible CPU memory leak OOM [bug] potential CPU memory leak OOM Sep 17, 2024
@LinB203
Copy link
Member

LinB203 commented Sep 17, 2024

We are also noticing this problem. It may caused by decord (a package for reading video).

@spacegoing
Copy link
Author

We are also noticing this problem. It may caused by decord (a package for reading video).

thank u so much for ur prompt reply.

may I ask what's the workaround?

I also read this: After several iterations, the loader worker processes will consume the same amount of CPU memory as the parent process for all Python objects in the parent process which are accessed from the worker processes. This can be problematic if the Dataset contains a lot of data (e.g., you are loading a very large list of filenames at Dataset construction time) and/or you are using a lot of workers (overall memory usage is number of workers * size of parent process). The simplest workaround is to replace Python objects with non-refcounted representations such as Pandas, Numpy or PyArrow objects. Check out issue #13246 for more details on why this occurs and example code for how to workaround these problems.

pytorch/pytorch#13246 (comment)

@spacegoing
Copy link
Author

Hi Lin @LinB203 ,

I tried running dataloader alone and the decord lib do have a mem leak issue. I fixed by wrap it with a with open(filename, 'rb') as f: clause and solves the issue (would be happy to open a pull request later):

image running dataloader alone

as u can see, after fix the memory grows slower, but there still seems to be other mem leak issues, any idea where should I debug first?

image

@LinB203
Copy link
Member

LinB203 commented Sep 22, 2024

We are also solving this now, but we have not solved it.
We are checking this now.

@spacegoing
Copy link
Author

spacegoing commented Sep 22, 2024

@LinB203 Hi Lin,

tks for ur reply. I guess decord is not the only issue:

I switched to a rust implementation https://github.com/spacegoing/ospv12/blob/bba089a6e2cb5584125ac2b7d77e1fd451bcf5d1/opensora/dataset/t2v_datasets.py#L359
of ffmpeg but the oom issue still persists. I'm also still on this now and will keep u updated.

the lib I use: https://github.com/gcanat/video_reader-rs

@spacegoing
Copy link
Author

@LinB203 Also, my current roadblock is that I can't track issue. I tried to use

170 def getitem(self, idx):
171
172 info = get_worker_info()
173 if info is not None:
174 if not self.is_start:
175 self.is_start = True
176 if self.local_step % 5 ==0:
177 snapshot = tracemalloc.take_snapshot()
178 snapshot.dump(f'./worker_snaps/rank_{dist.get_rank()}wid{info.id}step{self.local_step}.snp')
179

to track mem usage in each dataloader worker, but tracemalloc only tracks ~30 MB mem allocation

Would love to know which mem profiler ru using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants