[bug] potential `CPU` memory leak OOM #445

spacegoing · 2024-09-17T12:51:29Z

I'm running a 8-node, 64gpu training task. the cuda memory is fine. the problem is with CPU memory:

as can be seen from the figure, there seems to be a memory leak in cpu memory (NOT GPU). my json file for the dataset is ~19GB

Any idea what's the bug?

LinB203 · 2024-09-17T13:20:12Z

We are also noticing this problem. It may caused by decord (a package for reading video).

spacegoing · 2024-09-17T13:25:51Z

We are also noticing this problem. It may caused by decord (a package for reading video).

thank u so much for ur prompt reply.

may I ask what's the workaround?

I also read this: After several iterations, the loader worker processes will consume the same amount of CPU memory as the parent process for all Python objects in the parent process which are accessed from the worker processes. This can be problematic if the Dataset contains a lot of data (e.g., you are loading a very large list of filenames at Dataset construction time) and/or you are using a lot of workers (overall memory usage is number of workers * size of parent process). The simplest workaround is to replace Python objects with non-refcounted representations such as Pandas, Numpy or PyArrow objects. Check out issue #13246 for more details on why this occurs and example code for how to workaround these problems.

pytorch/pytorch#13246 (comment)

spacegoing · 2024-09-18T05:38:01Z

Hi Lin @LinB203 ,

I tried running dataloader alone and the decord lib do have a mem leak issue. I fixed by wrap it with a with open(filename, 'rb') as f: clause and solves the issue (would be happy to open a pull request later):

running dataloader alone

as u can see, after fix the memory grows slower, but there still seems to be other mem leak issues, any idea where should I debug first?

LinB203 · 2024-09-22T04:49:53Z

We are also solving this now, but we have not solved it.
We are checking this now.

spacegoing · 2024-09-22T11:50:00Z

@LinB203 Hi Lin,

tks for ur reply. I guess decord is not the only issue:

I switched to a rust implementation https://github.com/spacegoing/ospv12/blob/bba089a6e2cb5584125ac2b7d77e1fd451bcf5d1/opensora/dataset/t2v_datasets.py#L359
of ffmpeg but the oom issue still persists. I'm also still on this now and will keep u updated.

the lib I use: https://github.com/gcanat/video_reader-rs

spacegoing · 2024-09-22T14:33:43Z

@LinB203 Also, my current roadblock is that I can't track issue. I tried to use

170 def getitem(self, idx):
171
172 info = get_worker_info()
173 if info is not None:
174 if not self.is_start:
175 self.is_start = True
176 if self.local_step % 5 ==0:
177 snapshot = tracemalloc.take_snapshot()
178 snapshot.dump(f'./worker_snaps/rank_{dist.get_rank()}wid{info.id}step{self.local_step}.snp')
179

to track mem usage in each dataloader worker, but tracemalloc only tracks ~30 MB mem allocation

Would love to know which mem profiler ru using?

spacegoing changed the title ~~possible CPU memory leak OOM~~ [bug] potential CPU memory leak OOM Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] potential `CPU` memory leak OOM #445

[bug] potential `CPU` memory leak OOM #445

spacegoing commented Sep 17, 2024

LinB203 commented Sep 17, 2024

spacegoing commented Sep 17, 2024

spacegoing commented Sep 18, 2024

LinB203 commented Sep 22, 2024

spacegoing commented Sep 22, 2024 •

edited

Loading

spacegoing commented Sep 22, 2024

[bug] potential CPU memory leak OOM #445

[bug] potential CPU memory leak OOM #445

Comments

spacegoing commented Sep 17, 2024

LinB203 commented Sep 17, 2024

spacegoing commented Sep 17, 2024

spacegoing commented Sep 18, 2024

LinB203 commented Sep 22, 2024

spacegoing commented Sep 22, 2024 • edited Loading

spacegoing commented Sep 22, 2024

[bug] potential `CPU` memory leak OOM #445

[bug] potential `CPU` memory leak OOM #445

spacegoing commented Sep 22, 2024 •

edited

Loading