Potential timeout during data caching in multi-node trainings #1638

NeoLegends · 2024-10-23T16:59:26Z

#1621 describes an issue where there would be a Gloo timeout in the worker processes when the master process takes longer than 30min for the eval step. This was fixed w/ a regular sync while the master proc is doing the eval.

I expect a similar issue can arise during distributed trainings when FileCache is used. If a subset of the nodes already has the data locally stored on their disks, but others have not the nodes that have the data will enter the training loop and wait for the other nodes to load their data using an all_reduce. If those other nodes take longer than 30min there will be a timeout/crash in the nodes that already have the data.

Similar to the fix for #1621, we need to introduce a regular sync while the data is being cached to avoid this from happening, especially if e.g. caching takes longer than expected.

The text was updated successfully, but these errors were encountered:

NeoLegends · 2024-11-04T11:28:47Z

In RETURNN startup we first initialize the dataset, and only afterwards we initialize the engine. This makes it quite difficult to use distrib training primitives for regular syncs. Perhaps we should use files named by the local ranks (like the returnn log file) w/ regularly updated mtimes in the work directory instead.

During dataset initialization RETURNN creates a file and updates the mtime of it every second. The other processes/engines can then check if there are files in the work directory w/ a recent mtime and wait until either the files have been deleted or their mtime goes stale (indicating a crash) before the engine and distributed training primitives are initialized.

albertz · 2024-11-04T12:56:59Z

DistributeFilesDataset does the initialization lazily, so with that dataset, there should not be a problem. Before we make it unnecessary complicated/hacky, it would be good to see an actual example where we have such problem.

Also, we could change the logic of the distrib training and initialize that earlier. I don't really see the point of adding some weird/hacky file-based communication workaround because we don't want to use the proper distrib communication primitives just because of some historical reasons how the code currently is structured.

NeoLegends added the bug label Oct 23, 2024

NeoLegends self-assigned this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential timeout during data caching in multi-node trainings #1638

Potential timeout during data caching in multi-node trainings #1638

NeoLegends commented Oct 23, 2024

NeoLegends commented Nov 4, 2024 •

edited

Loading

albertz commented Nov 4, 2024

Potential timeout during data caching in multi-node trainings #1638

Potential timeout during data caching in multi-node trainings #1638

Comments

NeoLegends commented Oct 23, 2024

NeoLegends commented Nov 4, 2024 • edited Loading

albertz commented Nov 4, 2024

NeoLegends commented Nov 4, 2024 •

edited

Loading