You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#1621 describes an issue where there would be a Gloo timeout in the worker processes when the master process takes longer than 30min for the eval step. This was fixed w/ a regular sync while the master proc is doing the eval.
I expect a similar issue can arise during distributed trainings when FileCache is used. If a subset of the nodes already has the data locally stored on their disks, but others have not the nodes that have the data will enter the training loop and wait for the other nodes to load their data using an all_reduce. If those other nodes take longer than 30min there will be a timeout/crash in the nodes that already have the data.
Similar to the fix for #1621, we need to introduce a regular sync while the data is being cached to avoid this from happening, especially if e.g. caching takes longer than expected.
The text was updated successfully, but these errors were encountered:
In RETURNN startup we first initialize the dataset, and only afterwards we initialize the engine. This makes it quite difficult to use distrib training primitives for regular syncs. Perhaps we should use files named by the local ranks (like the returnn log file) w/ regularly updated mtimes in the work directory instead.
During dataset initialization RETURNN creates a file and updates the mtime of it every second. The other processes/engines can then check if there are files in the work directory w/ a recent mtime and wait until either the files have been deleted or their mtime goes stale (indicating a crash) before the engine and distributed training primitives are initialized.
DistributeFilesDataset does the initialization lazily, so with that dataset, there should not be a problem. Before we make it unnecessary complicated/hacky, it would be good to see an actual example where we have such problem.
Also, we could change the logic of the distrib training and initialize that earlier. I don't really see the point of adding some weird/hacky file-based communication workaround because we don't want to use the proper distrib communication primitives just because of some historical reasons how the code currently is structured.
#1621 describes an issue where there would be a Gloo timeout in the worker processes when the master process takes longer than 30min for the eval step. This was fixed w/ a regular sync while the master proc is doing the eval.
I expect a similar issue can arise during distributed trainings when
FileCache
is used. If a subset of the nodes already has the data locally stored on their disks, but others have not the nodes that have the data will enter the training loop and wait for the other nodes to load their data using anall_reduce
. If those other nodes take longer than 30min there will be a timeout/crash in the nodes that already have the data.Similar to the fix for #1621, we need to introduce a regular sync while the data is being cached to avoid this from happening, especially if e.g. caching takes longer than expected.
The text was updated successfully, but these errors were encountered: