Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential timeout during data caching in multi-node trainings #1638

Open
NeoLegends opened this issue Oct 23, 2024 · 2 comments
Open

Potential timeout during data caching in multi-node trainings #1638

NeoLegends opened this issue Oct 23, 2024 · 2 comments
Assignees
Labels

Comments

@NeoLegends
Copy link
Collaborator

#1621 describes an issue where there would be a Gloo timeout in the worker processes when the master process takes longer than 30min for the eval step. This was fixed w/ a regular sync while the master proc is doing the eval.

I expect a similar issue can arise during distributed trainings when FileCache is used. If a subset of the nodes already has the data locally stored on their disks, but others have not the nodes that have the data will enter the training loop and wait for the other nodes to load their data using an all_reduce. If those other nodes take longer than 30min there will be a timeout/crash in the nodes that already have the data.

Similar to the fix for #1621, we need to introduce a regular sync while the data is being cached to avoid this from happening, especially if e.g. caching takes longer than expected.

@NeoLegends NeoLegends added the bug label Oct 23, 2024
@NeoLegends NeoLegends self-assigned this Oct 23, 2024
@NeoLegends
Copy link
Collaborator Author

NeoLegends commented Nov 4, 2024

In RETURNN startup we first initialize the dataset, and only afterwards we initialize the engine. This makes it quite difficult to use distrib training primitives for regular syncs. Perhaps we should use files named by the local ranks (like the returnn log file) w/ regularly updated mtimes in the work directory instead.

During dataset initialization RETURNN creates a file and updates the mtime of it every second. The other processes/engines can then check if there are files in the work directory w/ a recent mtime and wait until either the files have been deleted or their mtime goes stale (indicating a crash) before the engine and distributed training primitives are initialized.

@albertz
Copy link
Member

albertz commented Nov 4, 2024

DistributeFilesDataset does the initialization lazily, so with that dataset, there should not be a problem. Before we make it unnecessary complicated/hacky, it would be good to see an actual example where we have such problem.

Also, we could change the logic of the distrib training and initialize that earlier. I don't really see the point of adding some weird/hacky file-based communication workaround because we don't want to use the proper distrib communication primitives just because of some historical reasons how the code currently is structured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants