Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Hang in Python Dataset Reader with DistConv #2457

Merged
merged 2 commits into from
Jun 13, 2024

Conversation

fiedorowicz1
Copy link
Contributor

Fixes a bug where shuffling responses for DistConv can hang when the last mini batch is incomplete. Due to asynchronous IO, the reader thread may request the current mini batch index from the LBANN dataset object before it has been properly updated by the training algorithm thread. This is solved by internally tracking the mini batch index.

@fiedorowicz1 fiedorowicz1 requested a review from tbennun June 12, 2024 17:57
@fiedorowicz1 fiedorowicz1 merged commit 0305a0e into LBANN:develop Jun 13, 2024
1 check passed
@fiedorowicz1 fiedorowicz1 deleted the fix-python-dataset-reader branch June 13, 2024 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants