-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi file error aggregation strategy #5795
base: master
Are you sure you want to change the base?
Add multi file error aggregation strategy #5795
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5795 +/- ##
==========================================
- Coverage 36.34% 36.08% -0.27%
==========================================
Files 1304 1244 -60
Lines 110148 107978 -2170
==========================================
- Hits 40037 38964 -1073
+ Misses 65944 64985 -959
+ Partials 4167 4029 -138
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
….gedik/improved_remote_file_reader
….gedik/improved_remote_file_reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
Few points/questions...
- This PR is only one part of the story right? Something still needs to direct all the plugins to write different error messages.
- Can we maybe mark the new proto fields as still a wip? If they don't work out for some reason we can mark those slots as deprecated and at least no one else will try to use them in the meantime. Typically we try to not merge idl changes until things are working end to end, esp for messages that live in the blob store forever, but i think it's low risk in this case.
- How will this pattern be extended into Agent plugins? The new option in the proto will need to be routed through the webapi plugin right?
What frontend UI changes are needed to make the end to end story work if any?
looks good from my end, but would like one of @eapolinario or @pingsutw to also take a look.
flyteplugins/go/tasks/pluginmachinery/ioutils/remote_file_output_reader.go
Outdated
Show resolved
Hide resolved
1/ This PR is self-complete in the sense that it switches the PyTorch plugin to use the earliest timestamp error file strategy. However, the plugin won't be writing multiple files yet, as the environment variables instructing it to do so are not set and the flytekit side changes to populate multiple error files is not implemented. Since the error reading code for earliest timestamp strategy is backward compatible, it works with a single error file as well. My plan was the following in terms of order: Get this PR in, start using the new error retriever for PyTorch
2/ The IDL changes are low risk IMO. I don't mind merging once it works end to end, but need 2 more PRs for that. How do we mark them as WIP? 3/ Re 'How will this pattern be extended into Agent plugins?' I am not familiar with Agent plugins. Any pointers? |
Tracking issue
Part of RFC https://github.com/flyteorg/flyte/blob/b3add894fd879d6b6c11e37ba292ae04bf60afeb/rfc/system/5598-deterministic-errors-distributed-training.md#flytepropeller-backend
Why are the changes needed?
Deterministic error propagation, see: https://github.com/flyteorg/flyte/blob/b3add894fd879d6b6c11e37ba292ae04bf60afeb/rfc/system/5598-deterministic-errors-distributed-training.md#flytepropeller-backend
What changes were proposed in this pull request?
An error aggregation strategy was introduced for the remove file output reader.
This PR is self-complete in the sense that it switches the PyTorch plugin to use the earliest timestamp error file strategy. However, the plugin won't be writing multiple files yet, as the environment variables instructing it to do so are not set and the flytekit side changes to populate multiple error files is not implemented. Since the error reading code for earliest timestamp strategy is backward compatible, it works with a single error file as well.
My plan was the following in terms of order:
*. Get the entry point changes in, for creating multiple error files when environment variables are present (with unit tests)
How was this patch tested?
Unit testing.
Check all the applicable boxes
Related PRs