Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi file error aggregation strategy #5795

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

bgedik
Copy link
Contributor

@bgedik bgedik commented Oct 1, 2024

Tracking issue

Part of RFC https://github.com/flyteorg/flyte/blob/b3add894fd879d6b6c11e37ba292ae04bf60afeb/rfc/system/5598-deterministic-errors-distributed-training.md#flytepropeller-backend

Why are the changes needed?

Deterministic error propagation, see: https://github.com/flyteorg/flyte/blob/b3add894fd879d6b6c11e37ba292ae04bf60afeb/rfc/system/5598-deterministic-errors-distributed-training.md#flytepropeller-backend

What changes were proposed in this pull request?

An error aggregation strategy was introduced for the remove file output reader.

This PR is self-complete in the sense that it switches the PyTorch plugin to use the earliest timestamp error file strategy. However, the plugin won't be writing multiple files yet, as the environment variables instructing it to do so are not set and the flytekit side changes to populate multiple error files is not implemented. Since the error reading code for earliest timestamp strategy is backward compatible, it works with a single error file as well.

My plan was the following in terms of order:

  • Get this PR in, start using the new error retriever for PyTorch
    *. Get the entry point changes in, for creating multiple error files when environment variables are present (with unit tests)
  • Populate the environment variables for the end to end function (at this point, I'll validate with manual tests)

How was this patch tested?

Unit testing.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Copy link

codecov bot commented Oct 1, 2024

Codecov Report

Attention: Patch coverage is 66.45161% with 52 lines in your changes missing coverage. Please review.

Project coverage is 36.08%. Comparing base (da2ce74) to head (22f39cd).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...uginmachinery/ioutils/remote_file_output_reader.go 70.22% 29 Missing and 10 partials ⚠️
flyteidl/gen/pb-go/flyteidl/core/errors.pb.go 0.00% 8 Missing ⚠️
flyteidl/gen/pb-go/flyteidl/core/execution.pb.go 0.00% 4 Missing ⚠️
flytestdlib/storage/stow_store.go 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5795      +/-   ##
==========================================
- Coverage   36.34%   36.08%   -0.27%     
==========================================
  Files        1304     1244      -60     
  Lines      110148   107978    -2170     
==========================================
- Hits        40037    38964    -1073     
+ Misses      65944    64985     -959     
+ Partials     4167     4029     -138     
Flag Coverage Δ
unittests-datacatalog ?
unittests-flyteadmin 55.60% <ø> (+0.03%) ⬆️
unittests-flytecopilot 12.17% <ø> (ø)
unittests-flytectl 62.21% <ø> (ø)
unittests-flyteidl 7.16% <0.00%> (-0.01%) ⬇️
unittests-flyteplugins 53.61% <70.67%> (+0.25%) ⬆️
unittests-flytepropeller 42.02% <100.00%> (+<0.01%) ⬆️
unittests-flytestdlib 55.39% <85.71%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bgedik
Copy link
Contributor Author

bgedik commented Oct 9, 2024

cc @eapolinario @wild-endeavor

Copy link
Contributor

@wild-endeavor wild-endeavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

Few points/questions...

  • This PR is only one part of the story right? Something still needs to direct all the plugins to write different error messages.
  • Can we maybe mark the new proto fields as still a wip? If they don't work out for some reason we can mark those slots as deprecated and at least no one else will try to use them in the meantime. Typically we try to not merge idl changes until things are working end to end, esp for messages that live in the blob store forever, but i think it's low risk in this case.
  • How will this pattern be extended into Agent plugins? The new option in the proto will need to be routed through the webapi plugin right?

What frontend UI changes are needed to make the end to end story work if any?

looks good from my end, but would like one of @eapolinario or @pingsutw to also take a look.

@bgedik
Copy link
Contributor Author

bgedik commented Oct 10, 2024

Thank you for working on this!

Few points/questions...

  • This PR is only one part of the story right? Something still needs to direct all the plugins to write different error messages.
  • Can we maybe mark the new proto fields as still a wip? If they don't work out for some reason we can mark those slots as deprecated and at least no one else will try to use them in the meantime. Typically we try to not merge idl changes until things are working end to end, esp for messages that live in the blob store forever, but i think it's low risk in this case.
  • How will this pattern be extended into Agent plugins? The new option in the proto will need to be routed through the webapi plugin right?

What frontend UI changes are needed to make the end to end story work if any?

looks good from my end, but would like one of @eapolinario or @pingsutw to also take a look.

1/ This PR is self-complete in the sense that it switches the PyTorch plugin to use the earliest timestamp error file strategy. However, the plugin won't be writing multiple files yet, as the environment variables instructing it to do so are not set and the flytekit side changes to populate multiple error files is not implemented. Since the error reading code for earliest timestamp strategy is backward compatible, it works with a single error file as well.

My plan was the following in terms of order:

Get this PR in, start using the new error retriever for PyTorch

  • Get the entry point changes in, for creating multiple error files when environment variables are present (with unit tests)
  • Populate the environment variables for the end to end function (at this point, I'll validate with manual tests)

2/ The IDL changes are low risk IMO. I don't mind merging once it works end to end, but need 2 more PRs for that. How do we mark them as WIP?

3/ Re 'How will this pattern be extended into Agent plugins?' I am not familiar with Agent plugins. Any pointers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants