Add multi file error aggregation strategy #5795

bgedik · 2024-10-01T21:53:06Z

Tracking issue

Part of RFC https://github.com/flyteorg/flyte/blob/b3add894fd879d6b6c11e37ba292ae04bf60afeb/rfc/system/5598-deterministic-errors-distributed-training.md#flytepropeller-backend

Why are the changes needed?

Deterministic error propagation, see: https://github.com/flyteorg/flyte/blob/b3add894fd879d6b6c11e37ba292ae04bf60afeb/rfc/system/5598-deterministic-errors-distributed-training.md#flytepropeller-backend

What changes were proposed in this pull request?

An error aggregation strategy was introduced for the remove file output reader.

This PR is self-complete in the sense that it switches the PyTorch plugin to use the earliest timestamp error file strategy. However, the plugin won't be writing multiple files yet, as the environment variables instructing it to do so are not set and the flytekit side changes to populate multiple error files is not implemented. Since the error reading code for earliest timestamp strategy is backward compatible, it works with a single error file as well.

My plan was the following in terms of order:

Get this PR in, start using the new error retriever for PyTorch
*. Get the entry point changes in, for creating multiple error files when environment variables are present (with unit tests)
Populate the environment variables for the end to end function (at this point, I'll validate with manual tests)

How was this patch tested?

Unit testing.

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

codecov · 2024-10-01T21:57:15Z

Codecov Report

Attention: Patch coverage is 66.45161% with 52 lines in your changes missing coverage. Please review.

Project coverage is 36.08%. Comparing base (da2ce74) to head (22f39cd).
Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
...uginmachinery/ioutils/remote_file_output_reader.go	70.22%	29 Missing and 10 partials ⚠️
flyteidl/gen/pb-go/flyteidl/core/errors.pb.go	0.00%	8 Missing ⚠️
flyteidl/gen/pb-go/flyteidl/core/execution.pb.go	0.00%	4 Missing ⚠️
flytestdlib/storage/stow_store.go	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5795      +/-   ##
==========================================
- Coverage   36.34%   36.08%   -0.27%     
==========================================
  Files        1304     1244      -60     
  Lines      110148   107978    -2170     
==========================================
- Hits        40037    38964    -1073     
+ Misses      65944    64985     -959     
+ Partials     4167     4029     -138

Flag	Coverage Δ
unittests-datacatalog	`?`
unittests-flyteadmin	`55.60% <ø> (+0.03%)`	⬆️
unittests-flytecopilot	`12.17% <ø> (ø)`
unittests-flytectl	`62.21% <ø> (ø)`
unittests-flyteidl	`7.16% <0.00%> (-0.01%)`	⬇️
unittests-flyteplugins	`53.61% <70.67%> (+0.25%)`	⬆️
unittests-flytepropeller	`42.02% <100.00%> (+<0.01%)`	⬆️
unittests-flytestdlib	`55.39% <85.71%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

….gedik/improved_remote_file_reader

bgedik · 2024-10-09T19:33:16Z

cc @eapolinario @wild-endeavor

wild-endeavor

Thank you for working on this!

Few points/questions...

This PR is only one part of the story right? Something still needs to direct all the plugins to write different error messages.
Can we maybe mark the new proto fields as still a wip? If they don't work out for some reason we can mark those slots as deprecated and at least no one else will try to use them in the meantime. Typically we try to not merge idl changes until things are working end to end, esp for messages that live in the blob store forever, but i think it's low risk in this case.
How will this pattern be extended into Agent plugins? The new option in the proto will need to be routed through the webapi plugin right?

What frontend UI changes are needed to make the end to end story work if any?

looks good from my end, but would like one of @eapolinario or @pingsutw to also take a look.

flyteplugins/go/tasks/pluginmachinery/ioutils/remote_file_output_reader.go

bgedik · 2024-10-10T05:19:57Z

Thank you for working on this!

Few points/questions...

This PR is only one part of the story right? Something still needs to direct all the plugins to write different error messages.

Can we maybe mark the new proto fields as still a wip? If they don't work out for some reason we can mark those slots as deprecated and at least no one else will try to use them in the meantime. Typically we try to not merge idl changes until things are working end to end, esp for messages that live in the blob store forever, but i think it's low risk in this case.

How will this pattern be extended into Agent plugins? The new option in the proto will need to be routed through the webapi plugin right?

What frontend UI changes are needed to make the end to end story work if any?

looks good from my end, but would like one of @eapolinario or @pingsutw to also take a look.

1/ This PR is self-complete in the sense that it switches the PyTorch plugin to use the earliest timestamp error file strategy. However, the plugin won't be writing multiple files yet, as the environment variables instructing it to do so are not set and the flytekit side changes to populate multiple error files is not implemented. Since the error reading code for earliest timestamp strategy is backward compatible, it works with a single error file as well.

My plan was the following in terms of order:

Get this PR in, start using the new error retriever for PyTorch

Get the entry point changes in, for creating multiple error files when environment variables are present (with unit tests)
Populate the environment variables for the end to end function (at this point, I'll validate with manual tests)

2/ The IDL changes are low risk IMO. I don't mind merging once it works end to end, but need 2 more PRs for that. How do we mark them as WIP?

3/ Re 'How will this pattern be extended into Agent plugins?' I am not familiar with Agent plugins. Any pointers?

Add multi file error aggregation strategy

d74d7a2

bgedik added 13 commits October 1, 2024 22:06

Add multi file error aggregation strategy

b95197f

Add multi file error aggregation strategy

4f019ba

Add multi file error aggregation strategy

35a3332

Add multi file error aggregation strategy

275ab74

Add multi file error aggregation strategy

eb479ca

Add multi file error aggregation strategy

17d3db4

Merge branch 'master' of https://github.com/flyteorg/flyte into bugra…

5bb0351

….gedik/improved_remote_file_reader

Add multi file error aggregation strategy

fa982c3

Add multi file error aggregation strategy

7128bb7

Add multi file error aggregation strategy

8f4629f

Add multi file error aggregation strategy

2df9e4d

Merge branch 'master' of https://github.com/flyteorg/flyte into bugra…

ce23e9f

….gedik/improved_remote_file_reader

Add multi file error aggregation strategy

43afb54

wild-endeavor reviewed Oct 10, 2024

View reviewed changes

flyteplugins/go/tasks/pluginmachinery/ioutils/remote_file_output_reader.go Outdated Show resolved Hide resolved

Add multi error file support

22f39cd

bgedik mentioned this pull request Oct 10, 2024

Multiple error files support for entry point flyteorg/flytekit#2797

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi file error aggregation strategy #5795

Add multi file error aggregation strategy #5795

bgedik commented Oct 1, 2024 •

edited

Loading

codecov bot commented Oct 1, 2024 •

edited

Loading

bgedik commented Oct 9, 2024

wild-endeavor left a comment

bgedik commented Oct 10, 2024 •

edited

Loading

Add multi file error aggregation strategy #5795

Are you sure you want to change the base?

Add multi file error aggregation strategy #5795

Conversation

bgedik commented Oct 1, 2024 • edited Loading

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Check all the applicable boxes

Related PRs

codecov bot commented Oct 1, 2024 • edited Loading

Codecov Report

bgedik commented Oct 9, 2024

wild-endeavor left a comment

Choose a reason for hiding this comment

bgedik commented Oct 10, 2024 • edited Loading

bgedik commented Oct 1, 2024 •

edited

Loading

codecov bot commented Oct 1, 2024 •

edited

Loading

bgedik commented Oct 10, 2024 •

edited

Loading