Unify the error handling for the RecordBatchStream #12641

YjyJeff · 2024-09-27T03:25:40Z

Is your feature request related to a problem or challenge?

Currently, most of the RecordBatchStream generated by the PhysicalPlan could propagate the errors polled from the input and continue polling. However, some streams will stop polling the input stream when an error is encountered. The behavior is inconsistent

Describe the solution you'd like

Unify the error handling for all of the RecordBatchStream: Continue polling the input stream when the input stream produces the error and only the stream that encounters the error can stop itself.

The advantage of this solution is that: We could decide whether to continue execution based on the type of error fetched from the root stream. For example, we may want to get the partial query result when data corruption happens in the TableScan. In this case, the error generated by the TableScan will be passed through all of the streams. As long as this TableScan could recover it self to produce the next RecordBatch, we could get the error and the partial query result

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2024-09-27T11:52:33Z

I agree that inconsistent behavior is not good -- my understanding is that when a stream is cancelled / aborted it should stop execution immediately: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#cancellation--aborting-execution so if you know of instances where that is not the case please let us know and I can file tickets to fix them

However, that documentation does not describe the policy of "fast shutdown on error" -- I will try and clarify it with a new PR.

I think many systems do want a quick shutdown if an error has occured and NOT continue to poll other streams. The reason is that if the overall query will error anyways, any additional polling is wasted work that would be thrown away.

For example, we may want to get the partial query result when data corruption happens in the TableScan. In this case, the error generated by the TableScan will be passed through all of the streams. As long as this TableScan could recover it self to produce the next RecordBatch, we could get the error and the partial query result

The usecase of ignoring certain bad files and continuing reading other files makes sense to me for certain systems (though other systems would likely wish to abort immediately)

To achieve this usecase, Instead of changing the default "shutdown as quickly as possible on error" behavior, here are some other ideas

Change the table scan operator itself (if you are using a custom one) so that it does not emit an error to the rest of the plan
Add some new ExecutionPlan operator that will "ignore" errors (as in not pass whatever error it gets to the output)

alamb · 2024-09-27T12:18:35Z

I tried to clarify error behavior in #12651

YjyJeff added the enhancement New feature or request label Sep 27, 2024

YjyJeff linked a pull request Sep 27, 2024 that will close this issue

Propagating the error generated by the input stream and continue polling #12642

Open

alamb mentioned this issue Sep 27, 2024

Minor: Improve documentation on execution error handling #12651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify the error handling for the RecordBatchStream #12641

Unify the error handling for the RecordBatchStream #12641

YjyJeff commented Sep 27, 2024

alamb commented Sep 27, 2024

alamb commented Sep 27, 2024

Unify the error handling for the RecordBatchStream #12641

Unify the error handling for the RecordBatchStream #12641

Comments

YjyJeff commented Sep 27, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Sep 27, 2024

alamb commented Sep 27, 2024