[SPARK-51265][SQL][SS] Throw proper error for eagerlyExecuteCommands containing streaming source marker #50015

HeartSaVioR · 2025-02-20T05:41:15Z

What changes were proposed in this pull request?

This PR checks whether the logical plan contains streaming source marker when eagerlyExecutedCommands are about to be executed. Here, the meaning of streaming source marker is a placeholder which will be materialized during microbatch planning. That means, if the plan has such a marker, the source is not materialized hence unable to read from that source.

This is easily triggered when user constructs the plan for the command (e.g. df.write.saveToTable), which includes df.readStream, or indirect reference (temp view against df.readStream).

This has to be caught by UnsupportedOperationChecker.checkBatch (which is called from QueryExecution.assertSupported), but if the query is a command which is meant to be eagerly executed, it throws an error before reaching to the code path, and the error is cryptic (either StackOverflowError, or AnalysisException but InternalError).

We should provide the proper error message to tell user that they have to fix their query.

Why are the changes needed?

Without the fix, StackOverflowError, or AnalysisException but InternalError is thrown for user's fault query.

Does this PR introduce any user-facing change?

Yes, we will provide clearer error (though TODO to be clarified for error class) for the error.

How was this patch tested?

New UT.

Was this patch authored or co-authored using generative AI tooling?

No.

…containing streaming source

HeartSaVioR · 2025-02-20T05:48:48Z

cc. @cloud-fan @viirya Would you mind taking a look? Thanks!

viirya · 2025-02-20T06:26:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

+    p.foreach {
+      case _: StreamingRelation | _: StreamingRelationV2 |
+           _: StreamingExecutionRelation | _: StreamingDataSourceV2ScanRelation =>
+        val msg = "Queries with streaming sources must be executed with writeStream.start()"


Does it have to be writeStream? Can it be readStream?

Oh, readStream cannot start a streaming query.

Yeah, for now, every streaming query should be triggered via DataStreamWriter.start().

viirya

It looks like a more correct error for users.

cloud-fan · 2025-02-20T07:32:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

+      // Since we are about to execute the plan, the plan shouldn't have a marker node to be
+      // materialized during microbatch planning. If the plan has a marker node, it is highly
+      // likely that users put streaming sources in a batch query.
+      // This case brings problem before reaching the check in UnsupportedOperationChecker,


I think the eager command execution is similar to a normal query execution, why can't it hit UnsupportedOperationChecker?

Because assertSupported is not called for eagerly executed command?

This is actually coupled with "explain". The test in QueryExecutionSuite ends up with StackOverflowError. I'll leave a PR comment what happens in the test suite.

https://github.com/apache/spark/pull/50015/files#r1963056691

HeartSaVioR · 2025-02-20T08:18:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala

+      withTable("output") {
+        val ex = intercept[AnalysisException] {
+          // Creates a table from streaming source with batch query. This should fail.
+          spark.sql("CREATE TABLE output AS SELECT * FROM s")


So when this query is executed, following is happening:

CreateDataSourceTableAsSelectCommand is executed. This reaches assertSupported, but it's a leaf node and it hides the query, hence the assertion is no-op.

It triggers InsertIntoHadoopFsRelationCommand. This exposes the query as child so we expect assertSupported is triggered, but the problem happens on creating "explainString" (planDesc).

When the query is determined as streaming (any leaf node is string), Spark creates IncrementalExecution (since there are streaming specific rules being defined there) to create executed plan, which "disabled" assertSupported(). This is not a bug, because we shouldn't check the streaming query with batch query's criteria. It should have been checked with streaming query's criteria before.

I'd say it is just conflicted - QueryExecution only works properly with batch query, and IncrementalExecution only works properly with streaming query. It's just that we found a case where QueryExecution somehow receives the execution of "streaming query" (at least from isStreaming flag perspective).

What happens? withCachedData is called infinitely (haven't followed about why it made a loop) and ended up with StackOverflowError.

This is only a case of CTAS, and there are lots of commands which we can't check everything, so I'd like to simply block the case where QueryExecution has to handle "streaming query" (which I only got reports from commands, but I could be wrong).

@cloud-fan FYI

So before this PR, this test fails with stack overflow?

If the stack overflow issue is fixed, InsertIntoHadoopFsRelationCommand will hit UnsupportedOperationChecker and we are fine?

Yes, it went with stack overflow, and for this case we might be OK.

Thuogh I wouldn't assume this is the only case - this is a minimized reproducer of the report, and origin report was ended up with AnalysisException with INTERNAL_ERROR (it was even from different command).

HeartSaVioR · 2025-02-20T08:22:13Z

.../src/test/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSinkSuite.scala

+        withTable("output") {
+          val ex = intercept[AnalysisException] {
+            // Creates a table from streaming source with batch query. This should fail.
+            df.sparkSession.sql("CREATE TABLE output AS SELECT * FROM s")


Same reasoning as above, because queries in FEB user function are technically batch queries.

HeartSaVioR · 2025-02-20T08:31:33Z

I realized the downside of the fix - if we were ever planning to unify the logical plan for write between batch and streaming (at least DSv2 path), it wouldn't work. For now, streaming write node is always either WriteToMicroBatchDataSource or WriteToMicroBatchDataSourceV1, so wouldn't be problematic with this change.

Though I don't have a clear way to resolve the above I commented, so unless someone has brilliant idea to resolve the coupling, maybe we need to live with it.

cloud-fan · 2025-02-20T08:45:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

+    // That is more aggressive than just checking the marker node for streaming source which is
+    // yet to be materialized. We'd like to be a bit conservative here since this is the exact
+    // problematic case we figured out.
+    p.foreach {


for leaf node command like CreateDataSourceTableAsSelectCommand, this check won't work?

Yeah... LeafRunnableCommand doesn't seem to be something we could deal with it in general (as there is no general interface to look at underlying query). We could only deal with it when LeafRunnableCommand runs another query (which they should).

viirya · 2025-02-20T17:17:48Z

I realized the downside of the fix - if we were ever planning to unify the logical plan for write between batch and streaming (at least DSv2 path), it wouldn't work. For now, streaming write node is always either WriteToMicroBatchDataSource or WriteToMicroBatchDataSourceV1, so wouldn't be problematic with this change.

What does it mean? I saw you don't check WriteToMicroBatchDataSource or WriteToMicroBatchDataSourceV1 but streaming relations like StreamingRelation, etc.

HeartSaVioR · 2025-02-20T21:57:06Z

@viirya
We have a TODO comment to apply DSv2 write to streaming query as well. e.g. df.writeTo(tblName).append() uses AppendData node. Not sure whether we will address this in near future though (this TODO comment seems to be there for years).

viirya · 2025-02-20T22:00:50Z

@viirya We have a TODO comment to apply DSv2 write to streaming query as well. e.g. df.writeTo(tblName).append() uses AppendData node. Not sure whether we will address this in near future though (this TODO comment seems to be there for years).

Do you mean after that (apply DSv2 write to streaming query), the fix won't work anymore?

HeartSaVioR · 2025-02-20T22:08:12Z

Depends on how we address it - here, we assert when executing command, and AppendData is AFAIK a command. So if we just make V2Command work for streaming, valid streaming queries will be routed to the assertion path and we will incorrectly fail the query.

Though I think we may need a bigger change to address TODO comment, so probably yet to be worried.

viirya · 2025-02-20T22:30:49Z

I see. Yea, maybe it is too early to worry about it.

cloud-fan · 2025-02-21T05:27:10Z

I think the root cause is command execution mode set incorrectly in IncrementalExecution, see #50037

HeartSaVioR · 2025-02-25T07:12:31Z

Closing via #50037 - much simpler change and both of PRs do not address the origin report which @cloud-fan will address later.

[SPARK-51265][SQL][SS] Throw proper error for eagerlyExecuteCommands …

9fb3e6f

…containing streaming source

github-actions bot added SQL STRUCTURED STREAMING labels Feb 20, 2025

misc change

6baf7f0

viirya reviewed Feb 20, 2025

View reviewed changes

viirya approved these changes Feb 20, 2025

View reviewed changes

viirya reviewed Feb 20, 2025

View reviewed changes

cloud-fan reviewed Feb 20, 2025

View reviewed changes

HeartSaVioR commented Feb 20, 2025

View reviewed changes

cloud-fan reviewed Feb 20, 2025

View reviewed changes

HeartSaVioR closed this Feb 25, 2025

[SPARK-51265][SQL][SS] Throw proper error for eagerlyExecuteCommands containing streaming source marker #50015

[SPARK-51265][SQL][SS] Throw proper error for eagerlyExecuteCommands containing streaming source marker #50015

Uh oh!

Conversation

HeartSaVioR commented Feb 20, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HeartSaVioR commented Feb 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Feb 20, 2025

Uh oh!

HeartSaVioR commented Feb 20, 2025

Uh oh!

viirya commented Feb 20, 2025

Uh oh!

HeartSaVioR commented Feb 20, 2025

Uh oh!

viirya commented Feb 20, 2025

Uh oh!

cloud-fan commented Feb 21, 2025

Uh oh!

HeartSaVioR commented Feb 25, 2025

Uh oh!

Uh oh!

HeartSaVioR Feb 20, 2025 •

edited

Loading

HeartSaVioR Feb 20, 2025 •

edited

Loading

HeartSaVioR Feb 20, 2025 •

edited

Loading

HeartSaVioR commented Feb 20, 2025 •

edited

Loading

HeartSaVioR Feb 20, 2025 •

edited

Loading