allow user-specified schema in read if it's consistent #3929

cloud-fan · 2024-12-06T04:23:11Z

Which Delta project/connector is this regarding?

Description

User-specified schema may come from the catalog if the Delta table is stored in an external catalog that syncs the table schema with the Delta log. We should allow it if it's the same as the real Delta table schema.

This is already the case for batch read, see apache/spark#15046

This PR changes the Delta streaming read to allow it as well.

Note: since Delta uses DS v2 (TableProvider) and explicitly claims that user-specified schema is not supported (TableProvider#supportsExternalMetadata returns false by default), end users still can't specify schema in spark.read/readStream.schema. This change is only for advanced Spark plugins that can construct logical plans to triggers Delta v1 source stream scan.

How was this patch tested?

a new test

Does this PR introduce any user-facing changes?

No

ni-mi · 2024-12-10T19:59:28Z

@allisonport-db @cloud-fan will this very important fix be in the 3.3.0 release?

ni-mi · 2024-12-24T11:28:09Z

@allisonport-db @cloud-fan will this very important fix be in the 3.3.0 release?

cloud-fan · 2024-12-24T12:45:29Z

cc @tdas

ni-mi · 2025-01-02T17:03:41Z

Hi,

@tdas any news about this fix?
It's a blocker for using Delta tables with Structured Streaming using Open Accessibility Databricks - and UC...

Thanks!

CarlEkerot · 2025-01-23T10:24:20Z

Hi,
I think I've experienced issues related to this when streaming from tables Glue catalogs as well. If the schema is properly set in the table property spark.sql.sources.schema, any streaming reads will fail. From the look of things, this is handled by setting the schema to {"type":"struct","fields":[]}. The Glue schema is subsequently set to col: array<byte>, making the table break tools like Athena.

nimrod-doubleverify · 2025-02-02T16:07:20Z

@cloud-fan @tdas any update? Anything?

raveeram-db · 2025-02-03T17:04:31Z

@cloud-fan @tdas any update? Anything?

Hi @nimrod-doubleverify apologies for the delay here, unfortunately this didn't make it to the 3.3.0 release but we'll work on getting out a 3.3.1 release this week.

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala

spark/src/main/resources/error/delta-error-classes.json

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala

…ataSource.scala

spark/src/test/scala/org/apache/spark/sql/delta/DeltaSourceSuite.scala

…te.scala

spark/src/test/scala/org/apache/spark/sql/delta/DeltaSourceSuite.scala

…te.scala

cloud-fan mentioned this pull request Dec 6, 2024

Spark Structured Streaming is not supported - DELTA_UNSUPPORTED_SCHEMA_DURING_READ unitycatalog/unitycatalog#715

Open

lzlfred approved these changes Feb 3, 2025

View reviewed changes

tdas approved these changes Feb 4, 2025

View reviewed changes

cloud-fan commented Feb 4, 2025

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala Outdated Show resolved Hide resolved

allow user-specified schema in read if it's consistent

f5a4df4

cloud-fan force-pushed the schema branch from cc30dbe to f5a4df4 Compare February 4, 2025 15:11

cloud-fan commented Feb 4, 2025

View reviewed changes

spark/src/main/resources/error/delta-error-classes.json Outdated Show resolved Hide resolved

Update spark/src/main/resources/error/delta-error-classes.json

c734a7b

cloud-fan commented Feb 4, 2025

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala Outdated Show resolved Hide resolved

cloud-fan added 2 commits February 4, 2025 23:17

Update spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaD…

07b4e9a

…ataSource.scala

Update DeltaSourceSuite.scala

746df47

cloud-fan commented Feb 5, 2025

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/delta/DeltaSourceSuite.scala Outdated Show resolved Hide resolved

Update spark/src/test/scala/org/apache/spark/sql/delta/DeltaSourceSui…

aa8f00a

…te.scala

cloud-fan commented Feb 5, 2025

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/delta/DeltaSourceSuite.scala Outdated Show resolved Hide resolved

Update spark/src/test/scala/org/apache/spark/sql/delta/DeltaSourceSui…

b8eaccc

…te.scala

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow user-specified schema in read if it's consistent #3929

allow user-specified schema in read if it's consistent #3929

cloud-fan commented Dec 6, 2024 •

edited

Loading

ni-mi commented Dec 10, 2024

ni-mi commented Dec 24, 2024

cloud-fan commented Dec 24, 2024

ni-mi commented Jan 2, 2025

CarlEkerot commented Jan 23, 2025

nimrod-doubleverify commented Feb 2, 2025

raveeram-db commented Feb 3, 2025 •

edited

Loading

allow user-specified schema in read if it's consistent #3929

Are you sure you want to change the base?

allow user-specified schema in read if it's consistent #3929

Conversation

cloud-fan commented Dec 6, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

ni-mi commented Dec 10, 2024

ni-mi commented Dec 24, 2024

cloud-fan commented Dec 24, 2024

ni-mi commented Jan 2, 2025

CarlEkerot commented Jan 23, 2025

nimrod-doubleverify commented Feb 2, 2025

raveeram-db commented Feb 3, 2025 • edited Loading

cloud-fan commented Dec 6, 2024 •

edited

Loading

raveeram-db commented Feb 3, 2025 •

edited

Loading