Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow user-specified schema in read if it's consistent #3929

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Dec 6, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

User-specified schema may come from the catalog if the Delta table is stored in an external catalog that syncs the table schema with the Delta log. We should allow it if it's the same as the real Delta table schema.

This is already the case for batch read, see apache/spark#15046

This PR changes the Delta streaming read to allow it as well.

Note: since Delta uses DS v2 (TableProvider) and explicitly claims that user-specified schema is not supported (TableProvider#supportsExternalMetadata returns false by default), end users still can't specify schema in spark.read/readStream.schema. This change is only for advanced Spark plugins that can construct logical plans to triggers Delta v1 source stream scan.

How was this patch tested?

a new test

Does this PR introduce any user-facing changes?

No

@ni-mi
Copy link

ni-mi commented Dec 10, 2024

@allisonport-db @cloud-fan will this very important fix be in the 3.3.0 release?

@ni-mi
Copy link

ni-mi commented Dec 24, 2024

@allisonport-db @cloud-fan will this very important fix be in the 3.3.0 release?

@cloud-fan
Copy link
Contributor Author

cc @tdas

@ni-mi
Copy link

ni-mi commented Jan 2, 2025

Hi,

@tdas any news about this fix?
It's a blocker for using Delta tables with Structured Streaming using Open Accessibility Databricks - and UC...

Thanks!

@CarlEkerot
Copy link

Hi,
I think I've experienced issues related to this when streaming from tables Glue catalogs as well. If the schema is properly set in the table property spark.sql.sources.schema, any streaming reads will fail. From the look of things, this is handled by setting the schema to {"type":"struct","fields":[]}. The Glue schema is subsequently set to col: array<byte>, making the table break tools like Athena.

@nimrod-doubleverify
Copy link

@cloud-fan @tdas any update? Anything?

@raveeram-db
Copy link
Collaborator

raveeram-db commented Feb 3, 2025

@cloud-fan @tdas any update? Anything?

Hi @nimrod-doubleverify apologies for the delay here, unfortunately this didn't make it to the 3.3.0 release but we'll work on getting out a 3.3.1 release this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants