[Spark] Relax check for generated columns and CHECK constraints on nested struct fields #3601

xzhseh · 2024-08-25T22:57:43Z

Which Delta project/connector is this regarding?

Description

this PR relaxes the check for nested struct fields especially when only some are being referenced by CHECK constraints or generated columns, which allows for more valid use cases in scenarios involving type widening or schema evolution.

the core function, checkReferencedByCheckConstraintsOrGeneratedColumns, inspects the nested/inner fields of the provided StructType to determine if any are referenced by dependent (CHECK) constraints or generated columns; for column types like ArrayType or MapType, the function checks these properties directly without inspecting the inner fields.

How was this patch tested?

through unit tests in TypeWideningConstraintsSuite and TypeWideningGeneratedColumnsSuite.

Does this PR introduce any user-facing changes?

yes, now the following (valid) use case will not be rejected by the check in ImplicitMetadataOperation.checkDependentExpression.

-- with `DELTA_SCHEMA_AUTO_MIGRATE` enabled
create table t (a struct<x: byte, y: byte>) using delta;
alter table t add constraint ck check (hash(a.x) > 0);

-- changing the type of struct field `a.y` when it's not
-- the field referenced by the CHECK constraint is allowed now.
insert into t (a) values (named_struct('x', CAST(2 AS byte), 'y', 1030))

…uct fields

xzhseh · 2024-08-25T22:59:30Z

Hi @johanl-db, could you help review this pr? Thanks!

johanl-db · 2024-08-26T12:55:08Z

spark/src/main/scala/org/apache/spark/sql/delta/schema/ImplicitMetadataOperation.scala

+          metadata = metadata
+        )
+      }
+    case (fromDataType, toDataType) =>


I think this won't work correctly if you have a map or array nested inside a struct:

col struct<array<struct<x: int, y: int>>>
If col.element.x is referenced by a generated column and the type of col.element.y changes, then we will call checkDependentConstraints/checkDependentGeneratedColumns on the array, which will fail even though we don't change a dependent field.

You should explicitly handle maps and arrays + add a test for this case

@johanl-db thanks for the review!

after investigating a bit, it seems that we can not directly access the field of a nested struct inside a array, e.g.,

create table t (a struct<b: array<struct<x: byte, y: byte>>>) using delta; -- this will fail due to - "[FIELD_NOT_FOUND] No such struct field `element` in `x`, `y`." -- same for generated columns. alter table t add constraint ck check (hash(a.b.element.x) > 0);

there could be other way to achieve this, e.g., alter table t add constraint ck check (hash(a.b[0].x) > 0), but this will cause other issues - even if we check for the generated columns/constraints, the field path will not be consistent with a.b[0].x;
since we automatically append the "element" in SchemaMergingUtils.transformColumns, which will lead to the checks for dependant constraints/generated columns being bypassed even if we've checked the field for the two properties, i.e., a.b.element.x (actual) is not consistent with a.b[0].x (expected).

I think we could probably block ArrayType and MapType (due to the appended "key" and "value") for now (i.e., keep the original semantic), and only relax the check for StructType.

Agree, it's better to keep what you have currently and keep being too strict for map and arrays - it's not worth the effort and risk of bugs

johanl-db · 2024-08-26T13:01:37Z

spark/src/main/scala/org/apache/spark/sql/delta/schema/ImplicitMetadataOperation.scala

+          currentField.dataType,
+          updateField.dataType
+        ) =>
+        checkReferencedByCheckConstraintsOrGeneratedColumns(


SchemaMergingUtils.transformColumns will apply the transform to all StructField in the schema, so you will end up calling checkReferencedByCheckConstraintsOrGeneratedColumns multiple times for nested struct fields since that method also recurses into the schema.

For example:
col struct<a: struct<x: int, y: int>>
transformColumns will call checkReferencedByCheckConstraintsOrGeneratedColumns on col first, which will recurse into the struct and call itself on a.
Then transformColumns will call checkReferencedByCheckConstraintsOrGeneratedColumns again on a directly.

Not really an issue, you can either skip recursing into structs in checkReferencedByCheckConstraintsOrGeneratedColumns or call out in a comment there that it's redundant

thanks for pointing this out!

I've skipped the recursing into structs for checkReferencedByCheckConstraintsOrGeneratedColumns, will finalize this after the discussion regarding MapType and ArrayType here.

…; update GeneratedColumnSuite; add helper functions

xzhseh · 2024-08-29T01:49:51Z

@johanl-db several updates:

checkReferencedByCheckConstraintsOrGeneratedColumns will now skip recursing into nested StructType to better integrate with the existing SchemaMergingUtils.transformColumns logic.
we will remain strict for MapType and ArrayType, particularly concerning potential nested structs, as discussed here.
relevant unit tests in GeneratedColumnSuite, TypeWideningGeneratedColumnsSuite, and TypeWideningConstraintsSuite have been updated accordingly.

johanl-db

The change looks good and correct, I left some follow up comments for a last iteration before merging

johanl-db · 2024-09-02T07:23:00Z

spark/src/main/scala/org/apache/spark/sql/delta/schema/ImplicitMetadataOperation.scala

+      from: DataType,
+      to: DataType,
+      metadata: Metadata): Unit = (from, to) match {
+    case (StructType(fromFields), StructType(toFields)) =>


I think you may be able to simplify a bit by removing this layer (checkReferencedByCheckConstraintsOrGeneratedColumns) and call checkConstraintsOrGeneratedColumnsExceptStruct directly from checkDependentExpressions, since transformColumns will call that method on all struct fields in the schema already:

private def checkConstraintsOrGeneratedColumnsExceptStruct( spark: SparkSession, path: Seq[String], protocol: Protocol, metadata: Metadata, currentDt: DataType, updateDt: DataType): Unit = (currentDt, updateDt) match { case (StructType(_), StructType(_)) => // we explicitly ignore the check for `StructType` here. case (from, to) => if (currentDt != updateDt) { checkDependentConstraints(spark, path, metadata, from, to) checkDependentGeneratedColumns(spark, path, protocol, metadata, from, to) } } SchemaMergingUtils.transformColumns(metadata.schema, dataSchema) { case (fieldPath, currentField, Some(updateField), _) if !SchemaMergingUtils.equalsIgnoreCaseAndCompatibleNullability( currentField.dataType, updateField.dataType ) => checkConstraintsOrGeneratedColumnsExceptStruct( spark = sparkSession, protocol = protocol, path = fieldPath :+ currentField.name, from = currentField.dataType, to = updateField.dataType, metadata = metadata )

nit: I would rename checkConstraintsOrGeneratedColumnsExceptStruct to e.g. checkDependentExpressionsOnStructField

thanks for the suggestion!

checkReferencedByCheckConstraintsOrGeneratedColumns has been removed.

checkDependentExpressionsOnStructField is now being directly called in checkDependentExpressions.

spark/src/main/scala/org/apache/spark/sql/delta/schema/ImplicitMetadataOperation.scala

johanl-db · 2024-09-02T07:37:14Z

spark/src/test/scala/org/apache/spark/sql/delta/typewidening/TypeWideningConstraintsSuite.scala

+            | )
+            |""".stripMargin
+        )
+        checkAnswer(sql("SELECT hash(a.x.z) FROM t"), Seq(Row(1765031574), Row(1765031574)))


Can you add one test with e.g. a struct<arr: array<struct<x, y>>> with a constraint defined on a.arr[0].x and changing the type of a.element.x and a.element.x both fail?

yes, I did attempt to add the corresponding tests; the issue is that, for both versions of checkDependentConstraints, we are unable to block operations involving constraints or generated columns related to a.arr[0].x, i.e., the operations pass without exceptions being thrown.

this seems to be more of a bug rather than something that should be addressed in this PR - I’m considering either adding the tests with a // FIXME: ... comment to highlight the issue, or simply skipping this scenario for now.

I created a github issue to track this: #3634, this is indeed a bug. I expect that use case of applying a check constraint or generated column on a specific array or map element should be fairly rare.

…actor

xzhseh · 2024-09-03T04:14:01Z

spark/src/main/scala/org/apache/spark/sql/delta/schema/ImplicitMetadataOperation.scala

+    // we explicitly ignore the check for `StructType` here.
+    case (StructType(_), StructType(_)) =>
+
+    // FIXME: we intentionally incorporate the pattern match for `ArrayType` and `MapType`


added a FIXME comment here to indicate this potential bug, not sure whether this is necessary or not, @johanl-db.

johanl-db

The change looks good, we can merge it
The CI failed, failure seems random and not related to anything in this PR
@tdas can you restart the CI + merge when it's green?

tdas · 2024-09-03T14:26:48Z

restarted all the tests. lets see

## Description Small test fix, follow up from #3601 The test contains an overflow and fails when running with ANSI_MODE enabled. ## How was this patch tested? Test-only

xzhseh added 2 commits August 25, 2024 18:32

relax check for generated columns and CHECK constraints on nested str…

fc0a2fa

…uct fields

update fmt

f5f6f20

johanl-db reviewed Aug 26, 2024

View reviewed changes

xzhseh changed the title ~~[Spark] relax check for generated columns and CHECK constraints on nested struct fields~~ [Spark] Relax check for generated columns and CHECK constraints on nested struct fields Aug 27, 2024

xzhseh requested a review from johanl-db August 27, 2024 04:05

skip recursing in checkReferencedByCheckConstraintsOrGeneratedColumns…

af3369d

…; update GeneratedColumnSuite; add helper functions

xzhseh force-pushed the xzhseh/relax-check-on-nested-struct-fields branch from 3b33bed to af3369d Compare August 29, 2024 21:59

update fmt

6b1cde3

johanl-db reviewed Sep 2, 2024

View reviewed changes

remove checkReferencedByCheckConstraintsOrGeneratedColumns; minor ref…

743a9da

…actor

xzhseh commented Sep 3, 2024

View reviewed changes

xzhseh requested a review from johanl-db September 3, 2024 04:14

johanl-db mentioned this pull request Sep 3, 2024

[BUG][Spark] Changing type of map/array field referenced by a generated column or CHECK constraint not failing correctly #3634

Open

3 tasks

johanl-db approved these changes Sep 3, 2024

View reviewed changes

Merge branch 'master' into xzhseh/relax-check-on-nested-struct-fields

3905fc9

Merge branch 'master' into xzhseh/relax-check-on-nested-struct-fields

cd48cbb

allisonport-db merged commit 3a9762f into delta-io:master Sep 5, 2024
14 of 17 checks passed

johanl-db mentioned this pull request Sep 6, 2024

[Spark] Fix overflow in GeneratedColumnSuite #3652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Relax check for generated columns and CHECK constraints on nested struct fields #3601

[Spark] Relax check for generated columns and CHECK constraints on nested struct fields #3601

xzhseh commented Aug 25, 2024 •

edited

Loading

xzhseh commented Aug 25, 2024

johanl-db Aug 26, 2024

xzhseh Aug 27, 2024 •

edited

Loading

johanl-db Aug 28, 2024

johanl-db Aug 26, 2024

xzhseh Aug 27, 2024

xzhseh commented Aug 29, 2024 •

edited

Loading

johanl-db left a comment

johanl-db Sep 2, 2024

xzhseh Sep 3, 2024 •

edited

Loading

johanl-db Sep 2, 2024

xzhseh Sep 3, 2024

johanl-db Sep 3, 2024

xzhseh Sep 3, 2024

johanl-db left a comment

tdas commented Sep 3, 2024

[Spark] Relax check for generated columns and CHECK constraints on nested struct fields #3601

[Spark] Relax check for generated columns and CHECK constraints on nested struct fields #3601

Conversation

xzhseh commented Aug 25, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

xzhseh commented Aug 25, 2024

Choose a reason for hiding this comment

xzhseh Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xzhseh commented Aug 29, 2024 • edited Loading

johanl-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xzhseh Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johanl-db left a comment

Choose a reason for hiding this comment

tdas commented Sep 3, 2024

xzhseh commented Aug 25, 2024 •

edited

Loading

xzhseh Aug 27, 2024 •

edited

Loading

xzhseh commented Aug 29, 2024 •

edited

Loading

xzhseh Sep 3, 2024 •

edited

Loading