Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Relax check for generated columns and CHECK constraints on nested struct fields #3601

Conversation

xzhseh
Copy link
Contributor

@xzhseh xzhseh commented Aug 25, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

close #3250.

this PR relaxes the check for nested struct fields especially when only some are being referenced by CHECK constraints or generated columns, which allows for more valid use cases in scenarios involving type widening or schema evolution.

the core function, checkReferencedByCheckConstraintsOrGeneratedColumns, inspects the nested/inner fields of the provided StructType to determine if any are referenced by dependent (CHECK) constraints or generated columns; for column types like ArrayType or MapType, the function checks these properties directly without inspecting the inner fields.

How was this patch tested?

through unit tests in TypeWideningConstraintsSuite and TypeWideningGeneratedColumnsSuite.

Does this PR introduce any user-facing changes?

yes, now the following (valid) use case will not be rejected by the check in ImplicitMetadataOperation.checkDependentExpression.

-- with `DELTA_SCHEMA_AUTO_MIGRATE` enabled
create table t (a struct<x: byte, y: byte>) using delta;
alter table t add constraint ck check (hash(a.x) > 0);

-- changing the type of struct field `a.y` when it's not
-- the field referenced by the CHECK constraint is allowed now.
insert into t (a) values (named_struct('x', CAST(2 AS byte), 'y', 1030))

@xzhseh
Copy link
Contributor Author

xzhseh commented Aug 25, 2024

Hi @johanl-db, could you help review this pr? Thanks!

metadata = metadata
)
}
case (fromDataType, toDataType) =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this won't work correctly if you have a map or array nested inside a struct:

col struct<array<struct<x: int, y: int>>>
If col.element.x is referenced by a generated column and the type of col.element.y changes, then we will call checkDependentConstraints/checkDependentGeneratedColumns on the array, which will fail even though we don't change a dependent field.

You should explicitly handle maps and arrays + add a test for this case

Copy link
Contributor Author

@xzhseh xzhseh Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanl-db thanks for the review!

after investigating a bit, it seems that we can not directly access the field of a nested struct inside a array, e.g.,

create table t (a struct<b: array<struct<x: byte, y: byte>>>) using delta;

-- this will fail due to - "[FIELD_NOT_FOUND] No such struct field `element` in `x`, `y`."
-- same for generated columns.
alter table t add constraint ck check (hash(a.b.element.x) > 0);

there could be other way to achieve this, e.g., alter table t add constraint ck check (hash(a.b[0].x) > 0), but this will cause other issues - even if we check for the generated columns/constraints, the field path will not be consistent with a.b[0].x;
since we automatically append the "element" in SchemaMergingUtils.transformColumns, which will lead to the checks for dependant constraints/generated columns being bypassed even if we've checked the field for the two properties, i.e., a.b.element.x (actual) is not consistent with a.b[0].x (expected).

I think we could probably block ArrayType and MapType (due to the appended "key" and "value") for now (i.e., keep the original semantic), and only relax the check for StructType.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it's better to keep what you have currently and keep being too strict for map and arrays - it's not worth the effort and risk of bugs

currentField.dataType,
updateField.dataType
) =>
checkReferencedByCheckConstraintsOrGeneratedColumns(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchemaMergingUtils.transformColumns will apply the transform to all StructField in the schema, so you will end up calling checkReferencedByCheckConstraintsOrGeneratedColumns multiple times for nested struct fields since that method also recurses into the schema.

For example:
col struct<a: struct<x: int, y: int>>
transformColumns will call checkReferencedByCheckConstraintsOrGeneratedColumns on col first, which will recurse into the struct and call itself on a.
Then transformColumns will call checkReferencedByCheckConstraintsOrGeneratedColumns again on a directly.

Not really an issue, you can either skip recursing into structs in checkReferencedByCheckConstraintsOrGeneratedColumns or call out in a comment there that it's redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing this out!

I've skipped the recursing into structs for checkReferencedByCheckConstraintsOrGeneratedColumns, will finalize this after the discussion regarding MapType and ArrayType here.

@xzhseh xzhseh changed the title [Spark] relax check for generated columns and CHECK constraints on nested struct fields [Spark] Relax check for generated columns and CHECK constraints on nested struct fields Aug 27, 2024
@xzhseh xzhseh requested a review from johanl-db August 27, 2024 04:05
…; update GeneratedColumnSuite; add helper functions
@xzhseh
Copy link
Contributor Author

xzhseh commented Aug 29, 2024

@johanl-db several updates:

  • checkReferencedByCheckConstraintsOrGeneratedColumns will now skip recursing into nested StructType to better integrate with the existing SchemaMergingUtils.transformColumns logic.
  • we will remain strict for MapType and ArrayType, particularly concerning potential nested structs, as discussed here.
  • relevant unit tests in GeneratedColumnSuite, TypeWideningGeneratedColumnsSuite, and TypeWideningConstraintsSuite have been updated accordingly.

@xzhseh xzhseh force-pushed the xzhseh/relax-check-on-nested-struct-fields branch from 3b33bed to af3369d Compare August 29, 2024 21:59
Copy link
Collaborator

@johanl-db johanl-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good and correct, I left some follow up comments for a last iteration before merging

from: DataType,
to: DataType,
metadata: Metadata): Unit = (from, to) match {
case (StructType(fromFields), StructType(toFields)) =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may be able to simplify a bit by removing this layer (checkReferencedByCheckConstraintsOrGeneratedColumns) and call checkConstraintsOrGeneratedColumnsExceptStruct directly from checkDependentExpressions, since transformColumns will call that method on all struct fields in the schema already:

  private def checkConstraintsOrGeneratedColumnsExceptStruct(
      spark: SparkSession,
      path: Seq[String],
      protocol: Protocol,
      metadata: Metadata,
      currentDt: DataType,
      updateDt: DataType): Unit = (currentDt, updateDt) match {
    case (StructType(_), StructType(_)) =>
      // we explicitly ignore the check for `StructType` here.
    case (from, to) =>
      if (currentDt != updateDt) {
        checkDependentConstraints(spark, path, metadata, from, to)
        checkDependentGeneratedColumns(spark, path, protocol, metadata, from, to)
      }
  }

    SchemaMergingUtils.transformColumns(metadata.schema, dataSchema) {
      case (fieldPath, currentField, Some(updateField), _)
        if !SchemaMergingUtils.equalsIgnoreCaseAndCompatibleNullability(
          currentField.dataType,
          updateField.dataType
        ) =>
        checkConstraintsOrGeneratedColumnsExceptStruct(
          spark = sparkSession,
          protocol = protocol,
          path = fieldPath :+ currentField.name,
          from = currentField.dataType,
          to = updateField.dataType,
          metadata = metadata
        )

nit: I would rename checkConstraintsOrGeneratedColumnsExceptStruct to e.g. checkDependentExpressionsOnStructField

Copy link
Contributor Author

@xzhseh xzhseh Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion!

  • checkReferencedByCheckConstraintsOrGeneratedColumns has been removed.
  • checkDependentExpressionsOnStructField is now being directly called in checkDependentExpressions.

| )
|""".stripMargin
)
checkAnswer(sql("SELECT hash(a.x.z) FROM t"), Seq(Row(1765031574), Row(1765031574)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one test with e.g. a struct<arr: array<struct<x, y>>> with a constraint defined on a.arr[0].x and changing the type of a.element.x and a.element.x both fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I did attempt to add the corresponding tests; the issue is that, for both versions of checkDependentConstraints, we are unable to block operations involving constraints or generated columns related to a.arr[0].x, i.e., the operations pass without exceptions being thrown.

this seems to be more of a bug rather than something that should be addressed in this PR - I’m considering either adding the tests with a // FIXME: ... comment to highlight the issue, or simply skipping this scenario for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a github issue to track this: #3634, this is indeed a bug. I expect that use case of applying a check constraint or generated column on a specific array or map element should be fairly rare.

// we explicitly ignore the check for `StructType` here.
case (StructType(_), StructType(_)) =>

// FIXME: we intentionally incorporate the pattern match for `ArrayType` and `MapType`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a FIXME comment here to indicate this potential bug, not sure whether this is necessary or not, @johanl-db.

Copy link
Collaborator

@johanl-db johanl-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good, we can merge it
The CI failed, failure seems random and not related to anything in this PR
@tdas can you restart the CI + merge when it's green?

@tdas
Copy link
Contributor

tdas commented Sep 3, 2024

restarted all the tests. lets see

@allisonport-db allisonport-db merged commit 3a9762f into delta-io:master Sep 5, 2024
14 of 17 checks passed
allisonport-db pushed a commit that referenced this pull request Sep 9, 2024
## Description
Small test fix, follow up from
#3601
The test contains an overflow and fails when running with ANSI_MODE
enabled.

## How was this patch tested?
Test-only
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants