-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark] Relax check for generated columns and CHECK constraints on nested struct fields #3601
[Spark] Relax check for generated columns and CHECK constraints on nested struct fields #3601
Conversation
Hi @johanl-db, could you help review this pr? Thanks! |
metadata = metadata | ||
) | ||
} | ||
case (fromDataType, toDataType) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't work correctly if you have a map or array nested inside a struct:
col struct<array<struct<x: int, y: int>>>
If col.element.x
is referenced by a generated column and the type of col.element.y
changes, then we will call checkDependentConstraints
/checkDependentGeneratedColumns
on the array, which will fail even though we don't change a dependent field.
You should explicitly handle maps and arrays + add a test for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johanl-db thanks for the review!
after investigating a bit, it seems that we can not directly access the field of a nested struct
inside a array
, e.g.,
create table t (a struct<b: array<struct<x: byte, y: byte>>>) using delta;
-- this will fail due to - "[FIELD_NOT_FOUND] No such struct field `element` in `x`, `y`."
-- same for generated columns.
alter table t add constraint ck check (hash(a.b.element.x) > 0);
there could be other way to achieve this, e.g., alter table t add constraint ck check (hash(a.b[0].x) > 0)
, but this will cause other issues - even if we check for the generated columns/constraints, the field path will not be consistent with a.b[0].x
;
since we automatically append the "element"
in SchemaMergingUtils.transformColumns, which will lead to the checks for dependant constraints/generated columns being bypassed even if we've checked the field for the two properties, i.e., a.b.element.x
(actual) is not consistent with a.b[0].x
(expected).
I think we could probably block ArrayType
and MapType
(due to the appended "key"
and "value"
) for now (i.e., keep the original semantic), and only relax the check for StructType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it's better to keep what you have currently and keep being too strict for map and arrays - it's not worth the effort and risk of bugs
currentField.dataType, | ||
updateField.dataType | ||
) => | ||
checkReferencedByCheckConstraintsOrGeneratedColumns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SchemaMergingUtils.transformColumns
will apply the transform to all StructField in the schema, so you will end up calling checkReferencedByCheckConstraintsOrGeneratedColumns
multiple times for nested struct fields since that method also recurses into the schema.
For example:
col struct<a: struct<x: int, y: int>>
transformColumns
will call checkReferencedByCheckConstraintsOrGeneratedColumns
on col
first, which will recurse into the struct and call itself on a
.
Then transformColumns
will call checkReferencedByCheckConstraintsOrGeneratedColumns
again on a
directly.
Not really an issue, you can either skip recursing into structs in checkReferencedByCheckConstraintsOrGeneratedColumns
or call out in a comment there that it's redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing this out!
I've skipped the recursing into structs for checkReferencedByCheckConstraintsOrGeneratedColumns
, will finalize this after the discussion regarding MapType
and ArrayType
here.
…; update GeneratedColumnSuite; add helper functions
@johanl-db several updates:
|
3b33bed
to
af3369d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good and correct, I left some follow up comments for a last iteration before merging
from: DataType, | ||
to: DataType, | ||
metadata: Metadata): Unit = (from, to) match { | ||
case (StructType(fromFields), StructType(toFields)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may be able to simplify a bit by removing this layer (checkReferencedByCheckConstraintsOrGeneratedColumns
) and call checkConstraintsOrGeneratedColumnsExceptStruct
directly from checkDependentExpressions
, since transformColumns
will call that method on all struct fields in the schema already:
private def checkConstraintsOrGeneratedColumnsExceptStruct(
spark: SparkSession,
path: Seq[String],
protocol: Protocol,
metadata: Metadata,
currentDt: DataType,
updateDt: DataType): Unit = (currentDt, updateDt) match {
case (StructType(_), StructType(_)) =>
// we explicitly ignore the check for `StructType` here.
case (from, to) =>
if (currentDt != updateDt) {
checkDependentConstraints(spark, path, metadata, from, to)
checkDependentGeneratedColumns(spark, path, protocol, metadata, from, to)
}
}
SchemaMergingUtils.transformColumns(metadata.schema, dataSchema) {
case (fieldPath, currentField, Some(updateField), _)
if !SchemaMergingUtils.equalsIgnoreCaseAndCompatibleNullability(
currentField.dataType,
updateField.dataType
) =>
checkConstraintsOrGeneratedColumnsExceptStruct(
spark = sparkSession,
protocol = protocol,
path = fieldPath :+ currentField.name,
from = currentField.dataType,
to = updateField.dataType,
metadata = metadata
)
nit: I would rename checkConstraintsOrGeneratedColumnsExceptStruct
to e.g. checkDependentExpressionsOnStructField
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion!
checkReferencedByCheckConstraintsOrGeneratedColumns
has been removed.checkDependentExpressionsOnStructField
is now being directly called incheckDependentExpressions
.
spark/src/main/scala/org/apache/spark/sql/delta/schema/ImplicitMetadataOperation.scala
Outdated
Show resolved
Hide resolved
| ) | ||
|""".stripMargin | ||
) | ||
checkAnswer(sql("SELECT hash(a.x.z) FROM t"), Seq(Row(1765031574), Row(1765031574))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add one test with e.g. a struct<arr: array<struct<x, y>>>
with a constraint defined on a.arr[0].x
and changing the type of a.element.x and a.element.x both fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I did attempt to add the corresponding tests; the issue is that, for both versions of checkDependentConstraints
, we are unable to block operations involving constraints
or generated columns
related to a.arr[0].x
, i.e., the operations pass without exceptions being thrown.
this seems to be more of a bug rather than something that should be addressed in this PR - I’m considering either adding the tests with a // FIXME: ...
comment to highlight the issue, or simply skipping this scenario for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a github issue to track this: #3634, this is indeed a bug. I expect that use case of applying a check constraint or generated column on a specific array or map element should be fairly rare.
// we explicitly ignore the check for `StructType` here. | ||
case (StructType(_), StructType(_)) => | ||
|
||
// FIXME: we intentionally incorporate the pattern match for `ArrayType` and `MapType` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a FIXME
comment here to indicate this potential bug, not sure whether this is necessary or not, @johanl-db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good, we can merge it
The CI failed, failure seems random and not related to anything in this PR
@tdas can you restart the CI + merge when it's green?
restarted all the tests. lets see |
## Description Small test fix, follow up from #3601 The test contains an overflow and fails when running with ANSI_MODE enabled. ## How was this patch tested? Test-only
Which Delta project/connector is this regarding?
Description
close #3250.
this PR relaxes the check for nested struct fields especially when only some are being referenced by CHECK constraints or generated columns, which allows for more valid use cases in scenarios involving type widening or schema evolution.
the core function,
checkReferencedByCheckConstraintsOrGeneratedColumns
, inspects the nested/inner fields of the providedStructType
to determine if any are referenced by dependent (CHECK) constraints or generated columns; for column types likeArrayType
orMapType
, the function checks these properties directly without inspecting the inner fields.How was this patch tested?
through unit tests in
TypeWideningConstraintsSuite
andTypeWideningGeneratedColumnsSuite
.Does this PR introduce any user-facing changes?
yes, now the following (valid) use case will not be rejected by the check in ImplicitMetadataOperation.checkDependentExpression.