Skip to content

Add support for default column values in engine #25679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Apr 27, 2025

Description

This PR adds support for the DEFAULT clause with literal values when creating a new table or adding a column.
The SHOW CREATE TABLE statement now includes the specified default values.
These defaults are respected in INSERT and MERGE operations.
Non-literal expressions such as <datetime value function>, USER, and similar are not supported in this PR.

<default clause> ::=
  DEFAULT <default option> 

<default option> ::=
    <literal> 
  | <datetime value function> 
  | USER
  | CURRENT_USER
  | CURRENT_ROLE
  | SESSION_USER
  | SYSTEM_USER
  | CURRENT_CATALOG
  | CURRENT_SCHEMA
  | CURRENT_PATH
  | <implicitly typed value specification> 

Release notes

## General, Memory
* Add support for default column values. ({issue}`25679`)

@cla-bot cla-bot bot added the cla-signed label Apr 27, 2025
@github-actions github-actions bot added iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector bigquery BigQuery connector mongodb MongoDB connector cassandra Cassandra connector ignite Ignite connector memory Memory connector labels Apr 27, 2025
@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch from 796b75e to 9b59d7c Compare May 5, 2025 12:16
@ebyhr ebyhr marked this pull request as ready for review May 5, 2025 12:31
@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch from 9b59d7c to 30ab353 Compare May 5, 2025 12:51
@github-actions github-actions bot added the docs label May 5, 2025
@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch 3 times, most recently from 3eae180 to 2e3b139 Compare May 5, 2025 23:31
@ebyhr ebyhr requested a review from martint May 5, 2025 23:34
@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch 2 times, most recently from b905a01 to d75264c Compare May 6, 2025 01:10
@ebyhr ebyhr requested review from Praveen2112 and kasiafi May 12, 2025 03:10
@@ -535,7 +535,7 @@ private RelationPlan getInsertPlan(
if (supportsMissingColumnsOnInsert) {
continue;
}
expression = new Constant(column.getType(), null);
expression = new Constant(column.getType(), column.getDefaultValue().orElse(null));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the connector supportsMissingColumnsOnInsert, then it is supposed to apply the default value on its own, right?
Do we have tests to cover both cases (connector applies the default value versus the engine applies it here)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We don't have such a test. I'm wondering if we can write a helpful test covering both cases in the engine.

@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch from d75264c to 995de6e Compare May 12, 2025 23:36
@ebyhr
Copy link
Member Author

ebyhr commented May 12, 2025

Addressed comments.

ColumnMetadata column = ColumnMetadata.builder()
.setName(columnName.getValue())
.setType(getSupportedType(session, catalogHandle, tableMetadata.metadata().getProperties(), type))
.setType(supportedType)
.setDefaultValue(element.getDefaultValue().map(value -> evaluateLiteral(literalInterpreter, value, type)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn’t the literal be evaluated as supported type ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the connector supports VARCHAR(4) and default value abcde ?

Copy link
Member Author

@ebyhr ebyhr May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I was wondering which type should win. My question is why users specify 5 characters on VARCHAR(4) column. That looks user mistake to me.

I will change to supportedType anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are special rules regarding type coercions at insert. They are described in the section "Store assignment" of the spec. TL;DR it is legal to trim a string to fit into the declared type, but you can only trim spaces.
I think we should respect the declared type of the column, and follow the coercion rules for inserts.
You can refer to the method noTruncationCast() in LogicalPlanner.

Copy link
Member Author

@ebyhr ebyhr May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a logic to trim following spaces. Could you take another look?

Also, extracted helper method so we can reuse from CreateTableTask and AddColumnTask.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr I discussed this with Martin, and we found relevant information in the spec. I'm sorry for the confusion. Here are the rules:

If the subject data type is character string, then the <literal> shall be a <character string
literal>. If the length of the subject data type is fixed, then the length in characters of
the <character string literal> shall not be greater than the length of the subject data
type. If the length of the subject data type is variable, then the length in characters of
the <character string literal> shall not be greater than the maximum length of the
subject data type.

It implies that

  1. the default value declaration should fail if the value is too long for the type
  2. the declaration should succeed when the value is shorter. In such a case, we should coerce the value to the declared type. Since we only support literals, we can statically adjust the value at column declaration time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we confirmed that the DEFAULT NULL NOT NULL behavior is implemented correctly: it can be legally declared, and should fail whenever the default value is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kasiafi In case of CREATE TABLE/CTAS we have introduced supportedType spi method which allows the engine to automatically coerce the data to supported type - In case of hive if it supports only timestamp(3) and if we provide a default value which is of timestamp(6) - then shouldn't we cast the default value to the supported type ?

@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch 3 times, most recently from cecbfec to 62525ca Compare May 14, 2025 04:50
@ebyhr
Copy link
Member Author

ebyhr commented May 14, 2025

Addressed comments and added anothor commit supporting the default option in Memory connector.

Comment on lines 3777 to 3788
column.getDefaultValue().ifPresent(defaultValue -> defaultColumnValues.put(columnHandle, defaultValue));
if (!column.isNullable()) {
nonNullableColumnHandles.add(columnHandle);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the column is non-nullable and the default value is null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails as NOT NULL violation. Added another test to BaseConnectorTest and verified locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So it is legal to declare: y int DEFAULT null NOT NULL but then it fails if the default value is actually inserted. I wondered if the spec even allows to define DEFAULT null NOT NULL but seems like it does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of DEFAULT null NOT NULL shouldn't we fail as the default value here doesn't makes much sense ?

@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch 2 times, most recently from 29b8e3b to ccfec5d Compare May 20, 2025 02:51
@ebyhr
Copy link
Member Author

ebyhr commented May 20, 2025

@martint @kasiafi @Praveen2112 Could you take another look?

@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch from ccfec5d to e7d9bdd Compare May 20, 2025 03:57
@ebyhr ebyhr requested review from kasiafi and Praveen2112 May 23, 2025 01:56
@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch from 0b2730f to 9a473c9 Compare May 26, 2025 05:59
@kasiafi
Copy link
Member

kasiafi commented May 26, 2025

Regarding the null default value, could we use the existing implementation? It would spare us storing the default value, and the behavior would be consistent across connectors. In fact, when the user declares DEFAULT NULL, we could just ignore it.

@kasiafi
Copy link
Member

kasiafi commented May 26, 2025

Do we plan to support other variants of the default value in the future? For values like CURRENT_USER or <datetime value function>, we cannot store them statically. I think it's fine to special-case for constants for now, but if we want to support the other options, we will need another api to store expressions.

@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch 2 times, most recently from 4a4d211 to 7550565 Compare May 27, 2025 23:03
@ebyhr
Copy link
Member Author

ebyhr commented May 28, 2025

I'm a little wondering whether fallback to existing implementation is really good idea. The behavior between default-null and missing-columns might be different in remote connectors. For instance, ClickHouse has default values in each type (e.g. 1 in integer). DEFAULT null might be used to enforce null in such connectors.

I changed NullableValue to ConnectorExpression so we can support non-constants in the future.

Copy link
Member

@kasiafi kasiafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment about type coercion.

@martint could you please check if it makes sense to you? Other than that, I think it's ready for you to review.

{
try {
Object value = literalInterpreter.evaluate(literal, type);
checkDefaultValue(value, type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we declare column type as double, and pass an integer default value?
According to spec, it is legal:

If the subject data type is approximate numeric, then the <literal> shall be a <signed
numeric literal>.

In such a case, we should apply a coercion.
For the general case, we should:

  • analyze the passed literal to get the type
  • if the type is not equal to the declared column type, check if this combination is legal according to the spec. Most cases can probably be handled with a TypeCoercion#canCoerce() call.
  • apply a coercion
  • evaluate the expression and store as a Constant

Copy link
Member Author

@ebyhr ebyhr May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to use ConstantEvaluator.evaluateConstant method. Let me know if I should avoid the method.

edit : The method doesn't work well in some cases, e.g. TINYINT DEFAULT 123. I will avoid this method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of CREATE TABLE/CTAS we have introduced supportedType spi method which allows the engine to automatically coerce the data to supported type - In case of hive if it supports only timestamp(3) and if we provide a default value which is of timestamp(6) - then shouldn't we cast the default value to the supported type ?

@Praveen2112 good point about the supported types.
The spec does not consider the possibility that the effective column type is different than the declared type.
In such case, we should probably validate the default value against the effective column type.
However, this could hypothetically lead to the unintuitive situation when the default value type is the same as the declared column type, but the default value is rejected.

@martint could you please share you opinion on how we should validate and coerce the default value in case when the declared column type differs from the effective type of the created column?

ColumnMetadata column = ColumnMetadata.builder()
.setName(columnName.getValue())
.setType(getSupportedType(session, catalogHandle, tableMetadata.metadata().getProperties(), type))
.setType(supportedType)
.setDefaultValue(element.getDefaultValue().map(value -> evaluateLiteral(literalInterpreter, value, type)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kasiafi In case of CREATE TABLE/CTAS we have introduced supportedType spi method which allows the engine to automatically coerce the data to supported type - In case of hive if it supports only timestamp(3) and if we provide a default value which is of timestamp(6) - then shouldn't we cast the default value to the supported type ?

Comment on lines 3777 to 3788
column.getDefaultValue().ifPresent(defaultValue -> defaultColumnValues.put(columnHandle, defaultValue));
if (!column.isNullable()) {
nonNullableColumnHandles.add(columnHandle);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of DEFAULT null NOT NULL shouldn't we fail as the default value here doesn't makes much sense ?

@ebyhr ebyhr force-pushed the ebi/engine-default-column-values branch from 7550565 to 9507eb0 Compare May 29, 2025 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery BigQuery connector cassandra Cassandra connector cla-signed delta-lake Delta Lake connector docs hive Hive connector iceberg Iceberg connector ignite Ignite connector memory Memory connector mongodb MongoDB connector syntax-needs-review
Development

Successfully merging this pull request may close these issues.

4 participants