Skip to content

support SET Timezone #4107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 8, 2022
Merged

support SET Timezone #4107

merged 5 commits into from
Nov 8, 2022

Conversation

waitingkuo
Copy link
Contributor

Which issue does this PR close?

Closes #4106

Rationale for this change

a POC for #4106

to support set time zone

❯ show time zone;
+--------------------------------+---------+
| name                           | setting |
+--------------------------------+---------+
| datafusion.execution.time_zone | UTC     |
+--------------------------------+---------+
1 row in set. Query took 0.069 seconds.

❯ set time zone = '+08:00';
0 rows in set. Query took 0.001 seconds.

❯ show time zone;
+--------------------------------+---------+
| name                           | setting |
+--------------------------------+---------+
| datafusion.execution.time_zone | +08:00  |
+--------------------------------+---------+
1 row in set. Query took 0.007 seconds.

casting to timestamptz should use the timezone instead of the fixed UTC

select arrow_typeof('2000-01-01T00:00:00'::timestamp::timestamptz);
+------------------------------------------+
| arrowtypeof(Utf8("2000-01-01T00:00:00")) |
+------------------------------------------+
| Timestamp(Nanosecond, Some("+08:00"))    |
+------------------------------------------+
1 row in set. Query took 0.007 seconds.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sql SQL Planner labels Nov 4, 2022
Comment on lines +2689 to +2691
fn convert_simple_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> {
match sql_type {
SQLDataType::Boolean => Ok(DataType::Boolean),
SQLDataType::TinyInt(_) => Ok(DataType::Int8),
SQLDataType::SmallInt(_) => Ok(DataType::Int16),
SQLDataType::Int(_) | SQLDataType::Integer(_) => Ok(DataType::Int32),
SQLDataType::BigInt(_) => Ok(DataType::Int64),
SQLDataType::UnsignedTinyInt(_) => Ok(DataType::UInt8),
SQLDataType::UnsignedSmallInt(_) => Ok(DataType::UInt16),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

becomes method of SqlToRel, so that we could get SessionState -> config_options -> timezone

Comment on lines 2709 to 2729
SQLDataType::Timestamp(tz_info) => {
let tz = if matches!(tz_info, TimezoneInfo::Tz)
|| matches!(tz_info, TimezoneInfo::WithTimeZone)
{
match self.schema_provider.get_config_option("datafusion.execution.time_zone") {
Some(ScalarValue::Utf8(s)) => {
s
}
Some(_) => {
None
}
None => None
}
//Some("+00:00".to_string())

} else {
None
};
Ok(DataType::Timestamp(TimeUnit::Nanosecond, tz))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this becomes the method of SqltoRel, we can now get the timezone and use it to replace the fixed UTC
note that we need to add some timezone validation here (e.g. 00:00.1 isn't valid)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is somewhat strange to me that the type of a table that is created depends on the current session 's timezone setting -- but I can't think of anything better short of the user explicitly providing the timezone themselves

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sql's TimestampTz determine the timezone during the execution (perhaps right before it's been execute). arrow doesn't support TimestampTz in nature. what arrow has here is Timestamp<TimeUnit, Some(tz)>.

I've been thinking about how to deal with this for a long time 🤮 . I have 2 approaches to do so

  1. add some thing like Timestamp<TimeUnit, Some("session")> All the time related functions/kernels need to convert this "session" to the timezone by themselves. i.e. fn week() accept Timestamp<TimeUnit, Some("session")> as input and use the timezone in config_options to convert the underline timestamp and then do the week extraction. This will be more that the TImestampTz in SQL

  2. arrow's Timestamp<TImeUnit, Some("some-timezone")> is determined while SQL is translated to arrow's data type. and this is what this pr is using.

SQL's TimestampTz isn't deterministic while making arrow's data deterministic makes more sense to me; so i picked the 2nd approach for now. But i'm happy to try the first approach if anyone is interested in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the second approach makes sense to me.

Maybe usability wise we can just add enough comments about what is happening, or maybe add that best practice is to explicitly specify the timezone in the create table statement / timestamp types.

Comment on lines 2921 to 2923
/*
pub fn convert_simple_data_type(sql_type: &SQLDataType) -> Result<DataType> {
match sql_type {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the original function which is now the method of SqlToRel

perhaps we should a deprecated notifunction here as this is a pub function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove it personally or of we want to deprecate it for a release cycle that would be fine

Bonus points for somehow calling the new implementation (like making a dummy SqlToRel somehow)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking great !

@@ -1741,6 +1741,12 @@ impl ContextProvider for SessionState {
.as_ref()
.and_then(|provider| provider.get(&provider_type)?.get_type(variable_names))
}

fn get_config_option(&self, variable: &str) -> Option<ScalarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 2709 to 2729
SQLDataType::Timestamp(tz_info) => {
let tz = if matches!(tz_info, TimezoneInfo::Tz)
|| matches!(tz_info, TimezoneInfo::WithTimeZone)
{
match self.schema_provider.get_config_option("datafusion.execution.time_zone") {
Some(ScalarValue::Utf8(s)) => {
s
}
Some(_) => {
None
}
None => None
}
//Some("+00:00".to_string())

} else {
None
};
Ok(DataType::Timestamp(TimeUnit::Nanosecond, tz))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is somewhat strange to me that the type of a table that is created depends on the current session 's timezone setting -- but I can't think of anything better short of the user explicitly providing the timezone themselves

let tz = if matches!(tz_info, TimezoneInfo::Tz)
|| matches!(tz_info, TimezoneInfo::WithTimeZone)
{
match self.schema_provider.get_config_option("datafusion.execution.time_zone") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use a named constant instead of the string here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb named constants are in datafusion/core. i'm not sure how to import them from here (due to circular import)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 maybe we should eventually move the config option stuff to common? I have a todo on my list to write up a ticket about rationalizing the config system - if no one else beats me to it I'll probably start in a week or two

Comment on lines 2921 to 2923
/*
pub fn convert_simple_data_type(sql_type: &SQLDataType) -> Result<DataType> {
match sql_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove it personally or of we want to deprecate it for a release cycle that would be fine

Bonus points for somehow calling the new implementation (like making a dummy SqlToRel somehow)

@github-actions github-actions bot added the optimizer Optimizer rules label Nov 7, 2022
Copy link
Contributor Author

@waitingkuo waitingkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @avantgardnerio
i made some updates

  1. fix examples
  2. limit the usage to fixedoffset (like +08:00)
  3. add test cases for different timezone format
  4. delete pub fn convert_simple_data_type, it becomes the method of SqlToRel
  5. still use dataufion.execution.time_zone instead of TIME_ZONE_OPT since importing constant variable in sql/planners introduces circular import.

Comment on lines -2811 to -2823
/// Convert SQL simple data type to relational representation of data type
pub fn convert_simple_data_type(sql_type: &SQLDataType) -> Result<DataType> {
match sql_type {
SQLDataType::Boolean => Ok(DataType::Boolean),
SQLDataType::TinyInt(_) => Ok(DataType::Int8),
SQLDataType::SmallInt(_) => Ok(DataType::Int16),
SQLDataType::Int(_) | SQLDataType::Integer(_) => Ok(DataType::Int32),
SQLDataType::BigInt(_) => Ok(DataType::Int64),
SQLDataType::UnsignedTinyInt(_) => Ok(DataType::UInt8),
SQLDataType::UnsignedSmallInt(_) => Ok(DataType::UInt16),
SQLDataType::UnsignedInt(_) | SQLDataType::UnsignedInteger(_) => {
Ok(DataType::UInt32)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb i deleted it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked the PR as "API change"

Comment on lines 2710 to 2723
match self
.schema_provider
.get_config_option("datafusion.execution.time_zone")
{
Some(ScalarValue::Utf8(s)) => s,
Some(v) => {
return Err(DataFusionError::Internal(format!(
"Incorrect data type for time_zone: {}",
v.get_datatype(),
)))
}
None => return Err(DataFusionError::Internal(format!(
"Config Option datafusion.execution.time_zone doesn't exist"
))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarValue::UTF8 -> Option<String>
other ScalarValue -> Err
None (variable name doen't exsit) -> Err

Comment on lines +320 to +326
async fn set_time_zone_good_time_zone_format() {
let ctx =
SessionContext::with_config(SessionConfig::new().with_information_schema(true));

plan_and_collect(&ctx, "SET TIME ZONE = '+08:00'")
.await
.unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test cases for good time zone format

Comment on lines +328 to +339
// casting UTF-8 to TimestampTZ isn't supported yet, add Timestamp as the middle layer for now
let result =
plan_and_collect(&ctx, "SELECT '2000-01-01T00:00:00'::TIMESTAMP::TIMESTAMPTZ")
.await
.unwrap();
let expected = vec![
"+-----------------------------+",
"| Utf8(\"2000-01-01T00:00:00\") |",
"+-----------------------------+",
"| 2000-01-01T08:00:00+08:00 |",
"+-----------------------------+",
];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we could change this to '2000-01-01T00:00:00'::TIMESTAMPTZ once it's added in the casting kernel (in arrow-rs)

Comment on lines +320 to +340
async fn set_time_zone_good_time_zone_format() {
let ctx =
SessionContext::with_config(SessionConfig::new().with_information_schema(true));

plan_and_collect(&ctx, "SET TIME ZONE = '+08:00'")
.await
.unwrap();

// casting UTF-8 to TimestampTZ isn't supported yet, add Timestamp as the middle layer for now
let result =
plan_and_collect(&ctx, "SELECT '2000-01-01T00:00:00'::TIMESTAMP::TIMESTAMPTZ")
.await
.unwrap();
let expected = vec![
"+-----------------------------+",
"| Utf8(\"2000-01-01T00:00:00\") |",
"+-----------------------------+",
"| 2000-01-01T08:00:00+08:00 |",
"+-----------------------------+",
];
// this might break once https://github.com/apache/arrow-rs/issues/1936 fixed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that these test cases might need to be modified after apache/arrow-rs#1936 fixed. I added a message here

Comment on lines +401 to +408
#[tokio::test]
async fn set_time_zone_bad_time_zone_format() {
let ctx =
SessionContext::with_config(SessionConfig::new().with_information_schema(true));

plan_and_collect(&ctx, "SET TIME ZONE = '+08:00:00'")
.await
.unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test cases for bad time zone format

Comment on lines +470 to +477
let expected = vec![
"+-------------------------------------------------------+",
"| Utf8(\"2000-01-01T00:00:00\") |",
"+-------------------------------------------------------+",
"| 2000-01-01T00:00:00 (Unknown Time Zone 'Asia/Taipei') |",
"+-------------------------------------------------------+",
];
assert_batches_eq!(expected, &result);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avantgardnerio
based on the discussion here #4106 (comment)

named timezone like America/Denver or Asia/Taipei won't be supported in this pr

@waitingkuo waitingkuo requested a review from alamb November 7, 2022 10:35
Comment on lines +2710 to +2724
match self
.schema_provider
.get_config_option("datafusion.execution.time_zone")
{
Some(ScalarValue::Utf8(s)) => s,
Some(v) => {
return Err(DataFusionError::Internal(format!(
"Incorrect data type for time_zone: {}",
v.get_datatype(),
)))
}
None => return Err(DataFusionError::Internal(
"Config Option datafusion.execution.time_zone doesn't exist"
.to_string(),
)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarValue::Utf8 -> Option<String>
other ScalarValue -> Err
None (variable datafusion.execution.time_zone doesn't exist in config_options) -> Err

@waitingkuo waitingkuo marked this pull request as ready for review November 7, 2022 15:36
@alamb alamb added the api change Changes the API exposed to users of the crate label Nov 7, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you @waitingkuo

I will wait for 24 hours to merge this as it has an API change.

Comment on lines -2811 to -2823
/// Convert SQL simple data type to relational representation of data type
pub fn convert_simple_data_type(sql_type: &SQLDataType) -> Result<DataType> {
match sql_type {
SQLDataType::Boolean => Ok(DataType::Boolean),
SQLDataType::TinyInt(_) => Ok(DataType::Int8),
SQLDataType::SmallInt(_) => Ok(DataType::Int16),
SQLDataType::Int(_) | SQLDataType::Integer(_) => Ok(DataType::Int32),
SQLDataType::BigInt(_) => Ok(DataType::Int64),
SQLDataType::UnsignedTinyInt(_) => Ok(DataType::UInt8),
SQLDataType::UnsignedSmallInt(_) => Ok(DataType::UInt16),
SQLDataType::UnsignedInt(_) | SQLDataType::UnsignedInteger(_) => {
Ok(DataType::UInt32)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked the PR as "API change"

@alamb
Copy link
Contributor

alamb commented Nov 7, 2022

Thank you again @waitingkuo for driving this feature set forward

@alamb alamb merged commit e8fe042 into apache:master Nov 8, 2022
@alamb
Copy link
Contributor

alamb commented Nov 8, 2022

Thanks again @waitingkuo

@ursabot
Copy link

ursabot commented Nov 8, 2022

Benchmark runs are scheduled for baseline = 63386d1 and contender = e8fe042. e8fe042 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate optimizer Optimizer rules sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support SET timezone to non-UTC time zone
3 participants