Skip to content

EPIC: Icehut timestamp support #100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rampage644 opened this issue Dec 27, 2024 · 8 comments
Closed

EPIC: Icehut timestamp support #100

rampage644 opened this issue Dec 27, 2024 · 8 comments

Comments

@rampage644
Copy link
Contributor

rampage644 commented Dec 27, 2024

Everything we know about timestamp: snowflake support, datafusion, arrow, Iceberg, etc.

All problems we identified, solved. All decisions we made.

@DanCodedThis
Copy link
Contributor

DanCodedThis commented Jan 9, 2025

Timestamps in datafusion works peculiarly. For example in query like, select convert_timezone('America/New_York, 'UTC', v3), where v3 is a timestamp with value '2024-01-06 08:00:00 America/New_York'.
Expected result: informs the user that the source_timestamp_ntz is not the right format, meaning it has a timezone. We do pattern matching on Timestamp"Unit"(Some(ts), None), where "Unit" is TimeUnit literal, "ts" is &i64.
What really happens: not only the query passes through, but the internal math inside datafusion breaks. If v3 were to be '2024-01-06 08:00:00' the result should've been '2024-01-06T13:00:00Z', but with v3 being '2024-01-06 08:00:00 America/New_York' result is '2024-01-06T18:00:00Z'.
Why this happens? -> Somewhere inside datafusion there happens a conversion where, the timezone gets added to the i64, and after that our operation happens which is correct.
How do I know this? -> If you try to pattern match on the Timestamp"Unit"(Some(ts), Some(tz)), where "Unit" is TimeUnit literal, "ts" is &i64, "tz" is &Arc<str>, which we do for the 2 argument version where the timestamp should have a timezone, "tz" always has the value "+00", no matter what timezone we put it too, but "ts" has the timezone offset inside of it calculated as i64. We also do have "exact" arms in the signature that only should take for the 3 argument version the last argument as a Timestamp(unit, None) where "unit" is TimeUnit (not literal).

@DanCodedThis
Copy link
Contributor

DanCodedThis commented Jan 9, 2025

A udf cannot return inside a timestamp timezone_wildcard or any predtermined timezones, because it understands you literally. So if you get a timestamp with a timezone of "+00" it expects that the return type will have the same timezone. And in fn return types, since this is dynamic and we only have the argument types, not the arguments themselves, we cannot know the timezone in advance, or pattern match on it, etc. We can however write by hand or with the use of chrono_tz "TZ_Variants" or/and " Tz"(chrono_tz it is used internally by chrono for the Timezone trait and is made by the same people) in the creation of the signature all possible timezone combinations.

@DanCodedThis
Copy link
Contributor

DanCodedThis commented Jan 9, 2025

Timestamps in datafusion works peculiarly. For example in query like, select convert_timezone('America/New_York, 'UTC', v3), where v3 is a timestamp with value '2024-01-06 08:00:00 America/New_York'. Expected result: informs the user that the source_timestamp_ntz is not the right format, meaning it has a timezone. We do pattern matching on Timestamp"Unit"(Some(ts), None), where "Unit" is TimeUnit literal, "ts" is &i64. What really happens: not only the query passes through, but the internal math inside datafusion breaks. If v3 were to be '2024-01-06 08:00:00' the result should've been '2024-01-06T13:00:00Z', but with v3 being '2024-01-06 08:00:00 America/New_York' result is '2024-01-06T18:00:00Z'. Why this happens? -> Somewhere inside datafusion there happens a conversion where, the timezone gets added to the i64, and after that our operation happens which is correct. How do I know this? -> If you try to pattern match on the Timestamp"Unit"(Some(ts), Some(tz)), where "Unit" is TimeUnit literal, "ts" is &i64, "tz" is &Arc, which we do for the 2 argument version where the timestamp should have a timezone, "tz" always has the value "+00", no matter what timezone we put it too, but "ts" has the timezone offset inside of it calculated as i64. We also do have "exact" arms in the signature that only should take for the 3 argument version the last argument as Timestamp(unit, None) where "unit" is TimeUnit (not literal).

My suggestion is that we always accept only utf8 and get the timezone from there before datafusion converts it to a timestamp with right value (ts), but a broken timezone (tz).

@DanCodedThis
Copy link
Contributor

DanCodedThis commented Jan 9, 2025

A udf cannot return inside a timestamp timezone_wildcard, because it understands you literally. So if you get a timestamp with a timezone of "+00" it expects that the return type will have the same timezone. And in fn return types, since this is dynamic and we only have the argument types, not the arguments themselves, we cannot know the timezone in advance, or pattern match on it, etc. We can however write by hand or with the use of chrono_tz "TZ_Variants" or/and " Tz"(chrono_tz it is used internally by chrono for the Timezone trait and is made by the same people) in the creation of the signature all possible timezone combinations.

Even if we do this we still cannot know the output timezone in advance, because even if the timezone is not going to be "+00" we will change it, and for example in a 2 argument version select convert_timezone('UTC', v3), where v3 is a timestamp with value '2024-01-06 08:00:00 America/New_York', it will still expect 'America/New_York' not "+00", "UTC", etc.

@DanCodedThis
Copy link
Contributor

Timestamps in datafusion works peculiarly. For example in query like, select convert_timezone('America/New_York, 'UTC', v3), where v3 is a timestamp with value '2024-01-06 08:00:00 America/New_York'. Expected result: informs the user that the source_timestamp_ntz is not the right format, meaning it has a timezone. We do pattern matching on Timestamp"Unit"(Some(ts), None), where "Unit" is TimeUnit literal, "ts" is &i64. What really happens: not only the query passes through, but the internal math inside datafusion breaks. If v3 were to be '2024-01-06 08:00:00' the result should've been '2024-01-06T13:00:00Z', but with v3 being '2024-01-06 08:00:00 America/New_York' result is '2024-01-06T18:00:00Z'. Why this happens? -> Somewhere inside datafusion there happens a conversion where, the timezone gets added to the i64, and after that our operation happens which is correct. How do I know this? -> If you try to pattern match on the Timestamp"Unit"(Some(ts), Some(tz)), where "Unit" is TimeUnit literal, "ts" is &i64, "tz" is &Arc, which we do for the 2 argument version where the timestamp should have a timezone, "tz" always has the value "+00", no matter what timezone we put it too, but "ts" has the timezone offset inside of it calculated as i64. We also do have "exact" arms in the signature that only should take for the 3 argument version the last argument as Timestamp(unit, None) where "unit" is TimeUnit (not literal).

This is confirmed by apache/datafusion#12218

@DanCodedThis
Copy link
Contributor

All those finds are connected to #76

@eadgbear
Copy link
Contributor

eadgbear commented Jan 9, 2025

Problem Description

Timestamps in Datafusion are not handled as expected in certain scenarios. Specifically, when using a query like:

SELECT convert_timezone('America/New_York', 'UTC', v3)

Where v3 is a timestamp with the value 2024-01-06 08:00:00 America/New_York.

Expected Behavior

The system should recognize that the source_timestamp_ntz format is invalid because it contains a timezone. It should match the pattern:

Timestamp<Unit>(Some(ts), None)
  • Unit: A TimeUnit literal.
  • ts: A &i64
    If v3 were 2024-01-06 08:00:00 (without a timezone), the result should be:
2024-01-06T13:00:00Z

Actual Behavior

Instead of throwing an error or handling it properly, the query passes through. However, the internal math in Datafusion breaks.

For example, with v3 as 2024-01-06 08:00:00 America/New_York, the result becomes:

2024-01-06T18:00:00Z

Root Cause

Somewhere inside Datafusion, the timezone is added to the i64 representation of the timestamp. Subsequent operations treat this modified value as if it were timezone-naive, leading to incorrect results.

Observations

When attempting to match the pattern:

Timestamp<Unit>(Some(ts), Some(tz))
  • Unit: A TimeUnit literal.
  • ts: A &i64.
  • tz: A &Arc<str>.

This pattern is used in the two-argument version of the function (where the timestamp should include a timezone). However:

The tz always has the value "+00", regardless of the input timezone.
The ts contains the timezone offset, already added to the i64 value.
In the three-argument version, where the last argument is expected to be Timestamp<Unit, None> the signature matches exactly, but operations fail because the timestamp offset has already been applied internally.

Suggested Fix

Investigate and prevent the internal conversion that adds the timezone offset to the i64 representation of the timestamp.
Ensure proper pattern matching on Timestamp<Unit>(Some(ts), Some(tz)) and enforce correct timezone handling.

Steps to Reproduce

Execute the query:

SELECT convert_timezone('America/New_York', 'UTC', '2024-01-06 08:00:00 America/New_York')

Observe the incorrect result:

2024-01-06T18:00:00Z

Compare with the expected result for v3 as '2024-01-06 08:00:00' (without timezone):

2024-01-06T13:00:00Z

@DanCodedThis Let me know if ChatGPT accurately summarized the problem :)

@DanCodedThis
Copy link
Contributor

DanCodedThis commented Jan 9, 2025

@slyons It did, actually.

  • The only thing it didn't get is that the Timestamp is not generic over TimeUnit.
  • By "TimeUnit literal", what I mean is something like a string literal or a macro. (Like it's part of the type)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants