Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DateTime reader support #133

Open
nickrobinson251 opened this issue Feb 23, 2021 · 2 comments
Open

DateTime reader support #133

nickrobinson251 opened this issue Feb 23, 2021 · 2 comments

Comments

@nickrobinson251
Copy link

nickrobinson251 commented Feb 23, 2021

Is it currently possible to read data in as a DateTime? If not, what would need to be done for this to be added?
(Sort of partner to #108, although i don't know how related reader and writer support are).

Current behaviour seems to be to read datetimes in as Int64 values.

For example, generating some data in Python:

>>> import pandas as pd
>>> 
>>> t1 = pd.Timestamp('2018-01-01 06:00:00+0000', tz='UTC')
>>> t2 = pd.Timestamp('2018-01-01 07:00:00+0000', tz='UTC')
>>> df = pd.DataFrame([t1, t2], columns=["datetime_utc"])
>>> df["datetime_utc"].dtype
datetime64[ns, UTC]
>>>
>>> df.to_parquet("datetimes.parquet")

and then reading it in Julia

julia> pq_file = Parquet.File("datetimes.parquet")
Parquet file: datetimes.parquet
    version: 1
    nrows: 2
    created by: parquet-cpp version 1.5.1-SNAPSHOT
    cached: 0 column chunks


julia> schema(pq_file)
Schema:
    required schema {
      optional INT64 datetime_utc # (from TIMESTAMP_MICROS)
    }

i’ve tried to use the map_logical_types keyword, for example Dict(["datetime_utc"] => (DateTime, Parquet.logical_timestamp)), but this errors with ERROR: unsupported storage type 2 for DateTime.

@oxinabox
Copy link

oxinabox commented Feb 24, 2021

I think this might just be a bug on this line with the wrong/incomplete storage type listed

if storage_type === _Type.INT96

The INT96 is defined here

INT96::Int32

From the same file: type 2 is INT32.
I I suspect a branch for that needs to be added.
Maybe for INT64 also?

@tanmaykm
Copy link
Member

We need to have an implementation that can decode Int64 logical timestamps and then plug it in there.

This is the format specification: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp.

The Parquet.logical_timestamp method currently handles only Int96 format and can't be used to decode Int64 encoded format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants