DateTime writer support #108

liamlundy · 2020-11-13T17:09:17Z

I know that you mention in the README that a few types are not supported by the writer as of now including date-like types. I didn't see any issues referencing this so I wanted to add one to track its status.

Is this planned to be supported soon? If not, what needs to happen in order to support this? I might be able to create a PR at some point.

xiaodaigh · 2020-11-14T00:54:53Z

Hmm, I think you can look at the writer.jl file and also it would be good to link the relevant DateTime support page from parquet format.

Can you write some parquet using datetime in python or R can and provide some simple files for testing?

liamlundy · 2020-11-17T15:21:19Z

Okay I was able to create some example files using pyarrow. I'll include the code I used to generate those examples.

Link to Apache Parquet docs about the Date / Time Logical Types: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

I'm not sure how soon I'll be able to dig into this, but I'll leave this info here for myself or anyone else that wants to take a crack at it in the meantime. It looks like there is also some work to be done to support reading a few of the date / time types as well.

Python script for generating parquet files with datetime columns:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


if __name__ == "__main__":
    weeks = pd.date_range(start="2000-01-01", periods=26, freq="W")
    hours = pd.date_range(start="2000-01-01", periods=26, freq="H")
    data = pd.DataFrame(
        {
            "ns": list(weeks),
            "ms": list(weeks),
            "us": list(weeks),
            "date": weeks.date,
            "time": hours.time,
        }
    )

    schema = pa.schema(
        [
                pa.field("ns", pa.timestamp("ns")),
                pa.field("ms", pa.timestamp("ms")),
                pa.field("us", pa.timestamp("us")),
                pa.field("date", pa.date64()),
                pa.field("time", pa.time64("ns")),
        ]
    )

    table = pa.Table.from_pandas(data, schema=schema)

    pq.write_table(table, "example-v1.parquet")
    pq.write_table(table, "example-v2.parquet", version="2.0")

    v1_file = pq.ParquetFile('example-v1.parquet')
    v2_file = pq.ParquetFile('example-v2.parquet')
    print(v1_file.schema)
    print(v2_file.schema)

Output:

<pyarrow._parquet.ParquetSchema object at 0x10f054c80>
required group field_id=0 schema {
  optional int64 field_id=1 ns (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=2 ms (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=3 us (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int32 field_id=4 date (Date);
  optional int64 field_id=5 time (Time(isAdjustedToUTC=true, timeUnit=nanoseconds));
}

<pyarrow._parquet.ParquetSchema object at 0x10f05d690>
required group field_id=0 schema {
  optional int64 field_id=1 ns (Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=2 ms (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=3 us (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int32 field_id=4 date (Date);
  optional int64 field_id=5 time (Time(isAdjustedToUTC=true, timeUnit=nanoseconds));
}

the-noble-argon · 2021-11-17T19:06:02Z

Being able to write datetimes is a crucial for a lot of data science applications. I'm still forced to call Python for this which makes it very hard to scale any timeseries Julia solution that has to interact with other components that speak parquet.

xiaodaigh · 2021-11-17T22:00:41Z

I wonder if the parquet2.jl implementation solves this?

nickrobinson251 mentioned this issue Feb 23, 2021

DateTime reader support #133

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DateTime writer support #108

DateTime writer support #108

liamlundy commented Nov 13, 2020

xiaodaigh commented Nov 14, 2020

liamlundy commented Nov 17, 2020

the-noble-argon commented Nov 17, 2021

xiaodaigh commented Nov 17, 2021

DateTime writer support #108

DateTime writer support #108

Comments

liamlundy commented Nov 13, 2020

xiaodaigh commented Nov 14, 2020

liamlundy commented Nov 17, 2020

the-noble-argon commented Nov 17, 2021

xiaodaigh commented Nov 17, 2021