Skip to content

ktrueda/go-parquet-tools

Repository files navigation

go-parquet-tools

Alternative to pypi parquet-tools in Golang.

You can show content/schema of parquet file(s) on local disk or on Amazon S3. It is incompatible with original parquet-tools. go-parquet-tools is faster because this is implemented in golang.

Install

go install github.com/ktrueda/go-parquet-tools@latest

Usage

go-parquet-tools csv test_resources/test1.parquet
one,two,three
-1,foo,true
<nil>,bar,false
2.5,baz,true
go-parquet-tools show --nil None "test_resources/*"
+------+-----+-------+
| one  | two | three |
+------+-----+-------+
| -1   | foo | true  |
| None | bar | false |
| 2.5  | baz | true  |
| -1   | foo | true  |
| None | bar | false |
| 2.5  | baz | true  |
+------+-----+-------+
go-parquet-tools show s3://foo/test1.parquet
Downloaded s3://foo/test.parquet to /var/folders/f3/9l_qwscs3z94m3yw255bw4l40000gn/T/9ed16365-58e2-40f2-a492-e8477b418a0f.parquet .
+-------+-----+-------+
| one   | two | three |
+-------+-----+-------+
| -1    | foo | true  |
| <nil> | bar | false |
| 2.5   | baz | true  |
+-------+-----+-------+
go-parquet-tools inspect test_resources/test1.parquet
insepct output
Version:  1
Schema:
        ######### schema #########
        Type:  <nil>
        TypeLength:  <nil>
        RepetitionType:  REQUIRED
        Name:  schema
        NumChildren:  0xc0000288d8
        ConvertedType:  <nil>
        Scale:  <nil>
        Precision:  <nil>
        FieldID:  <nil>
        LogicalType:  <nil>
        ######### one #########
        Type:  DOUBLE
        TypeLength:  <nil>
        RepetitionType:  OPTIONAL
        Name:  one
        NumChildren:  <nil>
        ConvertedType:  <nil>
        Scale:  <nil>
        Precision:  <nil>
        FieldID:  <nil>
        LogicalType:  <nil>
        ######### two #########
        Type:  BYTE_ARRAY
        TypeLength:  <nil>
        RepetitionType:  OPTIONAL
        Name:  two
        NumChildren:  <nil>
        ConvertedType:  UTF8
        Scale:  <nil>
        Precision:  <nil>
        FieldID:  <nil>
        LogicalType:  LogicalType({STRING:StringType({}) MAP:<nil> LIST:<nil> ENUM:<nil> DECIMAL:<nil> DATE:<nil> TIME:<nil> TIMESTAMP:<nil> INTEGER:<nil> UNKNOWN:<nil> JSON:<nil> BSON:<nil> UUID:<nil>})
        ######### three #########
        Type:  BOOLEAN
        TypeLength:  <nil>
        RepetitionType:  OPTIONAL
        Name:  three
        NumChildren:  <nil>
        ConvertedType:  <nil>
        Scale:  <nil>
        Precision:  <nil>
        FieldID:  <nil>
        LogicalType:  <nil>
NumRows:  3
RowGroups:
        Columns:
                #########
                FilePath  <nil>
                FileOffset  108
                MetaData.Type  DOUBLE
                MetaData.Encodings  [PLAIN_DICTIONARY PLAIN RLE]
                MetaData.PathInSchema  [one]
                MetaData.Codec  SNAPPY
                MetaData.NumValues  3
                MetaData.TotalUncompressedSize  100
                MetaData.TotalCompressedSize  104
                MetaData.KeyValueMetadata  []
                MetaData.DataPageOffset  36
                MetaData.IndexPageOffset  <nil>
                MetaData.DictionaryPageOffset  0xc000028930
                MetaData.Statistics  Statistics({Max:[0 0 0 0 0 0 4 64] Min:[0 0 0 0 0 0 240 191] NullCount:0xc000028938 DistinctCount:<nil> MaxValue:[0 0 0 0 0 0 4 64] MinValue:[0 0 0 0 0 0 240 191]})
                MetaData.EncodingStats  [PageEncodingStats({PageType:DICTIONARY_PAGE Encoding:PLAIN_DICTIONARY Count:1}) PageEncodingStats({PageType:DATA_PAGE Encoding:PLAIN_DICTIONARY Count:1})]
                MetaData.BloomFilterOffset  <nil>
                OffsetIndexOffset  <nil>
                OffsetIndexLength  <nil>
                ColumnIndexOffset  <nil>
                ColumnIndexLength  <nil>
                CryptoMeatadata  <nil>
                EncryptedColumnMetadata  []
                #########
                FilePath  <nil>
                FileOffset  281
                MetaData.Type  BYTE_ARRAY
                MetaData.Encodings  [PLAIN_DICTIONARY PLAIN RLE]
                MetaData.PathInSchema  [two]
                MetaData.Codec  SNAPPY
                MetaData.NumValues  3
                MetaData.TotalUncompressedSize  76
                MetaData.TotalCompressedSize  80
                MetaData.KeyValueMetadata  []
                MetaData.DataPageOffset  238
                MetaData.IndexPageOffset  <nil>
                MetaData.DictionaryPageOffset  0xc000028948
                MetaData.Statistics  Statistics({Max:[] Min:[] NullCount:0xc000028950 DistinctCount:<nil> MaxValue:[102 111 111] MinValue:[98 97 114]})
                MetaData.EncodingStats  [PageEncodingStats({PageType:DICTIONARY_PAGE Encoding:PLAIN_DICTIONARY Count:1}) PageEncodingStats({PageType:DATA_PAGE Encoding:PLAIN_DICTIONARY Count:1})]
                MetaData.BloomFilterOffset  <nil>
                OffsetIndexOffset  <nil>
                OffsetIndexLength  <nil>
                ColumnIndexOffset  <nil>
                ColumnIndexLength  <nil>
                CryptoMeatadata  <nil>
                EncryptedColumnMetadata  []
                #########
                FilePath  <nil>
                FileOffset  388
                MetaData.Type  BOOLEAN
                MetaData.Encodings  [PLAIN RLE]
                MetaData.PathInSchema  [three]
                MetaData.Codec  SNAPPY
                MetaData.NumValues  3
                MetaData.TotalUncompressedSize  40
                MetaData.TotalCompressedSize  42
                MetaData.KeyValueMetadata  []
                MetaData.DataPageOffset  346
                MetaData.IndexPageOffset  <nil>
                MetaData.DictionaryPageOffset  <nil>
                MetaData.Statistics  Statistics({Max:[1] Min:[0] NullCount:0xc000028970 DistinctCount:<nil> MaxValue:[1] MinValue:[0]})
                MetaData.EncodingStats  [PageEncodingStats({PageType:DATA_PAGE Encoding:PLAIN Count:1})]
                MetaData.BloomFilterOffset  <nil>
                OffsetIndexOffset  <nil>
                OffsetIndexLength  <nil>
                ColumnIndexOffset  <nil>
                ColumnIndexLength  <nil>
                CryptoMeatadata  <nil>
                EncryptedColumnMetadata  []
        TotalByteSize:  226
        NumRows:  3
        SotringColumns:  []
        FileOffset:  0xc000028978
        TotalCompressedSize:  0xc000028980
        Ordinal:  0xc000028988
KeyValueMetaData:  [KeyValue({Key:pandas Value:0xc000063330}) KeyValue({Key:ARROW:schema Value:0xc000063340})]
CreatedBy:  0xc000063350
ColumnOrders:  [ColumnOrder({TYPE_ORDER:TypeDefinedOrder({})}) ColumnOrder({TYPE_ORDER:TypeDefinedOrder({})}) ColumnOrder({TYPE_ORDER:TypeDefinedOrder({})})]
EncryptionAlgorithm:  <nil>
FooterSigningKeyMetadata:  []

Benchmark result

go-parquet-tools is 100x faster than pypi parquet-tools.

Command Mean [ms] Min [ms] Max [ms] Relative
parquet-tools csv test_resources/test1.parquet 702.8 ± 19.9 676.2 739.4 1.00
go-parquet-tools csv test_resources/test1.parquet 6.6 ± 0.4 6.2 7.3 1.00

https://github.com/sharkdp/hyperfine

About

Faster parquet-tools

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published