![](https://private-user-images.githubusercontent.com/41681088/381326564-1e97270f-7925-4cc2-8791-8d0cc77fe512.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0ODk0MjUsIm5iZiI6MTczOTQ4OTEyNSwicGF0aCI6Ii80MTY4MTA4OC8zODEzMjY1NjQtMWU5NzI3MGYtNzkyNS00Y2MyLTg3OTEtOGQwY2M3N2ZlNTEyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDIzMjUyNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTA1Y2M4MTk3NTA5NTRlYTQ0ZWFlZjcyMmY2ZmE1OGRlYTY0ZWQ3MjhjY2NhZTVjY2E2MjBhYWUwYjJkMWRmY2QmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.n_3StvqJqVStMU6RfaOVR_2ZKPu3Mn9dhdZkNiHJuas)
Parquetify is a lightweight tool leveraging the parquet-java library to generate Apache Parquet files based on the file definition provided in a JSON file.
Feature | Description |
---|---|
Physical Data Types: | All physical data types: INT32 , INT64 , BOOLEAN , FLOAT , DOUBLE , BINARY , FIXED_LEN_BYTE_ARRAY . |
Logical Data Types: | Most logical types: UTF8 , DECIMAL , DATE , TIME_MILLIS , TIME_MICROS , TIMESTAMP_MILLIS , TIMESTAMP_MICROS , ENUM , NONE , MAP , LIST , STRING , MAP_KEY_VALUE , TIME , INTEGER , JSON , BSON , UUID , INTERVAL , UINT_8 , UINT_16 , UINT_32 , UINT_64 , INT_8 , INT_16 , INT_32 , INT_64 , FLOAT16 |
Precision & Scale: | Precision and scale for DECIMAL types. |
Compression: | NONE , SNAPPY , GZIP , LZO , BROTLI , LZ4 , ZSTD . |
Encodings: | Automatically set by the writer for a given column. |
Bloom Filter: | Apply a bloom filter to specific columns or all columns (including those within groups). |
Writer Version: | Specify writer version (1.0 , 2.0 ). |
Customizable Sizes: | Row group and page sizes. |
- π Full documentation
- πΎ Installation
- π Creating Parquet File
- π Parquet File Schema Documentation
- π Overview
- π Fields
- π File Definition
- βοΈ Writer Options
- π Schema Definition
- π§ Missing Functionality
π Full documentation
π See wiki for the full documentation.
πΎ Installation
-
Download the latest release from the Releases:
sudo apt update wget https://github.com/Altinity/parquet-regression/releases/download/1.0.3/parquetify_1.0.3_amd64.deb
π‘ Note: Ensure that you download the package corresponding to your system architecture. Both ARM and x86_64 are supported.
-
Install the
.deb
package:sudo apt install ./parquetify_1.0.3_amd64.deb
-
Confirm the installation, run the following command:
parquetify
If successful, you will see usage instructions like:
Error parsing command line arguments: Missing required options: j, o usage: GenerateParquet -j,--json <arg> Path to the JSON file -o,--output <arg> Output path for the Parquet file
To generate your first Parquet file, use the provided example JSON available in our schema-example folder:
parquetify -j example.json -o /path/to/output/file.parquet
Warning
Parquetify allows you to specify any structure, including incorrect ones. If the structure is invalid, the Parquet file may be generated, but it may not be readable by tools or databases.
This document provides guidelines for defining the structure and properties of a Parquet file using a JSON schema. This schema aligns with Parquet-Java API terms, supporting complex types, including nested values in MAP structures.
- Schema Version: Draft-07 JSON Schema
- Title: Parquet File Schema
- Description: Defines Parquet file configuration with options for file metadata, writer settings, and column definitions.
fileName
(string, required):
Specifies the name of the output Parquet file.
Contains additional options for configuring the Parquet writer.
-
writerVersion
(string):
Version of the Parquet writer. Defaults to"1.0"
.- Options:
"1.0"
,"2.0"
- Options:
-
compression
(string):
Compression codec to use. Defaults to"SNAPPY"
.- Options:
"NONE"
,"SNAPPY"
,"GZIP"
,"LZO"
,"BROTLI"
,"LZ4"
,"ZSTD"
- Options:
-
rowGroupSize
(integer):
Size of row groups in bytes. Defaults to134217728
. -
pageSize
(integer):
Page size in bytes. Defaults to1048576
. -
bloomFilter
(string):
Bloom filter algorithm for columns. Defaults to"none"
.- Options:
"none"
,"all"
,["column1", "column2"]
(specific columns)
- Options:
Defines the structure and properties of each column in the Parquet file. It includes column data types, nesting, and complex structures such as MAP.
-
name
(string, required):
Name of the column. -
schemaType
(string, required):
Schema type for the column, aligning with Parquet-Java API.- Options:
"optional"
,"required"
,"repeated"
,"optionalGroup"
,"requiredGroup"
,"repeatedGroup"
- Options:
-
physicalType
(string):
Physical data type of the column.- Options:
"INT32"
,"INT64"
,"BOOLEAN"
,"FLOAT"
,"DOUBLE"
,"BINARY"
,"FIXED_LEN_BYTE_ARRAY"
- Options:
-
logicalType
(string):
Logical data type, aligning with Parquet-Java OriginalType.- Options:
"UTF8"
,"DECIMAL"
,"DATE"
,"TIME_MILLIS"
,"TIME_MICROS"
,"TIMESTAMP_MILLIS"
,"TIMESTAMP_MICROS"
,"ENUM"
,"NONE"
,"MAP"
,"LIST"
,"STRING"
,"MAP_KEY_VALUE"
,"TIME"
,"INTEGER"
,"JSON"
,"BSON"
,"UUID"
,"INTERVAL"
,"FLOAT16"
,"UINT8"
,"UINT16"
,"UINT32"
,"UINT64"
,"INT8"
,"INT16"
,"INT32"
,"INT64"
- Options:
These properties are relevant for the DECIMAL
logical type or FIXED_LEN_BYTE_ARRAY
physical type:
-
precision
(integer):
Precision forDECIMAL
. Minimum value is 1. -
scale
(integer):
Scale forDECIMAL
. Minimum value is 0. -
length
(integer):
Length forFIXED_LEN_BYTE_ARRAY
. Minimum value is 1.
Defines nested structures or fields for group types:
fields
(array of objects):
Additional fields for grouped or nested columns (used withoptionalGroup
,requiredGroup
, orrepeatedGroup
types).
If a column has a MAP type, key and value schemas are specified separately.
-
keyType
(object):
Schema for MAP key:physicalType
(string): Physical type, options include"INT32"
,"INT64"
,"BINARY"
.logicalType
(string): Logical type, options are"UTF8"
or"NONE"
.
-
valueType
(object):
Schema for MAP value, supporting complex nested types:physicalType
(string): Options include"INT32"
,"INT64"
,"BINARY"
,"BOOLEAN"
,"FLOAT"
,"DOUBLE"
,"MAP"
,"GROUP"
.logicalType
(string): Options include"UTF8"
,"DECIMAL"
,"NONE"
.fields
(array of objects): Additional fields ifvalueType
is a complex type, such asGROUP
or nestedMAP
.
{
"fileName": "example.parquet",
"options": {
"writerVersion": "2.0",
"compression": "GZIP",
"rowGroupSize": 128000000,
"pageSize": 1024000,
"bloomFilter": "all"
},
"schema": [
{
"name": "id",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "INTEGER"
},
{
"name": "data",
"schemaType": "optionalGroup",
"fields": [
{
"name": "value",
"schemaType": "optional",
"physicalType": "BINARY",
"logicalType": "UTF8"
}
]
}
]
}