-
Notifications
You must be signed in to change notification settings - Fork 0
Home
File name is determine via the fileName
field inside the JSON definition.
"fileName": "map_example.parquet",
Options determine the overall settings of the file like:
- the size of a row group
- the size of a page
- version of the parquet writer
- compression
- applying bloom filer on columns
Options should be specified as a set of key value pairs inside the JSON file under the "options".
"options": {
"writerVersion": "1.0",
"compression": "snappy",
"rowGroupSize": "default",
"pageSize": "default",
"bloomFilter": "all"
}
In parquet-java, the writerVersion specifies the version of the Parquet format used when writing data to Parquet files. This setting determines the encoding, compression algorithms, and metadata structure that will be applied during the write process.
- Ensures compatibility with older readers that only support the original Parquet format.
- Supports basic encoding methods like Plain and Dictionary encoding.
- Uses the original data page format without additional metadata.
- Limited to standard compression methods like Snappy and Gzip.
"options": {
"writerVersion": "1.0",
}
- May not be compatible with older readers but introduces enhancements for newer systems.
- Introduces advanced encodings such as Delta encoding (
DELTA_BINARY_PACKED
,DELTA_LENGTH_BYTE_ARRAY
), which improve compression efficiency for certain data types. - Utilizes Data Page V2 format, which includes checksums and more detailed metadata for better data integrity and performance.
- Supports additional compression codecs (
Zstandard (ZSTD)
,Brotli
,LZ4
), potentially offering better compression ratios.
"options": {
"writerVersion": "2.0",
}
"options": {
"compression": "snappy",
}
The value for compression
can be anything from: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD.
-
rowGroupSize
: Defines the maximum size (in bytes) of each row group when writing data to a Parquet file. -
pageSize
: Defines the size (in bytes) of each page within a column chunk.
Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.
Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk.
"options": {
"rowGroupSize": 256,
"pageSize": 1024
}
Bloom filter can be applied to: all columns
, no columns at all
or specific columns
"options": {
"bloomFilter": "all"
}
"options": {
"bloomFilter": ["id", "id2", "person.name"]
}
Here the values for bloomFilter
is a list of column paths that we want bloom filter to be applied to. Usually for the regular data types the path is just a name of the column like id
, and for the complex types it depends how deeply the column is nested.
In this example we apply the bloom filter to column name
that is located under the array of tuples with columns name
and age
so the path to name is person.name
.
An alternative way to specify configurations is to use the hadoop library and specify the configurations from there.
"hadoopConfigs": {
"parquet.compression": "UNCOMPRESSED",
"parquet.enable.dictionary": "true",
"parquet.page.size": "1048576"
}
Note
All the possible hadoop configurations are listed here
{
"name": "int8",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "INT8",
"data": [1, 2, 3, 4, 5]
}
{
"name": "int16",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "INT16",
"data": [1, 2, 3, 4, 5]
}
{
"name": "int32",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "INT32",
"data": [1, 2, 3, 4, 5]
}
{
"name": "int64",
"schemaType": "required",
"physicalType": "INT64",
"logicalType": "INT64",
"data": [1, 2, 3, 4, 5]
}
{
"name": "uint8",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "UINT8",
"data": [1, 2, 3, 4, 5]
}
{
"name": "uint16",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "UINT16",
"data": [1, 2, 3, 4, 5]
}
{
"name": "uint32",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "UINT32",
"data": [1, 2, 3, 4, 5]
}
{
"name": "uint64",
"schemaType": "required",
"physicalType": "INT64",
"logicalType": "UINT64",
"data": [1, 2, 3, 4, 5]
}
{
"name": "utf8",
"schemaType": "required",
"physicalType": "BINARY",
"logicalType": "UTF8",
"data": ["one", "two", "three", "four", "five", "\uD83D\uDCE6"]
}
{
"name": "decimal_int32",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "DECIMAL",
"precision": 3,
"scale": 2,
"data": [123, 321, 424]
}
{
"name": "decimal_int64",
"schemaType": "required",
"physicalType": "INT64",
"logicalType": "DECIMAL",
"precision": 10,
"scale": 3,
"data": [2147483648, 2147483649, 2147483650]
}
{
"name": "decimal_binary",
"schemaType": "required",
"physicalType": "BINARY",
"logicalType": "DECIMAL",
"precision": 10,
"scale": 3,
"data": ["213", "421", "1234"]
}
{
"name": "date_field_int32",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "DATE",
"data": [18628]
}
{
"name": "time_millis_field",
"schemaType": "required",
"physicalType": "INT32",
"logicalType": "TIME_MILLIS",
"data": [12345678, 23456789, 34567890, 45678901, 56789012]
}
{
"name": "time_micros_field",
"schemaType": "required",
"physicalType": "INT64",
"logicalType": "TIME_MICROS",
"data": [123456789012, 234567890123, 345678901234, 456789012345, 567890123456]
}
{
"name": "timestamp_micros_field",
"schemaType": "required",
"physicalType": "INT64",
"logicalType": "TIMESTAMP_MICROS",
"data": [1609459200000000, 1609545600000000, 1609632000000000, 1609718400000000, 1609804800000000]
}
{
"name": "timestamp_millis_field",
"schemaType": "required",
"physicalType": "INT64",
"logicalType": "TIMESTAMP_MILLIS",
"data": [1609459200000, 1609545600000, 1609632000000, 1609718400000, 1609804800000]
}
{
"name": "json_field",
"schemaType": "optional",
"physicalType": "BINARY",
"logicalType": "JSON",
"data": [
"{\"key1\": \"value1\"}",
"{\"key2\": \"value2\"}",
"{\"key3\": \"value3\"}",
"{\"key4\": \"value4\"}",
"{\"key5\": \"value5\"}"
]
},
{
"name": "bson_field",
"schemaType": "optional",
"physicalType": "BINARY",
"logicalType": "BSON",
"data": [
"{\"key1\": \"value1\"}",
"{\"key2\": \"value2\"}",
"{\"key3\": \"value3\"}",
"{\"key4\": \"value4\"}",
"{\"key5\": \"value5\"}"
]
}
{
"name": "binary_field",
"schemaType": "required",
"physicalType": "BINARY",
"logicalType": "STRING",
"data": ["one", "two", "three", "four", "five"]
}
{
"name": "enum_field",
"schemaType": "required",
"physicalType": "BINARY",
"logicalType": "ENUM",
"data": ["a", "b", "c", "d", "e"]
}
{
"name": "uuid_field",
"schemaType": "required",
"physicalType": "FIXED_LEN_BYTE_ARRAY",
"logicalType": "UUID",
"length": 16,
"data": [
"550e8400e29b41d4a716446655440000",
"550e8400e29b41d4a716446655440001",
"550e8400e29b41d4a716446655440002",
"550e8400e29b41d4a716446655440003",
"550e8400e29b41d4a716446655440004"
]
}
Note
Here length
is the specified length of the FIXED_LEN_BYTE_ARRAY
which is 16 for the given uuid
values.
-
name
: Column name. -
schemaType
: Specifies whether the column allows null values (required
means no null values). -
physicalType
: Defines the physical data type. -
logicalType
: Defines the logical type for better data interpretation. -
data
: An array of values to populate the column.
{
"name": "person",
"schemaType": "repeatedGroup",
"fields": [
{
"name": "name",
"schemaType": "optional",
"physicalType": "BINARY",
"logicalType": "STRING"
},
{
"name": "age",
"schemaType": "required",
"physicalType": "INT32"
}
],
"data": [
{
"name": "Alice",
"age": 30
},
{
"name": "Bob",
"age": 25
}
]
}
{
"name": "attributes",
"schemaType": "repeatedGroup",
"logicalType": "INT32",
"fields": [
{
"name": "key_value",
"schemaType": "repeatedGroup",
"fields": [
{
"name": "key",
"schemaType": "required",
"physicalType": "BINARY",
"logicalType": "STRING"
},
{
"name": "value",
"schemaType": "optional",
"physicalType": "BINARY",
"logicalType": "STRING"
}
]
}
],
"data": [{"key_value": [{"key": "tqwqmcqvqo", "value": "gkqcl"}]}]
}
{
"name": "person",
"schemaType": "requiredGroup",
"fields": [
{
"name": "name",
"schemaType": "optional",
"physicalType": "BINARY",
"logicalType": "STRING"
},
{
"name": "age",
"schemaType": "required",
"physicalType": "INT32"
}
],
"data": [
{
"name": "Alice",
"age": 30
},
{
"name": "Bob",
"age": 25
}
]
}
{
"name": "attributes",
"schemaType": "optionalGroup",
"logicalType": "INT32",
"fields": [
{
"name": "key_value",
"schemaType": "requiredGroup",
"fields": [
{
"name": "key",
"schemaType": "required",
"physicalType": "BINARY",
"logicalType": "STRING"
},
{
"name": "value",
"schemaType": "optional",
"physicalType": "BINARY",
"logicalType": "STRING"
}
]
}
],
"data": [{"key_value": [{"key": "tqwqmcqvqo", "value": "gkqcl"}]}]
}
-
repeatedGroup
: Defines an array of objects. -
requiredGroup
andoptionalGroup
: Define tuple-like structures.
Developed and maintained by the Altinity team.
- Home
- Parquet File Name
- Options of the File
- File Compression
- Writer Version
- Row and Page Size
- Bloom Filter
- Configure with Hadoop
- Integer Columns
- Unsigned Integer Columns
- UTF8 Columns
- Decimal Columns
- Date Columns
- Time and Timestamp Columns
- JSON and BSON Columns
- String Columns
- Enum Columns
- UUID Columns
- Float16 Column
- Array Columns
- Nested Array Columns
- Tuple Columns
- Nested Tuple Columns
- Schema Types
- Encodings
- File Encryption
- Extra Metadata Entries