Skip to content
Selfeer edited this page Nov 12, 2024 · 43 revisions
No image!

Parquet File Name

File name is determine via the fileName field inside the JSON definition.

  "fileName": "map_example.parquet",

Options

Options determine the overall settings of the file like:

  • the size of a row group
  • the size of a page
  • version of the parquet writer
  • compression
  • applying bloom filer on columns

Usage and Examples

Options should be specified as a set of key value pairs inside the JSON file under the "options".

  "options": {
    "writerVersion": "1.0",
    "compression": "snappy",
    "rowGroupSize": "default",
    "pageSize": "default",
    "bloomFilter": "all"
  }

Writer Version

In parquet-java, the writerVersion specifies the version of the Parquet format used when writing data to Parquet files. This setting determines the encoding, compression algorithms, and metadata structure that will be applied during the write process.

Writer Version 1.0

  • Ensures compatibility with older readers that only support the original Parquet format.
  • Supports basic encoding methods like Plain and Dictionary encoding.
  • Uses the original data page format without additional metadata.
  • Limited to standard compression methods like Snappy and Gzip.

Full example here

  "options": {
    "writerVersion": "1.0",
}

Writer Version 2.0

  • May not be compatible with older readers but introduces enhancements for newer systems.
  • Introduces advanced encodings such as Delta encoding (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY), which improve compression efficiency for certain data types.
  • Utilizes Data Page V2 format, which includes checksums and more detailed metadata for better data integrity and performance.
  • Supports additional compression codecs (Zstandard (ZSTD), Brotli, LZ4), potentially offering better compression ratios.

Full example here

  "options": {
    "writerVersion": "2.0",
}

Compression

Full example here

  "options": {
    "compression": "snappy",
}

The value for compression can be anything from: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD.

Row Group Size and Page size

  • rowGroupSize: Defines the maximum size (in bytes) of each row group when writing data to a Parquet file.
  • pageSize: Defines the size (in bytes) of each page within a column chunk.

Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.

Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk.

source: https://parquet.apache.org/docs/concepts/

Full example here

  "options": {
    "rowGroupSize": 256,
    "pageSize": 1024
}

Bloom Filter

Bloom filter can be applied to: all columns, no columns at all or specific columns

All Columns

Full example here

  "options": {
    "bloomFilter": "all"
}

Specific Columns

Full example here

  "options": {
    "bloomFilter": ["id", "id2", "person.name"]
}

Here the values for bloomFilter is a list of column paths that we want bloom filter to be applied to. Usually for the regular data types the path is just a name of the column like id, and for the complex types it depends how deeply the column is nested.

In this example we apply the bloom filter to column name that is located under the array of tuples with columns name and age so the path to name is person.name.

Hadoop Configurations

An alternative way to specify configurations is to use the hadoop library and specify the configurations from there.

Full example here

  "hadoopConfigs": {
    "parquet.compression": "UNCOMPRESSED",
    "parquet.enable.dictionary": "true",
    "parquet.page.size": "1048576"
}

Note

All the possible hadoop configurations are listed here

Data types

Regular Types

Integers

Int8

Full example here

  {
    "name": "int8",
    "schemaType": "required",
    "physicalType": "INT32",
    "logicalType": "INT8",
    "data": [1, 2, 3, 4, 5]
  }

Int16

Full example here

  {
    "name": "int16",
    "schemaType": "required",
    "physicalType": "INT32",
    "logicalType": "INT16",
    "data": [1, 2, 3, 4, 5]
  }

Int32

Full example here

  {
    "name": "int32",
    "schemaType": "required",
    "physicalType": "INT32",
    "logicalType": "INT32",
    "data": [1, 2, 3, 4, 5]
  }

Int64

Full example here

  {
    "name": "int64",
    "schemaType": "required",
    "physicalType": "INT64",
    "logicalType": "INT64",
    "data": [1, 2, 3, 4, 5]
  }

UInt8

Full example here

  {
    "name": "uint8",
    "schemaType": "required",
    "physicalType": "INT32",
    "logicalType": "UINT8",
    "data": [1, 2, 3, 4, 5]
  }

UInt16

Full example here

  {
    "name": "uint16",
    "schemaType": "required",
    "physicalType": "INT32",
    "logicalType": "UINT16",
    "data": [1, 2, 3, 4, 5]
  }

UInt32

Full example here

  {
    "name": "uint32",
    "schemaType": "required",
    "physicalType": "INT32",
    "logicalType": "UINT32",
    "data": [1, 2, 3, 4, 5]
  }

UInt64

Full example here

  {
    "name": "uint64",
    "schemaType": "required",
    "physicalType": "INT64",
    "logicalType": "UINT64",
    "data": [1, 2, 3, 4, 5]
  }

UTF8

Full example here

  {
    "name": "utf8",
    "schemaType": "required",
    "physicalType": "BINARY",
    "logicalType": "UTF8",
    "data": ["one", "two", "three", "four", "five", "\uD83D\uDCE6"]
  }

Decimal That Fits Into INT32 Physical Type

Full example here

  {
    "name": "decimal_int32",
    "schemaType": "required",
    "physicalType": "INT32",
    "logicalType": "DECIMAL",
    "precision": 3,
    "scale": 2,
    "data": [123, 321, 424]
  }

Large Decimal That Doesn't Fit Into INT32 Physical Type

Full example here

  {
    "name": "decimal_int64",
    "schemaType": "required",
    "physicalType": "INT64",
    "logicalType": "DECIMAL",
    "precision": 10,
    "scale": 3,
    "data": [2147483648, 2147483649, 2147483650]
  }

Decimal Annotated To BINARY

Full example here

  {
    "name": "decimal_binary",
    "schemaType": "required",
    "physicalType": "BINARY",
    "logicalType": "DECIMAL",
    "precision": 10,
    "scale": 3,
    "data": ["213", "421", "1234"]
  }

DATE

Full example here

  {
    "name": "date_field_int32",
    "schemaType": "required",
    "physicalType": "INT32",
    "logicalType": "DATE",
    "data": [18628]
  }

TIME_MILLIS

Full example here

  {
    "name": "time_millis_field",
    "schemaType": "required",
    "physicalType": "INT32",
    "logicalType": "TIME_MILLIS",
    "data": [12345678, 23456789, 34567890, 45678901, 56789012]
  }

TIME_MICROS

Full example here

  {
    "name": "time_micros_field",
    "schemaType": "required",
    "physicalType": "INT64",
    "logicalType": "TIME_MICROS",
    "data": [123456789012, 234567890123, 345678901234, 456789012345, 567890123456]
  }

TIMESTAMP_MICROS

Full example here

  {
    "name": "timestamp_micros_field",
    "schemaType": "required",
    "physicalType": "INT64",
    "logicalType": "TIMESTAMP_MICROS",
    "data": [1609459200000000, 1609545600000000, 1609632000000000, 1609718400000000, 1609804800000000]
  }

TIMESTAMP_MILLIS

Full example here

  {
    "name": "timestamp_millis_field",
    "schemaType": "required",
    "physicalType": "INT64",
    "logicalType": "TIMESTAMP_MILLIS",
    "data": [1609459200000, 1609545600000, 1609632000000, 1609718400000, 1609804800000]
  }

JSON and BSON

Full example here

    {
      "name": "json_field",
      "schemaType": "optional",
      "physicalType": "BINARY",
      "logicalType": "JSON",
      "data": [
        "{\"key1\": \"value1\"}",
        "{\"key2\": \"value2\"}",
        "{\"key3\": \"value3\"}",
        "{\"key4\": \"value4\"}",
        "{\"key5\": \"value5\"}"
      ]
    },
    {
      "name": "bson_field",
      "schemaType": "optional",
      "physicalType": "BINARY",
      "logicalType": "BSON",
      "data": [
        "{\"key1\": \"value1\"}",
        "{\"key2\": \"value2\"}",
        "{\"key3\": \"value3\"}",
        "{\"key4\": \"value4\"}",
        "{\"key5\": \"value5\"}"
      ]
    }

STRING

Full example here

  {
    "name": "binary_field",
    "schemaType": "required",
    "physicalType": "BINARY",
    "logicalType": "STRING",
    "data": ["one", "two", "three", "four", "five"]
  }

ENUM

Full example here

    {
      "name": "enum_field",
      "schemaType": "required",
      "physicalType": "BINARY",
      "logicalType": "ENUM",
      "data": ["a", "b", "c", "d", "e"]
    }

UUID

Full example here

  {
    "name": "uuid_field",
    "schemaType": "required",
    "physicalType": "FIXED_LEN_BYTE_ARRAY",
    "logicalType": "UUID",
    "length": 16,
    "data": [
      "550e8400e29b41d4a716446655440000",
      "550e8400e29b41d4a716446655440001",
      "550e8400e29b41d4a716446655440002",
      "550e8400e29b41d4a716446655440003",
      "550e8400e29b41d4a716446655440004"
    ]
  }

Note

Here length is the specified length of the FIXED_LEN_BYTE_ARRAY which is 16 for the given uuid values.

  • name: Column name.
  • schemaType: Specifies whether the column allows null values (required means no null values).
  • physicalType: Defines the physical data type.
  • logicalType: Defines the logical type for better data interpretation.
  • data: An array of values to populate the column.

Complex Types

Array

Full example here

{
  "name": "person",
  "schemaType": "repeatedGroup",
  "fields": [
    {
      "name": "name",
      "schemaType": "optional",
      "physicalType": "BINARY",
      "logicalType": "STRING"
    },
    {
      "name": "age",
      "schemaType": "required",
      "physicalType": "INT32"
    }
  ],
  "data": [
    {
      "name": "Alice",
      "age": 30
    },
    {
      "name": "Bob",
      "age": 25
    }
  ]
}

Nested Array

Full example here

    {
      "name": "attributes",
      "schemaType": "repeatedGroup",
      "logicalType": "INT32",
      "fields": [
        {
          "name": "key_value",
          "schemaType": "repeatedGroup",
          "fields": [
            {
              "name": "key",
              "schemaType": "required",
              "physicalType": "BINARY",
              "logicalType": "STRING"
            },
            {
              "name": "value",
              "schemaType": "optional",
              "physicalType": "BINARY",
              "logicalType": "STRING"
            }
          ]
        }
      ],
      "data": [{"key_value": [{"key": "tqwqmcqvqo", "value": "gkqcl"}]}]
    }

Tuple

Full example here

    {
      "name": "person",
      "schemaType": "requiredGroup",
      "fields": [
        {
          "name": "name",
          "schemaType": "optional",
          "physicalType": "BINARY",
          "logicalType": "STRING"
        },
        {
          "name": "age",
          "schemaType": "required",
          "physicalType": "INT32"
        }
      ],
      "data": [
        {
          "name": "Alice",
          "age": 30
        },
        {
          "name": "Bob",
          "age": 25
        }
      ]
    }

Nested Tuple

Full example here

    {
      "name": "attributes",
      "schemaType": "optionalGroup",
      "logicalType": "INT32",
      "fields": [
        {
          "name": "key_value",
          "schemaType": "requiredGroup",
          "fields": [
            {
              "name": "key",
              "schemaType": "required",
              "physicalType": "BINARY",
              "logicalType": "STRING"
            },
            {
              "name": "value",
              "schemaType": "optional",
              "physicalType": "BINARY",
              "logicalType": "STRING"
            }
          ]
        }
      ],
      "data": [{"key_value": [{"key": "tqwqmcqvqo", "value": "gkqcl"}]}]
    }
  • repeatedGroup: Defines an array of objects.
  • requiredGroup and optionalGroup: Define tuple-like structures.

Clone this wiki locally