Skip to content

Scan Iceberg table sorted on partition key without sort order #966

Closed
@BTheunissen

Description

@BTheunissen

Feature Request / Improvement

I am very excited by the recent additions to support PyArrow Record Batch Readers (#786) to better support out of memory reads of very large Iceberg tables.

I currently have a use-case where I want to read a large Iceberg table (1B+) rows, where a sort order key cannot be defined on the table due to the upstream sink processor not supporting it. The table has a partition_by on modified_date_hour, so cannot be used as a strict ordering key.

Example table schema:

user(
  1: op: optional string,
  4: lsn: optional long,
  5: __deleted: optional string,
  6: usr_created_date: optional long,
  7: usr_modified_date: optional long,
  8: modified_date: optional timestamptz,
  9: usr_id: optional string,
  12: table: optional string,
  13: ts_ms: optional long
),
partition by: [modified_date_hour],
sort order: [],
snapshot: Operation.APPEND: id=4107982468622810123, parent_id=4969834436785268669, schema_id=0

Is it possible currently in 0.7.0rc1, or would it be possible with some changes to support reading a PyArrow Batch Reader which fetches batches according to the partition key, and then sorts those batches in-memory by another table column provided to the client, in this case usr_modified_date, to return a stream of PyArrow batches. The idea here is to support being able to efficiently fetch and resume the stream via a connector to extract the records incrementally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions