Description
Feature Request / Improvement
I am very excited by the recent additions to support PyArrow Record Batch Readers (#786) to better support out of memory reads of very large Iceberg tables.
I currently have a use-case where I want to read a large Iceberg table (1B+) rows, where a sort order key cannot be defined on the table due to the upstream sink processor not supporting it. The table has a partition_by on modified_date_hour
, so cannot be used as a strict ordering key.
Example table schema:
user(
1: op: optional string,
4: lsn: optional long,
5: __deleted: optional string,
6: usr_created_date: optional long,
7: usr_modified_date: optional long,
8: modified_date: optional timestamptz,
9: usr_id: optional string,
12: table: optional string,
13: ts_ms: optional long
),
partition by: [modified_date_hour],
sort order: [],
snapshot: Operation.APPEND: id=4107982468622810123, parent_id=4969834436785268669, schema_id=0
Is it possible currently in 0.7.0rc1, or would it be possible with some changes to support reading a PyArrow Batch Reader which fetches batches according to the partition key, and then sorts those batches in-memory by another table column provided to the client, in this case usr_modified_date
, to return a stream of PyArrow batches. The idea here is to support being able to efficiently fetch and resume the stream via a connector to extract the records incrementally.