Scan Iceberg table sorted on partition key without sort order

### Feature Request / Improvement

I am very excited by the recent additions to support PyArrow Record Batch Readers (#786) to better support out of memory reads of very large Iceberg tables.

I currently have a use-case where I want to read a large Iceberg table (1B+) rows, where a sort order key cannot be defined on the table due to the upstream sink processor not supporting it. The table has a partition_by on `modified_date_hour`, so cannot be used as a strict ordering key.

Example table schema:
```
user(
  1: op: optional string,
  4: lsn: optional long,
  5: __deleted: optional string,
  6: usr_created_date: optional long,
  7: usr_modified_date: optional long,
  8: modified_date: optional timestamptz,
  9: usr_id: optional string,
  12: table: optional string,
  13: ts_ms: optional long
),
partition by: [modified_date_hour],
sort order: [],
snapshot: Operation.APPEND: id=4107982468622810123, parent_id=4969834436785268669, schema_id=0
```

Is it possible currently in 0.7.0rc1, or would it be possible with some changes to support reading a PyArrow Batch Reader which fetches batches according to the partition key, and then sorts those batches in-memory by another table column provided to the client, in this case `usr_modified_date`, to return a stream of PyArrow batches. The idea here is to support being able to efficiently fetch and resume the stream via a connector to extract the records incrementally.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scan Iceberg table sorted on partition key without sort order #966

Feature Request / Improvement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scan Iceberg table sorted on partition key without sort order #966

Description

Feature Request / Improvement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions