You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am very excited by the recent additions to support PyArrow Record Batch Readers (#786) to better support out of memory reads of very large Iceberg tables.
I currently have a use-case where I want to read a large Iceberg table (1B+) rows, where a sort order key cannot be defined on the table due to the upstream sink processor not supporting it. The table has a partition_by on modified_date_hour, so cannot be used as a strict ordering key.
Is it possible currently in 0.7.0rc1, or would it be possible with some changes to support reading a PyArrow Batch Reader which fetches batches according to the partition key, and then sorts those batches in-memory by another table column provided to the client, in this case usr_modified_date, to return a stream of PyArrow batches. The idea here is to support being able to efficiently fetch and resume the stream via a connector to extract the records incrementally.
The text was updated successfully, but these errors were encountered:
I believe this can be done in the plan_files function by specifying a partition field as a row_filter in scan
sorts those batches in-memory by another table column provided to the client
It's possible to sort the batches in-memory. However, I think all the data needs to be read into memory in order to perform a sort based on another table column.
Sort based on partition field can be done without reading all the data into memory since we can just work off the table metadata.
Sort based on partition field can be done without reading all the data into memory since we can just work off the table metadata.
Awesome this is just what I was looking for, I understand that the Arrow Record Batch Reader is a very new integration but I should be able to accomplish what I'm looking for with your tips!
Feature Request / Improvement
I am very excited by the recent additions to support PyArrow Record Batch Readers (#786) to better support out of memory reads of very large Iceberg tables.
I currently have a use-case where I want to read a large Iceberg table (1B+) rows, where a sort order key cannot be defined on the table due to the upstream sink processor not supporting it. The table has a partition_by on
modified_date_hour
, so cannot be used as a strict ordering key.Example table schema:
Is it possible currently in 0.7.0rc1, or would it be possible with some changes to support reading a PyArrow Batch Reader which fetches batches according to the partition key, and then sorts those batches in-memory by another table column provided to the client, in this case
usr_modified_date
, to return a stream of PyArrow batches. The idea here is to support being able to efficiently fetch and resume the stream via a connector to extract the records incrementally.The text was updated successfully, but these errors were encountered: