Support file row index / row id for each file in a `ListingTableProvider` #15892

alamb · 2025-04-29T11:52:38Z

Is your feature request related to a problem or challenge?

Quoting @daphnenhuch-at from How does 'sort' interact with record batches? #15711:

My goal is that I will have a fully sorted file sorted by primary key where each fileRowNumber is the index of that row in the file.

I am not sure what @daphnenhuch-at 's use case is, but getting row numbers from a file is used for several use cases I know of:

Implementing delete vectors (aka filtering out row by row_id has been deleted from a file)
Implementing external indexes (e.g. having a full text index that tells you document 10001, and 10003 match and then wanting to fetch (only) those rows from the file)

Today there are ways to compute this, but they are inefficient (for example, the workaround below will read all rows from the file, so if you are trying to select only one based on row number a huge amount of work is wasted)

Today you can kind of get this information, by

disable repartitioning by setting datafusion.execution.target_partitions config setting to 1. This is important to disable repartitioning otherwise large tables will be scanned in parallel and data from multiple parallel chunks will be interleaved

Running a query for each file using the row_number window function. Something like:

ctx
        .read_parquet("file1.parquet")
        .await?
        .window(vec![row_number().alias(DATA_FUSION_ROW_NUMBER)])

In SQL

> set datafusion.execution.target_partitions = 1;
0 row(s) fetched.
Elapsed 0.001 seconds.

> select "VendorID", row_number() OVER () from 'yellow_tripdata_2025-01.parquet' limit 10;
+----------+-----------------------------------------------------------------------+
| VendorID | row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
+----------+-----------------------------------------------------------------------+
| 1        | 1                                                                     |
| 1        | 2                                                                     |
| 1        | 3                                                                     |
| 2        | 4                                                                     |
| 2        | 5                                                                     |
| 2        | 6                                                                     |
| 1        | 7                                                                     |
| 1        | 8                                                                     |
| 1        | 9                                                                     |
| 2        | 10                                                                    |
+----------+-----------------------------------------------------------------------+
10 row(s) fetched.
Elapsed 0.005 seconds.

Describe the solution you'd like

I would like to consider a nicer way to get the row number from the file and then write queries against it.

Something like

select * from my_table where row_number IN (10002, 10003)

Which would return the 10,002 and 10,003 row in the file respectively. The idea is that then we could:

Do predicate pushdown on those row numbers
Figure out how to still scan the file in parallel

Describe alternatives you've considered

I think we would need to add some sort of special column (similar to partitioning columns) to the listing table provider

Another alternative would be to keep this kind of functionality out of the core and implement it in external table providers

Additional context

Related: Support metadata columns (location, size, last_modified) in ListingTableProvider #15173
Add support for file row numbers in Parquet readers arrow-rs#7307 (adding support for row numbers in a parquet reader)

The text was updated successfully, but these errors were encountered:

daphnenhuch-at · 2025-04-29T14:24:49Z

Thank you! Yes my goal is implementing deletion vectors here

daphnenhuch-at · 2025-04-30T18:32:21Z

By the way, this is the exact bug I was referencing here: #15833

I don't actually need to maintain the row number for each file, but rather just want the global row id after sorting the table across many thousands of records

alamb · 2025-05-01T21:33:36Z

Related discussion:

Support metadata columns (location, size, last_modified) in ListingTableProvider #15173

acking-you · 2025-05-01T21:54:08Z

nice feature👍

alamb added the enhancement New feature or request label Apr 29, 2025

This was referenced May 1, 2025

feat: metadata columns #14057

Draft

Support metadata columns (location, size, last_modified) in ListingTableProvider #15173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support file row index / row id for each file in a `ListingTableProvider` #15892

Support file row index / row id for each file in a `ListingTableProvider` #15892

alamb commented Apr 29, 2025

daphnenhuch-at commented Apr 29, 2025

daphnenhuch-at commented Apr 30, 2025

alamb commented May 1, 2025

acking-you commented May 1, 2025

Support file row index / row id for each file in a ListingTableProvider #15892

Support file row index / row id for each file in a ListingTableProvider #15892

Comments

alamb commented Apr 29, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

daphnenhuch-at commented Apr 29, 2025

daphnenhuch-at commented Apr 30, 2025

alamb commented May 1, 2025

acking-you commented May 1, 2025

Support file row index / row id for each file in a `ListingTableProvider` #15892

Support file row index / row id for each file in a `ListingTableProvider` #15892