Support file row index / row id for each file in a ListingTableProvider
#15892
Labels
enhancement
New feature or request
Is your feature request related to a problem or challenge?
I am not sure what @daphnenhuch-at 's use case is, but getting row numbers from a file is used for several use cases I know of:
Today there are ways to compute this, but they are inefficient (for example, the workaround below will read all rows from the file, so if you are trying to select only one based on row number a huge amount of work is wasted)
Today you can kind of get this information, by
1
. This is important to disable repartitioning otherwise large tables will be scanned in parallel and data from multiple parallel chunks will be interleavedRunning a query for each file using the
row_number
window function. Something like:In SQL
Describe the solution you'd like
I would like to consider a nicer way to get the row number from the file and then write queries against it.
Something like
Which would return the 10,002 and 10,003 row in the file respectively. The idea is that then we could:
Describe alternatives you've considered
I think we would need to add some sort of special column (similar to partitioning columns) to the listing table provider
Another alternative would be to keep this kind of functionality out of the core and implement it in external table providers
Additional context
location
,size
,last_modified
) inListingTableProvider
#15173The text was updated successfully, but these errors were encountered: