partitioning the files #556

needbrew989 · 2024-05-10T16:53:36Z

needbrew989
May 10, 2024

would it be possible to pass a series of columns that the parquet files would be partitioned on?

pacman82 · 2024-05-14T05:48:36Z

pacman82
May 14, 2024
Maintainer

Just to make sure I understand the feature request: You want odbc2parquet to start a new file, anytime one of the columns would have different value compared to the predecessor row? In addition to that you would make sure that the query returns the data already in correct order?

1 reply

needbrew May 14, 2024

yes, for large extracts it would be helpful to have that ability to partition the files generated. something similar to how python/pandas does it
https://stackoverflow.com/questions/52934265/how-to-write-a-partitioned-parquet-file-using-pandas

pacman82 · 2024-05-15T06:18:13Z

pacman82
May 15, 2024
Maintainer

This would require a file split, based on row wise logic. odbc2parquet currently writes units of batches. It also would require to look ahead in order to see how columns are split.

All of this could be done, but currently at least me personally can not pull this of in my spare time

I would suggest to either: Use a compression which allows you to split and concatenate the files written and partition them in a second pass.

Alternatively you could use either the rust or Python version of arrow-odbc and write it to parquet in your own custom code.

E.g. using the arrow-odbc Python bindings you use the existing Python libraries mentioned in the Stack overflow answer to achieve what you want.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partitioning the files #556

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

partitioning the files #556

needbrew989 May 10, 2024

Replies: 2 comments · 1 reply

pacman82 May 14, 2024 Maintainer

needbrew May 14, 2024

pacman82 May 15, 2024 Maintainer

needbrew989
May 10, 2024

Replies: 2 comments 1 reply

pacman82
May 14, 2024
Maintainer

pacman82
May 15, 2024
Maintainer