Replies: 4 comments 4 replies
-
This depends on so many different things that I would recommend bench marking the solutions for your usecase yourself. I recently stumbled upon this post https://timvink.nl/blog/databricks-query-speed/ from @timvink. All these solutions should roughly be in the same ballpark speedwise, as the biggest speedup comes from avoiding repeated roundtrips to the database. In the above blog post Tim missed to activate concurrent fetching when using
If you need partitioning, you currently would need to implement this yourself. Probably on top of the fetched records of My advice: Pick the artefact which allows you the best maintainance for your pipeline. You are likely to get way bigger speedups by experimenting with the query and sanatizing the database schema, then fiddling with the precise implementation of batch fetching. Concretly: Build up domain knowledge how big in the values in the database actually are vs. how big the schema allows them to be. If possible adapt the schema, if not cast them into the appropriate type in the query. If there are some akward bin or varchar max column you can unlock new orders of magnitude in terms of speed, vs a few percent than switiching implementations. Best, Markus |
Beta Was this translation helpful? Give feedback.
-
Thanks for the shout-out. And kudos for arrow-odbc-py, worked great and it's lightweight and easy to install and setup.
Darn, that's too bad! I had even skimmed your source code but quickly gave up as most was in rust. Probably concurrent fetching should be the default. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much @pacman82, for sharing your valuable insights and expertise. 🚀 |
Beta Was this translation helpful? Give feedback.
-
@fdcastel Since odbc2parquet 6.3 the --concurrent-fetching flag is introduced in odbc2parquet. This makes odbc2parquet a more viable choice from the performance standpoint. Also closing the discussion for now. |
Beta Was this translation helpful? Give feedback.
-
TL.;DR. What is the best performing way to extract data from an ODBC connection and write it to a Parquet file, possibly partitioned?
We're talking about millions of records here, so every little optimization counts.
Long version:
I was about to reopen the #556 discussion (support for partitioned files) and was ready to cite the arrow-rs project when I discovered other great projects by @pacman82:
https://github.com/pacman82/arrow-odbc-py
https://github.com/pacman82/arrow-odbc
https://github.com/pacman82/odbc-api
I was going to start a few benchmarks (
odbc2parquet
vsarrow-odbc
vsarrow-odbc-py
), but opted to ask here first. Something tells me that @pacman82 has already been through all of this 😅.Beta Was this translation helpful? Give feedback.
All reactions