Best performing way to extract data (from an ODBC source)? #644

fdcastel · 2024-09-26T18:25:03Z

fdcastel
Sep 26, 2024

TL.;DR. What is the best performing way to extract data from an ODBC connection and write it to a Parquet file, possibly partitioned?

We're talking about millions of records here, so every little optimization counts.

Long version:

I was about to reopen the #556 discussion (support for partitioned files) and was ready to cite the arrow-rs project when I discovered other great projects by @pacman82:

https://github.com/pacman82/arrow-odbc-py
https://github.com/pacman82/arrow-odbc
https://github.com/pacman82/odbc-api

I was going to start a few benchmarks (odbc2parquet vs arrow-odbc vs arrow-odbc-py), but opted to ask here first. Something tells me that @pacman82 has already been through all of this 😅.

pacman82 · 2024-09-28T09:28:14Z

pacman82
Sep 28, 2024
Maintainer

TL.;DR. What is the best performing way to extract data from an ODBC connection and write it to a Parquet file

This depends on so many different things that I would recommend bench marking the solutions for your usecase yourself. I recently stumbled upon this post https://timvink.nl/blog/databricks-query-speed/ from @timvink.

All these solutions should roughly be in the same ballpark speedwise, as the biggest speedup comes from avoiding repeated roundtrips to the database. In the above blog post Tim missed to activate concurrent fetching when using arrow-odbc-py, but I can hardly blame him. I put in an akward place in the interface. You have to activate it, after creating the reader by calling a method. If arrow-odbc would be faster than odbc2parquet I would guess it is because of that. odbc2parquet might get the same feature in the future though.

[...] , possibly partitioned?

If you need partitioning, you currently would need to implement this yourself. Probably on top of the fetched records of arrow-odbc or arrow-odbc-py. I won't have the time to contribute it to odbc2parquet in the near future.

My advice:

Pick the artefact which allows you the best maintainance for your pipeline. You are likely to get way bigger speedups by experimenting with the query and sanatizing the database schema, then fiddling with the precise implementation of batch fetching.

Concretly: Build up domain knowledge how big in the values in the database actually are vs. how big the schema allows them to be. If possible adapt the schema, if not cast them into the appropriate type in the query. If there are some akward bin or varchar max column you can unlock new orders of magnitude in terms of speed, vs a few percent than switiching implementations.

Best, Markus

0 replies

timvink · 2024-09-29T19:06:43Z

timvink
Sep 29, 2024

Thanks for the shout-out. And kudos for arrow-odbc-py, worked great and it's lightweight and easy to install and setup.

In the above blog post Tim missed to activate concurrent fetching when using arrow-odbc-py

Darn, that's too bad! I had even skimmed your source code but quickly gave up as most was in rust. Probably concurrent fetching should be the default.

4 replies

pacman82 Oct 1, 2024
Maintainer

Yes it should, reading your post made me realize how hard it is to discover that feature. It's a parameter to the top level function now, rather than a method on the reader.

Making it the default seems like the right thing to do. Yet I am a bit afraid of breaking workflows, because of doubling memory usage...

timvink Oct 1, 2024

If you can catch the memory error, you might be able to throw an error message suggesting to turn off concurrent fetching?

timvink Oct 1, 2024

But generally I think memory usage is not usually a big problem, net benefit is very likely to be positive

fdcastel Oct 1, 2024
Author

Yet I am a bit afraid of breaking workflows, because of doubling memory usage...

Nothing that a MAJOR release (new version) -- and a visible note in the release notes -- shouldn't solve? 😉

Concurrent fetching often speeds up the process considerably. I'm in the group that thinks that it should be the default.

fdcastel · 2024-10-01T17:15:34Z

fdcastel
Oct 1, 2024
Author

Thank you very much @pacman82, for sharing your valuable insights and expertise. 🚀

0 replies

pacman82 · 2024-12-31T20:13:48Z

pacman82
Dec 31, 2024
Maintainer

@fdcastel Since odbc2parquet 6.3 the --concurrent-fetching flag is introduced in odbc2parquet. This makes odbc2parquet a more viable choice from the performance standpoint.

Also closing the discussion for now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best performing way to extract data (from an ODBC source)? #644

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best performing way to extract data (from an ODBC source)? #644

fdcastel Sep 26, 2024

Replies: 4 comments · 4 replies

pacman82 Sep 28, 2024 Maintainer

timvink Sep 29, 2024

pacman82 Oct 1, 2024 Maintainer

timvink Oct 1, 2024

timvink Oct 1, 2024

fdcastel Oct 1, 2024 Author

fdcastel Oct 1, 2024 Author

pacman82 Dec 31, 2024 Maintainer

fdcastel
Sep 26, 2024

Replies: 4 comments 4 replies

pacman82
Sep 28, 2024
Maintainer

timvink
Sep 29, 2024

pacman82 Oct 1, 2024
Maintainer

fdcastel Oct 1, 2024
Author

fdcastel
Oct 1, 2024
Author

pacman82
Dec 31, 2024
Maintainer