Skip to content

Document best practices for high performance constructing arrays #7455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alamb opened this issue Apr 29, 2025 · 2 comments
Open

Document best practices for high performance constructing arrays #7455

alamb opened this issue Apr 29, 2025 · 2 comments

Comments

@alamb
Copy link
Contributor

alamb commented Apr 29, 2025

          > I think at this point there is little point in using `MutableBuffer` over `Vec` as the latter provides more performant (specialization over `T`, better inlining), slightly more safe and more complete API. The same probably applies for a lot of `Builder`-type APIs probably.

It might be worth spending some time documenting the best and most performant way to construct / transform Arrow arrays and apply it accross the arrow-rs create (marking stuff as deprecated if needed).

This LGTM and is a great improvement. I also would love to see this documentation if you don't mind dumping your thoughts on the topic while they're fresh!

Originally posted by @mbutrovich in #7422 (review)

@Dandandan
Copy link
Contributor

Dandandan commented Apr 29, 2025

Let me start with collecting a list of possible replacements of MutableArray vs Vec-based construction and faster generally faster way of doing things, so we could convert it

Vec-based API

Old Vec-based
MutableBuffer::new(size * std::mem::size_of::<T>()) Vec<T>::with_capacity(size)
BufferBuilder Vec
Buffer::from_trusted_len_iter Iterator::collect (into Vec) - this doesn't use unsafe and is as fast
Primitive builders Use Vec<T> (Iterator::collect) and build nulls separately (via BooleanBuffer::collect_bool if possible)
Byte builder Use Vec<T>, Vec<Offset> (Iterator::collect) and build nulls separately (via BooleanBuffer::collect_bool if possible)

Faster versions

Slower Faster
MutableBuffer::push or Vec::push extend or collect
vec![] or `MutableBuffer::new(0) Vec::with_capacity (preallocate once for known capacity)
NullBufferBuilder or boolean buffer builder BooleanBuffer::collect_bool

@alamb
Copy link
Contributor Author

alamb commented May 9, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants