Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-allocate buffer #422

Merged
merged 2 commits into from
Apr 11, 2023
Merged

Pre-allocate buffer #422

merged 2 commits into from
Apr 11, 2023

Conversation

JoaoAparicio
Copy link
Contributor

If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc.

Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode().

Inspired by #399

@JoaoAparicio JoaoAparicio marked this pull request as ready for review April 11, 2023 01:32
@JoaoAparicio
Copy link
Contributor Author

Waiting on release of
JuliaIO/TranscodingStreams.jl#136

@Moelf
Copy link
Contributor

Moelf commented Apr 11, 2023

any thoughts on this?

I know you were aware of this issue, and thanks a lot for looking into quite a few issues here and in the TransdocingStream

If we let transcode to its own allocation it will allocate a small
vector, start filling it, resize the vector, fill it some more, resize
the vector, etc.

Instead in this commit we pre-allocate a vector of the corect size and
pass it to transcode().

Inspired by apache#399
@JoaoAparicio
Copy link
Contributor Author

I have some thoughts. One solution to "my dataset is larger than memory" is partitioning. If your dataset is partitioned in such a way that each partition fits in memory, you can iterate it with

stream = Arrow.Stream(path)
for tbl in stream
    ...
end

You can do this right now without requiring any additional features from this package.

In contrast what is discussed in #340 (which is: don't decompress if you don't have to) is a different approach, but doesn't yet exist.

Currently I have some commits that add the feature to multi-thread decompression at the buffer level. I will be trying to upstream what I have so far. The difficulty is that these commits touch a lot of code, so this won't happen overnight. I imagine couple of weeks? On top of that it should be straightforward to implement what is discussed in #340.

@Moelf
Copy link
Contributor

Moelf commented Apr 11, 2023

stream = Arrow.Stream(path)
for tbl in stream

the problem of this approach is it's decompressing every column. Consider examples such as:

Decompressing every column would be super slower if I'm only using a small % of columns

Copy link
Member

@baumgold baumgold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@baumgold
Copy link
Member

@quinnj / @ericphanson - any comments before we merge?

@ericphanson
Copy link
Member

nope, LGTM!

@baumgold baumgold merged commit f8f8d8e into apache:main Apr 11, 2023
Moelf pushed a commit to Moelf/arrow-julia that referenced this pull request Apr 13, 2023
If we let transcode to its own allocation it will allocate a small
vector, start filling it, resize the vector, fill it some more, resize
the vector, etc.

Instead in this commit we pre-allocate a vector of the corect size and
pass it to transcode().

Inspired by apache#399
@ericphanson ericphanson mentioned this pull request Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants