Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: small optimizations to reduce parquet metadata decoding time by 40% #5856

Closed
wants to merge 3 commits into from

Conversation

XiangpengHao
Copy link
Contributor

Rationale for this change

This draft PR is not meant to be merged. Instead, it serves as a record for people who want to improve parquet metadata decoding performance.

It demonstrates three hacks that can easily improve decoding time by 40%:
(1) Use a better allocator (e.g., mimalloc)
(2) Move statistics to heap, i.e., reduce stack memory movement.
(3) Use SIMD to accelerate decoding varint, as provided by varint-simd crate.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 7, 2024
@XiangpengHao XiangpengHao changed the title Metadata tricks that reduce thrift decoding time by more than 40% Reducing parquet metadata decoding time by more than 40% Jun 7, 2024
@XiangpengHao XiangpengHao changed the title Reducing parquet metadata decoding time by more than 40% Reducing parquet metadata decoding time by 40% Jun 7, 2024
@@ -37,7 +37,7 @@ pub enum Page {
encoding: Encoding,
def_level_encoding: Encoding,
rep_level_encoding: Encoding,
statistics: Option<Statistics>,
statistics: Option<Box<Statistics>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea is that doing this reduces allocation size when no statistics are present? I would actually think in the case where there are page statistics this might actually slow things down due to the indirection

Copy link
Contributor Author

@XiangpengHao XiangpengHao Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for none stats, this is pure improvement; but for page stats, this is a trade-off: extra allocation vs smaller stack movement, in the blog post's Figure 7 (+layout opt), we can see that smaller stack movement wins and the improvement is indeed smaller for page/chunk stats.

return Ok(in_progress);
}
}
let (val, shift) = unsafe { varint_simd::decode_unsafe(self.buf.as_ptr()) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safety docs for this function say

There must be at least 16 bytes of allocated memory after the beginning of the pointer.

I think we need to check the remaining len and use the scalar code as a fallback for smaller lengths. Or somehow guarantee that the slice is always padded with 16 additional bytes at the end.

@alamb alamb changed the title Reducing parquet metadata decoding time by 40% POC: small optimizations to reduce parquet metadata decoding time by 40% Jun 10, 2024
@tustvold
Copy link
Contributor

tustvold commented Oct 8, 2024

Closing this PR as I think it has served its purpose, to demonstrate such wins are possible, and is not intended to be merged

@tustvold tustvold closed this Oct 8, 2024
@alamb
Copy link
Contributor

alamb commented Oct 8, 2024

Closing this PR as I think it has served its purpose, to demonstrate such wins are possible, and is not intended to be merged

Indeed -- if anyone else finds this PR you can read about the larger context here https://www.influxdata.com/blog/how-good-parquet-wide-tables/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants