Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update HyperLogLog library to original Airlift #68

Merged
merged 2 commits into from
Feb 15, 2024

Conversation

pranjalssh
Copy link

@pranjalssh pranjalssh commented Feb 14, 2024

There has been considerable performance improvements to original airlift library. Namely:

airlift@a932881
airlift@23b97cb

I copied all HyperLogLog files from airlift/airlift to prestodb/airlift - and made some API changes necessary for our internal PrivateLpcaSketch. API changes were "read-only" - so we don't have to worry about correctness issues in PrivateLpcaSketch:

DenseHll.eachBucket(BucketListener listener)
HyperLogLog.getMaxBucketValue()
HyperLogLog.getNumberOfBuckets()

Benchmark before and after showing 70X improvement for benchmarkMergeWithSparse which matches our expectations:

Benchmark                                   Mode  Cnt   Score   Error  Units
BenchmarkDenseHll.benchmarkInsert           avgt   50   1.441 ± 0.114  us/op
BenchmarkDenseHll.benchmarkMergeWithDense   avgt   50  21.630 ± 0.541  us/op
BenchmarkDenseHll.benchmarkMergeWithSparse  avgt   50  22.805 ± 0.227  us/op

Benchmark                                   Mode  Cnt   Score   Error  Units
BenchmarkDenseHll.benchmarkInsert           avgt   50   1.409 ± 0.035  us/op
BenchmarkDenseHll.benchmarkMergeWithDense   avgt   50  11.615 ± 1.016  us/op
BenchmarkDenseHll.benchmarkMergeWithSparse  avgt   50   0.335 ± 0.009  us/op

Commit 1: Update benchmarks and some tests(so we can compare)
Commit 2: Ports over HyperLogLog changes


@NotThreadSafe
/**
Copy link
Author

@pranjalssh pranjalssh Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK why they removed @ NotThreadSafe. I'll let someone more familiar with Java decide what's better

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed as part of airlift@1eb30dc.

It looks like the background is that javax.annotations was moved and renamed to jakarta.annotations (rename was due to some licensing disagreement between oracle and eclipse see jakartaee/rest#760). But Jakarta.annotations doesn't have the NotThreadSafe annotation (see discussion here: jakartaee/common-annotations-api#91). So seems like this change was basically part of upgrading that library.

@pranjalssh pranjalssh force-pushed the opt_dense_hll branch 2 times, most recently from ce2d214 to e787554 Compare February 14, 2024 19:00
Copy link

@feilong-liu feilong-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG. Have you run any tests to make sure the result is the same before and after change?

@pranjalssh
Copy link
Author

I've run unit tests so far. Note that Presto will not use this airlift code until we release this and update the version in Presto's pom. Once we merge and release this we can run correctness suites before landing in presto

@pranjalssh
Copy link
Author

Tried several queries to verify correctness and saw performance gains as well. Going to push now and figure out how to release later

@agrawaldevesh
Copy link

Just wow !! Would love to learn more about the performance improvement we saw on a real presto query with this change.

Copy link

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really try to understand all of it, but it's been reviewed already in airlift, and its well tested and looks reasonable.

Can you add a co-author line to the commit messages to credit whoever wrote the commits you took. Also would be nice to link those commits so its easy to follow to get more info.

@@ -171,7 +171,7 @@ public void eachBucket(BucketListener listener)
// if zeros > EXTENDED_BITS_LENGTH - indexBits, it means all those bits were zeros,
// so look at the entry value, which contains the number of leading 0 *after* EXTENDED_BITS_LENGTH
int bits = EXTENDED_PREFIX_BITS - indexBitLength;
if (zeros > bits) {
if (zeros >= bits) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correctness fix. We should call this out explicitly in the commit message

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll put :
"Improved accuracy and performance of hyperloglog functions"

// grow overflows arrays if necessary
overflowBuckets = Ints.ensureCapacity(overflowBuckets, overflows + 1, OVERFLOW_GROW_INCREMENT);
overflowValues = Bytes.ensureCapacity(overflowValues, overflows + 1, OVERFLOW_GROW_INCREMENT);
for (int shift = 4; shift >= 0; shift -= 4) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this number is 4? Is it because BITS_PER_BUCKET is 4 (I see that's used in calculating the size of the deltas array)? If so, we should use the static variable explicitly here so they can't get out of sync.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense - but ideally we should keep most of the code in sync with airlift/airlift so its easier to copy in future as well. So we can change there first and then copy it. Wdyt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine with me.

@pranjalssh pranjalssh merged commit 0ef7c09 into prestodb:master Feb 15, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants