-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update HyperLogLog library to original Airlift #68
Conversation
|
||
@NotThreadSafe | ||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK why they removed @ NotThreadSafe. I'll let someone more familiar with Java decide what's better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was changed as part of airlift@1eb30dc.
It looks like the background is that javax.annotations was moved and renamed to jakarta.annotations (rename was due to some licensing disagreement between oracle and eclipse see jakartaee/rest#760). But Jakarta.annotations doesn't have the NotThreadSafe annotation (see discussion here: jakartaee/common-annotations-api#91). So seems like this change was basically part of upgrading that library.
ce2d214
to
e787554
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG. Have you run any tests to make sure the result is the same before and after change?
I've run unit tests so far. Note that Presto will not use this airlift code until we release this and update the version in Presto's pom. Once we merge and release this we can run correctness suites before landing in presto |
Tried several queries to verify correctness and saw performance gains as well. Going to push now and figure out how to release later |
Just wow !! Would love to learn more about the performance improvement we saw on a real presto query with this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really try to understand all of it, but it's been reviewed already in airlift, and its well tested and looks reasonable.
Can you add a co-author line to the commit messages to credit whoever wrote the commits you took. Also would be nice to link those commits so its easy to follow to get more info.
@@ -171,7 +171,7 @@ public void eachBucket(BucketListener listener) | |||
// if zeros > EXTENDED_BITS_LENGTH - indexBits, it means all those bits were zeros, | |||
// so look at the entry value, which contains the number of leading 0 *after* EXTENDED_BITS_LENGTH | |||
int bits = EXTENDED_PREFIX_BITS - indexBitLength; | |||
if (zeros > bits) { | |||
if (zeros >= bits) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a correctness fix. We should call this out explicitly in the commit message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll put :
"Improved accuracy and performance of hyperloglog functions"
// grow overflows arrays if necessary | ||
overflowBuckets = Ints.ensureCapacity(overflowBuckets, overflows + 1, OVERFLOW_GROW_INCREMENT); | ||
overflowValues = Bytes.ensureCapacity(overflowValues, overflows + 1, OVERFLOW_GROW_INCREMENT); | ||
for (int shift = 4; shift >= 0; shift -= 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why this number is 4? Is it because BITS_PER_BUCKET is 4 (I see that's used in calculating the size of the deltas array)? If so, we should use the static variable explicitly here so they can't get out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense - but ideally we should keep most of the code in sync with airlift/airlift
so its easier to copy in future as well. So we can change there first and then copy it. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine with me.
e787554
to
a590b30
Compare
There has been considerable performance improvements to original airlift library. Namely:
airlift@a932881
airlift@23b97cb
I copied all HyperLogLog files from airlift/airlift to prestodb/airlift - and made some API changes necessary for our internal PrivateLpcaSketch. API changes were "read-only" - so we don't have to worry about correctness issues in PrivateLpcaSketch:
Benchmark before and after showing 70X improvement for
benchmarkMergeWithSparse
which matches our expectations:Commit 1: Update benchmarks and some tests(so we can compare)
Commit 2: Ports over HyperLogLog changes