-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Use lazy encoding in UTF-8 encoded string comparison #2021
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 24bd892.
} else { | ||
// UTF-8 encoded byte comparison, substring 2 indexes to cover surrogate pairs | ||
ByteString leftBytes = | ||
ByteString.copyFromUtf8(left.substring(i, Math.min(i + 2, left.length()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be left.length() - i
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? do u mean encode all the remaining strings? 🤔
But we have confirmed that we encountered the characters that differ, which could be up to 2 index, considering surrogate pairs. If we are already promised that we can find the difference within 2 indexes, is there still a need to encode all remaining? for example, "ab👍cdefghjklm..." vs "ab👎ghjkiuytrd....".
I think substrings 2 indexes is already being very generous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh you're right. I incorrectly thought left.length()
was the number of chars in the substring. My bad. Nevermind this comment 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: @milaGGL I would improve the following comment to be more descriptive (along the lines of what you said above):
changing
// UTF-8 encoded byte comparison, substring 2 indexes to cover surrogate pairs
to
// We have identified a code point difference, so we don't need to
// encode/compare _all_ the remainder of the strings. We only need to
// compare up to 2 more indexes to cover potential surrogate pairs.
@@ -1169,6 +1169,10 @@ public void snapshotListenerSortsUnicodeStringsSameWayAsServer() throws Exceptio | |||
.set(col.document("e"), map("value", "P")) | |||
.set(col.document("f"), map("value", "︒")) | |||
.set(col.document("g"), map("value", "🐵")) | |||
.set(col.document("h"), map("value", "你好")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This integration test coverage is good proof-of-concept. Please add solid unit test coverage too, as the logic is very delicate and exploits low-level properties of utf8 and utf16 encoding that are not well-known, obvious, or straight-forward.
The previous fix created a performance issue due to expensive UTF-8 encoding. Update
compareUtf8Strings
to use lazy encoding instead.