Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Use lazy encoding in UTF-8 encoded string comparison #2021

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

milaGGL
Copy link
Contributor

@milaGGL milaGGL commented Feb 18, 2025

The previous fix created a performance issue due to expensive UTF-8 encoding. Update compareUtf8Strings to use lazy encoding instead.

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: firestore Issues related to the googleapis/java-firestore API. labels Feb 18, 2025
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: l Pull request size is large. labels Feb 18, 2025
@milaGGL milaGGL marked this pull request as ready for review February 19, 2025 15:14
@milaGGL milaGGL requested review from a team as code owners February 19, 2025 15:14
@milaGGL milaGGL requested a review from ehsannas February 19, 2025 15:14
@milaGGL milaGGL assigned milaGGL and ehsannas and unassigned milaGGL Feb 19, 2025
@milaGGL milaGGL changed the title Use lazy encoding in utf-8 encoded string comparison FIX: Use lazy encoding in UTF-8 encoded string comparison Feb 19, 2025
} else {
// UTF-8 encoded byte comparison, substring 2 indexes to cover surrogate pairs
ByteString leftBytes =
ByteString.copyFromUtf8(left.substring(i, Math.min(i + 2, left.length())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be left.length() - i?

Copy link
Contributor Author

@milaGGL milaGGL Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? do u mean encode all the remaining strings? 🤔

But we have confirmed that we encountered the characters that differ, which could be up to 2 index, considering surrogate pairs. If we are already promised that we can find the difference within 2 indexes, is there still a need to encode all remaining? for example, "ab👍cdefghjklm..." vs "ab👎ghjkiuytrd....".

I think substrings 2 indexes is already being very generous

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh you're right. I incorrectly thought left.length() was the number of chars in the substring. My bad. Nevermind this comment 🤦

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: @milaGGL I would improve the following comment to be more descriptive (along the lines of what you said above):

changing

// UTF-8 encoded byte comparison, substring 2 indexes to cover surrogate pairs

to

// We have identified a code point difference, so we don't need to
// encode/compare _all_ the remainder of the strings. We only need to 
// compare up to 2 more indexes to cover potential surrogate pairs.

@@ -1169,6 +1169,10 @@ public void snapshotListenerSortsUnicodeStringsSameWayAsServer() throws Exceptio
.set(col.document("e"), map("value", "P"))
.set(col.document("f"), map("value", "︒"))
.set(col.document("g"), map("value", "🐵"))
.set(col.document("h"), map("value", "你好"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This integration test coverage is good proof-of-concept. Please add solid unit test coverage too, as the logic is very delicate and exploits low-level properties of utf8 and utf16 encoding that are not well-known, obvious, or straight-forward.

@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Feb 20, 2025
@milaGGL milaGGL requested a review from dconeybe February 20, 2025 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: firestore Issues related to the googleapis/java-firestore API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants