FIX: Use lazy encoding in UTF-8 encoded string comparison #2021

milaGGL · 2025-02-18T21:13:38Z

The previous fix created a performance issue due to expensive UTF-8 encoding. Update compareUtf8Strings to use lazy encoding instead.

This reverts commit 24bd892.

…/googleapis/java-firestore into mila/string-uses-byte-comparison

dconeybe · 2025-02-19T18:04:25Z

google-cloud-firestore/src/main/java/com/google/cloud/firestore/Order.java

+        } else {
+          // UTF-8 encoded byte comparison, substring 2 indexes to cover surrogate pairs
+          ByteString leftBytes =
+              ByteString.copyFromUtf8(left.substring(i, Math.min(i + 2, left.length())));


Should this be left.length() - i?

why? do u mean encode all the remaining strings? 🤔

But we have confirmed that we encountered the characters that differ, which could be up to 2 index, considering surrogate pairs. If we are already promised that we can find the difference within 2 indexes, is there still a need to encode all remaining? for example, "ab👍cdefghjklm..." vs "ab👎ghjkiuytrd....".

I think substrings 2 indexes is already being very generous

Ahh you're right. I incorrectly thought left.length() was the number of chars in the substring. My bad. Nevermind this comment 🤦

Nit: @milaGGL I would improve the following comment to be more descriptive (along the lines of what you said above):

changing

// UTF-8 encoded byte comparison, substring 2 indexes to cover surrogate pairs

to

// We have identified a code point difference, so we don't need to // encode/compare _all_ the remainder of the strings. We only need to // compare up to 2 more indexes to cover potential surrogate pairs.

dconeybe · 2025-02-19T18:17:35Z

google-cloud-firestore/src/test/java/com/google/cloud/firestore/it/ITQueryTest.java

@@ -1169,6 +1169,10 @@ public void snapshotListenerSortsUnicodeStringsSameWayAsServer() throws Exceptio
        .set(col.document("e"), map("value", "Ｐ"))
        .set(col.document("f"), map("value", "︒"))
        .set(col.document("g"), map("value", "🐵"))
+        .set(col.document("h"), map("value", "你好"))


This integration test coverage is good proof-of-concept. Please add solid unit test coverage too, as the logic is very delicate and exploits low-level properties of utf8 and utf16 encoding that are not well-known, obvious, or straight-forward.

milaGGL and others added 12 commits January 6, 2025 11:30

add test

96de3e3

chore: generate libraries at Mon Jan 6 16:30:40 UTC 2025

24bd892

Revert "chore: generate libraries at Mon Jan 6 16:30:40 UTC 2025"

9e812fa

This reverts commit 24bd892.

chore: generate libraries at Mon Jan 6 16:54:57 UTC 2025

149d3e1

add more tests

dec5d02

Merge branch 'main' into mila/string-uses-byte-comparison

89ed44c

format

c876e5a

remove lines commented out

b55992d

Update ITQueryTest.java

ef2ae13

Merge branch 'main' into mila/string-uses-byte-comparison

31cf3ff

resolve comment

f17c28a

use lazy encoding in utf-8 encoded string comparison

d4f299a

product-auto-label bot added size: l Pull request size is large. api: firestore Issues related to the googleapis/java-firestore API. labels Feb 18, 2025

Merge branch 'main' into mila/string-uses-byte-comparison

511469b

product-auto-label bot added size: m Pull request size is medium. and removed size: l Pull request size is large. labels Feb 18, 2025

cloud-java-bot and others added 3 commits February 18, 2025 21:17

chore: generate libraries at Tue Feb 18 21:15:36 UTC 2025

f86dcfb

Update Order.java

602a356

Merge branch 'mila/string-uses-byte-comparison' of https://github.com…

e8b3f57

…/googleapis/java-firestore into mila/string-uses-byte-comparison

milaGGL marked this pull request as ready for review February 19, 2025 15:14

milaGGL requested review from a team as code owners February 19, 2025 15:14

milaGGL requested a review from ehsannas February 19, 2025 15:14

milaGGL assigned milaGGL and ehsannas and unassigned milaGGL Feb 19, 2025

milaGGL changed the title ~~Use lazy encoding in utf-8 encoded string comparison~~ FIX: Use lazy encoding in UTF-8 encoded string comparison Feb 19, 2025

dconeybe requested changes Feb 19, 2025

View reviewed changes

dconeybe reviewed Feb 19, 2025

View reviewed changes

add unit test

4ce70e6

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Feb 20, 2025

milaGGL requested a review from dconeybe February 20, 2025 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Use lazy encoding in UTF-8 encoded string comparison #2021

FIX: Use lazy encoding in UTF-8 encoded string comparison #2021

milaGGL commented Feb 18, 2025 •

edited

Loading

dconeybe Feb 19, 2025

milaGGL Feb 19, 2025 •

edited

Loading

dconeybe Feb 19, 2025

ehsannas Feb 19, 2025

dconeybe Feb 19, 2025

FIX: Use lazy encoding in UTF-8 encoded string comparison #2021

Are you sure you want to change the base?

FIX: Use lazy encoding in UTF-8 encoded string comparison #2021

Conversation

milaGGL commented Feb 18, 2025 • edited Loading

dconeybe Feb 19, 2025

Choose a reason for hiding this comment

milaGGL Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

dconeybe Feb 19, 2025

Choose a reason for hiding this comment

ehsannas Feb 19, 2025

Choose a reason for hiding this comment

dconeybe Feb 19, 2025

Choose a reason for hiding this comment

milaGGL commented Feb 18, 2025 •

edited

Loading

milaGGL Feb 19, 2025 •

edited

Loading