Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Case Folding #228

Open
isomarcte opened this issue Feb 5, 2022 · 12 comments · May be fixed by #229
Open

Unicode Case Folding #228

isomarcte opened this issue Feb 5, 2022 · 12 comments · May be fixed by #229

Comments

@isomarcte
Copy link
Member

While working on cats-uri, I ran into an issue with how CIString was handling certain unicode values which led me to notice it wasn't respecting Caseless matching from the Unicode standard. As it turns out, neither does String.equalsIgnoreCase.

I'd just about completed a branch to implement full case folding as defined by the Unicode standard when I ran across this test.

  test("character based equality") {
    assert(CIString("ß") != CIString("SS"))
  }

Since under the Unicode standard's caseless matching these two strings would compare equal, I'm beginning to think we are intentionally not following the standard here. Is that the case? If so, why? Is it to maintain parity with what the Java standard library is doing with methods like equalsIgnoresCase?

@armanbilge
Copy link
Member

I was curious and it seems that JS may support this out-of-the-box. So at least we wouldn't have to copy those tables in 😅

"ß".localeCompare("SS", 'en', {sensitivity: 'base'}) // 0

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare

@armanbilge
Copy link
Member

Ah, here's the equivalent with JDK.

import java.text.Collator
val coll = Collator.getInstance
coll.setStrength(Collator.PRIMARY)
coll.compare("ß", "SS") // 0

@isomarcte
Copy link
Member Author

@armanbilge I actually already have the lookup tables: https://github.com/isomarcte/case-insensitive/blob/full-unicode-case-folding/core/src/main/scala/org/typelevel/ci/CaseFolds.scala#L23

It only took a few minutes to get emacs to transform the text file into that.

That said, I'm happy to use existing JS/JRE based implementations, but I think we need to actually have a case folded string: https://github.com/isomarcte/case-insensitive/blob/full-unicode-case-folding/core/src/main/scala/org/typelevel/ci/CIString.scala#L43

I didn't see anyway to get that from the Collator. We need the folded string to get a safe/correct implementation of .hashCode for CIString.

@armanbilge
Copy link
Member

Thanks, yes, I actually saw your lookup tables which is what prompted me to look for any stdlib solutions for this ... at least for JS, this is just one more thing to bloat the generated code 🙃

You're right though, if there's no way to get a folded string then your lookup tables are the way to go 👍

@isomarcte
Copy link
Member Author

I've only poked around in the JRE stdlib. I can check around in JS land to see if we can get a case folded string there and then maybe lighten the generated code and use the lookup tables on the JRE.

@armanbilge
Copy link
Member

No worries, you can stay focused on doing awesome work in cats-uri and I'll worry about the JS stuff. Correctness/functionality first, optimizations second :)

@isomarcte isomarcte linked a pull request Feb 6, 2022 that will close this issue
@armanbilge
Copy link
Member

armanbilge commented Feb 6, 2022

@isomarcte what about:

scala> "ß".toLowerCase.toUpperCase == "SS".toLowerCase.toUpperCase
val res9: Boolean = true

Does this do what we want?

Edit: ignore me :)

@rossabaker
Copy link
Member

It looks like you've implemented "full" case folding in #229. Is what we have currently consistent with "simple" case folding? (I could probably answer this myself...)

This would be a semantic breaking change, even if it's binary compatible. If we proceed, should this be 2.0?

I wonder whether our current implementation varies by Java version. JDK8 supported Unicode 6.2, and Java 17 supports Unicode 13.0. It looks like your proposed algorithm guarantees stability, but only if we also normalize to NKFC?

For each string S containing characters only from a given Unicode version, toCasefold(toNFKC(S)) under that version is identical to toCasefold(toNFKC(S)) under any later version of Unicode.

@rossabaker
Copy link
Member

Ugh, there are four definitions of this.

  • Default caseless matching
  • Canonical caseless matching
  • Compatibility caseless matching
  • Identifier caseless matching

@rossabaker
Copy link
Member

It seems you're targeting the Default definition, which seems as good a starting point as any.

@isomarcte
Copy link
Member Author

This would be a semantic breaking change, even if it's binary compatible. If we proceed, should this be 2.0?

I'm not sure. It is a semantic change, but most users will be unaffected. What we currently implement appears to be semantically equivalent to the simple case folding rules + rules for some Turkic languages with no normalization.

The Turkic rules are supposed to off by default. That is, this should yield false.

scala> val a: Char = 0x0049.toChar
val a: Char = I

scala> val b: Char = 0x0131.toChar                                                                                      
val b: Char = ı

scala> CIString(a.toString) == CIString(b.toString)
val res0: Boolean = true

I can probably adapt my PR to provide both explicit simple caseless matching and full caseless matching, and then maybe we can give users the option. Let me take a look. I also want to look into the normalization as well.

@isomarcte
Copy link
Member Author

@rossabaker @armanbilge take a look here for an insane alternative approach which handles the different normalization and folding rules more explicitly and doesn't break the semantics of CIString.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants