Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement (Almost) All The Unicode Caseless Matching Systems #232

Draft
wants to merge 11 commits into
base: series/2.x
Choose a base branch
from

Conversation

isomarcte
Copy link
Member

@isomarcte isomarcte commented Feb 6, 2022

The exception is identifier caseless matching because the normalization operations are not natively supported by java.text.Normalizer and would require even more code.

This change introduces a staggering 12 types of case folded strings. These break down to simple/full case folded strings, each with a default and Turkic variant, and then for each of those we have default, canonical, and compatibility normalization/folding.

CIString is now deprecated and defers to CIStringS, which is an input remembering caseless string using default Unicode caseless matching on simple case folded strings. This is the most similar form to what CIString was doing before, though I think most users will actually want to use CIStringCF which is canonical Unicode caseless matching on full case folded strings.

Note: There is still quite a bit of work to do on this branch. Tests/Scaladoc/typeclass instances come to mind, but I wanted to get it up for initial thoughts before I did all that.

This is an alternative to #229. There are some obvious cons here.

A major Con is that there are so many types. I don't know that we'd need/want to add CIString variants for each of the types of case folded strings, but even if we just have the types introduced thus far we've gone from 1 CIString to 15 Folded/Caseless strings. That said, this is just what the reality of the Unicode standard dictates. There are a few more succinct encodings I toyed with, but they have some negative code complexity tradeoffs.

An alternative is to just have a single CaseFoldedString which is the result of applying one of the folding algorithms and one of the normalization algorithms. This is what I did in #229, but I'm very wary of it as we lose information about what happened. Two case folded strings could be equal but not really be equivalent because they we derived from different folding/normalization. That's why, despite the verbosity, the independent types seemed like the right choice.

But then again 🤷 . I may have just lost my mind after staring at the Unicode standard this long. 🤣

This commit adds CaseFoldedString as a partner to CIString. A CaseFoldedString is case folded according to the Unicode rules for Caseless Matching. In contrast to CIString, it does _not_ keep a reference to the input `String`.

This commit changes CIString to be based on CaseFoldedString.
The exception is identifier caseless matching because the normalization operations are not natively supported by `java.text.Normalizer` and would require even more code.

This change introduces a staggering _12_ types of case folded strings. These break down to simple/full case folded strings, each with a default and Turkic variant, and then for each of those we have default, canonical, and compatibility normalization/folding.

CIString is now deprecated and defers to CIStringS, which is an input remembering caseless string using default Unicode caseless matching on simple case folded strings. This is the most similar form to what CIString was doing before, though I think most users will actually want to use CIStringCF which is canonical Unicode caseless matching on full case folded strings.
@isomarcte
Copy link
Member Author

Tests are expected to fail right now. This branch currently still has the tests I added for CIString to debug/validate the full case folding behavior on CIString. Since CIString is now explicitly using default case less matching with simple folding, it shouldn't pass default case less matching with full folding.

If we move forward here, I'll fix all that up.

@armanbilge
Copy link
Member

Phew! 😅 Some assorted thoughts:

CIString is now deprecated and defers to CIStringS, which is an input remembering caseless string using default Unicode caseless matching on simple case folded strings.

If CIString is deprecated anyway, why not just let it retain its old implementation, for sake of compatibility? Unless I misunderstood and that is what it was doing.

That's why, despite the verbosity, the independent types seemed like the right choice.

Static types are also better for Scala.js, since the optimizer can detect unused types and elide the relevant code. If this is implemented dynamically, then it no longer can do that. Although not sure if it this will make much difference in practice 😅

I think the hardest thing to stomach is that as a user or library author it's no longer obvious what the "right" choice is. I'm worried that if libraries don't sync up on this, then there will be fragmentation between APIs.

@isomarcte
Copy link
Member Author

If CIString is deprecated anyway, why not just let it retain its old implementation, for sake of compatibility? Unless I misunderstood and that is what it was doing.

My reasoning is that while I think the current implementation of CIString is actually Unicode default caseless matching with simple folding + the turkic rules, I'm not 100% sure on that. It might not actually be exactly any of the 16 different rules. Also, the standard states that the turkic rules should be omitted by default, but the current implementation of CIString includes them.

I'm okay just deprecating though if that is preferred.

I think the hardest thing to stomach is that as a user or library author it's no longer obvious what the "right" choice is. I'm worried that if libraries don't sync up on this, then there will be fragmentation between APIs.

Yeah, I agree. If we keep the other types around I think we should update the github site and basically all documentation to point people towards CIStringCF/CanonicalFullCaseFoldedString unless they know they specifically want something else since the Unicode standard states this will yield the "most correct results". (I'm trying not to freak out too much that we are dealing in ranges of correctness here...)

I on the fence about having the full set of types proposed here, but I'm trying not to be overly biased towards my specific use case. Since they are all in the standard, it's reasonable to think someone might want them eventually. They do have some nice properties, like SimpleCaseFoldedString won't change the length of the String and TurkicSimpleCaseFoldedString is actually the most similar to String.equalsIgnoreCase.

I'm willing to be talked out of this though.

I also hate the names CIStringCF, CIStringCS, and CIStringS, but I was at a lose for a better scheme.

Copy link
Member

@rossabaker rossabaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like mostly a superset of the other one, so it's probably better to discuss the design there.

message =
"Please use either CIStringCF, CIStringCS, or CIStringS instead. CIString/CIStringS implement Unicode default caseless matching on simple case folded strings. For most applications you probably want to use CIStringCF which implements Unicode canonical caseless matching on full case folded strings.",
since = "1.3.0")
final class CIString private (override val toString: String, val asCIStringS: CIStringS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http4s-0.22 and 0.23 would have to pin, because this is a big part of its public API, and even if we nowarned all the usages, the deprecation would be viral into everyone's apps. I would really rather avoid a source-breaking change on this type if we can avoid it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm struggling with what to do here. I'd like CIString to be the one most users will want out of the box, but we can't do that unless we break the semantics by changing it to be Unicode canonical caseless matching with full folded strings.

For most users, they probably won't notice, so maybe it's okay? That also would help with the concern about advertising which is the best type to use by default.

Copy link
Member

@armanbilge armanbilge Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, are we breaking semantics? Or was our current implementation buggy, and we are fixing the semantics?

The CI microsite states that "locale independence" is a design goal. Is that what @isomarcte is achieving here?

Edit: hmm, I see that "locale-independence" is a technical term meaning that e.g. Locale.setDefault(Locale.forLanguageTag("tr")) should not change the result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@armanbilge that's a tricky question. Yes and no. The Turkic special cases are obviously not locale independent. However, and forgive me for splitting hairs here, I'm not sure you can say that any Unicode caseless matching is locale independent. Whether or not "ı" == "I" in a caseless match is really only answerable in the context of some locale.

In as much as one can have locale independence in this context, the non-Turkic variants achieve that.

I can drop the Turkic ones, but...I don't know...maybe some Turkic language users will want to use them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: hmm, I see that "locale-independence" is a technical term meaning that e.g. Locale.setDefault(Locale.forLanguageTag("tr")) should not change the result

I did not realize that. 👀

Copy link
Member Author

@isomarcte isomarcte Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Idk if we can declare this a bug in case-insensitive specifically, but IIUC it is a bug in http4s.

The [Unicode simple](https://www.w3.org/TR/charmod-norm/#dfn-unicode-simple) casefolding form is not appropriate for string identity matching on the Web.

https://www.w3.org/TR/charmod-norm/#sec_unicode_cs

@armanbilge is it clear to you if they are implying default full caseless matching or canonical full caseless matching? It is not to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isomarcte canonical?

Case sensitivity is not recommended for most specifications but, in the case of an exception where the vocabulary allows non-ASCII characters and which does not want to be sensitive to case distinctions, the 'Unicode canonical case fold' approach SHOULD be used.

https://www.w3.org/TR/charmod-norm/#matching_CanonicalFoldNormalizationStep

How does "compatibility" fit into this?

A 'Unicode compatibility case fold' approach should not be used.

https://www.w3.org/TR/charmod-norm/#matching_CompatibilityFoldNormalizationStep

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@armanbilge I don't think compatibility comes into play in our http4s context. (thanks for finding that btw!)

I'm becoming increasingly convinced we should change CIString to be full canonical caseless matching and update the docs. I suspect most of our current user base is working around web based technologies and probably should be using full canonical caseless matching anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing here contradicts that: https://mvnrepository.com/artifact/org.typelevel/case-insensitive/usages

It's probably fine 🙃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would challenge that this is a bug in http4s. From the next paragraph in the spec:

Specifications that define case-insensitive matching in vocabularies limited to the Basic Latin (ASCII) subset of Unicode MAY specify ASCII case-insensitive matching.

I can't think of anywhere that http4s is using it that isn't an ASCII context. The one place I'm less sure of is the URI, but even there, encodings are by convention, so I think the answer is ambiguous. So, while CIString is less general purpose than we would like, is there a smoking gun that its use in the http4s ecosystem is wrong?

Comment on lines +49 to +62
// Yes, you read that right. Round and round we go.
//
// > "Yes, that’s it," said the Hatter with a sigh:
// > "it’s always tea-time, and we’ve no time to wash the things between whiles."
// >
// > "Then you keep moving round, I suppose?" said Alice.
// >
// > "Exactly so," said the Hatter: "as the things get used up."
// >
// > "But what happens when you come to the beginning again?" Alice ventured to ask.
// >
// > "Suppose we change the subject," the March Hare interrupted, yawning
//
// - Alice's Adventures In Wonderland, Chapter VII, by Lewis Carroll
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

@isomarcte
Copy link
Member Author

@rossabaker

This seems like mostly a superset of the other one, so it's probably better to discuss the design there.

This one drops the generic CaseFoldedString for separate types. I think CaseFoldedString may be dangerous.

This allows us to not deprecate the type.

There is still an unfortunate issue here. Full case folding can change the length (in terms of characters) of the `String`. This makes `CIString.length` not safe to use, as it is not clear which length it now refers to. This commit deprecates that method because of this. However, the `glob` based `unapplySeq` relies both on `length` and `Char` based case insensitivity. This means that it is no longer correct.

That said, by relying on `Char` based case insensitivity, it was probably not quite correct to begin with.
@isomarcte
Copy link
Member Author

isomarcte commented Feb 7, 2022

@rossabaker @armanbilge I've pushed a new commit which undeprecates and changes CIString to be Unicode canonical caseless matching with full case folding as discussed here.

See: b376e37

I ran into a different issue however. The unapplySeq for the ci StringContext is unsound on two counts. The first is that it relies on the length of the CIString, but that is ambiguous under full case folding. The second is that it is using case insensitivity on Char values directly. I don't think you can actually apply any of the Unicode caseless matching systems directly to a Char.

I think we should deprecate it. 🙁

Edit: I linked the wrong commit, 619b36f

@armanbilge
Copy link
Member

armanbilge commented Feb 7, 2022

Hmm. Putting aside the semantic changes, all these deprecations to length, unapply, etc. make this seem much more like a breaking change. Like full case folding is fundamentally incompatible/unsound with the API currently exposed.

@isomarcte
Copy link
Member Author

Yeah, it's not great. I'm still open to suggestions on what to do here. We could provide an alternative to to the ci syntax, but still not great.

@armanbilge
Copy link
Member

armanbilge commented Feb 7, 2022

Right, and that wouldn't help http4s anyway. Seems to me like we can either swallow this pretty breaking change on both the semantic and API level or put this into 2.0.0. IIUC this came up while working on cats-uri which itself is targeting http4s 1.0 I assume.

At the very least, starting a 2.x branch could unblock you for now, and we could think about strategies to get this into 1.x? /shrug

@isomarcte
Copy link
Member Author

isomarcte commented Feb 7, 2022

My concern with not changing this until http4s 1.x.x is that what CIString does right now is (probably) equivalent to simple case folding with Turkic rules and default caseless matching. When that disagrees with canonical caseless matching with full folding, the result is a bug in the http4s context. Are we okay with that on 0.22.x or 0.23.x?

I'm not sure what the full impact would be when they disagree. In a Uri for example, those would be cause two equivalent regNames to be treated as non-equivalent. Or, in the case of dotless I characters, our current CIString would cause two non-equivalent regNames to compare equal. This might get blocked by DNS rules, DNS adds a bunch of rules on top of RFC-3986, but I'm not sure. That's just one example, I'm sure there are others.

I mean...I get it...we've no great options and big source breaking changes suck...I'm just thinking out loud here.

At the very least, starting a 2.x branch could unblock you for now, and we could think about strategies to get this into 1.x?

Yeah, I'll work towards that in the meantime.

@armanbilge
Copy link
Member

I mean...I get it...we've no great options and big source breaking changes suck...I'm just thinking out loud here.

Yep me too 😅

Definitely agree that it's a bug in http4s, but I can't begin to understand how serious it is. Defer to Ross on next steps :)

@isomarcte
Copy link
Member Author

@rossabaker I'm (finally) circling back here. Do you have any thoughts?

I suspect there are some very deep serious side effects of the incorrect case-folding, but that they are quite rare. Given our user base doesn't seem to be complaining about them right now in http4s, I think we could pivot for version 2 of this library and hope any such invalid behavior is rare enough that we don't see it on 0.23.x of http4s.

Thoughts?

@isomarcte isomarcte mentioned this pull request May 7, 2022
@armanbilge armanbilge changed the base branch from main to series/2.x May 13, 2022 18:24
@rossabaker
Copy link
Member

I have severe apprehensions about introducing a 2.x version. The scala-xml upgrade has been a shit show, not so much due to actual incompatibility, and more because it's setting off alarms in SBT and getting casually upgraded in patch releases downstream because it was advertised as "compatible enough". We risk a similar shitshow as soon as we introduce a 2.x here.

Eyeballing this after several months away, it seems like it's binary compatible, and I didn't set off any MiMa alarm bells. Am I correct that this a purely semantic change as written?

I would also like to consider benchmarks before changing the core type used in http4s. It seems like this is heavier, but it probably doesn't matter in the context of an IO application? It's worth considering, given that this library is well established with one consumer that actually matters.

@isomarcte isomarcte mentioned this pull request Jan 4, 2023
@isomarcte
Copy link
Member Author

Eyeballing this after several months away, it seems like it's binary compatible, and I didn't set off any MiMa alarm bells. Am I correct that this a purely semantic change as written?

I've been distracted by work in https://github.com/typelevel/idna4s and at $WORK, but IIRC this can be made bincompat with the possible exception of the globbing matcher. See #297 which I just hastily wrote up.

I would also like to consider benchmarks before changing the core type used in http4s. It seems like this is heavier, but it probably doesn't matter in the context of an IO application? It's worth considering, given that this library is well established with one consumer that actually matters.

Definitely.

@armanbilge
Copy link
Member

armanbilge commented Jan 4, 2023

See also my comment in #232 (comment). While we can do this bin-compatibly, it feels like a technicality since so much would be changing. It may be worthwhile however.

It's worth considering, given that this library is well established with one consumer that actually matters.

I recently adopted it in Natchez as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants