Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some non-ascii unicode chars are not case-folded correctly. #31

Open
fisx opened this issue Sep 24, 2021 · 3 comments
Open

Some non-ascii unicode chars are not case-folded correctly. #31

fisx opened this issue Sep 24, 2021 · 3 comments

Comments

@fisx
Copy link

fisx commented Sep 24, 2021

import qualified Data.CaseInsensitive as CI
import qualified Data.Char as Char

main :: IO ()
main = do
  print ((Char.toLower <$> ("\5042" :: String)) == "\43906")
  print ((CI.foldCase (CI.mk ("\5042" :: String))) == "\43906")

{-
*Main> :main
True
False
-}

Thanks to QuickCheck! :)

@fisx
Copy link
Author

fisx commented Sep 24, 2021

Oh, interesting: there is Data.Text.toLower, and Data.Text.toCaseFold. But neither is compatible with CI:

import qualified Data.CaseInsensitive as CI
import qualified Data.Text as Text
import Prelude

main :: IO ()
main = do
  print (Text.toCaseFold "\5042" == "\43906")
  print ((CI.foldCase (CI.mk ("\5042" :: String))) == "\43906")

{-
*Main> :main
True
False
-}

@pcapriotti
Copy link

I think this is actually a bug in Text.toLower: Cherokee lowercase letters (e.g. U+AB82) fold to their uppercase counterparts (e.g. U+13B2). This is implemented incorrectly in text, since the fallback case of foldMapping in https://github.com/haskell/text/blob/master/src/Data/Text/Internal/Fusion/CaseMapping.hs converts every character to lowercase. So we get the strange (and incorrect!) behaviour that U+13B2 and U+AB82 map to each other when folding. See haskell/text#277.

@fisx
Copy link
Author

fisx commented Sep 27, 2021

It gets weirder:

*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase ("Ꮊ" :: String)
"\43914"
*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase it
"\5050"
*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase it
"\43914"
[...]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants