Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU collation parsing and generation problems #1089

Open
mhosken opened this issue Aug 20, 2021 · 2 comments
Open

ICU collation parsing and generation problems #1089

mhosken opened this issue Aug 20, 2021 · 2 comments
Labels

Comments

@mhosken
Copy link
Contributor

mhosken commented Aug 20, 2021

There are a number of issues with the collation rules in ICU syntax that it would be good to resolve. I think a short example might help. Here is the first line of a simple sort order specification: a/A aa á/Á, and the resulting start of the generated ICU style collation tailoring: [before 1] [first regular] < a\/A << aa << á\/Á.

Looking at how ICU parses rule strings, it distinguishes strings and syntactic elements. Thus < is a syntactic element as is /. Thus a/A is parsed as 3 elements a / and A which is an expansion that effectively says sort a after the previously element with an A appended. On the other hand if / is escaped, as in a\/A (as per generated LDML) that treats the / as part of the string and is parsed as a single string of a/A. Which is not what is wanted either. The correct way to interpret / in the simple ordering is to treat it as a 3rd level thus a/A would convert to a <<< A.

In general, this means that:

  • syntactic parts of the collation rule should not be escaped
  • syntactic elements that are part of collation element strings, should be escaped

I think this means you can't just run the whole collation rule through a general escaper/unescaper. Instead the escaping needs to be inserted when the collation rule is generated from the simple rules. I.e. the ICU generator produces syntactically correct ICU tailoring from the get go and that just gets copied into the LDML inside a CDATA section. No extra escaping is needed outside of what ICU wants to see.

And just to rub it in. The current LDML collation rules, therefore, are junky and cannot be used by any other tools.
For example, when I read in LDML from DBL bundles, I dump the ICU collation and regenerate it (complete with minimisation) from the simple order. I notice that SIL.WritingSystems does the same in ignoring the ICU tailoring, which could explain why the generated ICU rules aren't getting any testing?

@ermshiperete
Copy link
Member

@mhosken Can you add a pointer to the file that has the problem? That would help someone not so familiar with how everything works...

@mhosken
Copy link
Contributor Author

mhosken commented Aug 20, 2021

SIL.WritingSystem/LdmlCollationParser.cs. The output is a simple copy of the data directly, but the parser does some transformation of the tailoring string. I wonder if it should be the other way around and the generation from Simple to ICU would do all the escaping. The mapping from LDML to ICU is 1:1 with no transformation needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants