ICU collation parsing and generation problems #1089

mhosken · 2021-08-20T04:33:37Z

There are a number of issues with the collation rules in ICU syntax that it would be good to resolve. I think a short example might help. Here is the first line of a simple sort order specification: a/A aa á/Á, and the resulting start of the generated ICU style collation tailoring: [before 1] [first regular] < a\/A << aa << á\/Á.

Looking at how ICU parses rule strings, it distinguishes strings and syntactic elements. Thus < is a syntactic element as is /. Thus a/A is parsed as 3 elements a / and A which is an expansion that effectively says sort a after the previously element with an A appended. On the other hand if / is escaped, as in a\/A (as per generated LDML) that treats the / as part of the string and is parsed as a single string of a/A. Which is not what is wanted either. The correct way to interpret / in the simple ordering is to treat it as a 3rd level thus a/A would convert to a <<< A.

In general, this means that:

syntactic parts of the collation rule should not be escaped
syntactic elements that are part of collation element strings, should be escaped

I think this means you can't just run the whole collation rule through a general escaper/unescaper. Instead the escaping needs to be inserted when the collation rule is generated from the simple rules. I.e. the ICU generator produces syntactically correct ICU tailoring from the get go and that just gets copied into the LDML inside a CDATA section. No extra escaping is needed outside of what ICU wants to see.

And just to rub it in. The current LDML collation rules, therefore, are junky and cannot be used by any other tools.
For example, when I read in LDML from DBL bundles, I dump the ICU collation and regenerate it (complete with minimisation) from the simple order. I notice that SIL.WritingSystems does the same in ignoring the ICU tailoring, which could explain why the generated ICU rules aren't getting any testing?

The text was updated successfully, but these errors were encountered:

ermshiperete · 2021-08-20T09:20:31Z

@mhosken Can you add a pointer to the file that has the problem? That would help someone not so familiar with how everything works...

mhosken · 2021-08-20T15:27:36Z

SIL.WritingSystem/LdmlCollationParser.cs. The output is a simple copy of the data directly, but the parser does some transformation of the tailoring string. I wonder if it should be the other way around and the generation from Simple to ICU would do all the escaping. The mapping from LDML to ICU is 1:1 with no transformation needed.

ermshiperete added the bug label Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU collation parsing and generation problems #1089

ICU collation parsing and generation problems #1089

mhosken commented Aug 20, 2021

ermshiperete commented Aug 20, 2021

mhosken commented Aug 20, 2021

ICU collation parsing and generation problems #1089

ICU collation parsing and generation problems #1089

Comments

mhosken commented Aug 20, 2021

ermshiperete commented Aug 20, 2021

mhosken commented Aug 20, 2021