-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider relaxing locale resolution for Intl.Segmenter
#895
Comments
Implementations return all locales supported by ICU4C, which seems like a reasonable thing to do, because there's at least some guarantee that segmentation works for these locales. Returning everything could give the false impression that any locale works here, including locales like Klingon ( |
One option would be to return an explicit |
Text processing utilities, including Segmenter and Collator, work based on scripts and properties more than locales. It doesn't make a whole lot of sense to ask a Segmenter or a Collator "what locales do you support", because they support all locales written in scripts that are encoded in Unicode. It's a known issue that Segmenter favors majority languages in scripts over minority languages written in the same script (such as Cantonese ( > Intl.DateTimeFormat.supportedLocalesOf(["yue", "zh"])
Array [ "yue", "zh" ]
> Intl.Segmenter.supportedLocalesOf(["yue", "zh"])
Array [ "yue", "zh" ] It's not entirely clear to me why each component has its own list, especially since, as @anba notes, in practice they all just return the list of locales in ICU, even if they don't make sense for a particular component. If we were designing this from scratch, I feel like better behavior would be a single |
Additional context: unicode-org/icu4x#3284 The CLDR design group agreed earlier this year that |
2024-10-24 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-10-24.md#consider-relaxing-locale-resolution-for-intlsegmenter-895 We established a use case, which should help guide implementations. |
CLDR issue with some more notes: https://unicode-org.atlassian.net/browse/CLDR-18187 |
Alternative name: Make
%Intl.Segmenter%.[[AvailableLocales]]
be the full set of all syntactically valid locales (with some caveats like canonicalization/extensions)Rationale
While most Intl services do require passing a locale for a correct behaviour at runtime, the Segmenter service is in this weird position where it supports almost all locales you throw at it, and the provided locale is just used as a suggestion to segment certain special cases.
This apparently makes it difficult for libraries such as ICU4X to determine if a locale is on their list of
[[AvailableLocales]]
or not; in that case, only a couple of locales ("km", "lo", "my", "th") are "supported" in the sense that they load some amount of data for them on their data provider. However, the rest of locales are very much "supported", they just don't load locale specific data at runtime. (asking for @sffc's help to add more context about this)What then? Well, if virtually all locales are "supported" by
Segmenter
, why not just consider all (see alternative name) syntactically valid locales as supported locales for that service? This would mean making APIs such asIntl.Segmenter.supportedLocalesOf
always return everything, which doesn't sound too bad for a service that is basically a low level text processing utility.The text was updated successfully, but these errors were encountered: