Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/text/language: Match does not work as expected for artificial languages #45749

Open
bep opened this issue Apr 24, 2021 · 10 comments
Open

x/text/language: Match does not work as expected for artificial languages #45749

bep opened this issue Apr 24, 2021 · 10 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bep
Copy link
Contributor

bep commented Apr 24, 2021

golang.org/x/[email protected] golang.org/x/[email protected]

The following test failes with got art-x-klingon; want art-x-elvish.

func TestMatchArtificialLanguages(t *testing.T) {
	// See https://en.wikipedia.org/wiki/Codes_for_constructed_languages
	klingon, err := language.Parse("art-x-klingon")
	if err != nil {
		t.Fatal(err)
	}
	elvish, err := language.Parse("art-x-elvish")
	if err != nil {
		t.Fatal(err)
	}

	matcher := language.NewMatcher([]language.Tag{klingon, elvish})

	m, _, _ := matcher.Match(elvish)

	if m.String() != "art-x-elvish" {
		t.Errorf("got %s; want art-x-elvish", m.String())
	}
}

nicksnyder/go-i18n#252

@gopherbot gopherbot added this to the Unreleased milestone Apr 24, 2021
@bep bep closed this as completed Apr 24, 2021
@bep bep reopened this Apr 24, 2021
@nicksnyder
Copy link

@bep Does this diff change the results?

func TestMatchArtificialLanguages(t *testing.T) {
	// See https://en.wikipedia.org/wiki/Codes_for_constructed_languages
	klingon, err := language.Parse("art-x-klingon")
	if err != nil {
		t.Fatal(err)
	}
	elvish, err := language.Parse("art-x-elvish")
	if err != nil {
		t.Fatal(err)
	}

+	tags := []language.Tag{klingon, elvish}
-	matcher := language.NewMatcher([]language.Tag{klingon, elvish})
+	matcher := language.NewMatcher(tags)

-	m, _, _ := matcher.Match(elvish)
+	_, i, _ := matcher.Match(elvish)
+	m := tags[i]

	if m.String() != "art-x-elvish" {
		t.Errorf("got %s; want art-x-elvish", m.String())
	}
}

(I am curious if this is related to #24211 (comment))

@bep
Copy link
Contributor Author

bep commented Apr 26, 2021

@bep Does this diff change the results?

No. It sill returns the first index.

@cherrymui
Copy link
Member

cc @mpvl

@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 26, 2021
@ameowlia
Copy link
Contributor

ameowlia commented Nov 16, 2021

👋 hello, I'm no expert, but I've been poking around in /x/text recently, so I thought I would look into this. I think the root cause of this is that related to the private tags.

🏷️ Go is not sorting properly based on private tags.

Here is an example where non-artificial languages have the same issue.
https://play.golang.org/p/ZtyNz_CbSgO

	enA, _ := language.Parse("en-x-a")
	enB, _ := language.Parse("en-x-b")
	matcher := language.NewMatcher([]language.Tag{enA, enB})

	m, _, _ := matcher.Match(enB)
	if m.String() != "en-x-b" {
		fmt.Printf("got %s; want en-x-b\n", m.String())
	}

This returns: got en-x-a; want en-x-b

✏️ Go should sort based on private tags
According to rfc4647 "Matching of Language Tags"

In the lookup scheme, the language range is progressively truncated
from the end until a matching language tag is located. Single letter
or digit subtags (including both the letter 'x', which introduces
private-use sequences, and the subtags that introduce extensions) are
removed at the same time as their closest trailing subtag. For
example, starting with the range "zh-Hant-CN-x-private1-private2"
(Chinese, Traditional script, China, two private-use tags) the lookup
progressively searches for content as shown below:

Example of a Lookup Fallback Pattern

Range to match: zh-Hant-CN-x-private1-private2

  1. zh-Hant-CN-x-private1-private2
  2. zh-Hant-CN-x-private1
  3. zh-Hant-CN
  4. zh-Hant
  5. zh
  6. (default)

💻 The code looks like it inspects private tags... but it doesn't

  • Match eventually calls a function called VariantOrPrivateUseTags and compares the result of the two languages. I added some print lines and it only ever returned an empty string, even when my languages did have private tags.
  • I made a hacky fix locally so that it did return private use tags. But this didn't change the outcome. This is because for en-x-a and en-x-b they are initially given a confidence of "Exact" because they are the same language. Even when equalsRest returns false (with my hacky fix), it never lowers the confidence level based on private tags not matching.

✏️ I'm working on a fix to both return the private use tags properly and lower the confidence level when they do not match, which should fix this issue.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/364855 mentions this issue: language/match: match based on all subtags

@mpvl
Copy link
Contributor

mpvl commented Nov 18, 2021

The use of private use tags is more or less deprecated. The official tag for Klingon is tlh and for Elvish, if I’m not mistaken, qya. I can imagine preprocessors to normalize user input tags, but I don’t think this normalization should be applied to tags passed to the normalizer by the developer, for instance.

The marcher does not, nor does it aim to, implement RFC 4647. Instead, the algorithm is an enhancement of a matcher used internally at google, which is generally believed to give more reliable results.

I no longer work at Google, but I doubt the move away from -x support will have reverted.

@mpvl
Copy link
Contributor

mpvl commented Nov 18, 2021

Note, the compliance test set is in language/testdata. The CLDR test set is the golden standard.

The two are mostly the same, but there are some notable differences. The Go algorithm uses rule based matching, instead of the scoring algorithm used by Google's algorithm. This has the advantage that it

  1. preserves local information across matches (this is the whole point of these BCP 47 tags, so I wouldn't be surprised if other implementations are catching up to this),
  2. observes user preferences where a language is preferred interstitially to two dialects of another language (for instance, for supported languages en-GB, nl and a user preference of en, nl, en-GB it matches nl, not en-GB,
  3. is much easier to maintain and tweak, and
  4. is much faster (due to other performance improvements not related to the rule-based system per se).

Someone of the internationalization team in Google could verify if the Golden set is up to date. But I doubt support for -x has made its inroads, as the world is moving away from this as it was too problematic.

I would not, however, modify the algorithm based on isolated use cases. There are very specific reasons why it is the way it is with years of tweaking and testing based data-driven design.

@mpvl
Copy link
Contributor

mpvl commented Nov 18, 2021

As another note. The OP mentions the desire to support art-x-klingon, but this was never a valid language tag for Klingon. The (now deprecated) i-klingon predates the introduction of art. In turn, art was introduced on the same date as tlh (the modern canonical form of Klingon). The tag art only serves as a fallback for artificial languages, so art-x-klingon was never a valid tag, and a matcher relying on matching private use tags would miss this.

This already shows the problem of the use of private use tags: their usage is not stable and changes over time. This is one of the reasons why there was a move away from supporting private use tags.

Note that many artificial languages have their own language code. Here is a list:
https://en.wikipedia.org/wiki/Codes_for_constructed_languages

I can imagine supporting private use tags, though, if it does not interact with the rest of tag matching and if thorough consideration of the implications of canonicalization are given. But, as the use of -x is largely deprecated, the impact of considering them for matching regular languages will have to be tested against a large body of user data. My suspicion is that including this information in matches generally will make things worse, not better. This means that supporting matching on private use tags, should probably be limited to specifically supporting art-x-X and maybe tags starting with x-, if at all.

@bep
Copy link
Contributor Author

bep commented Nov 18, 2021

The official tag for Klingon is tlh and for Elvish, if I’m not mistaken,

Those were just two random examples. My original problem is that of handling of "unknown language tags", but you can also easily make the argument for unofficial artificial languages (like Foozzy, a new language i just made up).

@ameowlia
Copy link
Contributor

😮 Wow, thank you so much for all of this information @mpvl . It has been very hard to discover all of this. (Or maybe there are some docs I am missing?)

What I am taking from this is:

  • This is deprecated and not supported on purpose.
  • Therefore I will close my PR.
  • There might be a case for allowing this for art-x-X.
  • If I have the time I might reopen one just for the art-x-X case.
  • I will look for places to document this useful information you provided.

🙏 Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants