Skip to content

unclear/ambiguous language code notation #68

Open
@michaelkubina

Description

@michaelkubina

While working on a mapping of bibliographic language codes (ISO 639-2/B due to the RDA application guidelines of the German National Library https://wiki.dnb.de/download/attachments/127172808/Kapitel_6.pdf?version=2&modificationDate=1505213938000&api=v2) to the corresponding (presumably ISO 639-2/T coded?) language models, I came across three language codes for which I could not find a match:

  • tesseract_best/frk.traineddata
  • tesseract_best/kmr.traineddata
  • tesseract_best/osd.traineddata

I therefore suspected that the encoding of the language models was done according to ISO 639-3 (https://iso639-3.sil.org/code_tables/639/data) and found matches for frk(=Frankish) and kmr(=Northern Kurdish), but still none for osd. Finally, I consulted the documentation (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) again and could see that kmr(=Northern Kurdish) is a correct suffix, while frk actually means "Frakturschrift" (is that correct?) and osd stands for the "Orientation & Script Detection Module".

Since these two are not languages in the actual sense, it would be appropriate in my opinion to designate these two training models in such a way that they can be clearly distinguished from the language codes consisting of three letters. Furthermore it would be surely good to put these models into another directory and not parallel to the languages, so that no further misunderstandings arise here. This already happens with the font types and should therefore be done here in the same way.

I am aware that there are some languages with multiple models that need to be in parallel because:

  • _vert (vertical text flow)
  • _latn / _cyrl (other type system)
  • _old (old)
  • _sim (simplified writing system)
  • _tra (traditional notation)

In such cases it is also clear that here a notation reduced to three letters is not possible and all those models are necessary as they are. Nevertheless, at least the stem of the language code should be clearly assignable. A clear indication whether for this purpose ISO 639-2/T or ISO 639-3 is coded would be helpful.

Since kmr(=Northern Kurdish) does not emerge from ISO 639-2, but only kur(=Kurdish) is used there, one would conclude, as I do, that ISO 639-3 is authoritative. In this case, however, one could possibly also erroneously expect the typification according to macrolanguage construct used in ISO 639-3. ara(=Arabic) could then be understood as "Includes Standard Arabic and Egyptian Arabic), or nor(=Norwegian) "Includes Nynorsk and Bokmal). However, there are no language models available for this, as far as I can see.

For macro languages, however, a complete examination of the current language codes would actually have to take place and would also require the creation of "language sets", which do not currently exist in the form...I think this could be a stimulating discussion point and possibly an interesting feature in the future...but goes too far at this point. Also ISO 639-3 with currently 7910 language-codes is far bigger, than ISO 639-2 with about 460 language codes.

Thank you,
Michael Kubina

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions