Remove `languages_by_code` `Wtp` class argument #393

xxyzz · 2023-11-01T09:40:42Z

Require tatuylonen/wikitextprocessor#134

use the mediawiki-langcodes package to convert language names and codes

xxyzz · 2023-11-07T02:50:26Z

Changes:

use mediawiki-langcodes package to convert language names and codes.
remove the impractical --list-languages command argument, the outputs are too long. We also don't know all the languages in the dump file before extracting it, and even though some languages are extracted from Lua modules, they may not have word entries.
don't check language codes passed from --language and don't convert language name to code to avoid ambiguity.
don't skip sections that have unknown language title.

kristian-clausal · 2023-11-07T06:00:07Z

Moving all the language name functionality to another package seems sensible.

There should be a check for language code (and language name) for --languages, at least so that the script warns if it doesn't recognize the language code or parameter.

xxyzz · 2023-11-07T06:19:57Z

The --language option should only allow language code as its help message says, some language names could have multiple codes.

kristian-clausal · 2023-11-07T07:06:43Z

Printing a warning if a language name has language code conflicts should be trivial. Forcing people to use language codes when names are perfectly fine 99% of the time would be suboptimal. When there is a language code conflict, even in those cases the languages are probably going to be closely (very closely) related, so lumping them together in the output should be fine, and a warning will cover our asses in case that's not the desired effect.

EDIT: or we could exit with error if there is a language code conflict, which sounds the most correct.

xxyzz · 2023-11-07T07:20:26Z

I'm against making this rarely used argument more complicated. We can't tell if a passed argument is language code or language name, some language names are also valid language codes.

kristian-clausal · 2023-11-07T07:23:26Z

We will not start forbidding the use of language names in --language.

xxyzz · 2023-11-07T07:31:36Z

But how could this feature be implemented? As I said there are many language names are also language code:

SELECT a.lang_name, a.lang_code, b.lang_name FROM langcodes AS a JOIN langcodes AS b WHERE a.lang_name = b.lang_code AND a.lang_name < b.lang_name AND a.lang_code < b.lang_code AND a.in_lang = b.in_lang AND a.in_lang = 'en'

English names:


+-----------+-----------+------------------------------+
| lang_name | lang_code |          lang_name           |
+-----------+-----------+------------------------------+
| Ari       | aac       | Arikara                      |
| Asa       | aas       | Asu                          |
| Asa       | aas       | Chasu                        |
| Asa       | aas       | South Pare                   |
| Asa       | aas       | Southern Pare                |
| Abo       | abb       | Abon                         |
| Abo       | abb       | Abɔ̃                         |
| Bo        | abb       | Tibetan                      |
| Bea       | abj       | Beaver                       |
| Bea       | abj       | Dane-zaa                     |
| Bea       | abj       | Danezaa                      |
| Bea       | abj       | Danezaa ZaageɁ               |
| Aer       | aeq       | Eastern Arrernte             |
| Ako       | ahk       | Akurio                       |
| Igo       | ahl       | Isebe                        |
| Ali       | aiy       | Amaimon                      |
| Ba        | akm       | Bashkir                      |
| Bo        | akm       | Tibetan                      |
| Kol       | aky       | Kol (New Guinea)             |
| Kol       | aky       | Kol (Papua New Guina)        |
| Gae       | anb       | Guarequena                   |
| Gae       | anb       | Warekena                     |
| Gay       | anb       | Gayo                         |
| Anu       | anl       | Anuak                        |
| Anu       | anl       | Anyua                        |
| Anu       | anl       | Anyuak                       |
| Anu       | anl       | Anywa                        |
| Ayo       | aou       | Ayoreo                       |
| Ayo       | aou       | Ayoré                        |
| Ayo       | aou       | Ayoweo                       |
| Ayo       | aou       | Moro                         |
| Ayo       | aou       | Morotoco                     |
| Ayo       | aou       | Pyeta Yovai                  |
| Asu       | asa       | Asurini                      |
| Asu       | asa       | Asuriní                      |
| Asu       | asa       | Asuriní do Tocantins         |
| Asu       | asa       | Asuriní of Tocantins         |
| Asu       | asa       | Tocantins Asurini            |
| Wao       | auc       | Wappo                        |
| Nye       | bcv       | Nyengo                       |
| Bit       | bgk       | Bitara                       |
| Bo        | bgl       | Tibetan                      |
| Kol       | biw       | Kol (New Guinea)             |
| Kol       | biw       | Kol (Papua New Guina)        |
| Lia       | bli       | West-Central Limba           |
| Wo        | bsc       | Wolof                        |
| Car       | caq       | Carib                        |
| Car       | caq       | Cariña                       |
| Car       | caq       | Galibi                       |
| Car       | caq       | Galibi Carib                 |
| Car       | caq       | Galibí                       |
| Car       | caq       | Kali'na                      |
| Car       | caq       | Kalihna                      |
| Car       | caq       | Kalinya                      |
| Car       | caq       | Maraworno                    |
| Car       | caq       | Marworno                     |
| Sak       | ckh       | Sake                         |
| Sak       | ckh       | Shake                        |
| Maa       | cma       | San Jerónimo Tecóatl Mazatec |
| Mro       | cmr       | Mru                          |
| Lai       | cnh       | Lambya                       |
| Con       | cno       | Kofan                        |
| Con       | cno       | Kofane                       |
| Con       | cno       | Macu                         |
| Con       | cno       | Maku                         |
| Rai       | dhw       | Ramoaaina                    |
| Rai       | dhw       | Ramoaina                     |
| Rai       | dhw       | Ramuaaina                    |
| Rai       | dhw       | Ramuaina                     |
| Kol       | ekl       | Kol (New Guinea)             |
| Kol       | ekl       | Kol (Papua New Guina)        |
| Ora       | ema       | Oroha                        |
| Sie       | erg       | Simaa                        |
| Hor       | ero       | Horo                         |
| Rgu       | ero       | Ringgou                      |
| Ko        | fuj       | Korean                       |
| Ko        | fuj       | Modern Korean                |
| Kag       | gel       | Kajaman                      |
| Hop       | hob       | Hopi                         |
| Hop       | hob       | Moqui                        |
| Iko       | iki       | Olulumo-Ikom                 |
| Yaa       | iyx       | Yaminahua                    |
| Yaa       | iyx       | Yaminawa                     |
| Yei       | jei       | Yeni                         |
| Tol       | jic       | Tolowa                       |
| Tem       | kdh       | Themne                       |
| Tem       | kdh       | Timne                        |
| Lue       | khb       | Luvale                       |
| Kim       | kia       | Tofa                         |
| Kim       | kia       | Tofalar                      |
| Koi       | kkt       | Komi-Permyak                 |
| Rai       | lew       | Ramoaaina                    |
| Rai       | lew       | Ramoaina                     |
| Rai       | lew       | Ramuaaina                    |
| Rai       | lew       | Ramuaina                     |
| Taa       | lew       | Tanana                       |
| Lou       | loj       | Louisiana Creole             |
| Lou       | loj       | Louisiana Creole French      |
| Mor       | mhz       | Moro                         |
| Mor       | moq       | Moro                         |
| Sar       | mwm       | Sarabeca                     |
| Sar       | mwm       | Sarave                       |
| Sar       | mwm       | Saraveca                     |
| Sar       | mwm       | Saraveka                     |
| Taa       | nmn       | Tanana                       |
| Xoo       | nmn       | Xucuru                       |
| Xoo       | nmn       | Xucurú                       |
| Xoo       | nmn       | Xukuru                       |
| Xoo       | nmn       | Xukurú                       |
| Nuk       | noc       | Nuu-chah-nulth               |
| Nuk       | noc       | Nuuchahnulth                 |
| Nuk       | noc       | T'aat'aaqsapa                |
| Yom       | pil       | Yombe                        |
| Pom       | pmo       | Southeastern Pomo            |
| Uma       | ppk       | Umatilla                     |
| Sec       | sai-sec   | Sechelt                      |
| Sek       | sai-sec   | Sekani                       |
| Sek       | sai-sec   | Tsek'ene                     |
| Sek       | sai-tal   | Sekani                       |
| Sek       | sai-tal   | Tsek'ene                     |
| Yao       | sai-yao   | Yao (Africa)                 |
| Sha       | scw       | Shall-Zwall                  |
| Sok       | skk       | Sokoro                       |
| Ura       | ula       | Urarina                      |
| Uru       | ure       | Urumi                        |
| Woi       | wbw       | Woisika                      |
| Wom       | wmo       | Wom (Nigeria)                |
+-----------+-----------+------------------------------+

kristian-clausal · 2023-11-07T07:47:49Z

When there is ambiguity, we can just exit with error.

kristian-clausal · 2023-11-07T07:52:18Z

Actually, I see what you mean. Then we might just have to separate --language into --language-name and --language-code.

xxyzz · 2023-11-07T08:40:19Z

I changed the default capture code from en to the dump file language code.

xxyzz · 2023-11-07T09:02:41Z

src/wiktextract/wiktionary.py

-    if wxr.config.extract_thesaurus_pages:
+    if wxr.config.dump_file_lang_code == "en":
        emit_words_in_thesaurus(wxr, emitted, out_f, human_readable)


emit_words_in_thesaurus() is only used for English Wiktionary because other extractors now use lang_name property but the English extractor uses lang.

xxyzz · 2023-11-07T09:05:38Z

src/wiktextract/wiktwords.py

-    if not args.language:
-        args.language = ["en", "mul"]
+    # Default to dump file language and Translingual if not specified.
+    capture_lang_codes = set()


WiktionaryConfig.capture_lang_codes is changed to a set. Because we only check if a code is in the set and it also removes duplicated codes.

kristian-clausal · 2023-11-07T09:44:30Z

Thank you, this seems good.

This bug was introduced from pr tatuylonen#393, fixes tatuylonen#405.

xxyzz added 3 commits November 6, 2023 17:28

Remove languages_by_code Wtp class argument

bfb4fda

Delete language JSON files

46f5213

Remove LANGUAGES_BY_NAME and LANGUAGES_BY_CODE

52dccc1

use the mediawiki-langcodes package to convert language names and codes

xxyzz force-pushed the icu branch from 3fb4d74 to 52dccc1 Compare November 7, 2023 02:37

Show warning if --language code can't be found in the langcodes db

7a765d8

xxyzz force-pushed the icu branch from ba47ff4 to 7a765d8 Compare November 7, 2023 06:30

Break --language option to --language-code and --language-name

6edbb3f

xxyzz commented Nov 7, 2023

View reviewed changes

xxyzz merged commit a3665b8 into tatuylonen:master Nov 7, 2023
5 checks passed

xxyzz deleted the icu branch November 7, 2023 14:36

xxyzz added a commit to xxyzz/wiktextract that referenced this pull request Dec 1, 2023

Use language codes in HEAD_TAG_RE pattern

87b673e

This bug was introduced from pr tatuylonen#393, fixes tatuylonen#405.

xxyzz mentioned this pull request Dec 1, 2023

Use language codes in HEAD_TAG_RE pattern #409

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `languages_by_code` `Wtp` class argument #393

Remove `languages_by_code` `Wtp` class argument #393

xxyzz commented Nov 1, 2023 •

edited

Loading

xxyzz commented Nov 7, 2023

kristian-clausal commented Nov 7, 2023

xxyzz commented Nov 7, 2023

kristian-clausal commented Nov 7, 2023 •

edited

Loading

xxyzz commented Nov 7, 2023

kristian-clausal commented Nov 7, 2023

xxyzz commented Nov 7, 2023

kristian-clausal commented Nov 7, 2023

kristian-clausal commented Nov 7, 2023

xxyzz commented Nov 7, 2023

xxyzz Nov 7, 2023

xxyzz Nov 7, 2023

kristian-clausal commented Nov 7, 2023

Remove languages_by_code Wtp class argument #393

Remove languages_by_code Wtp class argument #393

Conversation

xxyzz commented Nov 1, 2023 • edited Loading

xxyzz commented Nov 7, 2023

kristian-clausal commented Nov 7, 2023

xxyzz commented Nov 7, 2023

kristian-clausal commented Nov 7, 2023 • edited Loading

xxyzz commented Nov 7, 2023

kristian-clausal commented Nov 7, 2023

xxyzz commented Nov 7, 2023

kristian-clausal commented Nov 7, 2023

kristian-clausal commented Nov 7, 2023

xxyzz commented Nov 7, 2023

xxyzz Nov 7, 2023

Choose a reason for hiding this comment

xxyzz Nov 7, 2023

Choose a reason for hiding this comment

kristian-clausal commented Nov 7, 2023

Remove `languages_by_code` `Wtp` class argument #393

Remove `languages_by_code` `Wtp` class argument #393

xxyzz commented Nov 1, 2023 •

edited

Loading

kristian-clausal commented Nov 7, 2023 •

edited

Loading