Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove languages_by_code Wtp class argument #393

Merged
merged 5 commits into from
Nov 7, 2023
Merged

Conversation

xxyzz
Copy link
Collaborator

@xxyzz xxyzz commented Nov 1, 2023

@xxyzz
Copy link
Collaborator Author

xxyzz commented Nov 7, 2023

Changes:

  • use mediawiki-langcodes package to convert language names and codes.
  • remove the impractical --list-languages command argument, the outputs are too long. We also don't know all the languages in the dump file before extracting it, and even though some languages are extracted from Lua modules, they may not have word entries.
  • don't check language codes passed from --language and don't convert language name to code to avoid ambiguity.
  • don't skip sections that have unknown language title.

@kristian-clausal
Copy link
Collaborator

Moving all the language name functionality to another package seems sensible.

There should be a check for language code (and language name) for --languages, at least so that the script warns if it doesn't recognize the language code or parameter.

@xxyzz
Copy link
Collaborator Author

xxyzz commented Nov 7, 2023

The --language option should only allow language code as its help message says, some language names could have multiple codes.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Nov 7, 2023

Printing a warning if a language name has language code conflicts should be trivial. Forcing people to use language codes when names are perfectly fine 99% of the time would be suboptimal. When there is a language code conflict, even in those cases the languages are probably going to be closely (very closely) related, so lumping them together in the output should be fine, and a warning will cover our asses in case that's not the desired effect.

EDIT: or we could exit with error if there is a language code conflict, which sounds the most correct.

@xxyzz
Copy link
Collaborator Author

xxyzz commented Nov 7, 2023

I'm against making this rarely used argument more complicated. We can't tell if a passed argument is language code or language name, some language names are also valid language codes.

@kristian-clausal
Copy link
Collaborator

We will not start forbidding the use of language names in --language.

@xxyzz
Copy link
Collaborator Author

xxyzz commented Nov 7, 2023

But how could this feature be implemented? As I said there are many language names are also language code:

SELECT a.lang_name, a.lang_code, b.lang_name FROM langcodes AS a JOIN langcodes AS b WHERE a.lang_name = b.lang_code AND a.lang_name < b.lang_name AND a.lang_code < b.lang_code AND a.in_lang = b.in_lang AND a.in_lang = 'en'
English names: +-----------+-----------+------------------------------+ | lang_name | lang_code | lang_name | +-----------+-----------+------------------------------+ | Ari | aac | Arikara | | Asa | aas | Asu | | Asa | aas | Chasu | | Asa | aas | South Pare | | Asa | aas | Southern Pare | | Abo | abb | Abon | | Abo | abb | Abɔ̃ | | Bo | abb | Tibetan | | Bea | abj | Beaver | | Bea | abj | Dane-zaa | | Bea | abj | Danezaa | | Bea | abj | Danezaa ZaageɁ | | Aer | aeq | Eastern Arrernte | | Ako | ahk | Akurio | | Igo | ahl | Isebe | | Ali | aiy | Amaimon | | Ba | akm | Bashkir | | Bo | akm | Tibetan | | Kol | aky | Kol (New Guinea) | | Kol | aky | Kol (Papua New Guina) | | Gae | anb | Guarequena | | Gae | anb | Warekena | | Gay | anb | Gayo | | Anu | anl | Anuak | | Anu | anl | Anyua | | Anu | anl | Anyuak | | Anu | anl | Anywa | | Ayo | aou | Ayoreo | | Ayo | aou | Ayoré | | Ayo | aou | Ayoweo | | Ayo | aou | Moro | | Ayo | aou | Morotoco | | Ayo | aou | Pyeta Yovai | | Asu | asa | Asurini | | Asu | asa | Asuriní | | Asu | asa | Asuriní do Tocantins | | Asu | asa | Asuriní of Tocantins | | Asu | asa | Tocantins Asurini | | Wao | auc | Wappo | | Nye | bcv | Nyengo | | Bit | bgk | Bitara | | Bo | bgl | Tibetan | | Kol | biw | Kol (New Guinea) | | Kol | biw | Kol (Papua New Guina) | | Lia | bli | West-Central Limba | | Wo | bsc | Wolof | | Car | caq | Carib | | Car | caq | Cariña | | Car | caq | Galibi | | Car | caq | Galibi Carib | | Car | caq | Galibí | | Car | caq | Kali'na | | Car | caq | Kalihna | | Car | caq | Kalinya | | Car | caq | Maraworno | | Car | caq | Marworno | | Sak | ckh | Sake | | Sak | ckh | Shake | | Maa | cma | San Jerónimo Tecóatl Mazatec | | Mro | cmr | Mru | | Lai | cnh | Lambya | | Con | cno | Kofan | | Con | cno | Kofane | | Con | cno | Macu | | Con | cno | Maku | | Rai | dhw | Ramoaaina | | Rai | dhw | Ramoaina | | Rai | dhw | Ramuaaina | | Rai | dhw | Ramuaina | | Kol | ekl | Kol (New Guinea) | | Kol | ekl | Kol (Papua New Guina) | | Ora | ema | Oroha | | Sie | erg | Simaa | | Hor | ero | Horo | | Rgu | ero | Ringgou | | Ko | fuj | Korean | | Ko | fuj | Modern Korean | | Kag | gel | Kajaman | | Hop | hob | Hopi | | Hop | hob | Moqui | | Iko | iki | Olulumo-Ikom | | Yaa | iyx | Yaminahua | | Yaa | iyx | Yaminawa | | Yei | jei | Yeni | | Tol | jic | Tolowa | | Tem | kdh | Themne | | Tem | kdh | Timne | | Lue | khb | Luvale | | Kim | kia | Tofa | | Kim | kia | Tofalar | | Koi | kkt | Komi-Permyak | | Rai | lew | Ramoaaina | | Rai | lew | Ramoaina | | Rai | lew | Ramuaaina | | Rai | lew | Ramuaina | | Taa | lew | Tanana | | Lou | loj | Louisiana Creole | | Lou | loj | Louisiana Creole French | | Mor | mhz | Moro | | Mor | moq | Moro | | Sar | mwm | Sarabeca | | Sar | mwm | Sarave | | Sar | mwm | Saraveca | | Sar | mwm | Saraveka | | Taa | nmn | Tanana | | Xoo | nmn | Xucuru | | Xoo | nmn | Xucurú | | Xoo | nmn | Xukuru | | Xoo | nmn | Xukurú | | Nuk | noc | Nuu-chah-nulth | | Nuk | noc | Nuuchahnulth | | Nuk | noc | T'aat'aaqsapa | | Yom | pil | Yombe | | Pom | pmo | Southeastern Pomo | | Uma | ppk | Umatilla | | Sec | sai-sec | Sechelt | | Sek | sai-sec | Sekani | | Sek | sai-sec | Tsek'ene | | Sek | sai-tal | Sekani | | Sek | sai-tal | Tsek'ene | | Yao | sai-yao | Yao (Africa) | | Sha | scw | Shall-Zwall | | Sok | skk | Sokoro | | Ura | ula | Urarina | | Uru | ure | Urumi | | Woi | wbw | Woisika | | Wom | wmo | Wom (Nigeria) | +-----------+-----------+------------------------------+

@kristian-clausal
Copy link
Collaborator

When there is ambiguity, we can just exit with error.

@kristian-clausal
Copy link
Collaborator

Actually, I see what you mean. Then we might just have to separate --language into --language-name and --language-code.

@xxyzz
Copy link
Collaborator Author

xxyzz commented Nov 7, 2023

I changed the default capture code from en to the dump file language code.

Comment on lines -228 to 229
if wxr.config.extract_thesaurus_pages:
if wxr.config.dump_file_lang_code == "en":
emit_words_in_thesaurus(wxr, emitted, out_f, human_readable)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emit_words_in_thesaurus() is only used for English Wiktionary because other extractors now use lang_name property but the English extractor uses lang.

if not args.language:
args.language = ["en", "mul"]
# Default to dump file language and Translingual if not specified.
capture_lang_codes = set()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WiktionaryConfig.capture_lang_codes is changed to a set. Because we only check if a code is in the set and it also removes duplicated codes.

@kristian-clausal
Copy link
Collaborator

Thank you, this seems good.

@xxyzz xxyzz merged commit a3665b8 into tatuylonen:master Nov 7, 2023
5 checks passed
@xxyzz xxyzz deleted the icu branch November 7, 2023 14:36
xxyzz added a commit to xxyzz/wiktextract that referenced this pull request Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants