-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove languages_by_code
Wtp
class argument
#393
Conversation
use the mediawiki-langcodes package to convert language names and codes
Changes:
|
Moving all the language name functionality to another package seems sensible. There should be a check for language code (and language name) for |
The |
Printing a warning if a language name has language code conflicts should be trivial. Forcing people to use language codes when names are perfectly fine 99% of the time would be suboptimal. When there is a language code conflict, even in those cases the languages are probably going to be closely (very closely) related, so lumping them together in the output should be fine, and a warning will cover our asses in case that's not the desired effect. EDIT: or we could exit with error if there is a language code conflict, which sounds the most correct. |
I'm against making this rarely used argument more complicated. We can't tell if a passed argument is language code or language name, some language names are also valid language codes. |
We will not start forbidding the use of language names in --language. |
But how could this feature be implemented? As I said there are many language names are also language code: SELECT a.lang_name, a.lang_code, b.lang_name FROM langcodes AS a JOIN langcodes AS b WHERE a.lang_name = b.lang_code AND a.lang_name < b.lang_name AND a.lang_code < b.lang_code AND a.in_lang = b.in_lang AND a.in_lang = 'en' English names:
+-----------+-----------+------------------------------+
| lang_name | lang_code | lang_name |
+-----------+-----------+------------------------------+
| Ari | aac | Arikara |
| Asa | aas | Asu |
| Asa | aas | Chasu |
| Asa | aas | South Pare |
| Asa | aas | Southern Pare |
| Abo | abb | Abon |
| Abo | abb | Abɔ̃ |
| Bo | abb | Tibetan |
| Bea | abj | Beaver |
| Bea | abj | Dane-zaa |
| Bea | abj | Danezaa |
| Bea | abj | Danezaa ZaageɁ |
| Aer | aeq | Eastern Arrernte |
| Ako | ahk | Akurio |
| Igo | ahl | Isebe |
| Ali | aiy | Amaimon |
| Ba | akm | Bashkir |
| Bo | akm | Tibetan |
| Kol | aky | Kol (New Guinea) |
| Kol | aky | Kol (Papua New Guina) |
| Gae | anb | Guarequena |
| Gae | anb | Warekena |
| Gay | anb | Gayo |
| Anu | anl | Anuak |
| Anu | anl | Anyua |
| Anu | anl | Anyuak |
| Anu | anl | Anywa |
| Ayo | aou | Ayoreo |
| Ayo | aou | Ayoré |
| Ayo | aou | Ayoweo |
| Ayo | aou | Moro |
| Ayo | aou | Morotoco |
| Ayo | aou | Pyeta Yovai |
| Asu | asa | Asurini |
| Asu | asa | Asuriní |
| Asu | asa | Asuriní do Tocantins |
| Asu | asa | Asuriní of Tocantins |
| Asu | asa | Tocantins Asurini |
| Wao | auc | Wappo |
| Nye | bcv | Nyengo |
| Bit | bgk | Bitara |
| Bo | bgl | Tibetan |
| Kol | biw | Kol (New Guinea) |
| Kol | biw | Kol (Papua New Guina) |
| Lia | bli | West-Central Limba |
| Wo | bsc | Wolof |
| Car | caq | Carib |
| Car | caq | Cariña |
| Car | caq | Galibi |
| Car | caq | Galibi Carib |
| Car | caq | Galibí |
| Car | caq | Kali'na |
| Car | caq | Kalihna |
| Car | caq | Kalinya |
| Car | caq | Maraworno |
| Car | caq | Marworno |
| Sak | ckh | Sake |
| Sak | ckh | Shake |
| Maa | cma | San Jerónimo Tecóatl Mazatec |
| Mro | cmr | Mru |
| Lai | cnh | Lambya |
| Con | cno | Kofan |
| Con | cno | Kofane |
| Con | cno | Macu |
| Con | cno | Maku |
| Rai | dhw | Ramoaaina |
| Rai | dhw | Ramoaina |
| Rai | dhw | Ramuaaina |
| Rai | dhw | Ramuaina |
| Kol | ekl | Kol (New Guinea) |
| Kol | ekl | Kol (Papua New Guina) |
| Ora | ema | Oroha |
| Sie | erg | Simaa |
| Hor | ero | Horo |
| Rgu | ero | Ringgou |
| Ko | fuj | Korean |
| Ko | fuj | Modern Korean |
| Kag | gel | Kajaman |
| Hop | hob | Hopi |
| Hop | hob | Moqui |
| Iko | iki | Olulumo-Ikom |
| Yaa | iyx | Yaminahua |
| Yaa | iyx | Yaminawa |
| Yei | jei | Yeni |
| Tol | jic | Tolowa |
| Tem | kdh | Themne |
| Tem | kdh | Timne |
| Lue | khb | Luvale |
| Kim | kia | Tofa |
| Kim | kia | Tofalar |
| Koi | kkt | Komi-Permyak |
| Rai | lew | Ramoaaina |
| Rai | lew | Ramoaina |
| Rai | lew | Ramuaaina |
| Rai | lew | Ramuaina |
| Taa | lew | Tanana |
| Lou | loj | Louisiana Creole |
| Lou | loj | Louisiana Creole French |
| Mor | mhz | Moro |
| Mor | moq | Moro |
| Sar | mwm | Sarabeca |
| Sar | mwm | Sarave |
| Sar | mwm | Saraveca |
| Sar | mwm | Saraveka |
| Taa | nmn | Tanana |
| Xoo | nmn | Xucuru |
| Xoo | nmn | Xucurú |
| Xoo | nmn | Xukuru |
| Xoo | nmn | Xukurú |
| Nuk | noc | Nuu-chah-nulth |
| Nuk | noc | Nuuchahnulth |
| Nuk | noc | T'aat'aaqsapa |
| Yom | pil | Yombe |
| Pom | pmo | Southeastern Pomo |
| Uma | ppk | Umatilla |
| Sec | sai-sec | Sechelt |
| Sek | sai-sec | Sekani |
| Sek | sai-sec | Tsek'ene |
| Sek | sai-tal | Sekani |
| Sek | sai-tal | Tsek'ene |
| Yao | sai-yao | Yao (Africa) |
| Sha | scw | Shall-Zwall |
| Sok | skk | Sokoro |
| Ura | ula | Urarina |
| Uru | ure | Urumi |
| Woi | wbw | Woisika |
| Wom | wmo | Wom (Nigeria) |
+-----------+-----------+------------------------------+
|
When there is ambiguity, we can just exit with error. |
Actually, I see what you mean. Then we might just have to separate --language into --language-name and --language-code. |
I changed the default capture code from |
if wxr.config.extract_thesaurus_pages: | ||
if wxr.config.dump_file_lang_code == "en": | ||
emit_words_in_thesaurus(wxr, emitted, out_f, human_readable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emit_words_in_thesaurus()
is only used for English Wiktionary because other extractors now use lang_name
property but the English extractor uses lang
.
if not args.language: | ||
args.language = ["en", "mul"] | ||
# Default to dump file language and Translingual if not specified. | ||
capture_lang_codes = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WiktionaryConfig.capture_lang_codes
is changed to a set
. Because we only check if a code is in the set and it also removes duplicated codes.
Thank you, this seems good. |
This bug was introduced from pr tatuylonen#393, fixes tatuylonen#405.
Require tatuylonen/wikitextprocessor#134