-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add languages.json for Spanish Wiktionary #391
Conversation
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
wxr = WiktextractContext(Wtp(lang_code=args.lang_code), WiktionaryConfig()) | ||
|
||
wxr = WiktextractContext( | ||
Wtp( | ||
lang_code=args.lang_code, db_path="wikt-db_es_language_data_temp.db" | ||
), | ||
WiktionaryConfig(), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't db_path
passed from command line argument?
I check the Spanish lengua template and find it calls the getlang template and expands templates that use language code as template title. For example, |
And I think if a Wiktionary edition doesn't use language code and name in Lua code then we also don't need this data. I search the code in wikitextprocessor and find the only places use these language code are |
Thanks for looking into this, @xxyzz. Let's see if I'm following you: The initial reason to pre-load the languages is so that wikitextprocessor can correctly execute modules that require these languages. So you're saying that for the Spanish Wiktionary which doesn't have modules using language data, we wouldn't need to pre-load them. Do I understand you correctly? So let's say, we do not preload language data for these editions, then we would not have access to Regarding the lengua template. I saw it and considered it to get the language names this way. But then I would still need a definite list of all the possible language codes. And the only such list I could find was Apéndice:Códigos de idioma. |
For the wikitextprocessor package, a single JSON file downloaded from the API linked above should be enough for the English and Chinese Wiktionary use plain text language names in subtitle, they would still need the huge JSON file. But I think we could run the extract language Lua code when first extracting the dump file and save the language data to database. That Spanish appendix page says it has 1215 languages, the template categories page has 1224 languages. I didn't check the numbers but you may need to find the difference. |
The Spanish Wiktionary doesn't use module data to define language codes. So I had to look for other sources of language data.
Fortunately, there seems to be a comprehensive, automatically-generated list of language codes with their corresponding Spanish names in the Appendix: Apéndice:Códigos de idioma
There is probably a better way of organizing the get_data logic for different languages than by just adding another script file. But for more or less a one-time thing, this should suffice.