Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add languages.json for Spanish Wiktionary #391

Merged
merged 3 commits into from
Oct 31, 2023

Conversation

empiriker
Copy link
Contributor

The Spanish Wiktionary doesn't use module data to define language codes. So I had to look for other sources of language data.

Fortunately, there seems to be a comprehensive, automatically-generated list of language codes with their corresponding Spanish names in the Appendix: Apéndice:Códigos de idioma

There is probably a better way of organizing the get_data logic for different languages than by just adding another script file. But for more or less a one-time thing, this should suffice.

This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
@kristian-clausal kristian-clausal merged commit 595ff1d into tatuylonen:master Oct 31, 2023
5 checks passed
Comment on lines +24 to +31
wxr = WiktextractContext(Wtp(lang_code=args.lang_code), WiktionaryConfig())

wxr = WiktextractContext(
Wtp(
lang_code=args.lang_code, db_path="wikt-db_es_language_data_temp.db"
),
WiktionaryConfig(),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't db_path passed from command line argument?

@xxyzz
Copy link
Collaborator

xxyzz commented Oct 31, 2023

I check the Spanish lengua template and find it calls the getlang template and expands templates that use language code as template title.

For example, {{lengua|es}} expands {{getlang|es}} then expands to {{es|texto=x}}. They save language names to templates: https://es.wiktionary.org/wiki/Categoría:Plantillas_de_idiomas

@xxyzz
Copy link
Collaborator

xxyzz commented Oct 31, 2023

And I think if a Wiktionary edition doesn't use language code and name in Lua code then we also don't need this data.

I search the code in wikitextprocessor and find the only places use these language code are mw.language.fetchLanguageName and mw.language.fetchLanguageNames and these two functions uses the same data on all mediawiki sites. And it can be downloaded from this api: https://fr.wiktionary.org/w/api.php?action=query&meta=siteinfo&siprop=languages&format=json&formatversion=2

@empiriker
Copy link
Contributor Author

empiriker commented Oct 31, 2023

Thanks for looking into this, @xxyzz.

Let's see if I'm following you: The initial reason to pre-load the languages is so that wikitextprocessor can correctly execute modules that require these languages. So you're saying that for the Spanish Wiktionary which doesn't have modules using language data, we wouldn't need to pre-load them. Do I understand you correctly?

So let's say, we do not preload language data for these editions, then we would not have access to wxr.config.LANGUAGES_BY_NAME and wxr.config.LANGUAGES_BY_CODE. So far I have found them useful to flexibly add the language code or the language name when one of them was missing. But I guess, we could do without.


Regarding the lengua template. I saw it and considered it to get the language names this way. But then I would still need a definite list of all the possible language codes. And the only such list I could find was Apéndice:Códigos de idioma.

@xxyzz
Copy link
Collaborator

xxyzz commented Oct 31, 2023

For the wikitextprocessor package, a single JSON file downloaded from the API linked above should be enough for the mw.language.fetchLanguageName function. And in the wiktextract package, at least the French extractor doesn't use language data very often, because the language code is in the template argument and the language name could be obtained by expanding the subtitle template.

English and Chinese Wiktionary use plain text language names in subtitle, they would still need the huge JSON file. But I think we could run the extract language Lua code when first extracting the dump file and save the language data to database.

That Spanish appendix page says it has 1215 languages, the template categories page has 1224 languages. I didn't check the numbers but you may need to find the difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants