Add languages.json for Spanish Wiktionary #391

empiriker · 2023-10-31T09:37:08Z

The Spanish Wiktionary doesn't use module data to define language codes. So I had to look for other sources of language data.

Fortunately, there seems to be a comprehensive, automatically-generated list of language codes with their corresponding Spanish names in the Appendix: Apéndice:Códigos de idioma

There is probably a better way of organizing the get_data logic for different languages than by just adding another script file. But for more or less a one-time thing, this should suffice.

This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.

xxyzz · 2023-10-31T10:26:55Z

languages/get_data_es.py

+    wxr = WiktextractContext(Wtp(lang_code=args.lang_code), WiktionaryConfig())
+
+    wxr = WiktextractContext(
+        Wtp(
+            lang_code=args.lang_code, db_path="wikt-db_es_language_data_temp.db"
+        ),
+        WiktionaryConfig(),
+    )


Shouldn't db_path passed from command line argument?

xxyzz · 2023-10-31T12:16:19Z

I check the Spanish lengua template and find it calls the getlang template and expands templates that use language code as template title.

For example, {{lengua|es}} expands {{getlang|es}} then expands to {{es|texto=x}}. They save language names to templates: https://es.wiktionary.org/wiki/Categoría:Plantillas_de_idiomas

xxyzz · 2023-10-31T12:23:19Z

And I think if a Wiktionary edition doesn't use language code and name in Lua code then we also don't need this data.

I search the code in wikitextprocessor and find the only places use these language code are mw.language.fetchLanguageName and mw.language.fetchLanguageNames and these two functions uses the same data on all mediawiki sites. And it can be downloaded from this api: https://fr.wiktionary.org/w/api.php?action=query&meta=siteinfo&siprop=languages&format=json&formatversion=2

empiriker · 2023-10-31T12:47:28Z

Thanks for looking into this, @xxyzz.

Let's see if I'm following you: The initial reason to pre-load the languages is so that wikitextprocessor can correctly execute modules that require these languages. So you're saying that for the Spanish Wiktionary which doesn't have modules using language data, we wouldn't need to pre-load them. Do I understand you correctly?

So let's say, we do not preload language data for these editions, then we would not have access to wxr.config.LANGUAGES_BY_NAME and wxr.config.LANGUAGES_BY_CODE. So far I have found them useful to flexibly add the language code or the language name when one of them was missing. But I guess, we could do without.

Regarding the lengua template. I saw it and considered it to get the language names this way. But then I would still need a definite list of all the possible language codes. And the only such list I could find was Apéndice:Códigos de idioma.

xxyzz · 2023-10-31T14:29:12Z

For the wikitextprocessor package, a single JSON file downloaded from the API linked above should be enough for the mw.language.fetchLanguageName function. And in the wiktextract package, at least the French extractor doesn't use language data very often, because the language code is in the template argument and the language name could be obtained by expanding the subtitle template.

English and Chinese Wiktionary use plain text language names in subtitle, they would still need the huge JSON file. But I think we could run the extract language Lua code when first extracting the dump file and save the language data to database.

That Spanish appendix page says it has 1215 languages, the template categories page has 1224 languages. I didn't check the numbers but you may need to find the difference.

empiriker added 3 commits October 31, 2023 09:25

kristian-clausal merged commit 595ff1d into tatuylonen:master Oct 31, 2023
5 checks passed

xxyzz reviewed Oct 31, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add languages.json for Spanish Wiktionary #391

Add languages.json for Spanish Wiktionary #391

empiriker commented Oct 31, 2023

xxyzz Oct 31, 2023

xxyzz commented Oct 31, 2023 •

edited

Loading

xxyzz commented Oct 31, 2023

empiriker commented Oct 31, 2023 •

edited

Loading

xxyzz commented Oct 31, 2023 •

edited

Loading

Add languages.json for Spanish Wiktionary #391

Add languages.json for Spanish Wiktionary #391

Conversation

empiriker commented Oct 31, 2023

xxyzz Oct 31, 2023

Choose a reason for hiding this comment

xxyzz commented Oct 31, 2023 • edited Loading

xxyzz commented Oct 31, 2023

empiriker commented Oct 31, 2023 • edited Loading

xxyzz commented Oct 31, 2023 • edited Loading

xxyzz commented Oct 31, 2023 •

edited

Loading

empiriker commented Oct 31, 2023 •

edited

Loading

xxyzz commented Oct 31, 2023 •

edited

Loading