Parsing the French Wiktionary #223

empiriker · 2023-03-24T11:00:24Z

This pull request adds support for parsing the French Wiktionary from the dump file.

This is a work in progress and relies on an update of wikitextprocessor (See pull request there.) However, I saw in #205 that others plan to work on this as well and I thought I share the progress I made already.

My current interest lies mainly in extracting glosses and example sentences from the French Wiktionary for French. So this is what I plan to support.

Approach

I added data files for French (pos_subtitles.json, other_subtitles.json, linkage_subtitles.json, etc.)
Place all code extensions behind a flag if ctx.lang_code == "fr": to not affect the rest of the code base.

What has been done

Updated prefixes RECOGNIZED_PREFIXES and IGNORED_PREFIXES
Modified French heading templates to standard heading format, i.e. English Wiktionary (see wikitextprocessor)
Verified subtitle hierarchy (Correctly finds language, pos, and other sections [might not work for rarer sections])
Added extract_examples_fr: Custom function following the French Wiktionary conventions for examples
Added parse_translation_item_fr: Custom function following the French Wiktionary conventions for translation items

What I still plan to do

Clean glosses (there are many unrecognized category templates that are expanded and placed into the text of glosses
Run code on the whole dump_file (I have been working only on pages including French words)

Ideas for the future

It is my impression that the French Wiktionary makes much more use of templates and enforces more standards than the English Wiktionary. That has the potential to lead to a much cleaner extraction result but means that many sections and page types need special treatment for wiktextract to run correctly.

Parse French Thesaurus pages
Parse Conjugation pages
Fix etymology
Fix linkages
Correctly extract sounds

There is potentially much more to do to extract all useful information. But these are the big ones I see.

Questions for discussion

Should this project even support other language Wiktionaries?
If yes, is the approach of adding support via flags desirable? Perhaps the code can be organized in a cleaner way?

PS: This is my first-ever contribution to an Open Source project. So please be kind if I am violating any conventions. I tried to be as organized as possible.

…directly from exemple, source and citation templates

xxyzz · 2023-03-27T10:44:49Z

wiktextract/form_descriptions.py

+        lst = list(x not in not_french_words and
+                   # not x.isdigit() and
+                   (x in french_words or x.lower() in french_words or
+                    x in known_firsts or
+                    x[0].isdigit() or
+                    # (x[0].isupper() and x.find("-") < 0 and x.isascii()) or
+                   (x.endswith("s") and len(x) >= 4 and
+                    x[:-1] in french_words) or  # Plural
+                    (x.endswith("ies") and len(x) >= 5 and
+                     x[:-3] + "y" in french_words) or  # E.g. lily - lilies
+                    (x.endswith("ing") and len(x) >= 5 and
+                     x[:-3] in french_words) or  # E.g. bring - bringing
+                    (x.endswith("ing") and len(x) >= 5 and
+                     x[:-3] + "e" in french_words) or  # E.g., tone - toning
+                    (x.endswith("ed") and len(x) >= 5 and
+                     x[:-2] in french_words) or   # E.g. hang - hanged
+                    (x.endswith("ed") and len(x) >= 5 and
+                     x[:-2] + "e" in french_words) or  # E.g. atone - atoned
+                    (x.endswith("'s") and x[:-2] in french_words) or
+                    (x.endswith("s'") and x[:-2] in french_words) or
+                    (x.endswith("ise") and len(x) >= 5 and
+                     x[:-3] + "ize" in french_words) or
+                    (x.endswith("ised") and len(x) >= 6 and
+                     x[:-4] + "ized" in french_words) or
+                    (x.endswith("ising") and len(x) >= 7 and
+                     x[:-5] + "izing" in french_words) or
+                    (re.search(r"[-/]", x) and
+                     all(((y in french_words and len(y) > 2)
+                          or not y)
+                         for y in re.split(r"[-/]", x))))
+                   for x in tokens)


Did you just copy and paste this long block of code from the above lines? This code and the 8M french_words.txt file and french_words.py file seems unnecessary.

Indeed, I added this whole logic very early in the process when I was still trying to understand which parts of the code base are highly specific to one language or one Wiktionary project. I mean, the whole idea of classifying descriptors by comparing to a list of English words doesn't work when the descriptors are in French. So in theory, the current logic is not helpful for parsing the French Wiktionary or might even be harmful.

However, you're completely right that at the current stage, the classify_desc method that makes use of the English word list does not affect the fields that this pull request aims to parse correctly, i.e. glosses, examples and translations. So it's best to deal with this at a stage when it actually might matter.

I removed this logic and the obsolete files. Thanks again!

xxyzz · 2023-03-27T10:48:12Z

wiktextract/data/fr/language_names_to_codes.json

-  "Toki pona": "tokipona",
+  "Toki pona": "tok",


This file is not used and can be deleted.

Thanks for pointing this out. I deleted the file.

xxyzz · 2023-03-27T14:19:35Z

wiktextract/data/fr/zh_pron_tags.json

@@ -0,0 +1,3635 @@
+{


You don't have to copy the file from the en folder. You could use an empty dictionary {} in this file.

Indeed! Done.

empiriker · 2023-03-27T15:01:14Z

I am currently working to correctly extract categorical tags from glosses and example sentences.

Fortunately, the French Wiktionary project mostly relies on template call to add this information. For example, in the entry for "desservitude", the categorical tag "Normandie" is created from the template {{Normandie|fr}}. This means all that is left doing is to identify all templates that have this function.

The same happens (with less frequency) when categorical information is attached to a single example.

My question: Into which field in the final word entry object should I write these tags?

For examples, I am using the "note" field right now, for glosses the field "tag". But there are also the fields "categories" and "topics" available.

kristian-clausal · 2023-03-28T05:02:53Z

`c

I am currently working to correctly extract categorical tags from glosses and example sentences.

Fortunately, the French Wiktionary project mostly relies on template call to add this information. For example, in the entry for "desservitude", the categorical tag "Normandie" is created from the template {{Normandie|fr}}. This means all that is left doing is to identify all templates that have this function.

The same happens (with less frequency) when categorical information is attached to a single example.

My question: Into which field in the final word entry object should I write these tags?

For examples, I am using the "note" field right now, for glosses the field "tag". But there are also the fields "categories" and "topics" available.

Use categories for Wiki categories, the stuff that appears at the bottom of the article. tags should contain linguistic information, including sociolinguistics (dialect, register, location), while topic is for more semantic stuff like "nautical" or "politics". I never touch the topics stuff myself and it's kind of neglected, Tatu implemented it and it hasn't gotten in the way of anything yet... But basically, almost everything goes into tags.

…te {{ébauche-déf}}

jmviz · 2023-05-03T19:33:54Z

In case this is worked on further, I noticed that French Module:langues/data uses a different idiom than the current wiktextract language data format / code is equipped to handle. It has sections for code "redirects" (search "Redirections" on the linked page) where defunct codes are mapped to canonical codes, so that the defunct codes end up with the same data as the canonical codes. This will cause wiktextract to end up with a number of incorrect mappings from name to code in LANGUAGES_BY_NAME . Probably a more expressive data format than currently used for wiktextract languages.json files will be needed to solve this problem in general.

empiriker · 2023-05-03T20:41:16Z

This is correct. By default wiktextract produces objects like this {"redirect": "ˈíːčèř", "title": "'í:čèř"} (sorry, for some reason I only have a Paiter example at hand right now) for these redirect pages.

Right now I have just ignored these results in the final dictionary. But you're right that they deserve special treatment.

From my side: Having achieved my own minimal goals, I committed the cardinal mistake of moving on thinking that I will get around soon enough to do the finishing touches and get this pull request merged... I still haven't found the time but I plan to get back to this.

jmviz · 2023-05-03T22:57:04Z

I mean in Module:langues/data, there are "redirection" lines like this:

-- Redirections de proto-langues
l['alg-pro'] = l['proto-algonquien']

so both alg-pro and proto-algonquien codes have the same data:

l['proto-algonquien'] = { nom = 'proto-algonquien', tri = 'algonquien proto' }

so in wiktextract/data/fr/languages.json there will be both:

"proto-algonquien": [
    "Proto-algonquien"
  ],

and

"alg-pro": [
    "Proto-algonquien"
  ],

so when LANGUAGES_BY_NAME is created, there will be two names that map to the same code. Currently it uses a simple rule to pick the shorter code, so it will end up mapping "Proto-algonquien" $\rightarrow$ ""alg-pro". But this will be incorrect, as the canonical code is "proto-algonquien".

xxyzz · 2023-07-19T09:53:36Z

The master branch now loads page extractor code dynamically according to the dump file language code from the extractor folder, please rebase to the latest code if you want to continue implement the French extractor.

empiriker · 2023-09-12T10:28:35Z

Hi everyone,

I just want to let you know that I am back to work on this. My plan is also to add support for other editions of Wiktionary a little later.

I have seen that you did some major changes and will take a look now whether it makes sense to rebase or start from scratch (hopefully reusing code from the current pull request).

@xxyzz Are you also currently actively working on extracting the French Wiktionary? I see that you made some commits recently, related to this. Maybe we should coordinate or just aim for incremental pull request? Looking forward to hear your thoughts.

xxyzz · 2023-09-12T11:06:28Z

I'm writing the French Wiktionary extractor recently, please check out the "extractor/fr" folder and "tests/test_fr_*" files. The code currently can extract some gloss, pronunciation, example, translation and inflection data. I only test a few pages, there are many more cases and templates need to be implemented. You could try the code on some pages and you would certainly find many cases that the code doesn't handle, please send pull request to improve the extractor code or create a issue for discussion.

And I haven't add the code for extracting thesaurus pages because these pages don't use a similar layout, and you would see many errors because the English Wiktionary thesaurus extractor is used and you could ignore the errors.

empiriker · 2023-09-16T11:12:02Z

@xxyzz Great work already!

Then this pull request here can be closed. There is no new contribution anymore.

empiriker added 9 commits March 23, 2023 14:27

change langcode for Toki pona to 'tok'

cadf5b4

add linkage_subtitles, other_subtitles and pos_subtitles for French

e8ef0ee

add French words and check for French form descriptions

98560ae

add basic support for parsing the French Wiktionary project

dde4806

add zh_pron_tags to French data for dependency reasons

3c58a1a

Merge branch 'master' of https://github.com/empiriker/wiktextract

96c6560

extract source from template 'exemple'

43d9fbb

remove print statement

b31df12

move French logic for extract_examples to separate function, extract …

d1642c7

…directly from exemple, source and citation templates

empiriker mentioned this pull request Mar 24, 2023

Expand French Wiktionary language and POS heading templates tatuylonen/wikitextprocessor#32

Closed

xxyzz reviewed Mar 27, 2023

View reviewed changes

remove French logic from (not used downstream)

0368147

xxyzz reviewed Mar 27, 2023

View reviewed changes

empiriker added 2 commits March 27, 2023 16:21

extract categorical tags before parsing glosses and examples

ac62df0

replace zh_pron_tags with empty dictionary

f00ef13

empiriker added 3 commits March 28, 2023 14:29

allow additional spaces in header regex

f43109e

add extensive list for categorical and tag templates

dc48227

move notes in glosses to 'note' field; deal with missing gloss templa…

ce0a311

…te {{ébauche-déf}}

Vuizur mentioned this pull request May 19, 2023

Feature request - Old French dictionary Vuizur/Wiktionary-Dictionaries#2

Open

empiriker closed this Sep 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing the French Wiktionary #223

Parsing the French Wiktionary #223

empiriker commented Mar 24, 2023 •

edited

Loading

xxyzz Mar 27, 2023

empiriker Mar 27, 2023

xxyzz Mar 27, 2023

empiriker Mar 27, 2023

xxyzz Mar 27, 2023

empiriker Mar 27, 2023

empiriker commented Mar 27, 2023

kristian-clausal commented Mar 28, 2023

jmviz commented May 3, 2023

empiriker commented May 3, 2023

jmviz commented May 3, 2023

xxyzz commented Jul 19, 2023

empiriker commented Sep 12, 2023

xxyzz commented Sep 12, 2023

empiriker commented Sep 16, 2023

Parsing the French Wiktionary #223

Parsing the French Wiktionary #223

Conversation

empiriker commented Mar 24, 2023 • edited Loading

Approach

What has been done

What I still plan to do

Ideas for the future

Questions for discussion

xxyzz Mar 27, 2023

Choose a reason for hiding this comment

empiriker Mar 27, 2023

Choose a reason for hiding this comment

xxyzz Mar 27, 2023

Choose a reason for hiding this comment

empiriker Mar 27, 2023

Choose a reason for hiding this comment

xxyzz Mar 27, 2023

Choose a reason for hiding this comment

empiriker Mar 27, 2023

Choose a reason for hiding this comment

empiriker commented Mar 27, 2023

kristian-clausal commented Mar 28, 2023

jmviz commented May 3, 2023

empiriker commented May 3, 2023

jmviz commented May 3, 2023

xxyzz commented Jul 19, 2023

empiriker commented Sep 12, 2023

xxyzz commented Sep 12, 2023

empiriker commented Sep 16, 2023

empiriker commented Mar 24, 2023 •

edited

Loading