Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing the French Wiktionary #223

Closed
wants to merge 15 commits into from
Closed

Conversation

empiriker
Copy link
Contributor

@empiriker empiriker commented Mar 24, 2023

This pull request adds support for parsing the French Wiktionary from the dump file.

This is a work in progress and relies on an update of wikitextprocessor (See pull request there.) However, I saw in #205 that others plan to work on this as well and I thought I share the progress I made already.

My current interest lies mainly in extracting glosses and example sentences from the French Wiktionary for French. So this is what I plan to support.

Approach

  • I added data files for French (pos_subtitles.json, other_subtitles.json, linkage_subtitles.json, etc.)
  • Place all code extensions behind a flag if ctx.lang_code == "fr": to not affect the rest of the code base.

What has been done

  • Updated prefixes RECOGNIZED_PREFIXES and IGNORED_PREFIXES
  • Modified French heading templates to standard heading format, i.e. English Wiktionary (see wikitextprocessor)
  • Verified subtitle hierarchy (Correctly finds language, pos, and other sections [might not work for rarer sections])
  • Added extract_examples_fr: Custom function following the French Wiktionary conventions for examples
  • Added parse_translation_item_fr: Custom function following the French Wiktionary conventions for translation items

What I still plan to do

  • Clean glosses (there are many unrecognized category templates that are expanded and placed into the text of glosses
  • Run code on the whole dump_file (I have been working only on pages including French words)

Ideas for the future

It is my impression that the French Wiktionary makes much more use of templates and enforces more standards than the English Wiktionary. That has the potential to lead to a much cleaner extraction result but means that many sections and page types need special treatment for wiktextract to run correctly.

  • Parse French Thesaurus pages
  • Parse Conjugation pages
  • Fix etymology
  • Fix linkages
  • Correctly extract sounds

There is potentially much more to do to extract all useful information. But these are the big ones I see.

Questions for discussion

  • Should this project even support other language Wiktionaries?
  • If yes, is the approach of adding support via flags desirable? Perhaps the code can be organized in a cleaner way?

PS: This is my first-ever contribution to an Open Source project. So please be kind if I am violating any conventions. I tried to be as organized as possible.

Comment on lines 2499 to 2529
lst = list(x not in not_french_words and
# not x.isdigit() and
(x in french_words or x.lower() in french_words or
x in known_firsts or
x[0].isdigit() or
# (x[0].isupper() and x.find("-") < 0 and x.isascii()) or
(x.endswith("s") and len(x) >= 4 and
x[:-1] in french_words) or # Plural
(x.endswith("ies") and len(x) >= 5 and
x[:-3] + "y" in french_words) or # E.g. lily - lilies
(x.endswith("ing") and len(x) >= 5 and
x[:-3] in french_words) or # E.g. bring - bringing
(x.endswith("ing") and len(x) >= 5 and
x[:-3] + "e" in french_words) or # E.g., tone - toning
(x.endswith("ed") and len(x) >= 5 and
x[:-2] in french_words) or # E.g. hang - hanged
(x.endswith("ed") and len(x) >= 5 and
x[:-2] + "e" in french_words) or # E.g. atone - atoned
(x.endswith("'s") and x[:-2] in french_words) or
(x.endswith("s'") and x[:-2] in french_words) or
(x.endswith("ise") and len(x) >= 5 and
x[:-3] + "ize" in french_words) or
(x.endswith("ised") and len(x) >= 6 and
x[:-4] + "ized" in french_words) or
(x.endswith("ising") and len(x) >= 7 and
x[:-5] + "izing" in french_words) or
(re.search(r"[-/]", x) and
all(((y in french_words and len(y) > 2)
or not y)
for y in re.split(r"[-/]", x))))
for x in tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you just copy and paste this long block of code from the above lines? This code and the 8M french_words.txt file and french_words.py file seems unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I added this whole logic very early in the process when I was still trying to understand which parts of the code base are highly specific to one language or one Wiktionary project. I mean, the whole idea of classifying descriptors by comparing to a list of English words doesn't work when the descriptors are in French. So in theory, the current logic is not helpful for parsing the French Wiktionary or might even be harmful.

However, you're completely right that at the current stage, the classify_desc method that makes use of the English word list does not affect the fields that this pull request aims to parse correctly, i.e. glosses, examples and translations. So it's best to deal with this at a stage when it actually might matter.

I removed this logic and the obsolete files. Thanks again!

"Toki pona": "tokipona",
"Toki pona": "tok",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not used and can be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I deleted the file.

@@ -0,0 +1,3635 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to copy the file from the en folder. You could use an empty dictionary {} in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! Done.

@empiriker
Copy link
Contributor Author

I am currently working to correctly extract categorical tags from glosses and example sentences.

Fortunately, the French Wiktionary project mostly relies on template call to add this information. For example, in the entry for "desservitude", the categorical tag "Normandie" is created from the template {{Normandie|fr}}. This means all that is left doing is to identify all templates that have this function.

image

The same happens (with less frequency) when categorical information is attached to a single example.

My question: Into which field in the final word entry object should I write these tags?

For examples, I am using the "note" field right now, for glosses the field "tag". But there are also the fields "categories" and "topics" available.

@kristian-clausal
Copy link
Collaborator

`c

I am currently working to correctly extract categorical tags from glosses and example sentences.

Fortunately, the French Wiktionary project mostly relies on template call to add this information. For example, in the entry for "desservitude", the categorical tag "Normandie" is created from the template {{Normandie|fr}}. This means all that is left doing is to identify all templates that have this function.

image

The same happens (with less frequency) when categorical information is attached to a single example.

My question: Into which field in the final word entry object should I write these tags?

For examples, I am using the "note" field right now, for glosses the field "tag". But there are also the fields "categories" and "topics" available.

Use categories for Wiki categories, the stuff that appears at the bottom of the article. tags should contain linguistic information, including sociolinguistics (dialect, register, location), while topic is for more semantic stuff like "nautical" or "politics". I never touch the topics stuff myself and it's kind of neglected, Tatu implemented it and it hasn't gotten in the way of anything yet... But basically, almost everything goes into tags.

@jmviz
Copy link
Contributor

jmviz commented May 3, 2023

In case this is worked on further, I noticed that French Module:langues/data uses a different idiom than the current wiktextract language data format / code is equipped to handle. It has sections for code "redirects" (search "Redirections" on the linked page) where defunct codes are mapped to canonical codes, so that the defunct codes end up with the same data as the canonical codes. This will cause wiktextract to end up with a number of incorrect mappings from name to code in LANGUAGES_BY_NAME . Probably a more expressive data format than currently used for wiktextract languages.json files will be needed to solve this problem in general.

@empiriker
Copy link
Contributor Author

This is correct. By default wiktextract produces objects like this {"redirect": "ˈíːčèř", "title": "'í:čèř"} (sorry, for some reason I only have a Paiter example at hand right now) for these redirect pages.

Right now I have just ignored these results in the final dictionary. But you're right that they deserve special treatment.

From my side: Having achieved my own minimal goals, I committed the cardinal mistake of moving on thinking that I will get around soon enough to do the finishing touches and get this pull request merged... I still haven't found the time but I plan to get back to this.

@jmviz
Copy link
Contributor

jmviz commented May 3, 2023

I mean in Module:langues/data, there are "redirection" lines like this:

-- Redirections de proto-langues
l['alg-pro'] = l['proto-algonquien']

so both alg-pro and proto-algonquien codes have the same data:

l['proto-algonquien'] = { nom = 'proto-algonquien', tri = 'algonquien proto' }

so in wiktextract/data/fr/languages.json there will be both:

"proto-algonquien": [
    "Proto-algonquien"
  ],

and

"alg-pro": [
    "Proto-algonquien"
  ],

so when LANGUAGES_BY_NAME is created, there will be two names that map to the same code. Currently it uses a simple rule to pick the shorter code, so it will end up mapping "Proto-algonquien" $\rightarrow$ ""alg-pro". But this will be incorrect, as the canonical code is "proto-algonquien".

@xxyzz
Copy link
Collaborator

xxyzz commented Jul 19, 2023

The master branch now loads page extractor code dynamically according to the dump file language code from the extractor folder, please rebase to the latest code if you want to continue implement the French extractor.

@empiriker
Copy link
Contributor Author

Hi everyone,

I just want to let you know that I am back to work on this. My plan is also to add support for other editions of Wiktionary a little later.

I have seen that you did some major changes and will take a look now whether it makes sense to rebase or start from scratch (hopefully reusing code from the current pull request).

@xxyzz Are you also currently actively working on extracting the French Wiktionary? I see that you made some commits recently, related to this. Maybe we should coordinate or just aim for incremental pull request? Looking forward to hear your thoughts.

@xxyzz
Copy link
Collaborator

xxyzz commented Sep 12, 2023

I'm writing the French Wiktionary extractor recently, please check out the "extractor/fr" folder and "tests/test_fr_*" files. The code currently can extract some gloss, pronunciation, example, translation and inflection data. I only test a few pages, there are many more cases and templates need to be implemented. You could try the code on some pages and you would certainly find many cases that the code doesn't handle, please send pull request to improve the extractor code or create a issue for discussion.

And I haven't add the code for extracting thesaurus pages because these pages don't use a similar layout, and you would see many errors because the English Wiktionary thesaurus extractor is used and you could ignore the errors.

@empiriker
Copy link
Contributor Author

@xxyzz Great work already!

Then this pull request here can be closed. There is no new contribution anymore.

@empiriker empiriker closed this Sep 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants