-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing the French Wiktionary #223
Conversation
…directly from exemple, source and citation templates
wiktextract/form_descriptions.py
Outdated
lst = list(x not in not_french_words and | ||
# not x.isdigit() and | ||
(x in french_words or x.lower() in french_words or | ||
x in known_firsts or | ||
x[0].isdigit() or | ||
# (x[0].isupper() and x.find("-") < 0 and x.isascii()) or | ||
(x.endswith("s") and len(x) >= 4 and | ||
x[:-1] in french_words) or # Plural | ||
(x.endswith("ies") and len(x) >= 5 and | ||
x[:-3] + "y" in french_words) or # E.g. lily - lilies | ||
(x.endswith("ing") and len(x) >= 5 and | ||
x[:-3] in french_words) or # E.g. bring - bringing | ||
(x.endswith("ing") and len(x) >= 5 and | ||
x[:-3] + "e" in french_words) or # E.g., tone - toning | ||
(x.endswith("ed") and len(x) >= 5 and | ||
x[:-2] in french_words) or # E.g. hang - hanged | ||
(x.endswith("ed") and len(x) >= 5 and | ||
x[:-2] + "e" in french_words) or # E.g. atone - atoned | ||
(x.endswith("'s") and x[:-2] in french_words) or | ||
(x.endswith("s'") and x[:-2] in french_words) or | ||
(x.endswith("ise") and len(x) >= 5 and | ||
x[:-3] + "ize" in french_words) or | ||
(x.endswith("ised") and len(x) >= 6 and | ||
x[:-4] + "ized" in french_words) or | ||
(x.endswith("ising") and len(x) >= 7 and | ||
x[:-5] + "izing" in french_words) or | ||
(re.search(r"[-/]", x) and | ||
all(((y in french_words and len(y) > 2) | ||
or not y) | ||
for y in re.split(r"[-/]", x)))) | ||
for x in tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you just copy and paste this long block of code from the above lines? This code and the 8M french_words.txt
file and french_words.py
file seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I added this whole logic very early in the process when I was still trying to understand which parts of the code base are highly specific to one language or one Wiktionary project. I mean, the whole idea of classifying descriptors by comparing to a list of English words doesn't work when the descriptors are in French. So in theory, the current logic is not helpful for parsing the French Wiktionary or might even be harmful.
However, you're completely right that at the current stage, the classify_desc
method that makes use of the English word list does not affect the fields that this pull request aims to parse correctly, i.e. glosses, examples and translations. So it's best to deal with this at a stage when it actually might matter.
I removed this logic and the obsolete files. Thanks again!
"Toki pona": "tokipona", | ||
"Toki pona": "tok", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is not used and can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I deleted the file.
@@ -0,0 +1,3635 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to copy the file from the en
folder. You could use an empty dictionary {}
in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed! Done.
I am currently working to correctly extract categorical tags from glosses and example sentences. Fortunately, the French Wiktionary project mostly relies on template call to add this information. For example, in the entry for "desservitude", the categorical tag "Normandie" is created from the template The same happens (with less frequency) when categorical information is attached to a single example. My question: Into which field in the final word entry object should I write these tags? For examples, I am using the "note" field right now, for glosses the field "tag". But there are also the fields "categories" and "topics" available. |
`c
Use |
In case this is worked on further, I noticed that French Module:langues/data uses a different idiom than the current wiktextract language data format / code is equipped to handle. It has sections for code "redirects" (search "Redirections" on the linked page) where defunct codes are mapped to canonical codes, so that the defunct codes end up with the same data as the canonical codes. This will cause wiktextract to end up with a number of incorrect mappings from name to code in |
This is correct. By default wiktextract produces objects like this Right now I have just ignored these results in the final dictionary. But you're right that they deserve special treatment. From my side: Having achieved my own minimal goals, I committed the cardinal mistake of moving on thinking that I will get around soon enough to do the finishing touches and get this pull request merged... I still haven't found the time but I plan to get back to this. |
I mean in Module:langues/data, there are "redirection" lines like this: -- Redirections de proto-langues
l['alg-pro'] = l['proto-algonquien'] so both l['proto-algonquien'] = { nom = 'proto-algonquien', tri = 'algonquien proto' } so in "proto-algonquien": [
"Proto-algonquien"
], and "alg-pro": [
"Proto-algonquien"
], so when |
The master branch now loads page extractor code dynamically according to the dump file language code from the |
Hi everyone, I just want to let you know that I am back to work on this. My plan is also to add support for other editions of Wiktionary a little later. I have seen that you did some major changes and will take a look now whether it makes sense to rebase or start from scratch (hopefully reusing code from the current pull request). @xxyzz Are you also currently actively working on extracting the French Wiktionary? I see that you made some commits recently, related to this. Maybe we should coordinate or just aim for incremental pull request? Looking forward to hear your thoughts. |
I'm writing the French Wiktionary extractor recently, please check out the "extractor/fr" folder and "tests/test_fr_*" files. The code currently can extract some gloss, pronunciation, example, translation and inflection data. I only test a few pages, there are many more cases and templates need to be implemented. You could try the code on some pages and you would certainly find many cases that the code doesn't handle, please send pull request to improve the extractor code or create a issue for discussion. And I haven't add the code for extracting thesaurus pages because these pages don't use a similar layout, and you would see many errors because the English Wiktionary thesaurus extractor is used and you could ignore the errors. |
@xxyzz Great work already! Then this pull request here can be closed. There is no new contribution anymore. |
This pull request adds support for parsing the French Wiktionary from the dump file.
This is a work in progress and relies on an update of wikitextprocessor (See pull request there.) However, I saw in #205 that others plan to work on this as well and I thought I share the progress I made already.
My current interest lies mainly in extracting glosses and example sentences from the French Wiktionary for French. So this is what I plan to support.
Approach
if ctx.lang_code == "fr":
to not affect the rest of the code base.What has been done
RECOGNIZED_PREFIXES
andIGNORED_PREFIXES
extract_examples_fr
: Custom function following the French Wiktionary conventions for examplesparse_translation_item_fr
: Custom function following the French Wiktionary conventions for translation itemsWhat I still plan to do
Ideas for the future
It is my impression that the French Wiktionary makes much more use of templates and enforces more standards than the English Wiktionary. That has the potential to lead to a much cleaner extraction result but means that many sections and page types need special treatment for wiktextract to run correctly.
There is potentially much more to do to extract all useful information. But these are the big ones I see.
Questions for discussion
PS: This is my first-ever contribution to an Open Source project. So please be kind if I am violating any conventions. I tried to be as organized as possible.