-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
form_of including trailing gloss text #126
Comments
This is an issue that can't easily be fixed through coding. I'll take some time tomorrow to go through a list of obvious problematic cases we generated today and fix them in Wiktionary. These are articles in languages other than English, and the filter we used was basically "native-language-word english-language-word" when there are exactly two words in the result, which catches a lot of examples where you have a translation that is a single word long: "Form of fragā strawberry", for example. In the case of wetlands style stuff, the correct thing to do is to use the |t=| parameter in {{plural of}}: https://en.wiktionary.org/wiki/Template:plural_of -- I've corrected the wetlands article if you want to take a look. |
I spent the whole day going through a short list of error-candidates for this; basically, "form of" entries that are suspicious and have a native language word + english language words as its contents (fraga strawberry). This isn't really feasible to edit by hand, like I did. If we want this to be edited on Wiktionary, might need to learn about wikimedia bots that would do execute lists of generated edits (inserting |t=|) after getting user approval. |
Thanks for digging into this. As you say, there are a ton (several tons!) of poorly constructed entries. And the correct form_of is not always a single word (see the second example below). I've also noticed similar problems with Does wiktextract have access to either the wiki source or the rendered HTML where it's trying to extract the base form? I think it's often unambiguous from that, because the base form is a single linked item—either a template parameter or a raw Three examples (2022-04-04 extraction):
Extracting form_of from the wiki source or html would also avoid some false positives where the gloss happens to resemble a form-of definition. If there's not a link in the definition, it's not really a form_of. E.g.:
|
I decided to implement simpler but less general fixes at this time (I am also not sure if the link approach would always work; I don't think there are links in all cases even though I don't immediately have a counterexample). I fixed "jakes" and other similar cases by recognizing "in its various senses" as not being part of base. Let me know if you find other similar issues. The changes should be reflected on https://kaikki.org in a couple of days. |
I also did a few other changes, including not interpreting "root of" as a form_of. I checked and none of the "root of" glosses was really a form_of. |
This sounds like it should help, thanks. I'm grabbing the latest data now, and will update this after taking a look. Incidentally, I found these by looking through cases where |
That led to a substantial improvement in the 2022-04-29 extraction—thanks!
There are 126 form_of that seem to be parsing problems in the English 2022-04-29 extraction—full list attached below. Here are some patterns that might be worth special casing:
Also, "so" gloss "Reduced form of 'so that'" is being parsed as a form_of Again, I suspect examining rendered links or template parameters could solve all of these. But if that isn't feasible, then the remaining cases are probably best fixed by editing Wiktionary. (And the list is short enough now that I'll probably start doing that.) Attachment: form-of-misparsed.txt |
When the Wiktionary definition includes both a form template and additional text, that additional text is being incorrectly included in the extracted
form_of
root word. Some examples:"wetlands" - kaikki.org (2022-03-15 extraction)
"wetlands" - Wiktionary
# {{plural of|en|wetland}} An area or region that is characteristically saturated; a marsh.
"shields" (verb entry) - kaikki.org (2022-03-15 extraction)
"shields" - Wiktionary
# {{en-third-person singular of|shield}}. Protects
(Arguably these Wiktionary entries could use some cleanup, but apparently this pattern is in use.)
The text was updated successfully, but these errors were encountered: