-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
form_of incorrectly adding dot #125
Comments
(Also, I suspect Template:dated form of and Category:{{lang}} dated forms identify an alt-of relationship—a historical spelling variant—rather than a grammatical form-of relationship. Similar to how wiktextract treats Template:alternative case form of as alt-of.) |
The problem here is that both words "OK" and "OK." exist in Wiktionary. It currently removes the dot if the version without it exists but the one with it doesn't. The problem is that hundreds if not thousands of templates are used to generate form-of/alt-of glosses, and they are inconsistent in whether they will add the dot. Perhaps an answer might be to look at template arguments and see if the linked word occurs there with a dot or without a dot... I think I'll try some experiments on this. |
Could wiktextract just treat the positional template argument(s) as the root form(s), always? Rather than trying to parse the root from the rendered plaintext. (Ignoring the first "langcode" arg, of course.) It seems like the template arguments would be unambiguous when available. This would also solve a lot of tricky parsing problems like in #126. |
I just committed changes that look at template arguments to determine whether the trailing "." should be removed. That change also fixes "okeh". The changes should be reflected on https://kaikki.org in a couple of days. |
Great, thanks! I'll pull the updated data and take a look. |
The English 2022-03-15 "okeh" extracted data indicates
okeh
is a form ofOK.
(with a dot, for both the intj and verb entries):But the word is actually a form of
OK
(without a dot). Here's the relevant part of the "okeh" Wiktionary source:# {{dated form of|en|OK}}
which expands to:
This seems related to #101. And I'm guessing there are several other templates that expand to sentences (ending in a full stop dot) that could cause similar confusion.
The relationship is clear from the unexpanded template (and from the html expansion
… <a>OK</a>.
). I can see the plain text expansion would be ambiguous. I wonder if it might be helpful for some logic to keep the unexpanded gloss templates available, similar to having both etymology_templates and etymology_text available.[Thanks for wiktextract and kaikki.org, by the way. They're amazingly useful—and incredibly ambitious!]
The text was updated successfully, but these errors were encountered: