Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

form_of incorrectly adding dot #125

Closed
medmunds opened this issue Mar 16, 2022 · 5 comments
Closed

form_of incorrectly adding dot #125

medmunds opened this issue Mar 16, 2022 · 5 comments

Comments

@medmunds
Copy link

The English 2022-03-15 "okeh" extracted data indicates okeh is a form of OK. (with a dot, for both the intj and verb entries):

      "form_of": [
        {
          "word": "OK."
        }
      ],

But the word is actually a form of OK (without a dot). Here's the relevant part of the "okeh" Wiktionary source:

# {{dated form of|en|OK}}

which expands to:

  1. Dated form of OK.

This seems related to #101. And I'm guessing there are several other templates that expand to sentences (ending in a full stop dot) that could cause similar confusion.

The relationship is clear from the unexpanded template (and from the html expansion … <a>OK</a>.). I can see the plain text expansion would be ambiguous. I wonder if it might be helpful for some logic to keep the unexpanded gloss templates available, similar to having both etymology_templates and etymology_text available.

[Thanks for wiktextract and kaikki.org, by the way. They're amazingly useful—and incredibly ambitious!]

@medmunds
Copy link
Author

(Also, I suspect Template:dated form of and Category:{{lang}} dated forms identify an alt-of relationship—a historical spelling variant—rather than a grammatical form-of relationship. Similar to how wiktextract treats Template:alternative case form of as alt-of.)

@tatuylonen
Copy link
Owner

The problem here is that both words "OK" and "OK." exist in Wiktionary. It currently removes the dot if the version without it exists but the one with it doesn't. The problem is that hundreds if not thousands of templates are used to generate form-of/alt-of glosses, and they are inconsistent in whether they will add the dot.

Perhaps an answer might be to look at template arguments and see if the linked word occurs there with a dot or without a dot... I think I'll try some experiments on this.

@medmunds
Copy link
Author

medmunds commented Apr 5, 2022

Perhaps an answer might be to look at template arguments and see if the linked word occurs there with a dot or without a dot

Could wiktextract just treat the positional template argument(s) as the root form(s), always? Rather than trying to parse the root from the rendered plaintext. (Ignoring the first "langcode" arg, of course.)

It seems like the template arguments would be unambiguous when available. This would also solve a lot of tricky parsing problems like in #126.

@tatuylonen
Copy link
Owner

I just committed changes that look at template arguments to determine whether the trailing "." should be removed. That change also fixes "okeh".

The changes should be reflected on https://kaikki.org in a couple of days.

@medmunds
Copy link
Author

Great, thanks! I'll pull the updated data and take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants