Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract semantic relations from German Wiktionary. #375

Merged
merged 1 commit into from
Oct 20, 2023

Conversation

empiriker
Copy link
Contributor

This adds support for extracting different several kinds of semantic relations from the German Wiktionary. It covers the most prevalent way in which these sections are structured.

It does not:

  • extract modifiers (tags) to a semantic relation, e.g.: in {{Synonyme}}\n:[1] [[Kokosnusspalme]], ''wissenschaftlich:'' [[Cocos nucifera]] from Kokospalme the tag wissenschaftlich will be ignored
  • extract semantic relations structured using sublists, e.g. Wortbildungen in Byte
  • extract semantic relations formated other than via a wiki link, e.g. in {{Redewendungen}}:[1] ''[[aller Anfang ist schwer]]. from aller the relation will not be captured since generally italics (even with links within) seem to be used to modify a relation, not to create one

I looked into all of these cases but there just doesn't seem to be a general rule that allows clearly separating the semantic relations from the rest.

For now, I think it's a great start.


FYI

  1. This is about the extent that I intend to flesh out each Wiktionary edition that I plan to cover. I hope it gives a good starting base for anyone who want to go deeper.
  2. @xxyzz I would like to also capture semantic relations from the French Wiktionary. Do you mind if I take a look at that? Or do plan to implement it soon?
  3. When I tackle the next Wiktionary edition (probably Spanish), I will attempt to use pydantic from the get-go. If it's easy to use, I will probably add it to the German extractor as well. Otherwise, I will just add the json schema manually.

@xxyzz
Copy link
Collaborator

xxyzz commented Oct 20, 2023

  1. I would like to also capture semantic relations from the French Wiktionary. Do you mind if I take a look at that? Or do plan to implement it soon?

French Wiuktionary's synonyms lists look very similar to the translation list, maybe could share some code. Please feel free to implement the feature.

  1. When I tackle the next Wiktionary edition (probably Spanish), I will attempt to use pydantic from the get-go. If it's easy to use, I will probably add it to the German extractor as well. Otherwise, I will just add the json schema manually.

That would be great! Does pydantic also add comments to the schema?

This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.

Fix types for python3.9

Import SEMANTIC_RELATIONS into pages.py
@empiriker
Copy link
Contributor Author

French Wiuktionary's synonyms lists look very similar to the translation list, maybe could share some code. Please feel free to implement the feature.

Alright. I'll take a look now.

That would be great! Does pydantic also add comments to the schema?

It has a description field which should be mapped to a json schema comment. I will keep this in mind when testing it out.

@xxyzz xxyzz merged commit b7d8d2d into tatuylonen:master Oct 20, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants