Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract translations from German Wiktionary #369

Merged
merged 2 commits into from
Oct 19, 2023
Merged

Conversation

empiriker
Copy link
Contributor

@empiriker empiriker commented Oct 19, 2023

I have covered the default cases of the translation tables in the German Wiktionary. It should extract >95% (probably more) of the usable information.

Notably there are three rare types of data/formats that I don't want to cover at the moment:

  • Translation links to another page (ca. 1.1% of pages)
  • Dialect information (ca. 0.12% of pages)
  • Non-standard senseid formats (currently extracted but not sense-disambiguated) (ca. 0.03% of pages)

The percentages come from the sample of ~18000 pages that I test on during development. I left comments and dummy code in the files to keep track of this and suggest an entry point for future implementation.


As a bonus, I now configured the isort extension in my IDE and reordered the imports in the German extractor files.

This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
@xxyzz xxyzz merged commit a8787ef into tatuylonen:master Oct 19, 2023
5 checks passed
@xxyzz
Copy link
Collaborator

xxyzz commented Oct 19, 2023

Thanks for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants