-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add word frequency db and related fix #142
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Owner
kermitt2
commented
Apr 29, 2022
•
edited
Loading
edited
- add word frequency resources and LMDB
- add optional filtering based on word frequency (above a certain frequency in the general language, a term is ignored - this is good to speed-up the processing because these terms are almost never disambiguated into an entity and because they are frequent they introduce a lot of uncertainty in the context): speed up English by 40% (without visible loss) and other languages by 10%
- extend support to zh, ja, ru
- fix Bad formatting of json response #141
- fix Case and term selection for French #139
- add a lmdb for Wikidata labels - here also integrate Fix #108 For KB concepts the given "valueName" of statements should be the english label instead of the wikipedia page title #110 but without the hacking of a fake property P0 - we create a distinct Wikidata label LMDB to be also used for Experiment to support Wikidata label without Wikipedia usages #72 (and Disambiguation in French: Charles Ier (Charlesmagne) #29).
- option to load statements only for entities that are presetn in a Wikipedia page of at least one of the supported language - this speed-up loading of Wikidata entities KB et reduce significantly the lmdb size
- load all upper knowledge base lmdb in one pass on the Wikidata json dump
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.