-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General Statistics of Retrievable Wiki Entities #157
Comments
Hello ! When the service starts, the console prints some basic statistics on the loaded data. For instance, with the currently provided KB files:
Upper knowledge base is Wikidata, this indicates the number of Wikidata entities loaded (=concepts) and the number of wikidata statements. By default, for a given entity, we only load statements if this entity exists in at least one language-specific Wikipedia, there's a parameter config to change this to load all the statements:
This is based on Wikidata dump from December 2022. Then, we have the basic Wikipedia loading information for every installed/supported language:
Parsing and compilation of all the Wikidata and Wikipedia resources is done by https://github.com/kermitt2/grisp I should certainly output more statistics on everything loaded and create a web service to get all the statistics of the current KB used by a running service. About the distribution of entities according to categories, I am not sure what you mean by categories, but you will find statistics on the Wikidata web site. |
Much appreciate the reply. I have a couple of follow-up questions. Does that mean all the fetched Wikipedia (total of around 62M from summing up all pages) and 18271696 Wikidata entities can be linked to the mentions in the text? The rest of WIkidata id cannot. It's also quite a surprise to see how small the number of loaded statements is compared to the number of concepts so I wonder why that is. |
@ndenStanford The current state of the tool is as follow:
It's why it's hard to support currently languages with less than one million Wikipedia pages. Without enough training material in the language, page and page interlinking, we don't have enough data to disambiguate terms. But I think it's still useful to have access to the whole Wikidata (the 100M entities) via the KB API because for example I have other tools doing more specialized entity mention extraction and I exploit and link these other entities directly. So the 100M entities are loaded.
As I wrote, by default, we only load statements of an entity if this entity "exists" in at least one language-specific Wikipedia. It appears that usually the same subset of entities are present in the different Wikipedia, and not every pages in a Wikipedia correspond to a Wikidata entity (many Wikipedia pages are redirection, disambiguation page, category, etc. only articles are mapped to a Wikidata entity, for example around 6M for English, out of 18M pages). It's possible to load all the statements by changing the config, for example I have an older version of entity-fishing with all statements loaded and it's more than 1B statements. Due to the size of the resulting DB and the extra time indexing this, it's turn off by the default. |
Thank you very much for the clarification :) |
Would you mind letting me know where I can see the general statistics of the retrievable wiki entities? I would love to understand the size of the retrievable knowledge base broken down into each entity category. If such statistics are not available, can you please elaborate on how can I read from the database directly? I am seeing that key and value of the database are encoded and I am not able to decode them since I do not know how it was encoded.
Appreciate your attention to this matter :)
The text was updated successfully, but these errors were encountered: