-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial impressions and questions of flashgeotext for extracting countries from affiliations #16
Comments
Hi, nice of you to give it a try.
Country mentions are not tracked the same way that GeoText tracks them. I wanted the user to add his/her own data to a LookupDataPool that then is looked up in a text. I don't cross-count across LookupData, rather every LookupData is looked up separately.
from flashgeotext.lookup import load_data_from_file
from flashgeotext.settings import DEMODATA_CITIES, DEMODATA_COUNTRIES
cities = load_data_from_file(DEMODATA_CITIES)
countries = load_data_from_file(DEMODATA_COUNTRIES)
print(cities["Paris"])
print(cities["Parys"]) Where a single keyword in LookupData looks like that: {
"Paris" : ['Pa-ri', 'Paarys', 'Paraeis', 'Paras', 'Pari',
'Paries', 'Parigi', 'Pariis', 'Pariisi', 'Parij',
'Parijs', 'Paris', 'Parisi', 'Parixe', 'Pariz',
'Parize', 'Parizh', 'Parizo', 'Parizs', 'Pariž',
'Parys', 'Paryzius', 'Paryžius', 'Paräis', 'París',
'Parîs', 'Párizs', 'paris', 'pyaris', 'Paris'],
[...] "Parys" is a synonym of "Paris" and vis versa. Demo data is what it is, demo data to show the functionality. It is taken from geonames just like GeoTexts data, but flashgeotext also uses the synonyms from geonames. I use those synonyms that have a 70% fuzzy ratio with the original keyword. It might be a task to clean the demo data a little more. But what I actually want is people creating their own datasets of what they're looking for and use flashgeotext as the means to extract them.
University is part of geonames city names for some reason, so yeah, another data problem. Thanks for your feedback, I am really happy about it. If you have suggestions on how to improve it, I am all ears. |
Thanks @iwpnd for the explanation. Looks like you're building a strong foundation for really-fast extraction of places from text. As a user, I'm looking for a package that can extract the number of country mentions (including those inferred by city cross-counting) in English texts. Ideally, there'd be a way to use flashgeotext with data that closely matches my intended application, with little-to-know extra work needed. I see how you're looking to build a lower-level application to support data that I could plug-in. However, I think you'll get a lot more users if you can also provide the data that supports their intended application. Hopefully, more of a community can develop to offload some of this work from your shoulders! Thanks again, looking to see how this project evolves, since it seems like a sturdy foundation... but probably needs a few second-layer features to appeal to my use case including:
Cheers! |
@dhimmel thanks for your feedback. A lot of good points. :) |
Thanks for posting at elyase/geotext#23 (comment) letting me know about this package. I'm interested in it as a way to extract countries referred to in author affiliations.
For example, here is an affiliation:
For my project, I'd like to know what countries are mentioned (either directly or inferred from a place mention inside that country).
If I run the following (with v0.2.0):
I get the following output:
Some impressions / questions?
Are the city mentions counting towards country mentions? If yes, why does "United States" not have a count of 2 for "Pittsburgh" and "USA".
Is "Parys" for "Paris"... not sure why this conversion is made.
Counting "University" as a city will almost always be a false positive for us, although I'm guessing this is a source data issue.
Thanks for considering this feedback / helping answer any of these questions.
The text was updated successfully, but these errors were encountered: