Initial impressions and questions of flashgeotext for extracting countries from affiliations #16

dhimmel · 2020-03-02T19:00:37Z

Thanks for posting at elyase/geotext#23 (comment) letting me know about this package. I'm interested in it as a way to extract countries referred to in author affiliations.

For example, here is an affiliation:

'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.

For my project, I'd like to know what countries are mentioned (either directly or inferred from a place mention inside that country).

If I run the following (with v0.2.0):

import flashgeotext.geotext
geotexter = flashgeotext.geotext.GeoText(use_demo_data=True)
affil = """\
'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, \
Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, \
Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France.
"""
geo_text = geotexter.extract(affil, span_info=False)
geo_text

I get the following output:

2020-03-02 18:50:46.475 | DEBUG    | flashgeotext.lookup:add:194 - cities added to pool
2020-03-02 18:50:46.479 | DEBUG    | flashgeotext.lookup:add:194 - countries added to pool
2020-03-02 18:50:46.480 | DEBUG    | flashgeotext.lookup:_add_demo_data:225 - demo data loaded for: ['cities', 'countries']
{'cities': {'University': {'count': 2},
  'Saarbrücken': {'count': 1},
  'Carnegie': {'count': 1},
  'Pittsburgh': {'count': 1},
  'Berlin': {'count': 2},
  'Parys': {'count': 2}},
 'countries': {'Germany': {'count': 2},
  'United States': {'count': 1},
  'France': {'count': 2}}}

Some impressions / questions?

Are the city mentions counting towards country mentions? If yes, why does "United States" not have a count of 2 for "Pittsburgh" and "USA".
Is "Parys" for "Paris"... not sure why this conversion is made.
Counting "University" as a city will almost always be a false positive for us, although I'm guessing this is a source data issue.

Thanks for considering this feedback / helping answer any of these questions.

The text was updated successfully, but these errors were encountered:

iwpnd · 2020-03-02T20:16:55Z

Hi, nice of you to give it a try.

Are the city mentions counting towards country mentions? If yes, why does "United States" not have a count of 2 for "Pittsburgh" and "USA".

Country mentions are not tracked the same way that GeoText tracks them.

I wanted the user to add his/her own data to a LookupDataPool that then is looked up in a text. I don't cross-count across LookupData, rather every LookupData is looked up separately.

Is "Parys" for "Paris"... not sure why this conversion is made.

from flashgeotext.lookup import load_data_from_file
from flashgeotext.settings import DEMODATA_CITIES, DEMODATA_COUNTRIES

cities = load_data_from_file(DEMODATA_CITIES)
countries = load_data_from_file(DEMODATA_COUNTRIES)

print(cities["Paris"])
print(cities["Parys"])

Where a single keyword in LookupData looks like that:

{
"Paris" : ['Pa-ri', 'Paarys', 'Paraeis', 'Paras', 'Pari',
 'Paries', 'Parigi', 'Pariis', 'Pariisi', 'Parij',
 'Parijs', 'Paris', 'Parisi', 'Parixe', 'Pariz',
 'Parize', 'Parizh', 'Parizo', 'Parizs', 'Pariž',
 'Parys', 'Paryzius', 'Paryžius', 'Paräis', 'París',
 'Parîs', 'Párizs', 'paris', 'pyaris', 'Paris'],

[...]

"Parys" is a synonym of "Paris" and vis versa. Demo data is what it is, demo data to show the functionality. It is taken from geonames just like GeoTexts data, but flashgeotext also uses the synonyms from geonames. I use those synonyms that have a 70% fuzzy ratio with the original keyword. It might be a task to clean the demo data a little more. But what I actually want is people creating their own datasets of what they're looking for and use flashgeotext as the means to extract them.

Counting "University" as a city will almost always be a false positive for us, although I'm guessing this is a source data issue.

University is part of geonames city names for some reason, so yeah, another data problem.

Thanks for your feedback, I am really happy about it. If you have suggestions on how to improve it, I am all ears.

dhimmel · 2020-03-09T18:59:10Z

Thanks @iwpnd for the explanation. Looks like you're building a strong foundation for really-fast extraction of places from text.

As a user, I'm looking for a package that can extract the number of country mentions (including those inferred by city cross-counting) in English texts. Ideally, there'd be a way to use flashgeotext with data that closely matches my intended application, with little-to-know extra work needed.

I see how you're looking to build a lower-level application to support data that I could plug-in. However, I think you'll get a lot more users if you can also provide the data that supports their intended application. Hopefully, more of a community can develop to offload some of this work from your shoulders!

Thanks again, looking to see how this project evolves, since it seems like a sturdy foundation... but probably needs a few second-layer features to appeal to my use case including:

option to cross-count cities
good default data for places in English texts
working out the kinks with unexpected names like Parys. Not that Parys shouldn't be matched as Paris, but shouldn't Paris go by its primary keyword?

Cheers!

iwpnd · 2020-03-09T19:17:28Z

@dhimmel thanks for your feedback. A lot of good points. :)

iwpnd added bug Something isn't working question Further information is requested labels Mar 2, 2020

dhimmel mentioned this issue Mar 9, 2020

Geolocate authors to countries based on affiliations greenelab/iscb-diversity#8

Closed

iwpnd closed this as completed Apr 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial impressions and questions of flashgeotext for extracting countries from affiliations #16

Initial impressions and questions of flashgeotext for extracting countries from affiliations #16

dhimmel commented Mar 2, 2020

iwpnd commented Mar 2, 2020

dhimmel commented Mar 9, 2020 •

edited

Loading

iwpnd commented Mar 9, 2020

Initial impressions and questions of flashgeotext for extracting countries from affiliations #16

Initial impressions and questions of flashgeotext for extracting countries from affiliations #16

Comments

dhimmel commented Mar 2, 2020

iwpnd commented Mar 2, 2020

dhimmel commented Mar 9, 2020 • edited Loading

iwpnd commented Mar 9, 2020

dhimmel commented Mar 9, 2020 •

edited

Loading