-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Country names with three words or more are not detected #25
Comments
I just realized that prior to e9204a2 at least country names with three capitalized words (i.e. United Arab Emirates, Central African Republic, ...) were detected. |
What about: Still have to test it, though. |
In a first attempt to test all 252 countries in countryInfo.txt using proposed regex vs the current regex in geotext I find the following. Current regex: 21 countries are not found and 5 times not the correct country is extracted. Still not found:
Still not correct:
And here the code how I tested it (I just copied read_table and removed the from geotext import GeoText
import io
from pprint import pprint
def read_table(filename, usecols=(0, 1), sep='\t', comment='#', encoding='utf-8', skip=0):
with io.open(filename, 'r', encoding=encoding) as f:
# skip initial lines
for _ in range(skip):
next(f)
# filter comment lines
lines = (line for line in f if not line.startswith(comment))
d = dict()
for line in lines:
columns = line.split(sep)
key = columns[usecols[0]] # .lower()
value = columns[usecols[1]].rstrip('\n')
d[key] = value
return d
countries = read_table('./countryInfo.txt', usecols=[4, 0], skip=1)
missing = []
error = []
for i in countries.items():
country_mentions = GeoText(i[0]).country_mentions
len_country_mentions = len(country_mentions)
if len_country_mentions != 1:
missing.append(str(i) + ' - ' + str(len_country_mentions))
else:
if list(i)[1] != list(country_mentions)[0]:
error.append(str(i) + ' != ' + list(country_mentions)[0])
print(len(missing))
pprint(missing)
print(len(error))
pprint(error) |
Here in countryInfo.txt you can see several country names with three words, i.e. United Arab Emirates, Antigua and Barbuda, Bosnia and Herzegovina, Central African Republic, ...
But due to the
[ \-]?
of the following regex, only countries names with a maximum of one space are detected.geotext/geotext/geotext.py
Line 107 in add0334
The text was updated successfully, but these errors were encountered: