Country names with three words or more are not detected #25

bbo2adwuff · 2020-09-23T23:31:34Z

Here in countryInfo.txt you can see several country names with three words, i.e. United Arab Emirates, Antigua and Barbuda, Bosnia and Herzegovina, Central African Republic, ...

But due to the [ \-]? of the following regex, only countries names with a maximum of one space are detected.

geotext/geotext/geotext.py

Line 107 in add0334

    
           city_regex = r"[A-ZÀ-Ú]+[a-zà-ú]+[ \-]?(?:d[a-u].)?(?:[A-ZÀ-Ú]+[a-zà-ú]+)*"

The text was updated successfully, but these errors were encountered:

bbo2adwuff · 2020-09-24T07:22:43Z

I just realized that prior to e9204a2 at least country names with three capitalized words (i.e. United Arab Emirates, Central African Republic, ...) were detected.

bbo2adwuff · 2020-09-24T07:56:38Z

What about:
city_regex = r"[A-ZÀ-Ú]+[a-zà-ú]+[ \-]?(?:and)?(?:d[a-u].)?(?:[ \-]?[A-ZÀ-Ú]+[a-zà-ú]+)*"

Still have to test it, though.

bbo2adwuff · 2020-09-28T09:31:38Z

In a first attempt to test all 252 countries in countryInfo.txt using proposed regex vs the current regex in geotext I find the following.

Current regex: 21 countries are not found and 5 times not the correct country is extracted.
Proposed regex: 9 countries are not found and 2 times not the correct country is extracted.

Still not found:

 "('Bonaire, Saint Eustatius and Saba ', 'BQ') - 0",
 "('Democratic Republic of the Congo', 'CD') - 0",
 "('Republic of the Congo', 'CG') - 0",
 "('South Georgia and the South Sandwich Islands', 'GS') - 0",
 "('Heard Island and McDonald Islands', 'HM') - 0",
 "('Saint Kitts and Nevis', 'KN') - 0",
 "('Sao Tome and Principe', 'ST') - 0",
 "('Saint Vincent and the Grenadines', 'VC') - 0",
 "('U.S. Virgin Islands', 'VI') - 0"

Still not correct:

 "('Isle of Man', 'IM') != CI"
 "('Saint Pierre and Miquelon', 'PM') != MU"

And here the code how I tested it (I just copied read_table and removed the .lower()):

from geotext import GeoText
import io
from pprint import pprint


def read_table(filename, usecols=(0, 1), sep='\t', comment='#', encoding='utf-8', skip=0):
    with io.open(filename, 'r', encoding=encoding) as f:
        # skip initial lines
        for _ in range(skip):
            next(f)

        # filter comment lines
        lines = (line for line in f if not line.startswith(comment))

        d = dict()
        for line in lines:
            columns = line.split(sep)
            key = columns[usecols[0]]  # .lower()
            value = columns[usecols[1]].rstrip('\n')
            d[key] = value
    return d


countries = read_table('./countryInfo.txt', usecols=[4, 0], skip=1)

missing = []
error = []
for i in countries.items():
    country_mentions = GeoText(i[0]).country_mentions
    len_country_mentions = len(country_mentions)
    if len_country_mentions != 1:
        missing.append(str(i) + ' - ' + str(len_country_mentions))
    else:
        if list(i)[1] != list(country_mentions)[0]:
            error.append(str(i) + ' != ' + list(country_mentions)[0])

print(len(missing))
pprint(missing)

print(len(error))
pprint(error)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Country names with three words or more are not detected #25

Country names with three words or more are not detected #25

bbo2adwuff commented Sep 23, 2020

bbo2adwuff commented Sep 24, 2020

bbo2adwuff commented Sep 24, 2020

bbo2adwuff commented Sep 28, 2020

Country names with three words or more are not detected #25

Country names with three words or more are not detected #25

Comments

bbo2adwuff commented Sep 23, 2020

bbo2adwuff commented Sep 24, 2020

bbo2adwuff commented Sep 24, 2020

bbo2adwuff commented Sep 28, 2020