Skip to content

Geoparser for extracting and locating place names from Finnish texts

License

Notifications You must be signed in to change notification settings

DigitalGeographyLab/Finger-geoparser

Repository files navigation

Finger Geoparser logo

Finger: the Finnish geoparser

Overview

Geoparsing is the process of finding location mentions (toponyms, aka. place names) in texts (toponym recognition or geotagging) and defining geographical representations, such as coordinate points, for them (toponym resolution or geocoding). Finger is a geoparser for Finnish texts. For toponym recognition, Finger uses a fine-tuned model trained for the Spacy NLP library. The same model lemmatizes recognized toponyms, that is, transforms them to a base form (Helsingissä –> Helsinki). Finally, the lemmatized toponyms are resolved to locations by querying a geocoder. Currently, we upkeep a geocoder based on Pelias. This program consists of three classes: the toponym recognizer, the toponym resolver, and the geoparser, which wraps the two previous modules. It uses a language model fine-tuned for extracting place names and a geocoder service for locating them.

Tiivistelmä suomeksi

Geojäsentäminen viittaa paikannimien löytämiseen ja paikantamiseen jäsentelemättömistä teksteistä. Finger on Python-ohjelmisto suomenkielisten tekstiaineistojen geojäsentämiseen. Paikannimien tunnistaminen ja perusmuotoistaminen (esim. Helsingistä –> Helsinki) tehdään spaCy-kirjaston avulla. Paikannimien sijainnin ratkaiseminen on Pelias-geokoodarilla. Ylläpidämme toistaiseksi Pelias-geokoodariin perustuvaa geokoodauspalvelua, jota Finger käyttää oletuksena, CSC:n Puhti-palvelussa.

Getting started

Finger is available through pypi.

Installation

I highly recommend creating a virtual environment for Finger (e.g., venv or conda to prevent clashes with other packages – the versions used by Finger are not necessarily the latest ones.

pip install fingerGeoparser

Next, a spaCy model (pipeline) that has been trained for named-entity recognition and lemmatization is needed. Any pipelines that meet these requirements are fine. For example, spaCy offers pre-trained pipelines for Finnish.

Alternatively, we have trained a model based on Finnish BERT. This is a transformers-based model Download the from Releases or pip install it directly like this:

pip install https://github.com/DigitalGeographyLab/Finger-geoparser/releases/download/0.2.0/fi_fingerFinbertPipeline-0.1.0-py3-none-any.whl

Usage example

from fingerGeoparser import geoparser

# initializing the geoparser
# model name (if it has been pip installed) or path is provided; in this example, we use spaCy's small pretrained model
gp = geoparser.geoparser(pipeline_path="fi_core_news_sm")

# defining inputs
input_texts = ["Olympialaiset järjestettiin sinä vuonna Helsingissä.", "Paris Hilton on maailmanmatkalla"]

res = gp.geoparse(input_texts)

res contains a Pandas dataframe with various columns of information (see Data model below)

If you want to find out more about the geoparser and the input parameters, call

help(geoparser)

Data model

Currently, the program accepts strings or lists of strings as input. The input is assumed to be in Finnish and segmented to short-ish pieces (so that the input isn't for example a whole book chapter as a string).

Most users will want to use the geoparser module, as it wraps geoparsing pipeline and functions under a simple principle: text in, results out. See below for an example. The output of the process is a Pandas dataframe with the following columns:

Column header Description Data type Example
input_text The input sentence string "Matti Järvi vietti tänään hienon päivän Lahdessa"
input_order The index of the inserted texts. i.e. the first text is 0, the second 1 etc. int 0
toponyms_found Whether locations were found in the input sent boolean True
toponyms Location tokens in the original wordform, if found (list of) string(s) or none "Lahdessa"
topo_lemmas Lemmatized versions of the toponyms (list of) string(s) or none "Lahti"
topo_spans index of the start and end characters of the identified toponyms in the input text string tuple (40, 48)
names Versions of the locations returned by querying GeoNames (list of) string(s) or none "Lahti"
coordinates Long/lat coordinate points in WGS84 (list of) tuple(s) or none (25.66151, 60.98267)
bbox [Minx, miny, maxx, maxy] bounding box of the location, if available (list of) list(s) [25.5428275, 60.9207905391, 25.8289352655, 61.04401]
gid Unique identifier given by the geocoder. (list of) string(s) "whosonfirst:locality:101748425"
label Label given by the geocode. (list of) strings "Lahti, Finland"
layer Type of location based on Who's on First placetypes (list of) string(s) "country"
* id The identifying element, like tweet id, tied to each input text. Optional string, int, float "first_sentence"

NOTE. The data model still subject to change as the work progresses.

License and credits

The source code is licensed under the MIT license.

This geoparser was developed by Tatu Leppämäki of the Digital Geography Lab, University of Helsinki. Find me on Mastodon and Twitter.

Other resources used in either the pipeline or this code:

Citation

If you use the geoparser or related resources in a scientific publication, please cite the following article:

@article{doi:10.1080/13658816.2024.2369539,
author = {Tatu Leppämäki, Tuuli Toivonen and Tuomo Hiippala},
title = {Geographical and linguistic perspectives on developing geoparsers with generic resources},
journal = {International Journal of Geographical Information Science},
volume = {0},
number = {0},
pages = {1--22},
year = {2024},
publisher = {Taylor \& Francis},
doi = {10.1080/13658816.2024.2369539},
URL = {https://doi.org/10.1080/13658816.2024.2369539}
}

About

Geoparser for extracting and locating place names from Finnish texts

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages