Geoparsing is the process of finding location mentions (toponyms, aka. place names) in texts (toponym recognition or geotagging) and defining geographical representations, such as coordinate points, for them (toponym resolution or geocoding). Finger is a geoparser for Finnish texts. For toponym recognition, Finger uses a fine-tuned model trained for the Spacy NLP library. The same model lemmatizes recognized toponyms, that is, transforms them to a base form (Helsingissä –> Helsinki). Finally, the lemmatized toponyms are resolved to locations by querying a geocoder. Currently, we upkeep a geocoder based on Pelias. This program consists of three classes: the toponym recognizer, the toponym resolver, and the geoparser, which wraps the two previous modules. It uses a language model fine-tuned for extracting place names and a geocoder service for locating them.
Geojäsentäminen viittaa paikannimien löytämiseen ja paikantamiseen jäsentelemättömistä teksteistä. Finger on Python-ohjelmisto suomenkielisten tekstiaineistojen geojäsentämiseen. Paikannimien tunnistaminen ja perusmuotoistaminen (esim. Helsingistä –> Helsinki) tehdään spaCy-kirjaston avulla. Paikannimien sijainnin ratkaiseminen on Pelias-geokoodarilla. Ylläpidämme toistaiseksi Pelias-geokoodariin perustuvaa geokoodauspalvelua, jota Finger käyttää oletuksena, CSC:n Puhti-palvelussa.
Finger is available through pypi.
I highly recommend creating a virtual environment for Finger (e.g., venv or conda to prevent clashes with other packages – the versions used by Finger are not necessarily the latest ones.
pip install fingerGeoparser
Next, a spaCy model (pipeline) that has been trained for named-entity recognition and lemmatization is needed. Any pipelines that meet these requirements are fine. For example, spaCy offers pre-trained pipelines for Finnish.
Alternatively, we have trained a model based on Finnish BERT. This is a transformers-based model Download the from Releases or pip install it directly like this:
pip install https://github.com/DigitalGeographyLab/Finger-geoparser/releases/download/0.2.0/fi_fingerFinbertPipeline-0.1.0-py3-none-any.whl
from fingerGeoparser import geoparser
# initializing the geoparser
# model name (if it has been pip installed) or path is provided; in this example, we use spaCy's small pretrained model
gp = geoparser.geoparser(pipeline_path="fi_core_news_sm")
# defining inputs
input_texts = ["Olympialaiset järjestettiin sinä vuonna Helsingissä.", "Paris Hilton on maailmanmatkalla"]
res = gp.geoparse(input_texts)
res contains a Pandas dataframe with various columns of information (see Data model below)
If you want to find out more about the geoparser and the input parameters, call
help(geoparser)
Currently, the program accepts strings or lists of strings as input. The input is assumed to be in Finnish and segmented to short-ish pieces (so that the input isn't for example a whole book chapter as a string).
Most users will want to use the geoparser module, as it wraps geoparsing pipeline and functions under a simple principle: text in, results out. See below for an example. The output of the process is a Pandas dataframe with the following columns:
Column header | Description | Data type | Example |
---|---|---|---|
input_text | The input sentence | string | "Matti Järvi vietti tänään hienon päivän Lahdessa" |
input_order | The index of the inserted texts. i.e. the first text is 0, the second 1 etc. | int | 0 |
toponyms_found | Whether locations were found in the input sent | boolean | True |
toponyms | Location tokens in the original wordform, if found | (list of) string(s) or none | "Lahdessa" |
topo_lemmas | Lemmatized versions of the toponyms | (list of) string(s) or none | "Lahti" |
topo_spans | index of the start and end characters of the identified toponyms in the input text string | tuple | (40, 48) |
names | Versions of the locations returned by querying GeoNames | (list of) string(s) or none | "Lahti" |
coordinates | Long/lat coordinate points in WGS84 | (list of) tuple(s) or none | (25.66151, 60.98267) |
bbox | [Minx, miny, maxx, maxy] bounding box of the location, if available | (list of) list(s) | [25.5428275, 60.9207905391, 25.8289352655, 61.04401] |
gid | Unique identifier given by the geocoder. | (list of) string(s) | "whosonfirst:locality:101748425" |
label | Label given by the geocode. | (list of) strings | "Lahti, Finland" |
layer | Type of location based on Who's on First placetypes | (list of) string(s) | "country" |
* id | The identifying element, like tweet id, tied to each input text. Optional | string, int, float | "first_sentence" |
NOTE. The data model still subject to change as the work progresses.
The source code is licensed under the MIT license.
This geoparser was developed by Tatu Leppämäki of the Digital Geography Lab, University of Helsinki. Find me on Mastodon and Twitter.
Other resources used in either the pipeline or this code:
- Finnish BERT language model by TurkuNLP, CC BY 4.0. See Virtanen, Kanerva, Ilo, Luoma, Luotolahti, Salakoski, Ginter and Pyysalo; 2019
- Turku NER corpus by TurkuNLP, CC BY 4.0. See Luoma, Oinonen, Pyykönen, Laippala and Pyysalo; 2020
- UD_Finnish TDT corpus CC BY SA 4.0. See Haverinen et al. 2014; Pyysalo et al. 2015 (Older versions)
- Spacy-fi pipeline by Antti Ajanki, MIT License.
If you use the geoparser or related resources in a scientific publication, please cite the following article:
@article{doi:10.1080/13658816.2024.2369539,
author = {Tatu Leppämäki, Tuuli Toivonen and Tuomo Hiippala},
title = {Geographical and linguistic perspectives on developing geoparsers with generic resources},
journal = {International Journal of Geographical Information Science},
volume = {0},
number = {0},
pages = {1--22},
year = {2024},
publisher = {Taylor \& Francis},
doi = {10.1080/13658816.2024.2369539},
URL = {https://doi.org/10.1080/13658816.2024.2369539}
}