Finger: the Finnish geoparser

Overview

Geoparsing is the process of finding location mentions (toponyms, aka. place names) in texts (toponym recognition or geotagging) and defining geographical representations, such as coordinate points, for them (toponym resolution or geocoding). Finger is a geoparser for Finnish texts. For toponym recognition, Finger uses a fine-tuned model trained for the Spacy NLP library. The same model lemmatizes recognized toponyms, that is, transforms them to a base form (Helsingissä –> Helsinki). Finally, the lemmatized toponyms are resolved to locations by querying a geocoder. Currently, we upkeep a geocoder based on Pelias. This program consists of three classes: the toponym recognizer, the toponym resolver, and the geoparser, which wraps the two previous modules. It uses a language model fine-tuned for extracting place names and a geocoder service for locating them.

Tiivistelmä suomeksi

Geojäsentäminen viittaa paikannimien löytämiseen ja paikantamiseen jäsentelemättömistä teksteistä. Finger on Python-ohjelmisto suomenkielisten tekstiaineistojen geojäsentämiseen. Paikannimien tunnistaminen ja perusmuotoistaminen (esim. Helsingistä –> Helsinki) tehdään spaCy-kirjaston avulla. Paikannimien sijainnin ratkaiseminen on Pelias-geokoodarilla. Ylläpidämme toistaiseksi Pelias-geokoodariin perustuvaa geokoodauspalvelua, jota Finger käyttää oletuksena, CSC:n Puhti-palvelussa.

Getting started

Finger is available through pypi.

Installation

I highly recommend creating a virtual environment for Finger (e.g., venv or conda to prevent clashes with other packages – the versions used by Finger are not necessarily the latest ones.

pip install fingerGeoparser

Next, a spaCy model (pipeline) that has been trained for named-entity recognition and lemmatization is needed. Any pipelines that meet these requirements are fine. For example, spaCy offers pre-trained pipelines for Finnish.

Alternatively, we have trained a model based on Finnish BERT. This is a transformers-based model Download the from Releases or pip install it directly like this:

pip install https://github.com/DigitalGeographyLab/Finger-geoparser/releases/download/0.2.0/fi_fingerFinbertPipeline-0.1.0-py3-none-any.whl

Usage example

from fingerGeoparser import geoparser

# initializing the geoparser
# model name (if it has been pip installed) or path is provided; in this example, we use spaCy's small pretrained model
gp = geoparser.geoparser(pipeline_path="fi_core_news_sm")

# defining inputs
input_texts = ["Olympialaiset järjestettiin sinä vuonna Helsingissä.", "Paris Hilton on maailmanmatkalla"]

res = gp.geoparse(input_texts)

res contains a Pandas dataframe with various columns of information (see Data model below)

If you want to find out more about the geoparser and the input parameters, call

help(geoparser)

Data model

Currently, the program accepts strings or lists of strings as input. The input is assumed to be in Finnish and segmented to short-ish pieces (so that the input isn't for example a whole book chapter as a string).

Most users will want to use the geoparser module, as it wraps geoparsing pipeline and functions under a simple principle: text in, results out. See below for an example. The output of the process is a Pandas dataframe with the following columns:

Column header	Description	Data type	Example
input_text	The input sentence	string	"Matti Järvi vietti tänään hienon päivän Lahdessa"
input_order	The index of the inserted texts. i.e. the first text is 0, the second 1 etc.	int	0
toponyms_found	Whether locations were found in the input sent	boolean	True
toponyms	Location tokens in the original wordform, if found	(list of) string(s) or none	"Lahdessa"
topo_lemmas	Lemmatized versions of the toponyms	(list of) string(s) or none	"Lahti"
topo_spans	index of the start and end characters of the identified toponyms in the input text string	tuple	(40, 48)
names	Versions of the locations returned by querying GeoNames	(list of) string(s) or none	"Lahti"
coordinates	Long/lat coordinate points in WGS84	(list of) tuple(s) or none	(25.66151, 60.98267)
bbox	[Minx, miny, maxx, maxy] bounding box of the location, if available	(list of) list(s)	[25.5428275, 60.9207905391, 25.8289352655, 61.04401]
gid	Unique identifier given by the geocoder.	(list of) string(s)	"whosonfirst:locality:101748425"
label	Label given by the geocode.	(list of) strings	"Lahti, Finland"
layer	Type of location based on Who's on First placetypes	(list of) string(s)	"country"
* id	The identifying element, like tweet id, tied to each input text. Optional	string, int, float	"first_sentence"

NOTE. The data model still subject to change as the work progresses.

License and credits

The source code is licensed under the MIT license.

This geoparser was developed by Tatu Leppämäki of the Digital Geography Lab, University of Helsinki. Find me on Mastodon and Twitter.

Other resources used in either the pipeline or this code:

Finnish BERT language model by TurkuNLP, CC BY 4.0. See Virtanen, Kanerva, Ilo, Luoma, Luotolahti, Salakoski, Ginter and Pyysalo; 2019
Turku NER corpus by TurkuNLP, CC BY 4.0. See Luoma, Oinonen, Pyykönen, Laippala and Pyysalo; 2020
UD_Finnish TDT corpus CC BY SA 4.0. See Haverinen et al. 2014; Pyysalo et al. 2015 (Older versions)
Spacy-fi pipeline by Antti Ajanki, MIT License.

Citation

If you use the geoparser or related resources in a scientific publication, please cite the following article:

@article{doi:10.1080/13658816.2024.2369539,
author = {Tatu Leppämäki, Tuuli Toivonen and Tuomo Hiippala},
title = {Geographical and linguistic perspectives on developing geoparsers with generic resources},
journal = {International Journal of Geographical Information Science},
volume = {0},
number = {0},
pages = {1--22},
year = {2024},
publisher = {Taylor \& Francis},
doi = {10.1080/13658816.2024.2369539},
URL = {https://doi.org/10.1080/13658816.2024.2369539}
}

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
fingerGeoparser		fingerGeoparser
tests		tests
.gitignore		.gitignore
FINGER_logo_transparent.png		FINGER_logo_transparent.png
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finger: the Finnish geoparser

Overview

Tiivistelmä suomeksi

Getting started

Installation

Usage example

Data model

License and credits

Citation

About

Releases 1

Packages

Languages

License

DigitalGeographyLab/Finger-geoparser

Folders and files

Latest commit

History

Repository files navigation

Finger: the Finnish geoparser

Overview

Tiivistelmä suomeksi

Getting started

Installation

Usage example

Data model

License and credits

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages