Skip to content

Frequent n-grams in OSM addresses by language. Helpful when contributing abbreviations to libpostal

License

Notifications You must be signed in to change notification settings

openvenues/address_languages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

address_languages

This repo contains lists of frequent n-grams found in street and place names in many languages around the world.

Files are organized by source (currently just OSM, but more may be added) and language code. The ngrams primarily come from street and venue names, which should be where the majority of abbreviations and special phrases are found.

Below are the languages for which we have ngrams, sorted by frequency in OSM. This data set has been used as a rough priority list in adding languages to libpostal. If you are a native speaker of any of these languages, we'd love your input (literally). It's easy to contribute abbreviations and phrases to the [dictionaries](https://github.com/openvenues/libpostal /tree/master/resources/dictionaries) in libpostal. In addition to helping expand abbreviations found in geocoder/address input, these dictionaries also inform the address parser and the language classifier.

Code Language/link Frequency in OSM
en English 16128724
fr French 5214044
de German 3886269
es Spanish 3568860
ja Japanese 3067910
pt Portuguese 2940133
it Italian 2510501
ru Russian 1770775
nl Dutch 1546215
zh Chinese 859707
pl Polish 603065
ca Catalan 575205
th Thai 382725
uk Ukrainian 366363
da Danish 359688
sv Swedish 292189
id Indonesian 285243
ko Korean 276525
hu Hungarian 254601
cs Czech 236100
ro Romanian 230943
nb Norwegian (Bokmål) 205299
ar Arabic 205010
fi Finnish 196344
tr Turkish 162672
ms Malay 158328
el Greek 139720
be Belarusian 127734
fa Farsi 116424
bg Bulgarian 116318
sr Serbian 100218
hr Croatian 98465
gl Galician 92234
he Hebrew 85501
vi Vietnamese 82158
sk Slovak 82091
ka Georgian 77399
lt Lithuanian 74998
sl Slovene 74636
ga Irish 70564
eu Basque 63814
lv Latvian 50028
et Estonian 35962
bs Bosnian 32323
kk Kazakh 31904
kn Kannada 28688
cy Welsh 25521
br Breton 24892
ne Nepali 23339
si Sinhalese 19415
az Azerbaijani 18588
mk Macedonian 17630
mt Maltese 16452
sq Albanian 15961
is Icelandic 15916
mr Marathi 14675
mn Mongolian 13788
am Amharic 13717
bn Bengali 12772
hy Armenian 12173
hi Hindi 11939
ky Kyrgyz 11833
lo Lao 10183
my Burmese 9802
uz Uzbek 8836
km Khmer 7247
oc Occitan 6871
tk Turkmen 6737
tg Tajik 4891
pap Papiamento 3968
bo Tibetan 3441
lb Luxembourgish 3081
mg Malagasy 2910
gd Scottish Gaelic 2861
mdf Moksha 2750
fo Faroese 2701
gsw Swiss German 2499
af Afrikaans 2476
ta Tamil 2429
myv Erzya 2412
fy Western Frisian 2222
ba Bashkir 2171
gv Manx 2038
ast Asturian 1862
tn Tswana 1831
tt Tatar 1688
os Ossetian 1511
dz Dzongkha 1190
udm Udmurt 1069
ml Malayalam 760
kl Kalaallisut 648
so Somali 642
to Tonga 503
ur Urdu 389
ug Uyghur 223
fil Filipino 204
se Northern Sami 175
dv Divehi 174
mi Māori 172
te Telugu 155
ab Abkhaz 149
ch Chamorro 130
ht Haitian Creole 121
kz Kazakh 78
qu Quechua 74
rm Romansh 64
sw Swahili 56
haw Hawaiian 46
ee Ewe 30
gu Gujarati 22
ak Akan 14

About

Frequent n-grams in OSM addresses by language. Helpful when contributing abbreviations to libpostal

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published