This repo contains lists of frequent n-grams found in street and place names in many languages around the world.
Files are organized by source (currently just OSM, but more may be added) and language code. The ngrams primarily come from street and venue names, which should be where the majority of abbreviations and special phrases are found.
Below are the languages for which we have ngrams, sorted by frequency in OSM. This data set has been used as a rough priority list in adding languages to libpostal. If you are a native speaker of any of these languages, we'd love your input (literally). It's easy to contribute abbreviations and phrases to the [dictionaries](https://github.com/openvenues/libpostal /tree/master/resources/dictionaries) in libpostal. In addition to helping expand abbreviations found in geocoder/address input, these dictionaries also inform the address parser and the language classifier.
Code | Language/link | Frequency in OSM |
---|---|---|
en | English | 16128724 |
fr | French | 5214044 |
de | German | 3886269 |
es | Spanish | 3568860 |
ja | Japanese | 3067910 |
pt | Portuguese | 2940133 |
it | Italian | 2510501 |
ru | Russian | 1770775 |
nl | Dutch | 1546215 |
zh | Chinese | 859707 |
pl | Polish | 603065 |
ca | Catalan | 575205 |
th | Thai | 382725 |
uk | Ukrainian | 366363 |
da | Danish | 359688 |
sv | Swedish | 292189 |
id | Indonesian | 285243 |
ko | Korean | 276525 |
hu | Hungarian | 254601 |
cs | Czech | 236100 |
ro | Romanian | 230943 |
nb | Norwegian (Bokmål) | 205299 |
ar | Arabic | 205010 |
fi | Finnish | 196344 |
tr | Turkish | 162672 |
ms | Malay | 158328 |
el | Greek | 139720 |
be | Belarusian | 127734 |
fa | Farsi | 116424 |
bg | Bulgarian | 116318 |
sr | Serbian | 100218 |
hr | Croatian | 98465 |
gl | Galician | 92234 |
he | Hebrew | 85501 |
vi | Vietnamese | 82158 |
sk | Slovak | 82091 |
ka | Georgian | 77399 |
lt | Lithuanian | 74998 |
sl | Slovene | 74636 |
ga | Irish | 70564 |
eu | Basque | 63814 |
lv | Latvian | 50028 |
et | Estonian | 35962 |
bs | Bosnian | 32323 |
kk | Kazakh | 31904 |
kn | Kannada | 28688 |
cy | Welsh | 25521 |
br | Breton | 24892 |
ne | Nepali | 23339 |
si | Sinhalese | 19415 |
az | Azerbaijani | 18588 |
mk | Macedonian | 17630 |
mt | Maltese | 16452 |
sq | Albanian | 15961 |
is | Icelandic | 15916 |
mr | Marathi | 14675 |
mn | Mongolian | 13788 |
am | Amharic | 13717 |
bn | Bengali | 12772 |
hy | Armenian | 12173 |
hi | Hindi | 11939 |
ky | Kyrgyz | 11833 |
lo | Lao | 10183 |
my | Burmese | 9802 |
uz | Uzbek | 8836 |
km | Khmer | 7247 |
oc | Occitan | 6871 |
tk | Turkmen | 6737 |
tg | Tajik | 4891 |
pap | Papiamento | 3968 |
bo | Tibetan | 3441 |
lb | Luxembourgish | 3081 |
mg | Malagasy | 2910 |
gd | Scottish Gaelic | 2861 |
mdf | Moksha | 2750 |
fo | Faroese | 2701 |
gsw | Swiss German | 2499 |
af | Afrikaans | 2476 |
ta | Tamil | 2429 |
myv | Erzya | 2412 |
fy | Western Frisian | 2222 |
ba | Bashkir | 2171 |
gv | Manx | 2038 |
ast | Asturian | 1862 |
tn | Tswana | 1831 |
tt | Tatar | 1688 |
os | Ossetian | 1511 |
dz | Dzongkha | 1190 |
udm | Udmurt | 1069 |
ml | Malayalam | 760 |
kl | Kalaallisut | 648 |
so | Somali | 642 |
to | Tonga | 503 |
ur | Urdu | 389 |
ug | Uyghur | 223 |
fil | Filipino | 204 |
se | Northern Sami | 175 |
dv | Divehi | 174 |
mi | Māori | 172 |
te | Telugu | 155 |
ab | Abkhaz | 149 |
ch | Chamorro | 130 |
ht | Haitian Creole | 121 |
kz | Kazakh | 78 |
qu | Quechua | 74 |
rm | Romansh | 64 |
sw | Swahili | 56 |
haw | Hawaiian | 46 |
ee | Ewe | 30 |
gu | Gujarati | 22 |
ak | Akan | 14 |