Skip to content

Commit

Permalink
initial project version
Browse files Browse the repository at this point in the history
  • Loading branch information
dohliam committed Jul 15, 2016
0 parents commit 86a7d75
Show file tree
Hide file tree
Showing 117 changed files with 16,825,872 additions and 0 deletions.
20 changes: 20 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

No Copyright

This license is acceptable for Free Cultural Works.

The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information below.


Other Information

In no way are the patent or trademark rights of any person affected by CC0, nor are the rights that other persons may have in the work or in how the work is used, such as publicity or privacy rights.

Unless expressly stated otherwise, the person who associated a work with this deed makes no warranties about the work, and disclaims liability for all uses of the work, to the fullest extent permitted by applicable law.
When using or citing the work, you should not imply endorsement by the author or the affirmer.


https://creativecommons.org/publicdomain/zero/1.0/
282 changes: 282 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
# wikidict-dsl-eo - Wikidata Bilingual DSL Dictionaries (Esperanto)

This repository makes available a collection of bilingual Esperanto dictionaries in DSL format derived from interwiki links (links between article titles in different languages) in Wikipedia. The data has been extracted from [Wikidata](https://www.wikidata.org/).

## Format

ABBYY Lingvo DSL is a flexible dictionary format that can be read by dictionary applications such as [Goldendict](https://github.com/goldendict/goldendict) and converted to other formats using tools such as [pyglossary](https://github.com/ilius/pyglossary). There are also a number of tools for creating DSL format dictionaries available in the [dsl-tools](https://github.com/dohliam/dsl-tools) project.

DSL files *must* be saved as UTF-16 to be usable by dictionary programs. The raw source files in this repository are saved in UTF-8 format, which is both significantly smaller in terms of file size, and also readable (and diffable) by git. However, there are fully encoded and compressed `.dsl.dz` dictionaries ready for use available in the [Releases](https://github.com/open-dsl-dict/wikidict-dsl-eo/releases) section.

You can also use the `rezip_dsl.rb` and `unzip_dsl.rb` [scripts](https://github.com/dohliam/dsl-tools/tree/master/zip_unzip) provided by the [dsl-tools](https://github.com/dohliam/dsl-tools) repo to encode/compress and decode/uncompress the dictionaries either individually or as a group.

## Data

The data directory contains the bilingual dictionaries in pairs according to [ISO language code](http://en.wikipedia.org/wiki/ISO_639-1).

The basic filename pattern is `[ISO]-eo_wikidict.dsl`, with `[ISO]` being the source language ISO code. A list of all language pairs is [below](#available-language-pairs).

## Available language pairs

Language codes | Language names
-------------- | --------------
`af-eo` | Afrikaans => Esperanto
`am-eo` | Amharic => Esperanto
`ang-eo` | Anglo-Saxon => Esperanto
`ar-eo` | Arabic => Esperanto
`arc-eo` | Aramaic => Esperanto
`bg-eo` | Bulgarian => Esperanto
`bi-eo` | Bislama => Esperanto
`bn-eo` | Bengali => Esperanto
`bo-eo` | Tibetan => Esperanto
`br-eo` | Breton => Esperanto
`bs-eo` | Bosnian => Esperanto
`ca-eo` | Catalan => Esperanto
`cdo-eo` | Min Dong => Esperanto
`chr-eo` | Cherokee => Esperanto
`chy-eo` | Cheyenne => Esperanto
`cr-eo` | Cree => Esperanto
`cs-eo` | Czech => Esperanto
`cy-eo` | Welsh => Esperanto
`da-eo` | Danish => Esperanto
`de-eo` | German => Esperanto
`el-eo` | Greek => Esperanto
`en-eo` | English => Esperanto
`es-eo` | Spanish => Esperanto
`et-eo` | Estonian => Esperanto
`eu-eo` | Basque => Esperanto
`fa-eo` | Persian => Esperanto
`ff-eo` | Fula => Esperanto
`fi-eo` | Finnish => Esperanto
`fr-eo` | French => Esperanto
`ga-eo` | Irish => Esperanto
`gan-eo` | Gan => Esperanto
`gd-eo` | Scottish Gaelic => Esperanto
`gu-eo` | Gujarati => Esperanto
`gv-eo` | Manx => Esperanto
`ha-eo` | Hausa => Esperanto
`hak-eo` | Hakka => Esperanto
`haw-eo` | Hawaiian => Esperanto
`he-eo` | Hebrew => Esperanto
`hi-eo` | Hindi => Esperanto
`hr-eo` | Croatian => Esperanto
`ht-eo` | Haitian => Esperanto
`hu-eo` | Hungarian => Esperanto
`hy-eo` | Armenian => Esperanto
`id-eo` | Indonesian => Esperanto
`ig-eo` | Igbo => Esperanto
`is-eo` | Icelandic => Esperanto
`it-eo` | Italian => Esperanto
`iu-eo` | Inuktitut => Esperanto
`ja-eo` | Japanese => Esperanto
`jbo-eo` | Lojban => Esperanto
`jv-eo` | Javanese => Esperanto
`ka-eo` | Georgian => Esperanto
`kg-eo` | Kongo => Esperanto
`ki-eo` | Kikuyu => Esperanto
`kl-eo` | Greenlandic => Esperanto
`km-eo` | Khmer => Esperanto
`ko-eo` | Korean => Esperanto
`la-eo` | Latin => Esperanto
`lg-eo` | Luganda => Esperanto
`lo-eo` | Lao => Esperanto
`lt-eo` | Lithuanian => Esperanto
`lv-eo` | Latvian => Esperanto
`mg-eo` | Malagasy => Esperanto
`mi-eo` | Maori => Esperanto
`mn-eo` | Mongolian => Esperanto
`ms-eo` | Malay => Esperanto
`mt-eo` | Maltese => Esperanto
`nah-eo` | Nahuatl => Esperanto
`ne-eo` | Nepali => Esperanto
`nl-eo` | Dutch => Esperanto
`nn-eo` | Norwegian (Nynorsk) => Esperanto
`no-eo` | Norwegian => Esperanto
`nv-eo` | Navajo => Esperanto
`ny-eo` | Chichewa => Esperanto
`oc-eo` | Occitan => Esperanto
`pa-eo` | Punjabi => Esperanto
`pi-eo` | Pali => Esperanto
`pl-eo` | Polish => Esperanto
`ps-eo` | Pashto => Esperanto
`pt-eo` | Portuguese => Esperanto
`qu-eo` | Quechua => Esperanto
`ro-eo` | Romanian => Esperanto
`ru-eo` | Russian => Esperanto
`sa-eo` | Sanskrit => Esperanto
`se-eo` | Northern Sami => Esperanto
`sh-eo` | Serbo-Croatian => Esperanto
`sk-eo` | Slovak => Esperanto
`sl-eo` | Slovenian => Esperanto
`sn-eo` | Shona => Esperanto
`so-eo` | Somali => Esperanto
`sq-eo` | Albanian => Esperanto
`sr-eo` | Serbian => Esperanto
`sv-eo` | Swedish => Esperanto
`sw-eo` | Kiswahili => Esperanto
`ta-eo` | Tamil => Esperanto
`te-eo` | Telugu => Esperanto
`th-eo` | Thai => Esperanto
`tl-eo` | Tagalog => Esperanto
`tpi-eo` | Tok Pisin => Esperanto
`tr-eo` | Turkish => Esperanto
`ug-eo` | Uyghur => Esperanto
`uk-eo` | Ukrainian => Esperanto
`ur-eo` | Urdu => Esperanto
`vi-eo` | Vietnamese => Esperanto
`wo-eo` | Wolof => Esperanto
`wuu-eo` | Wu => Esperanto
`xh-eo` | Xhosa => Esperanto
`yi-eo` | Yiddish => Esperanto
`yo-eo` | Yoruba => Esperanto
`za-eo` | Zhuang => Esperanto
`zh-eo` | Chinese (Mandarin) => Esperanto
`zh_classical-eo` | Classical Chinese => Esperanto
`zh_min_nan-eo` | Min Nan => Esperanto
`zh_yue-eo` | Cantonese => Esperanto
`zu-eo` | Zulu => Esperanto

## Statistics

### Dictionary size

Language pair | # of entries
------------- | ------------
`af-eo` | 15774
`am-eo` | 4800
`ang-eo` | 1993
`ar-eo` | 45989
`arc-eo` | 1195
`bg-eo` | 46336
`bi-eo` | 413
`bn-eo` | 11470
`bo-eo` | 1996
`br-eo` | 20125
`bs-eo` | 23241
`ca-eo` | 68969
`cdo-eo` | 1585
`chr-eo` | 417
`chy-eo` | 453
`cr-eo` | 81
`cs-eo` | 57254
`cy-eo` | 16196
`da-eo` | 41639
`de-eo` | 128145
`el-eo` | 29123
`en-eo` | 159055
`es-eo` | 100041
`et-eo` | 32313
`eu-eo` | 42865
`fa-eo` | 63453
`ff-eo` | 177
`fi-eo` | 54937
`fr-eo` | 127889
`ga-eo` | 13473
`gan-eo` | 3985
`gd-eo` | 7104
`gu-eo` | 2208
`gv-eo` | 3174
`ha-eo` | 335
`hak-eo` | 1940
`haw-eo` | 832
`he-eo` | 37716
`hi-eo` | 14068
`hr-eo` | 36007
`ht-eo` | 6384
`hu-eo` | 76764
`hy-eo` | 34892
`id-eo` | 42766
`ig-eo` | 641
`is-eo` | 14485
`it-eo` | 130538
`iu-eo` | 308
`ja-eo` | 74084
`jbo-eo` | 1094
`jv-eo` | 11476
`ka-eo` | 26293
`kg-eo` | 682
`ki-eo` | 277
`kl-eo` | 1312
`km-eo` | 986
`ko-eo` | 47583
`la-eo` | 45439
`lg-eo` | 140
`lo-eo` | 823
`lt-eo` | 34425
`lv-eo` | 22553
`mg-eo` | 11706
`mi-eo` | 2016
`mn-eo` | 5808
`ms-eo` | 57543
`mt-eo` | 1259
`nah-eo` | 6131
`ne-eo` | 4804
`nl-eo` | 122922
`nn-eo` | 31010
`no-eo` | 58995
`nv-eo` | 1123
`ny-eo` | 76
`oc-eo` | 25923
`pa-eo` | 5898
`pi-eo` | 2232
`pl-eo` | 123552
`ps-eo` | 1802
`pt-eo` | 114791
`qu-eo` | 8934
`ro-eo` | 77571
`ru-eo` | 117026
`sa-eo` | 3502
`se-eo` | 3942
`sh-eo` | 62550
`sk-eo` | 57601
`sl-eo` | 28325
`sn-eo` | 1309
`so-eo` | 2010
`sq-eo` | 14979
`sr-eo` | 73390
`sv-eo` | 71976
`sw-eo` | 12228
`ta-eo` | 12872
`te-eo` | 5063
`th-eo` | 23113
`tl-eo` | 23735
`tpi-eo` | 917
`tr-eo` | 48303
`ug-eo` | 1848
`uk-eo` | 93278
`ur-eo` | 14814
`vi-eo` | 84049
`wo-eo` | 779
`wuu-eo` | 1323
`xh-eo` | 224
`yi-eo` | 5645
`yo-eo` | 19411
`za-eo` | 462
`zh-eo` | 102468
`zh_classical-eo` | 1501
`zh_min_nan-eo` | 7608
`zh_yue-eo` | 11482
`zu-eo` | 482

### Top ten dictionaries by number of entries

Language pair | # of entries
------------- | ------------
`en-eo` | 159055
`it-eo` | 130538
`de-eo` | 128145
`fr-eo` | 127889
`pl-eo` | 123552
`nl-eo` | 122922
`ru-eo` | 117026
`pt-eo` | 114791
`zh-eo` | 102468
`es-eo` | 100041

## License

According to the Wikidata website:

> All structured data from the main and property namespace is available under the Creative Commons CC0 License
The data in this repository is therefore made available under the same [Creative Commons CC0 License](https://creativecommons.org/publicdomain/zero/1.0/) as that used by the Wikidata project. All of the data has been derived from the Wikidata JSON format [database dumps](https://dumps.wikimedia.org/wikidatawiki/entities/).
Loading

0 comments on commit 86a7d75

Please sign in to comment.