Initial Release of the Cantodict Archive.
This is an attempt to preserve Cantodict https://www.cantonese.sheik.co.uk/dictionary/. Cantodict is an important resource for Cantonese as it is effectively a descriptive dictionary characterizing the language for the early 2000s. The github archive has the scraped webpages, scripts for processing, and the generated output files.
Entires are exported in json, csv, sqlite, and in Kindle + Kobo dictionary formats.
See the README.md for descriptions of all the files.
For language processing nerds, you likely will want either cantodict.sqlite or cantodict.csv as that has all the data. The CSV also has yale romanizations.
For normal users, the Kobo and Kindle dictionaries are probably more interesting. Grab the yale or jyutping version based on preference. There is also a no-vulgar version which removes about 205 "vulgar" entries (though there are likely many others still in there)
If any members of the original cantodict team have concerns with this release, please contact me. I have attempted to reach out to the editors and a couple of the most prolific ones got back to me supporting this release so I am putting it out there.
A small set of the output formats are linked directly in the release below. Note, for the Kobo dicthtml files, while they will work as named, renaming the wanted variant to dicthtml-zh-en.zip
before putting it in the .kobo/dictionary
will yield a prettier zh - English
name instead of the full filename.
For all build artifacts including all the dictionary variants, see the output directory in the github repo.