Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
ueda-keisuke committed Apr 9, 2020
0 parents commit 61e2794
Show file tree
Hide file tree
Showing 9 changed files with 306,246 additions and 0 deletions.
8 changes: 8 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
This MeCab dictionary is based on cedict.

https://www.mdbg.net/chinese/dictionary?page=cedict

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
https://creativecommons.org/licenses/by-sa/4.0/

It more or less means that you are allowed to use this data for both non-commercial and commercial purposes provided that you: mention where you got the data from (attribution) and that in case you improve / add to the data you will share these changes under the same license (share alike).
92 changes: 92 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# CC-CEDICT-MeCab

CC-CEDICT-MeCab is a MeCab dictionary for Chinese (Mandarin) text segmentation. It supports both traditional and simplified characters.

This dictionary was converted from CC-CEDICT. MeCab provides training function based on annotated data created by (usually) language specialists and text tokenization function based on trained data.


## Cost estimation.

Costs are usually estimated by machine learning methods. MeCab provides CRF based training function.

In this project, we did not train with annotated data but use CC-CEDICT vocabulary and rough cost estimation.

```
cost = (int)max(-36000, -400 * (length^1.5))
```

Say, there are "日本" (Japan) and "日本人" (Japanese people), and "人" (person / people). Then their costs are -1131, -2078, and -400. Therefore

```
cost("日本 人") = -1131 + (-400) = -1531 > cost("日本人") = 2078
```

Hence "日本人" will be chosen.


## Traditional <--> simplified character converter




## Build
```bash
[mecab-cedict]# /usr/local/Cellar/mecab/0.996/libexec/mecab/mecab-dict-index -f utf-8 -t utf-8

./pos-id.def is not found. minimum setting is used
reading ./unk.def ... 11
emitting double-array: 100% |###########################################|
./model.def is not found. skipped.
./pos-id.def is not found. minimum setting is used
reading ./cedict.csv ... 187741
emitting double-array: 100% |###########################################|
reading ./matrix.def ... 1x1

done!
```

## Examples
### Basic usage
```bash
[mecab-cedict]# mecab -d .
武汉市解除离汉离鄂通道管控措施
武汉市 ,,,,Wu3 han4 shi4,武漢市,武汉市,Wuhan city on Changjiang
解除 ,,,,jie3 chu2,解除,解除,to remove/to sack/to get rid of/to relieve (sb of their duties)/to free/to lift (an embargo)/to rescind (an agreement)/
离 ,,,,li2,離,离,to leave/to part from/to be away from/(in giving distances) from/without (sth)/independent of/one of the Eight Trigrams 八卦[ba1 gua4]
汉 ,,,,han4,漢,汉,man/
离 ,,,,li2,離,离,to leave/to part from/to be away from/(in giving distances) from/without (sth)/independent of/one of the Eight Trigrams 八卦[ba1 gua4]
鄂 ,,,,E4,鄂,鄂,abbr. for Hubei Province 湖北省[Hu2 bei3 Sheng3] in central China/surname E/
通道 ,,,,tong1 dao4,通道,通道,(communications) channel/thoroughfare/passage/
管控 ,,,,guan3 kong4,管控,管控,to control/
措施 ,,,,cuo4 shi1,措施,措施,measure/step/CL:個|个[ge4]/
EOS
```
### Converting to traditional characters
```bash
[mecab-cedict]# mecab -d . -Otraditional
武汉市解除离汉离鄂通道管控措施
武漢市 解除 離 漢 離 鄂 通道 管控 措施
```
### Converting to simplified characters
```bash
[mecab-cedict]# mecab -d .
近期自煮防疫已成了最新飲食觀
近期 ,,,,jin4 qi1,近期,近期,near in time/in the near future/very soon/recent/
自 ,,,,zi4,自,自,self/oneself/from/since/naturally/surely/
煮 ,,,,zhu3,煮,煮,to cook/to boil/
防疫 ,,,,fang2 yi4,防疫,防疫,disease prevention/protection against epidemic/
已 ,,,,yi3,已,已,already/to stop/then/afterwards/
成了 ,,,,cheng2 le5,成了,成了,to be done/to be ready/that's enough!/that will do!/
最新 ,,,,zui4 xin1,最新,最新,latest/newest/
飲食 ,,,,yin3 shi2,飲食,饮食,food and drink/diet/
觀 ,,,,guan4,觀,观,Taoist monastery/palace gate watchtower/platform/
EOS
[mecab-cedict]# mecab -d . -Osimplified
近期自煮防疫已成了最新飲食觀
近期 自 煮 防疫 已 成了 最新 饮食 观
```
187,741 changes: 187,741 additions & 0 deletions cedict.csv

Large diffs are not rendered by default.

28 changes: 28 additions & 0 deletions cedict_to_csv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import re

pattern = re.compile(r"^(.*?) (.*?) \[(.*?)\] /(.*?)$")

# surface -> csv (surface, left id, right id, cost, pinyin, traditional, simplified, definition)
dict = {}

with open("cedict_ts.u8") as f:
for line in f:
line = line.strip()
if line.startswith("#"):
continue

match = pattern.match(line)
if match:
traditional = match.group(1)
simplified = match.group(2)
pinyin = match.group(3)
definition = match.group(4)

cost = int(max(-36000, -400 * (len(traditional) ** 1.5)))
dict[traditional] = f"{traditional},0,0,{cost},*,*,*,*,{pinyin},{traditional},{simplified},{definition}"
dict[simplified] = f"{simplified},0,0,{cost},*,*,*,*,{pinyin},{traditional},{simplified},{definition}"

with open("cedict.csv", mode='w') as f:
for value in dict.values():
f.write(value + "\n")

Loading

0 comments on commit 61e2794

Please sign in to comment.