zh-tw sentences release under CC0 to Public Domain, gather from various sources.
- archive of G0v Rand0m channel (chats to donate here) - https://g0v-slack-archive.g0v.ronny.tw/index/channel/CGU1SLHNH
- corpus at Mozilla Common Voice Project - https://github.com/mozilla/voice-web/tree/master/server/data/zh-TW
The coveraged rate of phonetic of current corpus, compared to CnsPhonetic2016-08v2.cin input table.
(calculate via text tools on 2022-03-09 DB) 24244 sentences
✗ node text-tools.js -c all.txt CnsPhonetic2016-08v2.cin
Total numbers of phonetic in voice-text-tools/CnsPhonetic2016-08v2.cin are 1567
Numbers of phonetic from 3495 characters in all.txt are 1040
We have cover 66.37% of the pronunciations.
The coveraged rate and missing chars from current text corpus to common chars table from MOE.
➜ voice-text-tools git:(master) ✗ node text-tools.js -o all.txt 教育部2015常用字99.75%\(3593字\).txt
Numbers of chars in all.txt are 3495
Numbers of chars in voice-text-tools/教育部2015常用字99.75%(3593字).txt are 3593
all.txt includes 3011 chars from voice-text-tools/教育部2015常用字99.75%(3593字).txt (83.8%)
all.txt missing 582 chars from voice-text-tools/教育部2015常用字99.75%(3593字).txt (16.2%):
