Skip to content

Commit

Permalink
Implemented proper text segmentation via mecab
Browse files Browse the repository at this point in the history
  • Loading branch information
arianneorpilla committed Apr 6, 2021
1 parent 4c2106f commit 0cc32fc
Show file tree
Hide file tree
Showing 20 changed files with 3,680 additions and 95 deletions.
Binary file added assets/ipadic/char.bin
Binary file not shown.
29 changes: 29 additions & 0 deletions assets/ipadic/dicrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
;
; Configuration file of IPADIC
;
; $Id: dicrc,v 1.4 2006/04/08 06:41:36 taku-ku Exp $;
;
cost-factor = 800
bos-feature = BOS/EOS,*,*,*,*,*,*,*,*
eval-size = 8
unk-eval-size = 4
config-charset = EUC-JP

; yomi
node-format-yomi = %pS%f[7]
unk-format-yomi = %M
eos-format-yomi = \n

; simple
node-format-simple = %m\t%F-[0,1,2,3]\n
eos-format-simple = EOS\n

; ChaSen
node-format-chasen = %m\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
unk-format-chasen = %m\t%m\t%m\t%F-[0,1,2,3]\t\t\n
eos-format-chasen = EOS\n

; ChaSen (include spaces)
node-format-chasen2 = %M\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
unk-format-chasen2 = %M\t%m\t%m\t%F-[0,1,2,3]\t\t\n
eos-format-chasen2 = EOS\n
Loading

0 comments on commit 0cc32fc

Please sign in to comment.