-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corpus preprocessing steps #13
Comments
Hi LydiaXiaohongLi, I recommend you to look into google's vocab first. Those vocabs imply that lower-casing is an option. (answer to question 1) And if you check those vocabs, punctuations are included. If you build vocab with my project or others, vocab would be ordered by frequency except some special tokens on the top of vocab. (answer to question 4) |
Thanks kwonmha, Thanks |
Subword vocab building algorithm will automatically separate 'hello,' into "hello" and ",". |
Hi @kwonmha, the vocab file that I generate has issue with punchtuations. -(Q). Could you please suggest how can I separate the punctuations? Does that involve further preprocessing? |
I fixed this problem. |
Hi Kwonmha,
Thanks for open source the repo. Can I ask generally the preprocessing steps for vocab builder, for a uncased bert model is follows:
Let me know if my understanding is not correct?
Thanks!
Regards
The text was updated successfully, but these errors were encountered: