Corpus preprocessing steps #13

LydiaXiaohongLi · 2020-03-23T07:10:18Z

Hi Kwonmha,
Thanks for open source the repo. Can I ask generally the preprocessing steps for vocab builder, for a uncased bert model is follows:

Convert corpus text file to lower case
Removal punctuations from corpus text file?
Build vocab
match the vocab file to bert model configuration, e.g. take the top 30k lines (as the vocab should be ordered by frequency descending order?), manually adjust the vocab file, so that it contains puncutations (i.e. vocabs for . , ? ! ##. ##, ##? ##! etc)?
use the vocab file for later pretraining bert model, the corpus of pretraining bert model needs to be lower cased, but without removal of punctuation?
Let me know if my understanding is not correct?

Thanks!
Regards

kwonmha · 2020-03-25T02:23:25Z

Hi LydiaXiaohongLi,

I recommend you to look into google's vocab first.
There are various versions of vocab : English-Cased, English-uncased, Multilingual-Cased, Multilingual-uncased, etc.

Those vocabs imply that lower-casing is an option. (answer to question 1)

And if you check those vocabs, punctuations are included.
You don't need to remove punctuations. (answer to question 2)

If you build vocab with my project or others, vocab would be ordered by frequency except some special tokens on the top of vocab. (answer to question 4)

LydiaXiaohongLi · 2020-03-25T05:05:53Z

Thanks kwonmha,
Follow up on the punctuation removal question:
If I don't remove punctuation in the corpus file, I will see vocab built for cases like words followed by punct as a single vocab toke, e.g. "hello," . Hence want to ask if should build vocab with corpus without punctuation, then add back punctuation manually as seperate standalone tokens?

Thanks
Regards

kwonmha · 2020-03-25T05:43:11Z

Subword vocab building algorithm will automatically separate 'hello,' into "hello" and ",".
Because "," appears to be follow many other words like "wow,", "well,".
So it won't be tied to other vocabs unless there are plenty of "hello,"s.

sahelimukherjee92 · 2020-05-19T08:40:59Z

Hi @kwonmha, the vocab file that I generate has issue with punchtuations.

-(Q).
(Proc.
(Price,
(Poon
(Polyak,
(Polyak
(PoPPCA)
(Pinto
(Photo
(Pham
(Petersen
(Perron,
(Pearl,
(Pati
(Palatucci
(Paccanaro
(PSD)
(PMF).

Could you please suggest how can I separate the punctuations? Does that involve further preprocessing?

kwonmha · 2020-11-05T13:16:59Z

I fixed this problem.
Check if it works.
Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus preprocessing steps #13

Corpus preprocessing steps #13

LydiaXiaohongLi commented Mar 23, 2020 •

edited

Loading

kwonmha commented Mar 25, 2020

LydiaXiaohongLi commented Mar 25, 2020

kwonmha commented Mar 25, 2020

sahelimukherjee92 commented May 19, 2020

kwonmha commented Nov 5, 2020

Corpus preprocessing steps #13

Corpus preprocessing steps #13

Comments

LydiaXiaohongLi commented Mar 23, 2020 • edited Loading

kwonmha commented Mar 25, 2020

LydiaXiaohongLi commented Mar 25, 2020

kwonmha commented Mar 25, 2020

sahelimukherjee92 commented May 19, 2020

kwonmha commented Nov 5, 2020

LydiaXiaohongLi commented Mar 23, 2020 •

edited

Loading