Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpus preprocessing steps #13

Open
LydiaXiaohongLi opened this issue Mar 23, 2020 · 5 comments
Open

Corpus preprocessing steps #13

LydiaXiaohongLi opened this issue Mar 23, 2020 · 5 comments

Comments

@LydiaXiaohongLi
Copy link

LydiaXiaohongLi commented Mar 23, 2020

Hi Kwonmha,
Thanks for open source the repo. Can I ask generally the preprocessing steps for vocab builder, for a uncased bert model is follows:

  1. Convert corpus text file to lower case
  2. Removal punctuations from corpus text file?
  3. Build vocab
  4. match the vocab file to bert model configuration, e.g. take the top 30k lines (as the vocab should be ordered by frequency descending order?), manually adjust the vocab file, so that it contains puncutations (i.e. vocabs for . , ? ! ##. ##, ##? ##! etc)?
  5. use the vocab file for later pretraining bert model, the corpus of pretraining bert model needs to be lower cased, but without removal of punctuation?
    Let me know if my understanding is not correct?

Thanks!
Regards

@kwonmha
Copy link
Owner

kwonmha commented Mar 25, 2020

Hi LydiaXiaohongLi,

I recommend you to look into google's vocab first.
There are various versions of vocab : English-Cased, English-uncased, Multilingual-Cased, Multilingual-uncased, etc.

Those vocabs imply that lower-casing is an option. (answer to question 1)

And if you check those vocabs, punctuations are included.
You don't need to remove punctuations. (answer to question 2)

If you build vocab with my project or others, vocab would be ordered by frequency except some special tokens on the top of vocab. (answer to question 4)

@LydiaXiaohongLi
Copy link
Author

Thanks kwonmha,
Follow up on the punctuation removal question:
If I don't remove punctuation in the corpus file, I will see vocab built for cases like words followed by punct as a single vocab toke, e.g. "hello," . Hence want to ask if should build vocab with corpus without punctuation, then add back punctuation manually as seperate standalone tokens?

Thanks
Regards

@kwonmha
Copy link
Owner

kwonmha commented Mar 25, 2020

Subword vocab building algorithm will automatically separate 'hello,' into "hello" and ",".
Because "," appears to be follow many other words like "wow,", "well,".
So it won't be tied to other vocabs unless there are plenty of "hello,"s.

@sahelimukherjee92
Copy link

Hi @kwonmha, the vocab file that I generate has issue with punchtuations.

-(Q).
(Proc.
(Price,
(Poon
(Polyak,
(Polyak
(PoPPCA)
(Pinto
(Photo
(Pham
(Petersen
(Perron,
(Pearl,
(Pati
(Palatucci
(Paccanaro
(PSD)
(PMF).

Could you please suggest how can I separate the punctuations? Does that involve further preprocessing?

@kwonmha
Copy link
Owner

kwonmha commented Nov 5, 2020

I fixed this problem.
Check if it works.
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants