Question on "mlm" in continued pre-training #172

tanliboy · 2024-06-03T18:04:54Z

Hi Team,

It is amazing handbook. In the continued pre-training script (run_cpt.py), I saw that it is not using "mlm" (Masked Language Model) parameter in the training process. I though that the training method mlm vs. forward prediction is the major differentiation between pre-training and supervised fine-tuning.

Has there been an assessment of the efficacy of continued pre-training with "mlm" compared to without it?
What's your advice or guidelines for incorporating "mlm" into the continued pre-training process?

Thanks!
Li

The text was updated successfully, but these errors were encountered:

xiyang-aads-lilly · 2024-06-18T02:23:25Z

all models are trained with causal language modeling. MLM is out of the scope I think for this project.

tanliboy · 2024-06-19T03:29:32Z

Thanks for your reply, @xiyang-aads-lilly !

In the case that we need to fine-tune on a small set of documents (<50M tokens), what would be the best strategy to integrate the knowledge into the LLMs without causing significant regressions on LLMs?

I have heard discussions between re-warming + re-sampling for continued pre-training vs. generating conversational data for instruction fine-tuning. Given we use SFT for both continued pre-training and instruction fine-tuning (assuming not using completion-only data loader), it seems that it is unnecessary to generate conversational data for instruction fine-tuning. Thoughts?

tanliboy changed the title ~~Questions on "mlm" in continued pre-training~~ Question on "mlm" in continued pre-training Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on "mlm" in continued pre-training #172

Question on "mlm" in continued pre-training #172

tanliboy commented Jun 3, 2024

xiyang-aads-lilly commented Jun 18, 2024 •

edited

Loading

tanliboy commented Jun 19, 2024

Question on "mlm" in continued pre-training #172

Question on "mlm" in continued pre-training #172

Comments

tanliboy commented Jun 3, 2024

xiyang-aads-lilly commented Jun 18, 2024 • edited Loading

tanliboy commented Jun 19, 2024

xiyang-aads-lilly commented Jun 18, 2024 •

edited

Loading