Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on "mlm" in continued pre-training #172

Open
tanliboy opened this issue Jun 3, 2024 · 2 comments
Open

Question on "mlm" in continued pre-training #172

tanliboy opened this issue Jun 3, 2024 · 2 comments

Comments

@tanliboy
Copy link

tanliboy commented Jun 3, 2024

Hi Team,

It is amazing handbook. In the continued pre-training script (run_cpt.py), I saw that it is not using "mlm" (Masked Language Model) parameter in the training process. I though that the training method mlm vs. forward prediction is the major differentiation between pre-training and supervised fine-tuning.

  • Has there been an assessment of the efficacy of continued pre-training with "mlm" compared to without it?
  • What's your advice or guidelines for incorporating "mlm" into the continued pre-training process?

Thanks!
Li

@tanliboy tanliboy changed the title Questions on "mlm" in continued pre-training Question on "mlm" in continued pre-training Jun 3, 2024
@xiyang-aads-lilly
Copy link

xiyang-aads-lilly commented Jun 18, 2024

all models are trained with causal language modeling. MLM is out of the scope I think for this project.

@tanliboy
Copy link
Author

Thanks for your reply, @xiyang-aads-lilly !

In the case that we need to fine-tune on a small set of documents (<50M tokens), what would be the best strategy to integrate the knowledge into the LLMs without causing significant regressions on LLMs?

I have heard discussions between re-warming + re-sampling for continued pre-training vs. generating conversational data for instruction fine-tuning. Given we use SFT for both continued pre-training and instruction fine-tuning (assuming not using completion-only data loader), it seems that it is unnecessary to generate conversational data for instruction fine-tuning. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants