-
I have a text_completion dataset that I've been fine tuning llama 3.1 on. I now have another dataset I want to use, but it is for the "instruct" dataset class. Is it possible to do a fine tuning run on both of these concurrently? If not, is the best practice to train on 1st dataset and then train the resulting checkpoint files on the 2nd dataset? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 4 replies
-
Hey @troy256, take a look at You can use it directly from the config for any recipe by specifying a list for your dataset:
Let us know if you run into any issues. |
Beta Was this translation helpful? Give feedback.
-
Thanks, I'm getting close. My instruct dataset just has "prompt" and "response" columns (no "instruction"). I can't figure out which instruct template to use, or maybe I'm not specifying it correct. I'd like to use an existing template instead of creating custom. |
Beta Was this translation helpful? Give feedback.
-
Thanks. What do I put for template in the dataset block? It seems to require that:
Relevant yaml:
|
Beta Was this translation helpful? Give feedback.
-
OK, great. I updated to nightly with command:
Specified "cu125" because nvidia-smi says I'm on CUDA 12.5. Should it say something later than just 0.2.1?
Still getting the TypeError:
I had to put max_seq_len back under the dataset block because it complained about me moving it to the tokenzier section. I'm not so worried about that though, I'll move it to wherever it tells me to. |
Beta Was this translation helpful? Give feedback.
Hey @troy256, take a look at
torchtune.datasets._concat.ConcatDataset
, it does just what you described.You can use it directly from the config for any recipe by specifying a list for your dataset:
Let us know if you run into any issues.