Possible to fine tune on 2 different dataset classes? #1509

troy256 · 2024-09-06T13:26:29Z

troy256
Sep 6, 2024

I have a text_completion dataset that I've been fine tuning llama 3.1 on. I now have another dataset I want to use, but it is for the "instruct" dataset class. Is it possible to do a fine tuning run on both of these concurrently? If not, is the best practice to train on 1st dataset and then train the resulting checkpoint files on the 2nd dataset?

Answered by RdoubleA

Sep 6, 2024

Hey @troy256, take a look at torchtune.datasets._concat.ConcatDataset, it does just what you described.

You can use it directly from the config for any recipe by specifying a list for your dataset:

dataset:
  - _component_: torchtune.datasets.instruct_dataset
    ...
  - _component_: torchtune.datasets.text_completion_dataset

Let us know if you run into any issues.

View full answer

RdoubleA · 2024-09-06T14:23:16Z

RdoubleA
Sep 6, 2024
Collaborator

Hey @troy256, take a look at torchtune.datasets._concat.ConcatDataset, it does just what you described.

You can use it directly from the config for any recipe by specifying a list for your dataset:

dataset:
  - _component_: torchtune.datasets.instruct_dataset
    ...
  - _component_: torchtune.datasets.text_completion_dataset

Let us know if you run into any issues.

0 replies

troy256 · 2024-09-09T15:14:58Z

troy256
Sep 9, 2024
Author

Thanks, I'm getting close. My instruct dataset just has "prompt" and "response" columns (no "instruction"). I can't figure out which instruct template to use, or maybe I'm not specifying it correct. I'd like to use an existing template instead of creating custom.

1 reply

RdoubleA Sep 9, 2024
Collaborator

We've just updated our APIs and will be updating our documentation very soon. If you use torchtune.datasets.instruct_dataset you don't need to specify any custom formatting as it will handle it for you, see here:

torchtune/torchtune/datasets/_instruct.py

Line 157 in 68d4f3e

    
               If your column names are different, you can use the ``column_map`` parameter to change

you just need to pass in column_map = {"input": "prompt", "output": "response"}. By default, it won't add any prompt template but it will format it correctly according to your dataset and it should be sufficient to start fine-tuning. Is there a specific prompt template you wanted to use?

troy256 · 2024-09-09T16:51:40Z

troy256
Sep 9, 2024
Author

Thanks. What do I put for template in the dataset block? It seems to require that:

TypeError: instruct_dataset() missing 1 required keyword-only argument: 'template'

Relevant yaml:

dataset:
  - _component_: torchtune.datasets.instruct_dataset
    source: /data/torchtune/dataset/instruct/parquet
    split: train
    train_on_input: True
    max_seq_len: 8192
    column_map: {"input": "prompt", "output": "response"}
  - _component_: torchtune.datasets.text_completion_dataset
    source: /data/torchtune/dataset/text_completion/parquet
    split: train
    max_seq_len: 8192

1 reply

RdoubleA Sep 9, 2024
Collaborator

ah, apologies this is confusing. our datasets have undergone a lot of changes recently. I would recommend installing our nightlies as this won't require you to pass in a prompt template. with the nightly build, your config should look like this, noticed that max_seq_len moved to the tokenizer:

tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model
  max_seq_len: 8192

dataset:
  - _component_: torchtune.datasets.instruct_dataset
    source: /data/torchtune/dataset/instruct/parquet
    split: train
    train_on_input: True
    column_map: {"input": "prompt", "output": "response"}
  - _component_: torchtune.datasets.text_completion_dataset
    source: /data/torchtune/dataset/text_completion/parquet
    split: train

troy256 · 2024-09-09T17:36:00Z

troy256
Sep 9, 2024
Author

OK, great. I updated to nightly with command:

pip install --pre torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cu125 --no-cache-dir --force-reinstall

Specified "cu125" because nvidia-smi says I'm on CUDA 12.5. Should it say something later than just 0.2.1?

# pip show torchtune
Name: torchtune
Version: 0.2.1
Summary: A native-PyTorch library for LLM fine-tuning
Home-page: 
Author: 
Author-email: PyTorch Team <[email protected]>
License: BSD 3-Clause License

Still getting the TypeError:

INFO:torchtune.utils.logging:Running FullFinetuneRecipeSingleDevice with resolved config:

batch_size: 1
checkpointer:
  _component_: torchtune.utils.FullModelHFCheckpointer
  checkpoint_dir: /data/HF-llama3.1-8b-instruct/
  checkpoint_files:
  - model-00001-of-00004.safetensors
  - model-00002-of-00004.safetensors
  - model-00003-of-00004.safetensors
  - model-00004-of-00004.safetensors
  model_type: LLAMA3
  output_dir: /data/HF-llama3.1-8b-instruct
  recipe_checkpoint: recipe_state.pt
compile: false
dataset:
- _component_: torchtune.datasets.instruct_dataset
  column_map:
    input: prompt
    output: response
  max_seq_len: 8192
  source: /data/torchtune/dataset/instruct/parquet
  split: train
  train_on_input: true
- _component_: torchtune.datasets.text_completion_dataset
  max_seq_len: 8192
  source: /data/torchtune/dataset/text_completion/parquet
  split: train
device: cuda
dtype: bf16
enable_activation_checkpointing: true
epochs: 1
gradient_accumulation_steps: 4
log_every_n_steps: 1
log_peak_memory_stats: false
loss:
  _component_: torch.nn.CrossEntropyLoss
max_steps_per_epoch: null
memory_efficient_fsdp_wrap: true
metric_logger:
  _component_: torchtune.utils.metric_logging.DiskLogger
  log_dir: /tmp/alpaca-llama3.1-finetune
model:
  _component_: torchtune.models.llama3_1.llama3_1_8b
optimizer:
  _component_: torch.optim.AdamW
  foreach: false
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /tmp/alpaca-llama3.1-finetune
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /data/HF-llama3.1-8b-instruct/original/tokenizer.model

DEBUG:torchtune.utils.logging:Setting manual seed to local seed 992868431. Local seed is seed + rank = 992868431 + 0
Writing logs to /tmp/alpaca-llama3.1-finetune/log_1725903030.txt
INFO:torchtune.utils.logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils.logging:Memory stats after model init:
        GPU peak memory allocation: 16.50 GB
        GPU peak memory reserved: 16.64 GB
        GPU peak memory active: 16.50 GB
INFO:torchtune.utils.logging:Tokenizer is initialized from file.
INFO:torchtune.utils.logging:Optimizer is initialized.
INFO:torchtune.utils.logging:Loss is initialized.
Traceback (most recent call last):
  File "/data/pe/bin/tune", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/run.py", line 179, in _run_cmd
    self._run_single_device(args)
  File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/run.py", line 93, in _run_single_device
    runpy.run_path(str(args.recipe), run_name="__main__")
  File "<frozen runpy>", line 286, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/data/pe/lib/python3.12/site-packages/recipes/full_finetune_single_device.py", line 524, in <module>
    sys.exit(recipe_main())
             ^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torchtune/config/_parse.py", line 50, in wrapper
    sys.exit(recipe_main(conf))
             ^^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/recipes/full_finetune_single_device.py", line 518, in recipe_main
    recipe.setup(cfg=cfg)
  File "/data/pe/lib/python3.12/site-packages/recipes/full_finetune_single_device.py", line 228, in setup
    self._sampler, self._dataloader = self._setup_data(
                                      ^^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/recipes/full_finetune_single_device.py", line 341, in _setup_data
    config.instantiate(single_cfg_dataset, tokenizer=self._tokenizer)
  File "/data/pe/lib/python3.12/site-packages/torchtune/config/_instantiate.py", line 106, in instantiate
    return _instantiate_node(config, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torchtune/config/_instantiate.py", line 31, in _instantiate_node
    return _create_component(_component_, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torchtune/config/_instantiate.py", line 20, in _create_component
    return _component_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: instruct_dataset() missing 1 required keyword-only argument: 'template'`

I had to put max_seq_len back under the dataset block because it complained about me moving it to the tokenzier section. I'm not so worried about that though, I'll move it to wherever it tells me to.

2 replies

RdoubleA Sep 10, 2024
Collaborator

there are no binaries for https://download.pytorch.org/whl/nightly/cu125 (if you open that link, it doesn't lead anywhere). you can do https://download.pytorch.org/whl/nightly/cu124 since you have the latest CUDA version, or stick to cu121. You are still getting the error because the nightly never installed correctly.

When I run pip install --pre torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cu124 --no-cache-dir you should get this version with pip show:

Name: torchtune
Version: 0.3.0.dev20240909+cu124
Summary: A native-PyTorch library for LLM fine-tuning
Home-page: 
Author: 
Author-email: PyTorch Team <[email protected]>
License: BSD 3-Clause License

troy256 Sep 10, 2024
Author

Thanks! After updating to the nightly cu124 build (sorry, should have checked for existence of cu125 first) and making some tweaks to the tuning recipe yaml I was able to get a full fine tuning run to complete. I'll can now test inference and experiment with different epochs, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible to fine tune on 2 different dataset classes? #1509

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Possible to fine tune on 2 different dataset classes? #1509

Uh oh!

troy256 Sep 6, 2024

Replies: 4 comments · 4 replies

Uh oh!

RdoubleA Sep 6, 2024 Collaborator

Uh oh!

troy256 Sep 9, 2024 Author

Uh oh!

RdoubleA Sep 9, 2024 Collaborator

Uh oh!

troy256 Sep 9, 2024 Author

Uh oh!

RdoubleA Sep 9, 2024 Collaborator

Uh oh!

troy256 Sep 9, 2024 Author

Uh oh!

RdoubleA Sep 10, 2024 Collaborator

Uh oh!

troy256 Sep 10, 2024 Author

troy256
Sep 6, 2024

Replies: 4 comments 4 replies

RdoubleA
Sep 6, 2024
Collaborator

troy256
Sep 9, 2024
Author

RdoubleA Sep 9, 2024
Collaborator

troy256
Sep 9, 2024
Author

RdoubleA Sep 9, 2024
Collaborator

troy256
Sep 9, 2024
Author

RdoubleA Sep 10, 2024
Collaborator

troy256 Sep 10, 2024
Author