Support for Packed Sequence Dataset #259

RadhaGulhane13 · 2024-08-02T19:42:32Z

What does this PR do ?

Adding support for Packed Sequence Dataset for SFT. NeMo supports SFT/PEFT with packed sequences, but this support is missing in NeMo-Aligner. This PR aims to add support for SFT with packed sequences in NeMo-Aligner.

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

The following documentation can be referred to for dataset preparation and adjusting training configurations : https://github.com/NVIDIA/NeMo/blob/main/docs/source/features/throughput_optimizations.rst#how-to-run-sftpeft-with-packed-sequence

# Provide packed sequences dataset path
TRAIN_DATA_PATH="<train_dataset_path>/packed_32768_seed0.npy"
VALID_DATA_PATH="<val_dataset_path>/packed_32768_seed0.npy"

python <path_to_Nemo-Aligner>/examples/nlp/gpt/train_gpt_sft.py \
  trainer.precision=bf16 \
  trainer.num_nodes=1 \
  trainer.devices=8 \
  trainer.sft.max_steps=-1 \
  trainer.sft.limit_val_batches=40 \
  trainer.sft.val_check_interval=1000 \
  model.megatron_amp_O2=True \
  model.restore_from_path=/path/to/your/mcore_gpt.nemo \
  model.optim.lr=5e-6 \
  model.data.chat=True \
  model.data.num_workers=0 \
  model.data.train_ds.micro_batch_size=1 \
  model.data.train_ds.global_batch_size=128 \
  model.data.train_ds.max_seq_length=4096 \
  +model.data.train_ds.packed_sequence=True \
  model.data.train_ds.micro_batch_size=1 \
  model.data.train_ds.global_batch_size=128 \
  model.data.train_ds.file_path=${TRAIN_DATA_PATH} \
  +model.data.validation_ds.packed_sequence=True \
  model.data.validation_ds.micro_batch_size=1 \
  model.data.validation_ds.global_batch_size=128 \
  model.data.validation_ds.file_path=${VALID_DATA_PATH}  \
  exp_manager.create_wandb_logger=True \
  exp_manager.explicit_log_dir=/results \
  exp_manager.wandb_logger_kwargs.project=sft_run \
  exp_manager.wandb_logger_kwargs.name=chat_sft_run \
  exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
  exp_manager.resume_if_exists=True \
  exp_manager.resume_ignore_no_checkpoint=True \
  exp_manager.create_checkpoint_callback=True \
  exp_manager.checkpoint_callback_params.monitor=validation_loss

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

Does the trainer resume and restore model state all states?
Does the trainer support all parallelism techniques(PP, TP, DP)?
Does the trainer support max_steps=-1 and validation?
Does the trainer only call APIs defined in alignable_interface.py?
Does the trainer have proper logging?

Additional Information

Related to add packed dataset #181

Signed-off-by: Radha Gulhane <[email protected]>

for more information, see https://pre-commit.ci

ashors1 · 2024-08-30T23:05:32Z

Hi @RadhaGulhane13, thanks a lot for the contribution! I've opened a new PR #275 which adds some documentation and makes a minor bug fix on top of your changes. I'll close this PR.

RadhaGulhane13 and others added 2 commits August 2, 2024 19:10

Support for Packed Sequence Dataset

96d4834

Signed-off-by: Radha Gulhane <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8fbe339

for more information, see https://pre-commit.ci

RadhaGulhane13 marked this pull request as draft August 6, 2024 17:27

RadhaGulhane13 marked this pull request as ready for review August 6, 2024 17:27

RadhaGulhane13 closed this Aug 6, 2024

RadhaGulhane13 reopened this Aug 6, 2024

ashors1 mentioned this pull request Aug 30, 2024

Add sequence packing support for SFTPackedDataset #275

Merged

8 tasks

ashors1 closed this Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Packed Sequence Dataset #259

Support for Packed Sequence Dataset #259

RadhaGulhane13 commented Aug 2, 2024

ashors1 commented Aug 30, 2024

Support for Packed Sequence Dataset #259

Support for Packed Sequence Dataset #259

Conversation

RadhaGulhane13 commented Aug 2, 2024

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Additional Information

ashors1 commented Aug 30, 2024