Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Packed Sequence Dataset #259

Closed

Conversation

RadhaGulhane13
Copy link
Contributor

What does this PR do ?

Adding support for Packed Sequence Dataset for SFT. NeMo supports SFT/PEFT with packed sequences, but this support is missing in NeMo-Aligner. This PR aims to add support for SFT with packed sequences in NeMo-Aligner.

Changelog

  • Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

The following documentation can be referred to for dataset preparation and adjusting training configurations : https://github.com/NVIDIA/NeMo/blob/main/docs/source/features/throughput_optimizations.rst#how-to-run-sftpeft-with-packed-sequence

# Provide packed sequences dataset path
TRAIN_DATA_PATH="<train_dataset_path>/packed_32768_seed0.npy"
VALID_DATA_PATH="<val_dataset_path>/packed_32768_seed0.npy"

python <path_to_Nemo-Aligner>/examples/nlp/gpt/train_gpt_sft.py \
  trainer.precision=bf16 \
  trainer.num_nodes=1 \
  trainer.devices=8 \
  trainer.sft.max_steps=-1 \
  trainer.sft.limit_val_batches=40 \
  trainer.sft.val_check_interval=1000 \
  model.megatron_amp_O2=True \
  model.restore_from_path=/path/to/your/mcore_gpt.nemo \
  model.optim.lr=5e-6 \
  model.data.chat=True \
  model.data.num_workers=0 \
  model.data.train_ds.micro_batch_size=1 \
  model.data.train_ds.global_batch_size=128 \
  model.data.train_ds.max_seq_length=4096 \
  +model.data.train_ds.packed_sequence=True \
  model.data.train_ds.micro_batch_size=1 \
  model.data.train_ds.global_batch_size=128 \
  model.data.train_ds.file_path=${TRAIN_DATA_PATH} \
  +model.data.validation_ds.packed_sequence=True \
  model.data.validation_ds.micro_batch_size=1 \
  model.data.validation_ds.global_batch_size=128 \
  model.data.validation_ds.file_path=${VALID_DATA_PATH}  \
  exp_manager.create_wandb_logger=True \
  exp_manager.explicit_log_dir=/results \
  exp_manager.wandb_logger_kwargs.project=sft_run \
  exp_manager.wandb_logger_kwargs.name=chat_sft_run \
  exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
  exp_manager.resume_if_exists=True \
  exp_manager.resume_ignore_no_checkpoint=True \
  exp_manager.create_checkpoint_callback=True \
  exp_manager.checkpoint_callback_params.monitor=validation_loss

Before your PR is "Ready for review"

Pre checks:

Checklist when contributing a new algorithm

  • Does the trainer resume and restore model state all states?
  • Does the trainer support all parallelism techniques(PP, TP, DP)?
  • Does the trainer support max_steps=-1 and validation?
  • Does the trainer only call APIs defined in alignable_interface.py?
  • Does the trainer have proper logging?

Additional Information

@RadhaGulhane13 RadhaGulhane13 marked this pull request as draft August 6, 2024 17:27
@RadhaGulhane13 RadhaGulhane13 marked this pull request as ready for review August 6, 2024 17:27
@RadhaGulhane13 RadhaGulhane13 reopened this Aug 6, 2024
@ashors1
Copy link
Collaborator

ashors1 commented Aug 30, 2024

Hi @RadhaGulhane13, thanks a lot for the contribution! I've opened a new PR #275 which adds some documentation and makes a minor bug fix on top of your changes. I'll close this PR.

@ashors1 ashors1 closed this Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants