Add sequence packing support for SFTPackedDataset #275

ashors1 · 2024-08-30T23:02:37Z

What does this PR do ?

Adds sequence packing for SFTPackedDataset. Thanks @RadhaGulhane13 for the contribution!

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

Original usage example from #259:

The following documentation can be referred to for dataset preparation and adjusting training configurations : https://github.com/NVIDIA/NeMo/blob/main/docs/source/features/throughput_optimizations.rst#how-to-run-sftpeft-with-packed-sequence

# Provide packed sequences dataset path
TRAIN_DATA_PATH="<train_dataset_path>/packed_32768_seed0.npy"
VALID_DATA_PATH="<val_dataset_path>/packed_32768_seed0.npy"

python <path_to_Nemo-Aligner>/examples/nlp/gpt/train_gpt_sft.py \
  trainer.precision=bf16 \
  trainer.num_nodes=1 \
  trainer.devices=8 \
  trainer.sft.max_steps=-1 \
  trainer.sft.limit_val_batches=40 \
  trainer.sft.val_check_interval=1000 \
  model.megatron_amp_O2=True \
  model.restore_from_path=/path/to/your/mcore_gpt.nemo \
  model.optim.lr=5e-6 \
  model.data.chat=True \
  model.data.num_workers=0 \
  model.data.train_ds.micro_batch_size=1 \
  model.data.train_ds.global_batch_size=128 \
  model.data.train_ds.max_seq_length=4096 \
  +model.data.train_ds.packed_sequence=True \
  model.data.train_ds.micro_batch_size=1 \
  model.data.train_ds.global_batch_size=128 \
  model.data.train_ds.file_path=${TRAIN_DATA_PATH} \
  +model.data.validation_ds.packed_sequence=True \
  model.data.validation_ds.micro_batch_size=1 \
  model.data.validation_ds.global_batch_size=128 \
  model.data.validation_ds.file_path=${VALID_DATA_PATH}  \
  exp_manager.create_wandb_logger=True \
  exp_manager.explicit_log_dir=/results \
  exp_manager.wandb_logger_kwargs.project=sft_run \
  exp_manager.wandb_logger_kwargs.name=chat_sft_run \
  exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
  exp_manager.resume_if_exists=True \
  exp_manager.resume_ignore_no_checkpoint=True \
  exp_manager.create_checkpoint_callback=True \
  exp_manager.checkpoint_callback_params.monitor=validation_loss

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

Does the trainer resume and restore model state all states?
Does the trainer support all parallelism techniques(PP, TP, DP)?
Does the trainer support max_steps=-1 and validation?
Does the trainer only call APIs defined in alignable_interface.py?
Does the trainer have proper logging?

Additional Information

Related to # (issue)

Signed-off-by: Radha Gulhane <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: ashors1 <[email protected]>

nemo_aligner/data/nlp/builders.py

Signed-off-by: ashors1 <[email protected]>

…-packing

Signed-off-by: ashors1 <[email protected]>

RadhaGulhane13 and others added 4 commits August 30, 2024 14:01

Support for Packed Sequence Dataset

05cbccd

Signed-off-by: Radha Gulhane <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

cb8258d

for more information, see https://pre-commit.ci

minor bug fix and comment

d1218cb

Signed-off-by: ashors1 <[email protected]>

add documentation

af08b99

Signed-off-by: ashors1 <[email protected]>

ashors1 requested a review from terrykong August 30, 2024 23:02

github-actions bot added the documentation Improvements or additions to documentation label Aug 30, 2024

ashors1 mentioned this pull request Aug 30, 2024

Support for Packed Sequence Dataset #259

Closed

8 tasks

terrykong requested changes Aug 30, 2024

View reviewed changes

nemo_aligner/data/nlp/builders.py Outdated Show resolved Hide resolved

nemo_aligner/data/nlp/builders.py Show resolved Hide resolved

ashors1 added 2 commits September 3, 2024 10:15

address comments

672c6fa

Signed-off-by: ashors1 <[email protected]>

copy cu_seqlen documentation from nemo

22c1fdb

Signed-off-by: ashors1 <[email protected]>

ashors1 requested a review from terrykong September 5, 2024 05:27

ashors1 added 2 commits September 4, 2024 22:30

Merge branch 'main' of github.com:NVIDIA/NeMo-Aligner into ashors/seq…

bfb4644

…-packing

remove new line

45c65a6

Signed-off-by: ashors1 <[email protected]>

terrykong approved these changes Sep 5, 2024

View reviewed changes

terrykong merged commit f1fa2dc into main Sep 5, 2024
5 checks passed

terrykong deleted the ashors/seq-packing branch September 5, 2024 17:11

ashors1 mentioned this pull request Sep 5, 2024

add packed dataset #181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sequence packing support for SFTPackedDataset #275

Add sequence packing support for SFTPackedDataset #275

ashors1 commented Aug 30, 2024

Add sequence packing support for SFTPackedDataset #275

Add sequence packing support for SFTPackedDataset #275

Conversation

ashors1 commented Aug 30, 2024

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Additional Information