debugging checkpoint resumption #469

pstjohn · 2024-11-22T16:57:49Z

Currently running the following pre-training command:

DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source ngc)

train_esm2 \
    --train-cluster-path ${DATA_DIR}/2024_03_sanity/train_clusters_sanity.parquet \
    --train-database-path ${DATA_DIR}/2024_03_sanity/train_sanity.db \
    --valid-cluster-path ${DATA_DIR}/2024_03_sanity/valid_clusters.parquet \
    --valid-database-path ${DATA_DIR}/2024_03_sanity/validation.db \
    --resume-if-exists \
    --precision="bf16-mixed" \
    --num-gpus 1 \
    --num-nodes 1 \
    --num-steps 10_000 \
    --val-check-interval 1_000 \
    --stop-after-steps 1_500 \
    --max-seq-length 1024 \
    --limit-val-batches 2 \
    --micro-batch-size 16 \
    --num-layers=6 \
    --hidden-size=320 \
    --num-attention-heads=20 \
    --ffn-hidden-size=1280 \
    --tensor-model-parallel-size 1 \
    --create-tensorboard-logger \
    --wandb-project=esm2_checkpoint_resumption \
    --experiment-name=8m_pretraining_local

Runs resumed from a checkpoint are reproducible, but different from the model that continued without stopping

adding option to error out after n steps

1cf1ad7

pstjohn changed the title ~~adding option to error out after n steps~~ debugging checkpoint resumption Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debugging checkpoint resumption #469

debugging checkpoint resumption #469

pstjohn commented Nov 22, 2024

debugging checkpoint resumption #469

Are you sure you want to change the base?

debugging checkpoint resumption #469

Conversation

pstjohn commented Nov 22, 2024