[Example] NanoGPT support in DLrover #516

Antlera · 2023-07-22T09:52:46Z

Add nanogpt job support.

UT result:

No modification to main code. This pr add nanoGPT(GPT2) code support.

Job Test result:

kubectl get pod

NAME                                          READY   STATUS    RESTARTS      AGE
dlrover-controller-manager-68c984bdc9-zt6nx   2/2     Running   2 (32m ago)   30h
elasticjob-torch-nanogpt-dlrover-master       1/1     Running   0             3m45s
torch-nanogpt-edljob-worker-0                 1/1     Running   0             3m38s
torch-nanogpt-edljob-worker-1                 1/1     Running   0             3m38s

worker0 log

iter 0: loss 4.2998, time 37076.40ms, mfu -100.00%, lr 6.00e-04, total time 37.08s
iter 1: loss 3.5839, time 35491.35ms, mfu -100.00%, lr 5.87e-04, total time 72.57s
iter 2: loss 3.9422, time 32296.26ms, mfu -100.00%, lr 5.48e-04, total time 104.86s
iter 3: loss 3.4417, time 29406.69ms, mfu -100.00%, lr 4.89e-04, total time 134.27s
iter 4: loss 3.3288, time 29490.09ms, mfu -100.00%, lr 4.13e-04, total time 163.76s
iter 5: loss 3.3090, time 28100.16ms, mfu 0.01%, lr 3.30e-04, total time 191.86s
iter 6: loss 5.8463, time 28509.24ms, mfu 0.01%, lr 2.47e-04, total time 220.37s
iter 7: loss 3.2476, time 29504.05ms, mfu 0.01%, lr 1.71e-04, total time 249.87s
iter 8: loss 3.2016, time 26998.40ms, mfu 0.01%, lr 1.12e-04, total time 276.87s
iter 9: loss 3.1901, time 29095.91ms, mfu 0.01%, lr 7.32e-05, total time 305.97s
iter 10: loss 3.1321, time 29203.27ms, mfu 0.01%, lr 6.00e-05, total time 335.17s
[2023-07-22 09:49:03,866] [INFO] [training.py:355:_invoke_run] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.

worker1 log

iter 0: loss 4.2901, time 37974.68ms, mfu -100.00%, lr 6.00e-04, total time 37.97s
iter 1: loss 3.5786, time 34394.90ms, mfu -100.00%, lr 5.87e-04, total time 72.37s
iter 2: loss 3.9526, time 32796.66ms, mfu -100.00%, lr 5.48e-04, total time 105.17s
iter 3: loss 3.4573, time 29504.12ms, mfu -100.00%, lr 4.89e-04, total time 134.67s
iter 4: loss 3.3110, time 29594.19ms, mfu -100.00%, lr 4.13e-04, total time 164.26s
iter 5: loss 3.2937, time 27404.71ms, mfu 0.01%, lr 3.30e-04, total time 191.67s
iter 6: loss 5.8443, time 28991.74ms, mfu 0.01%, lr 2.47e-04, total time 220.66s
iter 7: loss 3.2383, time 29812.16ms, mfu 0.01%, lr 1.71e-04, total time 250.47s
iter 8: loss 3.2393, time 26200.98ms, mfu 0.01%, lr 1.12e-04, total time 276.67s
iter 9: loss 3.2074, time 29606.31ms, mfu 0.01%, lr 7.32e-05, total time 306.28s
iter 10: loss 3.1579, time 28599.84ms, mfu 0.01%, lr 6.00e-05, total time 334.88s
[2023-07-22 09:49:07,057] [INFO] [training.py:355:_invoke_run] [default] worker group successfully finished. Waiting 300 seconds for other agents to **finish.**

Antlera · 2023-07-22T09:54:32Z

A big step towards #350 completion in Allreduce mode.

workingloong · 2023-07-23T13:10:11Z

The PR need to merge the master branch and format codes to pass pre-commit.

Antlera · 2023-07-23T13:44:40Z

The PR need to merge the master branch and format codes to pass pre-commit.

I tried to run pre-commit locally with the docker easydl/dlrover:ci to format my code. However, I failed with errors mentioned by #517 .

merlintang · 2023-07-24T10:23:48Z

Nanogpt meaning?

Antlera · 2023-07-24T10:30:19Z

Nanogpt meaning?

NanoGPT is a GPT to build from scratch by setting n_layer,n_head, and n_embedding of the transformer model. Users can test the ability of dlrover on GPT scaling from 6M parameters to 1.5B parameters (GPT2-xl size) or even larger.

Antlera · 2023-07-24T10:31:29Z

Nanogpt meaning?

We can submit a doc to explain how to scaling nanoGPT to test elasticity ability of dlrover.

workingloong · 2023-07-24T10:34:50Z

Nanogpt meaning?

https://github.com/karpathy/nanoGPT

to add a parameter interpreter.

model_zoo/pytorch/nanoGPT/nanoGPT.py

model_zoo/pytorch/nanoGPT/model.py

merlintang

does this source refer to other open source implementation of GPT2?

Antlera · 2023-07-25T07:19:23Z

does this source refer to other open source implementation of GPT2?

Yes, and the reference will be added to the doc.

workingloong

LGTM

dlrover/examples/torch_nanogpt_debug_job.yaml

dlrover/examples/torch_nanogpt_job.yaml

Antlera · 2023-07-25T08:37:01Z

does this source refer to other open source implementation of GPT2?

This will be resolved in #526 .

dlrover/examples/torch_nanogpt_job.yaml

docker/pytorch/nanogpt.dockerfile

docker builds from failing.

merlintang · 2023-07-26T01:20:39Z

LGTM

Antlera mentioned this pull request Jul 23, 2023

An example to support nanoGPT. #439

Closed

Antlera force-pushed the nanogpt_support branch from c59833f to cf2cf49 Compare July 24, 2023 12:40

Antlera added 14 commits July 24, 2023 14:35

GPT 2 model definition.

c9786d3

GPT 2 build and training code.

6c98463

nanogpt docker file.

780ed3c

Nanogpt job yamls for debugging and test.

443c353

Rename nanogpt job yaml.

d3070f5

Modify the code structure

52df694

to add a parameter interpreter.

Formatting code with black.

7570e41

Fix metadata error in yaml.

17659ac

Formatting code format & pass precommit.

a05acbb

Formatting code format for flake8.

d9c4a8f

Remove whitespace before :.

bb2e356

Add # noqa: E203 E501 to skip lines.

bcf0f9a

Fix mypy check error.

e70f995

Fix isort error.

05a87dd

Antlera force-pushed the nanogpt_support branch from cf2cf49 to 05a87dd Compare July 24, 2023 14:36

workingloong reviewed Jul 25, 2023

View reviewed changes

merlintang reviewed Jul 25, 2023

View reviewed changes

model_zoo/pytorch/nanoGPT/model.py Outdated Show resolved Hide resolved

merlintang requested changes Jul 25, 2023

View reviewed changes

Antlera added 2 commits July 25, 2023 07:14

Fix annotation formatting.

19bfa09

Rename nanogpt.py

d73703f

workingloong approved these changes Jul 25, 2023

View reviewed changes

Major-333 reviewed Jul 25, 2023

View reviewed changes

dlrover/examples/torch_nanogpt_debug_job.yaml Show resolved Hide resolved

Major-333 reviewed Jul 25, 2023

View reviewed changes

dlrover/examples/torch_nanogpt_job.yaml Outdated Show resolved Hide resolved

Fix imagePullPolicy.

9b7ed63

Antlera mentioned this pull request Jul 25, 2023

A document for nanogpt training example. #526

Closed

Antlera closed this Jul 25, 2023

Antlera reopened this Jul 25, 2023

workingloong reviewed Jul 25, 2023

View reviewed changes

dlrover/examples/torch_nanogpt_job.yaml Outdated Show resolved Hide resolved

merlintang changed the title ~~Nanogpt support example.~~ [Example] NanoGPT support in DLrover Jul 25, 2023

workingloong reviewed Jul 25, 2023

View reviewed changes

docker/pytorch/nanogpt.dockerfile Show resolved Hide resolved

Antlera added 4 commits July 25, 2023 12:22

Modify the mirror source.

749d970

Fix training configuration of yaml.

55f1006

Rename folders to prevent

faa755c

docker builds from failing.

Update the mirror source and file address.

c56189d

workingloong requested a review from merlintang July 26, 2023 00:53

merlintang approved these changes Jul 26, 2023

View reviewed changes

merlintang merged commit b014272 into intelligent-machine-learning:master Jul 26, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Example] NanoGPT support in DLrover #516

[Example] NanoGPT support in DLrover #516

Antlera commented Jul 22, 2023

Antlera commented Jul 22, 2023

workingloong commented Jul 23, 2023 •

edited

Loading

Antlera commented Jul 23, 2023

merlintang commented Jul 24, 2023

Antlera commented Jul 24, 2023

Antlera commented Jul 24, 2023

workingloong commented Jul 24, 2023

merlintang left a comment

Antlera commented Jul 25, 2023

workingloong left a comment

Antlera commented Jul 25, 2023

merlintang commented Jul 26, 2023

[Example] NanoGPT support in DLrover #516

[Example] NanoGPT support in DLrover #516

Conversation

Antlera commented Jul 22, 2023

Antlera commented Jul 22, 2023

workingloong commented Jul 23, 2023 • edited Loading

Antlera commented Jul 23, 2023

merlintang commented Jul 24, 2023

Antlera commented Jul 24, 2023

Antlera commented Jul 24, 2023

workingloong commented Jul 24, 2023

merlintang left a comment

Choose a reason for hiding this comment

Antlera commented Jul 25, 2023

workingloong left a comment

Choose a reason for hiding this comment

Antlera commented Jul 25, 2023

merlintang commented Jul 26, 2023

workingloong commented Jul 23, 2023 •

edited

Loading