Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Example] NanoGPT support in DLrover #516

Merged

Conversation

Antlera
Copy link
Collaborator

@Antlera Antlera commented Jul 22, 2023

Add nanogpt job support.

UT result:

No modification to main code. This pr add nanoGPT(GPT2) code support.

Job Test result:

kubectl get pod

NAME                                          READY   STATUS    RESTARTS      AGE
dlrover-controller-manager-68c984bdc9-zt6nx   2/2     Running   2 (32m ago)   30h
elasticjob-torch-nanogpt-dlrover-master       1/1     Running   0             3m45s
torch-nanogpt-edljob-worker-0                 1/1     Running   0             3m38s
torch-nanogpt-edljob-worker-1                 1/1     Running   0             3m38s

worker0 log

iter 0: loss 4.2998, time 37076.40ms, mfu -100.00%, lr 6.00e-04, total time 37.08s
iter 1: loss 3.5839, time 35491.35ms, mfu -100.00%, lr 5.87e-04, total time 72.57s
iter 2: loss 3.9422, time 32296.26ms, mfu -100.00%, lr 5.48e-04, total time 104.86s
iter 3: loss 3.4417, time 29406.69ms, mfu -100.00%, lr 4.89e-04, total time 134.27s
iter 4: loss 3.3288, time 29490.09ms, mfu -100.00%, lr 4.13e-04, total time 163.76s
iter 5: loss 3.3090, time 28100.16ms, mfu 0.01%, lr 3.30e-04, total time 191.86s
iter 6: loss 5.8463, time 28509.24ms, mfu 0.01%, lr 2.47e-04, total time 220.37s
iter 7: loss 3.2476, time 29504.05ms, mfu 0.01%, lr 1.71e-04, total time 249.87s
iter 8: loss 3.2016, time 26998.40ms, mfu 0.01%, lr 1.12e-04, total time 276.87s
iter 9: loss 3.1901, time 29095.91ms, mfu 0.01%, lr 7.32e-05, total time 305.97s
iter 10: loss 3.1321, time 29203.27ms, mfu 0.01%, lr 6.00e-05, total time 335.17s
[2023-07-22 09:49:03,866] [INFO] [training.py:355:_invoke_run] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.

worker1 log

iter 0: loss 4.2901, time 37974.68ms, mfu -100.00%, lr 6.00e-04, total time 37.97s
iter 1: loss 3.5786, time 34394.90ms, mfu -100.00%, lr 5.87e-04, total time 72.37s
iter 2: loss 3.9526, time 32796.66ms, mfu -100.00%, lr 5.48e-04, total time 105.17s
iter 3: loss 3.4573, time 29504.12ms, mfu -100.00%, lr 4.89e-04, total time 134.67s
iter 4: loss 3.3110, time 29594.19ms, mfu -100.00%, lr 4.13e-04, total time 164.26s
iter 5: loss 3.2937, time 27404.71ms, mfu 0.01%, lr 3.30e-04, total time 191.67s
iter 6: loss 5.8443, time 28991.74ms, mfu 0.01%, lr 2.47e-04, total time 220.66s
iter 7: loss 3.2383, time 29812.16ms, mfu 0.01%, lr 1.71e-04, total time 250.47s
iter 8: loss 3.2393, time 26200.98ms, mfu 0.01%, lr 1.12e-04, total time 276.67s
iter 9: loss 3.2074, time 29606.31ms, mfu 0.01%, lr 7.32e-05, total time 306.28s
iter 10: loss 3.1579, time 28599.84ms, mfu 0.01%, lr 6.00e-05, total time 334.88s
[2023-07-22 09:49:07,057] [INFO] [training.py:355:_invoke_run] [default] worker group successfully finished. Waiting 300 seconds for other agents to **finish.**

@Antlera
Copy link
Collaborator Author

Antlera commented Jul 22, 2023

A big step towards #350 completion in Allreduce mode.

@workingloong
Copy link
Collaborator

workingloong commented Jul 23, 2023

The PR need to merge the master branch and format codes to pass pre-commit.

@Antlera
Copy link
Collaborator Author

Antlera commented Jul 23, 2023

The PR need to merge the master branch and format codes to pass pre-commit.

I tried to run pre-commit locally with the docker easydl/dlrover:ci to format my code. However, I failed with errors mentioned by #517 .

@merlintang
Copy link
Collaborator

Nanogpt meaning?

@Antlera
Copy link
Collaborator Author

Antlera commented Jul 24, 2023

Nanogpt meaning?

NanoGPT is a GPT to build from scratch by setting n_layer,n_head, and n_embedding of the transformer model. Users can test the ability of dlrover on GPT scaling from 6M parameters to 1.5B parameters (GPT2-xl size) or even larger.

@Antlera
Copy link
Collaborator Author

Antlera commented Jul 24, 2023

Nanogpt meaning?

We can submit a doc to explain how to scaling nanoGPT to test elasticity ability of dlrover.

@workingloong
Copy link
Collaborator

Nanogpt meaning?

https://github.com/karpathy/nanoGPT

model_zoo/pytorch/nanoGPT/nanoGPT.py Outdated Show resolved Hide resolved
model_zoo/pytorch/nanoGPT/nanoGPT.py Outdated Show resolved Hide resolved
model_zoo/pytorch/nanoGPT/nanoGPT.py Outdated Show resolved Hide resolved
model_zoo/pytorch/nanoGPT/nanoGPT.py Outdated Show resolved Hide resolved
model_zoo/pytorch/nanoGPT/nanoGPT.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@merlintang merlintang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this source refer to other open source implementation of GPT2?

@Antlera
Copy link
Collaborator Author

Antlera commented Jul 25, 2023

does this source refer to other open source implementation of GPT2?

Yes, and the reference will be added to the doc.

Copy link
Collaborator

@workingloong workingloong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Antlera
Copy link
Collaborator Author

Antlera commented Jul 25, 2023

does this source refer to other open source implementation of GPT2?

This will be resolved in #526 .

@Antlera Antlera closed this Jul 25, 2023
@Antlera Antlera reopened this Jul 25, 2023
@merlintang merlintang changed the title Nanogpt support example. [Example] NanoGPT support in DLrover Jul 25, 2023
@merlintang
Copy link
Collaborator

LGTM

@merlintang merlintang merged commit b014272 into intelligent-machine-learning:master Jul 26, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants