Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RWKV-v4 training doesn't stop after max_epochs defined #266

Open
shamilajeewantha opened this issue Oct 19, 2024 · 2 comments
Open

RWKV-v4 training doesn't stop after max_epochs defined #266

shamilajeewantha opened this issue Oct 19, 2024 · 2 comments

Comments

@shamilajeewantha
Copy link

I tried training from scratch as explained in readme.

Training / Fine-tuning
pip install deepspeed==0.7.0 // pip install pytorch-lightning==1.9.5 // torch 1.13.1+cu117

NOTE: add weight decay (0.1 or 0.01) and dropout (0.1 or 0.01) when training on small amt of data. try x=x+dropout(att(x)) x=x+dropout(ffn(x)) x=dropout(x+att(x)) x=dropout(x+ffn(x)) etc.

Training RWKV-4 from scratch: run train.py, which by default is using the enwik8 dataset (unzip https://data.deepai.org/enwik8.zip).

I changed the n_epoch = 500 to n_epoch = 500 to test the training functionality. But the log keeps going on beyond that as shown below. Is there a way to train for a shorter number of epochs or any other change in configuration?

miniE 1 s 833 prog 20.00% : ppl 5.406813 loss 1.687660 lr 3.330213e-04: 100%|██████████| 833/833 [03:21<00:00, 4.13it/s] miniE 2 s 1666 prog 40.00% : ppl 3.495054 loss 1.251349 lr 1.386290e-04: 100%|██████████| 833/833 [03:17<00:00, 4.21it/s] miniE 3 s 2499 prog 60.00% : ppl 3.216982 loss 1.168444 lr 5.770800e-05: 100%|██████████| 833/833 [03:17<00:00, 4.22it/s] miniE 4 s 3332 prog 80.00% : ppl 3.105918 loss 1.133309 lr 2.402249e-05: 100%|██████████| 833/833 [03:17<00:00, 4.22it/s] miniE 5 s 4165 prog 100.00% : ppl 3.047873 loss 1.114444 lr 1.000000e-05: 100%|██████████| 833/833 [03:17<00:00, 4.22it/s miniE 6 s 4998 prog 120.00% : ppl 3.037687 loss 1.111096 lr 1.000000e-05: 100%|██████████| 833/833 [03:17<00:00, 4.22it/s miniE 7 s 5831 prog 140.00% : ppl 3.021025 loss 1.105596 lr 1.000000e-05: 100%|██████████| 833/833 [03:17<00:00, 4.21it/s miniE 8 s 6664 prog 160.00% : ppl 3.018359 loss 1.104713 lr 1.000000e-05: 100%|██████████| 833/833 [03:17<00:00, 4.21it/s miniE 9 s 7497 prog 180.00% : ppl 3.006846 loss 1.100892 lr 1.000000e-05: 100%|██████████| 833/833 [03:17<00:00, 4.21it/s miniE 10 s 8330 prog 200.00% : ppl 2.985658 loss 1.093820 lr 1.000000e-05: 100%|██████████| 833/833 [03:17<00:00, 4.21it/ miniE 11 s 8344 prog 200.34% : ppl 2.969897 loss 1.088527 lr 1.000000e-05: 2%|▏ | 14/833 [00:03<03:15, 4.20it/sminiE 11 s 8344 prog 200.34% : ppl 2.969897 loss 1.088527 lr 1.000000e-05: 2%|▏ | 14/833 [00:03<03:37, 3.76it/s

@BlinkDL
Copy link
Owner

BlinkDL commented Oct 25, 2024

use the final rwkv-xx.pth with largest xx

it's highly recommended to train rwkv-6 which will give you rwkv-final.pth

@shamilajeewantha
Copy link
Author

Thank you a lot for the response I will also try rwkv v6!

and I would also be grateful if you could provide some guidelines on the following matter
#268

"With rwkv-V4, If I wish to make an encoder decoder model for example to be used in translation, what are the hidden states that needs passing between the encoder and the decoder? Can you provide some guideline on this matter or any existing work?"

I want to make an encoder decoder model equivalent to lstm based encoder decoders where the hidden state gets passed to the decoder to inform the decoding process. I would really appreciate if you could provide me with some information on what could be used as the equivalent to hidden states in lstms? Furthermore, it appeared that run and train files seem to be following different architectures. I would also like to know the high level difference between those two

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants