Replies: 3 comments 5 replies
-
Wow, this is great progress! I'm so excited to see how fast people can make this throughout the year. I'm not sure if I mentioned this yet, but Synthyra is planning on getting some sponsors together for a hackathon-style contest with some prizes for fastest times. (Hopefully Q1 or Q2).
This makes sense, but curious if you got the idea from anywhere?
I like this!
Could you explain this a bit more or show me some of your previous code for this?
I've also observed a long cooldown is very important for the ESM runs. A lot of the loss convergence seems to happen in this window, so the initial LR may be too high? |
Beta Was this translation helpful? Give feedback.
-
The cooldown from 30% -> 15% was from https://aclanthology.org/2024.eacl-short.42.pdf A separate cooldown for the replacement rate was just something I tried out on a whim but it seems to be quite beneficial. It's a bit messy, what you're really doing is cooling the masking rate from 15% -> 12%, and the replacement rate from 7.5% -> 1.5%. (Side note: This is an ugly quadratic rather than a linear shift when you break it down to its components)
Again, this might just be noise I only ran one trial, but worth trying Based roughly on https://arxiv.org/pdf/2312.06522
Quite likely. There are only 33 tokens, so the lm_heads / embeddings are good targets for lower LR. The inner layers might also need a lower LR. BERTs also seem to be generally trained with a lower LR than GPTs. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
44M parameters model trained for 20,000 steps beats ESMC-300M with val loss of 2.1906.
Log
Details:
4*64*1024
Changes
Recommended further improvements:
Beta Was this translation helpful? Give feedback.
All reactions