A few tweaks that have worked well for me, worth trying #9

lapp0 · 2025-01-02T22:42:45Z

lapp0
Jan 2, 2025
Maintainer

44M parameters model trained for 20,000 steps beats ESMC-300M with val loss of 2.1906.

Log

Details:

batch size: 4*64*1024
steps: 20,000
hardware: 4x4090
time: 7:01:03

Changes

No truncation in training sequences (use padding instead)
Start with increased masking rate, linearly decreasing (30% -> 15%) Also mask 50%, replace/keep 25%/25%, with a linear shift to the correct 80%/10%/10% ratios.
architecture parameters: 12 layers, model dim 768 (same as modded-nanogpt except for the MLP size)

Recommended further improvements:

LR cooldown currently only 1,000 steps, should probably be extended for long runs
Experiment with greater starting masking ratios (40-50%)
I've seen some success with a train objectve of KL Div w/ label smoothing of 1%, but that wasn't incorporated in this run.

lhallee · 2025-01-02T23:02:24Z

lhallee
Jan 2, 2025
Maintainer

Wow, this is great progress! I'm so excited to see how fast people can make this throughout the year. I'm not sure if I mentioned this yet, but Synthyra is planning on getting some sponsors together for a hackathon-style contest with some prizes for fastest times. (Hopefully Q1 or Q2).

Start with increased masking rate, linearly decreasing (30% -> 15%) Also mask 50%, replace/keep 25%/25%, with a linear shift to the correct 80%/10%/10% ratios.

This makes sense, but curious if you got the idea from anywhere?

Experiment with greater starting masking ratios (40-50%)

I like this!

I've seen some success with a train objectve of KL Div w/ label smoothing of 1%, but that wasn't incorporated in this run.

Could you explain this a bit more or show me some of your previous code for this?

LR cooldown currently only 1,000 steps, should probably be extended for long runs

I've also observed a long cooldown is very important for the ESM runs. A lot of the loss convergence seems to happen in this window, so the initial LR may be too high?

0 replies

lapp0 · 2025-01-02T23:22:12Z

lapp0
Jan 2, 2025
Maintainer Author

This makes sense, but curious if you got the idea from anywhere?

The cooldown from 30% -> 15% was from https://aclanthology.org/2024.eacl-short.42.pdf

A separate cooldown for the replacement rate was just something I tried out on a whim but it seems to be quite beneficial.

It's a bit messy, what you're really doing is cooling the masking rate from 15% -> 12%, and the replacement rate from 7.5% -> 1.5%.

(Side note: This is an ugly quadratic rather than a linear shift when you break it down to its components)

>>> frac_done=1.0; lerp(0.5, 0.8, frac_done) * lerp(0.3, 0.15, frac_done)
0.12
>>> frac_done=0.0; lerp(0.5, 0.8, frac_done) * lerp(0.3, 0.15, frac_done)
0.15
>>> frac_done=0.33; lerp(0.5, 0.8, frac_done) * lerp(0.3, 0.15, frac_done)
0.1500495

Could you explain this a bit more or show me some of your previous code for this?

Again, this might just be noise I only ran one trial, but worth trying

Based roughly on https://arxiv.org/pdf/2312.06522

def kldiv_smoothed_loss(logits, targets, ignore_index=-100, label_smoothing=0.01):
    valid_mask = (targets != ignore_index)
    valid_targets = targets[valid_mask]
    target_one_hot = F.one_hot(valid_targets, num_classes=num_classes).to(logits.dtype)
    num_classes = logits.size(-1)
    smoothed_targets = (1 - label_smoothing) * target_one_hot + label_smoothing / num_classes
    log_probs = F.log_softmax(logits[valid_mask], dim=-1)  # Shape: (valid_N, num_classes)
    loss = F.kl_div(log_probs, smoothed_targets, reduction='batchmean')
    return loss

I've also observed a long cooldown is very important for the ESM runs. A lot of the loss convergence seems to happen in this window, so the initial LR may be too high?

Quite likely. There are only 33 tokens, so the lm_heads / embeddings are good targets for lower LR. The inner layers might also need a lower LR. BERTs also seem to be generally trained with a lower LR than GPTs.

0 replies

lhallee · 2025-01-03T15:41:29Z

lhallee
Jan 3, 2025
Maintainer

I am getting 66 million params when using 12 layers, which seems to make sense with reference to modded-GPT2 140? How is there 44?

Also, getting 1.8 million tokens now for the validation. I think the padding might be cutting off sequences in between batches.

5 replies

lapp0 Jan 3, 2025
Maintainer Author

Because I not--so-obviously hardcoded MLP intermediate dim https://gist.github.com/lapp0/8553e911c649eea11cc2d7426f26eab6#file-gistfile1-txt-L627

lhallee Jan 3, 2025
Maintainer

Ah got it, thanks

lapp0 Jan 3, 2025
Maintainer Author

I'm working on a generalized BERT trainer based on our work. I've come across two bugs you can replicate the fixes for:

Your masker for sliding mlm_mask_prob uses .item() which causes a graph break: fix
I didn't set padding dtype, this seems fine with uint8, but worth explicitly setting the padding dtype as I ran into an error with uint16: fix

If you run into device errors, the module-with-buffers design of MLMMasker might resolve this, although I didn't run into this in SpeedRunningESM2 and is likely an artifact of my refactor.

Edit: Another thing worth mentioning, your torch install instructions seem to be incompatible with flex_attention. Recommend changing to modded-nanogpts version

lhallee Jan 3, 2025
Maintainer

Your masker for sliding mlm_mask_prob uses .item() which causes a graph break: fix

Yep, I had already removed .item(), but like your solution with ones_like better.

I didn't set padding dtype, this seems fine with uint8, but worth explicitly setting the padding dtype as I ran into an error with uint16: fix

I think I will keep it uint8, as that is currently consistent everywhere. But setting it explicitly seems like a good add.

Edit: Another thing worth mentioning, your torch install instructions seem to be incompatible with flex_attention. Recommend changing to

Seems to be working for me. Did you get a specific error? Looks like we experiment on different hardware, so could be an artifact of that.

lapp0 Jan 3, 2025
Maintainer Author

I think I will keep it uint8, as that is currently consistent everywhere. But setting it explicitly seems like a good add.

Yep, my tokenizer is uint16, might only exist in that condition. No need to change to uint16, just pointing out the need for padding dtype.

Seems to be working for me. Did you get a specific error? Looks like we experiment on different hardware, so could be an artifact of that.

I'm on 4x4090. Might be specific to that hardware. modded-nanogpt uses a newer version of torch which fixes the error.

Edit: nevermind, it was a hardware error, which occurs randomly.

A few tweaks that have worked well for me, worth trying #9

Uh oh!

Uh oh!

lapp0 Jan 2, 2025 Maintainer

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

lhallee Jan 2, 2025 Maintainer

Uh oh!

Uh oh!

lapp0 Jan 2, 2025 Maintainer Author

Uh oh!

Uh oh!

lhallee Jan 3, 2025 Maintainer

Uh oh!

lapp0 Jan 3, 2025 Maintainer Author

Uh oh!

lhallee Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

lapp0 Jan 3, 2025 Maintainer Author

Uh oh!

lhallee Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

lapp0 Jan 3, 2025 Maintainer Author

lapp0
Jan 2, 2025
Maintainer

Replies: 3 comments 5 replies

lhallee
Jan 2, 2025
Maintainer

lapp0
Jan 2, 2025
Maintainer Author

lhallee
Jan 3, 2025
Maintainer

lapp0 Jan 3, 2025
Maintainer Author

lhallee Jan 3, 2025
Maintainer

lapp0 Jan 3, 2025
Maintainer Author

lhallee Jan 3, 2025
Maintainer

lapp0 Jan 3, 2025
Maintainer Author