Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache clearing interval for previous hidden states #2

Open
ekg opened this issue Feb 28, 2024 · 1 comment
Open

cache clearing interval for previous hidden states #2

ekg opened this issue Feb 28, 2024 · 1 comment

Comments

@ekg
Copy link

ekg commented Feb 28, 2024

I love this exploration! Thanks for writing and coding this up. Right now, we're working on modifications to the causal conv1d and selective scan CUDA kernels to support defining the input state, so we are reviewing your code carefully.

What is the objective of the exponential fall-off in the cache clearing in train-infinite.py?

        if completed_steps % clear_cache_interval == 0:
            for layer_idx in range(model.config.n_layer):
                conv_state = torch.zeros((1, model.config.d_model*2, 3), dtype=torch.bfloat16, device=accelerator.device).detach()
                ssm_state = torch.zeros((1, model.config.d_model*2, 16), dtype=torch.bfloat16, device=accelerator.device).detach()
                previous_hidden_states.append((conv_state, ssm_state))
            clear_cache_interval *= 2

Also, a general question: do you have a feeling for why your current implementation isn't working? Might vanishing gradients be an issue when running over longer sequences? I noticed that you're using bf16. I found this caused instability, and using amp for higher precision seemed to help.

@jzhang38
Copy link
Owner

jzhang38 commented Feb 29, 2024

What is the objective of the exponential fall-off in the cache clearing in train-infinite.py?

Ah this is just based on some intuitions that we should clear the hidden states from time to time to allow them to be generated by the newest model weights during training. Nothing rigorous here.

Also, a general question: do you have a feeling for why your current implementation isn't working?

I remember the problem I encountered was nan loss. So probably not vanishing gradients but exploding activations or gradients when the sequence length is too long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants