Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First and second predictions yield slightly different results when both jit and dropout are enabled #9

Open
justheuristic opened this issue Mar 17, 2022 · 0 comments

Comments

@justheuristic
Copy link
Member

justheuristic commented Mar 17, 2022

Curiously, the first two iterations of LeanTransformer on CPU may differ by a small amount (~1e-5) even with use_deterministic_algorithms(True)

To reproduce, go to this test and remove "for i in range 2"

for i in range(2):
model.set_optimizations(
gradient_checkpointing=checkpoints, checkpoint_last=checkpoint_last,
checkpoint_attention_core=custom_attn, ffn_custom_grad=custom_ffn, preserve_rng_state=preserve_rng)
torch.manual_seed(1337)
out = model(**batch)

Known facts:

  • it works consistently on torch==1.10.2, but inconsistently on torch 1.11.0
  • disabling JIT everywhere also fixes the issue, but it is unclear which exact jit causes the inconsistency
    • to reproduce: LEAN_USE_JIT=0 pytest ./tests/test_modifications.py
  • there are configurations where everything works fine: dropout=0 or disabling custom autograd in BOTH ffn and attn
  • it works consistently with position_embedding_type='absolute', but inconsistently with 'rotary'
  • setting rotary cache beforehand seemingly does not solve the issue
  • example of a failed job without range(2): https://github.com/learning-at-home/lean_transformer/runs/5584141044?check_suite_focus=true

Hypothesis: the issue may be due to jit running non-optimized code on the first pass. This may have a different RNG behavior and/or different dtypes.

@justheuristic justheuristic changed the title First and second predictions are slightly different when scripting First and second predictions yield slightly different results when jit and dropout are enabled Mar 17, 2022
@justheuristic justheuristic changed the title First and second predictions yield slightly different results when jit and dropout are enabled First and second predictions yield slightly different results when both jit and dropout are enabled Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant