Open
Description
Curiously, the first two iterations of LeanTransformer on CPU may differ by a small amount (~1e-5) even with use_deterministic_algorithms(True)
To reproduce, go to this test and remove "for i in range 2"
lean_transformer/tests/test_modifications.py
Lines 63 to 68 in e737a8f
Known facts:
- it works consistently on torch==1.10.2, but inconsistently on torch 1.11.0
- disabling JIT everywhere also fixes the issue, but it is unclear which exact jit causes the inconsistency
- to reproduce: LEAN_USE_JIT=0 pytest ./tests/test_modifications.py
- there are configurations where everything works fine: dropout=0 or disabling custom autograd in BOTH ffn and attn
- it works consistently with position_embedding_type='absolute', but inconsistently with 'rotary'
- setting rotary cache beforehand seemingly does not solve the issue
- example of a failed job without range(2): https://github.com/learning-at-home/lean_transformer/runs/5584141044?check_suite_focus=true
Hypothesis: the issue may be due to jit running non-optimized code on the first pass. This may have a different RNG behavior and/or different dtypes.
Metadata
Metadata
Assignees
Labels
No labels