Skip to content

First and second predictions yield slightly different results when both jit and dropout are enabled #9

Open
@justheuristic

Description

@justheuristic

Curiously, the first two iterations of LeanTransformer on CPU may differ by a small amount (~1e-5) even with use_deterministic_algorithms(True)

To reproduce, go to this test and remove "for i in range 2"

for i in range(2):
model.set_optimizations(
gradient_checkpointing=checkpoints, checkpoint_last=checkpoint_last,
checkpoint_attention_core=custom_attn, ffn_custom_grad=custom_ffn, preserve_rng_state=preserve_rng)
torch.manual_seed(1337)
out = model(**batch)

Known facts:

  • it works consistently on torch==1.10.2, but inconsistently on torch 1.11.0
  • disabling JIT everywhere also fixes the issue, but it is unclear which exact jit causes the inconsistency
    • to reproduce: LEAN_USE_JIT=0 pytest ./tests/test_modifications.py
  • there are configurations where everything works fine: dropout=0 or disabling custom autograd in BOTH ffn and attn
  • it works consistently with position_embedding_type='absolute', but inconsistently with 'rotary'
  • setting rotary cache beforehand seemingly does not solve the issue
  • example of a failed job without range(2): https://github.com/learning-at-home/lean_transformer/runs/5584141044?check_suite_focus=true

Hypothesis: the issue may be due to jit running non-optimized code on the first pass. This may have a different RNG behavior and/or different dtypes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions