First and second predictions yield slightly different results when both jit and dropout are enabled

Curiously, the first two iterations of LeanTransformer on CPU may differ by a small amount (~1e-5) even with use_deterministic_algorithms(True)

To reproduce, go to this test and remove "for i in range 2"
https://github.com/learning-at-home/lean_transformer/blob/e737a8ff9274e0ff1492dc76a62ecf36506f3e67/tests/test_modifications.py#L63-L68

Known facts:
- it works consistently on torch==1.10.2, but inconsistently on torch 1.11.0
- disabling JIT everywhere also fixes the issue, but it is unclear which exact jit causes the inconsistency
   - to reproduce:  LEAN_USE_JIT=0 pytest ./tests/test_modifications.py 
- there are configurations where everything works fine: dropout=0 or disabling custom autograd in BOTH ffn and attn
- it works consistently with position_embedding_type='absolute', but inconsistently with 'rotary'
- setting rotary cache beforehand seemingly does **not** solve the issue
- example of a failed job without range(2): https://github.com/learning-at-home/lean_transformer/runs/5584141044?check_suite_focus=true

Hypothesis: the issue may be due to jit running non-optimized code on the first pass. This may have a different RNG behavior and/or different dtypes.

	for i in range(2):
	model.set_optimizations(
	gradient_checkpointing=checkpoints, checkpoint_last=checkpoint_last,
	checkpoint_attention_core=custom_attn, ffn_custom_grad=custom_ffn, preserve_rng_state=preserve_rng)
	torch.manual_seed(1337)
	out = model(**batch)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

First and second predictions yield slightly different results when both jit and dropout are enabled #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

First and second predictions yield slightly different results when both jit and dropout are enabled #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions