WIP: UL2 merge #23

RaymondLi0 · 2023-02-07T21:23:09Z

This PR is based on NVIDIA#268
In addition:

changed the variance of the masked-span length to scale with the mean
truncate the sequences after the masking in the decoder-only case.
some slight refactor

TODO: getting around 30%reduced throughput with UL2.

…ke all arguments keyword to avoid warnings.

Remove old merge tool. See merge request ADLR/megatron-lm!433

added a flag to be able to switch between pytorch and ring exchange p2p See merge request ADLR/megatron-lm!434

…bwd pass

support for all mask in fused kernel + avoiding inplace operation in bwd pass See merge request ADLR/megatron-lm!435

fix a bug for size mismatch See merge request ADLR/megatron-lm!438

Timing levels See merge request ADLR/megatron-lm!436

fixed grad scalar warning so it only prints it for fp16 See merge request ADLR/megatron-lm!441

fixed grad scalar warning for bf16 See merge request ADLR/megatron-lm!442

Memory safety checks were incorrect for the tokens_to_generate=0 case See merge request ADLR/megatron-lm!447

Update state_dict arguments for recent PyTorch versions. See merge request ADLR/megatron-lm!432

The LICENSE file says everything is 3-clause BSD, which is what we want, but at some point the Apache license was added to the top of some files and that proliferated. This commit removes the Apache license from any files that we own the copyright to. Also updates the copyright year and removes the unnessary coding=utf-8 line.

Clean up licensing. See merge request ADLR/megatron-lm!451

Also merged in some changed from apex

Since the normal distribution is unbounded, we cannot have `max_ngrams` set to a bounded value.

Filtered means not `cls_id` or `sep_id` tokens. This slightly improves calculated statistics for long sequences and greatly for very short sequences.

Via an extra "private" argument.

The GPT tokenizer does not handle the difference between UL2 tokens and other special tokens well. This should be fine as UL2 tokens being distinct from other special tokens is never assumed at the moment (although other tokenizers implement it like that). In general, `additional_special_token_ids` is new for the GPT tokenizer, so there is no backward compatibility trouble.

Not always strictly necessary; this is only important for the decoder-only case. However, we don't bother checking for this since it's also queried in the `UL2Dataset`.

Usually we do not iterate through all indices, so we can save quite some time if `max_ngrams` is large.

…stribution

…er()

…to ul2-merge

jaredcasper and others added 30 commits July 21, 2022 15:35

Remove deprecated destination argument to state_dict functions and ma…

928a200

…ke all arguments keyword to avoid warnings.

Remove old merge tool.

5df9e1f

Merge branch 'del_merge' into 'main'

c464a10

Remove old merge tool. See merge request ADLR/megatron-lm!433

added a flag to be able to switch between pytorch and ring exchange p2p

e36cdd7

Merge branch 'add_ring_exchange_flag' into 'main'

8df49e7

added a flag to be able to switch between pytorch and ring exchange p2p See merge request ADLR/megatron-lm!434

support for all mask in fused kernel + avoiding inplace operation in …

76db958

…bwd pass

Merge branch 'fused_softmax_kernel_fixes' into 'main'

189e72a

support for all mask in fused kernel + avoiding inplace operation in bwd pass See merge request ADLR/megatron-lm!435

yttm + BytelevelBPE + setencepeice tokenizer support

45f4ee5

fix a bug for size mismatch

b7b2d6a

Merge branch 'beam_search' into 'main'

83d7867

fix a bug for size mismatch See merge request ADLR/megatron-lm!438

adress review comments

a44360e

Timing levels

77efccc

Merge branch 'timing' into 'main'

d207391

Timing levels See merge request ADLR/megatron-lm!436

fixed grad scalar warning so it only prints it for fp16

27bc133

Merge branch 'fix_grad_scalar_warning' into 'main'

91384a5

fixed grad scalar warning so it only prints it for fp16 See merge request ADLR/megatron-lm!441

fixed grad scalar warning for bf16

aaa5715

Merge branch 'fix_grad_scalar_warning' into 'main'

d63c254

fixed grad scalar warning for bf16 See merge request ADLR/megatron-lm!442

Memory safety checks were incorrect for the tokens_to_generate=0 case

e38d41c

Merge branch 'fixing_safety' into 'main'

8b68628

Memory safety checks were incorrect for the tokens_to_generate=0 case See merge request ADLR/megatron-lm!447

Merge branch 'state_dict_fix' into 'main'

1afe354

Update state_dict arguments for recent PyTorch versions. See merge request ADLR/megatron-lm!432

support separate datasets for train, valid and test

981c3df

Merge branch 'licensing' into 'main'

28ba253

Clean up licensing. See merge request ADLR/megatron-lm!451

Start Megatron-Core with vocab parallel cross entropy

2e6a46e

Also merged in some changed from apex

Bring mpu.data into megatron.core.

209f91c

Move layers from mpu to core.tensor_parallel.

c2ea914

Alias core.parallel_state as mpu and use it throughout code. RIP mpu.

5942af9

Move get_num_layers into transformer.py.

a94d0a6

Improve docstrings, destory global memory buffer.

e00a1ca

Update exports.

cbf780d

janEbert and others added 22 commits January 23, 2023 17:58

Fix max_ngrams for normal sampling style

eb3dd43

Since the normal distribution is unbounded, we cannot have `max_ngrams` set to a bounded value.

Do not limit max_predictions_per_seq

2d1b32d

Calculate and use amount of filtered tokens

10ef283

Filtered means not `cls_id` or `sep_id` tokens. This slightly improves calculated statistics for long sequences and greatly for very short sequences.

Document normal sampling style

6b29f42

Fix PrefixLM possible spans calculation

27fc9fb

Avoid mutable pointer in arguments

359742e

Allow passing callable for getting model_type

11e3d24

Fix getting model type

2a67e97

Allow recognizing when UL2 is used

2dc7587

Via an extra "private" argument.

Only add UL2 tokens if using UL2 pretrain script

7a4a94d

Add SEP token to GPT tokenizer if using UL2

c03a7be

Not always strictly necessary; this is only important for the decoder-only case. However, we don't bother checking for this since it's also queried in the `UL2Dataset`.

Fix enum name

959daaa

Fix private UL2 argument default value

49f6b0f

Use binary search for PrefixLM first tail index

aa9a1c7

Calculate n-gram indices lazily

d906cc1

Usually we do not iterate through all indices, so we can save quite some time if `max_ngrams` is large.

Prefer list comprehensions

3805df7

Merge branch 'janEbert-ul2' into ul2-merge

69a3519

support UL2 with HFtokenizer

f5d0df1

scale normal distribution variance with its mean, and truncate the di…

9f024dc

…stribution

in the decoder-only case, truncate the masked sequence

f845e38

refactor: UL2Dataset does not inherit T5Dataset anymore

ea79fe8

RaymondLi0 changed the base branch from main to multi-query-attention February 7, 2023 21:23

RaymondLi0 and others added 7 commits February 10, 2023 15:36

fix: mpu.get_cuda_rng_tracker() -> tensor_parallel.get_cuda_rng_track…

d1aed24

…er()

remove debug print

e712e7e

move is_ul2 to arguments

458ecf8

adjust attention-mask in generation for prefix-lm models

b9fa5f7

fix assert in tokenizer

3a305eb

Merge branch 'ul2-merge' of github.com:bigcode-project/Megatron-LM in…

96d18f7

…to ul2-merge

fix pretrain_ul2 for causal-decoder

fe05ccd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: UL2 merge #23

WIP: UL2 merge #23

RaymondLi0 commented Feb 7, 2023 •

edited

Loading

WIP: UL2 merge #23

Are you sure you want to change the base?

WIP: UL2 merge #23

Conversation

RaymondLi0 commented Feb 7, 2023 • edited Loading

RaymondLi0 commented Feb 7, 2023 •

edited

Loading