Add Comet integration #1

Lothiraldan · 2024-07-02T15:29:31Z

This PR add the possibility to log training metrics to Comet

…nd others fields

* fix python version and pytest install * Update NeoXArgs docs automatically * python3 * Update NeoXArgs docs automatically * pip not pip3 * Update NeoXArgs docs automatically * python3 pip * Update NeoXArgs docs automatically * python3 -m pip * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically * add docker setup to workflow * Update NeoXArgs docs automatically * python setup * Update NeoXArgs docs automatically * python setup v2 * Update NeoXArgs docs automatically * python setup v3 * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * python setup v3 * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically * Add hash back to deep speed version * Update NeoXArgs docs automatically --------- Co-authored-by: github-actions <[email protected]> Co-authored-by: Quentin Anthony <[email protected]>

* Add a chat data preprocessing script * add EOT at end of a chat * update README.md * apply pre-commit --------- Co-authored-by: Quentin Anthony <[email protected]>

* - Add conversion of HF llama models to NeoX * - Add conversion of HF llama models to NeoX * - minor fix * pre-commit --------- Co-authored-by: Quentin Anthony <[email protected]>

…s_data_with_chat_template.py (EleutherAI#1258) * bugfix: chat turns instead of repeating the conversation * pre-commit

* changing from self-hosted runners to Github's ubuntu-22.04 runner environment * adding warning about not using 'self-hosted' runner labels and using Github runners instead * updated some guidance in comments for coverity scan CI * moving CPU tests to workflow_dispatch only

* first draft (shape errors occurring) * training works (but poor convergence) * debugging progress: current commit works if we do regular TP via impl-ing AR in rowparallel as RS then AG * Update NeoXArgs docs automatically * push most recent code (updated mark_norms fn, back to 'real' sequence parallel) * Update NeoXArgs docs automatically * Fix LayerNorm all reduce gradient hook * Sum instead of average for LayerNorm gradient all reduce * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically * Fix gather and reduce scatter ops on sequence dimension * Fix sequence parallel with tied weight embeddings * Update NeoXArgs docs automatically * cleanup pass + add MoE arguments.py guard * pre-commit and clean up comments * remove vestigial debug code * remove unused debugging code * remove dummy test config * update fp32_allreduce to handle fp16 ; don't cast to fp32 for gathers * run linter on the rest of the files * Improve performance of sequence parallel gather, scatter, and reduce * Add comment * Update NeoXArgs docs automatically --------- Co-authored-by: github-actions <[email protected]> Co-authored-by: Brandon Yang <[email protected]> Co-authored-by: Quentin Anthony <[email protected]>

* Update README.md I added new models that have come out trained with the GPT-NeoX library. The library itself is sufficiently well-used that simply listing all citing papers is rapidly becoming non-viable. I'm currently leaning towards providing a curated list of "exciting" papers? I haven't looked at other libraries to see what they do yet. * Update NeoXArgs docs automatically --------- Co-authored-by: github-actions <[email protected]> Co-authored-by: Quentin Anthony <[email protected]>

* mamba fixes and cleaning * space * revert assertion change for now --------- Co-authored-by: Jacob Hatef <[email protected]>

…leutherAI#1240) * - add different packing impl (Unpacked, packing until overflow) - fix labels to also have valid/test implementations - fix label masking in _get_batch to also include anything from get_ltor_masks_and_position_ids * Update arguments.py to use train_label_data_paths instead of label_data_paths * - fix precommit

* Update transformer.py -> Add `intermediate_size` * add support for rwkv and mamba and add todos about swiglu * refactor activations and mlps * change llama config to swiglu * fixes gelu fusion * pre-commit run * add assert message to mamba linear * Update 1-3B.yml revert accidental change * Update 1-3B.yml * fixes various issues * add back swiglu check --------- Co-authored-by: jahatef <[email protected]> Co-authored-by: Quentin Anthony <[email protected]> Co-authored-by: Jacob Hatef <[email protected]>

…rAI#1270)

* Add a chat data preprocessing script * add EOT at end of a chat * - add different packing impl (Unpacked, packing until overflow) - fix labels to also have valid/test implementations - fix label masking in _get_batch to also include anything from get_ltor_masks_and_position_ids * update README.md * - Add metrics to forward step to add DPO specific metrics that are useful (accuracy, etc) - Add reference model setup for DPO - Add pairwise dataset for positive/negative pairs - Add DPO loss * Update arguments.py to use train_label_data_paths instead of label_data_paths * - Bugfixes from upstreaming.... * - add precompute logprobs... * - Finishing up precompute logprobs... * - update readme for DPO... * fix varname * Fix pipeline parallelism and incorrect neox_args name * apply precommit --------- Co-authored-by: Quentin Anthony <[email protected]>

* Add TE skeleton * Update NeoXArgs docs automatically * added option for te version of norms * import TERMSNorm * add te norm options to norm arg * add TE objects in weight decay function * reformat * add TERMSNorm and TELayerNorm * Update NeoXArgs docs automatically * - add Fused RMS Norm from apex * - make it consistent with how layernorm looks * Merged transformer engine and apex fused layernorm branches * Added assertion if TE is used * Removed unnecessary transformer-engine import * Changed importerror text for TE * Added requirements/requirements-transformerengine.txt * Add TE skeleton * Update NeoXArgs docs automatically * added option for te version of norms * import TERMSNorm * add te norm options to norm arg * add TE objects in weight decay function * reformat * add TERMSNorm and TELayerNorm * Update NeoXArgs docs automatically * - add Fused RMS Norm from apex * - make it consistent with how layernorm looks * Merged transformer engine and apex fused layernorm branches * Added assertion if TE is used * Removed unnecessary transformer-engine import * Changed importerror text for TE * Added requirements/requirements-transformerengine.txt * update comments * precommit --------- Co-authored-by: Quentin Anthony <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: lintangsutawika <lintang@stella-ord-0.stella-ord.tenant-eleutherai.svc.tenant.chi.local> Co-authored-by: lintangsutawika <[email protected]> Co-authored-by: dmahan93 <[email protected]> Co-authored-by: aurelion-source <[email protected]> Co-authored-by: aurelion-source <[email protected]>

* fix the te import * refactor get_params_for_weight_decay_optimization * remove incorrect type hint and dead imports

* Add a chat data preprocessing script * add EOT at end of a chat * - add different packing impl (Unpacked, packing until overflow) - fix labels to also have valid/test implementations - fix label masking in _get_batch to also include anything from get_ltor_masks_and_position_ids * update README.md * - Add metrics to forward step to add DPO specific metrics that are useful (accuracy, etc) - Add reference model setup for DPO - Add pairwise dataset for positive/negative pairs - Add DPO loss * Update arguments.py to use train_label_data_paths instead of label_data_paths * - Bugfixes from upstreaming.... * - add precompute logprobs... * - Finishing up precompute logprobs... * - update readme for DPO... * - Add RM training * add comment on why row-parallel for RMs * fix var name --------- Co-authored-by: Quentin Anthony <[email protected]>

…nd others fields

into comet-integration

Lothiraldan and others added 30 commits June 11, 2024 18:58

First version of the GPT-Neox <> Comet integration

52db752

Add basic Comet configuration

463aad9

Add configuration options for customizing the experiment name, tags a…

804cbcc

…nd others fields

Change one type hint to make validate_types happy

108f583

Add debugging logs

90ef7ae

Fix typo

12e064b

Fix typo

d0b68d7

Add a chat data preprocessing script (EleutherAI#1239)

0e5f6db

* Add a chat data preprocessing script * add EOT at end of a chat * update README.md * apply pre-commit --------- Co-authored-by: Quentin Anthony <[email protected]>

Fix paper reference in init_functions.py (EleutherAI#1241)

1cee5b7

Add hf llama to neox conversion (EleutherAI#1247)

c1ea2a1

* - Add conversion of HF llama models to NeoX * - Add conversion of HF llama models to NeoX * - minor fix * pre-commit --------- Co-authored-by: Quentin Anthony <[email protected]>

bugfix: chat turns instead of repeating the conversation in preproces…

0ef2c07

…s_data_with_chat_template.py (EleutherAI#1258) * bugfix: chat turns instead of repeating the conversation * pre-commit

mamba fixes and cleaning (EleutherAI#1262)

591563d

* mamba fixes and cleaning * space * revert assertion change for now --------- Co-authored-by: Jacob Hatef <[email protected]>

pre-commit and logging cleanup

6a2053b

add assert for missing tokenizer_type in config (EleutherAI#1267)

7548a8b

apply pre-commit and add missing close-paren to mamba config (Eleuthe…

ec82c05

…rAI#1270)

hotfix activation typo from EleutherAI#1212 (EleutherAI#1271)

01e74f4

TE Import Hotfix (EleutherAI#1272)

61a3daa

* fix the te import * refactor get_params_for_weight_decay_optimization * remove incorrect type hint and dead imports

add comets requirements file

bf8e78c

First version of the GPT-Neox <> Comet integration

227967f

Add basic Comet configuration

2a3513c

Add configuration options for customizing the experiment name, tags a…

7609829

…nd others fields

Lothiraldan and others added 12 commits September 8, 2024 20:47

Change one type hint to make validate_types happy

39f9142

Add debugging logs

43ed6e8

Fix typo

90c499a

Fix typo

913f877

pre-commit and logging cleanup

a6bddd6

add comets requirements file

0468dae

Merge branch 'comet-integration' of https://github.com/comet-ml/gpt-neox

ef32d69

into comet-integration

precommit

976cd5d

add comet config

962314e

update readme and arg comment

f0a4b70

Merge branch 'comet-integration' of https://github.com/comet-ml/gpt-neox

c6681b5

into comet-integration

revert unrelated config changes

4f76e0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Comet integration #1

Add Comet integration #1

Lothiraldan commented Jul 2, 2024

Add Comet integration #1

Are you sure you want to change the base?

Add Comet integration #1

Conversation

Lothiraldan commented Jul 2, 2024