Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Comet integration #1

Draft
wants to merge 42 commits into
base: main
Choose a base branch
from
Draft

Add Comet integration #1

wants to merge 42 commits into from

Conversation

Lothiraldan
Copy link
Member

This PR add the possibility to log training metrics to Comet

Lothiraldan and others added 30 commits June 11, 2024 18:58
* fix python version and pytest install

* Update NeoXArgs docs automatically

* python3

* Update NeoXArgs docs automatically

* pip not pip3

* Update NeoXArgs docs automatically

* python3 pip

* Update NeoXArgs docs automatically

* python3 -m pip

* Update NeoXArgs docs automatically

* Update NeoXArgs docs automatically

* Update NeoXArgs docs automatically

* add docker setup to workflow

* Update NeoXArgs docs automatically

* python setup

* Update NeoXArgs docs automatically

* python setup v2

* Update NeoXArgs docs automatically

* python setup v3

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* python setup v3

* Update NeoXArgs docs automatically

* Update NeoXArgs docs automatically

* Add hash back to deep speed version

* Update NeoXArgs docs automatically

---------

Co-authored-by: github-actions <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>
* Add a chat data preprocessing script

* add EOT at end of a chat

* update README.md

* apply pre-commit

---------

Co-authored-by: Quentin Anthony <[email protected]>
* - Add conversion of HF llama models to NeoX

* - Add conversion of HF llama models to NeoX

* - minor fix

* pre-commit

---------

Co-authored-by: Quentin Anthony <[email protected]>
…s_data_with_chat_template.py (EleutherAI#1258)

* bugfix: chat turns instead of repeating the conversation

* pre-commit
* changing from self-hosted runners to Github's ubuntu-22.04 runner environment

* adding warning about not using 'self-hosted' runner labels and using Github runners instead

* updated some guidance in comments for coverity scan CI

* moving CPU tests to workflow_dispatch only
* first draft (shape errors occurring)

* training works (but poor convergence)

* debugging progress: current commit works if we do regular TP via impl-ing AR in rowparallel as RS then AG

* Update NeoXArgs docs automatically

* push most recent code (updated mark_norms fn, back to 'real' sequence parallel)

* Update NeoXArgs docs automatically

* Fix LayerNorm all reduce gradient hook

* Sum instead of average for LayerNorm gradient all reduce

* Update NeoXArgs docs automatically

* Update NeoXArgs docs automatically

* Fix gather and reduce scatter ops on sequence dimension

* Fix sequence parallel with tied weight embeddings

* Update NeoXArgs docs automatically

* cleanup pass + add MoE arguments.py guard

* pre-commit and clean up comments

* remove vestigial debug code

* remove unused debugging code

* remove dummy test config

* update fp32_allreduce to handle fp16 ; don't cast to fp32 for gathers

* run linter on the rest of the files

* Improve performance of sequence parallel gather, scatter, and reduce

* Add comment

* Update NeoXArgs docs automatically

---------

Co-authored-by: github-actions <[email protected]>
Co-authored-by: Brandon Yang <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>
* Update README.md

I added new models that have come out trained with the GPT-NeoX library. The library itself is sufficiently well-used that simply listing all citing papers is rapidly becoming non-viable. I'm currently leaning towards providing a curated list of "exciting" papers? I haven't looked at other libraries to see what they do yet.

* Update NeoXArgs docs automatically

---------

Co-authored-by: github-actions <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>
* mamba fixes and cleaning

* space

* revert assertion change for now

---------

Co-authored-by: Jacob Hatef <[email protected]>
…leutherAI#1240)

* - add different packing impl (Unpacked, packing until overflow)
- fix labels to also have valid/test implementations
- fix label masking in _get_batch to also include anything from get_ltor_masks_and_position_ids

* Update arguments.py to use train_label_data_paths instead of label_data_paths

* - fix precommit
* Update transformer.py -> Add `intermediate_size`

* add support for rwkv and mamba and add todos about swiglu

* refactor activations and mlps

* change llama config to swiglu

* fixes gelu fusion

* pre-commit run

* add assert message to mamba linear

* Update 1-3B.yml

revert accidental change

* Update 1-3B.yml

* fixes various issues

* add back swiglu check

---------

Co-authored-by: jahatef <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>
Co-authored-by: Jacob Hatef <[email protected]>
* Add a chat data preprocessing script

* add EOT at end of a chat

* - add different packing impl (Unpacked, packing until overflow)
- fix labels to also have valid/test implementations
- fix label masking in _get_batch to also include anything from get_ltor_masks_and_position_ids

* update README.md

* - Add metrics to forward step to add DPO specific metrics that are useful (accuracy, etc)
- Add reference model setup for DPO
- Add pairwise dataset for positive/negative pairs
- Add DPO loss

* Update arguments.py to use train_label_data_paths instead of label_data_paths

* - Bugfixes from upstreaming....

* - add precompute logprobs...

* - Finishing up precompute logprobs...

* - update readme for DPO...

* fix varname

* Fix pipeline parallelism and incorrect neox_args name

* apply precommit

---------

Co-authored-by: Quentin Anthony <[email protected]>
* Add TE skeleton

* Update NeoXArgs docs automatically

* added option for te version of norms

* import TERMSNorm

* add te norm options to norm arg

* add TE objects in weight decay function

* reformat

* add TERMSNorm and TELayerNorm

* Update NeoXArgs docs automatically

* - add Fused RMS Norm from apex

* - make it consistent with how layernorm looks

* Merged transformer engine and apex fused layernorm branches

* Added assertion if TE is used

* Removed unnecessary transformer-engine import

* Changed importerror text for TE

* Added requirements/requirements-transformerengine.txt

* Add TE skeleton

* Update NeoXArgs docs automatically

* added option for te version of norms

* import TERMSNorm

* add te norm options to norm arg

* add TE objects in weight decay function

* reformat

* add TERMSNorm and TELayerNorm

* Update NeoXArgs docs automatically

* - add Fused RMS Norm from apex

* - make it consistent with how layernorm looks

* Merged transformer engine and apex fused layernorm branches

* Added assertion if TE is used

* Removed unnecessary transformer-engine import

* Changed importerror text for TE

* Added requirements/requirements-transformerengine.txt

* update comments

* precommit

---------

Co-authored-by: Quentin Anthony <[email protected]>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: lintangsutawika <lintang@stella-ord-0.stella-ord.tenant-eleutherai.svc.tenant.chi.local>
Co-authored-by: lintangsutawika <[email protected]>
Co-authored-by: dmahan93 <[email protected]>
Co-authored-by: aurelion-source <[email protected]>
Co-authored-by: aurelion-source <[email protected]>
* fix the te import

* refactor get_params_for_weight_decay_optimization

* remove incorrect type hint and dead imports
* Add a chat data preprocessing script

* add EOT at end of a chat

* - add different packing impl (Unpacked, packing until overflow)
- fix labels to also have valid/test implementations
- fix label masking in _get_batch to also include anything from get_ltor_masks_and_position_ids

* update README.md

* - Add metrics to forward step to add DPO specific metrics that are useful (accuracy, etc)
- Add reference model setup for DPO
- Add pairwise dataset for positive/negative pairs
- Add DPO loss

* Update arguments.py to use train_label_data_paths instead of label_data_paths

* - Bugfixes from upstreaming....

* - add precompute logprobs...

* - Finishing up precompute logprobs...

* - update readme for DPO...

* - Add RM training

* add comment on why row-parallel for RMs

* fix var name

---------

Co-authored-by: Quentin Anthony <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.