Releases · huggingface/nanotron

04 Mar 17:21

NouamaneTazi

v0.4

f2c8383

Support Mamba Architecture 🐍 Latest

Latest

How to use

What's Changed

[Fix] Assert the wrong tolerance of FA2's Layer Norm kernel by @xrsrke in #81
[DoReMi] Small refactors by @xrsrke in #95
Add Mamba PR by @3outeille in #83
Bump v0.4 + Quick refactos by @NouamaneTazi in #96

Full Changelog: v0.3...v0.4

Contributors

xrsrke, NouamaneTazi, and 3outeille

Assets 2

22 Feb 15:31

NouamaneTazi

v0.3

9f9af42

Support DoReMi training 🚀

You might think that one of the key ways to speed up pretraining performance is either by finding more quality data, increasing FLOPs, or changing the model architecture, but actually, these are not the only ways. DoReMi shows that, given the same source of training data, a model using an optimal data mixing strategy could outperform its counterpart with random sampling in at least 70% domains or all domains and downstream evaluations without any knowledge of the downstream evaluation tasks.

DoReMi Blog: https://crfm.stanford.edu/2023/09/14/doremi

Using DoReMi in Nanotron:

(Thanks to @xrsrke)

Step 0: Preprocessing data
Step 1: Train a small reference model using uniform sampling from each domain (for a given global batch size, you equally sample x samples across all domains, or in some cases, a domain has a smaller amount of samples than other domains. This leads to some domains running out of samples early, so you could enable automatic domain weights based on the token count).

CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 examples/doremi/train_reference.py --config-file examples/doremi/configs/config_280m_llama.yaml

Step 2: Use the trained reference model from step 1 to train an identical model, and use its performance to dynamically tune the domain weights during training.

CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 examples/doremi/train_doremi.py --config-file examples/doremi/configs/config_280m_llama_proxy.yaml

Step 3: Nanotron saves the domain weights in the model checkpoint. Now, calculate the optimal domain weights by averaging the domain weights across all training steps from step 1: $\bar{\alpha}=\frac{1}{T} \sum_{i=1}^T \alpha_t$.

import torch

domain_weights = torch.load("checkpoints/doremi/proxy-280m-llama/doremi_domain_weights_100000.pt")

total_weights = sum(d["domain_weights"] for d in domain_weights)
avg_weights = total_weights / len(domain_weights)

Then, set these avg_weights in the config of the larger run in the doremi section.

Step 4: Use the optimized domain weights from step 3 to train a larger model (could be 10x to 30x larger).

CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 examples/doremi/train_reference.py --config-file examples/doremi/configs/config_2.8b_llama_with_tuned_weights.yaml

Step 5: Profit 🤑

Contributors

xrsrke

Assets 2

16 Feb 18:33

NouamaneTazi

v0.2

372fdc1

MoEs are here! 🎉

How to use nanotron's MoEs

To use nanotron's 3D parallel implementation of MoEs simply add dMoE to your modeling as such:

        self.block_sparse_moe = dMoE(
            config,
            expert_parallel_group=parallel_context.expert_pg,
            tp_pg=parallel_context.tp_pg,
            parallel_config=parallel_config,
        )

See example in examples/moe/llamoe.py
You can control expert parallelism degree by setting parallelism.expert_parallel_size and weight parallelism degree is the same as tensor parallel degree

What's Changed

Make tests pass by @NouamaneTazi in #52
Refactoring tying mechanism + small fixes by @NouamaneTazi in #62
[Docs] Fix typos by @StandardAI in #63
quick fix train steps assertion by @NouamaneTazi in #66
fix configs by @NouamaneTazi in #67
[FP8 Training] A single forward and backward pass for a linear in FP8 by @xrsrke in #56
Update bench script by @NouamaneTazi in #64
Add CI/CD for unit tests by @xrsrke in #41
Refactor ParallelContext and some process groups creation by @NouamaneTazi in #69
Support Expert Parallelism by @NouamaneTazi in #72
Add MoEs support by @NouamaneTazi in #73
Implement pipeline parallel size-agnostic optimizer state loading by @nopperl in #71

New Contributors

@StandardAI made their first contribution in #63
@nopperl made their first contribution in #71

Full Changelog: v0.1...v0.2

Contributors

xrsrke, NouamaneTazi, and 2 other contributors

Assets 2

08 Feb 09:54

NouamaneTazi

v0.1

43b5e6b

⚙️ Initial release

Initial release of the nanotron library

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use

What's Changed

Contributors

Using DoReMi in Nanotron:

Contributors

How to use nanotron's MoEs

What's Changed

New Contributors

Contributors

Releases: huggingface/nanotron

Support Mamba Architecture 🐍

How to use

What's Changed

Contributors

Support DoReMi training 🚀

Using DoReMi in Nanotron:

Contributors

MoEs are here! 🎉

How to use nanotron's MoEs

What's Changed

New Contributors

Contributors

⚙️ Initial release