ReMoE

Codebase for the paper "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing"

ReLU-routed ReMoE consistently outperforms TopK-routed MoE

Overview:

ReMoE is a fully differentiable mixture-of-experts(MoE) architecture with ReLU routing and adaptive L1 regularization to control the sparsity. This repository builds upon Megatron-LM and provides the implementation with minimal changes to the original codebase.

Comparing to the conventional TopK-routed MoE, ReMoE has the following advantages:

Fully differentiable: ReMoE is continuous and fully differentiable, which allows optimizing with the correct gradients.
Dynamic expert allocation: ReMoE allocates different numbers of experts to tokens based on the token's importance, which is determined by the ReLU activation.
Consistent better performance: ReMoE consistently outperforms the TopK-routed MoE across various model sizes, expert counts, and granularities.
Plug-and-play: ReMoE can be easily integrated into existing MoE-based models with minimal changes to the routing logic (TopK+Softmax -> ReLU), without altering the compute flow.

Comparison of ReLU-routed ReMoE with TopK-routed MoE

Installation:

ReMoE shares the same dependencies as Megatron-LM. You can use the NGC's Pytorch container recommended by Megatron. Alternatively, follow the manual installation steps below:

conda create -n remoe python=3.11
conda activate remoe
# install torch and numpy
pip install torch torchvision torchaudio numpy

# install flash-attention
pip install packaging ninja
pip install flash-attn --no-build-isolation

# install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

# install TransformerEngine
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

Usage:

To enable ReLU routing in the MoE layer, set --moe-relu-routing when configuring the model. Experiment launch scripts for reproducing the results from the paper are available in the scripts directory. You can modify these scripts for your own configurations and datasets in Megatron-LM's format.

For research study, you can refer to megatron/core/transformer/moe/router.py for the implementation of ReLU routing and megatron/training/training.py for the adaptive L1 regularization.

Reproducing the results:

Below is a guide to reproduce the results in the paper from scratch.

Data preprocessing: Download the Pile dataset from Hugging Face. Preprocess it with:
```
bash data_perprocessing.sh
```
Some minor modifications are required, such as the path to vocabulary files.

Training: Run the following script for pretraining ReMoE models:

# Full command:
# bash scripts/train_llama_182m_remoe.sh [gpus_per_node] [train_iters] [micro_batch_size] [num_experts] [granularity] [project_name]
bash scripts/train_llama_182m_remoe.sh

MoE and dense models can be trained in similar ways. Please refer to the scripts directory for more details.

Citation:

If you find this work useful, please consider citing:

@article{wang2024remoe,
      title={ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing}, 
      author={Ziteng Wang and Jianfei Chen and Jun Zhu},
      journal={arXiv preprint arXiv:2412.14711},
      year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5,439 Commits
.github		.github
.gitlab		.gitlab
docs		docs
examples		examples
images		images
megatron		megatron
requirements		requirements
scripts		scripts
tasks		tasks
tests		tests
tools		tools
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pylintrc		.pylintrc
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.ci.dev		Dockerfile.ci.dev
Dockerfile.ci.lts		Dockerfile.ci.lts
Dockerfile.linting		Dockerfile.linting
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_MEGATRON.md		README_MEGATRON.md
data_preprocessing.sh		data_preprocessing.sh
mypy.ini		mypy.ini
pretrain_bert.py		pretrain_bert.py
pretrain_gpt.py		pretrain_gpt.py
pretrain_ict.py		pretrain_ict.py
pretrain_mamba.py		pretrain_mamba.py
pretrain_retro.py		pretrain_retro.py
pretrain_t5.py		pretrain_t5.py
pretrain_vision_classify.py		pretrain_vision_classify.py
pretrain_vision_dino.py		pretrain_vision_dino.py
pretrain_vision_inpaint.py		pretrain_vision_inpaint.py
pretrain_vlm.py		pretrain_vlm.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py
unit-test-job-lts.yaml		unit-test-job-lts.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReMoE

Codebase for the paper "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing"

Overview:

Installation:

Usage:

Reproducing the results:

Citation:

About

Releases

Packages

Languages

License

thu-ml/ReMoE

Folders and files

Latest commit

History

Repository files navigation

ReMoE

Codebase for the paper "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing"

Overview:

Installation:

Usage:

Reproducing the results:

Citation:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages