MoE #639

Muennighoff · 2024-06-30T21:38:10Z

Replaces #541

Notes:

I didn't find norm_after to work well but added it to conform with other parts of the code but can also remove it
Only left in the config file used for the final 5T run
I didn't include all configurations that we ran for OLMoE (e.g. expert choice) - I will probably put instructions for those in a separate olmoe repository for people who want to exactly reproduce

…hoff/MoE

epwalsh · 2024-08-01T20:29:33Z

olmo/model.py

+            from megablocks.layers.moe import MoE
+        except ImportError:
+            raise ImportError(
+                "To train MoEs, run `pip install git+https://github.com/Muennighoff/megablocks.git@olmoe`"


What's different about your branch for the original source?

It includes zloss which we use during training for better stability

you can view the exact difference here: databricks/megablocks@main...Muennighoff:megablocks:olmoe ; besides zloss it also has expert choice which is currently not used but i think we may want to try in the future when we go multimodal

Can you upstream this, so we don't have to depend on a private fork?

Sure, opened a PR here databricks/megablocks#133 - If / when it gets merged, I will update the install instructions. If people don't want to use zloss, it also works with the regular megablocks - it's not a big difference.

@Muennighoff , so they decided to merge their version instead. Is our version compatible? Will the model you trained work with their implementation of zloss?

dirkgr · 2024-08-02T16:06:20Z

olmo/config.py

+    The number of experts to use in the MoE block.
+    """
+
+    moe_top_k: Optional[int] = 2


If these are Optional, what does it mean when it's None?

They're optional when no MoE is used, otherwise required. Is this not an acceptable usage of Optional[int]? Can change it

In my opinion, when we have a config setting that is not always required we should either 1) always make it optional type, set it to None by default, and set it in every config when it is needed; or 2) don't make it optional type unless None is needed. I prefer 1 since it makes our config more readable (less irrelevant settings) and slightly more backwards compatible.

I can change it to option 1) if others agree? Note that there's other params not following this:

embedding_size: Optional[int] = 50304 gen1_gc_interval: Optional[int] = 1 distributed_strategy: Optional[DistributedStrategy] = DistributedStrategy.fsdp fsdp: Optional[FSDPConfig] = field(default_factory=FSDPConfig) auxiliary_loss_multiplier: Optional[float] = 1e-4

Do you actually rely on the defaults you put in here anywhere? If not, let's go with Shane's version, and default these to None. I assume something somewhere will fail if they are not set and you need them.

Do you actually rely on the defaults you put in here anywhere?

Yes quite a lot, e.g. the loss weights; the use of dropless MoEs (moe_dropless); leaving moe_interleave,moe_lbl_in_fp32,moe_shared_expert as False

Actually, I don't think setting them all to None is a good idea, as it means that everytime we add a new MoE-specific configuration parameter all MoE configs become outdated since every MoE-specific configuration parameter is Optional in that dense.

I can also remove the Optional from it as they have defaults anyways but then as seen in the examples I pasted above, we do have Optional config params with default values in the codebase anyways.

If it doesn't break everything, I'd prefer to have a special config object for MoE, which is Optional, but none of the items inside of that object are Optional. This may break backwards compatibility with the model we already released though?

Yes it would break compat with the configs we released but can pin a commit to our released repo if people want to reuse our configs to reproduce things exactly

Hm, that's unfortunate, but I think I prefer the MoEConfigObject. It reduces the impact on old-school dense model training.

I guess it would make the name ModelConfig a bit inaccurate though; maybe it should inherit from ModelConfig or sth

olmo/initialization.py

dirkgr · 2024-08-02T16:08:50Z

olmo/model.py

+            from megablocks.layers.moe import MoE
+        except ImportError:
+            raise ImportError(
+                "To train MoEs, run `pip install git+https://github.com/Muennighoff/megablocks.git@olmoe`"


Can you upstream this, so we don't have to depend on a private fork?

dirkgr · 2024-08-02T16:15:53Z

olmo/model.py

+                x = self._activation_checkpoint_fn(self.ff_norm, x)  # type: ignore
+            else:
+                x = self.ff_norm(x)
+            # Activation checkpointing for the MoE FFN is not supported


Why not? If there is a technical problem with it, will it affect whole_layer activation checkpointing as well?

It fails with

torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Unpack is being triggered for a tensor that was already unpacked once. If you are calling ctx.saved_tensors in backward, make sure to do so only once. Otherwise please open an issue with details on your use case. 2024-05-15T20:15:01.172963498Z 2024-05-15 13:15:01.171 jupiter-cs-aus-133.reviz.ai2.in:3 olmo.util:158 CRITICAL Uncaught CheckpointError: torch.utils.checkpoint: Unpack is being triggered for a tensor that was already unpacked once. If you are calling ctx.saved_tensors in backward, make sure to do so only once. Otherwise please open an issue with details on your use case.

This paper has some explanations why it is difficult to do act ckpt for MoEs: https://dspace.mit.edu/bitstream/handle/1721.1/153897/wisdom-dwisdom-meng-eecs-2024-thesis.pdf

whole_layer is not supported with MoE, only fine_grained - I added code to raise an error if it's not fine_grained & MoE is configured.

Ok, I see. Interesting. It would be fixable I think (by saving the active experts per token in the forward pass), but out of scope for this PR.

This is probably a fairly big blocker to going bigger though. For dense models, our fastest settings still use a lot of checkpointing.

olmo/model.py

olmo/train.py

scripts/train.py

Muennighoff · 2024-08-20T17:51:37Z

Linking this related PR that we should merge after: #707

If this PR here looks good to you, could you approve it @epwalsh / @dirkgr ? :)

2015aroras · 2024-08-20T19:50:40Z

olmo/config.py

+    The number of experts to use in the MoE block.
+    """
+
+    moe_top_k: Optional[int] = 2


In my opinion, when we have a config setting that is not always required we should either 1) always make it optional type, set it to None by default, and set it in every config when it is needed; or 2) don't make it optional type unless None is needed. I prefer 1 since it makes our config more readable (less irrelevant settings) and slightly more backwards compatible.

2015aroras · 2024-08-20T23:45:16Z

olmo/config.py

@@ -1273,3 +1334,41 @@ def update_legacy_settings(cls, config: D) -> D:
                new_config.optimizer = OptimizerConfig.update_legacy_settings(new_config.optimizer)

        return new_config
+
+
+def config_to_moe_args(config: ModelConfig) -> Dict[str, Any]:


I think it would be better to have this as an instance method of ModelConfig that can be invoked with something like config.build_moe_args()

I think the moe args may include things outside of the ModelConfig in the future. Currently, I put some things that may be considered as TrainingConfig params like moe_zloss_weight in the ModelConfig but in case we move them in the future to TrainingConfig then it would not only use the ModelConfig anymore.

olmo/model.py

olmo/optim.py

configs/official/OLMoE-7B-A1B.yaml

Co-authored-by: Shane A <[email protected]>

Muennighoff · 2024-09-04T23:36:33Z

All tests are passing except the GPU test which I assume is expected to fail. Feel free to merge 😊

olmo/config.py

dirkgr · 2024-09-08T21:20:06Z

olmo/model.py

-            device=config.init_device,
-        )
-        self.ff_out._is_residual = True  # type: ignore
+        if self.config.block_type != BlockType.moe:


Can you make this dependent on whether the block has a ff_out, instead of the block type?

with hasattr(), I mean

Do you mean if hasattr(self, "ff_out"):? Not sure that will work because the next lines are about creating self.ff_out so no block has it yet afaict

dirkgr · 2024-09-08T21:21:42Z

olmo/model.py

+            from megablocks.layers.moe import MoE
+        except ImportError:
+            raise ImportError(
+                "To train MoEs, run `pip install git+https://github.com/Muennighoff/megablocks.git@olmoe`"


@Muennighoff , so they decided to merge their version instead. Is our version compatible? Will the model you trained work with their implementation of zloss?

olmo/model.py

dirkgr · 2024-09-08T21:31:40Z

olmo/model.py

+                x = self._activation_checkpoint_fn(self.ff_norm, x)  # type: ignore
+            else:
+                x = self.ff_norm(x)
+            # Activation checkpointing for the MoE FFN is not supported


Ok, I see. Interesting. It would be fixable I think (by saving the active experts per token in the forward pass), but out of scope for this PR.

olmo/model.py

olmo/train.py

dirkgr · 2024-09-08T21:43:14Z

olmo/train.py

                # Run backward pass.
                loss.backward()

            # Remove output hooks
            for hook in output_hooks:
                hook.remove()

-        return ce_batch_loss, z_batch_loss
+        return ce_batch_loss, z_batch_loss, lb_batch_loss, moe_z_batch_loss, expert_assignments


@epwalsh, does the new trainer support all of this stuff? This seems like a lot of extra things.

Not directly but I think it could be supported through the callback system.

olmo/train.py

dirkgr · 2024-10-03T18:09:45Z

What's going on with this PR? Can we merge?

Muennighoff · 2024-10-26T04:48:02Z

What's going on with this PR? Can we merge?

Fixed some basics as discussed; I think we can merge!

Muennighoff added 30 commits June 19, 2024 22:13

Clean MoE implementation

e725eb9

Add conf

db24750

Fix return args

18450de

Rmv outdated kwarg

4ab7f77

Rmv legacy kwarg

dba42fd

Merge branch 'Muennighoff/MoE' of github.com:allenai/LLM into Muennig…

6c5f8a3

…hoff/MoE

Add distributed_strategy

6a8e089

Allow w/o weight attr

1a9a317

Merge branch 'Muennighoff/MoE' of github.com:allenai/LLM into Muennig…

ddf6fd4

…hoff/MoE

Allow w/o weight attr

ab55e07

Add MoE params

7aeefd4

Rmv kwarg

3eab45c

Reduce lb & moe losses

6d736da

LN & Emb Dec

d07c638

Merge branch 'Muennighoff/MoE' of github.com:allenai/LLM into Muennig…

cdb592f

…hoff/MoE

Do not decay emb

1399841

Tmp - debug throughput

a13b5b8

Fix

935167e

Fix

b96972d

maintain init order

0079490

Merge branch 'Muennighoff/MoE' of github.com:allenai/LLM into Muennig…

8b1c441

…hoff/MoE

Decay emb

e2c7286

Keep EA on CPU

d39a37c

Do not decay emb

3acfc04

Change norm

4432261

Confs

cef7707

Adapt wrap

2a6df33

Add conf

7421890

decemb conf

021974e

Updates

d5a0626

Muennighoff requested a review from dirkgr August 1, 2024 20:20

epwalsh reviewed Aug 1, 2024

View reviewed changes

dirkgr requested changes Aug 2, 2024

View reviewed changes

Muennighoff added 3 commits August 2, 2024 18:17

Fix typo; MoEArgs func

d8452a0

Format

8a28ced

Check for act ckpt strategy & moe; fix typo

91f5553

Muennighoff requested a review from dirkgr August 3, 2024 01:33

fix import

61ac104

Muennighoff added 2 commits August 20, 2024 10:52

Sort impot

f4faf8a

Merge branch 'main' into Muennighoff/MoE

fdc1021

2015aroras reviewed Aug 21, 2024

View reviewed changes

Muennighoff and others added 3 commits August 20, 2024 20:27

Fix typo

ed82181

Co-authored-by: Shane A <[email protected]>

Simplify isinstance

b0cc754

Co-authored-by: Shane A <[email protected]>

Clean conf & move constructor

ca9b41f

Muennighoff mentioned this pull request Aug 29, 2024

Add OLMoE huggingface/transformers#32406

Merged

3 tasks

Muennighoff added 4 commits September 4, 2024 16:24

Add ref

215c0f5

Merge main

43baf74

Sort imports

775e514

Format

cd0004b

dirkgr requested changes Sep 8, 2024

View reviewed changes

Muennighoff added 3 commits September 11, 2024 20:46

No exp ass

acb23dd

Revert

a143469

Simplify

1a4bdae

Muennighoff added 3 commits October 3, 2024 16:58

Rmv interleave

410064a

Merge branch 'main' into Muennighoff/MoE

671bc8e

expert_assignments on gpu

04a2da5

MoE #639

Are you sure you want to change the base?

MoE #639

Conversation

Muennighoff commented Jun 30, 2024 • edited Loading

Choose a reason for hiding this comment

Muennighoff Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Muennighoff commented Aug 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Muennighoff commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr commented Oct 3, 2024

Muennighoff commented Oct 26, 2024

Muennighoff commented Jun 30, 2024 •

edited

Loading

Muennighoff Aug 1, 2024 •

edited

Loading