MOE uses more memory than dense model and is slower #166

samuelwheeler · 2025-03-03T19:13:32Z

I am training a ~520 M model, but I have found that the megablocks moe version uses substantially more memory and takes longer to train than a dense model of corresponding size. I am using a model embedding dimension of 1536. The moe model has 48 experts with 8 active and and expert size of 128. I set lbl loss weight to 0.001.

samuelwheeler changed the title ~~MOE uses much more memory than dense model and is substantially slower~~ MOE uses more memory than dense model and is substantially slower Mar 3, 2025

samuelwheeler changed the title ~~MOE uses more memory than dense model and is substantially slower~~ MOE uses more memory than dense model and is slower Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOE uses more memory than dense model and is slower #166

MOE uses more memory than dense model and is slower #166

samuelwheeler commented Mar 3, 2025

MOE uses more memory than dense model and is slower #166

MOE uses more memory than dense model and is slower #166

Comments

samuelwheeler commented Mar 3, 2025