Mixture of Experts (MOE)

Related to Efficient FoMos. Source: huggingface and OLMoE paper.

TL;DR

MoEs replace the dense FFN in a transformer layer with multiple smaller FFNs (“experts”) + a router that picks which experts process each token.:

Are pretrained much faster vs. dense models

Have faster inference compared to a model with the same number of parameters

Require high VRAM as all experts are loaded in memory

Face many challenges in fine-tuning, but recent work with MoE instruction-tuning is promising

Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.

Sparse vs Dense

Sparsity uses the idea of conditional computation. While in dense models all the parameters are used for all the inputs, sparsity allows us to only run some parts of the whole system.

In the context of transformers models, a MoE consists of two main elements:

Spares MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8). $M o E (x) = \sum_{i \in Top-k (r (x))} so f t ma x (r (x))_{i} E_{i} (x)$
A gate network $(G)$ or router, that determines which tokens are sent to which expert $(E)$ $y = \sum_{i = 1}^{n} G (x)_{i} E_{i} (x)$ — just restricted to the top-k experts instead of summing over all of them. For example, in the image below, the token “More” is sent to the second expert, and the token “Parameters” is sent to the first network.

🚀 Costin Chitic

Recent Notes

Flash Attention

Mixture of Experts (MOE)

Neural ODEs

Beyond attention-based methods

FoMo Post-training and Adaption

Mixture of Experts (MOE)

Graph View

Backlinks