Review of Sparse Expert Models in Deep Learning

by dokuDoku mixture of experts network architecture

🔍

📄 Document

⇔

Sparse expert models are resurging as a popular architecture in deep learning with the variants: 
1. **Mixture of Experts (MoE)**
2. Switch Transformers 
3. Routing Networks 
4. BASE Layers 
5. etc.

These have the unifying idea that each data point, or token, only uses a subset of the total parameters. During training & inference models route tokens to specific expert(s) weights. Thus each token only interacts with a subset of the network parameters instead of using the entire network. Due to this property, sparsity reduces training & inference costs resulting in massive models with better accuracy than their dense counterparts.

### Overview

To give a brief overview, let's go over the stated benefits & drawbacks of Sparse models wrt dense ones. To start, sparse models are a generalization of a dense model, since a dense model is just a sparse model with 1 expert. Sparse models enable us to increase the number of params in a model with the number of experts & keep FLOPs constant. Sparsity can be good when you have many GPU's to host all the additional parameters. One can also ask about the RAM & data center specific setup here, but the point is for a small light system this may **not** be the best choice. Sparse models are good when *throughput* needs to be increased and *data parallelism* can be leveraged at training & inference.

Some of the weaknesses include that on a per parameter basis sparse models look worse than dense models since (by design) not all parameters are being used at once. *However* they do win when the number of inference params or FLOPs are kept equal. Additionally, sparse models are more unstable to train. Work to mediate this includes using float32 instead of float16 to train & skipping batches with NaN's or Inf gradients. Furthermore, some include working with the router z-loss and entropy loses.

Important to note, at larger scales especially, MoE model domain transfer with finetuning lags behind their dense counterparts. For a pre-training perplexity, sparse models were fine-tuning worse on reasoning tasks, but better on knowledge heavy tasks. (Note: Perplexity measures how well a probability model predicts a sample. Lower perplexity indicates better predictive performance.) This makes sense with each expert being located in the MLP of the transformer, which is known to encode facts.

To note, sparse expert models are naturally more interpretable, and it has been observed that experts can specialize at different levels of abstraction. For instance, for an NLP task one specialized around words of innovation, another the article a, and another for synonyms of speed. In multi-modal models, experts can vary in their specialization from simple textures up to concepts like wheels or door handles. Finally to note, these interpretability approaches are limited since they consider tokens arriving to experts in a narrow way without considering that each token captures contextual information from surrounding data through the attention mechanism.

### Routing

A big part of MoE & Sparse Expert Models is the router and how many experts are trained. Earlier works have included using all of the experts' outputs in a weighted sum. Subsequently Shazeer et al 2017 (referenced throughout this work) inserted an MoE layer between two LSTM layers s.t. the lower layer's LSTM was sent for computation within the MoE. After this works like GShard & Switch Transformers replaced feed-forward layers in Transformers with MoE layers, making this *experts-as-a-layer* approach the dominant paradigm. Others are revisiting having experts be completely independent models.

Choosing experts wrt input tokens entails a discrete selection which complicates the differentiability requirements for back propagation. As a solution Shazeer et all 2017 presented a top-k routing function taking an input token x and routes it to the top-k experts out of N. The model has a training variable $W_r$ which computes the logits $h(x) = W_r x$. The gate value for expert i is: $$p_i(x) = \frac{e^{h(x)_i}}{\sum_j^N e^{h(x)_j}}$$
The output combination of the layer is then the linearly weighted combination of the expert's output with $p_i(x)$ giving: $y = \sum_{i \in T} p_i(x)E_i(x)$. This is one such formulation, there are alternatives where we only choose (see Figure 6 & 7):
1. Top 1 Routing 
2. Top 2 Routing 
3. Hash Routing 
4. Expert chooses tokens 
5. BASE Routing 
6. Reinforcement Learning learned Router

Top 1 Routing with a Bandit approach has been proposed however Shazeer et al 2017 stated it was necessary to route to the top-k experts with k>1, since 2+ experts enable the network to compare & optimize relative performance.

Note that 3. Hash Routing is not learned but is simply used to balance tokens deterministically and evenly between experts. Most routing algorithms dynamically learn the routing decisions while training, however this can be frozen before training begins.

Additionally, there are alternatives where the Experts choose their top-k tokens it wants to process, meaning that each expert is guaranteed to get the same number of tokens. However, tokens may be dropped, but empirically we can adapt for this.

📄 Document

Comments (0)