Unified Scaling Laws for Routed Language Models

by dokuDoku network architecture mixture of experts

🔍

📄 Document

⇔

This is from a presentation at ICML 2022 for Unified Scaling Laws for Routed Language Models showing that Mixture of Expert like routed language models follow scaling laws predicting performance as a function of number of experts **E** and size of the model **N**. Additionally, the cost of executing a routing network is independent of E, so if scaling E improves performance, this can provide cost-free (at inference) scaling.

Routing enables scaling by adding layers in parallel to existing ones, using a router to decide which layer to use. Additionally, parameter count is proportional to number of experts. One of the big **challenges** of routing is that typically to choose the right expert the output of a router is discrete, so it is **difficult** to train with **gradients**. Note there do exist gradient estimation methods and workarounds, but this is a large area of research.

This presentation addresses the following questions: 
1. How much better will the network be if **E** is increased?
2. Given a routed network, what is the equivalent dense model?

Note that *Scaling Laws for Neural Language Models* showed that the loss of transformers obeys a power law (slide 18) s.t. $N_c$ is a constant, and so is $\alpha_N$. 
$$ log\ L(N) = (\frac{N_c}{N})^{\alpha_N} $$
We can take the exponential of both sides to get: $L(N) = exp((\frac{N_c}{N})^{\alpha_N})$ which looks like $e^{1/x}$ meaning as x increases, $1/x \rightarrow 0$ and $e^{1/x} \rightarrow 1$ with exponential decay. This basically guarantees scaling works.

It was found that there similarly is a scaling law for routed language models (slide 22) s.t. 
$$ log L(N,E) := a\ log(N) + b\ log(\hat E) + c\ log(N)log(\hat{E}) + d$$
and $$\frac{1}{\hat E} := \frac{1}{E -1+(\frac{1}{E_{start}} - \frac{1}{E_{max}})^{-1}} + \frac{1}{E_{max}}$$
where $a,b,c,E_{start},E{max}$ are fitted coefficients.

The $a log(N)$ term is the normal dense scaling law, and the $b log(\hat E)$ is an equivalent term in the number of experts. The term $c log(N)log(\hat E)$ shows how the number of experts and the model size interact. **Note** that the $\mathbf{\hat E}$ term is a saturation term limiting improvement from routing, especially as **N** is increased. If we solve for $\hat E$ we get: 
$$ K = \left( \frac{1}{E_{start}} - \frac{1}{E_{\max}} \right)^{-1} + \frac{1}{E_{\max}} - 1 $$

$$ \hat{E} = \frac{E}{1 + K E} $$

Using this simplified form we can see the following behaviors. As $E \rightarrow 1$ we get $\hat E \rightarrow \frac{1}{1+K}$ and as $E \rightarrow \infty$ we can restate the equation to get: $\hat{E} = \frac{E}{1 + K E} \Rightarrow \frac{1}{1/E + K}$ which then gives us $\hat{E} = \frac{1}{K}$ as $E \rightarrow \infty$. Thus we can deduce from these scaling laws that there's an upper limit to how many experts improve performance, as denoted by the $\hat{E}$ term.

Thus when we look at the equation: 
$$ log L(N,E) := a\ log(N) + b\ log(\hat E) + c\ log(N)log(\hat{E}) + d$$
We can see that as we increase $E$, $\hat{E}$ will saturate, however $N$ will not, so we can only get so many gains from adding experts.

Importantly, they showed that this works across several popular router techniques including S-BASE, RL-R (with REINFORCE), and Hash.

Some other properties of the scaling law include (slides 43 & 44) Log-Linear Partial Derivatives, and Saturation of E (previously discussed). For Log-Linear Partial Derivatives, we can see:
$$ a(E) := -\frac{\partial log L}{\partial log N} = a + c\ log(E)$$
$$ b(N) := -\frac{\partial log L}{\partial log E} = b + c\ log(N)$$
As we can see, if you fix N or E varying the other variable gives a power-law, supporting that the scaling law is compatible with the dense transformer's scaling law.

There's a few key routing parameters that affect the performance of the network but don't change N or E. Namely:
- K: the number of experts each data point is sent to 
- R: the percentage of layers that are routed

A change of variables can be performed to express scaling laws in terms of FLOPs required to execute a data point F and the fraction of total parameters any one data point interacts with B:
$$ log L(F,B) := a\ log(F) + b\ log(\hat{B}) + c\ log(F)log(\hat{B})+d $$
This can give us a way to relate routing models to dense models.

📄 Document

Comments (0)