Unified Scaling Laws for Routed Language Models by dokuDoku network architecture mixture of experts 🔍 Hide media ▲ 📄 Document Download PDF ⇔ This is from a presentation at ICML 2022 for Unified Scaling Laws for Routed Language Models showing that Mixture of Expert like routed language models follow scaling laws predicting performance as a function of number of experts **E** and size of the model **N**. Additionally, the cost of executing a routing network is independent of E, so if scaling E improves performance, this can provide cost-free (at inference) scaling. Routing enables scaling by adding layers in parallel to existing ones, using a router to decide which layer to use. Additionally, parameter count is proportional to number of experts. One of the big **challenges** of routing is that typically to choose the right expert the output of a router is discrete, so it is **difficult** to train with **gradients**. Note there do exist gradient estimation methods and workarounds, but this is a large area of research. This presentation addresses the following questions: 1. How much better will the network be if **E** is increased? 2. Given a routed network, what is the equivalent dense model? Note that *Scaling Laws for Neural Language Models* showed that the loss of transformers obeys a power law (slide 18) s.t. $N_c$ is a constant, and so is $\alpha_N$. $$ log\ L(N) = (\frac{N_c}{N})^{\alpha_N} $$ We can take the exponential of both sides to get: $L(N) = exp((\frac{N_c}{N})^{\alpha_N})$ which looks like $e^{1/x}$ meaning as x increases, $1/x \rightarrow 0$ and $e^{1/x} \rightarrow 1$ with exponential decay. This basically guarantees scaling works. It was found that there similarly is a scaling law for routed language models (slide 22) s.t. $$ log L(N,E) := a\ log(N) + b\ log(\hat E) + c\ log(N)log(\hat{E}) + d$$ and $$\frac{1}{\hat E} := \frac{1}{E -1+(\frac{1}{E_{start}} - \frac{1}{E_{max}})^{-1}} + \frac{1}{E_{max}}$$ where $a,b,c,E_{start},E{max}$ are fitted coefficients. The $a log(N)$ term is the normal dense scaling law, and the $b log(\hat E)$ is an equivalent term in the number of experts. The term $c log(N)log(\hat E)$ shows how the number of experts and the model size interact. **Note** that the $\mathbf{\hat E}$ term is a saturation term limiting improvement from routing, especially as **N** is increased. If we solve for $\hat E$ we get: $$ K = \left( \frac{1}{E_{start}} - \frac{1}{E_{\max}} \right)^{-1} + \frac{1}{E_{\max}} - 1 $$ $$ \hat{E} = \frac{E}{1 + K E} $$ Using this simplified form we can see the following behaviors. As $E \rightarrow 1$ we get $\hat E \rightarrow \frac{1}{1+K}$ and as $E \rightarrow \infty$ we can restate the equation to get: $\hat{E} = \frac{E}{1 + K E} \Rightarrow \frac{1}{1/E + K}$ which then gives us $\hat{E} = \frac{1}{K}$ as $E \rightarrow \infty$. Thus we can deduce from these scaling laws that there's an upper limit to how many experts improve performance, as denoted by the $\hat{E}$ term. Thus when we look at the equation: $$ log L(N,E) := a\ log(N) + b\ log(\hat E) + c\ log(N)log(\hat{E}) + d$$ We can see that as we increase $E$, $\hat{E}$ will saturate, however $N$ will not, so we can only get so many gains from adding experts. Importantly, they showed that this works across several popular router techniques including S-BASE, RL-R (with REINFORCE), and Hash. Some other properties of the scaling law include (slides 43 & 44) Log-Linear Partial Derivatives, and Saturation of E (previously discussed). For Log-Linear Partial Derivatives, we can see: $$ a(E) := -\frac{\partial log L}{\partial log N} = a + c\ log(E)$$ $$ b(N) := -\frac{\partial log L}{\partial log E} = b + c\ log(N)$$ As we can see, if you fix N or E varying the other variable gives a power-law, supporting that the scaling law is compatible with the dense transformer's scaling law. There's a few key routing parameters that affect the performance of the network but don't change N or E. Namely: - K: the number of experts each data point is sent to - R: the percentage of layers that are routed A change of variables can be performed to express scaling laws in terms of FLOPs required to execute a data point F and the fraction of total parameters any one data point interacts with B: $$ log L(F,B) := a\ log(F) + b\ log(\hat{B}) + c\ log(F)log(\hat{B})+d $$ This can give us a way to relate routing models to dense models. Later in the presentation we can see experimental results showing that as **N** increases the effect of increasing **E** decreases, supporting how $\mathbf{\hat{E}}$ saturates, but note that with the current number of parameters of models, scaling **E** is still useful. They also suggest reading their appendix for more detailed results. **Summary:** - Dense transformers obey simple scaling laws: bigger models are better - Routing alternate type of scaling improving performance at no cost - Scaling law predicting performance of dense & routed transformers as a function of N and E - This law fits the data well and has a number of properties This is from a presentation at ICML 2022 for Unified Scaling Laws for Routed Language Models showing that Mixture of Expert like routed language models follow scaling laws predicting performance as a function of number of experts **E** and size of the model **N**. Additionally, the cost of executing a routing network is independent of E, so if scaling E improves performance, this can provide cost-free (at inference) scaling. Routing enables scaling by adding layers in parallel to existing ones, using a router to decide which layer to use. Additionally, parameter count is proportional to number of experts. One of the big **challenges** of routing is that typically to choose the right expert the output of a router is discrete, so it is **difficult** to train with **gradients**. Note there do exist gradient estimation methods and workarounds, but this is a large area of research. This presentation addresses the following questions: 1. How much better will the network be if **E** is increased? 2. Given a routed network, what is the equivalent dense model? Note that *Scaling Laws for Neural Language Models* showed that the loss of transformers obeys a power law (slide 18) s.t. $N_c$ is a constant, and so is $\alpha_N$. $$ log\ L(N) = (\frac{N_c}{N})^{\alpha_N} $$ We can take the exponential of both sides to get: $L(N) = exp((\frac{N_c}{N})^{\alpha_N})$ which looks like $e^{1/x}$ meaning as x increases, $1/x \rightarrow 0$ and $e^{1/x} \rightarrow 1$ with exponential decay. This basically guarantees scaling works. It was found that there similarly is a scaling law for routed language models (slide 22) s.t. $$ log L(N,E) := a\ log(N) + b\ log(\hat E) + c\ log(N)log(\hat{E}) + d$$ and $$\frac{1}{\hat E} := \frac{1}{E -1+(\frac{1}{E_{start}} - \frac{1}{E_{max}})^{-1}} + \frac{1}{E_{max}}$$ where $a,b,c,E_{start},E{max}$ are fitted coefficients. The $a log(N)$ term is the normal dense scaling law, and the $b log(\hat E)$ is an equivalent term in the number of experts. The term $c log(N)log(\hat E)$ shows how the number of experts and the model size interact. **Note** that the $\mathbf{\hat E}$ term is a saturation term limiting improvement from routing, especially as **N** is increased. If we solve for $\hat E$ we get: $$ K = \left( \frac{1}{E_{start}} - \frac{1}{E_{\max}} \right)^{-1} + \frac{1}{E_{\max}} - 1 $$ $$ \hat{E} = \frac{E}{1 + K E} $$ Using this simplified form we can see the following behaviors. As $E \rightarrow 1$ we get $\hat E \rightarrow \frac{1}{1+K}$ and as $E \rightarrow \infty$ we can restate the equation to get: $\hat{E} = \frac{E}{1 + K E} \Rightarrow \frac{1}{1/E + K}$ which then gives us $\hat{E} = \frac{1}{K}$ as $E \rightarrow \infty$. Thus we can deduce from these scaling laws that there's an upper limit to how many experts improve performance, as denoted by the $\hat{E}$ term. Thus when we look at the equation: $$ log L(N,E) := a\ log(N) + b\ log(\hat E) + c\ log(N)log(\hat{E}) + d$$ We can see that as we increase $E$, $\hat{E}$ will saturate, however $N$ will not, so we can only get so many gains from adding experts. Importantly, they showed that this works across several popular router techniques including S-BASE, RL-R (with REINFORCE), and Hash. Some other properties of the scaling law include (slides 43 & 44) Log-Linear Partial Derivatives, and Saturation of E (previously discussed). For Log-Linear Partial Derivatives, we can see: $$ a(E) := -\frac{\partial log L}{\partial log N} = a + c\ log(E)$$ $$ b(N) := -\frac{\partial log L}{\partial log E} = b + c\ log(N)$$ As we can see, if you fix N or E varying the other variable gives a power-law, supporting that the scaling law is compatible with the dense transformer's scaling law. There's a few key routing parameters that affect the performance of the network but don't change N or E. Namely: - K: the number of experts each data point is sent to - R: the percentage of layers that are routed A change of variables can be performed to express scaling laws in terms of FLOPs required to execute a data point F and the fraction of total parameters any one data point interacts with B: $$ log L(F,B) := a\ log(F) + b\ log(\hat{B}) + c\ log(F)log(\hat{B})+d $$ This can give us a way to relate routing models to dense models. Later in the presentation we can see experimental results showing that as **N** increases the effect of increasing **E** decreases, supporting how $\mathbf{\hat{E}}$ saturates, but note that with the current number of parameters of models, scaling **E** is still useful. They also suggest reading their appendix for more detailed results. **Summary:** - Dense transformers obey simple scaling laws: bigger models are better - Routing alternate type of scaling improving performance at no cost - Scaling law predicting performance of dense & routed transformers as a function of N and E - This law fits the data well and has a number of properties Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!