Adaptive Mixtures of Local Experts by dokuDoku network architecture mixture of experts 🔍 Hide media ▲ 📄 Document Download PDF ⇔ Adaptive Mixtures of Local Experts is the first paper on MoE (Mixture of Experts) written by Geoffry Hinton (and others). In this paper, they discuss instead of having one large neural network that can do everything, what if we split one network into n with a gating network to choose which one of the experts to choose. The general idea is that you can have different experts where each expert independently learns about its domain and its weight changes are localized to said network. Some advantages to this are that different experts could learn about different parts of the state space that could be hard to learn together. Additionally, it means that there are less weights to update, so learning is faster. Finally, many are interested in MoE today due to lower inference $\textbf{costs}$ since your FLOPs do not have to scale wrt number of total parameters. Apparently, before this paper, the way that this was done before was with the loss function found in Equation 1.1. in the paper i.e. $$ E^c = ||d^c - \sum_{i} p^c_i o^c_i||^2 $$ For context, the super script $c$ is the case, $E$ is error, $d$ is desired output, $p_i$ is proportion of contribution of expert i, and $o_i$ is the output of expert i. What this loss function means is that we're taking a linear combination of outputs from each expert and we're comparing this to the desired output. This is interesting, because this means that each dimension of each expert can separately contribute to the final result. In practice, what they found was each of the experts was compensating for the other experts, so all the experts were learning jointly. If you look at the previous equation, you can see that since the output is a linear combination of all the experts, all the weights for all experts are updated in tandem. Let's consider if instead we compare each expert's result $o_i$ to the desired $d$, and then take a weighted sum of this quantity, as shown in Equation 1.2 $$ E^c = ||d^c - o^c_i||^2 = \sum_i p^c_i ||d^c - o^c_i||^2 $$ This change means that when we optimize the network each expert is expected to produce the entire output vector. Although there's no hard constraint for choosing only one expert here, I guess the reasoning is that if every expert has to predict each output completely and the gradients update the experts for this, the gradients will be set up such that it encourages the best model to be chosen. In the paper, they propose Equation 1.3 stating that they're taking the negative log of Equation 1.2 $$ E^c = - log \sum_i p^c_i e^{\frac{1}{2} || d^c - o^c_i||^2} $$ I.e. this is the negative log probability of generating the desired output vector under the mixture of Gaussians model. To see why this error function works better than Equation 1.2, they stated their derivatives in Equations 1.4 and 1.5 respectively. Equation 1.4 (Derivative of $ E^c = ||d^c - o^c_i||^2 = \sum_i p^c_i ||d^c - o^c_i||^2 $) $$ \frac{\partial E^c}{\delta o^c_i} = -2p^c_i (d^c - o_i^c)$$ Equation 1.5 (Derivative of $ E^c = - log \sum_i p^c_i e^{\frac{1}{2} || d^c - o^c_i||^2} $) $$ \frac{\partial E^c}{\partial o_i^c} = - \frac{p^c_i e^{\frac{1}{2}\lVert d^c - o^c_i \rVert^2}} {\sum_j p^c_j e^{\frac{1}{2}\lVert d^c - o^c_j \rVert^2}} \times (d^c - o^c_i)$$ Equation 1.4 basically says with whatever probability we chose the expert multiply that by the difference between the desired and outcome. This does not encode any relative measure for how the other models performed and only implicitly states how it was weighed (assuming $\sum_i p_i = 1$). Equation 1.5 is in the form of the classic softmax distribution weighting model i against the performance of all other models. Thus we weight the experts against each other. We can see that Equation 1.4 will adapt the best fitting expert the *slowest* and Equation 1.5 will adapt the best fitting expert the *fastest*. The reason being that for 1.4 since $d^c - o^c_i$ will be the smallest, that expert will get the smallest derivative signal. However in equation 1.5 since $d^c - o^c_i$ is the smallest $e^{d^c-o^c_i}$ will be the largest and thus the gradients will adapt the *fastest*. This model was tested on the vowel discrimination problem, where we try to see what vowel was said. This is derived from audio, not text. The expert networks are simple feed forward networks forming a single linear decision boundary in the 2D input space. Each takes the 2D formant input, computes a linear combination (weighted sum plus bias), and applies a nonlinear sigmoid activation. There were mixtures of 4 or 8 simple experts. All models were trained with data from the first 50 speakers and tested with data from the remaining 25. They compared against a baseline of backpropagation networks, which I'm guessing means a fully connected feed-forward network with roughly the same number of parameters. The mixture of experts model reached the error criterion significantly faster than the back propagation networks and required half as many epochs on average. Learning time for the mixture model also scales well as number of experts increased. This makes sense since you're backpropagating through less layers, and at this time sigmoids were used without residuals, so vanishing gradients was probably a big issue. Furthermore, the linear models could classify well within a specific region of the input space. Furthermore, the system begins in an unbiased state s.t. equal proportions are assigned to all experts. However, once one expert receives more error than the others, the symmetry is broken and the experts start to specialize. Interesting to note, in all simulations with 4 or 8 experts all but 2 or 3 had mixing proportions that were effectively 0 for all cases. This may mean that the dataset wasn't sufficiently interesting or that the error function caused a collapse into relatively few experts. Adaptive Mixtures of Local Experts is the first paper on MoE (Mixture of Experts) written by Geoffry Hinton (and others). In this paper, they discuss instead of having one large neural network that can do everything, what if we split one network into n with a gating network to choose which one of the experts to choose. The general idea is that you can have different experts where each expert independently learns about its domain and its weight changes are localized to said network. Some advantages to this are that different experts could learn about different parts of the state space that could be hard to learn together. Additionally, it means that there are less weights to update, so learning is faster. Finally, many are interested in MoE today due to lower inference $\textbf{costs}$ since your FLOPs do not have to scale wrt number of total parameters. Apparently, before this paper, the way that this was done before was with the loss function found in Equation 1.1. in the paper i.e. $$ E^c = ||d^c - \sum_{i} p^c_i o^c_i||^2 $$ For context, the super script $c$ is the case, $E$ is error, $d$ is desired output, $p_i$ is proportion of contribution of expert i, and $o_i$ is the output of expert i. What this loss function means is that we're taking a linear combination of outputs from each expert and we're comparing this to the desired output. This is interesting, because this means that each dimension of each expert can separately contribute to the final result. In practice, what they found was each of the experts was compensating for the other experts, so all the experts were learning jointly. If you look at the previous equation, you can see that since the output is a linear combination of all the experts, all the weights for all experts are updated in tandem. Let's consider if instead we compare each expert's result $o_i$ to the desired $d$, and then take a weighted sum of this quantity, as shown in Equation 1.2 $$ E^c = ||d^c - o^c_i||^2 = \sum_i p^c_i ||d^c - o^c_i||^2 $$ This change means that when we optimize the network each expert is expected to produce the entire output vector. Although there's no hard constraint for choosing only one expert here, I guess the reasoning is that if every expert has to predict each output completely and the gradients update the experts for this, the gradients will be set up such that it encourages the best model to be chosen. In the paper, they propose Equation 1.3 stating that they're taking the negative log of Equation 1.2 $$ E^c = - log \sum_i p^c_i e^{\frac{1}{2} || d^c - o^c_i||^2} $$ I.e. this is the negative log probability of generating the desired output vector under the mixture of Gaussians model. To see why this error function works better than Equation 1.2, they stated their derivatives in Equations 1.4 and 1.5 respectively. Equation 1.4 (Derivative of $ E^c = ||d^c - o^c_i||^2 = \sum_i p^c_i ||d^c - o^c_i||^2 $) $$ \frac{\partial E^c}{\delta o^c_i} = -2p^c_i (d^c - o_i^c)$$ Equation 1.5 (Derivative of $ E^c = - log \sum_i p^c_i e^{\frac{1}{2} || d^c - o^c_i||^2} $) $$ \frac{\partial E^c}{\partial o_i^c} = - \frac{p^c_i e^{\frac{1}{2}\lVert d^c - o^c_i \rVert^2}} {\sum_j p^c_j e^{\frac{1}{2}\lVert d^c - o^c_j \rVert^2}} \times (d^c - o^c_i)$$ Equation 1.4 basically says with whatever probability we chose the expert multiply that by the difference between the desired and outcome. This does not encode any relative measure for how the other models performed and only implicitly states how it was weighed (assuming $\sum_i p_i = 1$). Equation 1.5 is in the form of the classic softmax distribution weighting model i against the performance of all other models. Thus we weight the experts against each other. We can see that Equation 1.4 will adapt the best fitting expert the *slowest* and Equation 1.5 will adapt the best fitting expert the *fastest*. The reason being that for 1.4 since $d^c - o^c_i$ will be the smallest, that expert will get the smallest derivative signal. However in equation 1.5 since $d^c - o^c_i$ is the smallest $e^{d^c-o^c_i}$ will be the largest and thus the gradients will adapt the *fastest*. This model was tested on the vowel discrimination problem, where we try to see what vowel was said. This is derived from audio, not text. The expert networks are simple feed forward networks forming a single linear decision boundary in the 2D input space. Each takes the 2D formant input, computes a linear combination (weighted sum plus bias), and applies a nonlinear sigmoid activation. There were mixtures of 4 or 8 simple experts. All models were trained with data from the first 50 speakers and tested with data from the remaining 25. They compared against a baseline of backpropagation networks, which I'm guessing means a fully connected feed-forward network with roughly the same number of parameters. The mixture of experts model reached the error criterion significantly faster than the back propagation networks and required half as many epochs on average. Learning time for the mixture model also scales well as number of experts increased. This makes sense since you're backpropagating through less layers, and at this time sigmoids were used without residuals, so vanishing gradients was probably a big issue. Furthermore, the linear models could classify well within a specific region of the input space. Furthermore, the system begins in an unbiased state s.t. equal proportions are assigned to all experts. However, once one expert receives more error than the others, the symmetry is broken and the experts start to specialize. Interesting to note, in all simulations with 4 or 8 experts all but 2 or 3 had mixing proportions that were effectively 0 for all cases. This may mean that the dataset wasn't sufficiently interesting or that the error function caused a collapse into relatively few experts. Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!