Soft Merging of Experts with Adaptive Routing

by dokuDoku mixture of experts network architecture

🔍

📄 Document

⇔

Even after reading the paper, unclear how they actually updated the model and what the model architecture was exactly. Perhaps something was missed during the first read, but the size of block was not stated nor was exactly how it was updated (although it was briefly described).

This paper states that you can use a *single merged expert constructed of a weighted average of all experts' parameters* to effectively have a MoE model  that's just one model. This gives a great advantage at inference since it's just one model effectively and not multiple. Additionally, no gradient estimation is needed or any special router configuration.

The way SMEAR works is it uses the router's distribution of the input over experts to compute a weighted average of the parameters of the experts. The experts learned by SMEAR specialize to different inputs & share parameters across tasks.

The key finding is that *averaging the parameters of models that share a common architecture can produce an aggregate model that shares the capabilities of the individual models*. The models make a single routing choice wrt each input example.

A learned router (typically a simple neural network) computes a soft probability distribution (routing weights) over all experts for each input. These weights are used to compute a single "merged" expert as a weighted average of all experts' parameters. The input is then passed through this one merged expert (no sparsity in activation—every input effectively uses a blend of all experts, weighted adaptively). The router is trained jointly and end-to-end with the rest of the model (including the experts) using standard gradient-based backpropagation. This is possible because SMEAR avoids discrete routing decisions entirely—there's no non-differentiable top-k selection or need for gradient estimators (like straight-through, REINFORCE, or Gumbel-softmax) that plague traditional sparse MoE training.

In SMEAR, the router is a simple linear layer (with layer normalization) followed by softmax, producing soft probabilities over all experts for each input. These probabilities merge all expert parameters into a single weighted-average expert, through which the input passes (no discrete selection or activation of multiple experts). During training, this fully differentiable merging allows gradients to flow to all experts (weighted by probabilities) and the router, using only the task loss plus expert dropout to encourage specialization without collapse.

Even after reading the paper, unclear how they actually updated the model and what the model architecture was exactly. Perhaps something was missed during the first read, but the size of block was not stated nor was exactly how it was updated (although it was briefly described).

This paper states that you can use a *single merged expert constructed of a weighted average of all experts' parameters* to effectively have a MoE model  that's just one model. This gives a great advantage at inference since it's just one model effectively and not multiple. Additionally, no gradient estimation is needed or any special router configuration.

The way SMEAR works is it uses the router's distribution of the input over experts to compute a weighted average of the parameters of the experts. The experts learned by SMEAR specialize to different inputs & share parameters across tasks.

The key finding is that *averaging the parameters of models that share a common architecture can produce an aggregate model that shares the capabilities of the individual models*. The models make a single routing choice wrt each input example.

A learned router (typically a simple neural network) computes a soft probability distribution (routing weights) over all experts for each input. These weights are used to compute a single "merged" expert as a weighted average of all experts' parameters. The input is then passed through this one merged expert (no sparsity in activation—every input effectively uses a blend of all experts, weighted adaptively). The router is trained jointly and end-to-end with the rest of the model (including the experts) using standard gradient-based backpropagation. This is possible because SMEAR avoids discrete routing decisions entirely—there's no non-differentiable top-k selection or need for gradient estimators (like straight-through, REINFORCE, or Gumbel-softmax) that plague traditional sparse MoE training.

In SMEAR, the router is a simple linear layer (with layer normalization) followed by softmax, producing soft probabilities over all experts for each input. These probabilities merge all expert parameters into a single weighted-average expert, through which the input passes (no discrete selection or activation of multiple experts). During training, this fully differentiable merging allows gradients to flow to all experts (weighted by probabilities) and the router, using only the task loss plus expert dropout to encourage specialization without collapse.

CREATED: December 25, 2025 at 08:29 PM

📄 Document

Comments (0)