Dynamic Expert Specialization: Multi-Domain MoE Adaptation by dokuDoku mixture of experts 🔍 Hide media ▲ 📄 Document Download PDF ⇔ Read through paper again and update this 1/5/2026 Note that Figure 1 (included above) is slightly off from their description since the experts should also be trainable modules. ### Introduction DES-MoE addresses catastrophic forgetting with 3 things: 1. Adaptive router balancing - Pretrained knowledge - Task specific updates via distillation 2. Real-time expert domain correlation mapping - isolates domain specific gradients - domain meaning training set 3. Three phase adaptive fine-tuning schedule progressively freezing non-specialized parameters A key issue for MoE models is that for each new domain, a new expert subset must be determined & fine-tuned from scratch, which is inefficient and scales poorly wrt number of domains. Static expert allocation fails to exploit commonalities between domains, experts tuned for one domain are not easily reusable. DES-MoE combines a learnable adaptive router with online expert-domain correlation tracking to dynamically select & sparsely update the most relevant experts on a per-batch basis. ### Adaptive Lightweight Router The adaptive lightweight router augments the pretrained linear router with a shallow, trainable MLP trained under a dual signal paradigm. Formally the router is: $$ R_{adapt}(h_t) = W_2 \times GELU(W_1 h_t + b_1) + b_2 $$ S.t. $W_1 \in R^{d\times 4d}$, $b_1 \in R^{4d}$ are the hidden layer and $W_2 \in R^{4d\times |\Epsilon|}$, $b_2 \in R^{|\Epsilon|}$ project to the expert logits with $|\Epsilon|$ as the total number of experts. $W_2$ is initialized by copying the pretrained router's weights & Kaiming initialization is applied to $W_1$ preserving the router's original behavior at the start of finetuning. Note that the original router is: $R_{\text{orig}}(h_t) = W_{\text{orig}} h_t + b_{\text{orig}}$, so a second layer is added presumably for more adaptability when fine-tuning. If we need to finetune more than once then what? Training has two complementary loss components. 1. Knowledge distillation loss: $$ L_{KD} = \frac{1}{T} \sum_{t=1}^T KL(\sigma (R_{orig}(h_t)/\tau)|| \sigma(R_{adapt}(h_t)/\tau))$$ - This is the average KL Divergence of $R_{adapt}$ wrt $R_{original}$ i.e. how much $R_{adapt}$ (adapted router) has changed 2. Task Adaption Loss $$ L_{task} = -\sum_{t=1}^T log P(y_t | h_t, \theta_{router})$$ - $y_t$ next true token - $h_t$ token representation - $\theta_{router} = {W_1, b_1, W_2, b_2}$ - This is the cross entropy loss These two losses are combined into a time-dependent weighting: $$L_{router} = \lambda(\alpha)L_{KD} + (1-\lambda(\alpha))L_{task}$$ Note that $\lambda(\alpha)$ denotes the fraction of finetuning completed, so early in training $\lambda \approx 1$, so the router remains close to its pretrained state and the $L_{KD}$ term which prefers we stay close to our original router distribution. As we continue training we prioritize the task adaptation loss $L_{task}$. The general philosophy behind is is to ensure a smooth transition (wrt the router training) from learning general knowledge to task specific knowledge. ### Domain Guided Expert Specialization Domain-Guided Expert Specialization scheme that 1. Uncovers each domain's preferred experts 2. Restricts updates to these experts 3. Preserves a pool of shared experts for cross-domain transfer The training domain is partitioned by domain (training set origin), & it's recorded how often each expert is selected. From this we get $A_d^{(e)}$, the affinity between domain d and expert e. A binary specialization matrix is defined thresholding each row of A (see equation 6). Read through paper again and update this 1/5/2026 Note that Figure 1 (included above) is slightly off from their description since the experts should also be trainable modules. ### Introduction DES-MoE addresses catastrophic forgetting with 3 things: 1. Adaptive router balancing - Pretrained knowledge - Task specific updates via distillation 2. Real-time expert domain correlation mapping - isolates domain specific gradients - domain meaning training set 3. Three phase adaptive fine-tuning schedule progressively freezing non-specialized parameters A key issue for MoE models is that for each new domain, a new expert subset must be determined & fine-tuned from scratch, which is inefficient and scales poorly wrt number of domains. Static expert allocation fails to exploit commonalities between domains, experts tuned for one domain are not easily reusable. DES-MoE combines a learnable adaptive router with online expert-domain correlation tracking to dynamically select & sparsely update the most relevant experts on a per-batch basis. ### Adaptive Lightweight Router The adaptive lightweight router augments the pretrained linear router with a shallow, trainable MLP trained under a dual signal paradigm. Formally the router is: $$ R_{adapt}(h_t) = W_2 \times GELU(W_1 h_t + b_1) + b_2 $$ S.t. $W_1 \in R^{d\times 4d}$, $b_1 \in R^{4d}$ are the hidden layer and $W_2 \in R^{4d\times |\Epsilon|}$, $b_2 \in R^{|\Epsilon|}$ project to the expert logits with $|\Epsilon|$ as the total number of experts. $W_2$ is initialized by copying the pretrained router's weights & Kaiming initialization is applied to $W_1$ preserving the router's original behavior at the start of finetuning. Note that the original router is: $R_{\text{orig}}(h_t) = W_{\text{orig}} h_t + b_{\text{orig}}$, so a second layer is added presumably for more adaptability when fine-tuning. If we need to finetune more than once then what? Training has two complementary loss components. 1. Knowledge distillation loss: $$ L_{KD} = \frac{1}{T} \sum_{t=1}^T KL(\sigma (R_{orig}(h_t)/\tau)|| \sigma(R_{adapt}(h_t)/\tau))$$ - This is the average KL Divergence of $R_{adapt}$ wrt $R_{original}$ i.e. how much $R_{adapt}$ (adapted router) has changed 2. Task Adaption Loss $$ L_{task} = -\sum_{t=1}^T log P(y_t | h_t, \theta_{router})$$ - $y_t$ next true token - $h_t$ token representation - $\theta_{router} = {W_1, b_1, W_2, b_2}$ - This is the cross entropy loss These two losses are combined into a time-dependent weighting: $$L_{router} = \lambda(\alpha)L_{KD} + (1-\lambda(\alpha))L_{task}$$ Note that $\lambda(\alpha)$ denotes the fraction of finetuning completed, so early in training $\lambda \approx 1$, so the router remains close to its pretrained state and the $L_{KD}$ term which prefers we stay close to our original router distribution. As we continue training we prioritize the task adaptation loss $L_{task}$. The general philosophy behind is is to ensure a smooth transition (wrt the router training) from learning general knowledge to task specific knowledge. ### Domain Guided Expert Specialization Domain-Guided Expert Specialization scheme that 1. Uncovers each domain's preferred experts 2. Restricts updates to these experts 3. Preserves a pool of shared experts for cross-domain transfer The training domain is partitioned by domain (training set origin), & it's recorded how often each expert is selected. From this we get $A_d^{(e)}$, the affinity between domain d and expert e. A binary specialization matrix is defined thresholding each row of A (see equation 6). Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!