Mixtures of SubExperts for Large Language Continual Learning

by dokuDoku mixture of experts

🔍

📄 Document

⇔

Notes here, need to detail exactly what the paper meant by Mixture of SubExperts, MoSEs, since it appears that they were put in the attention mechanism, but it was not super clear.

Please double check this in paper, along with where the actual subexperts were placed in the network and what it means to do "parameter-wise modulation" with the bullseye.

As seen in the picture, it appears the big contribution is to use the MoSEs router to route the input to one of several subnetworks in a **parallel attention mechanism** and then add this to the pretrained attention output, i.e. doing something similar to LoRA where you add the output of the low rank matrix to the pretrained model. Additionally, we use prompt embedding (basically embedding the prompt to give us the task) to select which of the sub-experts to use at test time. We use $L_{pull}$ to maximize the cosine similarity of the task (and associated expert) embedding and input prompt to make the task embedding train properly. At test time, the query feature is generated and we find which task/sub-expert embedding this maximizes and choose the top k.

See Algorithm 1 & 2 in text.

MoSEs achieves near-zero or even positive backward transfer (BWT), significantly outperforming baselines like LoRA and standard MoE by isolating task-specific sub-experts and using adaptive routing to protect prior knowledge

Notes here, need to detail exactly what the paper meant by Mixture of SubExperts, MoSEs, since it appears that they were put in the attention mechanism, but it was not super clear.

Please double check this in paper, along with where the actual subexperts were placed in the network and what it means to do "parameter-wise modulation" with the bullseye.

As seen in the picture, it appears the big contribution is to use the MoSEs router to route the input to one of several subnetworks in a **parallel attention mechanism** and then add this to the pretrained attention output, i.e. doing something similar to LoRA where you add the output of the low rank matrix to the pretrained model. Additionally, we use prompt embedding (basically embedding the prompt to give us the task) to select which of the sub-experts to use at test time. We use $L_{pull}$ to maximize the cosine similarity of the task (and associated expert) embedding and input prompt to make the task embedding train properly. At test time, the query feature is generated and we find which task/sub-expert embedding this maximizes and choose the top k.

See Algorithm 1 & 2 in text.

MoSEs achieves near-zero or even positive backward transfer (BWT), significantly outperforming baselines like LoRA and standard MoE by isolating task-specific sub-experts and using adaptive routing to protect prior knowledge

CREATED: December 26, 2025 at 03:04 AM

LAST UPDATED: January 06, 2026 at 04:44 AM

Comments (0)