Mixture of Experts

Per-token, per-loop expert routing.

Inside the recurrent block, the FFN is a fine-grained MoE: every token is routed to a handful of specialized routed experts while a few shared experts always fire. This keeps per-token compute sparse even as the total expert pool — and total parameter count — grows large. Pick a token below and scrub the loop depth to see how routing changes.