MoDA (Experimental)

An alternative depth-aware attention + DeepSeek-MoE architecture.

Experimental architecture

Experimentalalternative track

MoDA (Mixture-of-Depths Attention) is a self-contained, experimental alternative model defined in open_mythos/moda.py. It is not the looped Recurrent Depth Transformer that defines the core OpenMythos brand — it does not use the Prelude / Recurrent / Coda loop, ACT halting, or the stable LTI injection. Instead it explores two orthogonal ideas: depth-aware attention with a per-layer depth KV cache, and a DeepSeek-style MoE FFN. Treat everything below as an illustrative simulation of that file.

How a MoDA block is built

Architecture overviewMoDABlock × 24
A post-norm decoder block pairs depth-aware attention with a DeepSeek MoE FFN. After the FFN, each layer projects its output through W_K^W / W_V^W and writes one K/V entry into a shared depth cache that all later layers can read. Grounded in MoDABlock / MoDAModel.

Inside one block (post-norm)

x (B, T, 2048)

MoDA Attention

reads depth cache 0…L-1 + causal seq

+ residual → RMSNorm

DeepSeek MoE FFN

2 shared + top-6/64 routed

+ residual → RMSNorm
x_out
W_K^W · W_V^W (RoPE on K)
write K/V → depth cache

How layers stack & share the depth cache

layer 0reads depth 0…
layer 1reads depth 0…0
layer 2reads depth 0…1
layer 3reads depth 0…2
24 layers total
depth
cache
K/V0
K/V1
K/V2
K/V3

Layer L attends jointly over its own causal sequence keys and the depth K/V written by layers 0…L-1 at the same token position — combined memory length grows to O(T·L).

Depth-aware attention

Depth KV cacheillustrative
Rows are layers; columns are token positions. Each layer writes a K/V entry after its MoE FFN. A query in layer L at position p reads its own causal sequence keys (row L, cols 0…p) and the depth keys cached by layers 0…L-1 at the same column.
4 / 5
5 / 6
steps the current layer to grow the depth cache
layer 5
5·0
5·1
5·2
5·3
5·4
5·5
5·6
layer 4
4·0
4·1
4·2
4·3
4·4
Q
4·6
layer 3
3·0
3·1
3·2
3·3
3·4
3·5
3·6
layer 2
2·0
2·1
2·2
2·3
2·4
2·5
2·6
layer 1
1·0
1·1
1·2
1·3
1·4
1·5
1·6
layer 0
0·0
0·1
0·2
0·3
0·4
0·5
0·6
p0p1p2p3p4p5p6
query (L, p)sequence keys (6)depth keys (4)cached, not readnot yet computed

query attends over 6 sequence + 4 depth = 10 keys, fused by a single softmax (see below).

Unified attention softmaxillustrative
MoDA concatenates the causal sequence logits and the depth logits and normalises them with one softmax — a single distribution over both key sets. Toggle to compare with a plain sequence-only softmax.
α=softmax([QKseq    QKdepth]/d)\alpha = \mathrm{softmax}\big([\,Q K_{\text{seq}}^{\top} \;\Vert\; Q K_{\text{depth}}^{\top}\,]\, / \sqrt{d}\big)
One softmax over sequence keys (causal) concatenated with depth keys.
5
4
2%
s0
28%
s1
12%
s2
17%
s3
6%
s4
1%
d0
27%
d1
2%
d2
5%
d3
← sequence region (causal)depth region →
Σ sequence
0.65
Σ depth
0.35
Σ total
1.00

The whole strip sums to 11. In sequence-only mode the depth keys contribute nothing, so all probability mass is forced back onto the sequence — MoDA instead lets depth keys compete directly with the sequence under the shared normaliser. Scores are seeded for illustration.

DeepSeek mixture-of-experts FFN

DeepSeek MoE gateillustrative
MoDA's FFN is a DeepSeek-style MoE: 2 shared experts always fire, while a gate routes each token to top-6 of 64 routed experts via a softmax over affinity logits.
re-routes a fresh seeded token through the gate

Shared experts — always active

shared 0
shared 1

Routed experts — sparse top-6

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

Gate weights (selected)

expert 41
4.6%
expert 5
4.5%
expert 46
4.1%
expert 45
4.0%
expert 20
3.9%
expert 26
3.7%
Expert-level balance lossα = 0.001
Lbal=ifiPi,fi=NrKT#{ti},  Pi=1Ttsi,t\mathcal{L}_{\text{bal}} = \sum_i f_i\, P_i,\quad f_i = \tfrac{N_r}{K'T}\,\#\{t \to i\},\; P_i = \tfrac{1}{T}\sum_t s_{i,t}

Unlike the core OpenMythos MoE (which uses an aux-loss-free per-expert routing bias), this experimental MoE keeps the gate bias off\textit{off} (use_bias=False) and instead adds DeepSeekMoE's explicit balance loss to discourage routing collapse. Gate weights come from the un-biased softmax scores.

Implementation (authoritative)

open_mythos/moda.py · lines 671-821 · MoDAAttention

Depth-aware unified attention (experimental)

python
class MoDAAttention(nn.Module):
    """Mixture-of-Depths Attention — read side.

    Each query jointly attends (single softmax) to:
      * Sequence KVs at the current layer (causal GQA).
      * Depth KVs from all preceding layers at the *same* token position.

    Depth cache entries are written externally by :class:`MoDABlock` from
    the full block output X_l^out (after the MoE FFN).

    Args:
        cfg: :class:`MoDAConfig` instance.
    """

    def __init__(self, cfg: MoDAConfig) -> None:
        """Build the MoDA attention module.

        Creates four projection matrices (Q, K, V, O) sized for GQA and
        stores the attention scale and dropout rate.

        Args:
            cfg: Model configuration.  Must satisfy
                 ``n_heads_q % n_heads_kv == 0`` (GQA requirement).

        Raises:
            ValueError: If ``n_heads_q`` is not divisible by ``n_heads_kv``.
        """
        super().__init__()
        if cfg.n_heads_q % cfg.n_heads_kv != 0:
            raise ValueError(
                f"n_heads_q ({cfg.n_heads_q}) must be divisible by "
                f"n_heads_kv ({cfg.n_heads_kv}) for GQA."
            )

        self.n_heads_q = cfg.n_heads_q
        self.n_heads_kv = cfg.n_heads_kv
        self.head_dim = cfg.head_dim
        self.gqa_group = cfg.n_heads_q // cfg.n_heads_kv
        self.scale = cfg.head_dim**-0.5
        self.dropout = cfg.attn_dropout

        inner_q = cfg.n_heads_q * cfg.head_dim
        inner_kv = cfg.n_heads_kv * cfg.head_dim

        self.q_proj = nn.Linear(cfg.d_model, inner_q, bias=False)
        self.k_proj = nn.Linear(cfg.d_model, inner_kv, bias=False)
        self.v_proj = nn.Linear(cfg.d_model, inner_kv, bias=False)
        self.o_proj = nn.Linear(inner_q, cfg.d_model, bias=False)

    def _expand_kv(self, kv: torch.Tensor) -> torch.Tensor:
        """Repeat KV heads along dim 1 to match the number of query heads.

        With GQA group size G, each KV head is shared by G query heads.
        ``repeat_interleave(G, dim=1)`` produces the correct interleaved
        expansion so that query head ``h`` is paired with KV head ``h // G``.

        Args:
            kv: Key or value tensor whose dim 1 is the KV-head axis.
                Supported shapes: ``[B, Hk, T, d]`` (sequence) and
                ``[B, Hk, T, L, d]`` (depth stack).

        Returns:
            Tensor with dim 1 expanded from ``Hk`` to ``Hq = Hk × G``.
            Returns *kv* unchanged when ``gqa_group == 1``.
        """
        if self.gqa_group == 1:
            return kv
        return kv.repeat_interleave(self.gqa_group, dim=1)

    def forward(
        self,
        x: torch.Tensor,
        depth_k_cache: List[torch.Tensor],
        depth_v_cache: List[torch.Tensor],
        cos: torch.Tensor,
        sin: torch.Tensor,
    ) -> torch.Tensor:
        """Compute MoDA attention output.

        Args:
            x:             ``[B, T, D]`` input hidden states.
            depth_k_cache: ``L`` tensors each ``[B, Hk, T, d]`` — depth keys.
            depth_v_cache: Matching depth values.
            cos/sin:       RoPE tables ``[1, 1, T, d]``.

        Returns:
            ``[B, T, D]`` output hidden states.
        """
        B, T, D = x.shape
        Hq, Hk, d = self.n_heads_q, self.n_heads_kv, self.head_dim

        Q = self.q_proj(x).view(B, T, Hq, d).transpose(1, 2)
        K = self.k_proj(x).view(B, T, Hk, d).transpose(1, 2)
        V = self.v_proj(x).view(B, T, Hk, d).transpose(1, 2)

        Q = apply_rotary_emb(Q, cos, sin)
        K = apply_rotary_emb(K, cos, sin)

        K_e = self._expand_kv(K)
        V_e = self._expand_kv(V)

        L = len(depth_k_cache)

        if L == 0:
            out = F.scaled_dot_product_attention(
                Q,
                K_e,
                V_e,
                is_causal=True,
                dropout_p=self.dropout if self.training else 0.0,
                scale=self.scale,
            )
        else:
            # Sequence logits [B, Hq, T, T] with causal mask
            seq_logits = torch.matmul(Q, K_e.transpose(-2, -1)) * self.scale
            causal_mask = torch.triu(
                torch.full((T, T), float("-inf"), device=x.device, dtype=Q.dtype),
                diagonal=1,
            )
            seq_logits = seq_logits + causal_mask

            # Depth KVs: [B, Hk, L, T, d] → [B, Hk, T, L, d]
            K_depth = torch.stack(depth_k_cache, dim=2).permute(0, 1, 3, 2, 4)
            V_depth = torch.stack(depth_v_cache, dim=2).permute(0, 1, 3, 2, 4)
            K_depth_e = self._expand_kv(K_depth)
            V_depth_e = self._expand_kv(V_depth)

            # Depth logits [B, Hq, T, L]
            depth_logits = torch.einsum("bhid,bhild->bhil", Q, K_depth_e) * self.scale

            # Unified softmax over T + L positions
            combined = torch.cat([seq_logits, depth_logits], dim=-1)
            weights = F.softmax(combined, dim=-1)
            if self.training and self.dropout > 0.0:
                weights = F.dropout(weights, p=self.dropout)

            seq_contrib = torch.matmul(weights[:, :, :, :T], V_e)
            depth_contrib = torch.einsum(
                "bhil,bhild->bhid", weights[:, :, :, T:], V_depth_e
            )
            out = seq_contrib + depth_contrib

        out = out.transpose(1, 2).reshape(B, T, Hq * d)
        return self.o_proj(out)


# ---------------------------------------------------------------------------
# MoDA Transformer Block
# ---------------------------------------------------------------------------

The unified softmax lives in the L > 0 branch: sequence and depth logits are concatenated, then a single F.softmax normalises both.

All diagrams above are seeded, illustrative simulations. The code panels are the source of truth — see open_mythos/moda.py.