MoDA (Experimental)

An alternative depth-aware attention + DeepSeek-MoE architecture.

Experimental architecture

Experimentalalternative track

MoDA (Mixture-of-Depths Attention) is a self-contained, experimental alternative model defined in open_mythos/moda.py. It is not the looped Recurrent Depth Transformer that defines the core OpenMythos brand — it does not use the Prelude / Recurrent / Coda loop, ACT halting, or the stable LTI injection. Instead it explores two orthogonal ideas: depth-aware attention with a per-layer depth KV cache, and a DeepSeek-style MoE FFN. Treat everything below as an illustrative simulation of that file.

How a MoDA block is built

Architecture overviewMoDABlock × 24

A post-norm decoder block pairs depth-aware attention with a DeepSeek MoE FFN. After the FFN, each layer projects its output through W_K^W / W_V^W and writes one K/V entry into a shared depth cache that all later layers can read. Grounded in MoDABlock / MoDAModel.

Inside one block (post-norm)

x (B, T, 2048)

MoDA Attention

reads depth cache 0…L-1 + causal seq

+ residual → RMSNorm

DeepSeek MoE FFN

2 shared + top-6/64 routed

+ residual → RMSNorm

x_out

W_K^W · W_V^W (RoPE on K)

write K/V → depth cache

How layers stack & share the depth cache

layer 0reads depth 0…∅

layer 1reads depth 0…0

layer 2reads depth 0…1

layer 3reads depth 0…2

⋮ 24 layers total

depth
cache

K/V0

K/V1

K/V2

K/V3

Layer L attends jointly over its own causal sequence keys and the depth K/V written by layers 0…L-1 at the same token position — combined memory length grows to O(T·L).

Depth-aware attention

Depth KV cacheillustrative

Rows are layers; columns are token positions. Each layer writes a K/V entry after its MoE FFN. A query in layer L at position p reads its own causal sequence keys (row L, cols 0…p) and the depth keys cached by layers 0…L-1 at the same column.

Current layer L4 / 5

Query position p5 / 6

steps the current layer to grow the depth cache

layer 5

5·0

5·1

5·2

5·3

5·4

5·5

5·6

layer 4

4·0

4·1

4·2

4·3

4·4

4·6

layer 3

3·0

3·1

3·2

3·3

3·4

3·5

3·6

layer 2

2·0

2·1

2·2

2·3

2·4

2·5

2·6

layer 1

1·0

1·1

1·2

1·3

1·4

1·5

1·6

layer 0

0·0

0·1

0·2

0·3

0·4

0·5

0·6

p0p1p2p3p4p5p6

query (L, p)sequence keys (6)depth keys (4)cached, not readnot yet computed

query attends over 6 sequence + 4 depth = 10 keys, fused by a single softmax (see below).

Unified attention softmaxillustrative

MoDA concatenates the causal sequence logits and the depth logits and normalises them with one softmax — a single distribution over both key sets. Toggle to compare with a plain sequence-only softmax.

\alpha = \mathrm{softmax}\big([\,Q K_{\text{seq}}^{\top} \;\Vert\; Q K_{\text{depth}}^{\top}\,]\, / \sqrt{d}\big)

One softmax over sequence keys (causal) concatenated with depth keys.

Sequence keys (causal)5

Depth keys (layers 0…L-1)4

Unified (sequence + depth)

28%

12%

17%

27%

← sequence region (causal)depth region →

Σ sequence

0.65

Σ depth

0.35

Σ total

1.00

The whole strip sums to $1$ . In sequence-only mode the depth keys contribute nothing, so all probability mass is forced back onto the sequence — MoDA instead lets depth keys compete directly with the sequence under the shared normaliser. Scores are seeded for illustration.

DeepSeek mixture-of-experts FFN

DeepSeek MoE gateillustrative

MoDA's FFN is a DeepSeek-style MoE: 2 shared experts always fire, while a gate routes each token to top-6 of 64 routed experts via a softmax over affinity logits.

re-routes a fresh seeded token through the gate

Shared experts — always active

shared 0

shared 1

Routed experts — sparse top-6

Gate weights (selected)

expert 41

4.6%

expert 5

4.5%

expert 46

4.1%

expert 45

4.0%

expert 20

3.9%

expert 26

3.7%

Expert-level balance lossα = 0.001

\mathcal{L}_{\text{bal}} = \sum_i f_i\, P_i,\quad f_i = \tfrac{N_r}{K'T}\,\#\{t \to i\},\; P_i = \tfrac{1}{T}\sum_t s_{i,t}

Unlike the core OpenMythos MoE (which uses an aux-loss-free per-expert routing bias), this experimental MoE keeps the gate bias $\textit{off}$ (use_bias=False) and instead adds DeepSeekMoE's explicit balance loss to discourage routing collapse. Gate weights come from the un-biased softmax scores.

Implementation (authoritative)

open_mythos/moda.py · lines 671-821 · MoDAAttention

Depth-aware unified attention (experimental)

python

class MoDAAttention(nn.Module):
    """Mixture-of-Depths Attention — read side.

    Each query jointly attends (single softmax) to:
      * Sequence KVs at the current layer (causal GQA).
      * Depth KVs from all preceding layers at the *same* token position.

    Depth cache entries are written externally by :class:`MoDABlock` from
    the full block output X_l^out (after the MoE FFN).

    Args:
        cfg: :class:`MoDAConfig` instance.
    """

    def __init__(self, cfg: MoDAConfig) -> None:
        """Build the MoDA attention module.

        Creates four projection matrices (Q, K, V, O) sized for GQA and
        stores the attention scale and dropout rate.

        Args:
            cfg: Model configuration.  Must satisfy
                 ``n_heads_q % n_heads_kv == 0`` (GQA requirement).

        Raises:
            ValueError: If ``n_heads_q`` is not divisible by ``n_heads_kv``.
        """
        super().__init__()
        if cfg.n_heads_q % cfg.n_heads_kv != 0:
            raise ValueError(
                f"n_heads_q ({cfg.n_heads_q}) must be divisible by "
                f"n_heads_kv ({cfg.n_heads_kv}) for GQA."
            )

        self.n_heads_q = cfg.n_heads_q
        self.n_heads_kv = cfg.n_heads_kv
        self.head_dim = cfg.head_dim
        self.gqa_group = cfg.n_heads_q // cfg.n_heads_kv
        self.scale = cfg.head_dim**-0.5
        self.dropout = cfg.attn_dropout

        inner_q = cfg.n_heads_q * cfg.head_dim
        inner_kv = cfg.n_heads_kv * cfg.head_dim

        self.q_proj = nn.Linear(cfg.d_model, inner_q, bias=False)
        self.k_proj = nn.Linear(cfg.d_model, inner_kv, bias=False)
        self.v_proj = nn.Linear(cfg.d_model, inner_kv, bias=False)
        self.o_proj = nn.Linear(inner_q, cfg.d_model, bias=False)

    def _expand_kv(self, kv: torch.Tensor) -> torch.Tensor:
        """Repeat KV heads along dim 1 to match the number of query heads.

        With GQA group size G, each KV head is shared by G query heads.
        ``repeat_interleave(G, dim=1)`` produces the correct interleaved
        expansion so that query head ``h`` is paired with KV head ``h // G``.

        Args:
            kv: Key or value tensor whose dim 1 is the KV-head axis.
                Supported shapes: ``[B, Hk, T, d]`` (sequence) and
                ``[B, Hk, T, L, d]`` (depth stack).

        Returns:
            Tensor with dim 1 expanded from ``Hk`` to ``Hq = Hk × G``.
            Returns *kv* unchanged when ``gqa_group == 1``.
        """
        if self.gqa_group == 1:
            return kv
        return kv.repeat_interleave(self.gqa_group, dim=1)

    def forward(
        self,
        x: torch.Tensor,
        depth_k_cache: List[torch.Tensor],
        depth_v_cache: List[torch.Tensor],
        cos: torch.Tensor,
        sin: torch.Tensor,
    ) -> torch.Tensor:
        """Compute MoDA attention output.

        Args:
            x:             ``[B, T, D]`` input hidden states.
            depth_k_cache: ``L`` tensors each ``[B, Hk, T, d]`` — depth keys.
            depth_v_cache: Matching depth values.
            cos/sin:       RoPE tables ``[1, 1, T, d]``.

        Returns:
            ``[B, T, D]`` output hidden states.
        """
        B, T, D = x.shape
        Hq, Hk, d = self.n_heads_q, self.n_heads_kv, self.head_dim

        Q = self.q_proj(x).view(B, T, Hq, d).transpose(1, 2)
        K = self.k_proj(x).view(B, T, Hk, d).transpose(1, 2)
        V = self.v_proj(x).view(B, T, Hk, d).transpose(1, 2)

        Q = apply_rotary_emb(Q, cos, sin)
        K = apply_rotary_emb(K, cos, sin)

        K_e = self._expand_kv(K)
        V_e = self._expand_kv(V)

        L = len(depth_k_cache)

        if L == 0:
            out = F.scaled_dot_product_attention(
                Q,
                K_e,
                V_e,
                is_causal=True,
                dropout_p=self.dropout if self.training else 0.0,
                scale=self.scale,
            )
        else:
            # Sequence logits [B, Hq, T, T] with causal mask
            seq_logits = torch.matmul(Q, K_e.transpose(-2, -1)) * self.scale
            causal_mask = torch.triu(
                torch.full((T, T), float("-inf"), device=x.device, dtype=Q.dtype),
                diagonal=1,
            )
            seq_logits = seq_logits + causal_mask

            # Depth KVs: [B, Hk, L, T, d] → [B, Hk, T, L, d]
            K_depth = torch.stack(depth_k_cache, dim=2).permute(0, 1, 3, 2, 4)
            V_depth = torch.stack(depth_v_cache, dim=2).permute(0, 1, 3, 2, 4)
            K_depth_e = self._expand_kv(K_depth)
            V_depth_e = self._expand_kv(V_depth)

            # Depth logits [B, Hq, T, L]
            depth_logits = torch.einsum("bhid,bhild->bhil", Q, K_depth_e) * self.scale

            # Unified softmax over T + L positions
            combined = torch.cat([seq_logits, depth_logits], dim=-1)
            weights = F.softmax(combined, dim=-1)
            if self.training and self.dropout > 0.0:
                weights = F.dropout(weights, p=self.dropout)

            seq_contrib = torch.matmul(weights[:, :, :, :T], V_e)
            depth_contrib = torch.einsum(
                "bhil,bhild->bhid", weights[:, :, :, T:], V_depth_e
            )
            out = seq_contrib + depth_contrib

        out = out.transpose(1, 2).reshape(B, T, Hq * d)
        return self.o_proj(out)


# ---------------------------------------------------------------------------
# MoDA Transformer Block
# ---------------------------------------------------------------------------

The unified softmax lives in the L > 0 branch: sequence and depth logits are concatenated, then a single F.softmax normalises both.

All diagrams above are seeded, illustrative simulations. The code panels are the source of truth — see open_mythos/moda.py.