Depth Extrapolation

More loops at inference — and the overthinking zone.

A looped transformer trained on N loops can run N+k loops at inference to solve harder problems — mirroring inference-time scaling of chain-of-thought. The gains are real but saturating, and beyond a point excess recurrence degradesthe answer. A fixed-depth transformer simply can't add depth this way.