OpenMythos: Rebuilding Claude Mythos with Recursive Transformers and MoE
OpenMythos is an open‑source PyTorch reimplementation of Anthropic's Claude Mythos that uses a mixed‑expert routed recurrent Transformer, introduces Recursive Depth Transformers, Multi‑Latent Attention, and several stability mechanisms, and demonstrates parameter‑efficient scaling backed by empirical studies.
Anthropic announced Claude Mythos, a powerful yet unreleased large model; a 22‑year‑old developer reverse‑engineered it and released OpenMythos, an open‑source PyTorch implementation built from first principles.
The architecture instantiates a mixed‑expert (MoE) routed recurrent Transformer, using weight sharing and cross‑expert conditional computation to achieve iterative depth.
The author hypothesizes that recursively applying a fixed‑parameter block together with sparse expert activation can improve the efficiency‑performance trade‑off and give rise to multi‑step reasoning. This leads to the definition of a Recursive Depth Transformer (RDT), a class of recurrent Transformers where a fixed weight set is applied across T cycles in a single forward pass.
Inference happens entirely in a continuous latent space with no intermediate token outputs, distinguishing it from Chain‑of‑Thought approaches; this formulation has been formally analyzed by Saunshi et al. (2025) and COCONUT (2024).
The recurrent block runs a shared TransformerBlock up to T=16 iterations. Each step injects the frozen encoding e via a stable LTI update rule. The block’s feed‑forward network is a MoE layer following DeepSeekMoE’s design: many fine‑grained routing experts where each token activates a sparse top‑K subset plus a few always‑active shared experts.
Crucially, the router selects a different expert subset at each depth, so each iteration performs a distinct computation. MoE supplies breadth across domains, while recurrence supplies depth of reasoning.
The full architecture is Prelude → Recurrent Block → Coda . Prelude and Coda are standard Transformer layers executed once; the recurrent block is the computational core. Attention defaults to Multi‑Latent Attention (DeepSeek‑V2), which compresses KV into low‑rank latent variables, reducing KV memory by 10‑20× at production scale.
Three mechanisms stabilize the recurrence:
LTI constraint injection (ensuring spectral radius ρ(A) < 1);
Adaptive Computation Time (ACT) for dynamic per‑position stopping;
Depth‑wise LoRA adapters that give each iteration expressive power without extra parameters.
Regarding parameter efficiency, a k‑layer model run for L cycles attains the quality of a k·L‑layer standard Transformer while using only k‑layer parameters. Empirically, Parcae, Prairie et al. (2026) show that a 770 M‑parameter RDT matches a 1.3 B‑parameter standard model on the same training data. The key insight is that inference depth is a function of compute, not of parameter count.
This reframes scaling debates: the critical dimension is inference‑time recurrence depth rather than training‑time model size.
OpenMythos contributions:
Full open‑source, configurable PyTorch implementation of the RDT hypothesis, including MoE FFN and Multi‑Latent Attention.
LTI‑stable recurrence injection integrated as a first‑class training primitive.
Depth‑wise LoRA adapters that differentiate behavior across iterations without extra parameters.
Reproducible research baseline for studying dynamic, scalable recurrent Transformers and inference depth.
Repository links:
https://x.com/KyeGomezB/status/2045659150340723107
https://github.com/kyegomez/OpenMythosSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
