laguna-moe-muon-optimizer_01

Poolside just emerged from stealth with Laguna XS.2 and M.1—two MoE language models built specifically for agentic coding workflows. The headline innovation isn't the architecture itself, but the Muon optimizer, a novel training method that achieves the same loss as AdamW in roughly 15% fewer steps.

Technical Specs

Laguna XS.2 (33B-A3B) runs 33B total parameters with only 3B activated per token—making it local-ready on consumer hardware with 36GB RAM. It ships with Apache 2.0 licensing, 128K context window, and FP8 quantized KV cache. The architecture uses 256 experts plus one shared expert, with 40 layers mixing sliding window attention (30 layers, 512-token window) and global attention (10 layers).

Laguna M.1 (225B-A23B) is the flagship—225B total, 23B active per token. Same 128K context, but weights remain closed (available on request for researchers). Both models were trained on 30T+ tokens with ~4.4T synthetic data (13% of the mix).

The Muon Optimizer

Muon replaces AdamW's two-state approach (momentum + variance) with a single-state design that applies Newton-Schulz orthogonalization to gradients. The mechanism maintains gradient diversity during training, preventing collapse that commonly occurs in MoE architectures.

Key differences:

Aspect AdamW Muon
States per parameter 2 1
Memory usage Higher 50% reduction
Update mechanism Adaptive LR Gradient orthogonalization

The compute overhead for orthogonalization stays under 1% of training step time, and the checkpoint sizes drop significantly. Poolside's distributed implementation batches Newton-Schulz operations across ranks with communication-compute overlap and CUDA graphs for efficiency.

Benchmarks

Laguna M.1: 72.5% SWE-bench Verified, 46.9% SWE-bench Pro, 40.7% Terminal-Bench 2.0

Laguna XS.2: 68.2% SWE-bench Verified, 44.5% SWE-bench Pro, 30.1% Terminal-Bench 2.0

The honest reporting stands out. Poolside openly acknowledges that Qwen3.6-35B-A3B (73.4% SWE-bench Verified, 51.5% Terminal-Bench) outperforms Laguna XS.2, and DeepSeek-V4-Flash (79.0% SWE-bench) leads the category. Terminal-Bench 2.0 reveals a significant gap—Laguna XS.2 scores 30.1% vs Qwen's 51.5%.

The positioning is clear: this is a Western open-weights alternative to Chinese model dominance, built for agent-first workflows with native ACP spec support.

Agent RL Training

Poolside built a fully asynchronous online RL system for long-horizon coding agents. The architecture decouples actors (running sandboxed tasks) from trainers ( consuming trajectories), with GPUDirect RDMA weight transfers moving hundreds of GB in ~5 seconds. The system uses a variant of CISPO for off-policy stability across multi-day training runs.

Availability

OpenRouter: Both models run on free tier (poolside/laguna-xs.2:free, poolside/laguna-m.1:free)

Ollama: ollama run laguna-xs.2 for local inference

HuggingFace: poolside/Laguna-XS.2 with FP8, NVFP4, and INT4 variants

Poolside Platform: Free API access at platform.poolside.ai

Community Sentiment

Hacker News reception was mixed. Users praised the fast inference and ACP spec adherence—one commenter noted it works better than Codex or OpenCode in Zed. Others criticized the benchmark position: "not winning any popular benchmark" and "quite a huge lead for Qwen" on Terminal-Bench. The consensus view: it's good to see a Western lab emerge from stealth with competitive models, even if they're not leading the leaderboard.

The real question is whether Muon's training efficiency translates to faster iteration cycles for future releases. 15% fewer steps to match AdamW performance is a genuine contribution—optimizer research has been stagnant since AdamW's 2019 introduction.

Sources: https://poolside.ai/blog/laguna-a-deeper-dive https://poolside.ai/blog/introducing-laguna-xs2-m1 https://huggingface.co/poolside/Laguna-XS.2 https://news.ycombinator.com/item?id=47936511 https://openrouter.ai/models/poolside/laguna-xs.2 https://ollama.com/library/laguna-xs.2 https://github.com/poolsideai/pool