Kimi K2.5: Still Worth It After Two Weeks?

Agent swarm and early fusion for better vision capabilities

Large Language Models
Author

Maxime Labonne

Published

February 19, 2026

Beijing-based Moonshot AI released Kimi K2.5 on January 27, 2026. Beyond traditional claims on benchmarks like HLE (50.2% with tools), coding, and vision, this release introduced the idea of “Agent Swarm”. Two weeks in, I wanted to revisit Kimi K2.5 and compare it with other recent releases like GLM-5, MiniMax-M2.5, and Qwen3.5.

What K2.5 Actually Is

Kimi K2.5 is one of the largest open-weight models with 1.04 trillion parameters and 32B activated parameters per token. This is significantly bigger than MiniMax-M2.5 (230B-A10B), Qwen3.5 (397B-A17B), and GLM-5 (1T-32B). It uses 384 experts with 8 activated per token, MLA attention, SwiGLU activation, and a 256K context window. The architecture is identical to Kimi K2, which shipped back in mid-2025.

What’s new is the training. K2 was originally pre-trained on 15T text-only tokens. K2.5 then continues from a near-end K2 checkpoint over an additional ~15T mixed visual and text tokens , plus ~1T for ViT training and ~700B for long-context mid-training (see tech report). If the numbers don’t overlap, that’s roughly 32T tokens across the full pipeline (vs. 28.5T tokens for the text-only GLM-5 — Qwen3.5 and MiniMax-2.5 haven’t released any numbers).

The vision encoder is MoonViT-3D, a 400M parameter native-resolution ViT based on SigLIP-SO-400M, with a NaViT packing strategy that handles variable-resolution images. For video, consecutive frames are grouped in fours and temporally pooled, achieving 4x compression. Qwen3.5 also used early fusion with a different strategy. Late fusion seems dead for frontier models.

The model ships in native INT4 precision (~595GB), not FP8/BF16. Moonshot used quantization-aware training during post-training to achieve this. Unsloth’s dynamic 1.8-bit quant brings it down to ~240GB, runnable on a single 24GB GPU with sufficient RAM offloading at ~10 tokens/sec.

Agent Swarm

Before we get to the benchmarks, I want to talk about the most technically interesting part of this release: Agent Swarm. And, specifically, the Parallel-Agent Reinforcement Learning (PARL) framework that powers it.

The core idea is that, instead of executing agent tasks sequentially (tool call, observe result, reason, next tool call, etc.), K2.5 learns to decompose problems into parallelizable subtasks and delegates them to sub-agents. The orchestrator is trainable, but the sub-agents are frozen copies of intermediate policy checkpoints. Only the orchestrator’s parameters get updated via RL.

This is designed this way because an end-to-end co-optimization of the orchestrator and sub-agents would create a credit assignment nightmare. In other words, if the final answer is wrong, is it the orchestrator’s fault for bad delegation, or the sub-agent’s fault for bad execution?

The training had to solve two emergent failure modes: serial collapse (the orchestrator defaults to safe sequential execution) and spurious parallelism (it spams sub-agent creation to game metrics without meaningful decomposition). Auxiliary reward terms push past both early in training, then anneal to zero so the final policy optimizes purely for task success.

According to Moonshot AI, BrowseComp jumps from 60.6% (single agent) to 78.4% (swarm), WideSearch F1 from 72.7% to 79.0%, and execution time drops 3-4.5x on suitable tasks. However, for comparison, Qwen3.5 reports a BrowseComp score of 78.6 using the same discard-all strategy as K2.5, but without a swarm mechanism. To my knowledge, this is the first time an open-weight model has been trained to parallelize agentic work rather than having it imposed by external scaffolding. Exciting stuff!

Benchmarks

Moonshot’s reported numbers are competitive. Here’s the context that matters.

Where K2.5 leads (among all models, including proprietary): HLE-Full with tools (50.2% vs. GPT-5.2’s 45.5%), BrowseComp with swarm (78.4%), OCRBench (92.3%), MathVista (90.1%), InfoVQA (92.6%). Note that MiniMax M2.5 reports 76.3% for BrowseComp without any swarm mechanism.

Where K2.5 is competitive but behind frontier: AIME 2025 (96.1% vs. GPT-5.2’s 100%), SWE-Bench Verified (76.8% vs. Claude Opus 4.5’s 80.9%, MiniMax M2.5’s 80.2%, Qwen3.5’s 76.4%), GPQA-Diamond (87.6% vs. GPT-5.2’s 92.4% and Qwen3.5’s 88.4%), Terminal-Bench 2.0 (50.8% vs. Claude’s 59.3%).

Where K2.5 shows clear gaps: WeirdML (46% vs. 72% for GPT-5.2) is telling. On Artificial Analysis’s AA-Omniscience knowledge index, K2.5 scores -11 (correct minus incorrect), meaning it hallucinates more than other frontier models. Claude Opus 4.5 scores +10, Gemini 3 Pro scores +13.

Kimi K2.5 was also the open-weight model with the high Intelligence Index on Artificial Analysis before the release of GLM-5. It still leads compared to Qwen3.5 and MiniMax-M2.5, which are smaller models. Note that it doesn’t mean that GLM-5 is necessarily superior (and vice-versa): many details get ironed out with these streamlined benchmarks, and features like Agent Swarm are not properly taken into account.

Community Feedback

Two weeks of community usage provide a good picture of the model’s real impact.

Coding: K2.5 is legitimately strong, especially for front-end work and visual-to-code tasks. Kilo Code reports it rapidly climbed to top-performer status for architectural planning. Multiple developers on r/LocalLLaMA report building complete projects at ~1/8th the cost of Opus. But the pattern in skeptical reviews is consistent: K2.5 often generates verbose, over-engineered code on the first pass, then simplifies when asked. Opus and Codex tend to get it right the first time.

Agent Swarm: Impressive when it works. Users report effective parallel web research and multi-niche data collection. But follow-up editing of swarm outputs is painful, and sub-agents can drift into inconsistent definitions for shared concepts. The spreadsheet use case (compiling data across rows) exposed this: every agent used slightly different column definitions.

Vision: The first open-weight model where vision feels genuinely competitive. Nathan Labenz tested it on a scanned document transcription task where Chinese models historically lagged, and K2.5 matched Gemini 3 level. Qwen3.5 also makes a strong play here: 90.3 on MathVista, 85.0 on MMMU, and native UI screenshot understanding with element detection. Creative writing and personality, on the other hand, are clearly behind Opus. And multiple users discovered K2.5 sometimes identifies itself as Claude, a strong signal about training data provenance.

The Bigger Picture

Verbosity and cost. When Artificial Analysis evaluated K2.5, the model generated 89 million output tokens. The median for comparable models is 14 million. At $0.60/$3.00 per million input/output tokens, the per-token price looks cheap, but when the model produces 6x more tokens per task, effective costs still spike. Kilo Code’s week-long free trial confirmed this: usage surged past 50B tokens/day, and they concluded the model’s verbosity dilutes the savings from input caching. This is the main issue with this model.

Multimodal training insights. The technical report contains a finding worth highlighting for anyone building multimodal models. Given a fixed vision-text token budget, early fusion with a low vision ratio (10% vision from the start) outperformed late fusion with a high ratio (50% vision injected at the 80% mark) across every metric. Moonshot also introduces “zero-vision SFT,” where text-only fine-tuning activates visual reasoning capabilities, and visual RL actually improved text-only benchmarks (MMLU-Pro: 84.7% to 86.4%, GPQA-Diamond: 84.3% to 86.4%). This bidirectional transfer validates the native multimodal approach.

Geopolitics and sustainability. K2.5 was trained on hardware constrained by US export controls, and it’s competitive with models from labs that have unconstrained chip access. Moonshot raised at a $4.8 billion valuation and is clearly spending that capital on user acquisition: the free tier is generous, Agent Swarm beta comes with free credits. The pricing is almost certainly not sustainable long-term. The license is Modified MIT, commercially free for companies under 100M MAU.

Deployment. Running K2.5 locally requires serious hardware: ~595GB at native INT4, with Unsloth’s 2-bit quant (375GB) as the practical sweet spot. On a single 24GB GPU with 256GB+ RAM, expect ~10 tokens/sec. The model works with vLLM, SGLang, and KTransformers, though the r/LocalLLaMA AMA surfaced rough edges with <think> tag parsing on some backends. Vision support in GGUF/llama.cpp is not yet available. For API users, 8 providers serve K2.5, with Fireworks leading on speed (283 t/s), DeepInfra on price ($0.90 blended), and Baseten on raw throughput (336 t/s).

What’s Next

K2.5 is a solid open-weight model with strong vision capabilities, but verbosity can be an issue for real-world usage. Because competition is particularly fierce at this range, it’s better to experiment with different options for your particular use cases.

The part I’d watch most closely is whether PARL generalizes. Whether that transfers to arbitrary real-world workflows or mainly helps on embarrassingly parallel research tasks is still open. ¯\(ツ)

Quick links: