On February 12th, 2026, barely a month after its Hong Kong IPO, Shanghai-based MiniMax dropped M2.5. The headline numbers: 80.2% SWE-Bench Verified , 51.3% Multi-SWE-Bench (first place), 76.3% BrowseComp. These are numbers that sit within a percentage point of Claude Opus 4.6 and ahead of GPT-5.2 on several agentic benchmarks. Even more interestingly to me, the model costs roughly $1 per hour of continuous operation at 100 tokens per second.
The striking part of this release is the combination: this is a 230B MoE model with only 10B active parameters, trained primarily through large-scale reinforcement learning across 200,000+ real-world environments. It handles not just code but full office productivity workflows (Word, Excel, PowerPoint).
What M2.5 Actually Is
MiniMax-M2.5 is an iterative improvement on the M2 family, which launched in late October 2025. The architecture is unchanged from M2: a Mixture-of-Experts model with 230 billion total parameters and 10 billion active per forward pass. For context, this active parameter count is tiny compared to what you’d expect from a frontier-competitive model:
- GLM-5 has 744B parameters and 40B active
- DeepSeek V3/R1 has 685B parameters and 37B active
- Not frontier, but as reference Qwen3-235B has 235B parameters and 22B active
The model comes in two API variants:
- M2.5-Lightning : 100 tokens per second, $0.30/M input, $2.40/M output
- M2.5 Standard : 50 tokens per second, $0.15/M input, $1.20/M output
The Lightning version is roughly 2x the throughput of other frontier models. The Standard version is extremely cheap. To put the pricing in perspective: Claude Opus 4.6 charges $5/M input and $25/M output tokens. Even GLM-5 that was just released is priced at $1/M input and $3.20/M output tokens, which makes it several times more expensive.
MiniMax claims the weights have been “fully open-sourced” on Hugging Face (though as of this writing, they haven’t been posted yet). If you want to run it locally, the recommendation is vLLM or SGLang. With only 10B active parameters, the inference footprint is remarkably manageable for a model at this capability level.
Benchmarks
A few things stand out. The Multi-SWE-Bench score of 51.3% is actually first place, ahead of Opus 4.6’s 50.3%. Multi-SWE-Bench tests multilingual coding tasks, and M2.5 was trained on 10+ languages (Python, Go, C, C++, TypeScript, Rust, Kotlin, Java, JavaScript, PHP, Lua, Dart, Ruby). This is not a Python-only model, my C++/Rust friends will be happy.
The BFCL multi-turn score is quite an interesting number. First, it’s quite cheeky to only report the multi-turn split and nothing else. At 76.8% on multi-turn function calling, M2.5 leads Opus 4.6 by over 13 percentage points. It also shows immense progress in terms of multi-turn tool use compared to MiniMax M2.1 (+39.4 points).
And then there’s the independent evaluation from OpenHands (the open-source coding agent platform). Their OpenHands Index placed M2.5 4th overall, behind only Claude Opus 4.6, Claude Opus 4.5, and GPT-5.2 Codex. Graham Neubig noted the model performed particularly well on long-running tasks like developing apps from scratch, an area where smaller models have historically struggled.
Forge Reinforcement Learning
What’s technically interesting about M2.5 is how MiniMax got here. The answer is large-scale reinforcement learning, and specifically their in-house framework called Forge.
Forge is what MiniMax calls an “agent-native RL framework.” The key design decision is a decoupling layer between the training/inference engine and the agent scaffolding. This means MiniMax can plug any agent framework (Claude Code, Droid, OpenCode, custom harnesses) into the RL training loop, and the model learns to generalize across scaffolds rather than overfitting to one particular tool interface. MiniMax trained across 200,000+ real-world environments and in multiple domains, including, interestingly, their own company’s internal tasks.
To get the RL to scale, MiniMax used three key innovations:
- CISPO (Clipped Importance Sampling Policy Optimization) : Their custom RL algorithm, first proposed in the M1 paper. Rather than clipping token updates like PPO/GRPO, CISPO clips the importance sampling weights. The result is that all tokens contribute to gradient computations, even low-probability ones that are often crucial for maintaining entropy and enabling scalable RL. In controlled experiments on Qwen2.5-32B, CISPO achieved a 2x speedup compared to DAPO (ByteDance’s recent RL algorithm).
- Asynchronous scheduling + tree-structured sample merging : To keep GPU utilization high during the inherently sequential nature of agent rollouts, they optimized the balance between throughput and sample off-policyness. They claim it achieves approximately a 40x training speedup over naive approaches.
- Process rewards for credit assignment : Long agent trajectories make credit assignment extremely difficult. If a coding agent takes 50 steps to solve a bug, which steps were actually helpful? MiniMax introduced process-level rewards to monitor generation quality throughout the trajectory, and also directly estimated real-world task completion time as a reward signal, pushing the model toward faster solutions.
MiniMax engineer Olive Song noted on Alex Volkov ’s ThursdAI podcast that the entire M2.5 training period was about two months. For reference, the M1 reasoning model’s full RL training on 512 H800s completed in just three weeks at a rental cost of $534,700. The M2 series uses the smaller 230B MoE architecture (M1 was 456B with 45.9B active), so the compute requirements are even more favorable.
The Bigger Picture
MiniMax made a few interesting design choices that deserve a highlight.
Emergent spec-writing behavior. MiniMax notes that M2.5 has learned to proactively plan before writing code, decomposing a project before starting implementation. This is consistent with what we’ve seen in other top coding models. When they’re trained in environments that reward end-to-end task completion, they develop strategic planning behaviors. The model learns that spending tokens on upfront planning saves tokens and reduces errors downstream. This directly translates into token efficiency. On SWE-Bench Verified, M2.5 consumed an average of 3.52M tokens per task versus M2.1’s 3.72M tokens.
Office productivity. Beyond code, manipulating Office documents becomes a key feature for frontier models. MiniMax clearly targets this space and developed an internal GDPval-MM benchmark (pairwise LLM-as-judge evaluation of trajectory quality). They claim that M2.5 achieved a 59.0% average win rate against mainstream models. They also develop MiniMax Agent, their consumer-facing agentic platform where users have built over 10,000 “Experts” (specialized agent configurations).
Cost. This is the most interesting bit to me here. MiniMax is still not Opus 4.6-level, but it frames M2.5 as the most cost-efficient alternative. The blog post even states that “you can have four M2.5 instances running continuously for an entire year for $10,000.” Honestly, I don’t think the experience is consistent enough for production workloads. Early reports from OpenHands suggest the model is strong but occasionally sloppy (wrong branch pushes, missed formatting instructions). However, it shows a clear and achievable path a few generations down the line.
What’s Next
MiniMax promised a more detailed technical blog post on the Forge framework and their RL scaling laws. That’s what I’m most interested about: does performance scale linearly with the number of environments, or are there diminishing returns?
Another question I have in mind is whether the M2 series’ rapid improvement is driven by catching up to others, or whether they are genuinely pushing the frontier on agentic RL. It feels like competition is extremely tough in terms of coding, but GDPval and office productivity might be a good approach to building differentiating capabilities.
Quick links:
- Official announcement: https://www.minimax.io/news/minimax-m25
- API access: https://platform.minimax.io/docs/guides/text-generation
- MiniMax Agent: https://agent.minimax.io
- OpenHands evaluation: https://openhands.dev/blog/minimax-m2-5-open-weights-models-catch-up-to-claude
- M2.1 weights (Hugging Face): https://huggingface.co/MiniMaxAI/MiniMax-M2.1
- OpenRouter: https://openrouter.ai/minimax
- CISPO paper (MiniMax-M1): https://arxiv.org/abs/2506.13585