So there is this new AI model called M2.7 and everyone is freaking out. Apparently it can improve itself — like, write about 30% of its own code through some self-improvement loop. Cool, right? Except… maybe do not freak out yet.
So What Did It Actually Achieve?
M2.7 introduces a process where the model iteratively improves its own performance. It plans changes, modifies code, runs evaluations, and decides whether to keep or discard the results. This approach is similar to how human developers refine software over time.
If this was really that powerful, why stop at version 2.7? Why not keep running it until we hit something bigger? The fact that they called it 2.7 and not 3.0 tells you everything.
The Benchmarks Tell A Nuanced Story
Real testing from the Kilo Code team shows a more interesting picture. On PinchBench — a standardized OpenClaw agent benchmark — M2.7 scored 86.2%, landing in the top 5 out of 50 models and within just 1.2 points of Claude Opus 4.6. The jump from M2.5 (82.5%) to M2.7 (86.2%) is a real 3.7-point improvement.
On Kilo Bench, an 89-task evaluation covering everything from git operations to cryptanalysis to QEMU automation, M2.7 solved 47% of tasks — second only to Qwen3.5-plus at 49%.
Here is the interesting part: every model solved tasks that no other model could. M2.7 unique win came on a SPARQL task that required understanding an EU-country filter as an eligibility criterion rather than an output filter. That is not just coding — that is reasoning about requirements. An oracle that picks the best model per task would solve 67% of tasks, showing these models are not interchangeable, they are complementary.
What Makes It Different
M2.7 reads extensively before writing. It pulls in surrounding files, analyzes dependencies, traces call chains. On tasks where that extra context pays off, it catches things other models miss. The tradeoff? It can over-explore hard problems, sometimes leading to timeouts on time-sensitive work.
For context: M2.7 median task duration sits at 355 seconds — notably longer than its predecessors. So while it might produce better results on complex refactors or codebase-wide changes, it is not winning any speed races.
The Price Is Right
The standout advantage is cost. At $0.30 per million input tokens and $1.20 per million output tokens, you are looking at roughly 8 to 20 times cheaper than Opus 4.6 or GPT 5.4. For 80-90% of the performance at a fraction of the price, it is a genuinely practical option for tasks that do not require heavy reasoning or fast iteration.
The Bottom Line
M2.7 is a solid pick when you are working on tasks that reward deep context gathering — complex refactors, codebase-wide changes, anything where understanding surrounding code matters more than speed. Its PinchBench score puts it in the same tier as GPT-5.4 and GLM-5 for general agent tasks.
Is it a breakthrough? Not really. Self-improving loops ARE going to be a bigger deal going forward, but M2.7 specifically? It is an incremental step with a compelling price point. Useful and affordable, especially for coding-heavy workflows where its thoroughness pays off.
Do not panic. But do keep an eye on it.