LONDON, March 31, 2026 - BlueOptima, the enterprise software development analytics company, today released its findings from the BlueOptima AI Refactoring Evaluation (BARE), the largest enterprise benchmark of large language model (LLM) coding performance on real-world production software.
Across 57 LLMs, 243,000+ files and nine programming languages, the results reveal a major reality check: top models succeed less than 23% of the time on real production code refactoring tasks; far from the 85%+ achieved on widely cited artificial benchmarks.
The findings are highly consequential for engineering leaders as they scale AI-assisted development across their organizations. Insights include:
- Benchmark performance does not predict real-world capability. Models routinely exceed 85% on benchmarks like HumanEval but average just 16.6% success on production maintainability tasks. Even frontier models remain below 23%.
- Performance varies by up to 8.6x across programming languages. The success rate for JavaScript is 31.9%, while C achieved only 3.7%. Industries dependent on low-level languages face materially different returns from AI coding investments.
- Frontier models are converging, not accelerating. Claude 4.6 Opus (~22%), Gemini 3 Pro (~21.5%) and GPT-5.2 (~21.3%) all cluster within a narrow 17–23% band, suggesting the field may be approaching a capability plateau rather than an exponential trajectory.
- Complex tasks remain largely out of reach. LLMs achieve ~30% success on localized function simplification but drop to ~3% on file reorganization and ~1.5% on reducing code coupling without agentic support. Consistent deficiencies in architectural reasoning are evident despite this capability dominating enterprise maintenance work.
- Cheaper models are not necessarily more economical. Once human review time for failed outputs is included, premium models can be 35% more cost-effective than commodity alternatives despite costing four times more per API call.
- AI that learns your code can teach it to your competitors. The study found that LLMs perform measurably better on code they've encountered during training, meaning enterprises that expose proprietary source code through API interactions or data partnerships risk enabling competitors to more effectively reproduce the design patterns, architectural decisions and domain logic that underpin their competitive advantage.
What This Means for Engineering Leaders
Enterprise engineering leaders are scaling AI coding tools at pace for fear of being left behind and without reliable evidence of what the consequences are for their software estate. They don’t have the visibility they need to maximize the benefits from this amazing technology, and they can’t demonstrate defensible ROI for their investments. The key question isn’t ‘Should we scale AI?’, it’s ‘Where does AI reliably work in our software development ecosystem?’
The BARE benchmark gives evidence-based recommendations: implement human review for all AI-generated production code, deploy AI selectively according to task type and language rather than uniformly across the organization, validate tool performance against their own codebases before scaling, and measure the total cost of successful outcomes rather than token price alone.
The findings also reinforce that AI adoption increases, rather than diminishes the value of senior engineering talent capable of evaluating and integrating machine-generated output.
“The industry narrative around AI coding has been built on benchmarks that were never designed to measure what matters most in enterprise software engineering: whether these tools can maintain and improve the quality of complex production systems,” said Jason Rolles, Founder and Managing Director of BlueOptima. “Engineering leaders are making multi-million dollar decisions based on benchmarks that don’t reflect production reality. This research shows where AI actually works, and where it creates risk and cost.”
With enterprises rapidly scaling AI-assisted development, these findings raise questions about how productivity gains and quality impact are being measured and whether current expectations are realistic.
To read the “Benchmarking the Real World Coding Performance of LLMs” whitepaper, visit https://www.blueoptima.com/resource/benchmarking-the-real-world-coding-performance-of-llms-introducing-bare.
About the study
The research evaluated 57 LLMs from leading providers, including OpenAI, Anthropic, Google, Meta and Mistral across 4,276 real source code files, yielding 243,732 model-file evaluation pairs. Each output was assessed via a seven-step static validation pipeline.
About BlueOptima
BlueOptima is a global software analytics company that helps technology leaders understand and improve how software gets built in the AI era. Its platform provides clear, objective insights into coding performance, giving leaders the data they need to make smarter decisions about resources, productivity, quality, and team effectiveness.
Used by organizations around the world, BlueOptima supports engineering leaders to measure AI impact, drive more efficient development processes, improve outcomes, and get more value from their software investments.
Media contact:
PANBlast for BlueOptima