AI benchmarks are failing enterprise industries. Here’s what needs to change.

Whether it is accuracy scores, reasoning tests, or increasingly complex evaluation suites, the AI industry has rallied around a shared idea: if a model performs well against a broad set of tasks, it is ready for real-world use. Having spent the last few years working directly with insurers and healthcare providers, Danny Major MBCS has seen first-hand where this assumption breaks down in enterprise environments.

Summary:

Current AI benchmarks measure only whether a model is capable of producing a coherent answer without telling us if it's safe
In regulated sectors, benchmarks must take safety, compliance, consistency and explainability into account
Organisations will need to build their own specific evaluation frameworks and benchmarks to ensure safe, functional AI use
Moving beyond general purpose benchmarking and investing in tailored evaluation will make AI adoption smoother and safer

Across the AI landscape, benchmarking has become the default way to measure progress. The benchmarks used to evaluate most AI systems, including those published by OpenAI, Google, Anthropic and external bodies, are designed to measure general capability. They tell us how well a model can answer a question, summarise content, or generate a response that appears coherent and relevant. What they do not tell us is whether that response is safe.

In sectors such as insurance, financial services and healthcare, the standard for ‘good’ is fundamentally different. An answer is not just correct or incorrect. It must be compliant, consistent, explainable and aligned to both regulation and brand. The risk is not simply that an AI makes a mistake, but that it makes a plausible mistake at scale. This is where the gap begins.

Targeted benchmarks foster better outcomes

Broad benchmarks optimise for performance in isolation. Regulated industries require evaluation in context. For example, a customer service AI handling policy queries is not judged on its ability to produce fluent language. It is judged on whether it adheres to policy wording, avoids misinterpretation and maintains an appropriate tone in sensitive scenarios. These are not abstract qualities. They are measurable, but only within the boundaries of a specific domain.

In practice, this means organisations are having to build their own evaluation frameworks from the ground up.

In my work with insurers, we have seen a shift towards what I would describe as ‘operational benchmarks’. These focus less on model capability and more on outcome integrity. Instead of asking ‘did the AI answer the question?’, the question becomes ‘did the AI answer the question in a way that is correct, compliant and safe for this customer, in this context?’ That subtle shift changes everything.

For you

Be part of something bigger, join BCS, The Chartered Institute for IT.

It introduces the need for structured guardrails, controlled response generation and continuous monitoring of live interactions. It also requires organisations to define what ‘good’ looks like for their specific use cases. This is not something that can be outsourced to a generic benchmark or third-party score.

Importantly, this is not about slowing down AI adoption. If anything, the opposite is true. The organisations making the most progress are those that recognise early that general-purpose benchmarks are only a starting point. They invest in domain-specific evaluation, embed governance into their deployment models, and treat AI performance as something that must be continuously measured and improved in production.

Moving towards a new class of AI industry benchmarks

There are early signs that the industry is moving in this direction. We are seeing increased focus on guardrails, auditability and explainability, particularly in collaboration with regulators. But, there is still a tendency to lean on broad performance metrics as a proxy for readiness, often because they are easy to compare across vendors. That needs to change.

If AI is to be deployed safely at scale in enterprise environments, we need a new class of benchmarks. Ones that reflect real-world risk, customer impact and operational accountability. Benchmarks that are not just technically impressive, but contextually meaningful.

Until then, organisations should be cautious about equating high benchmark performance with real-world readiness. Because in enterprise industries, the question is not whether AI works.
It is whether it works responsibly.