The Capability vs Impact Gap In AI

Model capability far exceeds business impact. We look at why the gap persists and what might close it.

Phil Glazer•Founder

7 min read

Benchmarks keep improving. Workplace workflows mostly haven't.

That mismatch isn't because the models are fake—it's because benchmarks assume clean inputs, stable goals, and cheap verification. Business work has ambiguous requirements, messy data, and high cost of being wrong.

Meanwhile, spending is real: Menlo Ventures estimated enterprise foundation-model API spend rose from ~$3.5B (Nov 2024) to ~$8.4B by mid-2025. Usage is widespread—ChatGPT processes over 2.5 billion prompts daily—but measurable improvements in cycle time, cost, or quality remain rare outside a few pockets.

Why benchmarks don't transfer

Lots of organizations have AI tools. Fewer have changed how they operate.

There's a difference between "we have licenses" and "we've redesigned workflows." Most teams are in the first camp: they've run pilots, maybe integrated a chatbot into customer service, possibly use Copilot for code suggestions. What they haven't done is rethink processes around what AI makes possible.

Benchmark results aren't lies—models really can engage with complex material in ways that would have seemed implausible five years ago. But benchmarks measure capability in isolation: well-formed questions, clear success criteria, no organizational friction. Business tasks have shifting requirements, missing context, and expensive verification.

The remote employee thought experiment

Imagine hiring an AI model as a remote employee—not as a tool you query, but as an actual team member in Slack, taking on projects.

It falls apart quickly, and the reasons are instructive.

A human employee starts building context on day one. They learn unwritten rules, figure out that when the CEO says "soon" she means two weeks, notice that sales and engineering use the same words differently. This ambient learning happens automatically.

Without engineered memory and retrieval, the model doesn't accumulate organizational context. It doesn't know the Thompson account has been difficult, that Q3 projections are optimistic, that Dave in accounting is the real decision-maker. You have to build that context pipeline yourself.

This cascades into everything else. Without context, the model makes assumptions. Often they're subtly wrong in ways that require significant effort to identify and correct. An "80% draft" that takes almost as long to validate as writing it wasn't a win.

Four categories of failure

When organizations don't see results from AI tools, the reasons cluster into four buckets. All are addressable, but they require deliberate engineering and change management.

1. Adoption and workflow fit

Tools sitting unused deliver zero value. Old habits are comfortable, and learning new tools takes activation energy that busy people don't have. The gap between free-tier and paid models is also underappreciated—someone who tried the free ChatGPT tier for ten minutes and dismissed it hasn't seen what's possible.

The 2025 Stack Overflow survey found 80% of developers now use AI tools, but trust in accuracy has fallen. Usage doesn't mean integration into core workflows.

2. Context and data access

Models perform dramatically better with context. "Write me a marketing email" produces generic output. "Write a marketing email for our B2B SaaS targeting CFOs at mid-market manufacturers evaluating us against [competitor], emphasizing integration capabilities and ROI data" produces something useful.

But providing that context requires data access—and most organizations haven't solved the permissions, retrieval, and data pipeline work needed to give models the information they need. The model can't see your CRM, your internal wiki, your Slack history, or your systems of record. Solving this is engineering work, not prompt craft.

3. Reliability and verification

Models will fabricate plausible nonsense, including citations. They make subtle logical errors that sound reasonable. Using them effectively requires treating outputs as untrusted drafts that need verification.

The problem: there's often no ground truth, no acceptance tests, no way to measure "better." Without evaluation infrastructure, you can't tell whether the model is helping or generating work. Building lightweight evals for your use cases matters more than prompt engineering.

4. Governance and integration

The last-mile work is substantial: APIs, retries, error handling, state machines, UI, audit trails, compliance review. Regulated industries face additional friction around data retention, access controls, and auditability.

Many AI demos work in isolation but break when integrated into production systems. The median demo succeeds; the 99th percentile case breaks workflows. Reliability at the tails matters for anything mission-critical.

What closes the gap

The gap will close, though probably not as quickly as enthusiastic predictions suggest.

Humans will get better at using the tools. Best practices spread slowly. People learn what works, when to use AI and when not to, how to verify outputs efficiently. This is cultural change—it takes years, not months.

Models will improve on reliability. Capability has improved quickly; reliability and controllability are improving more slowly. Each increment makes tools more forgiving of imperfect usage.

Products will reduce cognitive load. The current paradigm—blank text box, user provides all context—puts enormous burden on users. Better products gather context automatically, run lightweight verification, and guide users toward effective usage without requiring expertise.

Specific use cases will click completely. There will be task-product-model combinations where everything works without human gap-filling and output quality is reliable enough to trust. Code completion and some customer service applications are early examples. More are coming, but predicting which ones is hard.

We're in an awkward middle period. Models are good enough to be useful, but not without effort. Most teams haven't built the habits, evaluation infrastructure, or context pipelines to bridge the gap consistently.

Teams that build this muscle—eval frameworks, context retrieval, workflow integration—will move faster than those waiting for AI to become effortless. The waiting will take a while.

Interested in exploring this further?

We're looking for early partners to push these ideas forward.

Get in touch View our work

←Back to all posts