The Context Gap: Why AI Tools Underperform in Actual Work

LLMs often fail in real workflows less from lack of capability and more from missing state, norms, and permissions. Closing that gap is mostly product and integration work.

Phil Glazer•Founder

7 min read

Model capability is rising, but deployment ROI is not. The missing piece is usually business context and workflow integration, not raw model capability.

In this post, "context" means three things: (1) the current state of the work item, (2) the organization's norms and preferences, and (3) access-control boundaries—who can see what.

Why benchmarks don't predict business value

Benchmarks like MMLU and HumanEval measure isolated skill. Most business work is the opposite: embedded in systems, constrained by policy, and evaluated by downstream coordination cost.

A model's reasoning score tells you nothing about whether it can navigate your org chart, understand which customers are price-sensitive, or know that "per our last conversation" from your VP means "I'm annoyed you didn't do this already."

The data suggests this gap is widespread. MIT's Project NANDA report (July 2025) found that 95% of surveyed organizations reported zero measurable return from generative AI investments—not negative returns, zero. Roughly 60% evaluated enterprise AI systems, about 20% reached pilot stage, and only 5% made it to production.

These are survey-derived estimates with acknowledged limitations, but they match other industry signals. S&P Global reports that the share of companies abandoning most AI initiatives rose from 17% to 42% year-over-year, with organizations scrapping nearly half of proof-of-concepts before production.

Many of these aren't failures of raw capability. GPT-4, Claude, and Gemini can do impressive things in constrained settings. But impressive demos and operational value are different categories.

Why you can't just "hire an AI into Slack"

Imagine onboarding an AI like a person: give it Slack access, your wiki, your CRM. Let it absorb terminology and politics. After a month, it might become useful the way a smart new hire does.

This doesn't work, and the reasons are instructive.

Modern models have large context windows—you could theoretically dump months of Slack history into a prompt. But research on long-context performance shows this doesn't reliably solve the problem. Performance degrades on long inputs due to interference effects, and newer work suggests length itself hurts performance even with perfect retrieval. Explicit context management (fragmenting, searching, summarizing) outperforms throwing everything in.

Enterprise deployment also introduces constraints benchmarks ignore entirely. The OrgAccess benchmark tested access-control reasoning: even GPT-4.1 achieved an F1 score of just 0.27 on its hardest multi-permission setting. The model that can explain quantum mechanics struggles to consistently respect that sales shouldn't see engineering's draft roadmap.

Human employees do ambient learning without conscious effort—they pick up which acronyms matter, which meetings are optional, the gap between what people say and what they mean. Most LLM deployments are stateless or weakly stateful, so that learning doesn't accumulate.

The "80% done" problem

When context is missing, the failure mode is insidious. The model doesn't throw an error. It makes reasonable-sounding assumptions that are subtly wrong in ways that take effort to catch.

BetterUp Labs and Stanford surveyed 1,150 desk workers and found 40% had received AI-generated "workslop" in the previous month. Respondents reported spending roughly two hours resolving each incident—an estimated $186 per month in lost productivity per employee. (This is a vendor-conducted survey; treat as directional.)

A first draft that handles most of what you need sounds like a win. But if that draft requires 40% of the original time to review and fix—catching wrong tone, fabricated details, mismatched assumptions—you haven't saved time. You've added a coordination tax.

Developers feel this acutely. Stack Overflow's 2025 survey found 66% of developers say their biggest AI frustration is outputs that are "almost right, but not quite." Even more telling: 45% say debugging AI-generated code takes more time than writing it themselves.

Gartner found a version of this at the organizational level. In supply chain organizations, 72% deployed generative AI and individual workers saved about 1.5 hours per week. But those individual savings showed no correlation with improved team-level output or quality. The time saved got absorbed somewhere—likely in coordination, review, and fixing errors.

Why "just write better prompts" doesn't scale

Most people interact with AI like a search engine: question in, answer out. But models aren't lookup tables—they're synthesis engines, good at transformation, iteration, and combining information. Using them like Google is using maybe 20% of capability.

The typical prompt is remarkably sparse. Analysis of the LMSYS-Chat-1M dataset shows an average of just 69.5 tokens per user prompt, compared to 214.5 tokens per response. People ask for "a marketing email" and get generic output.

But "educate the users" has never been a scalable solution. People have jobs to do; they're not completing prompt engineering courses before sending a Slack message.

GitHub Copilot has crossed 20 million all-time users and is reportedly used by 90% of the Fortune 100. But Microsoft and GitHub don't disclose retention metrics—monthly or daily active users, or how many are still using it six months later. The gap between "tried it" and "integrated it into daily work" is where value either materializes or doesn't.

What actually works

If you're waiting for models to get smart enough that context stops mattering, you'll wait a long time. Better models help at the margins, but they don't solve the fundamental problem: your business context isn't in the training data and can't be fully captured in a system prompt.

A generic chat interface is a decent research tool but a poor workflow interface. Real systems should:

Auto-load relevant context from systems of record (CRM, wiki, ticket history)
Ask clarifying questions before generating, not after
Make permissions explicit and auditable
Run lightweight verification where ground truth exists

Gartner data on AI durability supports this: in high-maturity organizations, 45% of AI initiatives remain in production for 3+ years, compared to 20% in low-maturity organizations. Maturity isn't about better models—it's about data practices, integration, and workflow design.

BCG's "AI at Work 2025" report frames it as a workflow redesign problem: value gets unlocked when organizations reshape processes end-to-end rather than dropping AI into existing workflows. The tool alone isn't the product; the tool plus context plus workflow is.

Practical checklist

If you're deploying AI in workflows, ask:

Where does context live today? CRM, wiki, Slack, email, tickets? Can the system access it?
What are the permission boundaries? Who shouldn't see what? Can the system respect that?
How will you measure "good enough"? Without evals, you're guessing.
What's the review overhead? If humans spend 40% of saved time fixing outputs, you haven't saved time.
What happens when it's wrong? Audit trail? Rollback? Who's accountable?

The teams making progress treat this as a product and integration problem, not a procurement problem.

Interested in exploring this further?

We're looking for early partners to push these ideas forward.

Get in touch View our work

←Back to all posts