First Impressions of GPT-5.2: Early Tests in Our MCP-Based Eval Harness

Early observations from running GPT-5.2 through our MCP research setup. Too early for conclusions, but here's what we're seeing.

Phil Glazer•Founder

4 min read

OpenAI released GPT-5.2 today (December 11, 2025), and we're running it through our internal eval harness to see what changes show up in the workflows we care about. It's too early for definitive claims, but I wanted to capture some initial observations while they're fresh—and be upfront about what we still need to test before forming a real opinion.

What GPT-5.2 is and how we're testing it

GPT-5.2 is OpenAI's latest iteration on the GPT-5 family. GPT-5 launched August 7, 2025, so this is roughly a four-month gap—faster than the longer cycles we saw between GPT-3 and GPT-4.

We're testing GPT-5.2 via the OpenAI API using our MCP-based harness: same prompts, same tools, same documents, different model. MCP (Model Context Protocol) is an open standard for connecting LLMs to tools and data sources; our contribution is the specific set of eval tasks we run. This gives us a controlled environment to compare outputs on research synthesis, document analysis, and multi-step reasoning. It's not a comprehensive benchmark, but it's representative of how we actually use these models.

The GPT-5 context: cost optimization vs. capability gains

When GPT-5 launched in August, our subjective impression was that the most noticeable change was efficiency—serving behavior, latency, and pricing—rather than a step-change in what tasks we could reliably automate. Cheaper tokens are genuinely valuable for developers building on the API.

But from a capability standpoint? The jump from GPT-4 to GPT-5 didn't feel like the leap from GPT-3.5 to GPT-4. We didn't suddenly find ourselves able to automate tasks that had previously failed. The model got more efficient, responses got cheaper, but the ceiling of what we could accomplish didn't obviously rise.

This isn't a criticism—cost optimization is valuable work, and the market now has credible alternatives (Anthropic, plus a fast-moving open-source ecosystem), so efficiency and pricing matter more than they did in 2023. But it does mean GPT-5.2 arrives with a specific question: is this another efficiency update, or does it meaningfully improve reliability on hard tasks?

Initial observations from our MCP setup

So far the deltas look real, but I can't tell yet whether they're quality improvements or just style drift.

Anecdotally, the model seems to handle long context more gracefully. We've been feeding it ~80-100k tokens of source documents and asking for multi-step synthesis. GPT-5.2 seems less likely to "lose the thread" late in the conversation than GPT-5 did—but we haven't scored this rigorously yet.

Latency looks similar in our setup (same region, streaming on). If latency stays flat, that matters: many capability bumps show up first as higher latency or cost.

On straightforward research questions, the outputs are solid but not revelatory. I haven't hit a moment yet where I thought "GPT-5 couldn't have done this." But again, we're hours into testing, not days.

What's left to evaluate: coding tasks, tool calling, and beyond

The research eval only tells part of the story. We need to run GPT-5.2 through several other workflows before forming a genuine opinion:

Coding tasks are next. A lot of our internal tooling relies on AI-assisted development, and code quality is where model differences often show up most clearly. Does GPT-5.2 produce cleaner implementations? Handle edge cases better? We'll find out.

Tool calling and function execution matters for any multi-step workflows. GPT-5 was decent at structured outputs and function calling, but there were still failure modes around complex tool chains. If 5.2 tightens that up, it could be significant for production applications.

We're also curious about instruction following on nuanced tasks—the kind of work where you need the model to maintain specific constraints across a long output. That's historically been a weak point, and improvements there would be genuinely useful.

Early take and what we're watching for

My honest assessment right now: I don't have an opinion yet, and that's probably the right state to be in a few hours after release.

What I'm watching for as we continue testing: evidence of improved reasoning depth rather than just polish. The models have gotten good at sounding confident and coherent, but the question is whether GPT-5.2 can handle problems that genuinely stumped GPT-5. We'll be designing some specific test cases around multi-step logical problems and ambiguous instructions to probe that.

We'll post a more comprehensive take once we've run it through our full test suite. For now, cautious curiosity seems appropriate.

Interested in exploring this further?

We're looking for early partners to push these ideas forward.

Get in touch View our work

←Back to all posts