
I Tested 14 AI Coding Models on One Real-World TypeScript Challenge. Here's What Actually Works.
Everyone benchmarks models on LeetCode. I benchmarked 14 on a real async pattern — debounce with cancellation. Only 3 got it right.
I Tested 14 AI Coding Models on One Real-World TypeScript Challenge. Here's What Actually Works.
Everyone benchmarks models on LeetCode and HumanEval. Nobody benchmarks them on the messy, real-world patterns that actually break production code.
So I did.
I gave 14 coding models — from small local models to large cloud APIs — a single prompt:
"Write a TypeScript function that debounces async API calls with cancellation support."
Simple enough to describe. Deceptively hard to get right. This pattern lives at the intersection of timing, async control flow, and HTTP lifecycle management — exactly the kind of thing you'd ask a coding agent to help you with at 2am when you're wiring up a search-as-you-type feature. Get it wrong and your users see stale search results flickering over fresh ones, your app fires dozens of redundant API calls on every keystroke, and promises pile up in memory waiting for responses nobody needs anymore.
Here's what I found.
Why This Benchmark Matters
Most model benchmarks test whether an AI can solve algorithmic puzzles. That's fine, but it's not how developers actually use coding agents. In practice, you're asking models to produce correct, production-grade utility code — the kind of function that silently breaks in edge cases if the model doesn't understand the full problem space.
A proper debounce-with-cancellation function needs to handle five things:
- Debounce logic — delay execution until input settles
- No hanging promises — cancelled calls shouldn't leave unresolved promises
- HTTP cancellation — use
AbortControllerto actually cancel in-flight requests, not just ignore their results - Compilation — the TypeScript needs to, you know, compile
- Race condition guards — ensure a slow early response doesn't overwrite a fast later one
Miss any of these and you ship a function that looks correct but silently leaks memory, wastes bandwidth, or shows stale data.
The Results: Only 3 Out of 14 Got Everything Right
| Model | Debounce | No Hanging Promises | HTTP Cancel | Compiles | Race Guard | Score |
|---|---|---|---|---|---|---|
| gpt-oss:120b cloud | ✅ | ✅ | ✅ | ✅ | ✅ | 5/5 |
| Claude Sonnet 4.6 | ✅ | ✅ | ✅ | ✅ | ✅ | 5/5 |
| GLM-5 cloud | ✅ | ✅ | ✅ | ✅ | ✅ | 5/5 |
| Gemini 3 Flash | ✅ | ✅ | ✅ | ✅ | ❌ | 4/5 |
| GLM-4.7 cloud | ✅ | ✅ | ✅ | ✅ | ❌ | 4/5 |
| gpt-oss:20b local | ✅ | ✅ | ✅ | ✅ | ❌ | 4/5 |
| Cogito 2.1 671b | ✅ | ✅ | ❌ | ✅ | ❌ | 3/5 |
| Kimi K2.5 cloud | ✅ | ✅ | ✅ | ⚠️ | ❌ | 3.5/5 |
| qwen3-coder:480b | ✅ | ✅ | ❌ | ✅ | ❌ | 3/5 |
| DeepSeek V3.2 cloud | ⚠️ | ❌ | ❌ | ✅ | ❌ | 1.5/5 |
| MiniMax M2.5 cloud | ✅ | ⚠️ | ⚠️ | ⚠️ | ❌ | 2/5 |
| qwen3-coder-next | ✅ | ✅ | ✅ | ❌ | ❌ | 3/5 |
| Gemma3 27b cloud | ❌ | ❌ | ✅ | ✅ | ❌ | 2/5 |
| qwen3.5:9b local | ❌ | ❌ | ❌ | ❌ | ❌ | 0/5 |
The most common failure? Race condition guards. 11 out of 14 models missed it entirely. This is the subtlest requirement — you need to check that the response you're about to return actually corresponds to the most recent call, not an earlier one that happened to resolve late. It's the kind of bug that only manifests under real network conditions and is notoriously hard to reproduce in testing.
The Top Tier: What Separates the Best
gpt-oss:120b — The Overachiever
gpt-oss is OpenAI's open-source model family, and its 120B variant was the standout of this benchmark. Not only did it nail all five criteria, it produced the most sophisticated type signature I've seen for this pattern: [...TArgs, AbortSignal], threading the abort signal as the last argument. It also included a React hook example and a gotchas table in its response. The downside? It requires the target function to accept the signal as its last argument, which means refactoring existing function signatures.
Claude Sonnet 4.6 — The Pragmatist
The most elegant solution at roughly 30 lines. Used signal.addEventListener('abort', ...) to wire cancellation, which means it works with any existing async function without signature changes. Least verbose documentation of the top three, but the code speaks for itself. Full disclosure: I use Claude Code daily, and Claude still didn't win outright — gpt-oss:120b matched it on correctness and exceeded it on documentation. That's the point of running honest benchmarks.
GLM-5 — The Thoroughbred
GLM-5 is the latest model from Zhipu AI, a Beijing-based lab that's been quietly shipping competitive models. It delivered the most thorough edge case handling of any model in this benchmark. It introduced a pendingReject pattern to solve hanging promises and included a currentController === controller check for race conditions — the key guard that most models missed:
// The race condition guard that 11 of 14 models missed:
// If a slower early request resolves after a faster later one,
// this check ensures the stale result gets discarded.
if (currentController === controller) {
resolve(result);
} else {
// A newer call has already taken over — discard this result
}
The trade-off is verbosity — this is the solution you'd want if you were building a library, but might be overkill for an internal utility.
The Middle Pack: Good Enough?
Gemini 3 Flash deserves special mention. At 11 seconds response time, it was the fastest cloud model and got 4 out of 5 criteria right. If you're optimizing for speed and cost and can add a race guard yourself, this is arguably the best value.
GLM-4.7 (also from Zhipu AI) landed right alongside Gemini 3 Flash at 4/5 — solid all-around but missing the race guard and the pendingReject hanging promise fix that its bigger sibling GLM-5 nailed.
gpt-oss:20b local is the story of the benchmark for me. OpenAI's smallest open-source model, running locally, for free, at 90 tokens/second — this 20B model matched or beat cloud models 10-25x its size. It's using MXFP4 quantization, and the quality-to-size ratio is remarkable. If you're building a local coding assistant pipeline, this is your starting point.
Cogito 2.1 at 671B parameters — the largest model tested — produced the cleanest, most readable code and was the only model to preserve this context (a thoughtful touch). But it didn't wire up AbortController at all, relying on promise-only cancellation. Being the biggest model doesn't mean being the best.
The Failures Worth Studying
DeepSeek V3.2 had an args closure bug — it captured arguments at the wrong scope, meaning the debounced function could fire with stale arguments. This is exactly the kind of subtle bug that passes a quick visual review.
Gemma3 27b inverted the debounce logic entirely. Its execute() function fires immediately instead of after the delay. The code compiles, looks reasonable, and does the exact opposite of what it should. This is the scariest failure mode — code that's confidently wrong.
qwen3.5:9b at 9 billion parameters was simply too small for this task. Fast and free, but it produced scaffolding without working logic. Know your model's limits.
Speed and Cost: The Practical Matrix
| Model | Response Time | Where It Runs | Cost |
|---|---|---|---|
| Claude Sonnet 4.6 | Instant | Cloud | $200/mo (Claude Code Max) |
| gpt-oss:20b | ~1-2s | Local | Free |
| qwen3.5:9b | ~1-2s | Local | Free |
| Gemini 3 Flash | ~11s | Cloud | Pay-per-use |
| gpt-oss:120b | ~21s | Cloud | Pay-per-use |
| DeepSeek V3.2 | ~25s | Cloud | Pay-per-use |
| Gemma3 27b | ~26s | Cloud | Pay-per-use |
| qwen3-coder:480b | ~32s | Cloud | Pay-per-use |
| Cogito 2.1 671b | ~36s | Cloud | Pay-per-use |
| GLM-4.7 | ~42s | Cloud | Pay-per-use |
| GLM-5 | ~43s | Cloud | Pay-per-use |
| Kimi K2.5 | 2+ min | Cloud | Pay-per-use |
| MiniMax M2.5 | 3+ min | Cloud | Pay-per-use |
The speed differences are dramatic. Claude Sonnet 4.6 responds instantly (streaming through Claude Code). Kimi K2.5 and MiniMax M2.5 took over two minutes — an eternity when you're in flow state.
What This Tells Us About Choosing a Coding Model
Parameter count doesn't predict code quality. Cogito at 671B parameters lost to gpt-oss at 20B. The correlation between model size and correctness on this task was essentially zero.
The gap between "compiles" and "correct" is where bugs live. 12 of 14 models produced code that compiles. Only 3 produced code that's actually correct under all conditions. If your workflow is "generate code, see if it compiles, ship it" — this benchmark should make you nervous.
Local models are getting scary good. gpt-oss:20b running locally at 90 tokens/second, for free, producing 4/5 correct code is a different world than where we were even six months ago.
AbortController is a litmus test. Whether a model wires up proper HTTP cancellation (not just creates an AbortController and never passes it anywhere — looking at you, qwen3-coder) reveals whether it actually understands the purpose of the pattern or is just pattern-matching on syntax.
My Recommendations
If you want the best code quality and use a coding agent daily: Claude Sonnet 4.6 or gpt-oss:120b. Both got perfect scores. Claude wins on elegance and speed; gpt-oss:120b wins on thoroughness and documentation.
If you want the best free/local option: gpt-oss:20b, no contest. 4/5 correctness at local speeds with zero cost.
If you want the best balance of speed, cost, and quality: Gemini 3 Flash. Fast, cheap, 4/5 correct. Add your own race guard and you're golden.
If you're evaluating models for your team: Run this test. Seriously. A single well-chosen prompt that touches async patterns, TypeScript generics, and API lifecycle management will tell you more about a model's real-world coding ability than any leaderboard score.
Methodology note: All models received the same prompt with no additional context or follow-up. Cloud models were accessed through their respective APIs via Ollama unless otherwise noted. Local models ran on an i9-14900K with an RTX 5060 Ti 16GB, though I regularly run the same and larger models on a MacBook Pro M4 Max with 48GB of unified memory. Code was evaluated for compilation and manually reviewed for correctness and edge case handling. This is a single-prompt benchmark — production coding agent performance involves multi-turn interaction, context awareness, and tool use, which are not captured here.
What real-world async pattern should I benchmark next? I'm leaning toward WebSocket reconnection with exponential backoff — another pattern where "compiles" and "correct" are miles apart. Find me on X, LinkedIn, or GitHub.