We Benchmarked Frontier Reasoning Models on The Atlantic's Bracket City
A deep dive into how frontier AI models perform on complex word puzzles, revealing surprising insights about reasoning efficiency vs. accuracy.
I’ve been obsessed with The Atlantic’s new word game, Bracket City. It’s a puzzle where clues hide inside nested brackets. You start with something like [["___ of Arabia"] who starred in "[monosyllabic Michael [capital of Mississippi] album] Boys"]
and work your way inward—first solving ”___ of Arabia” (Lawrence) and “capital of Mississippi” (Jackson), revealing [Lawrence who starred in [monosyllabic Michael Jackson album] Boys]
. Each solved clue unlocks new brackets until you reveal a final statement.
I’m ashamed to admit that when I get stuck, I’ll paste a screenshot into ChatGPT o3 and ask for help. It’s freakishly good. At one point, I noticed in the chain of thought that o3 often tries to reason through the entire puzzle. So I got curious: I started pasting screenshots of completely unsolved puzzles and asking o3 to work through the whole thing. Often, it would nail the solution perfectly.
That’s when it hit me: if these models can solve complex word puzzles, how do they actually compare? Which ones excel at this kind of deep, recursive reasoning?
Claude 4 Opus solving a recent Bracket City puzzle
Building the Benchmark
My first approach seemed obvious: feed screenshots to each model and see who wins.
It worked okay, but the inference endpoints timed out constantly. Models would get halfway through reasoning and just… stop. Even when they didn’t timeout, the visual parsing was inconsistent—some models would misread brackets or lose track of nested structures entirely.
The breakthrough came when I reimplemented Bracket City’s game logic as LLM tool calls:
const tools = {
makeGuess: {
description: "Make a guess for a specific bracket clue",
parameters: {
clue: "The clue text inside brackets (without the brackets)",
guess: "Your guess for the answer to this clue",
},
},
getHint: {
description: "Get a hint (first letter) for a difficult clue",
parameters: {
clue: "The clue text inside brackets (without the brackets)",
},
},
revealClue: {
description: "Reveal the full answer for a clue (last resort)",
parameters: {
clue: "The clue text inside brackets (without the brackets)",
},
},
};
This approach worked beautifully. Models could now interact with puzzles programmatically, maintaining state while focusing purely on the reasoning challenge. I gave each model up to 50 steps to solve a puzzle, with the same scoring rules as human players: start at 100 points, lose 2 for wrong guesses, 5 for hints, and 15 for reveals.
The system prompt emphasized working from the innermost brackets outward—a key strategy for solving these puzzles efficiently. Models that understood this recursive pattern performed significantly better.
The Results
I tested 16 frontier models across 20 different Bracket City puzzles. The results revealed a fascinating tradeoff between accuracy and efficiency:
The Top Scorer: o3-high
- Average score: 92.11/100
- Success rate: 100%
- Average time per puzzle: 11 minutes
The Efficiency Champion: Claude 4 Opus
- Average score: 88.9/100
- Success rate: 100%
- Average time per puzzle: 3 minutes
While o3-high technically won with the highest score, it came at a steep cost—taking nearly 4x longer than Claude 4 Opus to achieve only marginally better results. This raises a crucial question: is a 3.2% improvement in accuracy worth quadrupling your inference time?
Here’s the complete leaderboard:
Rank | Model | Average Score | Success Rate | Avg Time (seconds) |
---|---|---|---|---|
1 | o3-high | 92.11 | 100% | 660.41 |
2 | claude-4-opus-20250514-32k-thinking | 88.9 | 100% | 183.47 |
3 | grok-4-07-09 | 86.15 | 95% | 309.49 |
4 | claude-4-sonnet-20250514-32k-thinking | 85.4 | 100% | 172.5 |
5 | claude-3.7-sonnet-20250219-32k-thinking | 70.75 | 90% | 152.3 |
6 | gemini-2.5-pro-preview-06-05 | 70.3 | 100% | 1185.29 |
7 | gemini-2.5-pro-preview-05-06 | 62.8 | 95% | 61.27 |
8 | gpt-4.1 | 45.75 | 65% | 32.8 |
9 | grok-3-beta | 40.25 | 70% | 124.88 |
10 | claude-3-5-sonnet-20241022 | 39.35 | 45% | 31.63 |
11 | o3-mini-high | 39.13 | 67% | 1617.94 |
12 | gemini-2.5-flash-preview-04-17 | 35.85 | 60% | 140.4 |
13 | o4-mini-medium | 29.53 | 41% | 974.77 |
14 | o3-mini-medium | 25.8 | 55% | 558.54 |
15 | gpt-4o | 23.9 | 45% | 87.37 |
16 | qwen3-235b-a22b | 20.94 | 33% | 1510.99 |
The most shocking results come from the bottom of the table. OpenAI’s “reasoning-optimized” mini models are disasters—o3-mini spent an absurd 27 minutes per puzzle to achieve a pathetic 39.13 average score. That’s nearly 9x longer than Claude 4 Opus for less than half the performance. Meanwhile, humble GPT-4.1 managed a respectable 45.75 score in just 33 seconds.
The Time-Performance Paradox
The benchmark reveals a critical insight about modern AI systems: more thinking time doesn’t necessarily mean better thinking. Look at these striking comparisons:
- Gemini 2.5 Pro (06-05) scored 70.3 but took nearly 20 minutes per puzzle
- Gemini 2.5 Pro (05-06) scored 62.8 but finished in just 1 minute
- o3-mini (high) spent 27 minutes thinking to score worse than GPT-4.1’s 33-second performance
This suggests that many models are stuck in inefficient reasoning loops rather than making meaningful progress. Claude’s models consistently demonstrate the best balance—they think efficiently, exploring productive paths rather than spinning their wheels.
Why This Matters
For real-world applications, the time-performance tradeoff is crucial. Consider these scenarios:
- Customer support: Would you rather wait 11 minutes for a 92% accurate response or 3 minutes for an 89% accurate one?
- Code debugging: Is a slightly better bug fix worth 4x the debugging time?
- Research assistance: Do you need the absolute best answer, or the best answer you can get in a reasonable timeframe?
In most cases, Claude 4 Opus represents the sweet spot—near-peak performance at practical speeds. The 3.2% accuracy gap between it and o3-high is negligible for most use cases, but the time difference is substantial.
The Surprising Winners and Losers
Winners:
- Claude models dominated the top spots with consistent speed and accuracy
- Grok-4 (86.15 score, 5.2 min) showed X’s model can compete with the best
- GPT-4.1 delivered respectable performance at lightning speed
Losers:
- o3-mini and o4-mini are embarrassments—slow AND inaccurate
- Gemini 2.5 Pro (06-05) took 20 minutes for middling results
- Qwen3-235b spent 25 minutes per puzzle for the worst performance
The failure of OpenAI’s mini models is particularly damning. These were supposedly optimized for reasoning tasks, yet they performed worse than general-purpose models while taking 30-50x longer. It’s a cautionary tale about the dangers of optimizing for the wrong metrics.
The Real Lesson
This benchmark teaches us that in AI, as in life, perfection is often the enemy of good. o3-high’s marginal victory comes at such a steep time cost that it’s rarely the right choice for practical applications. Claude 4 Opus emerges as the real winner—fast enough for interactive use, accurate enough for serious work.
The results also expose the hollowness of some “reasoning-optimized” claims. True reasoning capability isn’t about thinking longer—it’s about thinking better. Claude’s models demonstrate this perfectly, efficiently navigating solution spaces while others get lost in computational dead ends.
For developers building AI products, the message is clear: optimize for the full user experience, not just benchmark scores. A slightly less accurate model that responds in seconds will create more value than a marginally better one that makes users wait.
Want to test your own models or see the full results? The benchmark code is available at [github.com/redspringxyz/bracket-city-benchmark].