Benchmarks
OpenGameEval leaderboard update: June 2026
June 8, 2026 · Updated June 10, 2026 with Claude Fable 5 results
Roblox's OpenGameEval leaderboard has seen significant movement since the May update. The headline: Claude Fable 5, Anthropic's newest model, debuted straight at #1 on both leaderboards — 50.34% Pass@1 on the 87-eval code generation set and 64.67% on Debug, a full 8 points clear of the previous Debug leader. Gemini 3.5 Flash also entered the code generation set and holds the top Pass@5 score. Gemini 3.1 Pro, previously cited as the overall leader, has not been evaluated on the expanded 87-eval set and now appears only on the Debug leaderboard.
Code generation leaderboard (87 evals)
The main leaderboard now has seven evaluated models on the full 87-eval set. Two metrics were added alongside this update: Cons@5(success in at least 3 of 5 attempts) and All@5 (success in all 5 attempts), which together give a clearer picture of consistency vs. ceiling performance.
| Model | Pass@1 | Pass@5 | Cons@5 | All@5 | Tool err |
|---|---|---|---|---|---|
| Claude Fable 5 | 50.34% | 62.07% | 51.09% | 39.52% | 1.40% |
| Claude Opus 4.6 | 48.05% | 59.77% | 48.05% | 38.28% | 0.71% |
| Gemini 3.5 Flash | 48.05% | 63.22% | 49.03% | 33.86% | 3.30% |
| Gemini 3 Flash Preview | 47.82% | 60.92% | 48.84% | 35.12% | 5.51% |
| Claude Opus 4.7 | 43.45% | 58.62% | 43.45% | 32.18% | 1.33% |
| GPT-5.5 (Reasoning: M) | 40.69% | 56.32% | 40.13% | 30.62% | 0.91% |
| GPT-5.4 (Reasoning: M) | 40.23% | 55.17% | 40.00% | 29.02% | 1.81% |
Debug leaderboard (30 evals)
The Debug leaderboard has expanded to eleven models and now also tracks Cons@5 and All@5. Claude Fable 5 leads every column: Pass@1 (64.67%), Pass@5 (73.33%, tied with GLM 5), Cons@5 (66.09%), and All@5 (54.66%). Behind it, Gemini 3.1 Pro's advantage over GLM 5 on Pass@1 remains narrow (56.67% vs. 56.00%).
| Model | Pass@1 | Pass@5 | Cons@5 | All@5 | Tool err |
|---|---|---|---|---|---|
| Claude Fable 5 | 64.67% | 73.33% | 66.09% | 54.66% | 1.01% |
| Gemini 3.1 Pro | 56.67% | 70.00% | 58.36% | 42.68% | 5.97% |
| GLM 5 | 56.00% | 73.33% | 59.87% | 33.98% | 2.39% |
| Claude Opus 4.7 | 52.67% | 63.33% | 53.14% | 43.57% | 4.26% |
| GPT-5.4 (Reasoning: M) | 51.33% | 63.33% | 52.08% | 39.70% | 2.98% |
| Gemini 3 Flash Preview | 51.33% | 63.33% | 51.06% | 43.31% | 4.58% |
| Claude Opus 4.6 | 50.67% | 66.67% | 49.52% | 40.85% | 0.96% |
| GPT-5.5 (Reasoning: M) | 50.00% | 66.67% | 51.02% | 35.18% | 1.54% |
| Gemini 3.5 Flash | 49.33% | 70.00% | 48.46% | 36.33% | 3.37% |
| GPT Codex 5.3 | 47.33% | 70.00% | 47.90% | 27.00% | 3.21% |
| Claude Sonnet 4.6 | 46.00% | 60.00% | 46.47% | 33.87% | 6.47% |
What changed since May
Claude Fable 5: new #1 on both leaderboards
The most significant new entry. Claude Fable 5 tops the code generation set on Pass@1 (50.34%), Cons@5 (51.09%), and All@5 (39.52%) — the first model to cross 50% Pass@1 on the harder 87-eval set. On Debug the gap is even larger: 64.67% Pass@1, 8 points ahead of Gemini 3.1 Pro, with the highest consistency scores on the board (66.09% Cons@5, 54.66% All@5). Its tool error rates (1.40% codegen, 1.01% debug) are among the lowest tested, though Claude Opus 4.6 still holds the absolute lowest on both sets.
Gemini 3.5 Flash: best Pass@5 on code generation
Gemini 3.5 Flash matches Claude Opus 4.6 on Pass@1 (48.05%) and leads all evaluated models on Pass@5 (63.22%) — including Fable 5 (62.07%). In practical terms: if you run the same prompt a few times and take the best result, Gemini 3.5 Flash gives you the highest probability of at least one success. The tradeoff is a higher tool error rate (3.30% vs. Opus 4.6's 0.71%).
Gemini 3.1 Pro: Debug-only
Previous reports described Gemini 3.1 Pro as the overall leader at 55.3% Pass@1 — that figure came from the old deprecated 47-eval leaderboard. On the current 87-eval expanded set, Gemini 3.1 Pro has not yet been evaluated. It is now the second-strongest model on Debug (56.67%, behind Fable 5), but its code generation performance on the harder task set is an open question.
New metrics: Cons@5 and All@5
The leaderboard added two new columns:
- Cons@5 (consistent success) — probability of passing in at least 3 out of 5 attempts. Higher than Pass@1 means the model is reliable but not always first-try. Claude Opus 4.6's Cons@5 equals its Pass@1 exactly (48.05%), suggesting it tends to either solve a task reliably or fail consistently — not much in-between. Claude Fable 5 leads the metric on both sets (51.09% codegen, 66.09% debug).
- All@5 — probability of passing all 5 attempts. Claude Fable 5 leads this metric on both sets (39.52% codegen, 54.66% debug), meaning it has the highest rate of tasks it can solve every single time. Gemini 3.5 Flash's lower All@5 (33.86%) despite a strong Pass@5 suggests more variance run-to-run.
Claude Opus 4.7: the detailed review
Roblox published a detailed comparison of Opus 4.6 vs. 4.7 on this eval suite. The apparent Pass@1 drop (48.1% → 43.5%) is not statistically significant (p=0.24). What is significant: Opus 4.7 makes 39% fewer tool calls per task, with the largest reductions in exploration tools (search_game_tree, script_grep, inspect_instance). It tends to fail by stopping too early rather than over-engineering, and it recovers well on well-specified tasks where the target is unambiguous.
Roblox's recommendation: for open-ended discovery tasks (e.g. “remove tutorial assets,” “make lights toggle with day/night”), add explicit instructions for Opus 4.7 to explore the full workspace before acting. For well-defined tasks it outperforms 4.6 on several evals and costs less to run.
New models: GPT-5.5, GPT-5.4, and GPT Codex 5.3
GPT-5.5 (Reasoning: M) entered both leaderboards: 40.69% Pass@1 on code generation and 50.00% Pass@1 on Debug. Its 0.91% tool error rate on code generation is the second-lowest of any model tested, behind only Claude Opus 4.6. GPT-5.4 (Reasoning: M) was also added to the code generation set (40.23% Pass@1, currently last) and its Debug results were re-run: 51.33% Pass@1, up from the previously reported 50.00%.
GPT Codex 5.3 appears on Debug only (47.33% Pass@1) with a strong Pass@5 (70.00%). It has not yet been evaluated on code generation.
What this means for your workflow
- Claude Fable 5 as the default. #1 on both leaderboards, best consistency scores (Cons@5 and All@5) on both sets, and the strongest single-attempt bug-fixer by a wide margin (64.67% Debug Pass@1). The all-around pick for both building and fixing.
- Gemini 3.5 Flash for iterative work. If you're running prompts multiple times and evaluating results, its Pass@5 lead (63.22%) makes it the best model for high-iteration workflows.
- Claude Opus 4.6 for precision. Lowest tool error rate on both sets (0.71% codegen, 0.96% debug). When you need the model to do exactly what you asked without malformed tool calls or retry overhead.
- GLM 5 as a Debug alternative. Effectively tied with Gemini 3.1 Pro on Debug Pass@1 (56.00%) and ties Fable 5 on Debug Pass@5 (73.33%) with a low tool error rate (2.39%).
- Claude Opus 4.7 for cost-sensitive projects. Statistically equivalent to 4.6 on accuracy, 39% fewer tool calls, with caveats for open-ended discovery tasks.