Benchmarks

OpenGameEval leaderboard update: June 2026

June 8, 2026 · Updated June 10, 2026 with Claude Fable 5 results


Roblox's OpenGameEval leaderboard has seen significant movement since the May update. The headline: Claude Fable 5, Anthropic's newest model, debuted straight at #1 on both leaderboards — 50.34% Pass@1 on the 87-eval code generation set and 64.67% on Debug, a full 8 points clear of the previous Debug leader. Gemini 3.5 Flash also entered the code generation set and holds the top Pass@5 score. Gemini 3.1 Pro, previously cited as the overall leader, has not been evaluated on the expanded 87-eval set and now appears only on the Debug leaderboard.

Code generation leaderboard (87 evals)

The main leaderboard now has seven evaluated models on the full 87-eval set. Two metrics were added alongside this update: Cons@5(success in at least 3 of 5 attempts) and All@5 (success in all 5 attempts), which together give a clearer picture of consistency vs. ceiling performance.

ModelPass@1Pass@5Cons@5All@5Tool err
Claude Fable 550.34%62.07%51.09%39.52%1.40%
Claude Opus 4.648.05%59.77%48.05%38.28%0.71%
Gemini 3.5 Flash48.05%63.22%49.03%33.86%3.30%
Gemini 3 Flash Preview47.82%60.92%48.84%35.12%5.51%
Claude Opus 4.743.45%58.62%43.45%32.18%1.33%
GPT-5.5 (Reasoning: M)40.69%56.32%40.13%30.62%0.91%
GPT-5.4 (Reasoning: M)40.23%55.17%40.00%29.02%1.81%

Debug leaderboard (30 evals)

The Debug leaderboard has expanded to eleven models and now also tracks Cons@5 and All@5. Claude Fable 5 leads every column: Pass@1 (64.67%), Pass@5 (73.33%, tied with GLM 5), Cons@5 (66.09%), and All@5 (54.66%). Behind it, Gemini 3.1 Pro's advantage over GLM 5 on Pass@1 remains narrow (56.67% vs. 56.00%).

ModelPass@1Pass@5Cons@5All@5Tool err
Claude Fable 564.67%73.33%66.09%54.66%1.01%
Gemini 3.1 Pro56.67%70.00%58.36%42.68%5.97%
GLM 556.00%73.33%59.87%33.98%2.39%
Claude Opus 4.752.67%63.33%53.14%43.57%4.26%
GPT-5.4 (Reasoning: M)51.33%63.33%52.08%39.70%2.98%
Gemini 3 Flash Preview51.33%63.33%51.06%43.31%4.58%
Claude Opus 4.650.67%66.67%49.52%40.85%0.96%
GPT-5.5 (Reasoning: M)50.00%66.67%51.02%35.18%1.54%
Gemini 3.5 Flash49.33%70.00%48.46%36.33%3.37%
GPT Codex 5.347.33%70.00%47.90%27.00%3.21%
Claude Sonnet 4.646.00%60.00%46.47%33.87%6.47%

What changed since May

Claude Fable 5: new #1 on both leaderboards

The most significant new entry. Claude Fable 5 tops the code generation set on Pass@1 (50.34%), Cons@5 (51.09%), and All@5 (39.52%) — the first model to cross 50% Pass@1 on the harder 87-eval set. On Debug the gap is even larger: 64.67% Pass@1, 8 points ahead of Gemini 3.1 Pro, with the highest consistency scores on the board (66.09% Cons@5, 54.66% All@5). Its tool error rates (1.40% codegen, 1.01% debug) are among the lowest tested, though Claude Opus 4.6 still holds the absolute lowest on both sets.

Gemini 3.5 Flash: best Pass@5 on code generation

Gemini 3.5 Flash matches Claude Opus 4.6 on Pass@1 (48.05%) and leads all evaluated models on Pass@5 (63.22%) — including Fable 5 (62.07%). In practical terms: if you run the same prompt a few times and take the best result, Gemini 3.5 Flash gives you the highest probability of at least one success. The tradeoff is a higher tool error rate (3.30% vs. Opus 4.6's 0.71%).

Gemini 3.1 Pro: Debug-only

Previous reports described Gemini 3.1 Pro as the overall leader at 55.3% Pass@1 — that figure came from the old deprecated 47-eval leaderboard. On the current 87-eval expanded set, Gemini 3.1 Pro has not yet been evaluated. It is now the second-strongest model on Debug (56.67%, behind Fable 5), but its code generation performance on the harder task set is an open question.

New metrics: Cons@5 and All@5

The leaderboard added two new columns:

Claude Opus 4.7: the detailed review

Roblox published a detailed comparison of Opus 4.6 vs. 4.7 on this eval suite. The apparent Pass@1 drop (48.1% → 43.5%) is not statistically significant (p=0.24). What is significant: Opus 4.7 makes 39% fewer tool calls per task, with the largest reductions in exploration tools (search_game_tree, script_grep, inspect_instance). It tends to fail by stopping too early rather than over-engineering, and it recovers well on well-specified tasks where the target is unambiguous.

Roblox's recommendation: for open-ended discovery tasks (e.g. “remove tutorial assets,” “make lights toggle with day/night”), add explicit instructions for Opus 4.7 to explore the full workspace before acting. For well-defined tasks it outperforms 4.6 on several evals and costs less to run.

New models: GPT-5.5, GPT-5.4, and GPT Codex 5.3

GPT-5.5 (Reasoning: M) entered both leaderboards: 40.69% Pass@1 on code generation and 50.00% Pass@1 on Debug. Its 0.91% tool error rate on code generation is the second-lowest of any model tested, behind only Claude Opus 4.6. GPT-5.4 (Reasoning: M) was also added to the code generation set (40.23% Pass@1, currently last) and its Debug results were re-run: 51.33% Pass@1, up from the previously reported 50.00%.

GPT Codex 5.3 appears on Debug only (47.33% Pass@1) with a strong Pass@5 (70.00%). It has not yet been evaluated on code generation.

What this means for your workflow


Back to BloxBot