Benchmarks

OpenGameEval leaderboard update: June 2026

June 8, 2026 · Updated June 10, 2026 with Claude Fable 5 results

Roblox's OpenGameEval leaderboard has seen significant movement since the May update. The headline: Claude Fable 5, Anthropic's newest model, debuted straight at #1 on both leaderboards — 50.34% Pass@1 on the 87-eval code generation set and 64.67% on Debug, a full 8 points clear of the previous Debug leader. Gemini 3.5 Flash also entered the code generation set and holds the top Pass@5 score. Gemini 3.1 Pro, previously cited as the overall leader, has not been evaluated on the expanded 87-eval set and now appears only on the Debug leaderboard.

Code generation leaderboard (87 evals)

The main leaderboard now has seven evaluated models on the full 87-eval set. Two metrics were added alongside this update: Cons@5(success in at least 3 of 5 attempts) and All@5 (success in all 5 attempts), which together give a clearer picture of consistency vs. ceiling performance.

Model	Pass@1	Pass@5	Cons@5	All@5	Tool err
Claude Fable 5	50.34%	62.07%	51.09%	39.52%	1.40%
Claude Opus 4.6	48.05%	59.77%	48.05%	38.28%	0.71%
Gemini 3.5 Flash	48.05%	63.22%	49.03%	33.86%	3.30%
Gemini 3 Flash Preview	47.82%	60.92%	48.84%	35.12%	5.51%
Claude Opus 4.7	43.45%	58.62%	43.45%	32.18%	1.33%
GPT-5.5 (Reasoning: M)	40.69%	56.32%	40.13%	30.62%	0.91%
GPT-5.4 (Reasoning: M)	40.23%	55.17%	40.00%	29.02%	1.81%

Debug leaderboard (30 evals)

The Debug leaderboard has expanded to eleven models and now also tracks Cons@5 and All@5. Claude Fable 5 leads every column: Pass@1 (64.67%), Pass@5 (73.33%, tied with GLM 5), Cons@5 (66.09%), and All@5 (54.66%). Behind it, Gemini 3.1 Pro's advantage over GLM 5 on Pass@1 remains narrow (56.67% vs. 56.00%).

Model	Pass@1	Pass@5	Cons@5	All@5	Tool err
Claude Fable 5	64.67%	73.33%	66.09%	54.66%	1.01%
Gemini 3.1 Pro	56.67%	70.00%	58.36%	42.68%	5.97%
GLM 5	56.00%	73.33%	59.87%	33.98%	2.39%
Claude Opus 4.7	52.67%	63.33%	53.14%	43.57%	4.26%
GPT-5.4 (Reasoning: M)	51.33%	63.33%	52.08%	39.70%	2.98%
Gemini 3 Flash Preview	51.33%	63.33%	51.06%	43.31%	4.58%
Claude Opus 4.6	50.67%	66.67%	49.52%	40.85%	0.96%
GPT-5.5 (Reasoning: M)	50.00%	66.67%	51.02%	35.18%	1.54%
Gemini 3.5 Flash	49.33%	70.00%	48.46%	36.33%	3.37%
GPT Codex 5.3	47.33%	70.00%	47.90%	27.00%	3.21%
Claude Sonnet 4.6	46.00%	60.00%	46.47%	33.87%	6.47%

What changed since May

Claude Fable 5: new #1 on both leaderboards

The most significant new entry. Claude Fable 5 tops the code generation set on Pass@1 (50.34%), Cons@5 (51.09%), and All@5 (39.52%) — the first model to cross 50% Pass@1 on the harder 87-eval set. On Debug the gap is even larger: 64.67% Pass@1, 8 points ahead of Gemini 3.1 Pro, with the highest consistency scores on the board (66.09% Cons@5, 54.66% All@5). Its tool error rates (1.40% codegen, 1.01% debug) are among the lowest tested, though Claude Opus 4.6 still holds the absolute lowest on both sets.

Gemini 3.5 Flash: best Pass@5 on code generation

Gemini 3.5 Flash matches Claude Opus 4.6 on Pass@1 (48.05%) and leads all evaluated models on Pass@5 (63.22%) — including Fable 5 (62.07%). In practical terms: if you run the same prompt a few times and take the best result, Gemini 3.5 Flash gives you the highest probability of at least one success. The tradeoff is a higher tool error rate (3.30% vs. Opus 4.6's 0.71%).

Gemini 3.1 Pro: Debug-only

Previous reports described Gemini 3.1 Pro as the overall leader at 55.3% Pass@1 — that figure came from the old deprecated 47-eval leaderboard. On the current 87-eval expanded set, Gemini 3.1 Pro has not yet been evaluated. It is now the second-strongest model on Debug (56.67%, behind Fable 5), but its code generation performance on the harder task set is an open question.

New metrics: Cons@5 and All@5

The leaderboard added two new columns:

Cons@5 (consistent success) — probability of passing in at least 3 out of 5 attempts. Higher than Pass@1 means the model is reliable but not always first-try. Claude Opus 4.6's Cons@5 equals its Pass@1 exactly (48.05%), suggesting it tends to either solve a task reliably or fail consistently — not much in-between. Claude Fable 5 leads the metric on both sets (51.09% codegen, 66.09% debug).
All@5 — probability of passing all 5 attempts. Claude Fable 5 leads this metric on both sets (39.52% codegen, 54.66% debug), meaning it has the highest rate of tasks it can solve every single time. Gemini 3.5 Flash's lower All@5 (33.86%) despite a strong Pass@5 suggests more variance run-to-run.

Claude Opus 4.7: the detailed review

Roblox published a detailed comparison of Opus 4.6 vs. 4.7 on this eval suite. The apparent Pass@1 drop (48.1% → 43.5%) is not statistically significant (p=0.24). What is significant: Opus 4.7 makes 39% fewer tool calls per task, with the largest reductions in exploration tools (search_game_tree, script_grep, inspect_instance). It tends to fail by stopping too early rather than over-engineering, and it recovers well on well-specified tasks where the target is unambiguous.

Roblox's recommendation: for open-ended discovery tasks (e.g. “remove tutorial assets,” “make lights toggle with day/night”), add explicit instructions for Opus 4.7 to explore the full workspace before acting. For well-defined tasks it outperforms 4.6 on several evals and costs less to run.

New models: GPT-5.5, GPT-5.4, and GPT Codex 5.3

GPT-5.5 (Reasoning: M) entered both leaderboards: 40.69% Pass@1 on code generation and 50.00% Pass@1 on Debug. Its 0.91% tool error rate on code generation is the second-lowest of any model tested, behind only Claude Opus 4.6. GPT-5.4 (Reasoning: M) was also added to the code generation set (40.23% Pass@1, currently last) and its Debug results were re-run: 51.33% Pass@1, up from the previously reported 50.00%.

GPT Codex 5.3 appears on Debug only (47.33% Pass@1) with a strong Pass@5 (70.00%). It has not yet been evaluated on code generation.

What this means for your workflow

Claude Fable 5 as the default. #1 on both leaderboards, best consistency scores (Cons@5 and All@5) on both sets, and the strongest single-attempt bug-fixer by a wide margin (64.67% Debug Pass@1). The all-around pick for both building and fixing.
Gemini 3.5 Flash for iterative work. If you're running prompts multiple times and evaluating results, its Pass@5 lead (63.22%) makes it the best model for high-iteration workflows.
Claude Opus 4.6 for precision. Lowest tool error rate on both sets (0.71% codegen, 0.96% debug). When you need the model to do exactly what you asked without malformed tool calls or retry overhead.
GLM 5 as a Debug alternative. Effectively tied with Gemini 3.1 Pro on Debug Pass@1 (56.00%) and ties Fable 5 on Debug Pass@5 (73.33%) with a low tool error rate (2.39%).
Claude Opus 4.7 for cost-sensitive projects. Statistically equivalent to 4.6 on accuracy, 39% fewer tool calls, with caveats for open-ended discovery tasks.

Back to BloxBot