Home › Leaderboard

AI coding model leaderboard

Every LLM Flowpicker tracks, ranked by coding benchmarks. The table below is sorted by SWE-bench Verified — the benchmark that best predicts real-world coding — with the latest frontier models from Google, OpenAI, and Anthropic at the top. Sort by any column — SWE-bench, HumanEval, MMLU, context window, or price — and share the ranking with the link.

Ranked by SWE-bench, high to low.

How to read this leaderboard

SWE-bench Verified is the benchmark that best predicts real coding ability — it measures whether a model can resolve actual GitHub issues end to end, not solve isolated puzzles. HumanEval and MMLU are useful secondary signals (function-level correctness and general reasoning), but a high HumanEval score with a low SWE-bench score usually means a model is good at small snippets and weaker at navigating a real codebase.

One caveat the rankings can't show: the best benchmark score is not always the best tool to use day to day. Cost, speed, context window, and how a model behaves inside an agent loop matter just as much. Sort by price or context to see those trade-offs, then compare any two models side by side or build a full stack around your pick.

Frequently asked

What is the best LLM for coding in 2026?

By SWE-bench Verified, the latest frontier models from Google (Gemini 3), OpenAI (GPT-5.x), and Anthropic (Claude) lead the ranking for real-world coding — see the live table above for the current order, since it shifts as new models ship. For the cheapest capable option, sort by price: several open-weight models (DeepSeek, Qwen Coder) score competitively at a fraction of the cost.

What does SWE-bench measure?

SWE-bench Verified gives a model a real GitHub issue and the repository, and checks whether its patch makes the project's existing tests pass. It rewards reading a codebase, locating the bug, and fixing it — the closest public proxy for day-to-day engineering work.

Where does this data come from?

Scores are compiled from each model's published benchmark results and provider documentation, and updated as new models ship. Use them as a directional ranking, not a guarantee for your specific codebase.

Embed this leaderboard

Link to the live ranking — it stays current as new models ship:

Full leaderboardflowpicker.xyz/leaderboard.html
Ranked by HumanEvalflowpicker.xyz/leaderboard.html#sort=humanEval
Cheapest firstflowpicker.xyz/leaderboard.html#sort=priceTier

Found your model? Build a full stack around it — Flowpicker shows compatibility warnings and monthly cost as you pick.

Build your stack →