168 TESTS COMPLETE
TPS.SH
01 / RESULTS
Benchmark Results
All 12 models compared across speed, quality, and cost. 9 local Ollama models vs 3 Claude API models on Apple M2 Max with 32GB unified memory.
|
Model |
Type |
TPS |
TTFT |
Avg Time |
Quality |
Cost |
| ⚡ |
Claude Haiku 4.5 |
Cloud |
167.8 |
0.7s |
16.8s |
8.35 |
$0.27 |
|
Claude Sonnet 4.6 |
Cloud |
78.9 |
1.5s |
39.9s |
8.42 |
$0.94 |
|
Claude Opus 4.6 |
Cloud |
74.7 |
1.7s |
41.5s |
8.61 |
$1.49 |
| 👑 |
qwen3-coder |
Local |
48.8 |
1.1s |
37.8s |
7.48 |
Free |
|
gemma4:26b |
Local |
39.2 |
24.5s |
65.9s |
8.36 |
Free |
|
phi4:14b |
Local |
17.9 |
1.5s |
56.5s |
7.21 |
Free |
|
qwen2.5-coder:14b |
Local |
15.6 |
1.5s |
68.2s |
6.64 |
Free |
|
deepseek-r1:14b |
Local |
14.6 |
70.2s |
137.0s |
5.89 |
Free |
|
glm-4.7-flash |
Local |
10.2 |
54.8s |
229.5s |
5.30 |
Free |
|
qwen2.5-coder:32b |
Local |
7.89 |
3.4s |
135.7s |
7.24 |
Free |
| 🏆 |
gemma4:31b |
Local |
7.71 |
108.9s |
308.6s |
8.87 |
Free |
|
qwen3:32b |
Local |
6.92 |
171.9s |
497.1s |
5.76 |
Free |
⚡
Fastest Model
Claude Haiku 4.5 at 167.8 tok/s — 3.4x faster than the best local model. Best value cloud option at $0.27.
🏆
Highest Quality
gemma4:31b scores 8.87/10 — the highest quality of any model, local or cloud. A 31B local model outscoring Claude Opus (8.61) is the Phase 3 headline.
👑
Best Local Speed
qwen3-coder at 48.8 tok/s with 7.48 quality. MoE architecture shines on M2 Max. 100% offline, zero cost, air-gapped.
02 / ARCHITECTURE
Benchmark Pipeline
Four-stage pipeline: load prompts, execute against models, judge quality, generate reports.
📄
Prompt Bank
21 YAML prompts across 7 coding categories. Each defines task, expected behavior, and evaluation criteria.
21 prompts · 7 categories
⚙
Runner
Executes prompts against Ollama (local) and Anthropic API (cloud). Captures TPS, TTFT, tokens, cost, and hardware metrics.
Ollama + Anthropic adapters
⚖
Judge
Claude Sonnet 4.6 scores each output on correctness (40%), completeness (35%), and clarity (25%). Bias-flagged for self-evaluation.
weighted scoring · bias flags
📊
Reports
Word docs, PowerPoint decks, React dashboard, and interactive comparison website. Full export pipeline.
docx + pptx + React + web
03 / PROMPT CATEGORIES
7 Coding Task Types
Each category contains 3 prompts of varying complexity, testing different aspects of code intelligence.
$ cat code_generation.yaml
Code Generation
3 prompts · Write new code from spec
$ cat debugging_reasoning.yaml
Debugging & Reasoning
3 prompts · Find and fix bugs
$ cat refactoring.yaml
Refactoring
3 prompts · Improve existing code
$ cat explanation_teaching.yaml
Explanation & Teaching
3 prompts · Explain concepts clearly
$ cat short_quick.yaml
Short Quick Tasks
3 prompts · Fast utility tasks
$ cat long_complex.yaml
Long Complex Research
3 prompts · Deep architecture tasks
$ cat tool_calling.yaml
Tool Calling / Agentic
3 prompts · Agentic tool use
04 / KEY FINDINGS
What 168 Tests Reveal
Phase 3 added 5 new local models and re-ran all 3 cloud models. The data tells a nuanced story about local vs cloud tradeoffs.
A local model takes the quality crown
gemma4:31b scores 8.87/10 — the highest quality of any model tested, surpassing Claude Opus (8.61) and Sonnet (8.42). A free, offline, 31B local model outperforming the best cloud models is the Phase 3 headline result.
8.87 > 8.61 quality
MoE architecture still dominates local speed
qwen3-coder (30B MoE) runs at 48.8 tok/s vs gemma4:26b at 39.2 tok/s and qwen2.5-coder:14b at 15.6 tok/s. MoE activates fewer parameters per token, leveraging M2 Max memory bandwidth more efficiently.
48.8 tok/s best local
gemma4:26b is the new quality-speed sweet spot
At 39.2 tok/s and 8.36 quality, gemma4:26b offers cloud-tier quality (above Haiku's 8.35) at local speeds with zero cost. It bridges the gap between qwen3-coder speed and Claude quality.
8.36 quality at 39.2 TPS
Claude-judging-Claude bias caveat
Cloud quality scores involve Claude models judging Claude outputs. These are flagged in all reports but may inflate cloud scores. A fair comparison would need an independent judge (GPT-4 or human reviewers).
bias flagged
05 / CLI
Command-Line Interface
8 Typer commands for the full benchmark lifecycle. Run, judge, analyze, report, export.
$
python -m llm_bench run
$
python -m llm_bench run --hardware-metrics
$
python -m llm_bench run -m qwen3-coder -c code_generation
$
python -m llm_bench judge <run_id>
$
python -m llm_bench analyze <run_id>
$
python -m llm_bench report <run_id>
$
python -m llm_bench export <run_id>
$
python -m llm_bench hardware-report
$
python -m llm_bench cost-estimate
$
python -m llm_bench list-runs
06 / TECH STACK
Built With
Python-first toolkit with React dashboard and interactive comparison websites.
Python + Typer
CLI framework
Ollama
Local model runtime
Anthropic API
Cloud model access
React + Recharts
Dashboard UI
Chart.js
Comparison website charts
python-docx
Word report generation
python-pptx
PowerPoint generation
Dive Into the Data
Interactive Chart.js visualizations, hardware analysis, and the full comparison.