Cognitive Revolution · July 10, 2025 · 70m

Why AI Benchmarks Are Broken (And How to Fix Them)

A critical examination of AI benchmarking. Current benchmarks are saturated, gameable, and measuring the wrong things. Labenz proposes principles for better evaluation.

Canon

•

Goodhart's law as hedonic adaptation for metrics

When a metric becomes a target, it stops being a good metric. AI benchmarks follow this pattern: labs optimize for benchmark scores until the scores no longer measure real capability.

Claude ChatGPT Gemini

•

Benchmark gaming as the false self of AI — optimized appearances over genuine capability

Models trained to maximize benchmark scores develop a false self — impressive on standardized tests while lacking the genuine understanding those tests were meant to measure.

Claude ChatGPT Gemini