TO BEnchmark OR NOT TO BEnchmark

I have spent twenty-five years measuring whether systems are any good, and only recently have I understood what I was really doing all that time.

The measuring took more forms than I could neatly list, though I can try. There was dependability benchmarking and robustness testing, fault injection and the injection of vulnerabilities and attacks, the security benchmarking of web applications and services and the frameworks beneath them, the evaluation of intrusion detection systems, trustworthiness benchmarking, the prediction of failures before they arrive, and, in these last few years, the question of what large language models actually produce when we hand them the job of writing code. Different tools, different communities, different decades, and for most of my career I lived them as genuinely different problems, each with its own literature and its own hard-won tricks. I am now convinced they were always the same problem, and that I spent the better part of a career circling something I could see well enough to work on but never quite name.

What the work kept teaching me

If you benchmark for long enough, and across enough domains, you begin to notice that the failures rhyme with one another.

Years ago, in security, some colleagues and I set out to do something that sounds almost trivial: take a handful of vulnerability detection tools, run them against the same software, and say which was best. We had everything we needed, and still could not answer, because the ranking depended entirely on which metric we chose to believe. Detection rate favored one tool, false positives another, and a different class of vulnerability reshuffled the order again. We were measuring carefully and reproducibly, and still could not say what “best” was supposed to mean.

I watched a tradition I deeply respect run into its own version of this. Database performance benchmarks have been rigorous and audited since the 1980s, and they measured performance beautifully while saying nothing about dependability. Systems were tuned to those scores and then deployed where staying up mattered far more than being fast. The benchmark measured exactly what it promised; the trouble was only that a performance number came to stand for overall quality, and nothing in the practice ever warned anyone away from that leap.

Then the newest version arrived, at a speed that still unsettles me. Large language models are ranked and funded and deployed on benchmarks that check whether their code passes a set of functional tests, and when we looked closely at what those models actually wrote, we kept finding code that passed the tests and was nonetheless insecure, hard to maintain, and unreliable in precisely the ways the benchmark never examined. A score built to say one narrow thing was being read, everywhere, as evidence of something far larger.

Three domains, three decades, and underneath all of them the same quiet shape. A benchmark that measured something, carefully. A conclusion drawn from it that it could not support. And nowhere a principle that would have caught the gap before it slipped downstream into decisions that mattered. These were never separate problems. They were one absence, wearing different clothes each time I met it.

The thing we never built

We benchmark constantly in computer science. We benchmark performance and dependability, security and robustness, energy, and now the quality and trustworthiness of the AI systems reshaping the field. Benchmarks decide what gets deployed and what gets funded, what is taught to students as progress and what is quietly set aside. In a real and underappreciated sense, benchmarks are how our field decides what is true.

And yet we have never built a discipline concerned with how to do it well. We have a deep science of the systems we measure and almost none of the measuring itself. We teach software engineering, algorithms, and verification, each with its principles and its curriculum, while benchmarking, which sits underneath them all, we mostly learn by imitation and inherit by convention. Even our best foundations only make the gap easier to see. Jim Gray’s Benchmark Handbook codified how to build performance benchmarks; TPC and SPEC turned that into rigorous, audited practice; dependability benchmarking matured in its own corner and security benchmarking in another, each staying largely within its own property and community, patiently rediscovering lessons the others had already learned. It was disciplined practice within silos, but never a discipline that spanned them. I spent my career inside several of those silos at once, which is perhaps why their walls eventually looked less like natural boundaries than accidents of history.

Why now, and not in Coimbra

For a long time I could have written some of this and did not, and the pieces only fell into place when they did for two reasons that arrived together.

One half belongs to the field. The rise of large language models took a problem that had simmered for decades and forced it into the open, at a scale that made it impossible to keep filing away as somebody else’s concern. When billions of dollars and a great deal of public trust ride on benchmark scores that the research community is, at that very moment, publishing papers to question, the absence of a discipline stops being a matter of intellectual tidiness and becomes something more urgent.

The other half belongs to me, and to a move across an ocean. For most of my career I worked inside one strong and beautiful tradition, the dependability community that shaped me in Coimbra, and when you are that deeply embedded in a way of doing things, its assumptions become almost invisible; they feel less like choices than like the water you swim in. Leaving changed that. Arriving in Charlotte, on new soil, among people who came to these questions by other paths, I kept asking which of my habits of mind were genuine principles and which were merely artifacts of a particular place. It is an uncomfortable question, but it is exactly the one that let me see the walls between the benchmarking domains for what they are. You do not notice the assumptions of a place until you have left it. It turns out the same is true of a discipline.

Naming it

I have come to believe the missing thing deserves a name, because naming it is the first real step toward building it. I have been calling it Benchmark Engineering: the principles, the vocabulary, and the practices for designing, validating, maintaining, and eventually retiring benchmarks, treated as an engineering discipline in its own right rather than a chore we perform on the way to work we consider more important.

I do not pretend to have that discipline worked out. The principles still need to be articulated, the vocabulary agreed, the body of knowledge slowly assembled, and none of that is the work of one person or one paper. What I am sure of is that the absence itself has become the bottleneck, and that the crisis in AI benchmarking is not a new problem at all but the oldest one in benchmarking, arriving at last at a speed we can no longer ignore.

Looking back, I suspect nearly every strand of work I listed at the top was, without my quite realizing it, an attempt to practice Benchmark Engineering before I had a name to give it. Looking forward, I would like to spend a good part of whatever career I have left helping to give it one properly: principles that can be taught, a vocabulary that travels across domains, a community that finally stops rediscovering the same mistakes in isolation. I have written the first, careful version of this argument in the compressed register a paper demands. This is the other version, the one that can afford to admit how long it took me to see what had been in front of me the whole time.

A note from the other half of this blog

There is a line I keep returning to. A well-designed system carries its assumptions forward whether or not you remember writing them down. I used it in my first post to describe something personal, the way a place you leave keeps quietly shaping what you notice. But it is also, almost exactly, what an unexamined benchmark is: an assumption about what matters, propagated silently through everything built on top of it, long after anyone remembers choosing it.

Adler argued that we are pulled forward by the goals we choose rather than pushed by our past, and I find the idea more useful the older I get. Naming a discipline is a forward-pulled act. I am not trying to catalog what benchmarking has been; I am trying to say what it might become, and to persuade enough of us that it is worth the work of building. That, it turns out, was never really a separate question from the ones this blog was always going to be about.

What the work kept teaching me

The thing we never built

Why now, and not in Coimbra

Naming it

A note from the other half of this blog

Enjoy Reading This Article?