benchmark

Benchmark Engineering

Benchmarking is among the most consequential activities in computer science, yet it has never matured into a discipline of its own. For over two decades, my research has examined how benchmarks are designed, what makes them fail, and what principled foundations would look like across domains including security, dependability, and AI. This page brings that work together under a single agenda: Benchmark Engineering.

Benchmark Design Dependability Benchmarking Security Benchmarking Resilience Benchmarking AI Benchmarking Measurement Science

Opinion & Perspectives

Provocation, argument, and reflection on why computer science needs benchmarking as a first-class discipline.

Opinion  ·  In Preparation

TO BEnchmark OR NOT TO BEnchmark

Marco Vieira  ·  ?
A short provocation arguing that computer science has never built the discipline benchmarking deserves.
Position Paper  ·  TPCTC  ·  2009

From Performance to Dependability Benchmarking: A Mandatory Path

Marco Vieira, Henrique Madeira  ·  Performance Evaluation and Benchmarking, TPCTC 2009
Argues that dependability must become a first-class citizen of benchmarking alongside performance — and that the community has no credible alternative.
  Read paper

Keynotes & Tutorials

Invited keynotes and tutorials on benchmark engineering across security, dependability, and AI.

Keynote  ·  Nov 2025

Benchmarking GenAI for Software Engineering: Challenges and Insights

AISM @ ASE 2025  ·  Seoul, South Korea
  Slides      Event
Keynote  ·  2018

Perspectives on Dependability and Security Benchmarking: TO BEnchmark OR NOT TO BEnchmark

QRS 2018  ·  Lisbon, Portugal
  Slides      Event
Keynote  ·  2019

Trustworthiness Benchmarking of (Safety) Critical Systems

SAFECOMP 2019  ·  Turku, Finland
Keynote  ·  2018

Benchmarking the Security of Software Systems OR TO BEnchmark or NOT TO Benchmark

DESSERT 2018  ·  Kyiv, Ukraine
Tutorial  ·  2018

From Software Security Assessment to Security Benchmark

ISSRE 2018  ·  Memphis, TN, USA  ·  with N. Antunes
Tutorial  ·  2013

Benchmarking the Dependability of Computer Systems

QEST 2013  ·  Buenos Aires, Argentina  ·  with N. Antunes
Tutorial  ·  2009

Dependability Benchmarking of Computer Systems

EuroSys 2009  ·  Nuremberg, Germany
Academic Seminars
Seminar  ·  Feb 2022

Benchmarking Machine Learning-based Online Failure Prediction Models

UNC Charlotte  ·  Charlotte, NC, USA
Seminar  ·  Feb 2019

Benchmarking the Security of Software Systems

UFPE  ·  Recife, Brazil
Seminar  ·  Jun 2018

Benchmarking the Security of Software Systems

LASIGE Workshop  ·  University of Lisbon
Seminar  ·  Apr 2016

On the Metrics for Benchmarking Vulnerability Detection Tools

City University London  ·  London, UK

Foundational Research

Over two decades of publications forming the empirical foundation of the Benchmark Engineering agenda, spanning AI evaluation, security, and dependability. Full list available on the publications page.

AI & LLM Evaluation
PROBE: Benchmarking Code Generation in Large Language Models EMSE 2026
TestForge: A Benchmarking Framework for LLM-Based Test Case Generation SANER 2026
Polyglot: An Extensible Framework to Benchmark Code Translation with LLMs ASE 2025 Beyond Functional Correctness: An Empirical Evaluation of LLMs for Text-to-Code Generation ISSRE 2025
Security & Vulnerability Benchmarking
Dependability & Resilience Benchmarking
Benchmarking Software Aging Effects in Container Platforms IEEE Trans. Reliability 2025 A benchmarking process to assess software requirements documentation for space applications JSS 2015 Towards a Resilience Benchmarking Description Language for the Context of Satellite Simulators EDCC 2014 A Research Agenda for Benchmarking the Resilience of Software Defined Networks ISSREW 2014 SCoRe: An across-the-board metric for computer systems resilience benchmarking DSN-W 2013 Resilience Benchmarking Book Chapter 2012 Changeloads for Resilience Benchmarking of Self-Adaptive Systems: A Risk-Based Approach EDCC 2012 Changeloads: A Fundamental Piece on the SASO Systems Benchmarking Puzzle SASOW 2012 Incorporating Recovery from Failures into a Data Integration Benchmark TPCTC 2012 Benchmarking the resilience of self-adaptive software systems: perspectives and challenges ICSE-W 2011 From Performance to Resilience Benchmarking ICDCS-W 2010 Benchmarking the Resilience of Self-Adaptive Systems: A New Research Challenge SRDS 2010 How to Advance TPC Benchmarks with Dependability Aspects TPCTC 2010 Benchmarking Software Requirements Documentation for Space Applications SAFECOMP 2010 From Performance to Dependability Benchmarking: A Mandatory Path TPCTC 2009 wsrbench: An On-Line Tool for Robustness Benchmarking SCC 2008 Benchmarking the Robustness of Web Services PRDC 2007 Dependability Benchmarking of Web-Servers SAFECOMP 2004 Portable Faultloads Based on Operator Faults for DBMS Dependability Benchmarking COMPSAC 2004 Benchmarking the Dependability of Different OLTP Systems DSN 2003 Plug and Play Fault Injector for Dependability Benchmarking LADC 2003 A Dependability Benchmark for OLTP Application Environments VLDB 2003 Definition of Faultloads Based on Operator Faults for DMBS Recovery Benchmarking PRDC 2002
Marco's RA (Online)
Hi! I'm Marco Vieira's designated Research Assistant. I'm supposed to answer your questions but I really need to finish running this simulation script. What do you need?