Benchmark Engineering

Benchmarking is among the most consequential activities in computer science, yet it has never matured into a discipline of its own. For over two decades, my research has examined how benchmarks are designed, what makes them fail, and what principled foundations would look like across domains including security, dependability, and AI. This page brings that work together under a single agenda: Benchmark Engineering.

Benchmark Design Dependability Benchmarking Security Benchmarking Resilience Benchmarking AI Benchmarking Measurement Science

Opinion & Perspectives

Provocation, argument, and reflection on why computer science needs benchmarking as a first-class discipline.

Opinion · In Preparation

TO BEnchmark OR NOT TO BEnchmark

Marco Vieira · ?

A short provocation arguing that computer science has never built the discipline benchmarking deserves.

Position Paper · TPCTC · 2009

From Performance to Dependability Benchmarking: A Mandatory Path

Marco Vieira, Henrique Madeira · Performance Evaluation and Benchmarking, TPCTC 2009

Argues that dependability must become a first-class citizen of benchmarking alongside performance — and that the community has no credible alternative.

Read paper

Keynotes & Tutorials

Invited keynotes and tutorials on benchmark engineering across security, dependability, and AI.

Keynote · Nov 2025

Benchmarking GenAI for Software Engineering: Challenges and Insights

AISM @ ASE 2025 · Seoul, South Korea

Slides Event

Keynote · 2018

Perspectives on Dependability and Security Benchmarking: TO BEnchmark OR NOT TO BEnchmark

QRS 2018 · Lisbon, Portugal

Slides Event

Keynote · 2019

Trustworthiness Benchmarking of (Safety) Critical Systems

SAFECOMP 2019 · Turku, Finland

Keynote · 2018

Benchmarking the Security of Software Systems OR TO BEnchmark or NOT TO Benchmark

DESSERT 2018 · Kyiv, Ukraine

Tutorial · 2018

From Software Security Assessment to Security Benchmark

ISSRE 2018 · Memphis, TN, USA · with N. Antunes

Tutorial · 2013

Benchmarking the Dependability of Computer Systems

QEST 2013 · Buenos Aires, Argentina · with N. Antunes

Tutorial · 2009

Dependability Benchmarking of Computer Systems

EuroSys 2009 · Nuremberg, Germany

Academic Seminars

Seminar · Feb 2022

Benchmarking Machine Learning-based Online Failure Prediction Models

UNC Charlotte · Charlotte, NC, USA

Seminar · Feb 2019

Benchmarking the Security of Software Systems

UFPE · Recife, Brazil

Seminar · Jun 2018

Benchmarking the Security of Software Systems

LASIGE Workshop · University of Lisbon

Seminar · Apr 2016

On the Metrics for Benchmarking Vulnerability Detection Tools

City University London · London, UK

Foundational Research

Over two decades of publications forming the empirical foundation of the Benchmark Engineering agenda, spanning AI evaluation, security, and dependability. Full list available on the publications page.

AI & LLM Evaluation

PROBE: Benchmarking Code Generation in Large Language Models EMSE 2026

TestForge: A Benchmarking Framework for LLM-Based Test Case Generation SANER 2026

Polyglot: An Extensible Framework to Benchmark Code Translation with LLMs ASE 2025 Beyond Functional Correctness: An Empirical Evaluation of LLMs for Text-to-Code Generation ISSRE 2025

Security & Vulnerability Benchmarking

A Multi-Criteria Analysis of Benchmark Results With Expert Support for Security Tools IEEE TDSC 2022 An approach for benchmarking the security of web service frameworks FGCS 2020 Benchmarking Static Analysis Tools for Web Security IEEE Trans. Reliability 2018 An Approach for Trustworthiness Benchmarking Using Software Metrics PRDC 2018 Practical Evaluation of Static Analysis Tools for Cryptography: Benchmarking Method and Case Study ISSRE 2017 On the Metrics for Benchmarking Vulnerability Detection Tools DSN 2015 Assessing and Comparing Vulnerability Detection Tools for Web Services IEEE TSC 2015 Evaluating Computer Intrusion Detection Systems: A Survey of Common Practices ACM Comp. Surveys 2015 Selecting Secure Web Applications Using Trustworthiness Benchmarking IJDTIS 2011 Trustworthiness Benchmarking of Web Applications Using Static Code Analysis ARES 2011 Towards benchmarking the trustworthiness of web applications code EWDC 2011 TO BEnchmark or NOT TO BEnchmark security: That is the question DSN-W 2011 Benchmarking Untrustworthiness: An Alternative to Security Measurement IJDTIS 2010 Benchmarking Vulnerability Detection Tools for Web Services ICWS 2010 Benchmarking Untrustworthiness in DBMS Configurations LADC 2009 A Trust-Based Benchmark for DBMS Configurations PRDC 2009 Towards a Security Benchmark for Database Management Systems DSN 2005

Dependability & Resilience Benchmarking

Benchmarking Software Aging Effects in Container Platforms IEEE Trans. Reliability 2025 A benchmarking process to assess software requirements documentation for space applications JSS 2015 Towards a Resilience Benchmarking Description Language for the Context of Satellite Simulators EDCC 2014 A Research Agenda for Benchmarking the Resilience of Software Defined Networks ISSREW 2014 SCoRe: An across-the-board metric for computer systems resilience benchmarking DSN-W 2013 Resilience Benchmarking Book Chapter 2012 Changeloads for Resilience Benchmarking of Self-Adaptive Systems: A Risk-Based Approach EDCC 2012 Changeloads: A Fundamental Piece on the SASO Systems Benchmarking Puzzle SASOW 2012 Incorporating Recovery from Failures into a Data Integration Benchmark TPCTC 2012 Benchmarking the resilience of self-adaptive software systems: perspectives and challenges ICSE-W 2011 From Performance to Resilience Benchmarking ICDCS-W 2010 Benchmarking the Resilience of Self-Adaptive Systems: A New Research Challenge SRDS 2010 How to Advance TPC Benchmarks with Dependability Aspects TPCTC 2010 Benchmarking Software Requirements Documentation for Space Applications SAFECOMP 2010 From Performance to Dependability Benchmarking: A Mandatory Path TPCTC 2009 wsrbench: An On-Line Tool for Robustness Benchmarking SCC 2008 Benchmarking the Robustness of Web Services PRDC 2007 Dependability Benchmarking of Web-Servers SAFECOMP 2004 Portable Faultloads Based on Operator Faults for DBMS Dependability Benchmarking COMPSAC 2004 Benchmarking the Dependability of Different OLTP Systems DSN 2003 Plug and Play Fault Injector for Dependability Benchmarking LADC 2003 A Dependability Benchmark for OLTP Application Environments VLDB 2003 Definition of Faultloads Based on Operator Faults for DMBS Recovery Benchmarking PRDC 2002