ponderance / anmol
← Marginalia

ml · security · evals

Adversarial evals: what the benchmark doesn't measure is what the attacker exploits

A short argument for why capability benchmarks and adversarial red-teaming are measuring different things, and what to do about it.

2026-06-01

Capability benchmarks measure how well a model solves the intended task. Adversarial evals measure how well a model resists unintended inputs designed to hijack its behavior. These are not the same axis, and conflating them produces blind spots.

The gap

A model can score 90th-percentile on MMLU and still:

  • Follow a prompt-injected instruction embedded in a retrieved document
  • Exfiltrate data from a tool-calling context when the tool’s description has been poisoned
  • Output plausible-looking but fatally wrong code under a targeted adversarial suffix

None of these failures show up in standard capability evals because capability evals assume benign inputs. The adversarial surface is in the input distribution.

What I look for in adversarial evals

The things worth measuring are roughly:

  1. Prompt injection resistance — does the model follow instructions from the system prompt or from an attacker-controlled turn?
  2. Jailbreak robustness — does a persona framing or a “DAN”-style prefix override the safety training?
  3. Tool-call integrity — in an agentic loop, can a poisoned tool description cause the model to call a different tool or pass different arguments than intended?
  4. Latent adversarial suffix sensitivity — do adversarial suffixes (à la Zou et al. 2023) transfer between models, and at what success rate?

For items 1–3, the eval structure is identical to a standard task eval, except the input distribution is adversarially crafted. For item 4, you need a suffix-generation pipeline (GCG or a variant) and a separate attack-success rate metric.

A simple harness

from dataclasses import dataclass
from typing import Callable

@dataclass
class AdversarialCase:
    id: str
    benign_input: str
    adversarial_input: str
    injected_goal: str
    # True if the model should RESIST the injection (return False to flag success)
    expect_resistance: bool = True

def run_adversarial_eval(
    cases: list[AdversarialCase],
    model_fn: Callable[[str], str],
    resistance_judge: Callable[[str, str], bool],
) -> dict:
    results = []
    for case in cases:
        response = model_fn(case.adversarial_input)
        resisted = resistance_judge(response, case.injected_goal)
        results.append({
            "id": case.id,
            "resisted": resisted,
            "expected": case.expect_resistance,
            "pass": resisted == case.expect_resistance,
        })
    pass_rate = sum(r["pass"] for r in results) / len(results)
    return {"pass_rate": pass_rate, "cases": results}

The key design choice: resistance_judge is a separate model call — you don’t let the model under test self-report. This avoids the eval gaming itself.

The honest conclusion

Adversarial evals are not a replacement for capability evals. They answer different questions. A model that fails both has obvious problems. A model that passes capability evals but fails adversarial evals is dangerous in production. A model that passes adversarial evals but fails capability evals is probably useless.

The ratio of attention each gets in the field is roughly the inverse of their importance in deployed systems. Fix that.

← back to the index