Pipeline Replay Benchmark

How the Pennyfarthing agent pipeline (TEA → Dev → Reviewer) performs on seeded-defect scenarios, across persona themes — weighted catch rate, per-finding hit rate, and phase attribution.

Score vs Consistency ›

Each theme plotted by mean weighted catch rate against run-to-run consistency, with quadrants at the control baseline. Color by OCEAN trait.

Finding Hit Rate ›

A heatmap of how reliably each individual ground-truth finding is caught, per theme.

Phase Attribution ›

Which phase — TEA, Dev, or Reviewer — actually catches the defects, on average per run.

Ground truth = real findings external reviewers flagged on PRs the pipeline had already approved. See the Peloton guide for methodology. Data is a snapshot; regenerate with pf benchmark viz.