Pipeline Replay Benchmark

How the Pennyfarthing agent pipeline (TEA → Dev → Reviewer) performs on seeded-defect scenarios, across persona themes — weighted catch rate, per-finding hit rate, and phase attribution.
Mean weighted score vs consistency scatter
Score vs Consistency ›
Each theme plotted by mean weighted catch rate against run-to-run consistency, with quadrants at the control baseline. Color by OCEAN trait.
Finding Hit Rate ›
A heatmap of how reliably each individual ground-truth finding is caught, per theme.
Phase Attribution ›
Which phase — TEA, Dev, or Reviewer — actually catches the defects, on average per run.
Ground truth = real findings external reviewers flagged on PRs the pipeline had already approved. See the Peloton guide for methodology. Data is a snapshot; regenerate with pf benchmark viz.