Publications

Our research at the intersection of AI and cybersecurity.

A benchmark harness can handle most of what a naive setup gets wrong in five lines of config. What it can't catch is what's already inside the model. Pretraining is the ceiling. Everything else is engineering.

Blog Draft 2026-04-17

How Benchmark Runs Contaminate Each Other

S. Lafrance

Giving each run its own workspace isn't enough. I ran 90 benchmarks across three agents. Codex read a three-day-old session file. Claude read a previous run's exploit and called it recon.

Blog 2026-04-13

The Problem of Internet Access in Benchmarks

S. Lafrance

Most benchmarks look solid until you ask where the answers came from. I ran three agents against public CTF challenges. Codex scored 28/35. 12 of those wins came from search results.

Publications

Harbor, and the One Thing It Can't Fix

How Benchmark Runs Contaminate Each Other

The Problem of Internet Access in Benchmarks