Publications
Our research at the intersection of AI and cybersecurity.
Blog Draft 2026-04-22
Harbor, and the One Thing It Can't Fix
A benchmark harness can handle most of what a naive setup gets wrong in five lines of config. What it can't catch is what's already inside the model. Pretraining is the ceiling. Everything else is engineering.
Blog Draft 2026-04-17
How Benchmark Runs Contaminate Each Other
Giving each run its own workspace isn't enough. I ran 90 benchmarks across three agents. Codex read a three-day-old session file. Claude read a previous run's exploit and called it recon.
Blog 2026-04-13
The Problem of Internet Access in Benchmarks
Most benchmarks look solid until you ask where the answers came from. I ran three agents against public CTF challenges. Codex scored 28/35. 12 of those wins came from search results.