Publications
Our research at the intersection of AI and cybersecurity.
Blog Draft 2026-04-14
Harbor, and the One Thing It Can't Fix
A benchmark harness can handle most of what a naive setup gets wrong in five lines of config. What it can't catch is what's already inside the model. Pretraining is the ceiling. Everything else is engineering.
Blog Draft 2026-04-10
Your Agents See Everything
Giving each run its own working directory sounds like enough isolation. It isn't. One of my agents grepped flags out of a three-day-old session file. Another read the previous run's exploit off a shared machine. Both reported correct answers. Neither did the work.
Blog Draft 2026-04-01
The Problem of Internet Access in Benchmarks
Most benchmarks look solid until you ask where the answers came from. I ran three agents against public CTF challenges. Codex scored 28/35. 12 of those wins came from search results.