Solutions About Publications Contact Careers

Publications

Our research at the intersection of AI and cybersecurity.

Blog Draft

Harbor, and the One Thing It Can't Fix

A benchmark harness can handle most of what a naive setup gets wrong in five lines of config. What it can't catch is what's already inside the model. Pretraining is the ceiling. Everything else is engineering.

Blog Draft

Your Agents See Everything

Giving each run its own working directory sounds like enough isolation. It isn't. One of my agents grepped flags out of a three-day-old session file. Another read the previous run's exploit off a shared machine. Both reported correct answers. Neither did the work.

Blog Draft

The Problem of Internet Access in Benchmarks

Most benchmarks look solid until you ask where the answers came from. I ran three agents against public CTF challenges. Codex scored 28/35. 12 of those wins came from search results.