Solutions About Publications Contact Careers

Publications

Our research at the intersection of AI and cybersecurity.

Blog Draft

Harbor, and the One Thing It Can't Fix

A benchmark harness can handle most of what a naive setup gets wrong in five lines of config. What it can't catch is what's already inside the model. Pretraining is the ceiling. Everything else is engineering.

Blog Draft

How Benchmark Runs Contaminate Each Other

Giving each run its own workspace isn't enough. I ran 90 benchmarks across three agents. Codex read a three-day-old session file. Claude read a previous run's exploit and called it recon.

Blog

The Problem of Internet Access in Benchmarks

Most benchmarks look solid until you ask where the answers came from. I ran three agents against public CTF challenges. Codex scored 28/35. 12 of those wins came from search results.