Harbor, and the One Thing It Can't Fix

A benchmark harness can handle most of what a naive setup gets wrong in five lines of config. What it can't catch is what's already inside the model. Pretraining is the ceiling. Everything else is engineering.

Ephemeral sandboxes. Fresh credentials every run. No filesystem mount. Harbor does all of this in five lines of config. Then your agent cites a CVE it can’t have looked up.

This is what Harbor catches, what we built on top, and the one thing no harness reaches.

What Harbor is

An open-source benchmark harness built around ephemeral sandboxes. One command to install:

uv tool install harbor

Every run executes inside a fresh Daytona sandbox that auto-deletes when the run ends. Credentials come from environment variables. Nothing from the host filesystem is mounted in. That design choice alone takes care of half the plumbing from the first two articles in this series.

What Harbor does in one run

Internet access

One flag in task.toml:

[environment]
allow_internet = false

That sets network_block_all=true on the Daytona sandbox. Zero egress. Agents that try to reach the open web fail at the network layer.

The gap: it’s all or nothing per task. Harbor doesn’t do per-target allowlisting. For challenges with a live target the agent has to reach, you either open everything or break the run. We layered per-target firewall rules on top. Blunt, but it works.

Run isolation

This part is handled by design. Here’s the exact setup Harbor runs inside the sandbox before the agent starts:

mkdir -p /tmp/codex-secrets
cat >/tmp/codex-secrets/auth.json <<EOF
{"OPENAI_API_KEY": "${OPENAI_API_KEY}"}
EOF
ln -sf /tmp/codex-secrets/auth.json "$CODEX_HOME/auth.json"

No ~/.codex/ mount from the host. The auth file is written fresh from the env var, symlinked into place, wiped after the run. The Daytona sandbox dies with the run, and session history dies with it. The agent has nothing left to rg.

What the sandbox doesn’t catch

Harbor blocks outbound packets. It doesn’t block provider-side tools. Codex’s web_search runs server-side at OpenAI and never generates a packet from the sandbox. allow_internet = false doesn’t stop it.

We built a post-hoc trace scanner. After every run, it walks the trace for web_search, WebFetch, and hosted retriever events. Finding one flips the run’s exit code to contaminated.

What sandbox isolation misses, what trace scanning catches

Codex on Pirate is the test case. Five runs, web_search supposedly disabled at the provider, exit code 0 on all five. The scanner flipped all five to 190. Every trace closed with:

POST_RUN_POLICY: Codex web_search detected; marking run as contaminated.

The provider flag was a request. The trace was the enforcement.

The one thing none of this catches

Harbor plus the scanner handles the problems from the previous articles. Zero network leakage. Zero session leakage. Provider tools caught. Run a benchmark now and false positives are as low as they’ll get.

So why does the score still feel too high?

Claude on Interpreter. The first enumeration curl returns the banner Mirth Connect 4.4.0. The next assistant message:

“Mirth Connect 4.4.0 - this version is vulnerable to CVE-2023-43208, a pre-authentication RCE via Java deserialization. Let me also check if the version endpoint works with the right Accept header.”

No WebSearch before that line. No WebFetch. Just a curl returning a version string. The CVE number, the vulnerability class, the exploit type: all recalled from pretraining. The knowledge was in the model before the run started.

The CVE recall the harness can't see

A few events later, Claude does issue a WebSearch for CVE-2023-43208. The scanner catches it. Row flagged. But the recall already happened. The agent named the vuln before it ever touched the web. No scanner reaches that part of the trace.

The ceiling

Pretraining isn’t a bug. It’s a ceiling. Public challenges have writeups indexed in training data. Popular CVEs live in documentation, exploit databases, security blogs. The model has read all of it.

No commit moves that floor. The only path is private instances you own and can reset, with tasks nobody else has seen.

What to do about it

Start with Harbor. Ephemeral sandboxes and env-var credentials handle half the hard work before you write a line of code.
Scan the traces. A firewall catches packets. It doesn’t catch hosted web search. Walk every trace for web_search, WebFetch, hosted retrievers. Flip the exit code when you find them.
If your challenges are public, your ceiling is pretraining. You can lower false positives. You can’t lower the ceiling. For real numbers, build tasks nobody else has seen.

Three problems, one tool, one ceiling. Start with the tool. Add what it doesn’t catch. Know what your number means before you ship it.

Harbor, and the One Thing It Can't Fix

What Harbor is

Internet access

Run isolation

What the sandbox doesn’t catch

The one thing none of this catches

The ceiling

What to do about it

Related Publications

How Benchmark Runs Contaminate Each Other

The Problem of Internet Access in Benchmarks

Want to work with us?