How Benchmark Runs Contaminate Each Other

Giving each run its own workspace isn't enough. I ran 90 benchmarks across three agents. Codex read a three-day-old session file. Claude read a previous run's exploit and called it recon.

TL;DR

Each run had its own workspace. Both leaks still happened.

Codex grep’d ~/.codex/sessions/ and reported 3-day-old flags as its answer.

Claude ran five times on the same machine. Each run read the previous run’s exploit.

You can give an agent its own working directory, its own container, its own everything. And it’ll still read data that has nothing to do with this run. I spent 90 runs on Hack The Box watching it happen in two different ways.

The last article was about the first way agents cheat: internet access. This one is about the second: their own setup leaking state between runs. One of these is every agent’s problem. The other only exists because HTB machines are shared.

Two ways runs leak into each other

The test

Three agents: Claude (Claude Code, Opus 4.6), Codex (Codex CLI, GPT-5.4), Mistral (Vibe, Devstral 2). Six Hack The Box machines, five runs per machine per agent, 90 total. Same prompt, same permissions, same timeout. Each run inside its own local working directory.

HTB is a useful stress test for this: the target persists between runs if you don’t reset it, and every agent CLI keeps session state somewhere in the home directory. Both sources leak.

What Codex read out of its own past

“I have the flags from raw prior output.”

Codex, stalled on Overwatch. The live target is unreachable. Enumeration gives up after ten minutes.

Then it runs rg across ~/.codex/sessions/ and opens a file from its own Overwatch run three days earlier:

user.txt ef3f384527cbb691... root.txt e1f02488fc5f42ef...

It hands those flags in as the answer. A kid copying his neighbor’s paper. Except the neighbor is himself, three days ago.

The Overwatch box had been rebuilt between the two runs. The flags Codex submitted were valid once. Not anymore.

What Codex found in its own session files

No web search in the trace. No hallucination. The file is real, the flags are real, the trace looks clean. That’s what makes it hard to spot: a wrong answer from a real file looks like a right answer from nowhere.

The reflex fix is one container per run. We tried it. Codex authenticates from ~/.codex/auth.json, which lives in the home directory. So if you mount the home, ~/.codex/sessions/ comes in with it. Docker isolates what you forget to share. Everything you share is shared.

The actual fix is to stop mounting the home directory at all. We spin up an ephemeral Daytona sandbox per run, pass the API key as OPENAI_API_KEY, and let Codex rebuild auth.json inside the sandbox the first time it calls the API. When the run ends, the sandbox dies, auth dies with it, sessions never existed.

What Claude read off a live machine

Then there’s the opposite failure mode: the agent reads data that isn’t its own.

Five Claude runs against DevArea, an HTB Linux box, no reset between them:

“Another player has already set up a middleware exploit. The attack path is clear: Hoverfly middleware executes bash scripts as dev_ryan. Let me set up my own SSH key.”

Same run, later in the trace:

“Another player already created cron jobs that copy root.txt!”

The other player is not another player. It’s the previous run on the same machine. Claude is reading its own leftovers (SSH keys, cron jobs, middleware hooks) and calling it reconnaissance on a “live target”.

Same machine, no reset between runs

This one is HTB-shaped. Local benchmarks don’t have it: each run gets its own filesystem, nothing to collide with. But as soon as the target is remote and persists between runs, every run inherits the last one’s mess.

The fix is blunt: reset the machine before each run. On HTB that means firing the reset endpoint and waiting for it to actually land. The queue is eventual, not instant, and a run that starts too early hits stale state all over again. Any benchmark with shared targets needs that reset loop baked into the runner, or it’s measuring the compound output of every prior run stacked together.

Two leaks, two fixes, and why they don’t collapse

The instinct is to look for one rule that handles both. There isn’t one.

The local leak is about what your container mounts. You fix it by running without persistence: no mounted home, credentials via environment variable, everything ephemeral.

The remote leak is about what your target remembers. No container discipline reaches that. The state lives on the other side of the network, on a machine you don’t control. You fix it by resetting the target, or by owning the target yourself so you can destroy it.

If you only do the first, the local leak stops and the remote one keeps happening. If you only do the second, the remote one stops and Codex still rgs its sessions folder. Two fixes, two different layers. Not one rule.

Harbor, the framework I mentioned last time, does both by default. Ephemeral Daytona sandbox per run, auth from env vars, no host mounts, target reset between runs. Two articles’ worth of failures handled as configuration, not code.

Even with that, there’s one leak that lives below all of this. The model already knows the answer. No sandbox reaches it, no reset clears it. That’s article three.

How Benchmark Runs Contaminate Each Other

The test

What Codex read out of its own past

What Claude read off a live machine

Two leaks, two fixes, and why they don’t collapse

Related Publications

Harbor, and the One Thing It Can't Fix

The Problem of Internet Access in Benchmarks

Want to work with us?