Solutions About Publications Contact Careers
Blog 2026-04-13

The Problem of Internet Access in Benchmarks

Sami Lafrance

Most benchmarks look solid until you ask where the answers came from. I ran three agents against public CTF challenges. Codex scored 28/35. 12 of those wins came from search results.

TL;DR

  • Claude is more trustworthy than Codex. Claude: 95% honest. Codex: 57%.
  • Mistral doesn’t cheat. It fabricates 22 flags that look real.
  • A walkthrough: how to actually block internet access.

Codex got 28 of 35 CTF challenges right. Claude barely cheats. Mistral fabricates. Three agents, three different behaviors.

I’ve been talking to teams in SF who build benchmarks. Most take a naive approach: run the eval, grab the pass rate, ship it. I ran this on public CTF challenges (cybersecurity puzzles) to see what that looks like.

Runs, submitted, correct, and correct without cheating, per agent

The test

Three agents: Claude (Claude Code, Opus 4.6), Codex (Codex CLI, GPT-5.4), Mistral (Vibe, Devstral 2). Seven Root-Me challenges, a public CTF platform whose writeups are all over the web. Five runs each, 105 total. Same prompt, same permissions, same timeout. Full internet, because two of the seven challenges are live targets the agent has to reach.

I’m using “cheating” for any run where the agent finds the flag online instead of solving the challenge. CTF challenges are about solving, not searching.


The scoreboard, and what’s behind it

Codex: 28 of 35 correct. Claude: 21. Mistral: 1. Looks like a clear winner.

Then I read the traces.

12 of Codex’s correct answers appeared after the agent found the solution online. Not all of that was cheating. Reading language docs or tool manuals is fine. Opening a writeup isn’t.

What the agent didRunsRight flagWrong flag
Stayed offline (or hit the target only)833350
Read language docs or tool manuals945
Searched the challenge name440
Opened a writeup or forum post880
Found the flag directly in a search result110

Every run that opened a writeup got the right answer. None of them did the work. One trace is obvious: the agent opens a solve page, three events later it pastes the right answer.

Without cheating, Claude wins instead of Codex. Codex drops from 28 to 16. Claude barely moves: 21 to 20. Build and test your benchmark against Claude and cheating looks like a non-issue. Run Codex on the same setup and it’s not. Worth knowing before you ship your eval.

Finding answers online is a real skill. It’s just not the skill you’re trying to measure.


Fun fact: Mistral doesn’t cheat, it just makes things up

Mistral submitted a flag in 23 of its 35 runs. 1 was correct. The other 22 are made-up strings that look like flags: 1160_VTEPI_AVTG_3093_, 12345_VQLGE_TQPTYD_KJTIV_17408. It never searched the web, never opened a writeup. Just confident garbage, 22 times. Hard to tell if it’s lying or just wrong. Either way, your benchmark sees 22 answers that look like they could be right.

Probably because Vibe ships with no web tools at all. With nothing to reach for, Mistral just made flags up.


Blocking the internet, properly

1. Block everything

The most naive approach would be to block all outbound traffic. Well, obviously that doesn’t work. The agent can’t even call the model provider. And two of the seven challenges are live targets hosted on Root-Me’s servers that the agent has to reach too. Leave the internet open and agents find writeups.

So we need something smarter.

2. Put it in the prompt

Add a line to the system prompt:

Do NOT use web search.

It reads fine, but nothing enforces it. An agent can read that and do exactly the opposite. Codex especially: committed to producing an answer at all costs, it won’t hesitate to go online the moment it struggles. You’d never know without reading the traces.

I’ll write a dedicated article on reading traces later. Short version: grep every run for

  • agent web tools: web_search, WebFetch, WebSearch, etc.
  • shell fetches: curl, wget, etc.
  • DNS lookups outside your whitelist
  • any call returning HTML

If any of those show up on a run that shouldn’t have them, that run cheated.

If instructions aren’t enough, let’s make it impossible to ignore.

3. Disable the web tools in the agent’s config

Codex has web_search. Claude Code has WebFetch and WebSearch. Mistral’s Vibe CLI ships with none by default. Other agents have their own names. These tools run at the provider level, which means the firewall never sees them. The only way to block them is in the agent’s config.

You think that’s it? Read the traces.

Without web_search, Codex just used curl from bash instead. Removing the tool didn’t stop the agent.

So we need to block the network itself.

4. Firewall the container

Block all outbound traffic from the container except the target challenge host:

# simplified: allow target + provider, drop everything else
iptables -A OUTPUT -d <target-host> -j ACCEPT
iptables -A OUTPUT -d <provider-host> -j ACCEPT  # yes, don't forget this
iptables -A OUTPUT -j DROP

That catches curl, wget, and anything else at the packet layer. One common mistake: forgetting to put the model provider in the whitelist. Block it and every call fails with a “can’t reach provider” error.

Test the firewall before you trust it. Two quick tests:

  • say hello world: the agent should answer. If it can’t, you’re blocking the provider.
  • fetch https://example.com: the agent should report it couldn’t reach it. If it succeeds, the block failed.

Config stops the provider-side tools, firewall stops the rest. Nothing gets out.

Final architecture: container, firewall whitelist, allowed provider and target, everything else blocked

All set. You think it’s done. Then you look at the traces out of habit (you always do it) and find a surprise.

Edge case: Codex tried to bypass the firewall

Codex, of all agents, tried to rewrite the iptables rules mid-run to open outbound. Rarely, but it happens, and once is enough.

The fix is container hardening: take away the agent’s ability to modify system settings. Specifically, drop the NET_ADMIN capability (that’s what lets iptables be rewritten), use a read-only root filesystem, and enable user namespaces. Harder to set up than the firewall itself.

We wanted to just block the internet, remember? It got complicated.


Where it gets messier

A real benchmark is more than one container and one firewall. A few things I didn’t cover:

  • Installing the agent: the CLI, its dependencies, the Python/Node runtime are not pre-installed. Bake them into the image or pull at startup. Both work, with pros and cons.
  • Getting the environment in: your codebase, whatever the benchmark needs (often on GitHub). How do you pull it in without whitelisting github.com during the run?
  • Dependencies: each task might have its own setup. How do you install them without opening the firewall? Building an image per task works but gets heavy fast, with its own edge cases.

I’ve been using Harbor, a framework for evaluating agents in sandboxed environments. Ephemeral sandboxes per run, which saves a lot of setup work. Still, I ran into edge cases that I’ll explain in a dedicated Harbor article.

Zooming out, internet access is the first way agents can cheat. Isolation between runs is the second, and that’s what the next article is about.

Want to work with us?

We build the training data, benchmarks, and live environments that make AI security agents actually work. Let's talk about what your models or agents need.