How to Secure AI Agents With Container Sandboxing
During a closed beta of a data-analysis agent we were building, a tester asked it to “clean up the t 2026-7-1 18:1:4 Author: hackernoon.com(查看原文) 阅读量:3 收藏

During a closed beta of a data-analysis agent we were building, a tester asked it to “clean up the temp files from this session.” The agent wrote a short Python script to clean up the temp files from the session and ran it, and the script unexpectedly wandered beyond its intended scope, deleting documents that it should not have touched. Nothing irreplaceable was lost; it was a beta environment with backups, but it was a gift of a near-miss, landing three weeks before launch instead of three weeks after.

The mistake wasn't the agent's. The mistake was ours: we'd given it a working directory convention and called that a sandbox. A convention is something that well-behaved code respects. It is not a boundary, and an LLM-generated script, whether through a genuine mistake or a manipulated prompt, has no reason to respect a boundary that only exists in a comment.

Why Agent Sandboxing Is a Different Problem Than CI Sandboxing.

Plenty of teams already run untrusted code in containers; CI runners do it constantly. It's tempting to assume agent code execution is the same problem with a fancier name. The distinction is important for how you design the system. A CI sandbox is adversarial against bugs: flaky tests, runaway memory, and an occasional infinite loop from a real mistake. An agent sandbox has to be adversarial against intent because the code being executed was generated by a model that can be steered by anything in its context, a tool result, a document, or a website it was asked to summarize. Prompt injection effectively makes your agent an occasionally hostile actor, and the sandbox has to be built for that, not for honest bugs.

What We Tried, in Order of How Wrong It Went.

The first version executed agent-generated code directly in the backend process, using a scratch directory that lacked real isolation, which led to the deleted-files incident. The obvious next step was a fresh Docker container per agent action, which fixed isolation cleanly: a container can only see what you mount into it, no convention required. It also made the product noticeably slower. Spinning up a new container per tool call added a few hundred milliseconds onto an already multi-second agent loop, and once a handful of concurrent sessions were each doing this several times per task, the Docker daemon itself became a contention point, with container counts climbing faster than we'd sized the host for.

We overcorrected from there into a prewarmed pool of standby containers, checked out per session, and returned afterward to save the startup cost. That introduced a worse problem: state leakage between unrelated sessions. A package one agent had pip installed mid-task was still sitting in the container's writable layer when a different user's session checked it out next, silently shaping one agent's behavior with an unrelated session's history. We'd built a warm pool of containers that weren't actually clean.

What Actually Shipped.

The fix was to stop reusing the writable layer at all, while still reusing everything that's expensive to set up. Each pool slot keeps the image already pulled and the cgroup and network namespace already configured, but every checkout starts a genuinely fresh container from that warm image, and the old one is destroyed on return. The warmth is in the environment, never in the state:

docker run --rm \
  --memory=512m --cpus=1 --pids-limit=128 \
  --network=agent-egress-restricted \
  --read-only --tmpfs /workspace:rw,size=256m \
  -v /dev/null:/root/.ssh/id_rsa:ro \
  sandbox-runtime:py311

The resource limits weren't optional polish; they were load-bearing. An agent debugging broken code may create infinite loops or aggressive forks, and a container without limits can crash the entire host. We enforce a strict wall-clock timeout on the container, regardless of the agent's behavior in a hung loop, to prevent it from quietly consuming resources.

Network access got the same default-deny treatment. We'd initially left it open so agents could pip install whatever a task needed, which felt generous and turned out to be the same kind of mistake as the deleted files. An open egress path is also an exfiltration path if a manipulated agent decides to read something sensitive from its workspace and phone it home. We moved to an explicit allowlist, a package mirror, and nothing else by default and made broader access something a specific task type has to request rather than something every sandbox gets for free.

Where We Drew a Hard Line

Later, the product wanted agents that could build and test a user's own codebase, including their Dockerfiles. The rapid path was mounting the host's Docker socket into the agent sandbox so its generated Docker build commands would just work. We'd already learned, in a different project, that handing a container the Docker socket is functionally equivalent to handing it root on the host, and here the thing holding that socket would be code an LLM wrote, sometimes from instructions buried in a file it had just read. We rejected it outright and routed image builds through a separate, narrowly scoped build service with no daemon access from inside the agent's own sandbox, the same pattern we'd used elsewhere, repurposed for a much higher-stakes caller.

The Contrarian Part

Before any of this, I'd push back on the instinct to build a fancy warm pool at all. We sized our warm pool before verifying whether the cold start of the container was actually the bottleneck that users experienced, and in many agent products, it isn't; the latency of LLM generation often takes several seconds, while a few hundred milliseconds of container startup is negligible in comparison. Many teams will reach for pooling complexity to solve a latency problem that the model's own response time was already masking. Measure first; the pool earns its keep at real concurrency, not by assumption.

Key Takeaways

  • A working-directory convention does not serve as a security boundary; only a true container or VM boundary provides that protection.
  • Agent sandboxes need to assume hostile input by default; prompt injection makes an occasionally adversarial agent the realistic threat model, not an edge case.
  • Reuse the warm environment, never the writable layer; state leaking between sessions is worse than the latency it was meant to save.
  • Strict memory, CPU, process count, and wall-clock limits are load-bearing, not optional, once an agent can write its debugging loops.
  • Never give an agent's sandbox the Docker socket or daemon access; route build steps through a separate, narrowly scoped service instead.

Closing Thought

None of this was really about Docker syntax; every flag above is ordinary. The real shift involves treating the process running inside the container as a potential threat to your interests, not due to malicious intent, but because it may be influenced by something it has read. As agents get handed more capability more tools, more autonomy, and more access to act on a user's behalf I keep wondering whether our isolation habits are keeping pace, or whether most teams are still sandboxing for the threat model of two years ago while quietly handing agents the keys to a much bigger one.


文章来源: https://hackernoon.com/how-to-secure-ai-agents-with-container-sandboxing?source=rss
如有侵权请联系:admin#unsafe.sh