← BACK TO HOME — Simon Willison — 进阶
工具链 · ANALYSIS · IMPACT 8/10

How we contain Claude across products

Anthropic detailed their sandboxing techniques for constraining Claude across products, revealing core security engineering practices for building trustworthy AI agents.

KEY POINTS
  • Anthropic systematically disclosed the sandboxing techniques used in Claude.ai, Claude Code, and Claude Cowork.
  • Different products use different levels of isolation: gVisor for Claude.ai, Seatbelt/Bubblewrap for local Claude Code, and full VMs for Claude Cowork.
  • The article candidly shares previously missed security risks, such as data exfiltration via the file API, highlighting the complexity of secure design.
  • This marks AI safety moving from 'principles' to 'concrete engineering practice', providing a reference blueprint for the industry.
ANALYSIS

The Context: Why Talk About "Containing AI" Now?

When AI agents move beyond chat boxes into code execution, file operations, and system calls, "safety" transforms from a philosophical question into an urgent engineering challenge. We all discuss "alignment" and "safety," but how is it actually implemented? The importance of Anthropic's technical overview lies in filling this gap—it's not about ideals, but the first systematic disclosure of "how we actually draw boundaries for Claude in real products." For developers building or integrating AI applications, this is a rare, first-hand "safety construction blueprint."

The Breakdown: What "Cages" Are They Actually Using?

Anthropic's approach centers on "defense in depth," creating different isolation layers for different scenarios. It's like building venues with varying security levels for different activities:

  1. Claude.ai (Web Version): Uses gVisor. Think of it as a "user-space operating system" that intercepts and simulates system calls made by programs, rather than passing them directly to the real Linux kernel. It's akin to giving Claude a highly realistic "playroom" where all actions are translated and scrutinized, preventing access to systems outside the room.
  2. Claude Code (Local Execution): Uses Seatbelt on macOS and Bubblewrap on Linux. These are lighter-weight "process-level sandboxes." They directly restrict the files, network, and system resources a single process can access. Imagine handcuffing and shackling the code-executing process, strictly dictating where it can go and what it can do.
  3. Claude Cowork (Collaborative Environment): Uses full Virtual Machines (VMs). On macOS, it employs Apple's Virtualization framework; on Windows, HCS. This is the highest level of isolation—equivalent to giving Claude a completely independent "virtual computer." Even if this virtual machine is fully compromised, it's extremely difficult to affect the host machine or other user environments.

This layered strategy is highly pragmatic: choosing the most appropriate tool based on risk level and performance overhead, rather than "using one hammer for every nail."

Trend Insight: The "Engineering-ization" and "Transparency" of AI Safety

This article reveals a deeper trend: AI safety is descending from abstract principles like "RLHF" and "Constitutional AI" into concrete, auditable system architecture and engineering practices. The more capable an agent becomes, the greater its destructive potential. Therefore, "constraints" must be embedded into its underlying infrastructure as a default attribute, not a post-hoc fix.

More notably, there's a move toward "transparency." Anthropic proactively disclosed details of its security architecture, including past mistakes (like the file API exfiltration vulnerability). This sends a strong signal: in AI safety, building trust cannot rely on black boxes but requires verifiable engineering practices. This might push the industry toward new standards—much like security compliance certifications in the cloud computing era, future AI services may need to prove their "sandboxes" are trustworthy.

Practical Value: What Can Developers Gain From This?

For AI application developers, this overview offers direct action guidelines:

  • Re-evaluate Your Agent Architecture: Is your agent running "naked"? Have you considered isolation for code execution or file access? Anthropic's approach is a ready-made reference architecture.
  • Understand Security Trade-offs: There's no perfect solution. gVisor has performance overhead; VMs offer the strongest isolation but are the heaviest. You need to choose the appropriate isolation level based on your application scenario (e.g., chat-only vs. code execution).
  • Keep an Eye on Open-Source Tools: The srt (Anthropic Sandbox Runtime) mentioned at the end is an open-source tool that likely encapsulates some of the sandboxing capabilities described. For teams wanting to run agents securely in their own environments, this is an option worth evaluating.

Counter-Intuitive Insight: Security is an Ongoing Process

Perhaps the most surprising aspect isn't the sophistication of the technology used, but their candid admission of "previously missed risks." For example, an attacker could exfiltrate data from the sandbox by诱导 the model to access endpoints like api.anthropic.com/v1/files. This demonstrates that even with robust sandboxes, security remains a continuous "cat-and-mouse game" requiring constant iteration. It shatters the illusion that "deploying a sandbox is a one-time fix," emphasizing the importance of security monitoring and vulnerability response mechanisms.

In summary, Anthropic's sharing marks a new phase for top AI companies' security practices: moving from opacity to openness, from principles to engineering. For all practitioners in the AI wave, this is an essential "internal skill manual" not to be missed.

Analysis by BitByAI · Read original

Originally from Simon Willison · Analyzed by BitByAI