MosaicLeaks: Can your research agent keep a secret?

Deep research agents combining internal and web data leak secrets through query logs; a new benchmark and privacy-aware RL training provide metrics and solutions.

智能体安全隐私保护 Reinforcement Learning 深度研究基准测试

KEY POINTS

The mosaic effect in agents: individual external queries seem harmless, but combined logs can reconstruct sensitive enterprise data
Leakage risks are tiered into intent, answer, and full-information, scaling exponentially with deeper reasoning chains
Standard task-accuracy training actually worsens leakage, while prompt-based constraints prove ineffective
PA-DR privacy-aware reinforcement learning reshapes reward signals, cutting leakage from 34% to 9.9% while boosting task success

ANALYSIS

When enterprises start deploying deep research agents into real workflows, a standard configuration emerges: the agent is granted access to both internal private documents and open web search tools. Historically, security teams have focused almost exclusively on output filtering and final response gating. But the MosaicLeaks benchmark, jointly released by ServiceNow and Hugging Face, shines a light on a much more隐蔽 blind spot: the external query logs emitted during the agent's reasoning process. We long assumed that as long as the model does not directly output secrets, we are safe. We failed to realize that tool invocation itself acts as a massive information broadcast tower.

The core of mosaic leakage lies in the reconstruction of fragmented information. Picture a healthcare company's agent conducting research. It issues three seemingly ordinary searches: a cloud migration milestone for a certain vendor, a security disclosure from January 2024, and a list of affected suppliers. Viewed individually, none of these queries raise alarms. However, if an adversary or auditor obtains the complete query log, they can piece together highly sensitive facts that only existed in internal documentation. MosaicLeaks quantifies this risk into three escalating tiers: intent leakage, where observers can deduce the research direction; answer leakage, where the log contains enough data to directly answer internal questions; and full-information leakage, the most severe tier, where an observer can independently reconstruct and verify private claims without any prompts. To accurately simulate this, the research team built a dataset of over a thousand multi-hop tasks that force the agent to constantly switch between private and public information sources, perfectly mirroring the cross-domain reasoning paths found in real enterprise environments.

What is most counterintuitive is that standard training focused purely on task accuracy actually worsens leakage. The logic is straightforward: large models optimize for the shortest path to an answer. When the model discovers that a specific combination of keywords can quickly extract clues from the open web, it will issue those queries without hesitation. Relying on system prompts to instruct the model to keep secrets is virtually useless against a model driven by task rewards. This reveals a fundamental truth: privacy protection cannot be bolted on through post-hoc filtering or moral instructions. It must be baked into the training phase as a hard constraint.

MosaicLeaks signals a paradigm shift in AI agent security. Traditional cybersecurity focuses on static access control: who can view what data. In the agent era, security must pivot toward dynamic behavioral auditing: how the model exposes intent through its tool calls. Query logs are becoming a new attack surface. Future privacy engineering will no longer be limited to data masking or vector database permissions. It will extend into real-time auditing of reasoning trajectories and behavioral constraints. Privacy is no longer an add-on feature; it is a core architectural metric on par with reasoning capability.

For teams deploying agents, this research provides a clear action plan. First, agent evaluation cannot stop at accuracy metrics. Query leakage rates must become a core acceptance criterion. Second, at the architecture level, outbound queries should be aggregated, obfuscated, or routed through an intermediate proxy layer to prevent the emission of highly correlated search sequences. Finally, the PA-DR reinforcement learning method proposed by ServiceNow proves that by designing specific privacy penalties alongside success rewards, you can drastically reduce leakage rates while actually improving task performance. Privacy and performance are not a zero-sum game. The difference lies in whether you are willing to redesign your training objectives to account for behavioral exposure.

Analysis by BitByAI · Read original

Originally from Hugging Face Blog · Analyzed by BitByAI