Tag: AI Safety (23 articles)

What happened after 2,000 people tried to hack my AI assistant

A public AI security challenge saw 2,000 people attempt to leak secrets via prompt injection, with all 6,000 attempts failing, reflecting progress in frontier model defenses but also revealing lingering risks.

Simon Willison · Jun 27, 2026

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

Anthropic reverses its controversial policy of silently limiting Claude for frontier LLM research, sparking industry-wide reflection on AI safety transparency and developer trust.

Simon Willison · Jun 11, 2026

If Claude Fable stops helping you, you'll never know

Anthropic's silent restrictions on Claude Fable's assistance for rival AI development tasks have sparked a fierce debate about AI transparency versus commercial interests.

Simon Willison · Jun 10, 2026

Initial impressions of Claude Fable 5

Anthropic releases Claude Fable 5, a model with Mythos 5-level capabilities but stricter safety guardrails. Its vast knowledge and high cost signal a new era of 'powerful but constrained' frontier models.

Simon Willison · Jun 10, 2026

GDS weighs in on the NHS's decision to retreat from Open Source

The UK's NHS closed its open-source repositories due to security vulnerabilities, prompting a public rebuke from the Government Digital Service and sparking a deeper debate on open-source strategy in the AI era.

Simon Willison · May 17, 2026

Behind the Scenes Hardening Firefox with Claude Mythos Preview

Mozilla leveraged the Claude Mythos preview and advanced harnessing techniques to find and fix 423 Firefox security vulnerabilities in one month—a 20x increase over their average—marking a qualitative shift in AI security auditing from noise generation to high-value signal production.

Simon Willison · May 8, 2026

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

The UK's AI Security Institute found GPT-5.5's cyber capabilities for finding vulnerabilities are comparable to the leading Claude Mythos model, but its general availability marks a new phase in AI-driven cybersecurity offense and defense.

Simon Willison · May 1, 2026

Quoting Bobby Holley

Mozilla's CTO reports that using Anthropic's Claude AI, Firefox identified and fixed 271 vulnerabilities in an assessment, marking a shift where AI moves from an 'assistant' to a 'lead' role in security defense.

Simon Willison · Apr 22, 2026

Changes in the system prompt between Claude Opus 4.6 and 4.7

The system prompt update for Claude Opus 4.7 reveals the evolution of AI assistants from passive responders to proactive tool-users, deep task executors, and more responsible safety frameworks.

Simon Willison · Apr 19, 2026

Trusted access for the next era of cyber defense

OpenAI launches GPT-5.4-Cyber, a model fine-tuned for defensive cybersecurity, and its "Trusted Access" program, signaling that leading AI companies are making cybersecurity a key battleground while seeking a new balance between safety and openness.

Simon Willison · Apr 15, 2026

Cybersecurity Looks Like Proof of Work Now

AI security reviews reveal that system security is evolving into an economic game: defenders must spend more computational resources (tokens) than attackers to ensure safety, which unexpectedly boosts the value of open-source projects.

Simon Willison · Apr 15, 2026

Reward Hacking in Reinforcement Learning

Reward hacking presents challenges in reinforcement learning due to flaws in reward functions, particularly impacting language models, necessitating further research and mitigation strategies.

Lilian Weng · Nov 28, 2024

Extrinsic Hallucinations in LLMs

This article explores the phenomenon of extrinsic hallucinations in large language models, analyzing their causes and detection methods, and proposes effective strategies to reduce hallucinations while emphasizing the risks of knowledge updates.

Lilian Weng · Jul 7, 2024

Adversarial Attacks on LLMs

This article explores adversarial attacks on large language models (LLMs), including types of attacks, threat models, and their impact on the safety of generated text, revealing significant challenges in AI safety.

Lilian Weng · Oct 25, 2023

Hackers Simply Asked Meta AI to Give Them Access to High-Profile Instagram Accounts. It Worked

A real-world attack where hackers bypassed Instagram's account recovery by simply asking Meta's AI chatbot to link a new email, revealing the severe risks of wiring AI directly into critical systems without proper authorization boundaries.

Simon Willison ·

Scaling AI Safety Research for a Multi-Agent World

Google DeepMind and partners launch a $10M funding call to study emergent risks in large-scale multi-agent AI systems, signaling a pivotal shift from individual model safety to ecosystem-level security.

Google DeepMind Blog ·

More details on Fable 5’s cyber safeguards and our jailbreak framework

Anthropic reveals its four-tier safety classifier for Fable 5 and a draft jailbreak severity framework, aiming to set a common language for AI risk communication across the industry and with governments.

Anthropic News ·

Redeploying Fable 5

The US lifted export controls on Claude Fable 5, and Anthropic is leveraging the incident to build an industry-wide jailbreak severity framework with Amazon, Microsoft, and others, reshaping the balance between AI safety and deployment.

Anthropic News ·

May 19, 2026AnnouncementsWidening the conversation on frontier AI

Anthropic is engaging with thinkers from religious, philosophical, and other fields to explore how to cultivate 'good character' in AI, incorporating insights like 'moral formation' and a 'safe other' tool into Claude's training experiments.

Anthropic News ·

May 22, 2026AnnouncementsProject Glasswing: An initial update

Anthropic's Project Glasswing, using Claude Mythos Preview, discovered over ten thousand high-severity vulnerabilities in critical global software within a month, shifting the core cybersecurity bottleneck from finding flaws to fixing them.

Anthropic News ·

Microsoft Copilot Cowork Exfiltrates Files

A critical security flaw in Microsoft Copilot Cowork allowed attackers to exfiltrate user files via prompt injection by exploiting auto-sent emails and pre-authenticated download links.

Simon Willison ·

Policy on the AI Exponential

Anthropic proposes a two-part policy framework that would give governments authority to block high-risk AI deployments, setting clear thresholds to regulate only the most powerful models.

Anthropic News ·

Securing the future of AI agents

Google DeepMind's AI Control Roadmap treats AI agents as potentially untrusted entities, using defense-in-depth and MITRE threat modeling to ensure secure deployment even with imperfect alignment.

Google DeepMind Blog ·