← Back to Home

Tag: AI Safety (14 articles)

sqlite AGENTS.md

The SQLite project's AGENTS.md file explicitly rejects AI-generated code while creating a dedicated channel for AI-reported bugs, revealing a pragmatic strategy for open-source communities to handle the AI wave.

Simon Willison · May 28, 2026

The pressure

The rise of AI-assisted security research is putting unprecedented pressure on foundational open-source projects like curl with a flood of high-quality vulnerability reports, revealing the double-edged sword of AI in security.

Simon Willison · May 27, 2026

Microsoft Copilot Cowork Exfiltrates Files

A critical security flaw in Microsoft Copilot Cowork allows attackers to use prompt injection to trick the AI agent into exfiltrating sensitive files like OneDrive data using the user's own permissions.

Simon Willison · May 26, 2026

Behind the Scenes Hardening Firefox with Claude Mythos Preview

Mozilla leveraged the Claude Mythos preview and advanced harnessing techniques to find and fix 423 Firefox security vulnerabilities in one month—a 20x increase over their average—marking a qualitative shift in AI security auditing from noise generation to high-value signal production.

Simon Willison · May 8, 2026

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

The UK's AI Security Institute found GPT-5.5's cyber capabilities for finding vulnerabilities are comparable to the leading Claude Mythos model, but its general availability marks a new phase in AI-driven cybersecurity offense and defense.

Simon Willison · May 1, 2026

Quoting Bobby Holley

Mozilla's CTO reports that using Anthropic's Claude AI, Firefox identified and fixed 271 vulnerabilities in an assessment, marking a shift where AI moves from an 'assistant' to a 'lead' role in security defense.

Simon Willison · Apr 22, 2026

Trusted access for the next era of cyber defense

OpenAI launches GPT-5.4-Cyber, a model fine-tuned for defensive cybersecurity, and its "Trusted Access" program, signaling that leading AI companies are making cybersecurity a key battleground while seeking a new balance between safety and openness.

Simon Willison · Apr 15, 2026

Cybersecurity Looks Like Proof of Work Now

AI security reviews reveal that system security is evolving into an economic game: defenders must spend more computational resources (tokens) than attackers to ensure safety, which unexpectedly boosts the value of open-source projects.

Simon Willison · Apr 15, 2026

Reward Hacking in Reinforcement Learning

Reward hacking presents challenges in reinforcement learning due to flaws in reward functions, particularly impacting language models, necessitating further research and mitigation strategies.

Lilian Weng · Nov 28, 2024

Extrinsic Hallucinations in LLMs

This article explores the phenomenon of extrinsic hallucinations in large language models, analyzing their causes and detection methods, and proposes effective strategies to reduce hallucinations while emphasizing the risks of knowledge updates.

Lilian Weng · Jul 7, 2024

Adversarial Attacks on LLMs

This article explores adversarial attacks on large language models (LLMs), including types of attacks, threat models, and their impact on the safety of generated text, revealing significant challenges in AI safety.

Lilian Weng · Oct 25, 2023

May 22, 2026AnnouncementsProject Glasswing: An initial update

Anthropic's 'Project Glasswing', leveraging its latest AI model Mythos Preview, has helped partners discover over ten thousand high or critical-severity vulnerabilities in one month, shifting the core bottleneck in software security from 'finding vulnerabilities' to 'verifying and patching them'.

Anthropic News ·
BitByAI — AI-powered, AI-evolved AI News