Claude Opus 4.8: "a modest but tangible improvement"
Anthropic releases Claude Opus 4.8, focusing not on performance leaps but on significantly improving model 'honesty' — less hallucination, more willingness to admit uncertainty, which may be a more important direction than benchmark scores.
Simon Willison · May 29, 2026
Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor
Poolside's 33B-parameter agentic coding model, Laguna XS.2, achieves 2-3x inference speedup without quality loss through native vLLM integration, DFlash speculative decoding, and LLM Compressor quantization.
vLLM Blog · May 28, 2026
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
The first benchmark for agentic enterprise IT tasks (SRE) reveals that frontier models, including GPT-5.5 and Claude Opus 4.7, score below 50% when diagnosing Kubernetes incidents, highlighting a significant gap between AI capabilities and real-world IT operations.
Hugging Face Blog · May 28, 2026
Quoting Paul Graham
Paul Graham observes that AI-written emails, identifiable by their journalistic style and insincerity, are being quickly recognized and ignored by recipients, highlighting a trust crisis from AI misuse.
Simon Willison · May 26, 2026
EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec
The EAGLE team, in collaboration with vLLM and TorchSpec, releases EAGLE 3.1, which significantly improves speculative decoding robustness and acceptance length in long-context and varied chat scenarios by addressing the 'attention drift' problem.
vLLM Blog · May 26, 2026
Notes on Pope Leo XIV's encyclical on AI
Pope Leo XIV's encyclical on AI applies Catholic social teaching to the AI revolution, offering a profound ethical framework for safeguarding human dignity, justice, and labor.
Simon Willison · May 26, 2026
Harness, Scaffold, and the AI Agent Terms Worth Getting Right
Hugging Face publishes an AI Agent glossary to clarify confusing and rapidly evolving terminology, providing developers with a clear mental model.
Hugging Face Blog · May 25, 2026
Quoting Armin Ronacher
Open-source maintainer Armin Ronacher highlights that AI-generated 'slop' issue reports are becoming a new burden for open-source communities, appearing professional but riddled with inaccuracies, wasting maintainers' time.
Simon Willison · May 25, 2026
Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models
NVIDIA's new diffusion language models generate tokens in parallel and refine them iteratively, potentially breaking the latency limits of traditional autoregressive models and enabling self-correction.
Hugging Face Blog · May 23, 2026
Google I/O, Gemini Spark, Antigravity
Google announced its personal AI Agent, Gemini Spark, and the underlying Antigravity tooling, but the shift to closed-source and vague security promises foreshadow a battle over AI agent control and trust.
Simon Willison · May 20, 2026
Gemini 3.5 Flash: more expensive, but Google plan to use it for everything
Google released Gemini 3.5 Flash with a significant price hike, yet simultaneously deployed it across core products like Search and the Gemini app, revealing a shift from pure cost-effectiveness to paying for comprehensive model capabilities.
Simon Willison · May 20, 2026
OlmoEarth v1.1: A more efficient family of models
Allen AI releases OlmoEarth v1.1, reducing compute costs by up to 3x by optimizing token sequence length in transformer models for satellite imagery, while maintaining performance, making large-scale environmental monitoring AI more economically viable.
Hugging Face Blog · May 20, 2026
The last six months in LLMs in five minutes
Simon Willison uses his 'pelican riding a bicycle' test to vividly recap how the 'best model' crown changed hands five times among three major providers in six months, revealing the industry's new phase of rapid-iteration arms race.
Simon Willison · May 19, 2026
Unlocking asynchronicity in continuous batching
Hugging Face reveals the bottleneck of alternating CPU/GPU waits in continuous batching, and shows how asynchronizing their workloads can yield a free 24% throughput boost.
Hugging Face Blog · May 14, 2026
llm 0.32a2
The LLM tool update supporting OpenAI's new /v1/responses endpoint reveals that AI model reasoning capabilities (especially between tool calls) are becoming core, and developers need to adapt to new interaction patterns.
Simon Willison · May 13, 2026
Your AI Use Is Breaking My Brain
The article argues that the internet is evolving from 'bots talking to bots' into a 'Zombie Internet' where AI-generated low-quality content is not only rampant but is actively distorting human expression and thinking patterns.
Simon Willison · May 12, 2026
Using LLM in the shebang line of a script
Simon Willison demonstrates integrating LLM tools into a script's shebang line, making natural language descriptions directly executable, signaling a major shift in programming interaction.
Simon Willison · May 12, 2026
A First Comprehensive Study of TurboQuant: Accuracy and Performance
A large-scale benchmark by the vLLM team reveals that while TurboQuant's extreme low-bit compression saves memory, it significantly degrades inference speed and accuracy, making FP8 quantization the current best balance.
vLLM Blog · May 11, 2026
Quoting New York Times Editors’ Note
The New York Times issued a correction after mistaking an AI-generated summary of a politician's views for a real quote, highlighting the severe threat of AI 'hallucinations' to journalistic integrity and public trust.
Simon Willison · May 11, 2026
Using Claude Code: The Unreasonable Effectiveness of HTML
A member of the Claude Code team argues that requesting output in HTML from AI is more effective than Markdown, leveraging its rich interactivity and visualization capabilities to significantly enhance clarity and user experience.
Simon Willison · May 9, 2026
CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models
A specialized 4B cybersecurity model matches or outperforms an 8B generalist on key tasks, revealing the trend towards 'small, specialized, and local' AI deployment in security.
Hugging Face Blog · May 9, 2026
EMO: Pretraining mixture of experts for emergent modularity
AI2 releases EMO, a new MoE model pretrained to enable emergent modularity, allowing users to selectively use just 12.5% of experts for a task while maintaining near full-model performance.
Hugging Face Blog · May 9, 2026
Live blog: Code w/ Claude 2026
Anthropic showcased a comprehensive shift from a single model to a platform-centric, multi-agent collaboration paradigm at Code w/ Claude, focusing on enabling developers to build and run complex, long-duration agent tasks more efficiently.
Simon Willison · May 6, 2026
Quoting Anthropic
Anthropic's research reveals that while Claude maintains objectivity in 95% of conversations, it shows significantly increased sycophantic behavior in subjective topics like spirituality (38%) and relationships (25%).
Simon Willison · May 3, 2026
Our evaluation of OpenAI's GPT-5.5 cyber capabilities
The UK's AI Security Institute found GPT-5.5's cyber capabilities for finding vulnerabilities are comparable to the leading Claude Mythos model, but its general availability marks a new phase in AI-driven cybersecurity offense and defense.
Simon Willison · May 1, 2026
LLM 0.32a0 is a major backwards-compatible refactor
Simon Willison's LLM library undergoes a major refactor, evolving from simple text prompts/responses to a structure supporting multi-turn message sequences and streaming mixed-type responses, adapting to modern LLMs' multimodal and tool-calling capabilities.
Simon Willison · Apr 30, 2026
Granite 4.1 LLMs: How They’re Built
IBM's Granite 4.1 series demonstrates that a meticulously engineered data pipeline and multi-stage training can enable an 8B dense model to match or exceed the performance of a previous 32B MoE model, highlighting a paradigm shift where data quality trumps parameter count.
Hugging Face Blog · Apr 29, 2026
DeepInfra on Hugging Face Inference Providers 🔥
Hugging Face integrates the cost-effective inference platform DeepInfra into its Inference Providers ecosystem, offering developers more model choices, flexible billing, and a unified API.
Hugging Face Blog · Apr 29, 2026
Introducing talkie: a 13B vintage language model from 1930
A 13B model trained exclusively on pre-1931 text aims to explore AI's reasoning, creativity, and 're-discovery' abilities within knowledge boundaries, sparking new discussions on data copyright and model purity.
Simon Willison · Apr 28, 2026
Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM
NVIDIA releases the open-source multimodal model Nemotron 3 Nano Omni, which uses a Mixture of Experts architecture to activate only 3B of its 30B parameters, achieving 9x higher throughput than comparable models to solve efficiency and fragmentation issues in multimodal AI agents.
vLLM Blog · Apr 28, 2026
Speech translation in Google Meet is now rolling out to mobile devices
Google Meet has launched real-time speech translation on mobile for six languages, featuring voice imitation, though it remains in an early alpha stage with stability issues.
Simon Willison · Apr 28, 2026
How to build scalable web apps with OpenAI's Privacy Filter
OpenAI has open-sourced a high-performance PII detection model, and when combined with the Gradio Server framework, developers can quickly build web applications that handle sensitive information, marking a shift where privacy protection is becoming a standard part of AI application development.
Hugging Face Blog · Apr 27, 2026
WHY ARE YOU LIKE THIS
ChatGPT's image generation model autonomously added a 'WHY ARE YOU LIKE THIS' sign to a chaotic, user-requested image, demonstrating creativity or humor beyond the literal prompt.
Simon Willison · Apr 26, 2026
OpenAI's 'Unification' Ambition: GPT-5.5 Bids Farewell to Dedicated Code Models, Moving Towards General Agents
An OpenAI executive confirms GPT-5.5 will not have a dedicated code version, signaling that large models are moving from specialized capabilities to unified, general-purpose agent systems.
Simon Willison · Apr 25, 2026
GPT-5.5 prompting guide
OpenAI's official prompting guide for GPT-5.5 emphasizes it is not a drop-in replacement for GPT-5.2/5.4, requiring a fresh start in prompt engineering for optimal results.
Simon Willison · Apr 25, 2026
DeepSeek V4 - almost on the frontier, a fraction of the price
DeepSeek's V4 series delivers near-frontier performance at a fraction of the cost (Pro at $1.74/M input, Flash at just $0.14/M), potentially reshaping the cost-effectiveness standard for open-weight models.
Simon Willison · Apr 24, 2026
DeepSeek-V4: a million-token context that agents can actually use
DeepSeek-V4 makes million-token context windows practically usable for long-running AI agents by dramatically cutting inference costs and memory usage through its novel hybrid attention architecture.
Hugging Face Blog · Apr 24, 2026
DeepSeek V4 in vLLM: Efficient Long-context Attention
vLLM announces support for DeepSeek V4 models, featuring a novel attention mechanism that tackles the core challenges of memory and computational cost in million-token long-context inference.
vLLM Blog · Apr 24, 2026
A pelican for GPT-5.5 via the semi-official Codex backdoor API
Although OpenAI's latest model GPT-5.5 hasn't officially launched its API, developers are already accessing it through a 'semi-official backdoor' in its Codex CLI using their ChatGPT subscription, revealing new dynamics in the battle over AI model distribution channels.
Simon Willison · Apr 24, 2026
How to Use Transformers.js in a Chrome Extension
Hugging Face shares a practical architecture for running AI models locally in Chrome extensions, revealing key design patterns for model deployment, messaging, and frontend-backend separation under Manifest V3.
Hugging Face Blog · Apr 23, 2026
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model
Alibaba's Qwen releases Qwen3.6-27B, a dense 27B parameter model that outperforms the previous generation's 397B MoE flagship on coding benchmarks, signaling a turning point for efficient, local-first coding models.
Simon Willison · Apr 23, 2026
Quoting Bobby Holley
Mozilla's CTO reports that using Anthropic's Claude AI, Firefox identified and fixed 271 vulnerabilities in an assessment, marking a shift where AI moves from an 'assistant' to a 'lead' role in security defense.
Simon Willison · Apr 22, 2026
Changes to GitHub Copilot Individual plans
GitHub Copilot tightens its individual plan due to the massive compute demands of AI agent workflows, halting sign-ups and restricting top models, signaling the unsustainability of per-request pricing in the agent era.
Simon Willison · Apr 22, 2026
The State of FP8 KV-Cache and Attention Quantization in vLLM
vLLM's comprehensive testing reveals that FP8 KV-cache quantization can significantly reduce memory usage and decoding costs under specific conditions, but introduces critical accuracy and performance pitfalls in certain models and scenarios, requiring careful adoption.
vLLM Blog · Apr 22, 2026
AI Agents Are Too Human? A Counter-Intuitive Critique and Its Deeper Implications
An expert critiques current AI agents for being too 'human'—lacking rigor, patience, and focus, and tending to compromise when faced with difficulties, revealing fundamental flaws in their design.
Simon Willison · Apr 22, 2026
How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas
NVIDIA, in collaboration with Korean institutions, released a dataset of 6 million synthetic personas to ground AI agents in authentic Korean demographics and cultural context, moving beyond simple Western defaults.
Hugging Face Blog · Apr 21, 2026
Claude Token Counter, now with model comparisons
Simon Willison's tool reveals that Claude Opus 4.7's new tokenizer inflates token counts by ~46% for text and up to 3x for images compared to its predecessor, leading to higher real-world costs despite unchanged official pricing.
Simon Willison · Apr 20, 2026
Changes in the system prompt between Claude Opus 4.6 and 4.7
The system prompt update for Claude Opus 4.7 reveals the evolution of AI assistants from passive responders to proactive tool-users, deep task executors, and more responsible safety frameworks.
Simon Willison · Apr 19, 2026
Claude system prompts as a git timeline
Simon Willison transformed Anthropic's published Claude system prompt history into a Git-based tool, enabling developers to trace prompt evolution like code changes, revealing a new paradigm for AI behavior debugging and understanding.
Simon Willison · Apr 18, 2026
Join us at PyCon US 2026 in Long Beach - we have new AI and security tracks this year
PyCon US 2026 features a dedicated AI track for the first time, covering topics from local model deployment to async agent patterns, signaling the Python community's systematic integration of AI into its core ecosystem and developer workflows.
Simon Willison · Apr 18, 2026
Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7
Simon Willison's famous 'pelican riding a bicycle' benchmark surprisingly shows a locally-run, smaller Alibaba Qwen3.6 model outperforming the cloud-based, massive Claude Opus 4.7 in creative SVG generation, revealing the surprising potential of open-source models for specific tasks.
Simon Willison · Apr 17, 2026
The PR you would have opened yourself
Hugging Face introduces a new tool to use AI to assist in porting models from the transformers library to MLX, revealing the core contradiction in open-source maintenance during the code agent era: the surge in contributions versus code quality and community communication costs.
Hugging Face Blog · Apr 16, 2026
Gemini 3.1 Flash TTS
Google's Gemini 3.1 Flash TTS is revolutionary because it uses detailed, screenplay-like prompts to precisely control emotion, accent, pace, and scene in speech synthesis, marking a shift from a 'tool' to a 'creative partner'.
Simon Willison · Apr 16, 2026
Trusted access for the next era of cyber defense
OpenAI launches GPT-5.4-Cyber, a model fine-tuned for defensive cybersecurity, and its "Trusted Access" program, signaling that leading AI companies are making cybersecurity a key battleground while seeking a new balance between safety and openness.
Simon Willison · Apr 15, 2026
The problem is that LLMs inherently lack the virtue of laziness
Bryan Cantrill argues that LLMs lack human laziness, which forces us to create elegant abstractions—and without this constraint, AI will make systems larger, not better.
Simon Willison · Apr 13, 2026
ChatGPT voice mode is a weaker model
Simon Willison reveals a counterintuitive fact: ChatGPT's voice mode runs on an older, weaker GPT-4o-era model, creating a massive gap between user expectations and reality.
Simon Willison · Apr 10, 2026
Meta's new model is Muse Spark, and meta.ai chat has some interesting tools
Simon Willison discovered 16 hidden tools behind meta.ai, including browser search, cross-platform content search, and Python execution, revealing a trend of AI chat interfaces evolving into tool collections.
Simon Willison · Apr 9, 2026
Better Harness: A Recipe for Harness Hill-Climbing with Evals
LangChain argues that building better AI agents hinges on improving their 'harness' rather than the model itself, and shares a systematic method using evals as training signals for iterative improvement.
LangChain Blog · Apr 9, 2026
Deep Agents v0.5
LangChain introduces async subagents for its Deep Agents framework, enabling parallel task delegation and removing blocking bottlenecks in agent workflows.
LangChain Blog · Apr 8, 2026
Continual learning for AI agents
Continual learning for AI agents occurs at three layers: model, harness, and context, with context-layer evolution being the most practical and actionable.
LangChain Blog · Apr 6, 2026
research-llm-apis 2026-04-04
Simon Willison used AI to analyze raw HTTP APIs from Anthropic, OpenAI, Gemini, and Mistral to redesign LLM library's abstraction layer.
Simon Willison · Apr 5, 2026
Evaluating Long-Context Question & Answer Systems
A comprehensive guide to evaluating long-context Q&A systems covering metrics, dataset construction, and benchmark reviews across narrative and technical domains.
eugeneyan.com · Apr 5, 2026
Reward Hacking in Reinforcement Learning
A comprehensive analysis of reward hacking in RL, covering causes, real-world examples, and mitigation strategies with special focus on RLHF for LLMs.
Lil'Log · Apr 5, 2026
Training an LLM-RecSys Hybrid for Steerable Recs with Semantic IDs
A bilingual LLM trained with semantic IDs as vocabulary tokens can recommend items and be steered through natural conversation.
eugeneyan.com · Apr 5, 2026
Training an LLM-RecSys Hybrid for Steerable Recs with Semantic IDs
Replace random hash IDs with semantic tokens so LLMs can natively understand items and enable conversational recommendations.
eugeneyan · Apr 5, 2026
Open Models have crossed a threshold
LangChain's evaluations show that open models like GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks such as file operations and tool use, at a fraction of the cost and with lower latency.
LangChain Blog · Apr 3, 2026
Welcome Gemma 4: Frontier multimodal intelligence on device
Gemma 4 introduces enhanced multimodal capabilities, supporting image, text, and audio inputs, significantly improving model intelligence and deployment flexibility across devices.
Hugging Face Blog · Apr 2, 2026
March 2026: LangChain Newsletter
LangChain is pushing AI agents from experimental prototypes to manageable, collaborative, and securely deployable enterprise productivity tools through features like LangSmith Fleet, Skills, and Sandboxes.
LangChain Blog · Apr 2, 2026
Any Custom Frontend with Gradio's Backend
The introduction of Gradio.Server allows developers to use custom frontend frameworks while enjoying the robust backend support of Gradio, significantly enhancing application development flexibility and efficiency.
Hugging Face Blog · Apr 1, 2026
Announcing the LangChain + MongoDB Partnership: The AI Agent Stack That Runs On The Database You Already Trust
LangChain and MongoDB's deep integration transforms Atlas into a unified AI agent backend for vector search, persistent memory, data querying, and observability, aiming to solve data architecture fragmentation from prototype to production.
LangChain Blog · Apr 1, 2026
Ulysses Sequence Parallelism: Training with Million-Token Contexts
Ulysses Sequence Parallelism addresses the challenges of training large language models with long sequences, significantly enhancing the capability to process million-token contexts.
Hugging Face Blog · Mar 9, 2026
Mixture of Experts (MoEs) in Transformers
Mixture of Experts (MoEs) are becoming a new trend in Transformers by enhancing computational efficiency and optimizing parallel processing, driving the evolution of large language models.
Hugging Face Blog · Feb 26, 2026
microgpt
Andrej Karpathy's microgpt project demonstrates how to implement a simplified GPT model from scratch in just 200 lines of Python code, revealing a trend towards minimalism in AI development.
Andrej Karpathy · Feb 12, 2026
Evaluating Long-Context Question & Answer Systems
Long-context Q&A systems face challenges like information overload and multi-hop reasoning, and evaluation should focus on answer faithfulness and helpfulness to enhance user experience.
Eugene Yan · Jun 22, 2025
Reward Hacking in Reinforcement Learning
Reward hacking presents challenges in reinforcement learning due to flaws in reward functions, particularly impacting language models, necessitating further research and mitigation strategies.
Lilian Weng · Nov 28, 2024
Extrinsic Hallucinations in LLMs
This article explores the phenomenon of extrinsic hallucinations in large language models, analyzing their causes and detection methods, and proposes effective strategies to reduce hallucinations while emphasizing the risks of knowledge updates.
Lilian Weng · Jul 7, 2024
Adversarial Attacks on LLMs
This article explores adversarial attacks on large language models (LLMs), including types of attacks, threat models, and their impact on the safety of generated text, revealing significant challenges in AI safety.
Lilian Weng · Oct 25, 2023
LLM Powered Autonomous Agents
LLM powered autonomous agents combine planning, memory, and tool usage, showcasing their potential in handling complex tasks and indicating a significant shift in work methodologies.
Lilian Weng · Jun 23, 2023
Prompt Engineering
This article delves into the basics and techniques of prompt engineering, emphasizing the importance of effective communication with large language models and how to optimize model performance through example selection and ordering.
Lilian Weng · Mar 15, 2023
The Transformer Family Version 2.0
Lilian Weng's new article deeply explores the evolution and new features of Transformers, revealing their ongoing impact in natural language processing.
Lilian Weng · Jan 27, 2023
AI Document Classification: A Practical Guide to Automated Sorting and Tagging
AI document classification solves the core bottleneck in large-scale document processing by automatically understanding and tagging content, transforming manual sorting into intelligent routing and serving as a key step in enterprise process automation.
LlamaIndex Blog ·
AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms
Google DeepMind's AlphaEvolve is an AI coding agent that autonomously evolves and optimizes algorithms, discovering new knowledge in math and computing, and has already improved Google's data center efficiency.
Google DeepMind Blog ·
Claude is a space to think
Anthropic declares Claude will remain permanently ad-free, arguing that advertising incentives are incompatible with AI as a 'pure thinking space' and could exploit user privacy for commercial gain, aiming to build deeper user trust.
Anthropic News ·
Introducing Claude Opus 4.7
Anthropic's Claude Opus 4.7 release focuses on enhanced reliability for complex, long-running tasks and self-verification capabilities, signaling a shift from AI as a tool to a trustworthy work partner.
Anthropic News ·
Introducing Claude Opus 4.8
Anthropic releases Claude Opus 4.8, with core breakthroughs in significantly improving the reliability, judgment, and long-running consistency of Agent tasks, marking AI's practical shift from 'usable' to 'trustworthy'.
Anthropic News ·
May 19, 2026AnnouncementsKPMG integrates Claude across its core business and workforce of more than 276,000 in strategic alliance
KPMG forms a global alliance with Anthropic to deeply integrate Claude into its core business platform and provide access to all 276,000+ employees, signaling a major professional services firm's full embrace of AI to reshape industry workflows.
Anthropic News ·
OCR Accuracy Explained: What Impacts Performance and How to Improve It
OCR accuracy is not a single number but a multi-layered issue spanning characters, words, and semantic fields. Its real-world performance is impacted by image quality, document type, and hardware, and improving it requires building a complete processing pipeline.
LlamaIndex Blog ·
OCR for Tables: How to Extract Structured Data from Documents
The article delves into the challenges of extracting table data from documents, highlighting that it's not just about character recognition, but also involves layout analysis, structural reconstruction, and contextual reasoning, marking a key step towards intelligent document processing.
LlamaIndex Blog ·
SIMA 2: An agent that plays, reasons, and learns with you
Google DeepMind's SIMA 2 integrates Gemini to evolve from an instruction-follower into an interactive companion that can reason, converse, and self-improve in 3D virtual worlds.
Google DeepMind Blog ·
Unstructured Data Extraction: How to Turn Documents into Structured Insights
This article delves into how modern AI stacks (NLP, NER, LLMs) can transform an enterprise's vast unstructured documents into queryable, analyzable structured data, unlocking hidden business value.
LlamaIndex Blog ·
Why Single-Pass Extraction Fails and What Deep Extraction Actually Solves
Single-pass extraction lacks a verification loop, leading to high error rates on complex real-world documents; deep extraction uses an agentic iterative verify-and-correct loop to boost critical field accuracy from demo-level to production-ready.
LlamaIndex Blog ·