Tag: Large Language Models (139 articles)

Better Models: Worse Tools

Newer Claude models are increasingly making mistakes when calling third-party edit tools, likely because Anthropic over-trained them on Claude Code's own tool syntax, degrading general tool-use ability and highlighting platform lock-in risks in AI training.

Simon Willison · Jul 5, 2026

Harness Engineering for Self-Improvement

Lilian Weng argues that the key to AI self-improvement lies not in model size but in the 'harness' layer connecting models to reality, and proposes design patterns that can evolve themselves.

Lilian Weng · Jul 4, 2026

Quoting Josh W. Comeau

Multiple developer course creators report revenue drops of over 50% as AI both shakes confidence in career prospects and offers free personalized learning alternatives, posing a serious challenge to traditional tech education.

Simon Willison · Jul 4, 2026

Fable's judgement

The optimal way to use advanced AI coding tools isn't micromanagement, but granting them autonomous judgment and dynamic routing, letting the main model focus on architecture while sub-agents handle implementation.

Simon Willison · Jul 4, 2026

What's new in Claude Sonnet 5

Claude Sonnet 5 brings Opus-level performance at Sonnet prices, but a tokenizer change effectively raises costs by 30% for English users; removed sampling params and default thinking mode add more hidden costs.

Simon Willison · Jul 1, 2026

Ending AI Evaluation Anarchy: How Hugging Face and EEE Are Building a Trusted Record for Model Performance

EEE and Hugging Face Community Evals are now integrated, enabling standardized evaluation results with full metadata to be posted directly on model pages, solving the problem of scattered, incomparable scores and moving the industry toward evaluation transparency.

Hugging Face Blog · Jun 30, 2026

Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding

Simon Willison reviews the open-source Ornith-1.0 model, highlighting its efficient tool calling and code understanding for agentic tasks, signaling new advances in open agentic coding models.

Simon Willison · Jun 30, 2026

What happened after 2,000 people tried to hack my AI assistant

A public AI security challenge saw 2,000 people attempt to leak secrets via prompt injection, with all 6,000 attempts failing, reflecting progress in frontier model defenses but also revealing lingering risks.

Simon Willison · Jun 27, 2026

Incident Report: CVE-2026-LGTM

A fictional incident report about dueling AI review agents reveals real risks of uncontrolled costs and multi-agent conflicts in AI-powered supply chain security.

Simon Willison · Jun 27, 2026

Quoting OpenAI

OpenAI launches the GPT-5.6 series with tiered pricing and controllable caching, introducing a government-coordinated limited preview that signals a new era of compliance-first, refined AI operations.

Simon Willison · Jun 27, 2026

Privacy-Aware Infrastructure in the AI-Native Era: An Asset Classification Case Study

Meta shares a hybrid asset classification approach: using LLMs for ambiguous cold-start but relying on human-reviewed deterministic rules for daily enforcement, achieving auditable data governance in the AI era.

Meta Engineering Blog · Jun 26, 2026

AI and Liability

A German court ruling holds Google liable for errors in its AI overviews, reinforcing that AI agents are extensions of their deployers, and companies cannot hide behind faulty AI to avoid responsibility.

Simon Willison · Jun 26, 2026

GLM-5.2: Built for Long-Horizon Tasks

Z.ai releases GLM-5.2, the first open-source model to achieve stable 1M-token context and rival top closed-source models on long-horizon coding benchmarks.

Hugging Face Blog · Jun 17, 2026

The Fable 5 Export Controls Harm US Cyber Defense

The US export controls on Claude Fable 5 for being able to 'fix code' misunderstand that this is a normal defensive security activity, and such controls harm rather than help cybersecurity.

Simon Willison · Jun 16, 2026

Beyond One Model: Fusion in vLLM Semantic Router

vLLM Semantic Router introduces Fusion, a routing primitive that lets a panel of models produce independent answers, has a judge model analyze them, and synthesizes a single response — making model composition a first-class serving pattern.

vLLM Blog · Jun 16, 2026

"They screwed us": Personality clashes sent Anthropic's models offline

The US government suspended Anthropic models over a jailbreak vulnerability, revealing a clash between the illusion of perfect AI safety and real-world communication failures in AI governance.

Simon Willison · Jun 15, 2026

Claude Fable is relentlessly proactive

Without explicit instructions to use browser automation, Claude Fable 5 autonomously wrote HTML test pages, controlled browsers, and took screenshots to debug a UI bug.

Simon Willison · Jun 12, 2026

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

Anthropic reverses its controversial policy of silently limiting Claude for frontier LLM research, sparking industry-wide reflection on AI safety transparency and developer trust.

Simon Willison · Jun 11, 2026

DiffusionGemma

Google open-sources DiffusionGemma, applying diffusion architecture to text generation for the first time, achieving over 500 tokens/sec and offering a new paradigm for high-throughput scenarios.

Simon Willison · Jun 11, 2026

Quoting Jeremy Howard

Howard argues that if slowing down AI self-improvement is truly the goal, leading labs must restrict their own models first, exposing slowdown rhetoric as a potential cover for monopoly.

Simon Willison · Jun 10, 2026

If Claude Fable stops helping you, you'll never know

Anthropic's silent restrictions on Claude Fable's assistance for rival AI development tasks have sparked a fierce debate about AI transparency versus commercial interests.

Simon Willison · Jun 10, 2026

DiffusionGemma: The First Diffusion LLM (dLLM) Natively Supported in vLLM

vLLM natively supports a discrete diffusion language model that replaces sequential generation with parallel block denoising, trading compute for bandwidth to significantly reduce latency.

vLLM Blog · Jun 10, 2026

Initial impressions of Claude Fable 5

Anthropic releases Claude Fable 5, a model with Mythos 5-level capabilities but stricter safety guardrails. Its vast knowledge and high cost signal a new era of 'powerful but constrained' frontier models.

Simon Willison · Jun 10, 2026

Quoting Andrej Karpathy

As AI makes software creation nearly effortless, Andrej Karpathy observes that his personal demand for software is growing exponentially, illustrating the Jevons paradox in tech.

Simon Willison · Jun 10, 2026

OpenAI Help: Lockdown Mode

Lockdown Mode uses deterministic rules to block outbound requests, cutting off the data exfiltration vector in prompt injection attacks and implicitly revealing the weakness of default ChatGPT security.

Simon Willison · Jun 6, 2026

An update on our election safeguards

Anthropic reveals its use of constitutional training, system prompts, and published evaluation datasets to keep Claude politically neutral, while coupling them with policy enforcement to prevent election abuse—reflecting a broader shift of AI companies into information governance roles.

Anthropic News · Jun 6, 2026

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

NVIDIA's Nemotron 3.5 unifies multimodal evaluation, custom enterprise policies, and auditable reasoning traces into a single safety model, tackling real-world compliance and edge-case challenges for enterprise AI.

Hugging Face Blog · Jun 5, 2026

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

NVIDIA introduces a task-seeded synthetic data generation pipeline that achieves double-digit benchmark improvements in Nemotron-3 Nano pretraining, signaling a new paradigm for synthetic data usage.

Hugging Face Blog · Jun 4, 2026

Microsoft's new MAI models

Simon Willison delves into Microsoft's new MAI models, revealing that despite claims of 'clean licensed data', the training process still relies on web crawls, sparking discussion on AI copyright issues.

Simon Willison · Jun 3, 2026

Claude Opus 4.8: "a modest but tangible improvement"

Anthropic releases Claude Opus 4.8, focusing not on performance leaps but on significantly improving model 'honesty' — less hallucination, more willingness to admit uncertainty, which may be a more important direction than benchmark scores.

Simon Willison · May 29, 2026

Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor

Poolside's 33B-parameter agentic coding model, Laguna XS.2, achieves 2-3x inference speedup without quality loss through native vLLM integration, DFlash speculative decoding, and LLM Compressor quantization.

vLLM Blog · May 28, 2026

Quoting Armin Ronacher

Open-source maintainer Armin Ronacher highlights that AI-generated 'slop' issue reports are becoming a new burden for open-source communities, appearing professional but riddled with inaccuracies, wasting maintainers' time.

Simon Willison · May 25, 2026

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

NVIDIA's new diffusion language models generate tokens in parallel and refine them iteratively, potentially breaking the latency limits of traditional autoregressive models and enabling self-correction.

Hugging Face Blog · May 23, 2026

Google I/O, Gemini Spark, Antigravity

Google announced its personal AI Agent, Gemini Spark, and the underlying Antigravity tooling, but the shift to closed-source and vague security promises foreshadow a battle over AI agent control and trust.

Simon Willison · May 20, 2026

Gemini 3.5 Flash: more expensive, but Google plan to use it for everything

Google released Gemini 3.5 Flash with a significant price hike, yet simultaneously deployed it across core products like Search and the Gemini app, revealing a shift from pure cost-effectiveness to paying for comprehensive model capabilities.

Simon Willison · May 20, 2026

OlmoEarth v1.1: A more efficient family of models

Allen AI releases OlmoEarth v1.1, reducing compute costs by up to 3x by optimizing token sequence length in transformer models for satellite imagery, while maintaining performance, making large-scale environmental monitoring AI more economically viable.

Hugging Face Blog · May 20, 2026

The last six months in LLMs in five minutes

Simon Willison uses his 'pelican riding a bicycle' test to vividly recap how the 'best model' crown changed hands five times among three major providers in six months, revealing the industry's new phase of rapid-iteration arms race.

Simon Willison · May 19, 2026

Unlocking asynchronicity in continuous batching

Hugging Face reveals the bottleneck of alternating CPU/GPU waits in continuous batching, and shows how asynchronizing their workloads can yield a free 24% throughput boost.

Hugging Face Blog · May 14, 2026

llm 0.32a2

The LLM tool update supporting OpenAI's new /v1/responses endpoint reveals that AI model reasoning capabilities (especially between tool calls) are becoming core, and developers need to adapt to new interaction patterns.

Simon Willison · May 13, 2026

Your AI Use Is Breaking My Brain

The article argues that the internet is evolving from 'bots talking to bots' into a 'Zombie Internet' where AI-generated low-quality content is not only rampant but is actively distorting human expression and thinking patterns.

Simon Willison · May 12, 2026

Using LLM in the shebang line of a script

Simon Willison demonstrates integrating LLM tools into a script's shebang line, making natural language descriptions directly executable, signaling a major shift in programming interaction.

Simon Willison · May 12, 2026

Quoting New York Times Editors’ Note

The New York Times issued a correction after mistaking an AI-generated summary of a politician's views for a real quote, highlighting the severe threat of AI 'hallucinations' to journalistic integrity and public trust.

Simon Willison · May 11, 2026

Using Claude Code: The Unreasonable Effectiveness of HTML

A member of the Claude Code team argues that requesting output in HTML from AI is more effective than Markdown, leveraging its rich interactivity and visualization capabilities to significantly enhance clarity and user experience.

Simon Willison · May 9, 2026

CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models

A specialized 4B cybersecurity model matches or outperforms an 8B generalist on key tasks, revealing the trend towards 'small, specialized, and local' AI deployment in security.

Hugging Face Blog · May 9, 2026

EMO: Pretraining mixture of experts for emergent modularity

AI2 releases EMO, a new MoE model pretrained to enable emergent modularity, allowing users to selectively use just 12.5% of experts for a task while maintaining near full-model performance.

Hugging Face Blog · May 9, 2026

Live blog: Code w/ Claude 2026

Anthropic showcased a comprehensive shift from a single model to a platform-centric, multi-agent collaboration paradigm at Code w/ Claude, focusing on enabling developers to build and run complex, long-duration agent tasks more efficiently.

Simon Willison · May 6, 2026

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

The UK's AI Security Institute found GPT-5.5's cyber capabilities for finding vulnerabilities are comparable to the leading Claude Mythos model, but its general availability marks a new phase in AI-driven cybersecurity offense and defense.

Simon Willison · May 1, 2026

LLM 0.32a0 is a major backwards-compatible refactor

Simon Willison's LLM library undergoes a major refactor, evolving from simple text prompts/responses to a structure supporting multi-turn message sequences and streaming mixed-type responses, adapting to modern LLMs' multimodal and tool-calling capabilities.

Simon Willison · Apr 30, 2026

Granite 4.1 LLMs: How They’re Built

IBM's Granite 4.1 series demonstrates that a meticulously engineered data pipeline and multi-stage training can enable an 8B dense model to match or exceed the performance of a previous 32B MoE model, highlighting a paradigm shift where data quality trumps parameter count.

Hugging Face Blog · Apr 29, 2026

DeepInfra on Hugging Face Inference Providers 🔥

Hugging Face integrates the cost-effective inference platform DeepInfra into its Inference Providers ecosystem, offering developers more model choices, flexible billing, and a unified API.

Hugging Face Blog · Apr 29, 2026

Introducing talkie: a 13B vintage language model from 1930

A 13B model trained exclusively on pre-1931 text aims to explore AI's reasoning, creativity, and 're-discovery' abilities within knowledge boundaries, sparking new discussions on data copyright and model purity.

Simon Willison · Apr 28, 2026

Speech translation in Google Meet is now rolling out to mobile devices

Google Meet has launched real-time speech translation on mobile for six languages, featuring voice imitation, though it remains in an early alpha stage with stability issues.

Simon Willison · Apr 28, 2026

How to build scalable web apps with OpenAI's Privacy Filter

OpenAI has open-sourced a high-performance PII detection model, and when combined with the Gradio Server framework, developers can quickly build web applications that handle sensitive information, marking a shift where privacy protection is becoming a standard part of AI application development.

Hugging Face Blog · Apr 27, 2026

WHY ARE YOU LIKE THIS

ChatGPT's image generation model autonomously added a 'WHY ARE YOU LIKE THIS' sign to a chaotic, user-requested image, demonstrating creativity or humor beyond the literal prompt.

Simon Willison · Apr 26, 2026

OpenAI's 'Unification' Ambition: GPT-5.5 Bids Farewell to Dedicated Code Models, Moving Towards General Agents

An OpenAI executive confirms GPT-5.5 will not have a dedicated code version, signaling that large models are moving from specialized capabilities to unified, general-purpose agent systems.

Simon Willison · Apr 25, 2026

GPT-5.5 prompting guide

OpenAI's official prompting guide for GPT-5.5 emphasizes it is not a drop-in replacement for GPT-5.2/5.4, requiring a fresh start in prompt engineering for optimal results.

Simon Willison · Apr 25, 2026

DeepSeek V4 - almost on the frontier, a fraction of the price

DeepSeek's V4 series delivers near-frontier performance at a fraction of the cost (Pro at $1.74/M input, Flash at just $0.14/M), potentially reshaping the cost-effectiveness standard for open-weight models.

Simon Willison · Apr 24, 2026

DeepSeek-V4: a million-token context that agents can actually use

DeepSeek-V4 makes million-token context windows practically usable for long-running AI agents by dramatically cutting inference costs and memory usage through its novel hybrid attention architecture.

Hugging Face Blog · Apr 24, 2026

A pelican for GPT-5.5 via the semi-official Codex backdoor API

Although OpenAI's latest model GPT-5.5 hasn't officially launched its API, developers are already accessing it through a 'semi-official backdoor' in its Codex CLI using their ChatGPT subscription, revealing new dynamics in the battle over AI model distribution channels.

Simon Willison · Apr 24, 2026

How to Use Transformers.js in a Chrome Extension

Hugging Face shares a practical architecture for running AI models locally in Chrome extensions, revealing key design patterns for model deployment, messaging, and frontend-backend separation under Manifest V3.

Hugging Face Blog · Apr 23, 2026

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

Alibaba's Qwen releases Qwen3.6-27B, a dense 27B parameter model that outperforms the previous generation's 397B MoE flagship on coding benchmarks, signaling a turning point for efficient, local-first coding models.

Simon Willison · Apr 23, 2026

Quoting Bobby Holley

Mozilla's CTO reports that using Anthropic's Claude AI, Firefox identified and fixed 271 vulnerabilities in an assessment, marking a shift where AI moves from an 'assistant' to a 'lead' role in security defense.

Simon Willison · Apr 22, 2026

Changes to GitHub Copilot Individual plans

GitHub Copilot tightens its individual plan due to the massive compute demands of AI agent workflows, halting sign-ups and restricting top models, signaling the unsustainability of per-request pricing in the agent era.

Simon Willison · Apr 22, 2026

AI Agents Are Too Human? A Counter-Intuitive Critique and Its Deeper Implications

An expert critiques current AI agents for being too 'human'—lacking rigor, patience, and focus, and tending to compromise when faced with difficulties, revealing fundamental flaws in their design.

Simon Willison · Apr 22, 2026

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

NVIDIA, in collaboration with Korean institutions, released a dataset of 6 million synthetic personas to ground AI agents in authentic Korean demographics and cultural context, moving beyond simple Western defaults.

Hugging Face Blog · Apr 21, 2026

Claude Token Counter, now with model comparisons

Simon Willison's tool reveals that Claude Opus 4.7's new tokenizer inflates token counts by ~46% for text and up to 3x for images compared to its predecessor, leading to higher real-world costs despite unchanged official pricing.

Simon Willison · Apr 20, 2026

Changes in the system prompt between Claude Opus 4.6 and 4.7

The system prompt update for Claude Opus 4.7 reveals the evolution of AI assistants from passive responders to proactive tool-users, deep task executors, and more responsible safety frameworks.

Simon Willison · Apr 19, 2026

Claude system prompts as a git timeline

Simon Willison transformed Anthropic's published Claude system prompt history into a Git-based tool, enabling developers to trace prompt evolution like code changes, revealing a new paradigm for AI behavior debugging and understanding.

Simon Willison · Apr 18, 2026

Join us at PyCon US 2026 in Long Beach - we have new AI and security tracks this year

PyCon US 2026 features a dedicated AI track for the first time, covering topics from local model deployment to async agent patterns, signaling the Python community's systematic integration of AI into its core ecosystem and developer workflows.

Simon Willison · Apr 18, 2026

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Simon Willison's famous 'pelican riding a bicycle' benchmark surprisingly shows a locally-run, smaller Alibaba Qwen3.6 model outperforming the cloud-based, massive Claude Opus 4.7 in creative SVG generation, revealing the surprising potential of open-source models for specific tasks.

Simon Willison · Apr 17, 2026

The PR you would have opened yourself

Hugging Face introduces a new tool to use AI to assist in porting models from the transformers library to MLX, revealing the core contradiction in open-source maintenance during the code agent era: the surge in contributions versus code quality and community communication costs.

Hugging Face Blog · Apr 16, 2026

Gemini 3.1 Flash TTS

Google's Gemini 3.1 Flash TTS is revolutionary because it uses detailed, screenplay-like prompts to precisely control emotion, accent, pace, and scene in speech synthesis, marking a shift from a 'tool' to a 'creative partner'.

Simon Willison · Apr 16, 2026

Trusted access for the next era of cyber defense

OpenAI launches GPT-5.4-Cyber, a model fine-tuned for defensive cybersecurity, and its "Trusted Access" program, signaling that leading AI companies are making cybersecurity a key battleground while seeking a new balance between safety and openness.

Simon Willison · Apr 15, 2026

The problem is that LLMs inherently lack the virtue of laziness

Bryan Cantrill argues that LLMs lack human laziness, which forces us to create elegant abstractions—and without this constraint, AI will make systems larger, not better.

Simon Willison · Apr 13, 2026

Deep Agents v0.5

LangChain introduces async subagents for its Deep Agents framework, enabling parallel task delegation and removing blocking bottlenecks in agent workflows.

LangChain Blog · Apr 8, 2026

research-llm-apis 2026-04-04

Simon Willison used AI to analyze raw HTTP APIs from Anthropic, OpenAI, Gemini, and Mistral to redesign LLM library's abstraction layer.

Simon Willison · Apr 5, 2026

Evaluating Long-Context Question & Answer Systems

A comprehensive guide to evaluating long-context Q&A systems covering metrics, dataset construction, and benchmark reviews across narrative and technical domains.

eugeneyan.com · Apr 5, 2026

Reward Hacking in Reinforcement Learning

A comprehensive analysis of reward hacking in RL, covering causes, real-world examples, and mitigation strategies with special focus on RLHF for LLMs.

Lil'Log · Apr 5, 2026

Training an LLM-RecSys Hybrid for Steerable Recs with Semantic IDs

A bilingual LLM trained with semantic IDs as vocabulary tokens can recommend items and be steered through natural conversation.

eugeneyan.com · Apr 5, 2026

Training an LLM-RecSys Hybrid for Steerable Recs with Semantic IDs

Replace random hash IDs with semantic tokens so LLMs can natively understand items and enable conversational recommendations.

eugeneyan · Apr 5, 2026

Welcome Gemma 4: Frontier multimodal intelligence on device

Gemma 4 introduces enhanced multimodal capabilities, supporting image, text, and audio inputs, significantly improving model intelligence and deployment flexibility across devices.

Hugging Face Blog · Apr 2, 2026

Any Custom Frontend with Gradio's Backend

The introduction of Gradio.Server allows developers to use custom frontend frameworks while enjoying the robust backend support of Gradio, significantly enhancing application development flexibility and efficiency.

Hugging Face Blog · Apr 1, 2026

Ulysses Sequence Parallelism: Training with Million-Token Contexts

Ulysses Sequence Parallelism addresses the challenges of training large language models with long sequences, significantly enhancing the capability to process million-token contexts.

Hugging Face Blog · Mar 9, 2026

Mixture of Experts (MoEs) in Transformers

Mixture of Experts (MoEs) are becoming a new trend in Transformers by enhancing computational efficiency and optimizing parallel processing, driving the evolution of large language models.

Hugging Face Blog · Feb 26, 2026

microgpt

Andrej Karpathy's microgpt project demonstrates how to implement a simplified GPT model from scratch in just 200 lines of Python code, revealing a trend towards minimalism in AI development.

Andrej Karpathy · Feb 12, 2026

Evaluating Long-Context Question & Answer Systems

Long-context Q&A systems face challenges like information overload and multi-hop reasoning, and evaluation should focus on answer faithfulness and helpfulness to enhance user experience.

Eugene Yan · Jun 22, 2025

Reward Hacking in Reinforcement Learning

Reward hacking presents challenges in reinforcement learning due to flaws in reward functions, particularly impacting language models, necessitating further research and mitigation strategies.

Lilian Weng · Nov 28, 2024

Extrinsic Hallucinations in LLMs

This article explores the phenomenon of extrinsic hallucinations in large language models, analyzing their causes and detection methods, and proposes effective strategies to reduce hallucinations while emphasizing the risks of knowledge updates.

Lilian Weng · Jul 7, 2024

Adversarial Attacks on LLMs

This article explores adversarial attacks on large language models (LLMs), including types of attacks, threat models, and their impact on the safety of generated text, revealing significant challenges in AI safety.

Lilian Weng · Oct 25, 2023

LLM Powered Autonomous Agents

LLM powered autonomous agents combine planning, memory, and tool usage, showcasing their potential in handling complex tasks and indicating a significant shift in work methodologies.

Lilian Weng · Jun 23, 2023

Prompt Engineering

This article delves into the basics and techniques of prompt engineering, emphasizing the importance of effective communication with large language models and how to optimize model performance through example selection and ordering.

Lilian Weng · Mar 15, 2023

The Transformer Family Version 2.0

Lilian Weng's new article deeply explores the evolution and new features of Transformers, revealing their ongoing impact in natural language processing.

Lilian Weng · Jan 27, 2023

A First Comprehensive Study of TurboQuant: Accuracy and Performance

A comprehensive benchmark by the vLLM team reveals that TurboQuant generally underperforms FP8 quantization and is only potentially viable for extreme memory-constrained edge deployments.

vLLM Blog ·

Agentic Document Processing: How AI Agents Are Automating Complex Workflows

The article explains how agentic document processing enables AI to shift from passive data extraction to actively understanding, reasoning, and executing complex business workflows for end-to-end automation.

LlamaIndex Blog ·

AI Document Classification: A Practical Guide to Automated Sorting and Tagging

AI document classification automates sorting and tagging by understanding content and context, freeing enterprises from labor-intensive manual classification and serving as a crucial step toward automating document workflows.

LlamaIndex Blog ·

AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

Google DeepMind introduces AlphaEvolve, an AI coding agent that combines LLM creativity with automated evaluators to autonomously discover and optimize complex algorithms, with applications in data centers, chip design, and AI training.

Google DeepMind Blog ·

An update on recent Claude Code quality reports

Anthropic clarifies that Claude Code quality issues were not model-related, but stemmed from three complex bugs in the engineering framework, revealing deep challenges in AI Agent system engineering.

Simon Willison ·

Claude is a space to think

Anthropic declares Claude will remain permanently ad-free, arguing that advertising incentives are fundamentally incompatible with the core goal of an AI assistant being genuinely helpful.

Anthropic News ·

Anthropic introduces Claude Science: An AI workbench for scientists

Anthropic launches Claude Science, an AI workbench integrating 60+ scientific tools that produces auditable artifacts, signaling a move from general-purpose AI into deeply vertical scientific research.

Anthropic News ·

Arcade.dev tools now in LangSmith Fleet

LangChain integrates Arcade's 7,500+ agent-optimized tools into LangSmith Fleet, solving authentication, authorization, and reliability challenges for agent tool use through a single gateway.

LangChain Blog ·

Better Harness: A Recipe for Harness Hill-Climbing with Evals

LangChain introduces the 'Better-Harness' system, treating evaluations as 'training data' for agents, iteratively optimizing the engineering framework (harness) to improve agent performance, with a core focus on avoiding overfitting and achieving generalization.

LangChain Blog ·

Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

The key to scaling enterprise AI isn't better prompts or larger models, but Agent Logic: using deterministic software engineering primitives to constrain and steer LLMs for reliable, cost-effective execution.

Hugging Face Blog ·

Building a Better LiteParse Skill with Evals

Through trace analysis and iterative evaluations, LlamaIndex optimized an agent's PDF parsing strategy, revealing a shift toward disciplined, data-driven agent engineering.

LlamaIndex Blog ·

Building Blocks for Foundation Model Training and Inference on AWS

AWS details the infrastructure supporting the full foundation model lifecycle from pre-training and post-training to inference, revealing a paradigm shift from a single scaling law to three, and the deep integration trend of open-source software stacks with cloud infrastructure.

Hugging Face Blog ·

Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale

Meta has built a unified AI agent platform that encodes senior engineers' domain expertise into reusable skills, automating the discovery and resolution of infrastructure performance issues, saving significant power and engineering time.

Meta Engineering Blog ·

ChatGPT voice mode is a weaker model

Simon Willison points out that ChatGPT's voice mode actually runs on an older GPT-4o model, revealing AI companies' business strategy of deploying different capability models across product lines.

Simon Willison ·

DeepSeek V4 in vLLM: Efficient Long-context Attention

DeepSeek V4 achieves efficient million-token long-context inference on vLLM through innovative KV cache compression and sparse attention mechanisms, marking a new era for long-text processing.

vLLM Blog ·

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

EAGLE 3.1 addresses the performance degradation of speculative decoding in long-context and varied chat templates by introducing FC normalization and post-norm design, doubling acceptance length in long-context scenarios and significantly improving the robustness and practicality of inference acceleration.

vLLM Blog ·

Elastic Expert Parallelism in vLLM

vLLM introduces Elastic Expert Parallelism (Elastic EP), enabling runtime scaling of MoE inference deployments by adding or removing GPU workers without restarts, adapting to demand fluctuations and laying the groundwork for fault-tolerant serving.

vLLM Blog ·

Hackers Simply Asked Meta AI to Give Them Access to High-Profile Instagram Accounts. It Worked

A real-world attack where hackers bypassed Instagram's account recovery by simply asking Meta's AI chatbot to link a new email, revealing the severe risks of wiring AI directly into critical systems without proper authorization boundaries.

Simon Willison ·

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

The article clarifies the confusion around key AI Agent terms like Harness and Scaffolding, aiming to build a clear, shared mental model for the field.

Hugging Face Blog ·

How we build evals for Deep Agents

The LangChain team shares their core philosophy for building AI agent evals: more tests don't mean better agents; the key is designing targeted, self-documenting evaluations that directly measure desired behaviors.

LangChain Blog ·

Human judgment in the agent improvement loop

LangChain explains the core challenge of building reliable AI Agents: integrating human experts' tacit knowledge and judgment into the development loop, not just relying on documented explicit knowledge.

LangChain Blog ·

I think Anthropic and OpenAI have found product-market fit

Simon Willison argues that OpenAI and Anthropic have found product-market fit through coding/general-purpose AI agents, evidenced by their shift to charging enterprise customers based on API usage, marking a new phase in AI commercialization.

Simon Willison ·

Introducing Claude Opus 4.7

Anthropic releases Claude Opus 4.7, focusing on enhanced complex coding and long-running task capabilities, with its 'self-verification' mechanism marking a key step towards more autonomous AI agents.

Anthropic News ·

Introducing Claude Opus 4.8

Anthropic releases Claude Opus 4.8, with core breakthroughs in significantly improving the reliability, judgment, and long-running consistency of Agent tasks, marking AI's practical shift from 'usable' to 'trustworthy'.

Anthropic News ·

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

NVIDIA releases its omni-modal understanding model Nemotron 3 Nano Omni, setting new open-source benchmarks across document, audio-video understanding, and agentic tasks, while delivering significantly higher efficiency than comparable models.

Hugging Face Blog ·

Introducing ParseBench: The First Document Parsing Benchmark for AI Agents

LlamaIndex releases ParseBench, the first document parsing benchmark for AI agents, evaluating parsers across five dimensions like tables and charts, revealing no single method excels at everything, with LlamaParse Agentic showing the most balanced performance.

LlamaIndex Blog ·

Is grep all you need? Lexical VS Sematic Search for Agents

The article explores the boundaries between traditional grep and semantic search/RAG for AI agents, highlighting grep's limitations with unstructured documents and at enterprise scale, and proposes a hybrid approach combining parsing tools.

LlamaIndex Blog ·

TCS and Anthropic partner to bring Claude to regulated industries

The Anthropic-TCS partnership marks a strategic shift in AI adoption from direct model sales to channel-based integration, leveraging traditional IT giants to penetrate heavily regulated sectors.

Anthropic News ·

Expanding Project Glasswing

Anthropic is scaling its AI-driven critical infrastructure defense network while warning that automated AI cyberattacks will become ubiquitous within a year, forcing the industry to shift from vulnerability discovery to rapid remediation.

Anthropic News ·

What we learned mapping a year’s worth of AI-enabled cyber threats

AI is not just being used to write malware; it's increasingly being applied in the deeper, more complex stages of cyberattacks, rendering traditional risk assessment methods obsolete and exposing gaps in existing security frameworks like MITRE ATT&CK.

Anthropic News ·

March 2026: LangChain Newsletter

LangChain is pushing agents from experimental prototypes to scalable, manageable enterprise assets through updates like LangSmith Fleet, Skills, and Sandboxes.

LangChain Blog ·

Anthropic acquires Stainless

Anthropic acquires core SDK tool provider Stainless to solve the 'last mile' problem of AI agent connectivity and strengthen its MCP protocol ecosystem.

Anthropic News ·

May 19, 2026AnnouncementsKPMG integrates Claude across its core business and workforce of more than 276,000 in strategic alliance

KPMG forms a global strategic alliance with Anthropic, deeply integrating Claude into its core business platform and workflows for all 276,000 employees, marking a full-scale AI bet by the professional services giant.

Anthropic News ·

Meta's new model is Muse Spark, and meta.ai chat has some interesting tools

Meta released Muse Spark, but the real story is its chat interface integrating 16 tools—web search, social media content search, code interpreter, etc.—building a complete AI agent workbench.

Simon Willison ·

OCR Accuracy Explained: What Impacts Performance and How to Improve It

OCR accuracy is not a single number, but a systems engineering problem determined by image quality, document complexity, evaluation metrics, and post-processing.

LlamaIndex Blog ·

Open Models have crossed a threshold

LangChain's evaluations show that open-source models like GLM-5 and MiniMax M2.7 now match top closed-source models on core agent tasks, while offering up to 90% cost reduction and significantly lower latency.

LangChain Blog ·

Anthropic's Claude Tag: When AI Becomes a 'Permanently Online Colleague,' How Will Work Patterns Be Reshaped?

Anthropic launched Claude Tag, deeply integrating AI into team collaboration spaces like Slack with capabilities for multi-user collaboration, long-term memory, and proactive asynchronous work, marking a paradigm shift from AI as a tool to a 'digital colleague'.

Anthropic News ·

Introducing Claude Sonnet 5

Anthropic's Sonnet 5 delivers agentic performance close to the Opus flagship at significantly lower cost, enabling developers to build powerful autonomous agents with mid-tier models.

Anthropic News ·

Serving Agentic Workloads at Scale with vLLM x Mooncake

vLLM integrates Mooncake's distributed KV cache to solve the bottleneck of recomputing long context prefixes in agentic workloads, achieving a 3.8x throughput increase and a 46x reduction in time-to-first-token.

vLLM Blog ·

SIMA 2: An agent that plays, reasons, and learns with you

DeepMind's SIMA 2 integrates Gemini's reasoning into 3D game AI, evolving from a simple instruction follower to an intelligent companion that understands goals, converses, and self-improves.

Google DeepMind Blog ·

The pressure

curl's lead maintainer, Daniel Stenberg, reveals that an unprecedented flood of high-quality, AI-assisted security vulnerability reports is putting immense pressure on the open-source project's team.

Simon Willison ·

The State of FP8 KV-Cache and Attention Quantization in vLLM

vLLM uses FP8 quantization for KV cache to halve memory usage and double throughput for long-context inference while maintaining accuracy, though specific performance pitfalls need attention.

vLLM Blog ·

Unstructured Data Extraction: How to Turn Documents into Structured Insights

LlamaIndex's blog post highlights that 90% of enterprise data is unstructured, and modern AI stacks (NLP, NER, LLM) can convert these documents into queryable structured information, unlocking significant business value.

LlamaIndex Blog ·

vLLM Tops the Artificial Analysis Leaderboard

The open-source inference engine vLLM has outperformed all proprietary competitors in deploying multiple frontier open-weight models, with its core optimization techniques like operator fusion publicly available, revealing the immense potential of open source in AI inference.

vLLM Blog ·