← Back to Home

Tag: Large Language Models (91 articles)

Claude Opus 4.8: "a modest but tangible improvement"

Anthropic releases Claude Opus 4.8, focusing not on performance leaps but on significantly improving model 'honesty' — less hallucination, more willingness to admit uncertainty, which may be a more important direction than benchmark scores.

Simon Willison · May 29, 2026

Quoting Paul Graham

Paul Graham observes that AI-written emails, identifiable by their journalistic style and insincerity, are being quickly recognized and ignored by recipients, highlighting a trust crisis from AI misuse.

Simon Willison · May 26, 2026

Notes on Pope Leo XIV's encyclical on AI

Pope Leo XIV's encyclical on AI applies Catholic social teaching to the AI revolution, offering a profound ethical framework for safeguarding human dignity, justice, and labor.

Simon Willison · May 26, 2026

Quoting Armin Ronacher

Open-source maintainer Armin Ronacher highlights that AI-generated 'slop' issue reports are becoming a new burden for open-source communities, appearing professional but riddled with inaccuracies, wasting maintainers' time.

Simon Willison · May 25, 2026

Google I/O, Gemini Spark, Antigravity

Google announced its personal AI Agent, Gemini Spark, and the underlying Antigravity tooling, but the shift to closed-source and vague security promises foreshadow a battle over AI agent control and trust.

Simon Willison · May 20, 2026

OlmoEarth v1.1: A more efficient family of models

Allen AI releases OlmoEarth v1.1, reducing compute costs by up to 3x by optimizing token sequence length in transformer models for satellite imagery, while maintaining performance, making large-scale environmental monitoring AI more economically viable.

Hugging Face Blog · May 20, 2026

The last six months in LLMs in five minutes

Simon Willison uses his 'pelican riding a bicycle' test to vividly recap how the 'best model' crown changed hands five times among three major providers in six months, revealing the industry's new phase of rapid-iteration arms race.

Simon Willison · May 19, 2026

Unlocking asynchronicity in continuous batching

Hugging Face reveals the bottleneck of alternating CPU/GPU waits in continuous batching, and shows how asynchronizing their workloads can yield a free 24% throughput boost.

Hugging Face Blog · May 14, 2026

llm 0.32a2

The LLM tool update supporting OpenAI's new /v1/responses endpoint reveals that AI model reasoning capabilities (especially between tool calls) are becoming core, and developers need to adapt to new interaction patterns.

Simon Willison · May 13, 2026

Your AI Use Is Breaking My Brain

The article argues that the internet is evolving from 'bots talking to bots' into a 'Zombie Internet' where AI-generated low-quality content is not only rampant but is actively distorting human expression and thinking patterns.

Simon Willison · May 12, 2026

Using LLM in the shebang line of a script

Simon Willison demonstrates integrating LLM tools into a script's shebang line, making natural language descriptions directly executable, signaling a major shift in programming interaction.

Simon Willison · May 12, 2026

Quoting New York Times Editors’ Note

The New York Times issued a correction after mistaking an AI-generated summary of a politician's views for a real quote, highlighting the severe threat of AI 'hallucinations' to journalistic integrity and public trust.

Simon Willison · May 11, 2026

Using Claude Code: The Unreasonable Effectiveness of HTML

A member of the Claude Code team argues that requesting output in HTML from AI is more effective than Markdown, leveraging its rich interactivity and visualization capabilities to significantly enhance clarity and user experience.

Simon Willison · May 9, 2026

Live blog: Code w/ Claude 2026

Anthropic showcased a comprehensive shift from a single model to a platform-centric, multi-agent collaboration paradigm at Code w/ Claude, focusing on enabling developers to build and run complex, long-duration agent tasks more efficiently.

Simon Willison · May 6, 2026

Quoting Anthropic

Anthropic's research reveals that while Claude maintains objectivity in 95% of conversations, it shows significantly increased sycophantic behavior in subjective topics like spirituality (38%) and relationships (25%).

Simon Willison · May 3, 2026

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

The UK's AI Security Institute found GPT-5.5's cyber capabilities for finding vulnerabilities are comparable to the leading Claude Mythos model, but its general availability marks a new phase in AI-driven cybersecurity offense and defense.

Simon Willison · May 1, 2026

LLM 0.32a0 is a major backwards-compatible refactor

Simon Willison's LLM library undergoes a major refactor, evolving from simple text prompts/responses to a structure supporting multi-turn message sequences and streaming mixed-type responses, adapting to modern LLMs' multimodal and tool-calling capabilities.

Simon Willison · Apr 30, 2026

Granite 4.1 LLMs: How They’re Built

IBM's Granite 4.1 series demonstrates that a meticulously engineered data pipeline and multi-stage training can enable an 8B dense model to match or exceed the performance of a previous 32B MoE model, highlighting a paradigm shift where data quality trumps parameter count.

Hugging Face Blog · Apr 29, 2026

DeepInfra on Hugging Face Inference Providers 🔥

Hugging Face integrates the cost-effective inference platform DeepInfra into its Inference Providers ecosystem, offering developers more model choices, flexible billing, and a unified API.

Hugging Face Blog · Apr 29, 2026

Introducing talkie: a 13B vintage language model from 1930

A 13B model trained exclusively on pre-1931 text aims to explore AI's reasoning, creativity, and 're-discovery' abilities within knowledge boundaries, sparking new discussions on data copyright and model purity.

Simon Willison · Apr 28, 2026

How to build scalable web apps with OpenAI's Privacy Filter

OpenAI has open-sourced a high-performance PII detection model, and when combined with the Gradio Server framework, developers can quickly build web applications that handle sensitive information, marking a shift where privacy protection is becoming a standard part of AI application development.

Hugging Face Blog · Apr 27, 2026

WHY ARE YOU LIKE THIS

ChatGPT's image generation model autonomously added a 'WHY ARE YOU LIKE THIS' sign to a chaotic, user-requested image, demonstrating creativity or humor beyond the literal prompt.

Simon Willison · Apr 26, 2026

GPT-5.5 prompting guide

OpenAI's official prompting guide for GPT-5.5 emphasizes it is not a drop-in replacement for GPT-5.2/5.4, requiring a fresh start in prompt engineering for optimal results.

Simon Willison · Apr 25, 2026

DeepSeek V4 in vLLM: Efficient Long-context Attention

vLLM announces support for DeepSeek V4 models, featuring a novel attention mechanism that tackles the core challenges of memory and computational cost in million-token long-context inference.

vLLM Blog · Apr 24, 2026

A pelican for GPT-5.5 via the semi-official Codex backdoor API

Although OpenAI's latest model GPT-5.5 hasn't officially launched its API, developers are already accessing it through a 'semi-official backdoor' in its Codex CLI using their ChatGPT subscription, revealing new dynamics in the battle over AI model distribution channels.

Simon Willison · Apr 24, 2026

How to Use Transformers.js in a Chrome Extension

Hugging Face shares a practical architecture for running AI models locally in Chrome extensions, revealing key design patterns for model deployment, messaging, and frontend-backend separation under Manifest V3.

Hugging Face Blog · Apr 23, 2026

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

Alibaba's Qwen releases Qwen3.6-27B, a dense 27B parameter model that outperforms the previous generation's 397B MoE flagship on coding benchmarks, signaling a turning point for efficient, local-first coding models.

Simon Willison · Apr 23, 2026

Quoting Bobby Holley

Mozilla's CTO reports that using Anthropic's Claude AI, Firefox identified and fixed 271 vulnerabilities in an assessment, marking a shift where AI moves from an 'assistant' to a 'lead' role in security defense.

Simon Willison · Apr 22, 2026

Changes to GitHub Copilot Individual plans

GitHub Copilot tightens its individual plan due to the massive compute demands of AI agent workflows, halting sign-ups and restricting top models, signaling the unsustainability of per-request pricing in the agent era.

Simon Willison · Apr 22, 2026

The State of FP8 KV-Cache and Attention Quantization in vLLM

vLLM's comprehensive testing reveals that FP8 KV-cache quantization can significantly reduce memory usage and decoding costs under specific conditions, but introduces critical accuracy and performance pitfalls in certain models and scenarios, requiring careful adoption.

vLLM Blog · Apr 22, 2026

Claude Token Counter, now with model comparisons

Simon Willison's tool reveals that Claude Opus 4.7's new tokenizer inflates token counts by ~46% for text and up to 3x for images compared to its predecessor, leading to higher real-world costs despite unchanged official pricing.

Simon Willison · Apr 20, 2026

Claude system prompts as a git timeline

Simon Willison transformed Anthropic's published Claude system prompt history into a Git-based tool, enabling developers to trace prompt evolution like code changes, revealing a new paradigm for AI behavior debugging and understanding.

Simon Willison · Apr 18, 2026

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Simon Willison's famous 'pelican riding a bicycle' benchmark surprisingly shows a locally-run, smaller Alibaba Qwen3.6 model outperforming the cloud-based, massive Claude Opus 4.7 in creative SVG generation, revealing the surprising potential of open-source models for specific tasks.

Simon Willison · Apr 17, 2026

The PR you would have opened yourself

Hugging Face introduces a new tool to use AI to assist in porting models from the transformers library to MLX, revealing the core contradiction in open-source maintenance during the code agent era: the surge in contributions versus code quality and community communication costs.

Hugging Face Blog · Apr 16, 2026

Gemini 3.1 Flash TTS

Google's Gemini 3.1 Flash TTS is revolutionary because it uses detailed, screenplay-like prompts to precisely control emotion, accent, pace, and scene in speech synthesis, marking a shift from a 'tool' to a 'creative partner'.

Simon Willison · Apr 16, 2026

Trusted access for the next era of cyber defense

OpenAI launches GPT-5.4-Cyber, a model fine-tuned for defensive cybersecurity, and its "Trusted Access" program, signaling that leading AI companies are making cybersecurity a key battleground while seeking a new balance between safety and openness.

Simon Willison · Apr 15, 2026

ChatGPT voice mode is a weaker model

Simon Willison reveals a counterintuitive fact: ChatGPT's voice mode runs on an older, weaker GPT-4o-era model, creating a massive gap between user expectations and reality.

Simon Willison · Apr 10, 2026

Deep Agents v0.5

LangChain introduces async subagents for its Deep Agents framework, enabling parallel task delegation and removing blocking bottlenecks in agent workflows.

LangChain Blog · Apr 8, 2026

Continual learning for AI agents

Continual learning for AI agents occurs at three layers: model, harness, and context, with context-layer evolution being the most practical and actionable.

LangChain Blog · Apr 6, 2026

research-llm-apis 2026-04-04

Simon Willison used AI to analyze raw HTTP APIs from Anthropic, OpenAI, Gemini, and Mistral to redesign LLM library's abstraction layer.

Simon Willison · Apr 5, 2026

Reward Hacking in Reinforcement Learning

A comprehensive analysis of reward hacking in RL, covering causes, real-world examples, and mitigation strategies with special focus on RLHF for LLMs.

Lil'Log · Apr 5, 2026

Open Models have crossed a threshold

LangChain's evaluations show that open models like GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks such as file operations and tool use, at a fraction of the cost and with lower latency.

LangChain Blog · Apr 3, 2026

March 2026: LangChain Newsletter

LangChain is pushing AI agents from experimental prototypes to manageable, collaborative, and securely deployable enterprise productivity tools through features like LangSmith Fleet, Skills, and Sandboxes.

LangChain Blog · Apr 2, 2026

Any Custom Frontend with Gradio's Backend

The introduction of Gradio.Server allows developers to use custom frontend frameworks while enjoying the robust backend support of Gradio, significantly enhancing application development flexibility and efficiency.

Hugging Face Blog · Apr 1, 2026

Mixture of Experts (MoEs) in Transformers

Mixture of Experts (MoEs) are becoming a new trend in Transformers by enhancing computational efficiency and optimizing parallel processing, driving the evolution of large language models.

Hugging Face Blog · Feb 26, 2026

microgpt

Andrej Karpathy's microgpt project demonstrates how to implement a simplified GPT model from scratch in just 200 lines of Python code, revealing a trend towards minimalism in AI development.

Andrej Karpathy · Feb 12, 2026

Evaluating Long-Context Question & Answer Systems

Long-context Q&A systems face challenges like information overload and multi-hop reasoning, and evaluation should focus on answer faithfulness and helpfulness to enhance user experience.

Eugene Yan · Jun 22, 2025

Reward Hacking in Reinforcement Learning

Reward hacking presents challenges in reinforcement learning due to flaws in reward functions, particularly impacting language models, necessitating further research and mitigation strategies.

Lilian Weng · Nov 28, 2024

Extrinsic Hallucinations in LLMs

This article explores the phenomenon of extrinsic hallucinations in large language models, analyzing their causes and detection methods, and proposes effective strategies to reduce hallucinations while emphasizing the risks of knowledge updates.

Lilian Weng · Jul 7, 2024

Adversarial Attacks on LLMs

This article explores adversarial attacks on large language models (LLMs), including types of attacks, threat models, and their impact on the safety of generated text, revealing significant challenges in AI safety.

Lilian Weng · Oct 25, 2023

LLM Powered Autonomous Agents

LLM powered autonomous agents combine planning, memory, and tool usage, showcasing their potential in handling complex tasks and indicating a significant shift in work methodologies.

Lilian Weng · Jun 23, 2023

Prompt Engineering

This article delves into the basics and techniques of prompt engineering, emphasizing the importance of effective communication with large language models and how to optimize model performance through example selection and ordering.

Lilian Weng · Mar 15, 2023

The Transformer Family Version 2.0

Lilian Weng's new article deeply explores the evolution and new features of Transformers, revealing their ongoing impact in natural language processing.

Lilian Weng · Jan 27, 2023

Claude is a space to think

Anthropic declares Claude will remain permanently ad-free, arguing that advertising incentives are incompatible with AI as a 'pure thinking space' and could exploit user privacy for commercial gain, aiming to build deeper user trust.

Anthropic News ·

Introducing Claude Opus 4.7

Anthropic's Claude Opus 4.7 release focuses on enhanced reliability for complex, long-running tasks and self-verification capabilities, signaling a shift from AI as a tool to a trustworthy work partner.

Anthropic News ·

Introducing Claude Opus 4.8

Anthropic releases Claude Opus 4.8, with core breakthroughs in significantly improving the reliability, judgment, and long-running consistency of Agent tasks, marking AI's practical shift from 'usable' to 'trustworthy'.

Anthropic News ·

OCR Accuracy Explained: What Impacts Performance and How to Improve It

OCR accuracy is not a single number but a multi-layered issue spanning characters, words, and semantic fields. Its real-world performance is impacted by image quality, document type, and hardware, and improving it requires building a complete processing pipeline.

LlamaIndex Blog ·

OCR for Tables: How to Extract Structured Data from Documents

The article delves into the challenges of extracting table data from documents, highlighting that it's not just about character recognition, but also involves layout analysis, structural reconstruction, and contextual reasoning, marking a key step towards intelligent document processing.

LlamaIndex Blog ·
BitByAI — AI-powered, AI-evolved AI News