The Open Agent Leaderboard
Hugging Face and IBM launch the Open Agent Leaderboard, shifting evaluation from standalone models to full agent systems (including tools, planning, memory), while measuring both performance and cost.
Hugging Face and IBM launch the Open Agent Leaderboard, shifting evaluation from standalone models to full agent systems (including tools, planning, memory), while measuring both performance and cost.
OpenAI's Codex CLI introduces a /goal command that enables the coding agent to automatically loop until a goal is met or token budget exhausted, signaling a shift from single-shot Q&A to persistent task execution.
NVIDIA releases Nemotron 3 Nano Omni, a hybrid Mamba-Transformer model enabling long-context multimodal understanding of documents, audio, and video, leading multiple benchmarks and offering an efficient new option for AI agents handling complex real-world tasks.
NVIDIA releases the open-source multimodal model Nemotron 3 Nano Omni, which uses a Mixture of Experts architecture to activate only 3B of its 30B parameters, achieving 9x higher throughput than comparable models to solve efficiency and fragmentation issues in multimodal AI agents.
An OpenAI executive confirms GPT-5.5 will not have a dedicated code version, signaling that large models are moving from specialized capabilities to unified, general-purpose agent systems.
The culprit behind Claude Code's quality decline over the past two months wasn't model degradation, but three harness-level bugs, with a 'session state cleanup' glitch exposing hidden complexities in AI Agent engineering.
DeepSeek-V4 makes million-token context windows practically usable for long-running AI agents by dramatically cutting inference costs and memory usage through its novel hybrid attention architecture.
An end-to-end multimodal agent demo running on NVIDIA Jetson Orin Nano Super, showcasing how the model autonomously decides when to use the camera and answers questions with visual context, signaling the descent of powerful AI capabilities to edge devices.
An expert critiques current AI agents for being too 'human'—lacking rigor, patience, and focus, and tending to compromise when faced with difficulties, revealing fundamental flaws in their design.
NVIDIA, in collaboration with Korean institutions, released a dataset of 6 million synthetic personas to ground AI agents in authentic Korean demographics and cultural context, moving beyond simple Western defaults.
IBM and HuggingFace introduce the VAKRA benchmark, revealing that current AI agents perform poorly on complex multi-step tasks, with key failure modes including tool-chain planning, parameter passing, and error recovery.
LangChain's annual conference focuses on the challenges of scaling AI agents from production validation to enterprise-wide deployment, revealing how major companies build platforms, evaluate performance, and structure teams.
LangChain argues that building reliable AI agents requires systematically integrating domain experts' tacit knowledge and judgment throughout the development lifecycle, rather than relying solely on the model's own capabilities.
LangChain argues that building better AI agents hinges on improving their 'harness' rather than the model itself, and shares a systematic method using evals as training signals for iterative improvement.
LangChain introduces async subagents for its Deep Agents framework, enabling parallel task delegation and removing blocking bottlenecks in agent workflows.
LangChain partners with Arcade.dev to integrate over 7,500 agent-optimized tools into LangSmith Fleet, simplifying tool integration, authentication, and authorization through a single MCP gateway.
Continual learning for AI agents occurs at three layers: model, harness, and context, with context-layer evolution being the most practical and actionable.
A LangChain engineer shares a complete pipeline for AI agents to automatically detect regressions, diagnose issues, and submit fix PRs after deployment, combining statistical methods with intelligent triage to reduce false positives.
LangChain's evaluations show that open models like GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks such as file operations and tool use, at a fraction of the cost and with lower latency.
LangChain is pushing AI agents from experimental prototypes to manageable, collaborative, and securely deployable enterprise productivity tools through features like LangSmith Fleet, Skills, and Sandboxes.
LangChain and MongoDB's deep integration transforms Atlas into a unified AI agent backend for vector search, persistent memory, data querying, and observability, aiming to solve data architecture fragmentation from prototype to production.
LangChain proposes a 6-point checklist before building agent evaluations, emphasizing manual analysis of 20-50 real failure traces before automating tests.
LangChain shares its core philosophy for building AI agent evaluation systems: more evals aren't better; instead, precisely define and measure the agent behaviors you care about to guide its evolution.
Google DeepMind's AlphaEvolve is an AI coding agent that autonomously evolves and optimizes algorithms, discovering new knowledge in math and computing, and has already improved Google's data center efficiency.
Anthropic's Claude Opus 4.7 release focuses on enhanced reliability for complex, long-running tasks and self-verification capabilities, signaling a shift from AI as a tool to a trustworthy work partner.
LlamaIndex introduces ParseBench, the first OCR benchmark designed specifically for AI agents, alongside open-sourcing a local document parsing server and a secure sandboxed CLI agent, signaling a shift in document processing towards agent-native infrastructure.
Google DeepMind's SIMA 2 integrates Gemini to evolve from an instruction-follower into an interactive companion that can reason, converse, and self-improve in 3D virtual worlds.
Single-pass extraction lacks a verification loop, leading to high error rates on complex real-world documents; deep extraction uses an agentic iterative verify-and-correct loop to boost critical field accuracy from demo-level to production-ready.