Tag: AI智能体 (28 articles)

The Open Agent Leaderboard

Hugging Face and IBM launch the Open Agent Leaderboard, shifting evaluation from standalone models to full agent systems (including tools, planning, memory), while measuring both performance and cost.

Hugging Face Blog · May 18, 2026

Codex CLI 0.128.0 adds /goal

OpenAI's Codex CLI introduces a /goal command that enables the coding agent to automatically loop until a goal is met or token budget exhausted, signaling a shift from single-shot Q&A to persistent task execution.

Simon Willison · May 1, 2026

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

NVIDIA releases Nemotron 3 Nano Omni, a hybrid Mamba-Transformer model enabling long-context multimodal understanding of documents, audio, and video, leading multiple benchmarks and offering an efficient new option for AI agents handling complex real-world tasks.

Hugging Face Blog · Apr 28, 2026

Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM

NVIDIA releases the open-source multimodal model Nemotron 3 Nano Omni, which uses a Mixture of Experts architecture to activate only 3B of its 30B parameters, achieving 9x higher throughput than comparable models to solve efficiency and fragmentation issues in multimodal AI agents.

vLLM Blog · Apr 28, 2026

OpenAI's 'Unification' Ambition: GPT-5.5 Bids Farewell to Dedicated Code Models, Moving Towards General Agents

An OpenAI executive confirms GPT-5.5 will not have a dedicated code version, signaling that large models are moving from specialized capabilities to unified, general-purpose agent systems.

Simon Willison · Apr 25, 2026

An update on recent Claude Code quality reports

The culprit behind Claude Code's quality decline over the past two months wasn't model degradation, but three harness-level bugs, with a 'session state cleanup' glitch exposing hidden complexities in AI Agent engineering.

Simon Willison · Apr 24, 2026

DeepSeek-V4: a million-token context that agents can actually use

DeepSeek-V4 makes million-token context windows practically usable for long-running AI agents by dramatically cutting inference costs and memory usage through its novel hybrid attention architecture.

Hugging Face Blog · Apr 24, 2026

Gemma 4 VLA Demo on Jetson Orin Nano Super

An end-to-end multimodal agent demo running on NVIDIA Jetson Orin Nano Super, showcasing how the model autonomously decides when to use the camera and answers questions with visual context, signaling the descent of powerful AI capabilities to edge devices.

Hugging Face Blog · Apr 22, 2026

AI Agents Are Too Human? A Counter-Intuitive Critique and Its Deeper Implications

An expert critiques current AI agents for being too 'human'—lacking rigor, patience, and focus, and tending to compromise when faced with difficulties, revealing fundamental flaws in their design.

Simon Willison · Apr 22, 2026

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

NVIDIA, in collaboration with Korean institutions, released a dataset of 6 million synthetic personas to ground AI agents in authentic Korean demographics and cultural context, moving beyond simple Western defaults.

Hugging Face Blog · Apr 21, 2026

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM and HuggingFace introduce the VAKRA benchmark, revealing that current AI agents perform poorly on complex multi-step tasks, with key failure modes including tool-chain planning, parameter passing, and error recovery.

Hugging Face Blog · Apr 15, 2026

Previewing Interrupt 2026: Agents at Enterprise Scale

LangChain's annual conference focuses on the challenges of scaling AI agents from production validation to enterprise-wide deployment, revealing how major companies build platforms, evaluate performance, and structure teams.

LangChain Blog · Apr 10, 2026

Human judgment in the agent improvement loop

LangChain argues that building reliable AI agents requires systematically integrating domain experts' tacit knowledge and judgment throughout the development lifecycle, rather than relying solely on the model's own capabilities.

LangChain Blog · Apr 9, 2026

Better Harness: A Recipe for Harness Hill-Climbing with Evals

LangChain argues that building better AI agents hinges on improving their 'harness' rather than the model itself, and shares a systematic method using evals as training signals for iterative improvement.

LangChain Blog · Apr 9, 2026

Deep Agents v0.5

LangChain introduces async subagents for its Deep Agents framework, enabling parallel task delegation and removing blocking bottlenecks in agent workflows.

LangChain Blog · Apr 8, 2026

Arcade.dev tools now in LangSmith Fleet

LangChain partners with Arcade.dev to integrate over 7,500 agent-optimized tools into LangSmith Fleet, simplifying tool integration, authentication, and authorization through a single MCP gateway.

LangChain Blog · Apr 7, 2026

Continual learning for AI agents

Continual learning for AI agents occurs at three layers: model, harness, and context, with context-layer evolution being the most practical and actionable.

LangChain Blog · Apr 6, 2026

How My Agents Self-Heal in Production

A LangChain engineer shares a complete pipeline for AI agents to automatically detect regressions, diagnose issues, and submit fix PRs after deployment, combining statistical methods with intelligent triage to reduce false positives.

LangChain Blog · Apr 4, 2026

Open Models have crossed a threshold

LangChain's evaluations show that open models like GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks such as file operations and tool use, at a fraction of the cost and with lower latency.

LangChain Blog · Apr 3, 2026

March 2026: LangChain Newsletter

LangChain is pushing AI agents from experimental prototypes to manageable, collaborative, and securely deployable enterprise productivity tools through features like LangSmith Fleet, Skills, and Sandboxes.

LangChain Blog · Apr 2, 2026

Announcing the LangChain + MongoDB Partnership: The AI Agent Stack That Runs On The Database You Already Trust

LangChain and MongoDB's deep integration transforms Atlas into a unified AI agent backend for vector search, persistent memory, data querying, and observability, aiming to solve data architecture fragmentation from prototype to production.

LangChain Blog · Apr 1, 2026

Agent Evaluation Readiness Checklist

LangChain proposes a 6-point checklist before building agent evaluations, emphasizing manual analysis of 20-50 real failure traces before automating tests.

LangChain Blog · Mar 27, 2026

How we build evals for Deep Agents

LangChain shares its core philosophy for building AI agent evaluation systems: more evals aren't better; instead, precisely define and measure the agent behaviors you care about to guide its evolution.

LangChain Blog · Mar 26, 2026

AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

Google DeepMind's AlphaEvolve is an AI coding agent that autonomously evolves and optimizes algorithms, discovering new knowledge in math and computing, and has already improved Google's data center efficiency.

Google DeepMind Blog ·

Introducing Claude Opus 4.7

Anthropic's Claude Opus 4.7 release focuses on enhanced reliability for complex, long-running tasks and self-verification capabilities, signaling a shift from AI as a tool to a trustworthy work partner.

Anthropic News ·

LlamaIndex Newsletter 5-19-26

LlamaIndex introduces ParseBench, the first OCR benchmark designed specifically for AI agents, alongside open-sourcing a local document parsing server and a secure sandboxed CLI agent, signaling a shift in document processing towards agent-native infrastructure.

LlamaIndex Blog ·

SIMA 2: An agent that plays, reasons, and learns with you

Google DeepMind's SIMA 2 integrates Gemini to evolve from an instruction-follower into an interactive companion that can reason, converse, and self-improve in 3D virtual worlds.

Google DeepMind Blog ·

Why Single-Pass Extraction Fails and What Deep Extraction Actually Solves

Single-pass extraction lacks a verification loop, leading to high error rates on complex real-world documents; deep extraction uses an agentic iterative verify-and-correct loop to boost critical field accuracy from demo-level to production-ready.

LlamaIndex Blog ·