← Back to Home

Tag: AI智能体 (28 articles)

The Open Agent Leaderboard

Hugging Face and IBM launch the Open Agent Leaderboard, shifting evaluation from standalone models to full agent systems (including tools, planning, memory), while measuring both performance and cost.

Hugging Face Blog · May 18, 2026

Codex CLI 0.128.0 adds /goal

OpenAI's Codex CLI introduces a /goal command that enables the coding agent to automatically loop until a goal is met or token budget exhausted, signaling a shift from single-shot Q&A to persistent task execution.

Simon Willison · May 1, 2026

An update on recent Claude Code quality reports

The culprit behind Claude Code's quality decline over the past two months wasn't model degradation, but three harness-level bugs, with a 'session state cleanup' glitch exposing hidden complexities in AI Agent engineering.

Simon Willison · Apr 24, 2026

Gemma 4 VLA Demo on Jetson Orin Nano Super

An end-to-end multimodal agent demo running on NVIDIA Jetson Orin Nano Super, showcasing how the model autonomously decides when to use the camera and answers questions with visual context, signaling the descent of powerful AI capabilities to edge devices.

Hugging Face Blog · Apr 22, 2026

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM and HuggingFace introduce the VAKRA benchmark, revealing that current AI agents perform poorly on complex multi-step tasks, with key failure modes including tool-chain planning, parameter passing, and error recovery.

Hugging Face Blog · Apr 15, 2026

Previewing Interrupt 2026: Agents at Enterprise Scale

LangChain's annual conference focuses on the challenges of scaling AI agents from production validation to enterprise-wide deployment, revealing how major companies build platforms, evaluate performance, and structure teams.

LangChain Blog · Apr 10, 2026

Human judgment in the agent improvement loop

LangChain argues that building reliable AI agents requires systematically integrating domain experts' tacit knowledge and judgment throughout the development lifecycle, rather than relying solely on the model's own capabilities.

LangChain Blog · Apr 9, 2026

Deep Agents v0.5

LangChain introduces async subagents for its Deep Agents framework, enabling parallel task delegation and removing blocking bottlenecks in agent workflows.

LangChain Blog · Apr 8, 2026

Arcade.dev tools now in LangSmith Fleet

LangChain partners with Arcade.dev to integrate over 7,500 agent-optimized tools into LangSmith Fleet, simplifying tool integration, authentication, and authorization through a single MCP gateway.

LangChain Blog · Apr 7, 2026

Continual learning for AI agents

Continual learning for AI agents occurs at three layers: model, harness, and context, with context-layer evolution being the most practical and actionable.

LangChain Blog · Apr 6, 2026

How My Agents Self-Heal in Production

A LangChain engineer shares a complete pipeline for AI agents to automatically detect regressions, diagnose issues, and submit fix PRs after deployment, combining statistical methods with intelligent triage to reduce false positives.

LangChain Blog · Apr 4, 2026

Open Models have crossed a threshold

LangChain's evaluations show that open models like GLM-5 and MiniMax M2.7 now match closed frontier models on core agent tasks such as file operations and tool use, at a fraction of the cost and with lower latency.

LangChain Blog · Apr 3, 2026

March 2026: LangChain Newsletter

LangChain is pushing AI agents from experimental prototypes to manageable, collaborative, and securely deployable enterprise productivity tools through features like LangSmith Fleet, Skills, and Sandboxes.

LangChain Blog · Apr 2, 2026

Agent Evaluation Readiness Checklist

LangChain proposes a 6-point checklist before building agent evaluations, emphasizing manual analysis of 20-50 real failure traces before automating tests.

LangChain Blog · Mar 27, 2026

How we build evals for Deep Agents

LangChain shares its core philosophy for building AI agent evaluation systems: more evals aren't better; instead, precisely define and measure the agent behaviors you care about to guide its evolution.

LangChain Blog · Mar 26, 2026

Introducing Claude Opus 4.7

Anthropic's Claude Opus 4.7 release focuses on enhanced reliability for complex, long-running tasks and self-verification capabilities, signaling a shift from AI as a tool to a trustworthy work partner.

Anthropic News ·

LlamaIndex Newsletter 5-19-26

LlamaIndex introduces ParseBench, the first OCR benchmark designed specifically for AI agents, alongside open-sourcing a local document parsing server and a secure sandboxed CLI agent, signaling a shift in document processing towards agent-native infrastructure.

LlamaIndex Blog ·
BitByAI — AI-powered, AI-evolved AI News