Anthropic releases Agent paper again: Thinking like human engineers to solve the "long-range task" problem

Anthropic released a new article discussing the engineering practices of long-running agents and proposed solutions to the challenges of long-term tasks. The article introduces how to develop two parts of solutions for the Claude Agent SDK—initializing agents and coding agents—by simulating the working methods of human engineers to address the challenges of context window limitations and complex tasks

Anthropic releases another masterpiece on Agent engineering practice: Effective harnesses for long-running agents, highly recommended for everyone to read.

Previously, I introduced the Anthropic Agent article collection here:

The second half of Agents! More important than Prompt is "context engineering," Anthropic's first systematic elaboration.

Everything can be an Agent! Anthropic's official "three-step cycle method" teaches you how to create the strongest intelligent agent step by step.

Anthropic strikes again! A folder that trains Claude into a truly functional Agent.

Another masterpiece on Agent development from Anthropic, a new paradigm reduces Token consumption by 98.7%.

As the capabilities of AI Agents improve, developers are beginning to require them to undertake complex tasks that span hours or even days. However, how to maintain consistent progress across multiple context windows remains an unsolved challenge.

The core challenge faced by long-running Agents is that they must work in "sessions," and each new session starts like a new engineer with no past memory. Due to the limited context window and the complexity of projects that cannot be completed in a single window, Agents need a mechanism to bridge the gap between coding sessions.

The Anthropic engineering team developed a two-part solution for the Claude Agent SDK by observing how human engineers work: Initializer Agent and Coding Agent.

Core Challenge: Context Compression is Not Enough

The Claude Agent SDK is a general-purpose Agent framework with context management capabilities (such as compression), which theoretically should allow Agents to work indefinitely.

However, in practical tests (for example, asking the latest Opus 4.5 to build a clone of claude.ai), relying solely on context compression is insufficient. Claude primarily exhibits two failure modes:

Trying to complete all work at once: Agents tend to do too much in a single session, leading to context exhaustion midway, leaving only half-completed functions and lacking documentation. The next session's Agent must guess what happened previously, wasting a lot of time fixing foundational applications.
Announcing completion too early: In the later stages of a project, new Agent instances see that some functions are already present and mistakenly believe the entire work is complete.

Solution: Dual Agent Architecture

Anthropic breaks down the problem and proposes a dual solution:

Initialization Agent: The first session uses dedicated prompts to set up the environment. This includes generating the init.sh script, a claude-progress.txt file to record progress, and an initial Git commit to show file additions.

Coding Agent: Each subsequent session is dedicated to achieving incremental progress and leaving structured updates.

The key to this solution is enabling the Agent to quickly understand the work status when opening a new window—this is primarily achieved through the claude-progress.txt file and Git history.

Three Pillars of Environment Management

To support this workflow, the environment setup includes the following key components:

1. Feature List

To prevent the Agent from rushing or ending too early, the Initialization Agent is required to write a detailed document containing all functional requirements. In the case of the claude.ai clone, this includes over 200 functional points.

These features are initially marked as "failing," providing a clear overview of the work for subsequent Agents.

Example JSON File:

{     "category": "functional",     "description": "New chat button creates a fresh conversation",     "steps": [       "Navigate to main interface",       "Click the 'New Chat' button",       "Verify a new conversation is created",       "Check that chat area shows welcome state",       "Verify conversation appears in sidebar"     ],     "passes": false }

Experiments have shown that using JSON format is superior to Markdown, as the model is less likely to mistakenly alter or overwrite JSON files. Additionally, the prompts must contain strict instructions prohibiting the deletion or editing of tests, allowing only changes to the passes field status.

2. Incremental Progress

With the initial scaffolding in place, the Coding Agent is required to work on one feature at a time To maintain a clean environment, the Agent needs to:

Submit code via Git after each code change, along with descriptive information;

Write a summary in the progress file.

This allows the model to utilize Git to roll back erroneous code and restore to a working state, avoiding situations where subsequent Agents have to guess what their predecessors did.

3. End-to-End Testing

Another major failure mode for Claude is marking features as complete without proper testing. It often only performs unit tests or simple curl commands, neglecting end-to-end validation.

The solution is to explicitly prompt Claude to use browser automation tools (such as Puppeteer MCP server) to test like a human user. By allowing Claude to see screenshots, it can identify and fix non-obvious bugs in the code.

Getting Up to Speed

Based on the above architecture, each coding Agent is prompted to execute a series of standard steps at the beginning of a session:

Run pwd to check the current working directory.

Read the Git log and progress file to understand what has recently been completed.

Read the feature list file and select the highest priority unfinished feature.

Run init.sh to start the development server.

Before implementing new features, run basic end-to-end tests to ensure the application is not in a broken state.

Typical Session Flow Example:

[Assistant] Let me first understand the current status of the project.

[Tool Use] <bash - pwd>

[Tool Use] <read - claude-progress.txt>

[Tool Use] <read - feature_list.json>

[Assistant] Checking the Git log...

[Tool Use] <bash - git log --oneline -20>

[Assistant] Checking for startup scripts and restarting the server...

[Assistant] Now verifying if basic functionalities are working...

[Assistant] Verification passed. Now I will check tests.json to decide what to do next.

Common Failure Modes and Fixes Summary

Conclusion

This research demonstrates a viable solution for the long-range Agent framework, but there are still unresolved issues:

Single Agent vs. Multi-Agent: It is currently unclear whether a general-purpose coding Agent performs best or if a multi-Agent architecture (such as dedicated testing Agents, QA Agents, and code cleanup Agents) is superior.

Domain Generalization: This demonstration focuses on full-stack web development. Future directions involve extending these experiences to other long-range task areas such as scientific research or financial modeling.

Risk Warning and Disclaimer

The market carries risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at their own risk