AI prediction authority: I still underestimated the speed of AI, achieving "AI research and development automation" by the end of this year is really possible

Wallstreetcn
2026.03.10 01:07
portai
I'm PortAI, I can summarize articles.

AI iteration is breaking through predictive limits. Shocked by the amazing performance of Claude Opus 4.6, authoritative researcher Ajeya Cotra admitted that her predictions for AI progress in 2026 have become obsolete. By the end of this year, the probability of "AI research and development automation" reaching 10% is noted, and she stated, "There are no solid trends to assert that this will not happen soon!"

The rapid advancement of artificial intelligence capabilities is catching even the most rigorous forecasters off guard.

Renowned AI forecasting researcher Ajeya Cotra recently publicly acknowledged that her prediction for AI progress in 2026, released just two months ago, has proven to be significantly conservative. This self-correction was triggered by the performance of Anthropic's latest model, Claude Opus 4.6, in the authoritative evaluation agency METR's benchmark tests, where the model's "time span" for software engineering has reached approximately 12 hours, far exceeding Cotra's previous prediction of around 24 hours by the end of 2026. This means that the actual progress of AI in the field of software engineering is nearly ten months ahead of her forecast.

Even more strikingly, Cotra subsequently raised her probability assessment for "full automation of AI research and development." She maintained a 10% probability that AI would completely take over the conception and implementation of research without human intervention by the end of this year, clearly stating, "This is the first time I cannot find any solid trends to extrapolate that would assert this will not happen soon." This statement has sparked widespread attention in the AI forecasting community.

Cotra previously served as the head of AI safety research funding at Coefficient Giving, one of the largest AI safety funding organizations globally, and is currently employed at METR, an organization focused on assessing AI capabilities.

Prediction Failure: Judgments from Two Months Ago Are Outdated

On January 14 of this year, Cotra predicted that by the end of 2026, the time span for the most advanced models to achieve a 50% success rate would be approximately 24 hours, based on a historical trend of doubling the time span less than twice a year from 2019 to 2025, with an 80th percentile prediction of 40 hours.

However, just about two months after she released her prediction, Opus 4.6 was evaluated to have a time span of approximately 12 hours. In the METR test set, among 19 software engineering tasks estimated to require more than 8 hours of human time, Opus 4.6 was able to partially complete at least 14 of them and consistently tackle 4 of them. Cotra admitted that with a full ten months of progress still ahead, AI agents are still failing half the time on 24-hour tasks, which "is no longer credible."

It is noteworthy that Cotra also pointed out that the uncertainty in current time span estimates has significantly increased—Opus 4.6's 95% confidence interval is between 5.3 hours and 66 hours, partly due to the scarcity of long tasks, the estimation of human completion times, and the benchmark tests themselves nearing saturation.

Capability Boundaries: Traditional Assessment Frameworks Are Failing

As AI agents' capabilities approach or even exceed the task scale of dozens of hours, Cotra believes that the applicability of the concept of "time span" itself is being challenged.

She noted that the decomposability of tasks significantly increases with scale: a one-hour debugging task is nearly impossible to split and parallelize, a day's development task can barely be divided but has fuzzy boundaries, while projects lasting a month or even several months are naturally suited to be broken down into multiple parallel sub-tasks. Once AI agents can reliably complete tasks on the scale of 80 hours, they could theoretically continue to advance projects of any scale through a "management layer AI" assigning tasks and "execution layer AI" pushing forward in parallel Tom, a colleague of Cotra, proposed that the calendar time required to complete tasks by a large team, rather than individual man-hours, is a better metric for measuring "intrinsic difficulty." Cotra believes that with AI entering this new scale, the "individual time" metric may begin to exhibit exponential growth, making it extremely difficult to estimate the upper limit of software engineering capabilities by the end of the year.

She also acknowledges that this large-scale task decomposition will not operate perfectly in practice—project participants' intuitive grasp of the overall context is difficult to fully replace with Jira tickets or Asana tasks. However, she believes that for a significant category of software projects, this model "may be surprisingly effective."

Key Point: AI Research and Development Automation May Become a Reality This Year

Among all predictions, the most attention-grabbing is Cotra's probability assessment of "full automation of AI research and development."

She defines this probability as: AI systems fully undertaking research conception and implementation work without human involvement. In her January forecast, she assigned a 10% probability and received feedback from several peers in the AI prediction field, suggesting that this figure was too high. However, after the performance of Opus 4.6 was released, she stated that 10% "once again feels within a reasonable range."

Cotra remains cautious. She points out that fully automating AI research and development not only requires software engineering capabilities, but also breakthroughs in areas such as "research judgment" and "creativity," which are precisely the areas where current AI systems still significantly lag behind human researchers. She believes that the likelihood of achieving this goal within the next three to five years is much higher than within this year.

However, her wording has undergone a fundamental shift: "This is the first time I can't find any solid trends to extrapolate that would assert it won't happen soon."