OpenAI launches "PaperBench" test to prove the strongest AI agent has not surpassed humans

Stheadline
2025.04.03 02:41
portai
I'm PortAI, I can summarize articles.

OpenAI launched a new benchmark test "PaperBench" yesterday, aimed at assessing the ability of AI Agents to replicate top AI research. The test results show that even the most advanced AI models did not surpass the human baseline. PaperBench requires AI Agents to replicate 20 papers from the ICML 2024 conference from scratch, and the results indicate that the best-performing AI Agent achieved only a 21% replication score. OpenAI has open-sourced the relevant code to facilitate research on the engineering capabilities of AI Agents

OpenAI launches "PaperBench" test to prove the strongest AI Agent has not surpassed humans

OpenAI announced yesterday (2nd) the launch of a new benchmark test called "PaperBench," aimed at assessing the ability of AI Agents to replicate top AI research. The results indicate that even the most advanced models have not yet surpassed the human benchmark.

PaperBench requires AI Agents to replicate 20 Spotlight and Oral papers presented at the ICML 2024 conference from scratch, including understanding the core contributions of the papers, independently developing codebases, and successfully executing related experiments. To ensure a fair and objective assessment, the research team designed a hierarchical scoring system that breaks down each replication task into 8,316 independently scoreable subtasks.

OpenAI stated that all scoring criteria were developed in collaboration with the original paper authors to ensure the accuracy and practicality of the evaluation. The team also developed a judgment system based on large language models that can automatically score the AI Agent's replication attempts.

The test results show that the currently best-performing AI Agent, Claude 3.5 Sonnet (new version) developed by Anthropic, achieved an average replication score of only 21%. The research team also invited top machine learning PhD students to complete the same test, and the results indicate that AI models have not yet surpassed the capabilities of human experts in research replication. OpenAI has now open-sourced the relevant code to promote further research in the industry on the engineering capabilities of AI Agents