
After more than 400 days, have domestic large models completely surpassed GPT-4?

Pay a little attention to recent news, "catching up with GPT-4" has become a new hotspot for domestic large models.
Baidu's ERNIE Bot, SenseTime's SenseNova, and Alibaba Cloud's recently released Tongyi Qianwen 2.5 have all joined the "comprehensive catch-up with GPT-4" camp.
If we extend the timeline a bit further, over the past half year, news of "surpassing GPT-4" has been 层出不穷, even if the reports deliberately added prefixes like multiple benchmarks or partial indicators, it still garnered enough attention and became a strong indicator for domestic large models to prove their capabilities.
To briefly recap, domestic large models have been chasing GPT-4 for over 400 days, and the "catch-up process" can be roughly divided into three stages.
Stage 1: Partial Performance Surpassing GPT-4
On March 14, 2023, OpenAI officially launched GPT-4. At that time, most domestic large models were not yet open to the public, and the few in beta testing were still being compared to GPT-3. As an industry benchmark, GPT-4 was like science fiction becoming reality, elevated to a godlike status by many.
But just half a year later, GPT-4 appeared on the comparison lists of domestic large model developers.
At the end of August 2023, SenseTime announced a new breakthrough: its "InternLM" model with 123 billion parameters ranked second globally on a collection of 300,000 questions across 51 well-known evaluation sets. It also ranked first in ten evaluations covering comprehensive exams (agieval), knowledge Q&A (commonsenseqa), reading comprehension, and reasoning, surpassing the then-dominant GPT-4.
On October 17, 2023, at the "Generating the Future" launch event, Baidu officially released ERNIE Bot 4.0. Robin Li demonstrated the model's four core capabilities—understanding, generation, logic, and memory—along with their application scenarios. Although no evaluation data was provided, Li confidently stated that ERNIE Bot 4.0's overall performance was "on par with GPT-4."
The 序幕 of domestic large models catching up with GPT-4 was officially 拉开. In the following months, many large models adopted this marketing angle: their overall capabilities were no worse than GPT-3.5, and they had begun surpassing GPT-4 in certain performance metrics.
Stage 2: Overall Performance Approaching GPT-4
By early 2024, the domestic "hundred-model war" entered a 收敛 phase. Some large models not favored by the capital market gradually became mere statistics, with only a few tech giants and unicorns remaining active on the front lines. The surviving models had to prove their capabilities.
"Overall performance approaching GPT-4" became the new marketing slogan.
At the Zhipu AI Technology Open Day in mid-January 2024, the new-generation foundational model GLM-4 was officially released. According to Zhipu AI, in 权威 English test rankings, GLM-4 had 整体逼近 GPT-4, averaging over 90% of GPT-4's performance, with 个别项目持平. In 中文 tasks, which domestic companies value more, GLM-4's performance 全面超过 GPT-4.
Also in January 2024, iFlytek released the Spark Cognitive Model V3.5, with significant improvements in core capabilities like logical reasoning, language understanding, text generation, math problem-solving, coding, and multimodal understanding. Its language understanding and math abilities had surpassed GPT-4 Turbo, coding reached 96% of GPT-4 Turbo, and multimodal understanding hit 91% of GPT-4V. "In Chinese understanding, it's even 遥遥领先."
In hindsight, Zhipu AI and iFlytek's marketing strategies were somewhat "conservative." Baichuan Intelligence's Baichuan 3, released around the same time, claimed to have surpassed GPT-4 in Chinese evaluations like CMMLU and GAOKAO.
Stage 3: Comprehensive Surpassing of GPT-4 Turbo
At OpenAI's first 开发者大会 in November 2023, GPT-4 Turbo was the 焦点. It was not only smarter than GPT-4 but also had higher text processing limits, faster 推理, and was cheaper. Domestic large models soon had a new benchmark to compare against.
First, SenseNova 5.0, released in April 2024 with 600 billion parameters, cited OpenCompass evaluation data at its launch: it had reached or surpassed GPT-4 Turbo,几乎全方位碾压 the 同期发布的 Llama 3-70B.
Then came Alibaba Cloud's recently released Tongyi Qianwen 2.5. According to media reports, its performance 全面赶超 GPT-4-Turbo, making it the "strongest" Chinese large model. Its open-source 110-billion-parameter model achieved top scores in multiple benchmark tests, surpassing Meta's Llama-3-70B to become the most powerful open-source model.
It's certain that SenseNova 5.0 and Tongyi Qianwen 2.5 are just the beginning, with more domestic large models set to surpass GPT-4-Turbo in capability.
After all, iFlytek has already 预热 the release of Spark Cognitive Model V4.0 in the first half of the year, which will fully 对标 GPT-4 系列. ERNIE Bot 4.0 has been out for over half a year, and a new version is likely in the works,大概率会在性能上再上一个台阶……
What’s the Point of "Benchmarking"?
Whether it's the initial "partial performance 超越" or the ongoing "comprehensive 赶超," the basis is 第三方评测 results or the subjective judgments of large model developers. For example, OpenCompass, frequently cited by SenseTime and Alibaba Cloud, is an open-source large model evaluation platform from Shanghai AI Lab.
Regarding some large models' obsession with 刷榜 and benchmarking, Professor Lin Dahua, a leading scientist at Shanghai AI Lab,直言 in a media interview: Improving model scores through 题海战术 distorts the reflection of actual capabilities, affecting 研发 teams' improvement directions and commercial 落地. "High scores but low abilities" harm the institutions themselves. Any specific name on a 榜单 is just one of countless tests during a model's 成长, and temporary rankings don’t truly reflect its capabilities.
Moreover, many large model test sets 公开透明, with questions or outlines publicly available. It’s not hard for developers to improve scores through "targeted training." With enough test questions fed to the model, scores won’t be too low in an open-book exam.
In other words, high scores don’t necessarily mean strong capabilities. The point of "benchmarking" is merely to give clients or developers a 初步认识 of a model's abilities. The ultimate 评估 factors are always "can it solve problems" and "can it deliver tangible productivity in real scenarios."
Especially as large models move toward 落地应用,一味炒作 "surpassing GPT-4" or "top benchmark scores" while ignoring 实效 may backfire. Take 财报分析, a common large model application: if a model can’t even understand a company’s 财报, no high score will convince clients—it’ll just be excluded from 合作 lists.
According to research reports from CITIC Securities and others, OpenAI’s GPT-5 is currently in red team testing and 有望 launch this summer, potentially achieving breakthroughs in multimodal understanding, long-text input, and zero-shot learning, with performance 远超 GPT-4. Even if domestic models spend 400+ days catching up to GPT-4, they’ll likely remain in 追赶姿态 for a long time.
The value of large models lies in being productivity tools that solve 日常问题. The 阶段性升级 of catching up with GPT-4 can be seen as a 标志 of domestic models’ 有序迭代部署 and narrowing gaps. But like 手机跑分,过度营销 risks turning them into objects of 群嘲。
The copyright of this article belongs to the original author/organization.
The views expressed herein are solely those of the author and do not reflect the stance of the platform. The content is intended for investment reference purposes only and shall not be considered as investment advice. Please contact us if you have any questions or suggestions regarding the content services provided by the platform.
