Ultraman's failure in expectation management, GPT-5's inability to achieve AGI

Wallstreetcn
2025.08.09 04:00
portai
I'm PortAI, I can summarize articles.

On August 8th, OpenAI released the GPT-5 model, which includes multiple versions, touted as "the smartest, fastest, and most practical." Despite being limited to free access for C-end users and significantly reducing API prices, experts believe that the actual performance of GPT-5 has not met expectations, with insufficient progress and low marginal efficiency. Liu Guang pointed out that although there have been optimizations in data processing, the reasoning ability still has shortcomings and has not met users' high expectations

In the expectation of "superhuman capabilities," GPT-5 was launched in August.

In the early hours of August 8, OpenAI officially released the GPT-5 model, which includes four versions: GPT-5, GPT-5 mini, GPT-5 nano, and GPT-5 Pro.

OpenAI described the new model as "the smartest, fastest, and most practical." If an additional label were to be added, the industry generally considers it to be "affordable." Firstly, it is offered for limited free use to end users, and secondly, there is a "significant price reduction" for API users—charging $1.25 per million tokens for input, while output costs only $10.

So, aside from making Ultraman the "price butcher" in the industry, does OpenAI's submission meet expectations?

"Quantitative change rather than qualitative change," "Cheaper OpenAI and a return to open source," "A better balance between safety and usability."

On August 8, during a live interpretation of the path to AGI initiated by Tencent Technology, Zhiyuan Community, and the Tsinghua University Institute for International Governance of Artificial Intelligence, Liu Guang, head of the data research group at Zhiyuan Institute, Assistant Professor Xiong Yuxuan from Central China Normal University, young scholar Zhang Hui from the University of Science and Technology of China, and young scientist Zhang Hui from Tsinghua University Institute for International Governance of Artificial Intelligence provided their summaries.

"Expectations were too high (previously); although there has been improvement (in reasoning and tool usage), it did not meet expectations, which is 'an expected progress,'" Liu Guang said. Zhang Hui also gave a similar evaluation, stating, "The progress of GPT-5 is not 'stunning,' the marginal efficiency is low, and it has not met the high expectations generated by long-term groundwork."

In Liu Guang's view, the extensive use of synthetic data, the establishment of a data grading and classification system, and the development of a general data quality assessment model have alleviated the problem of scarce high-quality data to some extent. However, as OpenAI's interim model, GPT-5 also has shortcomings, such as in research related to reasoning.

"We are neither clear about the true source of reasoning capabilities nor can we determine what constitutes truly effective reasoning forms," Liu Guang said.

Before the launch of GPT-5, Anthropic and Google successively positioned themselves, launching Claude opus 4.1 and Genie 3 world model, especially the former, which publicly hinted at future new moves, suggesting a "hot war" in base models in August.

"OpenAI had to launch (GPT-5) under pressure," Xiong Yuxuan said.

In Xiong Yuxuan's view, competitors both domestic and international, such as Anthropic, Google, DeepSeek, and Kimi, are pushing OpenAI to release new products. Meanwhile, external concerns about safety are gradually easing, and the "multi-model routing (Router)" also leans towards more commercial considerations. "All of these are pushing OpenAI towards 'cheaper and more open,' which is a good thing for the industry." Regarding the new multi-model routing capability, Xiong Yuxuan defines it as an extension of the early MoE and not a disruptive technological breakthrough. "Sometimes it feels like driving a sports car, and sometimes like using an off-road vehicle, flexibly adjusting according to the task."

Zhang Hui has long been engaged in security research, and like Xiong Yuxuan, he focuses on security, especially the "safe completion" capability of GPT-5.

"It achieves a good balance between safety and usability," Zhang Hui said, particularly regarding GPT-5's shift from "refusal to answer" to "safe completion." Zhang Hui considers this a positive signal. "In the past, strong refusals could harm normal usage, while 'safe completion' has achieved a dual enhancement of safety and performance through dynamic thresholds and user intent classification, proving that safety and innovation can coexist."

The following is the full transcript of the live broadcast (adjusted and shortened without changing the original meaning):

Tencent Technology: First, let's do a quick fill-in-the-blank. GPT-3 brought Scaling Law, GPT-3.5 brought RLHF, GPT-4 brought MoE, GPT-4o brought multimodality, and o1 opened up a new paradigm for reasoning. So, what does GPT-5 bring? You can answer with one or two keywords combined, no more than two.

Xiong Yuxuan: The launch of GPT-5 is more like a shift in OpenAI's business strategy: cheaper and more open. Coupled with the previous GPT-OSS, it can be seen as a return to and embrace of the open-source community.

Zhang Hui: I think its safe completion mechanism is very noteworthy and has inspired me a lot—finding a relatively good balance between safety and usability.

Liu Guang: There are incremental improvements across the board, but nothing beyond expectations. It's a quantitative change rather than a qualitative change.

Tencent Technology: Ultraman has made several grand preparations for GPT-5, so what does this model really mean for OpenAI?

Liu Guang: GPT-4o has been released for some time, and everyone has been looking forward to GPT-5. It is said that there were multiple rounds of "races" internally, eliminating many versions mainly due to data issues and unexpected events during the training process that led to performance not meeting expectations. One key obstacle was the "data wall"—there was almost no incremental high-quality data. Later, the team alleviated this issue through synthetic data and other means.

From the latest version, GPT-5 shows significant improvements in reasoning ability and tool invocation capability. The technical report and system card mention that not only scholar-contributed data was used in training, but also data generated by the model; in data processing, a hierarchical classification was conducted to distinguish between credible and non-credible sources, supplemented by joint screening by humans and models. After this series of data work, the improvements in safety and reasoning ability of GPT-5 were expected.

However, because the R&D cycle was prolonged and external expectations were high, the post-release effect felt more like a "reasonable outcome." Zhang Hui: I agree with Teacher Liu's view - GPT-5 is indeed not impressive enough.

After Ultraman has built up emotional expectations for one or two years, users' expectations have been raised very high. To be objective, it has not met those expectations. However, from my personal sporadic tests and experiences, there are still some aspects worth learning from and referencing, including its failures.

In terms of this product, not meeting expectations is actually to be expected. It is particularly noteworthy that Ultraman's own statements have also changed - from the initial "general artificial intelligence has been achieved" to the later "it has not been achieved yet." This shift in rhetoric itself indicates the gap between external expectations and actual progress.

Xiong Yuxuan: I believe that GPT-5 is more like a signal, marking a new turning point in the industry. It may not be a disruptive technological innovation, but from model architecture to data collection, it reflects that the development of large models has entered a new stage.

Looking back at the release of GPT-4, safety issues raised widespread concerns, and some even called for a pause in the development of GPT-5, to which Ultraman said it would not be released in the short term. But years have passed, and GPT-5 has still arrived, which at least indicates two points:

First, the industry is changing rapidly and competition is fierce, forcing OpenAI to accelerate its progress;

Second, from a safety perspective, many previous concerns have proven not to pose as direct a threat to humanity as imagined, and there are already various feasible control measures.

For OpenAI, GPT-5 is more about commercial layout. For example, the new version of model routing (Router) is essentially just an extension of the early MoE - sometimes like driving a sports car, sometimes like using an off-road vehicle, flexibly allocating based on tasks, rather than a disruptive technological breakthrough.

But the result it brings is that the model is cheaper, more open, and easier to use. This is not only beneficial for OpenAI itself but will also drive the entire industry towards a more usable and widespread direction.

Tencent Technology: Plus users can only see GPT-5 on their accounts, and other historical models are nowhere to be found. Is OpenAI planning to unify all models with such a unified model in the future?

Zhang Hui: I think there may be multiple reasons. Intuitively, the first is that it simplifies operations from the user's perspective - in the context of fierce commercial competition and competitors simplifying settings, this can enhance usability. Secondly, from a brand management perspective, OpenAI's brand effect is currently significant, and integrating the series of products into GPT-5 helps to form a unified brand image, which is scientifically and reasonably managed.

Finally, we cannot rule out the factor of technological innovation - for example, the introduction of Router, what kind of innovation does it belong to? How does it automatically identify user prompts and intentions? Whether there are deeper technological breakthroughs hidden within still needs further observation.

Tencent Technology: What do you think of this multi-model routing architecture? The ability to freely choose which model to call based on demand, how significant is this innovation?

Xiong Yuxuan: In fact, there has always been a division of labor between models, reflected in functionality and scale. This idea is not new; for example, Professor Huang Gao from Tsinghua University proposed "dynamic neural networks" a long time ago - dynamically selecting network structures based on task difficulty In the era of large models, the concept remains the same.

GPT-5 includes both reasoning models that require thinking and models that do not; the former consumes more resources and has longer running times. This approach can save users money while also reducing operational load, thereby providing a better overall experience. The so-called "unified large model" is more of a conceptual packaging, with both technical considerations and strategic factors in commercial operations behind it.

Liu Guang: The concept of routing has long been studied in academia and practiced in industry; for example, 360 recently attempted a routing mechanism with multiple agents or models. The difference with OpenAI may be that it performs better in terms of effectiveness, experience, and engineering optimization.

From an innovation perspective, it is more about taking existing ideas to the extreme—just like in the GPT-3 era, where simply scaling up significantly enhanced capabilities. In this direction, GPT-5 represents an extreme optimization at the engineering level, resulting in a better experience and a certain enhancement in capabilities, which is a form of engineering innovation.

Tencent Technology: Calculating "9.9 - 9.11" with GPT-5 without activating the thinking mode is incorrect; it is right once activated. Why? Is the technical difficulty still significant here?

Liu Guang: The difficulty lies in determining when reasoning is needed and when it is not. Humans instinctively judge and use tools when necessary; models lack prior knowledge and are not good at precise numerical calculations. From this perspective, it is not surprising that it calculates incorrectly. In theory, it can call built-in tools (like Python), but how to automatically invoke the appropriate tool at the right moment and whether to trigger reasoning remains a bottleneck.

Tencent Technology: Tasks like "elementary school math problems" or "counting how many fingers are in a picture," which humans can easily see, have not been completely solved. Is this a flaw "encoded in the genes" of large models?

Zhang Hui: From a safety governance perspective, this is more like an endogenous risk. The model lacks human prior knowledge and common sense about the world, leading to insufficient understanding of "when reasoning is needed." It mainly learns through a "drilling" method, and it may rarely encounter boundary cases like "9.9 - 9.11" in the data.

As for whether clicking the thinking button triggers architectural innovation or simply activates external tools, the significance of the two is very different. But fundamentally, this is still an endogenous issue: lacking prior knowledge makes it difficult to adaptively judge whether reasoning is needed and when to call tools.

Tencent Technology: In complex fields like AI for healthcare or for science, to what extent can the capability evolution of GPT-5 provide assistance? Has it unlocked new application scenarios after its release?

Xiong Yuxuan: OpenAI claims that its performance in fields like healthcare and economic data is very good, reaching SOTA levels. However, the core issue in these fields is whether the results are trustworthy. In the past, we discussed the hallucination problem, which can be alleviated by introducing external knowledge through RAG. GPT-5 indeed has its own methods for data construction, cleaning, and quality enhancement, but this is more about engineering optimization As for reasoning, we certainly hope it can truly play a role, but how far it can go remains to be verified. Whether the current CoT (Chain of Thought) is true reasoning is also a matter of debate in academia. Especially in the medical field, which concerns life safety, it is necessary to view its development with a cautious yet optimistic attitude.

Tencent Technology: So OpenAI has not really solved the question of when to let the model reason and what constitutes the most appropriate reasoning?

Xiong Yuxuan: It has indeed raised this question and claims that the current model Router is addressing this issue. However, how effective it is still needs to be observed for a while.

Liu Guang: Following Professor Xiong's point, academia is still exploring what constitutes true large model reasoning. Is it explicit CoT (Chain of Thought) or implicit reasoning? What should the form of reasoning look like? More critically, we actually do not understand how reasoning ability is generated and how to enhance it. This is a significant mystery.

For example, Bengio's article points out that the current CoT is often not faithful. If the reasoning process and the final answer are inconsistent, then the monitoring based on the reasoning chain becomes ineffective. Without resolving such fundamental issues, it is difficult to pave the way for further development.

From a data perspective, does constructing a large amount of reasoning data enhance reasoning ability? This raises a new question: is this ability naturally emergent or acquired through reinforcement learning? The answer is uncertain. Therefore, many functions proposed by OpenAI may still lack sufficient explanation regarding their underlying mechanisms.

Tencent Technology: What innovations does GPT-5 bring in the field of data? How does it address the issue of insufficient high-quality data?

Liu Guang: In the past, everyone said they encountered a "data wall," and high-quality data was basically exhausted. One feasible path is to use synthetic data—when the model's capability is strong enough, it can generate data close to human output. I believe GPT-5 likely employs this method extensively. In fact, many domestic companies, such as DeepSeek and Qianwen, are also introducing synthetic data in pre-training, which has become an industry-recognized approach.

Another interesting example is the IMO Math Olympiad gold medal incident. OpenAI announced the award in advance, relying on a set of general reinforcement learning methods and a universal reasoning model. The challenge with such models lies in how to score questions that do not have standard answers; if this mechanism is resolved, it can be used to filter and evaluate data.

From the technical report of GPT-5, they categorized information sources and scored them using statistical methods and models, which is closely related to the design of the reward model. Some believe this may be a technical legacy left by the previous "super alignment" team.

Zhang Hui: From a safety perspective, ensuring data source integrity is very necessary. One major highlight of GPT-5 is the proposal of an output-centered safety strategy. It sounds like common sense, but it is particularly mentioned because it involves user intent recognition issues at the input stage. Perhaps the model initially had some deviations in this regard, but it has now returned to the correct path This is enlightening for me—discussing content safety from the output perspective is indeed correct. As for the challenges in reality, such as data sources and hallucination issues, they can all be addressed technically. The intent recognition emphasized by GPT-5, if it can accurately classify user intentions, will allow for more diverse safety strategies.

However, OpenAI also acknowledges that even with intent classification, harmful outputs will still occur, which is unavoidable. I think this acknowledgment itself reflects a pragmatic attitude.

Tencent Technology: Even if the capability leap of GPT-5 is below expectations this time, can it bring more users and API revenue from a business perspective?

Xiong Yuxuan: There should still be growth globally. After all, OpenAI is the industry leader, and whenever a new product is released, there will always be people willing to try it. But domestically, the effect may not be the same. The large models in China are already very strong and can fully handle daily work scenarios.

Sometimes the gap is not significant, and I actually prefer to use domestic models, such as DeepSeek, Doubao, and Qianwen, which are already very useful for daily programming tasks.

Liu Guang: OpenAI still leads in some application scenarios, such as deep research and information collection and organization, with an extremely low hallucination rate. But the competitive pressure is also enormous, especially from Claude. GPT-5 has proposed support for parallel Agents this time, but it still needs time to verify whether it can outperform Claude in actual performance.

Many domestic companies, such as Kimi's K2, have already integrated into Claude's ecosystem. This actually reflects the difference between two philosophies: OpenAI insists on a single large model dominating the market, while Claude follows an industrialized path—small steps and rapid iterations, building a tool ecosystem. Currently, the industry and individual developers tend to prefer Claude's approach.

Xiong Yuxuan: But this time it can also be seen that OpenAI is trying new directions. Its router can, to some extent, be seen as a kind of orchestration for Agents. From this perspective, OpenAI has indeed been forced to make such attempts.

Tencent Technology: Will these model companies in the future turn towards a "general Agent ecosystem + platform" model?

Xiong Yuxuan: It is possible to move towards a general Agent ecosystem, or as Professor Liu said, everyone is already quite similar in model development. The focus of competition may shift to data, such as data synthesis, diversity, effectiveness, safety, and alignment with human values.

In addition, attention should also be paid to the matching issue between data and models—not just generating more data is necessarily good; it also needs to be compatible with the model's architecture and scale.

Tencent Technology: From the perspective of ordinary people, it seems that these model companies have returned to similar paths, becoming increasingly homogenized, and prices may continue to drop?

Liu Guang: Yes, I believe prices will definitely continue to decrease. Many models are now free or even open-sourced. After DeepSeek was open-sourced, it has become—if you don't open-source your best models, it will be very difficult to compete on the stage However, open source has two sides: on one hand, it is very helpful for market promotion and allows many traditional enterprises to adopt it quickly; on the other hand, it poses a significant impact on the ToB business model. Whether open source is sustainable, I cannot judge yet, but what is certain is that it will continue to lower the cost of using models and make them more widespread.

Liu Guang: I think we are still in the initial stage and cannot say there is no gap at all. For example, the video model recently launched by Google may be a new direction—no longer focusing on text but on video generation. Has the language model reached its limit? We cannot conclude that yet.

Xiong Yuxuan: We are not saying that we will stop developing models in the future; it's just that at this stage, from the user's perspective, the differences are not that significant. It's like driving a Ferrari versus a regular car; under city speed limits, even if you can go faster, there are limits. User demand is there, so even if there are performance differences in models, they are likely to lead to homogenization in the end.

For example, if Google has developed a video model, do you think Meta won't follow suit? Once everyone starts doing it, we will return to the cycle we mentioned earlier—from models to data, then to security, with data, algorithms, and computing power continuously rising in a spiral manner.

Tencent Technology: We are entering the next more challenging problem, which is the "Safe Completion" proposed this time. In the past, models would directly refuse to answer when encountering serious security issues, but now it has shifted to safe completion. Can this approach truly achieve a balance between usability and security? Or is it sacrificing security to accommodate usability more?

Zhang Hui: From the information released about GPT-5, it has indeed made efforts in this area. The previous overly rigid refusals were not good—many times, users are not malicious. For example, if a primary school student asks a chemistry question, refusing to answer directly would harm its usefulness.

We often have a habitual mindset that sees safety and innovation as opposites, but that's not the case. From the results of GPT-5, safe completion through some algorithmic innovations has not only maintained performance but has also led to overall improvements. This shows that safety and innovation can complement each other.

I think this is a good start. Even if it's just a simple safe completion, such as setting dynamically changing harmful thresholds and a series of small micro-innovations, it can enhance both safety and performance. This is also very enlightening for subsequent Agent development.

Tencent Technology: What kind of mistake led to the error in the chart that OpenAI was criticized for yesterday, causing them to make such a basic mistake at such an important release?

Liu Guang: I just discussed this with several teachers. Such basic errors are likely because they felt their performance was particularly good after drawing the chart and slightly raised the scores. As for whether other team members reviewed it or if it was simply checked by GPT-5 itself, it's hard to say, but any of these could lead to such results.

Zhang Hui: I don't think this is surprising, as OpenAI has previously made similar "perceptual errors." It could be due to the style setting of the prompts, leading the model to emphasize "I am particularly good, and the other party is particularly bad," thus continuously amplifying its advantages in the bar chart The model is quite satisfied with this result and feels that it has achieved the user's intent.

Xiong Yuxuan: I actually think it's more likely that they are indeed too anxious internally, which also indicates that there is a lot of pressure for this press conference, and the preparation was relatively rushed.

Tencent Technology: Recently, Ultraman mentioned in an interview that he feels a sense of "being completely useless compared to AI." Has AI's capability evolved to the point where humans begin to doubt their sense of purpose? How much time do we have left in this window?

Zhang Hui: This question is indeed quite critical and hard to predict. Some say it's 2025, while others say it's 2027. In my view, Ultraman's approach to large models might be different from mine. I use it more for literature retrieval, and although it generates many non-existent documents, some are real and come with original links, which is very helpful for RAG. If you say using large models makes one shiver and doubt life, I haven't encountered that yet.

Xiong Yuxuan: I think this issue should be viewed from two aspects. First, the stronger AI is, as long as we can control it, it is definitely a good thing that can improve our work efficiency. As for fear, I think we should adopt a dynamic perspective. It will indeed replace some jobs, but it may also give rise to new business forms.

For example, the teaching profession may evolve into a "teacher-machine-student" interaction model in the future, where the teacher's role shifts from merely imparting knowledge to also teaching students how to interact with large models. So it can be both concerning and encouraging. I still tend to view it cautiously but optimistically.

Liu Guang: Overall, I am quite optimistic. Just like what Teacher Xiong mentioned about education, I noticed that OpenAI's GPT-5 has a dedicated entry for education, which may detail the thinking process more thoroughly and even provide APIs to check whether homework is generated by AI. On one hand, it's to assist students in learning, while on the other hand, it aims to prevent cheating, which itself is contradictory.

I believe the key is still controllability. If AI operates within a controllable range, there is no problem; but if it becomes uncontrollable, like the rumors that it refuses to shut down or pretends to shut down but actually doesn't, that would be very creepy. However, from the current perspective, I remain optimistic as long as the safety mechanisms and sandbox mechanisms are designed well enough, such risks could potentially be mitigated to some extent.

Xiong Yuxuan: I will return to the router mechanism mentioned at the beginning. Although its innovation is not particularly significant from an academic perspective, it has actually redefined the competitive direction of the industry and greatly promoted the subsequent development of Agentic AI. From a business perspective, it may make GPT more affordable for more people, which is quite meaningful.

Zhang Hui: Setting performance aside, I think there has been progress in safety measures and governance. Through innovative mechanisms like safety supplementation, it has achieved a simultaneous improvement in performance and content safety that meets user experience, providing us with a new idea—that safety does not necessarily have to sacrifice usability Liu Guang: I believe OpenAI has made some compromises in the definition and path of AGI. In the past, it emphasized that a model could do everything, but now it is packaging a series of models and advancing with agents and tool calls. This adjustment in direction is, in itself, a realistic choice.

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at one's own risk