Scaling Law is in a dilemma, is reinforcement learning the only hope for the whole village?
Facing bottlenecks, Scaling Law is seen as the key to AI breakthroughs. Recently, the Q3 summary of the AI industry pointed out that pre-training Scaling Law is no longer effective, and 80% of companies may abandon this strategy. Instead, Self-play RL is considered the future hope, especially in terms of coding ability, where Claude Sonnet 3.5 has outperformed GPT-4o, demonstrating the potential of RL. Meanwhile, OpenAI is about to release a new model, and the ChatGPT Pro subscription plan has also been launched, priced at $200 per month

Orange Soda Shop, by orangesai, original title: "The Dilemma of Scaling Law, the Hope of Self-play RL, and $200 per Pound Strawberries", cover image from: AI Generation
Scaling Law hits a bottleneck, reinforcement learning becomes the breakthrough for AI.
• 🚀 Reinforcement learning achieves breakthroughs in coding, mathematics, and other fields
• 🍓 OpenAI is about to release a new strawberry model
• 💰 ChatGPT Pro subscription for $200/month, featuring powerful new model performance
In the past few days, three things have happened:
-
Listened to Xiaojun and Guangmi's podcast summarizing the AI industry in Q3, discussing the bottleneck of pre-training Scaling Law and the importance of Self-Play RL;
-
The Information revealed that OpenAI will release a new strawberry model within 2 weeks;
-
The ChatGPT Pro subscription plan has started its soft launch, priced at $200 per month, but upon trying it out, it seems to offer nothing new.
These three events are interconnected, piecing together some information that is likely to become a consensus.
I. The Dilemma of Scaling Law
The entire large language model industry has not made significant progress for a long time.
This is a common feeling among everyone.
After models reach tens of billions of parameters, the training cost and difficulty have increased significantly, but it seems that even trillions cannot bring about a qualitative improvement.
Ilya even bluntly stated that people no longer know what they are scaling when discussing Scaling Law, and he has some new ideas about scaling.
Guangmi mentioned that the consensus gradually forming in Silicon Valley is that RL reinforcement learning is the next breakthrough.
And in terms of pre-training, 80% of companies will give up pre-training.
II. The Hope of Self-play RL for the whole village
If pre-training is unaffordable, reinforcement learning becomes the hope for the whole village.
The first unexpected breakthrough is Claude Sonnet 3.5, whose coding ability surpasses GPT-4o, empowering Cursor and making AI Coding the hottest topic. The breakthrough in coding ability may be achieved through RL.
The second unexpected breakthrough is DeepSeek, which started late but focused on improving the model's reasoning, coding, and mathematical abilities. Recently, it merged the Coder and Chat models, reaching a coding level close to GPT-4o and leading the industry in China The commonality between these two companies is: breakthrough in a single point.
If the characteristic of large language models is the improvement of general intelligence, the characteristic of RL is the breakthrough in a single point.
And reasoning, coding, mathematics, and Agent are currently the most valuable productivity fields, most suitable for making breakthroughs in a single point.
III. Strawberry Model, Arriving in Two Weeks
This information comes from a report by The Information yesterday:
OpenAI is planning to release a text-only version of "Strawberry" within the next two weeks, according to two testers involved with the model.
据两名测试者透露,OpenAI 计划在未来两周内发布 “Strawberry” 的纯文本版本。
Early impressions indicate it’s somewhat underwhelming, primarily using chain-of-thought prompting. Responses take 10-20 seconds, making it slower than expected.
初步印象表明,它有些不及预期,主要是使用链式思维提示。响应时间为 10 到 20 秒,比预期的慢。
While testers found its performance slightly better than GPT-4o, Strawberry struggles with short, simple queries and has issues with memory integration.
虽然测试人员发现它的能力略优于 GPT-4o,但 Strawberry 在处理简短 Query 时表现不佳,并且在记忆方面存在问题。
The model lacks image integration, making it exclusively text-based for now.
该模型目前不支持图像集成,因此只能处理文本。
It is expected that Strawberry will have rate limits and might introduce a higher-priced tier for users seeking faster response times, diverging from the current pricing structure of ChatGPT It is expected that Strawberry will set rate limits and may introduce higher-priced packages for users who require faster response times, which is different from the current pricing structure of ChatGPT.
ChatGPT Pro is now online, priced at $200 per month.
There were reports a few days ago that OpenAI was considering a subscription price of $2000 per month, which seemed crazy. But today, OpenAI finally announced the actual subscription price: $200 per month...
I don't know if it's because of the $2000 price point setting the stage, but $200 seems reasonable?
After purchasing and using the $200 subscription, it was found that besides being able to use 4o unlimitedly, there isn't anything new.
So the only possibility is probably the upcoming release of Strawberry.
Kazek's summary of Strawberry: Based on the new paradigm of Self-play RL, it is extremely strong in mathematics, coding ability, and has a new model capable of autonomously performing browser/system-level operations for users. More intelligent, slower, and more expensive.
Why is it so expensive? Simply put, this is the cost of higher intelligence.
From a utility perspective, the characteristics of the Strawberry model are: strong coding, mathematical, reasoning, and agent abilities, all of which are highly valuable. If the coding ability is significantly better than the current Claude 3.5, $200 per month is acceptable.
From a cost perspective, each response from the Strawberry model involves a lot of internal "thinking" and can take 10-20 seconds, with computational costs likely to be more than 10 times that of GPT4.
In principle, the Self-play RL method of Strawberry itself requires huge reasoning costs for training, and due to its non-real-time nature, the value of this model may be in synthetic data rather than direct use, and the price of high-quality data is very expensive, so $200 may just be enough for a few doctoral students.
Now that the pricing is out, we await the model announcement at OpenAI's dev day in November, the highlight of the AI industry this year. Will it be a new milestone or as mundane as an Apple product launch? Let's wait and see!
Orangesai, author of Orange Soda Shop