
The birth of a non-mainstream large model: questioned, plagiarized, and modified

Founder of RWKV, Peng Bo, spent three years building the RWKV architecture in the context of the prevailing Transformer architecture. This architecture is different from the RNN architecture of Transformer. Despite facing doubts and financing difficulties, Peng Bo persisted in his innovative path and received an invitation to join OpenAI, demonstrating his potential to disrupt OpenAI. The industry has mixed views on the RWKV architecture, believing that it does not have a fundamental difference from Transformer
Before meeting the founder of RWKV, Peng Bo, I, like some others, also doubted whether he was a fraud or a "pseudoscientist."
Employees of his company, Yuanshi Intelligent, have published posts on Xiaohongshu, such as "OpenAI is at a dead end." Under these "extreme views," there are also voices of questioning and defamation in the comments section, asking them to provide evidence to prove that their self-developed large model architecture, RWKV6, is indeed superior to OpenAI.
But Peng Bo completely ignores these voices, perhaps this is the price of being unconventional. While many large model companies choose the same Transformer architecture as OpenAI, Peng Bo took three years to take a different path, rewriting Transformer into an RNN form, retaining lower temporal and spatial complexity. "Give me 100,000 cards and 100 little brothers, and I'll beat OpenAI." Peng Bo half-jokingly and half-seriously told Huxiu.
Although they are taking a completely different path from OpenAI, during the Spring Festival of 2023, Peng Bo received a job offer from the "competitor" OpenAI. Luo Xuan, co-founder of Yuanshi Intelligent, explained that this could indicate that they do have the potential to disrupt OpenAI, and this offer is more like a letter of marque used to recruit Peng Bo, a different kind of "recognition."
However, there are also voices in the industry that believe that the RWKV architecture is not fundamentally different from the Transformer architecture.
In addition, their financing process is not "smooth," and their level of recognition by capital is not high. Peng Bo openly told me that some investors also judge them as "wild scientists."
This has led to limitations in computing power and human resources, which also affect the hard indicators of RWKV. "Now everyone is scaling up more advanced, and there are more tokens being refined. In this respect, we are relatively behind," Peng Bo said. Models with more tokens have a significant advantage from the start, but he also believes that "when the token volume increases, they won't be able to easily criticize us."
The road less traveled
However, there are still believers in the RWKV route.
For example, a well-known investor with a technical background, at the end of 2023, provided RWKV team with a seed round investment in the tens of millions. In order to gain the trust of the RWKV team, he went to the coffee shop below Peng Bo's building for two hours and subscribed to 1% of the shares.
It is not common for such a well-known investor to visit in person. But Peng Bo rarely leaves his home, he needs enough time alone to "refine the elixir." When most people focus on working at the model layer, Bloomberg chose to research the underlying architecture of models. He believes that AI companies today are focusing on improving mechanical intelligence, while he is also concerned with enhancing creativity and wisdom. The former points to the brain, with clear pathways (such as through synthetic data); while the latter ultimately points to the heart and mind, with paths yet to be explored - but this will be a truly interesting problem that requires understanding and creating spirituality.
Initially, the reason for doing this "changing the world timeline" thing was because Bloomberg wanted to explore whether AI could write "truly powerful" novels, especially speculative fiction. Bloomberg used "Foolish Old Man Removes the Mountains" to describe this work, "this is actually physical work, requiring the realization and testing of a large number of details."
In 2020, Bloomberg started working on models, initially improving the transformer architecture - introducing explicit decay and short convolutions.
When optimizing the attention mechanism, Bloomberg found that it could be written as an RNN. After introducing the advantages of RNN efficiency and elegance, the effect was still maintained. Through this method, RWKV-2 was born.
In 2016, the most popular architecture in the AI academic community was a variant of the RNN recurrent neural network - LSTM; but half a year later, the Transformer emerged, which was benchmarked against the RNN, making the once popular RNN obsolete minority. Therefore, RWKV can be considered a revival of RNN.
In 2023, Bloomberg quickly gained attention from the open-source community. The father of LSTM, Sepp Hochreiter, also retweeted RWKV's tweet, introducing it as an RNN architecture that can achieve Transformer performance without using attention mechanisms and runs very fast.
With the increase in attention, doubts about RWKV also followed. A technical director of a leading AI company bluntly told Huxiu, "There is no fundamental difference between the RWKV architecture and the Transformer architecture."
At a meeting in July, in order to change my skeptical attitude, Luo Xuan, co-founder of Genesis Intelligence, specifically showed me the researchers who studied the RWKV architecture and published papers. This includes scholars from institutions such as the Shanghai Artificial Intelligence Laboratory, Alibaba Damo Academy, Tencent Youtu Laboratory, etc., who have displayed more than twenty papers using RWKV in various modes on the official website rwkv.cn.
The academic community has been exploring compressing and replicating the physical world into a world model of a large model, and Bloomberg firmly believes that the RWKV architecture, as an RNN, is the most suitable path to the world model.
Bloomberg's logic is: because RNN is closer to the way the human brain and the universe operate, RWKV is an RNN with a constant state size - it is the fixed size limitation that forces the model to learn real things, compressing the world into its state. Just like playing "Stardew Valley," because the backpack space is limited, players can only choose to store the most important items Bloomberg explained to me the advantages of RWKV from a physics perspective. In physics, the next state of the universe is only related to the previous state, which is the so-called locality and causality. Quantum field theory follows this principle. However, Transformer is an RNN with a state (KV cache) that keeps growing, where each word needs to be compared with the previous words, which is equivalent to "action at a distance", not in line with the physics of our universe. Therefore, in Bloomberg's view, RWKV is closer to the essence of this world.
Bloomberg even expressed more radically: our goal is to achieve true intelligence. Currently, we can use a hybrid model as a transitional solution, but in the future, it will definitely be pure RWKV.
Confidence in achieving this, besides the advantages mentioned above, "luck is also very important." Bloomberg considers himself to have always been a lucky person.
Breaking the loop
To my surprise, Bloomberg did not fit my assumption of those arrogant and conceited stereotypes. He emphasized to me multiple times that he is actually a humble and cautious person. For example, when investors asked about the current weaknesses and flaws of RWKV, he would openly discuss them and provide improvement solutions. He also mentioned, "The current limitations in computing power and human resources have to some extent affected their ability to prove the advanced nature of the RWKV architecture."
Such resource constraints are because they do not receive much support from capital.
Although at the 2024 Qiji Innovation Forum roadshow, Lu Qi introduced them as a "company respected in North America," most of the time they are not the darlings of venture capital.
Almost every time he meets with investors, Bloomberg needs to spend a long time explaining RWKV. Luo Xuan feels like "they seem to be here to learn," and after learning, some may still say they were deceived, "this is just pseudoscience." Many investors are hesitant to invest because they also do not understand the underlying technology of these large models. Additionally, when most people choose the Transformer route, investors are more cautious about non-mainstream routes like RWKV. There are also voices from the mainstream Transformer route that interfere with their judgment.
Bloomberg does not spend excessive time explaining and aligning with investors. Therefore, he chooses to only meet investors downstairs in his own building, which is also a screening mechanism - "If an investor requires me to meet in person before investing, it means they do not understand what we are doing."
Having worked in hedge funds, he is very clear about this investment logic: everyone pursues the lower-risk top route, with endorsements from big shots or teams being better, which can provide a more stable exit strategy.
Therefore, he understands the choices of these investors very well, "after all, investors with both understanding and courage are few."
However, the development of large models requires massive resources. I roughly calculated that currently, one hundred thousand GPUs would cost approximately 20 billion RMB. Computing power and human resources are the primary constraints facing Bloomberg. "If our computing power increases, it will be much easier to prove ourselves." Although RWKV is getting stronger with each generation of iterations, "in the absence of sufficient computing power, some hard metrics are indeed difficult to achieve." In a situation where everyone is looking at hard metrics, it is inevitable to fall into a passive loop After expressing some regrets about this kind of cycle, Bloomberg was very calm and self-consistent: "I don't think there is any regret, this is the test for innovators. If you choose a more difficult path, you have to bear such things. If you can't handle it, then don't innovate. I think it's very reasonable."
Bloomberg believes that time is on the side of RWKV. Currently, the limitation of resources may affect the training of large models, but it will not affect the iteration speed of RWKV. Instead, it can be his motivation: "Taking risks with small things is more fun. Since you want to do big things - building the infrastructure for all AI models of mankind, why not prove that you have the ability to do it in any environment, which will only be more dramatic in the future."
Encounter of Non-consensus
In order to make such a new architectural innovation, there must be enough space for solitude. Bloomberg has hardly participated in any public activities because he has a key helper - Luo Xuan, the co-founder of Genesis Intelligence.
Luo Xuan, like a reliable supporting role who always accompanies the male protagonist in American blockbusters, acts as Bloomberg's spokesperson, frequently appearing at various technology events, actively promoting the RWKV architecture.
Luo Xuan follows Bloomberg because he believes Bloomberg is a genius, "He seems to be born to do this," while Bloomberg believes he is only "very good at seeing what others find difficult to see," and he will look at the relationship between AI and humans from a higher perspective.
After meeting, I found that Bloomberg, with long hair resembling an artist, is much more vivid than I imagined. In his spare time, he also plays "Honkai Impact 3rd", pays attention to society and human nature, and even started a "new business" in emotional relationship consulting this year.
After listening to Luo Xuan's story about Bloomberg, I gradually began to understand Luo Xuan: Bloomberg started reading third grade material at the age of 6, took the college entrance exam at 16, his first choice was the Physics Department of Nanjing University, but eventually his exam score exceeded the Tsinghua University admission line in Guangdong Province by 40 points. After entering Nanjing University, Bloomberg quickly transferred to the University of Hong Kong with a full scholarship.
The reason for not choosing the Computer Science Department is: he felt it was unnecessary to specialize. His parents, who taught at the university, believed that computer programming was the future direction, so Bloomberg started coding at the age of six and published a book on game programming in high school.
In 2006, after graduating from the University of Hong Kong, Bloomberg went to the world's largest forex hedge fund at the time, worked on quantitative models, and later became one of the fund managers, managing over 60 million US dollars in his twenties.
In 2013, Bloomberg returned from Hong Kong to Shenzhen and started a smart hardware startup - Linglin Technology. In 2019, he observed market demand and started making not-so-smart full-spectrum lights. Some people on Zhihu also jokingly referred to Bloomberg as the "light bulb seller".
At that time, Luo Xuan, who was still at Tmall Genie AI Lab, found that smart speakers were still an unestablished story at the time, so he started a logistics robot startup.
With the arrival of the epidemic, the company's sales were affected. Bloomberg kept the company running and devoted himself to the research and development of the underlying architecture of AI models. The road to the birth of the RWKV architecture also began. In the industry where Luo Xuan is located, he also started to engage in organizing and participating in hackathons - offline activities that use programming to solve practical problems At a hackathon organized by Luo Xuan, he met Bloomberg for the first time, and their destinies intersected from then on.
During the meeting, Bloomberg told Luo Xuan that he might be the best candidate to achieve AGI - by then he had independently developed RWKV-1 to RWKV-4 and had gained many followers overseas. Luo Xuan had met many geniuses before, but Bloomberg was the kind of genius that he found more interesting. Although Luo Xuan thought Bloomberg's words were a bit crazy at the time, he strongly agreed with what Bloomberg was doing and the underlying logic, so he decided to join Genesis Intelligence.
The reason Bloomberg chose Luo Xuan was also very simple: Luo Xuan could help Bloomberg with many things that he couldn't focus on, and he did them well. The two complemented each other very well.
"We are heading towards a correct non-consensus, and this non-consensus is definitely without consensus." This is how Luo Xuan described his feelings after joining Genesis Intelligence.
"A long road, chosen by oneself, walked by oneself." In the future, Bloomberg will follow the plan, iterate generation by generation, to break this cycle. He said that the future RWKV8 will be something very interesting.
"What do you think is the fundamental difference between people?"
- At the end of the conversation, Bloomberg, who is good at starting from the essence, asked me.
"It's cognition," he told me, "I can only say that the direction I am taking is something they couldn't even dream of."
