Musk xAI's first research findings are released! Co-first authors Yang Ge and Yao Ban, both founding members and alumni.

Musk's xAI, the first public research result is here! One of the co-first authors is Yang Ge, a disciple of Qi Chengtong and a founding member of xAI...

Musk's xAI, the first public research result is here!

One of the co-first authors is Greg Yang, a founding member of xAI and a disciple of Qiu Chengtong.

Previously, Greg Yang publicly stated that his research direction in xAI is "Math for AI" and "AI for Math".

One of the key focuses is to continue his previous research:

The unified programming language for describing neural network architectures, Tensor Programs, has already been applied in GPT-4.

This new paper belongs to this series and focuses on "how to train infinitely deep networks".

For this, Greg Yang himself also conducted a live broadcast on ??? to share.

Let's take a look at the exciting content worth marking~

Training Infinitely Deep Neural Networks

In simple terms, this article studies the extension of residual networks (ResNet) in the depth direction.

We know that residual networks solve the problem of performance degradation of deep convolutional neural networks when the depth increases. However, it is still not easy to train a good deep residual network when the network continues to deepen:

As the network deepens, the scale of features continues to increase, leading to network instability; after deepening the network, it is necessary to readjust the hyperparameters, which is a considerable workload...

Greg Yang and his colleagues' idea is to find a deep parameterization method that can both learn features and achieve hyperparameter transfer.

They first thought of the two extreme cases of infinite-width neural networks: either kernel machines or feature learners. For the latter, the optimal hyperparameters do not change with the width.

Here, they use the Tensor Programs framework to analyze the extreme cases of infinite-width networks.

As mentioned earlier, Tensor Programs is one of Greg Yang's long-term research goals: to establish a low-level programming language that can describe and analyze neural network architectures using mathematical language.

Specifically, Tensor Programs consist of matrix multiplication and activation functions. Greg Yang found that if the neural network function can be expressed using this language, initialization analysis can be done automatically and completely. In the mathematical derivation part, I won't go into specific details here. Let's just get a glimpse of the style...

Based on these derivation analyses, the author proposes the Depth-μP method, which enables hyperparameter transfer in the depth direction, greatly simplifying hyperparameter adjustment at different depths.

Depth-μP includes the following points:

The coefficient a/sqrt(L) is inversely proportional to each residual branch and the square root of the depth L.
The learning rate of each weight matrix decreases as the depth L increases, depending on the type of optimization algorithm. For SGD, the learning rate is a constant η, and for adaptive optimization algorithms such as Adam, the learning rate is η/sqrt(L).

It is worth noting that the author found that when the depth of the residual block is 1, Depth-μP is the optimal way to parameterize the depth, which ensures that the hyperparameters converge as the depth increases, achieving the transfer of hyperparameters in the depth direction.

However, when the depth of the residual block is ≥2, there are still issues of failed hyperparameter transfer and decreased training performance.

In addition, the paper also explores the concept of "feature diversity," believing that it plays a crucial role in deep networks.

Another co-first author of the paper is Dingli Yu from Princeton. He graduated from Tsinghua University's Yao Class and is currently pursuing a Ph.D. in the Department of Computer Science at Princeton.

What did Yang Ge say in the live broadcast?

In the live broadcast, Yang Ge also answered questions that the audience was interested in. Quantum Bit has summarized some of the questions without changing their original meaning.

Q: For many of us, the content of the paper may be beyond our understanding. But I want to know, what is the difference between the model you mentioned and ChatGPT and OpenAI's technology that we can experience? What are the significant differences or innovations of this paper compared to OpenAI's achievements?

Yang Ge: Let me briefly comment on this. I would say that these features currently do not have a direct relationship with practical applications. They are more like research-oriented.

Of course, the ultimate goal of doing all this is to make the model better and safer, and then benefit humanity. What we are doing now is describing the expected effects, which may not have a direct impact.

We are all on the same boat now, doing what we can, whether it is short-term work or long-term applied research, all for the benefit of everyone. **Q: ** It sounds like you are building an artificial computer brain capable of reasoning, so is this what you are researching? Also, I am a mother and my 7-year-old son is very interested in mathematics. Do you have any suggestions to keep him interested and passionate about the field of AI?

**Yang Ge: ** "New type of network" refers to artificial neural networks, which I believe are the backbone of many modern technologies, including Google, Facebook, Instagram, and other services that you use every day. These services are built on these artificial neural networks. These networks were inspired by the real neural networks of animals and humans about sixty or seventy years ago, but they have deviated from real neuroscience.

These networks are essentially mathematical problems, so we analyze them extensively to gain a deep understanding of these neural networks.

Although we are not yet clear about the exact connections of real neurons, through mathematical research, we can optimize these artificial neural networks to help technology companies improve people's lives.

Regarding your second question, it's great to hear that your son is very interested in mathematics. This is the foundation for creating great achievements in the field of technology and improving everyone's lives.

My advice would be to first keep your son's passion for mathematics alive, as this is very important. Once that passion is lost, it becomes difficult to continue learning.

Also, pay attention to what he enjoys and make the learning process interesting to further stimulate his interest. At the same time, cultivate his curiosity about how things work and encourage a scientific mindset, where he investigates driven by curiosity. For example, dismantling things and trying to understand how they work.

If someone loses their enthusiasm for exploring the mathematical truths of the universe, it can be difficult to find the motivation to move forward. Overall, I suggest you cultivate in your son a strong interest and curiosity about the world, especially in mathematics and the essence of science.

**Q: ** I have a more abstract question. You had the idea of approaching infinity and then wrote this paper based on that idea. Have you considered using different architectures for neural networks? Not the standard architecture with neurons and countless layers, but something completely different. For example, a completely different way of connecting these neurons, maybe in a square shape?

**Yang Ge: ** Actually, the research on nonlinearity and our insights into the number of layers in our work are very preliminary. There are still many questions to explore about what is the appropriate structure or what the structure should be like.

The Meta team, for example, has previously studied what happens when neurons are randomly connected and obtained some interesting results. So, there is definitely a lot more that can be done. Right now, I don't have a specific answer to what would be the right or better structure.

About Yang Ge

Yang Ge was born in Hunan Province and went to the United States after graduating from elementary school. He studied at Harvard University under the guidance of Professor Shing-Tung Yau.

△ Yang Ge with Shing-Tung Yau, Source: Yang Ge's Twitter In 2017, Yang Ge graduated from Harvard and then entered Microsoft under the recommendation of Shen Xiangyang.

At Microsoft, Yang Ge received high praise from Shen Xiangyang. A few months ago, at a forum called "Fundamental Science and Artificial Intelligence," Shen Xiangyang publicly stated:

Microsoft Research usually only recruits doctoral students, but Yang Ge, as an undergraduate, entered Microsoft Research. Not only did he enter Microsoft Research, but he has also been outstanding in the past five years, especially in his significant contributions to the development of GPT.

It is worth mentioning that he himself has admitted that his μTransfer (Tensor Programs series) method was used in GPT-4.

Yang Ge's research on Tensor Programs started early, and in 2019, he published "Tensor Programs I" and continued to explore it while working at Microsoft. He believes that almost any computation in deep learning can be represented as Tensor Programs.

In July of this year, Musk announced the establishment of a new company called xAI. Yang Ge left Microsoft and joined the founding team of xAI as a mathematician.

After joining xAI, Yang Ge has revealed on multiple occasions that the long-term goal of the Tensor Programs project is to develop a "Theory of Everything" for large-scale deep learning, which means finding theoretical rules that can truly understand the behavior of AI models.

He also stated:

AI will enable everyone to understand our mathematical universe in ways that were previously unimaginable.

Author: Xifeng, Yuyang, Source: Quantum Bit, Original Title: "Musk xAI's First Research Result Released! Founding Member Yang Ge & Yao Ban Alumni Co-First Authors"