The Next Fifteen Years: The Future of Cloud and AI Integration

China Finance Online
2024.09.23 16:09
portai
I'm PortAI, I can summarize articles.

In the new stage of cloud and AI integration, Alibaba Cloud is committed to meeting the massive computing power demand of the AI era. By developing the large-scale cluster architecture HPN7.0, Alibaba Cloud has achieved a performance improvement of over 10% for end-to-end model training, supporting cluster computing with tens of thousands of cards. Alibaba Cloud's forward-looking layout keeps it ahead in AI infrastructure, attracting numerous enterprise customers and driving innovation in areas such as intelligent driving and embodied intelligent robots

The cloud will be the backbone and fuel depot of AI.

Since Alibaba Cloud wrote the first line of code in 2009, fifteen years have passed. After experiencing two waves of cloud computing, one for the takeoff of Internet companies and the other for the deep digital transformation of traditional enterprises, we are now entering the third wave: a new stage of integration between cloud and AI. Like a tide, the emergence of AI is not overturning the industrial logic of the previous two waves, but the technological accumulation on the cloud platform will accelerate the release of platform value in the AI infrastructure stage.

Looking ahead to the next fifteen years, it is not difficult to predict: the cloud will be the backbone and fuel depot of AI.

This point can also be traced in the development history of the internet. Around the year 2000, when the domestic internet was just emerging, it was the operators who provided the network infrastructure, and people's needs were just sending emails and shopping. With the arrival of cloud computing and big data, as the network entered the era of SDN, it supported the rise of network games, live streaming, and algorithm recommendations. Now, we are experiencing the era of AI intelligence, with requirements for the network such as intelligent driving, LLM training reasoning, which are completely different from the past.

How can we match the massive computing power demand in the AI era and unleash performance to the extreme? Alibaba Cloud already has the answer.

To meet the network requirements of the AI era, Alibaba Cloud developed the large-scale cluster architecture HPN7.0 last year, supporting cluster computing with tens of thousands of cards. According to the latest announcement at this year's Yunqi Conference, HPN7.0 has improved end-to-end training performance of models by more than 10%. Currently, it implements front-end and back-end network separation technology, with front-end 400G network bandwidth providing high-speed storage access and node communication; and back-end 3.2T GPU interconnection network, meeting the demand for large-scale AI computing.

In fact, Alibaba Cloud had been researching the first-generation tens of thousands of cards cluster HPN6.0 as early as 2021, mainly to meet the needs of autonomous driving customers for visual model training, when large models were not yet so popular. In addition, in October 2022, Alibaba Cloud was the first in the industry to propose the concept of MaaS (Model as a Service) and led the trend of the concept. All these indicate that Alibaba Cloud has always maintained forward-thinking and layout in AI infrastructure at the bottom and middle layers.

In the era where cloud and AI are inseparable, forward-looking layout has enabled Alibaba Cloud to quickly win a large number of new enterprise customers, such as innovators with co-creation capabilities in intelligent driving, embodied intelligent robots, and other fields. The competition in AI Infra will also usher in a new round of industry revolution.

The Collision of Addition and Subtraction

The leap of intelligence is particularly evident in new energy vehicles, which have a stronger demand for intelligent infrastructure.

At the Yunqi Conference in Hangzhou on September 19th, He Xiaopeng, Chairman of XPeng Motors, which claims to be the "world's first AI car," predicted that the value of end-to-end large models in the field of autonomous driving is that in the future, everyone in every city can drive like an experienced driver.

In conventional concepts, code enrichment implies powerful functionality, but through the establishment of end-to-end neural network code for perception-decision-execution integration, combining these three steps in the same large model has completely changed the previous serial architecture The actual effect is: end-to-end "bypassing" map data, can directly generate vehicle acceleration, steering, and braking signals based on real-time image data collected by cameras and sensors, making the car's response smoother.

In this context, the code will be further reduced. Taking Tesla FSD v12.5.1 version as an example, the original over 300,000 lines of C++ code have been cut down to only 3,000 lines. However, at the same time, Musk purchased 350,000 graphics cards from NVIDIA to support faster data processing. Adding computing power is also the basic premise for making the front end gradually simpler.

Over the past two years, XPeng Motors has also been increasing its computing power. Its computing power reserve in the Intelligent Computing Center jointly built with Alibaba Cloud has expanded by over 4 times to 2.51 Eflops. What used to take a week to complete in terms of autonomous driving large model training can now be shortened to one hour through the Intelligent Computing Center. To accelerate end-to-end large model development and improve the upper limit of autonomous driving, He Xiaopeng stated that he will continue to deepen cooperation with Alibaba Cloud in AI computing power, with an expected annual investment of 3.5 billion RMB in research and development, of which 700 million RMB will be allocated to computing power training, accelerating the implementation of end-to-end large models.

XPeng Motors has been using Alibaba Cloud since 2015, moved its car networking research and development to the cloud in 2019, and in 2022, established an autonomous driving Intelligent Computing Center in Ulanqab with Alibaba Cloud. Furthermore, through the integration of the self-developed "Global Large Language Model" X-GPT and Alibaba Cloud's Tongyi Qianwen, the in-car assistant has been comprehensively upgraded to be fully intelligent. In addition, XPeng Motors actively embraces Alibaba Cloud's Tongyi Wanxiang, and introduces Tongyi Lingma in the research and development field to significantly improve code review efficiency... This automaker is now fully committed to AI, deeply integrating with Alibaba Cloud resources in manufacturing, car networking, autonomous driving, intelligent cockpit, and official website digital marketing.

Another automaker, Geely, is also sprinting on the road of intelligent driving. It has been cooperating with Alibaba Cloud for 9 years, using a hybrid cloud architecture, deploying 1000 servers + 20P storage in offline private cloud, and 70,000 core public cloud ECS + 28P storage online. In the intelligent driving scenario, Geely uses Feitian private cloud, benchmark computing platform PAI Lingjun + OSS + big data + database; for the intelligent cockpit, it leverages Alibaba Cloud's EGS + DeepGPU acceleration engine to push Geely's self-developed large models to the cloud, achieving a 40% acceleration improvement, and calling the Tongyi large model API interface. At the same time, with the support of Tongyi Wanxiang VL function, it enables intelligent cockpit external object recognition, and utilizes Tongyi Qianwen Plus to support customer emotional chat.

According to industry estimates, traditional fuel vehicles have approximately 30,000 components, with about 500 vehicle chips; while new energy AI vehicles have less than 20,000 parts, but around 5,500 chips. Between addition and subtraction, the interaction mode and production logic of vehicles have been changed, further enhancing the reliance on proprietary chips, large-scale computing clusters, and cloud-native databases.

According to the latest news, NVIDIA's widely covered automotive NVIDIA DRIVE Orin system-level chip has achieved deep adaptation with Alibaba Cloud's Tongyi Qianwen multimodal large model Qwen2-VL, and has been applied in the intelligent cockpit scenario of Zebra Zhixing The integration of large models into car cabins to expand human-machine interaction boundaries has become a trend.

The Full-stack Innovation of the "Ten Thousand Cards Era"

Although large models are popular, they are not yet mature. Large models on the market almost always experience training interruptions due to various errors, and training efficiency is crucial for business innovation. If training is too slow and constantly interrupted, it is difficult to improve innovation efficiency. People often add more GPUs to enhance training efficiency. For example, when Meta trains the Llama model, it uses a cluster of 16,000 cards, and roughly every two to three hours, the entire training task has to start over from the previous checkpoint.

From 128 cards to 1024 cards, from thousands of cards to tens of thousands of cards, and then to hundreds of thousands of cards, "ten thousand card stacking" seems simple in theory, as individual GPU computing power multiplied by the GPU scale constitutes the overall computing power. However, in actual operation, when the scale of cards expands rapidly, it is difficult to maintain this theoretical linear ratio, leading to computing power "decay," which are operational challenges.

At this point, the network plays an important role in this cluster because the network requires time during the "gradient synchronization" process and involves a large amount of data exchange. The length of this time directly determines the waiting time of GPUs in the calculation process, making traditional network clusters no longer suitable for AI computing.

In response to this, the innovative design proposed by Alibaba Cloud's HPN7.0 high-performance network architecture mentioned above adopts a single-layer thousand-card, two-layer ten-thousand-card, storage-compute separation architecture specifically designed for AI computing, supporting a cluster of hundreds of thousands of cards. The two-layer network not only reduces latency but also simplifies the number and topology of network connections, thereby finding the optimal solution.

The scale advantage accumulated by Alibaba Cloud in the past is also supporting a new round of technological advantages. The latest release at the Yunqi Conference, Qwen2.5–72B, outperforms Llama 3.1 405B in performance, while the model computing power cost has once again decreased, with the highest reduction of 85% for the three main models. AI infrastructure is bound to become more powerful, not only needing to perform well but also be affordable, in order to drive more innovation. Alibaba Cloud is also accelerating in this direction.

According to Zhou Jingren, CTO of Alibaba Cloud, the transformation of AI technology has reached all aspects of the computer system, requiring comprehensive innovation across the board. It is not only the network but also the technological architecture system of servers, storage, data processing, model training, and inference platforms that need to be comprehensively upgraded around AI. "Alibaba Cloud is setting a new standard for AI infrastructure in the AI era."

Eddie Wu, CEO of Alibaba Group and Chairman and CEO of Alibaba Cloud Intelligence, shared at the Yunqi Conference that over the past year, Alibaba Cloud has invested in building a large amount of AI computing power, but it still cannot meet the strong demand from customers. This further strengthens Alibaba Cloud's future investment efforts.

Specifically, on the server side, Alibaba Cloud's latest Pangu AI server supports 16-card GPUs on a single machine, with a shared memory of 1.5TB, and provides GPU fault prediction based on AI algorithms with an accuracy rate of 92%. The AI era will transition from CPU cores to GPU-based computing instances, requiring support for heterogeneous chips worldwide, facing more architectural innovations than the CPU era Pangu servers are specifically optimized for AI, with optimizations in chip rapid adaptation and heat dissipation.

In terms of storage, Alibaba Cloud's CPFS file storage has evolved into a fully managed service over the past year, eliminating the need for customer operation and maintenance work. It currently scales up to a bandwidth of up to 20TB per second, adopting a tiered storage architecture in design to place the hottest data on the lowest latency storage. The data transfer speed between CPFS and unified storage data lake OSS reaches 100GB per second. These designs are all tailored for AI computing.

Including Pangu AI servers, HPN network, CPFS storage, along with container service ACS, they together form Alibaba Cloud's AI computing platform Lingjun, creating a more suitable infrastructure for GPU computing and AI model training at the AI Infra level. The cloud-based intelligent computing platform CFFF jointly built by Alibaba Cloud and Fudan University, as well as the autonomous driving intelligent computing center jointly built with XPeng Motors in Ulanqab, are both industrial applications of Lingjun.

Currently, the full-stack capabilities for AI development and application landing can be externally implemented through the PAI and Alibaba Cloud Bailian platforms. At the Yunqi Conference, both platforms have new service upgrades released: PAI model training has significantly improved stability, with cluster-level fault detection in minutes for thousand-card scale, covering 98.6% of faults; the exclusive version of Bailian 2.0 was released at the Yunqi Conference, specifically optimized for government and enterprise customers.

It is precisely due to these innovations that general large models and basic computing power have experienced multiple rounds of price reductions, reducing the cost of enterprise AI development, which is crucial for increasing the penetration rate of AI in various industries.

Looking back at the early days of cloud computing, it was divided into IaaS, PaaS, and SaaS layers based on different hosting levels. Now this architecture extends upward to MaaS open source with AI, and downward to the chip layer and heterogeneous computing power. AI not only expands the boundaries of the cloud but also motivates the cloud to undergo another physical-level full-stack upgrade transformation. Now is truly the time to test the innovation capabilities of cloud vendors.

In the next fifteen years, a new wave is surging on top of the infrastructure of "AI + Cloud".

With fifteen years of ups and downs in Chinese cloud computing, we take the three waves of cloud computing as the main line, systematically reviewing and reflecting on the past, present, and future of the integration of industry and cloud computing through three articles