Integrating 100,000 H100 GPUs! Musk launches super AI cluster project aiming to build the world's "most powerful artificial intelligence"

Zhitong
2024.07.23 07:11

Elon Musk launches the Super AI Cluster project to build the world's most powerful artificial intelligence. The project will be built with 100,000 NVIDIA H100 AI GPUs and supported by Super Micro Computer for the basic hardware infrastructure. Musk aims to complete the construction of this artificial intelligence system by December this year

According to the financial news app Zhitong Finance, Tesla CEO and founder of the AI startup xAI, Elon Musk, has officially launched the construction of the "world's most powerful AI training cluster" project. The goal is to build the "world's most powerful artificial intelligence (AI)" with xAI by December, using a massive AI supercomputing system powered by 100,000 NVIDIA H100 AI GPUs. Industry insiders commented that once NVIDIA's new "world's most powerful performance" Blackwell architecture AI GPU is successfully shipped in the fourth quarter, xAI may be among the first customers to try out this AI GPU.

It is reported that Musk has been posting continuously on the social media platform X (formerly known as Twitter), emphasizing that his startup xAI has initiated the "world's most powerful AI training cluster" project. He aims to utilize this AI training cluster to create what he calls the "world's most powerful artificial intelligence" by December this year. In a post on X, it was mentioned that today, xAI's super AI cluster in Memphis has started using 100,000 liquid-cooled NVIDIA H100 GPUs for AI training, which are connected to a single RDMA (Remote Direct Memory Access) structure.

It is known that the AI server industry leader Super Micro Computer (SMCI.US) has provided most of the basic hardware infrastructure for xAI's super AI cluster project. The company's CEO, Charles Liang, also commented on Musk's post, praising the strong execution capabilities of the xAI team. Previously, Charles Liang had expressed admiration for Musk's liquid-cooled artificial intelligence data center.

Super Micro Computer's major clients include ChatGPT, Sora developer OpenAI, and many AI startups in the field, including xAI founded by Musk, as well as Oracle and cloud service giants like Amazon AWS. Due to Super Micro Computer's close partnership with NVIDIA over the years, these two companies deeply integrated with NVIDIA rely on their strong supply chain and long-term collaboration to often secure larger shipments of NVIDIA AI GPUs Super Micro Computer has been using the latest NVIDIA GPU and integrated full set of CUDA acceleration tools through years of close cooperation with NVIDIA, providing powerful GPU acceleration capabilities for AI training/inference workloads, an indispensable technical link for global enterprises to deploy AI technology. Moreover, Super Micro Computer has long been known in the server field for its customized server solutions, able to optimize hardware configurations according to specific customer needs. This customization capability is crucial for startups like xAI.

In a subsequent tweet, Musk explained that the new super AI training cluster will "train the world's most powerful artificial intelligence according to various metrics." According to previous statements of intent, industry analysts speculate that xAI's 100,000 H100 GPU supercluster will now be used for the Grok 3 super AI large model training cluster. Musk stated that the improved Large Language Model (LLM) is expected to complete the AI training phase "before December of this year."

From the current scale, the new xAI Memphis super AI training cluster easily surpasses any AI computing cluster in the global top 500 in terms of NVIDIA AI GPU computing power. The world's most powerful super AI computing systems, such as Frontier (37,888 AMD GPUs), Aurora (60,000 Intel GPUs), and Microsoft Eagle (14,400 NVIDIA H100 GPUs), seem to be far behind xAI's AI cluster.

Demand for NVIDIA AI GPUs remains strong! A new round of stock price surge is imminent

It is understood that in May of this year, media reports indicated that Musk plans to build an xAI supercomputing factory by the fall of 2025. At that time, Musk hastily announced the construction of the super AI training cluster, and at that time announced a large-scale purchase of NVIDIA's H100 AI GPUs based on the Hopper architecture. At that time, this move seemed to indicate that Musk was not patient enough to wait for NVIDIA's next-generation upgraded H200 AI GPU, let alone the B100, B200, and GB 200 AI GPUs based on the yet-to-be-announced Blackwell architecture.

However, with NVIDIA expected to achieve the first shipment of Blackwell architecture AI GPUs in the fourth quarter, some industry analysts expect xAI under Musk to be among the first customers to try out this AI GPU. In the press release for the Blackwell architecture AI GPU released by NVIDIA in March, Musk publicly stated that NVIDIA's AI hardware is "the best AI hardware." Musk also likened the tech industry's AI arms race to a high-stakes "poker game," where companies need to invest billions of dollars annually in AI hardware to stay competitive.

NVIDIA's next-generation architecture AI GPU family based on Blackwell will see a new boost in ultra-high performance. Tech giants such as Amazon, Dell, Google, Meta, and Microsoft will heavily deploy Blackwell AI GPUs in their latest data center AI server systems. Wall Street analysts generally speculate that the demand for NVIDIA hardware from these tech giants will far exceed market expectations. Recently, industry insiders revealed that due to the strong global demand for NVIDIA's upcoming Blackwell architecture AI GPU, NVIDIA has significantly increased its AI GPU foundry orders with chip giant TSMC by at least 25%.

NVIDIA's currently hottest AI chips, the H100/H200 GPU accelerators, are based on NVIDIA's breakthrough Hopper GPU architecture, providing significantly more powerful computing capabilities compared to the previous generation, especially in terms of floating-point operations, tensor core performance, and AI-specific acceleration. More importantly, the AI GPU based on the Blackwell architecture outperforms the Hopper architecture by far. On the GPT-3 LLM benchmark with 175 billion parameters, the inference performance of the GB200 based on the Blackwell architecture is 7 times that of the H100 system, and it offers 4 times the training speed of the H100 system.

Based on NVIDIA's upcoming launch of the new generation Blackwell GPU and the continued strong demand for NVIDIA H100/H200 AI GPUs, some Wall Street analysts predict that this will stimulate a new round of performance and stock price growth for NVIDIA. Therefore, they have raised NVIDIA's target stock price within 12 months, believing that NVIDIA's stock price is poised for a new round of growth.

The well-known Wall Street firm Piper Sandler recently reiterated its "overweight" rating on NVIDIA and raised its target price for the next 12 months from $120 to $140 (NVIDIA closed at $123.54 on Monday). Another firm, Loop Capital, recently raised NVIDIA's target price for the next 12 months from $120 to $175 and maintained a "buy" rating on the stock. The international bank UBS reiterated its "buy" rating on NVIDIA and raised its target price from $120 to $150.

In a report, Piper Sandler wrote: "Research data shows strong pre-orders for NVIDIA's new Blackwell products, and pre-orders for existing products like H100 and H200 remain very strong." Piper Sandler expects NVIDIA's revenue for the quarter ending in July to be approximately $2 billion higher than the market consensus. In the previous quarter's earnings report, NVIDIA's revenue exceeded market expectations by around $1.5 billion