AMD 苏姿丰：亮剑 NVIDIA 黄仁勋

AMD launches the strongest AI APU in history, but helps push NVIDIA's market value over $1 trillion.

Knowing that he was not the opponent, he still drew his sword. This is not only the embodiment of the fighting spirit of Li Yunlong, the captain of the independent regiment, but also that of Lisa Su, the CEO of AMD.

On June 13th, Eastern Time, AMD, which is regarded as the most realistic competitor of NVIDIA, released the super AI APU (accelerator) AMD Instinct MI 300X, which is comparable to NVIDIA's current strongest AI computing acceleration chip H100.

From a technical point of view, AMD's acceleration chip performance surpasses NVIDIA H100 with parameter support. However, whether the parameters can be equivalent to performance is a matter of different opinions in the capital market.

After hitting a new high since January 19, 2022, AMD's stock price fell all the way down and closed at $124.53, down 3.61%, while NVIDIA rose 3.9% and closed at $410.22, with a market value breaking through $1 trillion for the second time.

MI 300 Series: Born for AGI

AMD Instinct MI 300X is an accelerator (AI chip) developed specifically for generative AI.

Unlike AMD Instinct MI 300A, which was launched in June 2022, AMD Instinct MI 300X does not integrate CPU cores, but adopts a design of 8 GPU chiplets (based on CDNA 3 architecture) and 4 I/O memory chiplets, which makes its integrated transistor count as high as 153 billion.

To alleviate the memory constraints faced by AI large language models (LLMs), AMD has integrated 192GB of HBM3 (High Bandwidth Memory) for this chip, with a storage bandwidth of up to 5.2 TB/s and a processing capacity of up to an astonishing 40 billion parameters. A single MI 300X can run a model with as many as 800 billion parameters.

AMD Instinct MI 300A can be understood as customized for LLM: with 192GB of HBM3 memory, 5.2TB/s memory bandwidth, and 896GB/s Infinity Fabric bandwidth. AMD integrates 153 billion transistors in a total of 12 5nm chips.

HBM is a type of DRAM that is designed for data-intensive applications that require high throughput. Its function is similar to that of a data "transfer station", which stores the image data used in the frame buffer area and waits for the GPU to call it.

The biggest difference between HBM and other DRAM is its ultra-high bandwidth. The latest generation of HBM is HBM3, with a bandwidth of up to 819 GB/s, while the bandwidth of GDDR6 is only up to 96GB/s, and the bandwidth of DDR4, which is commonly used as an external storage device for CPUs and hardware processing units, is only 10% of that of HBM.With such high bandwidth, HBM has become a core component of high-performance GPUs. NVIDIA's supercomputing cluster DGX GH200 also uses HBM3 display memory.

According to different application scenarios, the American JEDEC (Solid State Technology Association) divides DRAM into three types: standard DDR, mobile DDR, and graphics DDR, with HBM belonging to the last type.

In the past 20 years, the speed of computing power has increased rapidly, but the improvement of I/O (write and read) bandwidth has been limited - the former has increased by 90,000 times, while the latter has only increased by 30 times, which has led to the "memory wall" problem, that is, slow data transfer and high energy consumption.

To effectively solve the data transfer bottleneck, improving memory bandwidth has become a must-attack technical problem. The so-called memory bandwidth is the rate at which the processor can read data from memory or store data in memory.

GDDR uses traditional methods to package standard PCBs and tested DRAMs with SoCs together, aiming to provide higher data rates with narrower data channels, thereby achieving the necessary throughput, with high bandwidth and energy efficiency.

In the decision-making AI stage, GDDR's bandwidth can still meet application requirements, but with the advent of generative AI (AGI), storage vendors have come up with a "stacking" (chiplet) solution (stacking GDDR and packaging it with GPUs) to solve the memory bandwidth problem, and thus HBM was born.

Physically, GDDR is independently packaged and placed around the GPU on the PCB, while HBM is arranged in layers using 3D stacking technology on the silicon interposer and packaged with the GPU as a whole. After such processing, the area of HBM2 is nearly 1 times smaller (94%) than that of GDDR5.

Currently, HBM has been upgraded to HBM3. From the initial 1GB storage capacity and 128GB/s bandwidth of HBM1, it has developed to the current 64GB storage capacity and 819GB/s bandwidth (HBM3 standard released on January 28, 2022).

After the clear application of AGI (i.e., OpenAI's release of ChatGPT-3.5), NVIDIA's AI accelerator H100, launched in March 2022, has fallen behind AMD Instinct MI 300X in terms of performance, with the latter's HMB density being 2.4 times that of the former, and the bandwidth being 1.6 times that of the former.

In terms of storage space, AMD Instinct MI 300X can use 192GB of memory, while NVIDIA H100 chips only support 120GB of memory.

Perhaps AMD still thinks its performance is not up to par with NVIDIA, after all, NVIDIA has not yet launched a truly AGI-oriented accelerator chip. Therefore, AMD claims that based on the AMD Infinity architecture with a bandwidth of 896GB/s, 8 AMD Instinct M1 300X accelerators can be combined in one system.This provides a solution beyond NVIDIA for AI reasoning and training with stronger computing power.

Currently, the AMD Instinct M1 300X has not yet been mass-produced and will be sampled in the third quarter of this year at the earliest, and officially launched in the fourth quarter.

Competition among relatives

AMD CEO Lisa Su said that as the scale of semantic models becomes larger and larger, multiple GPUs are needed to support ultra-large-scale data. However, if AMD's dedicated accelerator chip is used, developers do not need so many GPUs.

Su also said that the potential market size of IDC AI accelerators will increase from $30 billion this year to $150 billion+ in 2027, with a compound annual growth rate of over 50%.

AMD has launched the LLM dedicated accelerator with such powerful AI training and reasoning performance, but its stock price fell 3.61% in the day's trading, why?

According to insiders in the domestic computing power supply chain, AMD did not disclose the list of major customers using the AMD Instinct MI 300 series chips, which is equivalent to not responding positively to the speculation in the capital market about which major customers are using this chip.

In addition, the observer pointed out that AMD did not disclose the cost or sales plan of the MI 300 series chips. "Considering the large number (24) of HBM3, the large Die area, and the tight capacity of TSMC's CoWoS packaging, this phenomenon occurred (launching a powerful performance chip, but the capital market chose to vote with its feet)."

CoWoS is part of TSMC's advanced packaging technology combination 3D Fabric, which includes front-end 3D chip stacking or TSMC-SoIC (system integration chip), back-end CoWoS and InFO series packaging technology, which can achieve better performance, power consumption, size, appearance, and function, and achieve chip system-level integration.

The demand for computing power of AI pre-training large models will drive the further development of advanced packaging technology and IDC construction. The demand for computing power of pre-training large models such as ChatGPT is huge, and Chiplet advanced packaging is urgently needed to break the limitations of Moore's Law, which will become an effective means to improve the speed of IDC construction.

NVIDIA's LLM secret technique for dominating AGI, in addition to the powerful APU hardware, its supporting software for developers is also a key foundation of Huang Renxun's AI empire. Therefore, AMD naturally follows suit and also launches dedicated AI chip software (similar to NVIDIA CUDA), namely ROCm.

This supporting software greatly reduces the performance calling threshold of GPUs. For example, originally, relatively more professional OpenGL graphics programming language was needed, but with NVIDIA CUDA, developers can call GPUs with Java or C++. The role of CUDA is equivalent to a bridge from ordinary mass-level code software to professional high-threshold graphics programming languages.From the speech of AMD President and AI Strategy Leader Victor Peng, it can be seen that AMD began to learn from NVIDIA's software and hardware integrated approach very early on, but "this process is very long. (Of course) In the establishment and open model, library, framework, and tool ecosystem model, we have made great progress."

Peng is the president and AI strategy leader of AMD, and his existence actually reflects Su Zifeng's strategic vision to challenge NVIDIA.

One of Su Zifeng's means of challenging NVIDIA's monopoly position in the AGI era is through acquisitions. In 2022, AMD acquired Xilinx, a major producer of programmable processors, for $48.8 billion, which will help accelerate video compression tasks. Peng is the CEO of Xilinx and was "packaged" in this transaction to become the president of AMD.

In addition, ADM also focuses on APU (accelerated processor) based on its own CPU advantages, forming differentiated competition with NVIDIA's core APU "A100/H100".

From a market perspective, it is better to have two strong competitors than one with monopolistic power like NVIDIA. Therefore, if challenger AMD's efforts in software and hardware integration have performance and cost highlights, it is not entirely without opportunities.

Just like "Nirvana in Fire" and "Romance of the Three Kingdoms" are wars between relatives, AMD and NVIDIA also have similar dramatic colors.

There are reports that Huang Renxun and Su Zifeng are actually relatives. Huang Renxun's mother and Su Zifeng's maternal grandfather are siblings, but it is not known whether they are cousins or siblings.

Huang Renxun moved to the United States from Thailand at the age of 9, graduated from Oregon State University with a bachelor's degree in electrical engineering, and then obtained a master's degree in electronic engineering from Stanford University. After graduation, he joined AMD as a chip design engineer and founded NVIDIA at the age of 30.

Su Zifeng settled in the United States with his parents at the age of 5 and obtained a PhD in EE (Electrical Engineering) from the Massachusetts Institute of Technology at the age of 24. After that, he worked at TI, IBM, and AMD. In 2014, he began to lead AMD. From 2014 to this year (2023), AMD's stock price has increased nearly 30 times under Su Zifeng's leadership.