Track Hyper | The Evolution of Intel Chip Design
The unique AI card Gaudi 3 and the transformative Xeon 6
Author: Zhou Yuan / Wall Street News
On September 25th, Intel officially launched the AI accelerator card Gaudi 3 and the "Granite Rapids" Xeon 6 (Xeon 6, server CPU).
Gaudi 3 competes with Nvidia's H100 and AMD's Instinct MI300, used for GAI and HPC; Xeon 6 is designed for artificial intelligence and high-performance computing scenarios.
Weak but strong? How to understand?
In April of this year, Intel announced that it would launch Gaudi 3 in the fourth quarter of this year. Now, the market can finally see the actual performance of this AI accelerator card that Intel has spared no effort to develop, to see how strong it really is. After all, Intel is relying on this chip to directly compete with Nvidia's popular H100 accelerator card.
Apart from using HBM2E (third generation) as the storage center, which is somewhat confusing - H100 uses HBM3 (H100 SXM5 GPU is the world's first GPU to use HBM3 memory, providing up to 3 TB/s of memory bandwidth), the other performance upgrades, at least from the parameters, are still very impressive.
Gaudi 3 adopts TSMC's 5nm process technology, with two chipsets: each chipset has 4 (total of 8) MMEs (Matrix Multiplication Engines), containing 64 Tensor Processor Cores (TPCs, with 256x256 MAC structure with FP32 accumulators); SRAM cache capacity doubled to 96MB, bandwidth doubled to 19.2TB/s; HBM2E memory capacity increased from 96GB to 128GB (8 chips), with a bandwidth of 3.7TB/s.
Compared to its predecessor Gaudi 2, Gaudi 3 also has a significant improvement in physical configuration. Gaudi 2 used TSMC's 7nm process, with 24 TPCs, 2 MMEs, and 96GB HBM2E high-bandwidth memory. However, for reasons unknown, Intel's Gaudi 3 only supports FP8 matrix operations and BFloat16 matrix and vector operations, no longer supporting FP32, TF32, and FP16.
In terms of performance, Gaudi 3's MME and vector BF16 parameters do not match Nvidia's H100.
Gaudi 3's MME BF16/FP8 both reach 1835 TFlops (1.835 quadrillion operations per second), while the vector BF16 can reach 28.7 TFlops (28.7 trillion operations per second), representing an increase of 3.2 times, 1.1 times, and 1.6 times compared to Gaudi 2. However, Nvidia's H100 performance parameters for BF16 are 1979 TFlops (higher than 1835 TFlops), FP8 is 3958 TFlops, and BF16 is 1979 TFlops On the core performance parameters, the gap between Gaudi 3 and NVIDIA H100 is visible to the naked eye. However, Intel claims that Gaudi 3 is 50% ahead of H100 in LLM large model inference performance, 40% faster in training time, and has a comprehensive price-performance ratio twice that of NVIDIA.
How did they achieve this? Is it because Intel's software capabilities (especially software development tools) and AI ecosystem are stronger than NVIDIA's? After all, strong software capabilities and a complete ecosystem are needed to fully unleash the potential of hardware performance.
Intel has not provided much explanation for this. Its claim of being stronger than NVIDIA is based solely on a few PowerPoint slides. Therefore, whether it can truly deliver as Intel claims still needs to be verified by the market and time.
The only thing that gives the market confidence in NVIDIA's high self-assurance is the price. Earlier this year, Intel stated that the AI accelerator suite based on eight Gaudi 3 accelerators is priced at $125,000. In other words, the price of each Gaudi 3 is approximately $15,625. In comparison, the current price of H100 is $30,678, making Intel's Gaudi 3 price 50.93% of NVIDIA's H100.
At least Intel's top management also recognizes the value of the ecosystem, showing a fairly comprehensive understanding.
Justin Hotard, Executive Vice President and General Manager of Data Center and Artificial Intelligence Business Unit at Intel, said, "The demand for AI is driving a significant transformation in data centers, and the industry is demanding choices in hardware, software, and development tools. With the introduction of Xeon 6 with P-cores and Gaudi 3 AI accelerators, Intel is building an open ecosystem that enables our customers to implement all workloads with higher performance, efficiency, and security."
From this statement, it can be seen that Intel's AI accelerator card ecosystem is also in the process of being built. In terms of software development support, Gaudi 3 seamlessly supports the PyTorch framework, Hugging Face Transformer, and diffusion models; at the same time, Gaudi 3 will be provided to IBM Cloud and Intel Tiber Developer Cloud.
In addition, the Gaudi 3 accelerator offers three deployment forms: OAM 2.0 standard interposer card, passive cooling peak power consumption of 900W, liquid cooling peak power consumption of 1200W; HLB-325 general-purpose board, power consumption unknown; HL-338 expansion card, PCIe 5.0 x16 interface, passive cooling peak power consumption of 600W.
Systems based on Intel Gaudi 3 will be fully launched from Dell, HPE, and Supermicro in the fourth quarter of this year, with Dell and Supermicro systems shipping in October and Supermicro devices shipping in December.
Design Approach Aligning with MediaTek?
On the same day, Intel also released the long-awaited but long-forgotten "Granite Rapids" Xeon 6 (CPU) On September 25th, the high-end products of the "Granite Rapids" server CPU series finally made their debut. The market believes that the combination of the "Granite Rapids" Xeon 6 and the "Sierra Forest" Xeon 6 released in June this year can still compete, at least reducing Intel's market losses in the IDC (data center) field.
Although the result is not so good - launching a new powerful processor only to reduce losses is somewhat disappointing. However, for Intel's competitor AMD, it is difficult for the latter to surpass Intel in dimensions such as technology, cost, performance, and market. Therefore, reducing losses is relatively ideal for Intel.
Due to the chip packaging and architecture of the E-core (energy consumption) and P-core (performance) variants of the Xeon 6, which were publicly disclosed at Hot Chips 2023 in 2023, the highlight of performance improvement for the "Granite Rapids" Xeon 6 is the information revealed on September 25th. At least, it has boosted Intel's confidence and partially shown a glimmer of hope to the market - the improvement in Xeon 6's design level.
The chip design level determines the final performance. The most difficult part of chip design is trade-offs, which depend on comprehensive considerations of specific chip positioning, performance, technological level, cost, competition, and market demand.
For example, MediaTek's core consideration in designing flagship chips is to improve performance while maintaining relatively low power consumption; whereas Qualcomm pursues high performance and does not pursue power balance as extremely as MediaTek, hence the previously criticized Snapdragon chips.
Intel's chip design consideration is somewhat similar to MediaTek's. For example, IPC (Instructions Per Clock) is often used as an important indicator to measure CPU performance. So, is it advisable to unlimitedly improve IPC in chip design?
Don't forget about power consumption limitations. Although desktops or servers have higher tolerance for power consumption, they also consider energy costs comprehensively. How should one choose at this point?
Recently, Ronak Singhal, a senior researcher at Intel and the chief architect of the Xeon 6 product line, has explained this topic. The core point is that Intel's design philosophy for Xeon 6 is to reduce energy consumption while maintaining high performance as much as possible, without excessively pursuing IPC.
The result of this design philosophy is that in the "Granite Rapids" Xeon 6, Intel has increased the core count from the previous 56 cores of two P-cores to 120 cores, a 2.3 times increase, while the power at the top has only increased to 500W, a 1.4 times increase.
Overall, the Xeon 6 has many performance characteristics, such as the Ultra Core Count (UCC) variant, namely the Xeon 6 6900P, which has up to 504 MB of L3 cache, far exceeding the typical Intel chip cache capacity. However, the Xeon 6 also has some unique designs, such as variants that do not support four-way and eight-way servers, which is as confusing as Gaudi 3 using HBM2E