
Nvidia’s Blackwell GPU Offers FP4, Transformer Engine, Sharp

Nvidia has unveiled its new Blackwell GPU, which offers 30x performance for generative AI. The B200, the company's first GPU superchip, will replace the H100 as the state-of-the-art AI acceleration in the data center. The new generation of GPU technology, named Blackwell after game theorist David Harold Blackwell, includes improved power efficiency and AI inference in autonomous systems. The transformer engine has been upgraded to enable micro-scaling down to 4-bit representation. Communication between multiple experts in generative AI models is a key focus, as it can become a bottleneck.
SAN JOSE, CALIF.—Market leader Nvidia unveiled its new generation of GPU technology, designed to accelerate training and inference of generative AI. The new technology platform is named Blackwell, after game theorist David Harold Blackwell, and will replace the previous generation, Hopper.

“Clearly, AI has hit the point where every application in the industry can benefit by applying generative AI to augment how we make PowerPoints, write documents, understand our data and ask questions of it,” Ian Buck, VP and general manager of Nvidia’s hyperscale and HPC computing business, told EE Times. “It’s such an incredibly valuable tool that the world can’t build up infrastructure fast enough to meet the promise, and make it accessible, affordable and ubiquitous.”
The B200, two reticle-sized GPU die on a new custom TSMC 4NP process node with 196 GB of HBM3e memory, will supersede the H100 as state-of-the-art AI acceleration in the data center. The GB200, or “Grace Blackwell,” is the new Grace Hopper—the same Grace Arm-based CPU combined with two B200s. There is also a B100—a single-die version of Blackwell which will mainly be used to replace Hopper systems where the same form factor is required.
B200’s two CoWoS-mounted die are connected by a 10-TB/s NV-HBI (high-bandwidth interconnect) link.

Improved Power Efficiency and AI Inference in Autonomous Systems
By Shingo Kojima, Sr Principal Engineer of Embedded Processing, Renesas Electronics 03.26.2024

Leveraging Advanced Microcontroller Features to Improve Industrial Fan Performance
By Dylan Liu, Geehy Semiconductor 03.21.2024

FerriSSD Offers the Stability and Data Security Required in Medical Equipment
By Lancelot Hu 03.18.2024
“That fabric is not just a network, the fabric of the GPU extends from every core and every memory, across the two die, into every core, which means software sees one fully coherent GPU,” Buck said. “There’s no locality, no programming differences – there is just one giant GPU.”

B200 will offer 2.5× the FLOPS of H100 at the same precision, but it also supports lower precision formats, including FP6 and FP4. A second-gen version of the transformer engine reduces precision as far as possible during inference and training to maximise throughput.
Buck described how hardware support for dynamic scaling meant the first-gen transformer engine could dynamically adjust scale and bias while maintaining accuracy as far as possible, on a layer-by-layer basis. The transformer engine effectively “does the bookkeeping,” he said.
“For the next step [in the calculation], where do you need to move the tensor in that dynamic range to keep everything in range? Once you fall out, you’re out of range,” he said. “We have to predict it…the transformer engine looks all the way back, a thousand [operations] back in history to project where it needs to dynamically move the tensor so that the forward calculation stays within range.”
For the Blackwell generation, the transformer engine has been upgraded to enable micro-scaling not just at the tensor level, but for elements within the tensor. Groups of “tens of elements” can now have different scaling factors, with that level of granularity supported in hardware down to FP4.
“With Blackwell, I can have a separate range for every group of elements within the tensor, and that’s how I can go below FP8 down to 4-bit representation,” Buck said. “Blackwell has hardware to do that micro-scaling…so now the transformer engine is tracking every tensor in every layer, but also every group of elements in the tensor.”

Communication
With B200 boosting performance 2.5× over H100, where do Nvidia’s 25×-30× performance claims come from? The key is in communication for large generative AI models, Buck said.
While earlier generative AI models were a single monolithic transformer, today’s largest generative AI models use a technique called mixture of experts (MoE). With MoE, layers are composed of multiple mini-layers, which are more focused on particular tasks. A router model decides which of these experts to use for any given MoE layer. Models like Gemini, Mixtral and Grok are built this way.
The trouble is these models are so large that somewhere between four and 32 individual experts are likely being run on separate GPUs. Communication between them becomes the bottleneck; all-to-all and all-reduce operations are required to combine results from different experts. While large attention and feedforward layers in monolithic transformers are often split across multiple GPUs, the problem is particularly acute for MoE models.
Hopper has eight GPUs per NVLink (short range chip-to-chip communication) domain at 900 GB/s—but when moving from, say eight to 16 experts, half the communication has to go over Infiniband (used for server communications) at only 100 GB/s.
“So if your data center has Hoppers, the best you can do is half of your time is going to be spent on experts communicating, and when that’s happening, the GPUs are sitting idle—you’ve built a billion-dollar data center and at best, it’s only 50% utilized,” Buck said. “This is a problem for modern generative AI. It’s do-able—people do it—but it’s something we wanted to solve in the Blackwell generation.”

For Blackwell, Nvidia doubled NVLink speeds to 1800 GB/s per GPU, and extended NVLink domains to 72 GPUs in the same rack. Nvidia’s NVL72 rack scale system, also announced at GTC, has 36 Grace Blackwells—for a total of 72 B200 GPUs.
Nvidia also built a new switch chip, NVLink Switch, with 144 NVLink ports and a non-blocking switching capacity of 14.4 TB/s. There are 18 of these switches in the NVL72 rack, with an all-to-all network topology, meaning every GPU in the rack can talk to every other GPU in the rack at the full bidirectional bandwidth of 1800 GB/s—18× what it would have been with Infiniband.
“We crushed it,” Buck said.
The new switches can also do math. They support Nvidia’s scalable hierarchical aggregation and reduction protocol (Sharp) technology, which can perform certain types of simple math in the switch. This means the same data does not have to be sent to different endpoints multiple times and reduces the time spent communicating.
“If we need to add tensors or something like that, we don’t even need to bother the GPUs any more, we can do that in the network, giving it an effective bandwidth for all-reduce operations of 3600 GB/s,” Buck said. “That’s how we get to 30 times faster.”
B200 GPUs can run in a 1000-W power envelope with air cooling, but with liquid cooling, they can run on 1200 W. The jump to liquid cooling was not necessarily about wanting to boost the power supply to each GPU, Buck said.
“The reason for liquid cooling is for the NVL72, we wanted to build a bigger NVLink domain,” he said. “We couldn’t build a bigger PCB, so we built it rack scale. We could do that with multiple racks, but to do fast signaling, we’d have to go to optics…that would be a lot of transceivers. It would need another 20 kW of power, and it would be six times more expensive to do that versus copper, which is a direct connection to the GPU SerDes.”
Copper’s distances are shorter than optics’—limited to around a meter—so the GPUs have to be close together in the same rack.
“In the rack, the two compute trays are sandwiched between the switches, it wouldn’t work if you did a top-of-rack NVLink switch, because the distance from the bottom to the top of the rack would not be able to run with 1800 GB/s or 200 Gb/s SerDes—it’s too far,” Buck said. “We move the NVSwitch to the middle, we can do everything in 200 Gb/s SerDes, all in copper, six times lower cost for 72 GPUs. That’s why liquid cooling is so important—we have to do everything within a meter.”
Trillion-parameter models can now be deployed on a single rack, reducing overall cost. Buck said that versus the same performance with Hopper GPUs, Grace Blackwell can do it with 25× less power and 25× less cost.
“What that means is that trillion-parameter generative AI will be everywhere—it will democratize AI,” he said. “Every company will have access to that level of AI interactivity, capability, creativity…I am super excited.”
AI AND BIG DATA
