Did NVIDIA, the explosive player, "blow itself up"?

Huxiu
2024.08.05 10:47
portai
I'm PortAI, I can summarize articles.

After NVIDIA's market value soared and then fell, analysts warned investors to hit the brakes. Chip design issues and customer defections are challenging NVIDIA, and problems with the server productization node may further impact NVIDIA's performance. Analysts believe that the trend of top customers' spending on NVIDIA's GPU products will shake, and NVIDIA's stock price may see a double-digit decline. In addition, negative rumors surround NVIDIA's Blackwell architecture chips, including low yield rates, shipping delays, and other issues. Although the problems have been identified, a re-spin is needed

Tencent Technology, Author: Leslie Wu (Former TSMC Factory Expert), Editor: Su Yang, Cover Image Source: Visual China

NVIDIA's market value soared and then fell, analysts warn investors to hit the brakes, citing chip design issues and customer defections.

• 💥 NVIDIA's market value soared and then fell, analysts warn investors to hit the brakes

• 🚨 Chip design issues and customer defections pose challenges for NVIDIA

• 💻 Issues with the productization node of server products may further impact NVIDIA's performance

NVIDIA, which frequently experiences market volatility, failed to maintain its $3 trillion market value.

On June 19th Beijing time, NVIDIA's market value reached $3.335 trillion, surpassing Microsoft and Apple to become the global leader. After this moment of glory, NVIDIA's market value began to decline, with a 26% decrease as of the close on August 2nd.

Prior to this, analysts had already advised investors to "hit the brakes." Citing the views of DA Davidson analyst Gil Luria, Daily Economic News reported that NVIDIA's record-breaking performance of $26 billion was driven by top customers' spending on its GPU products. He believes that this trend will shake in the future, and NVIDIA's stock price will experience a double-digit decline within 18 months.

In the eyes of analysts like Gil Luria, top customers have already shown "hesitation," and NVIDIA's own "mistakes" have provided a window for customer defections and competitors to seize opportunities. It all starts with negative rumors about the Blackwell architecture chip, including issues such as low CoWoS yield, abandonment of the B100 SKU, delayed shipments of the B200, and reflow of key issues.

From internal sources at TSMC, the news of NVIDIA's reflow of the Blackwell chip is true, but mainly involves the B100 series base chip, with issues arising in the underlying Standard cell—pre-designed standard specific function, size circuit modules. If chip design is understood as building blocks, the standard cell is the smallest unit of the blocks—abnormal conditions occur under high pressure, and the problems have been identified and require a redesign of the mask.

However, the overall wafer manufacturing time from Wafer-in to Wafer-out cannot be shortened, but fortunately, only small batches will be shipped in 2024, which is not the shipping time for Blackwell servers. By expanding capacity before the end of this year to recover the progress of small batch shipments, from my personal industry experience, this is not difficult for TSMC I. Taking the blame for delayed shipments

Abandoning B100 and reflowing B200 for delayed shipments is a one-sided understanding of the "Blackwell chip delay incident," which is related to the complex naming by NVIDIA.

The Blackwell series of chips includes two basic chips, B100 and B102. SKUs including B200\GB200 are based on the B100 series Chiplet solution, while B200A is based on B102.

To facilitate understanding, a table has been prepared for everyone to compare the basic chips B102 and B100, as well as the corresponding server SKUs. For different server applications, more styles can be combined, such as HGX B200A / HGX B200 / NVL36/72, or even the air-cooled versions of NVL8 or GB210A.

The naming of Blackwell chips and various SKUs has caused confusion among outsiders, which is understandable. However, the statement "CoWoS yield is only 66%, with only 10 Good die per wafer" is out of touch with common sense.

Let's briefly explain the concept of "yield" in the front-end and back-end of wafer manufacturing.

The front-end GPU Die, like Apple, Qualcomm, and AMD, NVIDIA is using the mature N4P process this time, so there is no need to worry about yield at all.

In the back-end packaging, especially the "oS" part of CoWoS, it not only includes the GPU die but also HBM memory. Moreover, the cost of 8 HBM dies itself is very high. If the GPU die fails, the entire package becomes scrap. Therefore, production cannot proceed with a yield of less than 80%, as costs will escalate infinitely, and profit margins cannot be guaranteed. If the yield is at a level of 66%, production simply won't happen.

In terms of risk management for abnormal yield in the manufacturing process, as a Fabless factory, whether it's NVIDIA, Apple, or others, they cannot risk everything on a new solution. If there are issues with the new solution, the entire product generation could be scrapped, which is too risky. Therefore, when placing orders, there will always be alternative plans in place. In other words, even if there are issues with the CoWoS-L yield, it will not affect the shipment of Blackwell chips.

For example, Apple's A18 chip next year may adopt TSMC's new 2nm process, but they will also have a backup plan using the N3P process to ensure no losses. NVIDIA will naturally do the same.

According to the data we have obtained, the current yield of Blackwell using CoWoS-L packaging is around 90% and still climbing, which is consistent with the most thorough research on CoWoS by the Nomura team in the industry In addition, at the beginning of the year, TSMC's expected yield for CoWoS-L was 95%. Compared to products using CoWoS-S packaging such as H200 and H100 with a yield of 99%, 90% is indeed a poor performance. However, for a new process, it can be reluctantly accepted.

Therefore, the current yield of CoWoS-L is indeed below expectations. Due to issues with the standard units in the front-end GPU die, a redesign of the mask is required, resulting in the inability to smoothly produce Blackwell chips, indirectly leading to a halt in the capacity of CoWoS-L in the later stage. In summary, there is a significant anomaly in the CoWoS-L yield, which in turn hinders the smooth shipment of Backwell chips, deviating from the facts and industry norms.

In fact, prior to the issue of reflowing the B100 series base chips, NVIDIA had already made adjustments due to the yield of CoWoS-L being below 95%. On the B200A using the B102 base chip, they switched to CoWoS-S packaging. The original plan was to alleviate the capacity pressure of CoWoS-L and ensure more Blackwell chips are produced by 2025. This adjustment can also help NVIDIA solve the progress delay issue caused by GPU die design problems and increase the overall shipment of Blackwell chips in 2025.

II. Who is squeezing NVIDIA's "neck"?

There has been much discussion in the past about NVIDIA squeezing the neck of computing power, but NVIDIA's own "neck" is actually being squeezed by upstream companies such as HBM memory.

It should be noted that currently, the supply of HBM and liquid-cooled QCD quick-connect modules is relatively tight. However, tight supply will not delay shipments, at most it will lead to a reduction in shipment volume. Moreover, the process of these scarce components is still guaranteed at this stage. For example, Samsung has already been confirmed to join NVIDIA's HBM supplier system.

What will truly affect the shipment of Blackwell chips are the subsequent nodes of various server productizations.

From the industry chain's information, not only chips have entered the production stage, but also board components, switching equipment, racks, cooling solutions, and more.

Expanding from an 8-card cabinet to a 72-card cabinet involves considerations such as network bandwidth convergence, as well as various parallel strategies (model data segmentation, segmented calculation, copying and reorganization) for the optimal operation of the entire cabinet. In addition, with more trays and higher density, the complex issues of internal wiring quantity, high-speed switching, and heat dissipation mean that the racks also need to be redesigned, which should currently be in the testing phase.

Since the NVL36/72 servers are all new technical solutions, the completeness and integration of all subsystems are also one of the risk points. While external attention has focused on performance in the past, the overall system's high maturity and reliability are also key factors in evaluating the success of this generation of products When using the water-cooled GB200 series, the issue of leakage must also be considered, mainly involving three components: water-cooled plate, manifold, CDU liquid distribution unit, and QCD quick connector. Among them, the quick connector is the most prone to leakage, so leakage is also the most headache-inducing issue for server manufacturers, as its quality is crucial and directly involves liability attribution. In general, if leakage occurs, NVIDIA will compensate customers first, and then claim against system manufacturers like Foxconn and Pegatron. An AI server rack can cost millions of dollars, and compensation for leakage could potentially bankrupt a small enterprise.

From the information we have received, currently NVIDIA is still conducting water-cooling tests with system manufacturers like Foxconn and Pegatron, and mass adoption has not yet taken place.

As mentioned earlier, whether it's chip manufacturers, system manufacturers, or cooling manufacturers, facing potential compensation in the millions of dollars, no company is willing to easily bear such risks. Actual mass deployment can only happen after practical implementation and having "guinea pigs".

Will NVIDIA "flip over"?

As mentioned at the beginning of the article, NVIDIA's market value has dropped from a historical high of over $3.3 trillion to the current $2.6 trillion, a decline of over 26%. During the release of the first-quarter report, NVIDIA confidently forecasted second-quarter revenue of $28 billion, with an error margin within ±2%.

Now, due to design issues with the GPU die, the CoWoS packaging yield is below the expected 95%, and various server technical solutions have not been finalized, which will affect the smooth shipment of the Blackwell chip. Will these issues further escalate and knock NVIDIA out of the $2 trillion market value list?

It can be said that there won't be significant issues in the short term. The key point is that the Blackwell chip is scheduled for small-scale production in the third quarter and will ramp up in the fourth quarter. This is just TSMC's production schedule. After completing the production of the GPU die, the next steps involve CoWoS, followed by the Bumping factory, and finally assembly by companies like Foxconn and Wistron, leading to server shipments and performance realization.

In short, server shipments will impact NVIDIA's revenue, not the chip shipments from TSMC.

At the current pace, large-scale server deliveries are expected at the earliest in the first quarter of 2025. In other words, NVIDIA will only see a significant business increment on the Blackwell chip in the first quarter of next year. This aligns with the market's existing reasonable expectations and will not be reflected in the second quarter, or even the third-quarter performance.

For NVIDIA, discovering design issues in the third quarter, providing solutions, and then running a Super hot run at TSMC will correspond to the middle to late fourth quarter, around November-December. By then, this portion of capacity will already be reserved, and production can continue for about 3 months. Regardless of N4P or CoWoS - S/L, TSMC's capacity is currently sufficient, with a utilization rate of up to 120%, to address delays in chip shipments due to design flaws that were originally planned for small-scale shipments in the third quarter There is basically not much difficulty, that is to say, calculated on an annual basis, although Blackwell's shipments will be reduced this year, it will not be significantly reduced.

For NVIDIA and the entire downstream industry chain, the chip issue has been exposed, and various practical environmental tests must also be conducted on the various subsystems of servers. The optimistic point is that the chips currently produced will only have problems in specific high-pressure environments. These chips can be handed over to server system manufacturers such as Foxconn for various adjustments and testing. The various subsystems of servers will still have half a year to receive the chips for simulating various environmental tests. The bulk shipment time will ultimately fall in February-March 2025.

From the current situation, with the background of H200 flood-type shipments in the second quarter, the performance is likely to meet guidance and exceed expectations. Moreover, the main revenue in 2023 is the H200 series. As mentioned earlier, the scale of Blackwell chip small batch shipments this year will be slightly reduced from the original plan, with approximately 20,000 wafers (CoWoS-L reduced from 41K to less than 20K). Converted into NVIDIA's performance estimate, it will be around 8-9.5 billion US dollars. However, with the emergency measures of incremental sales of the H series and the return of B series wafers to boost capacity, the performance loss this time will be around 5 billion US dollars. This may be reflected in the fourth quarter financial report, and it will definitely have an impact on the stock price, after all, it is a product failure.

Compared to the "product failure" of the Blackwell chip itself, a more worthy issue to think about and pay attention to is that NVIDIA launches new SKUs every year, requiring many innovative technologies. This pace is very fast. If there is not enough time to optimize and improve reliability, there is also a possibility of a complete product failure in the next few years on a certain product. This is the development logic of NVIDIA that we need to re-examine, and it is also an opportunity that competitors are eagerly waiting for.

From a more macro perspective, although NVIDIA's growth logic in the past two years is fine, the longer-term development poses increasing risks. These risks are not only manifested in the crazy and aggressive technological iterations of each generation but also in application-side and subsequent demand issues, simply put, the well-known "AI bubble", or whether there will be strong competitors in new technologies, such as new chip technologies or upstream companies mastering large models starting to self-develop.

In the past few days, there have indeed been many reports about major companies in China and the United States all embarking on self-development. As a side note, OpenAI's self-developed chip project has almost reached an agreement with TSMC.

Tencent Technology, Author: Leslie Wu (Former TSMC Factory Construction Expert)