Why did Wooden Sister sell off, why is Groq making waves? And the 12 unicorns aiming to "overtake" NVIDIA on the curve.

The soaring popularity of Groq may suggest that the main battlefield of AI chips will shift from training to inference. When more next-generation dedicated inference chips that can replace NVIDIA GPUs emerge, can NVIDIA's "throne" still be maintained?

Cathie Wood, the star fund manager on Wall Street and CEO of Ark Investment Management, recently revealed in a media interview that she cashed out approximately $4.5 million worth of NVIDIA stocks. She believes that the future demand for NVIDIA chips is not as hot as expected, and NVIDIA will face more competition.

Wood pointed out that tech giants like Meta, Amazon, and Alphabet-C are developing their own AI chips. These chips are more specialized and purposeful, while NVIDIA's chips are more general-purpose.

Just as Wood made this statement, the Groq LPU (Language Processing Units) chip made a stunning appearance, claiming to be "100 times more cost-effective than NVIDIA" (the chip is 10 times faster than NVIDIA, but the price and power consumption are only one-tenth of NVIDIA). Coupled with Alphabet-C's self-developed AI chip TPU, many people are saying that NVIDIA's strong competitor has emerged.

According to Groq's official website, the LPU is a chip designed specifically for AI inference. Unlike NVIDIA, which relies on high-speed data transfer, Groq claims that their LPU chip does not use high-bandwidth memory (HBM) in the system, but instead uses SRAM, which is about 20 times faster than the memory used by GPUs.

Groq even challenged NVIDIA, stating that the LPU can replace the GPU in executing inference tasks. Through its specialized design, the LPU can provide optimized performance and energy efficiency for specific AI applications.

The cost-effectiveness of Groq is the key issue

Jia Yangqing, former Vice President of Technology at Alibaba, analyzed in a post that to ensure the same throughput as the H100, more cards are needed. A single LPU card has only 230MB of memory and costs over $20,000. When running the Llama-2 70b model, 305 Groq cards are needed, compared to just 8 H100 cards.

At current prices, this means that at the same throughput, Groq's hardware cost is 40 times that of the H100, and the energy cost is 10 times. If operated for three years, Groq's hardware procurement cost would be $11.44 million, with operating costs of $762,000. In comparison, the hardware procurement cost for 8 H100 cards is $300,000, with operating costs of $72,000.

However, Huawei's "genius youth" Zuo Pengfei refuted Jia Yangqing's argument on Zhihu, stating that people are confusing the selling price and the cost price. The inference cost should be calculated based on the cost price of Groq's own production cards. Although the selling price of Groq cards is $20,000, their cost price is very low, estimated to be around $1,200 per card. Left Pengfei bluntly stated that the major cost of GPUs lies in HBM, but Groq directly abandoned HBM, and the card cost can be estimated based on the cost of SRAM. Calculated at $1200 per card, the total cost of 500 Groq cards is $600,000. With this $600,000, Groq can only purchase two NVIDIA H100s. Can two NVIDIA H100s achieve a performance of 500 tokens/s?

How does Groq reduce costs?

The analysis points out that undoubtedly, based on the current hardware costs, Groq's LPU server is much more expensive than NVIDIA. On one hand, as mentioned by Left Pengfei, the official price of the LPU card is inflated, and on the other hand, a key point is that the LPU architecture is more suitable for scenarios with a large number of concurrent user accesses, significantly reducing the cost of a single user's request through high throughput and large concurrency.

This leads us to the unique architecture of the LPU, which is different from traditional CPU and GPU architectures. Groq has designed a Tensor Streaming Processor (TSP) architecture from scratch to accelerate complex workloads in artificial intelligence, machine learning, and high-performance computing.

Each TSP also has networking capabilities, allowing direct information exchange with other TSPs through the network without relying on external network devices, enhancing the system's parallel processing capabilities and efficiency.

Groq's lightning-fast response quickly sparked discussions on the internet, and the key to its speed lies in the fact that the LPU does not use HBM but instead utilizes SRAM. This design also significantly increases throughput.

This means that the LPU does not need to frequently load data from memory like GPUs using high-bandwidth memory (HBM). This feature not only helps avoid HBM shortages but also effectively reduces costs.

Therefore, due to the smaller amount of data required for AI inference calculations compared to model training, Groq's LPU is more energy-efficient. When performing inference tasks, it reads less data from external memory and consumes less power than NVIDIA GPUs.

However, some analysis points out that Groq's lightning-fast speed is based on limited single-card throughput capacity. SRAM also has two characteristics: large area and high power consumption. In the same capacity, the area of SRAM is 5 to 6 times larger than DRAM, and area translates to cost, so SRAM is not cheap when calculated per unit area.

Will the main battlefield of AI chips shift from training to inference?

Before the launch of Groq LPU, both large-scale model training and inference were designed based on NVIDIA, using the CUDA software technology stack. The sudden popularity of Groq LPU has led the market to speculate that the main battlefield of AI chips will shift from training to inference. In the field of AI, there are two main types of tasks: training and inference. The training phase requires high demands on massive computing power and memory capacity, with less reliance on access speed. In contrast, the inference phase is entirely different. AI models must run at extreme speeds in order to provide as many tokens as possible to end users, thus speeding up response times to user commands.

Some analysts believe that the AI inference market will see significant growth in the coming year. Compared to AI training, AI inference is more closely aligned with user terminal scenarios. Trained large models need to be served in practical scenarios through AI inference services. Currently, NVIDIA-based AI inference solutions still come at a high cost, affecting user experience in terms of performance and latency.

The sudden emergence of Groq this time is also due to the launch of cloud services powered by the Llama2 or Mistreal models carried by Groq. If the Groq LPU inference chip can address performance and cost issues from a hardware perspective, enabling large-scale deployment of AI inference services, there may be more AI inference applications landing in the future.

Analysis suggests that the Ampere architecture design used by NVIDIA A100 provides support for a wide range of computing tasks, including but not limited to machine learning. The Tensor Core technology and support for multiple data types in A100 indeed offer powerful acceleration for deep learning, but TSP's dedicated optimization may provide better performance and energy efficiency in machine learning tasks.

Due to its high energy efficiency, Groq has thought of turning TSP into a dedicated integrated circuit, namely ASIC. ASICs are characterized by highly optimized specific applications or algorithms to achieve optimal performance, low power consumption, and minimal area. As they are specifically designed to perform one or a group of related tasks, they are often more efficient in completing these tasks than non-customized chips, especially in the field of inference.

Data shows that the current data center custom chip market is about $30 billion. As more next-generation dedicated inference chips that can replace NVIDIA appear in data centers, and the growth of cloud-based AI training chips may gradually slow down, this may also be a significant reason why NVIDIA is starting to enter the custom chip market.

Inference chip companies are trying to carve out a piece of the pie from NVIDIA's vast market

The media has compiled a list of the 12 companies currently at the forefront of competition. These startups have an average history of only five years, with the highest funding amount reaching $720 million. In the future, they may all become strong challengers to NVIDIA's "throne".