Four Solutions for Inference Chips, Written by David Patterson

Wallstreetcn
2026.01.19 07:13
portai
I'm PortAI, I can summarize articles.

Recently, an article titled "Challenges and Research Directions for Large Language Model Inference Hardware" co-authored by Xiaoyu Ma and David Patterson discussed the challenges and solutions for inference chips of large language models (LLMs). The article pointed out that the main challenges faced by LLM inference lie in memory and interconnects, rather than computational power, and proposed four architectural research directions: high-bandwidth flash memory, near-memory processing, 3D memory logic stacking, and low-latency interconnects. It is expected that in the next 5-8 years, annual sales of inference chips will grow 4-6 times

Recently, an article co-authored by Xiaoyu Ma and David Patterson titled "Challenges and Research Directions for Large Language Model Inference Hardware" was officially released. After its publication, the article attracted widespread attention. In the article, the authors provide suggestions regarding the challenges and solutions for LLM inference chips.

The following is the main text of the article:

The inference of large language models (LLM) is highly challenging. The autoregressive decoding phase of the underlying Transformer model fundamentally distinguishes LLM inference from training. Influenced by recent trends in artificial intelligence, the main challenges lie in memory and interconnect rather than computational power.

To address these challenges, we focus on four architectural research directions: high-bandwidth flash memory, which can provide 10 times the memory capacity with bandwidth comparable to HBM; near-memory processing and 3D memory logic stacking, which can achieve high memory bandwidth; and low-latency interconnects, which can accelerate communication. Although our research primarily focuses on data center AI, we also explore the application of these solutions in mobile devices.

Introduction

When an author began their career in 1976, about 40% of the papers at the computer architecture conference came from industry. By the time of the 2025 ISCA conference, this proportion had dropped to below 4%, indicating a near-complete disconnect between research and practice. To help restore the historical connection between the two, we propose several research directions that, if advanced, could help address some of the most severe hardware challenges facing the AI industry.

The inference of large language models (LLM) is facing a crisis. The rapid development of hardware has driven advancements in artificial intelligence. It is expected that in the next 5-8 years, the annual sales of inference chips will grow 4-6 times. While training has shown significant breakthroughs in artificial intelligence, the cost of inference determines its economic feasibility. As the usage of these models skyrockets, companies find it expensive to maintain state-of-the-art models.

New trends are making inference more difficult. Recent advancements in LLMs require more resources for inference:

  • Mixture of Experts (MoE). Unlike a single dense feedforward module, MoE selectively calls upon dozens to hundreds of experts (DeepSeekv3 has 256) for inference. This sparsity allows for a significant increase in model size, thereby improving model quality, even though the training cost only slightly increases. While MoE aids in training, it also exacerbates the inference process by expanding memory and communication capabilities
  • Inference Model. Inference is a technique that emphasizes thinking before acting, aimed at improving model quality. The additional "thinking" step generates a long list of "thoughts" before arriving at the final answer, similar to how people gradually solve problems. This thinking significantly increases generation latency, and the long sequence of thoughts also consumes a large amount of memory.
  • Multimodal. LLMs have evolved from text generation to image, audio, and video generation. Larger data types require more resources than text generation.
  • Long Context. The context window refers to the amount of information that the LLM model can access when generating answers. Longer contexts help improve model quality but increase computational and memory demands.
  • Retrieval-Augmented Generation (RAG). RAG accesses user-specific knowledge bases to obtain relevant information as additional context to improve LLM results, but this increases resource demands.
  • Diffusion. Unlike the autoregressive method of sequentially generating tokens, the new diffusion method generates all tokens (e.g., an entire image) in one step and then iteratively denoises the image until the desired quality is achieved. Unlike the aforementioned methods, diffusion methods only increase computational demands.

The growing market and challenges faced by LLM inference indicate that innovation is both an opportunity and a necessity!

Current LLM Inference Hardware and Its Inefficiencies

We first review the basics of LLM inference and the main bottlenecks in mainstream AI architectures, focusing on LLMs in data centers. LLMs on mobile devices face different limitations and thus require different solutions (e.g., HBM is not feasible).

At the core of LLMs is the Transformer, which consists of two distinctly different inference phases: Prefill and Decode (Figure 1). Prefill is similar to training, as it processes all tokens in the input sequence simultaneously, making it essentially parallel and typically limited by computational power. In contrast, decoding is inherently sequential, as each step generates one output token ("autoregressive"), thus it is limited by memory. The KV (Key Value) cache connects these two phases, with its size proportional to the lengths of the input and output sequences. Although prefill and decode appear simultaneously in Figure 1, they are not tightly coupled and often run on different servers. Decomposed inference allows for software optimization methods such as batching, thereby reducing memory usage during the decoding process. A survey on efficient LLM inference reviews many software optimization methods.

GPUs and TPUs are commonly used accelerators in data centers for training and inference. Historically, inference systems have typically been scaled down from training systems, such as by reducing the number of chips or using smaller chips with lower memory or performance So far, there has been no GPU/TPU specifically designed for LLM inference. Since pre-filling is similar to training, while decoding is entirely different, GPUs/TPUs face two challenges in decoding, leading to inefficiencies.

Decoding Challenge 1: Memory

Autoregressive decoding essentially limits inference to memory, and new software trends exacerbate this challenge. In contrast, hardware development trends are completely different.

1. AI processors face memory bottlenecks

Current data center GPUs/TPUs rely on High Bandwidth Memory (HBM) and connect multiple HBM stacks to a single chip accelerator ASIC (Figure 2 and Table 1). However, the rate of increase in memory bandwidth is far lower than the rate of increase in floating-point operations per second (FLOPS). For example, the floating-point performance of NVIDIA's 64-bit GPUs grew 80 times from 2012 to 2022, while bandwidth only increased 17 times. This gap is expected to continue to widen.

2. The cost of HBM is becoming increasingly high

Taking a single HBM stack as an example, the standardized prices for its capacity (USD/GB) and bandwidth (USD/GBps) have both risen over time. Figure 3(a) shows that from 2023 to 2025, the prices for both increased by 1.35 times. This increase is due to the rising manufacturing and packaging difficulties as the number of chips in each HBM stack increases and DRAM density grows. In contrast, Figure 3(b) shows that the equivalent cost of standard DDR4 DRAM has decreased over time. From 2022 to 2025, the capacity cost dropped to 0.54 times, and the bandwidth cost dropped to 0.45 times. Although prices for all memory and storage devices are expected to rise significantly in 2026 due to unexpected demand, we believe that the diverging trends in HBM and DRAM prices will continue in the long run.

3. The growth of DRAM density is slowing down

The scalability of individual DRAM chips is also concerning. Since the introduction of the 8Gb DRAM chip in 2014, achieving a fourfold increase has taken more than 10 years. Previously, such a fourfold increase typically occurred every 3-6 years.

4. Solutions that only use SRAM are insufficient to meet the challenges

Cerebras and Groq attempted to circumvent the challenges of DRAM and HBM by using full-wafer chips filled with SRAM. (Cerebras even adopted wafer-level integration.) While this solution seemed feasible when the companies were founded a decade ago, LLMs quickly surpassed the capacity of SRAM on the chip. Both companies later had to retrofit and add external DRAM

Decoding Challenge 2: End-to-End Latency

1. User-Facing Inference Means Low Latency

Unlike training, which can take weeks, inference is closely related to real-time requests and needs to respond within seconds or even shorter. Low latency is crucial for user-facing inference. (Batch or offline inference does not have low latency requirements.) Depending on the application scenario, the measure of latency can be the time when all output tokens are completed or the time when the first token is generated. Both present challenges:

  • Completion Time Challenge. The decoding process generates one token at a time, so the longer the output sequence, the longer the latency. Longer output sequences increase latency, but longer input sequences can also reduce latency because accessing V Cache during decoding and pre-filling takes more time. Each decoding iteration has higher memory access latency because it is constrained by memory.
  • First Token Generation Time Challenge. Longer input sequences and Random Access Generators (RAG) increase the workload before generating tokens, thus increasing the time to generate the first token. Inference models also add to this latency because they generate many "thought" tokens before the first user-visible tokens.

2. Interconnect Latency is More Important than Bandwidth

Before the advent of LLMs, inference in data centers typically ran on a single chip, while training required supercomputers. The goal of supercomputer interconnects was clearly more focused on bandwidth than latency. LLM inference changed all of this:

  • Due to the large weights, LLM inference now requires a multi-chip system and employs software sharding techniques, which means frequent communication is necessary. MoE and long sequence models further increase system scale to meet greater memory capacity demands.
  • Unlike training, considering that the batch size for Decode is smaller, the size of network messages is usually also smaller. For small messages frequently sent in large networks, latency is more important than bandwidth.

Table 2 summarizes the main challenges of Decode inference. Only Diffusion requires increased computational power—which is relatively easy to achieve—because it is fundamentally different from Transformer Decode. Therefore, we focus on promising directions to improve memory and interconnect latency rather than computational power. The last four lines present research opportunities to meet these demands, which will be introduced next.

Rethinking Four Research Opportunities for LLM Inference Hardware

Performance/cost metrics measure the efficiency of AI systems. Modern metrics emphasize actual performance normalization, total cost of ownership (TCO), average power consumption, and carbon dioxide equivalent emissions (CO2e), providing new goals for system design:

  • Performance must be meaningful. For LLM decoding inference, achieving high FLOPS on large chips does not necessarily mean high performance. Instead, we need to efficiently scale memory bandwidth and capacity, and optimize interconnect speed.
  • Performance must be achieved within the data center capacity, which is often limited by power consumption, space, and CO2e budget.
  • Power consumption and CO2e are the primary optimization goals. Power consumption affects TCO and data center capacity. Power consumption and energy cleanliness determine operational CO2e. Manufacturing yield and lifecycle determine implicit CO2e.

Next, we will introduce four promising research directions to address decoding challenges (bottom of Table 2). Although they are described independently, they are synergistic; one architecture can effectively combine many of these directions. Comprehensive improvements in performance/TCO, performance/CO2e, and performance/power.

  1. High Bandwidth Flash, 10x Capacity Increase

High Bandwidth Flash (HBF) combines the bandwidth of HBM with the capacity of flash memory by stacking flash chips (similar to HBM) (Figure 4 (a)). HBF can increase the memory capacity per node by 10 times, thereby reducing system size, power consumption, total cost of ownership (TCO), CO2 emissions (CO2e), and network overhead. Table 3 compares HBF with HBM and DDR DRAM. The disadvantages of other solutions are: DDR5's bandwidth, HBM's capacity, and HBF's write limitations and higher read latency. Another advantage of HBF is sustainable capacity expansion; flash memory capacity doubles every three years, while, as mentioned, the growth rate of DRAM is slowing.

Two well-known limitations of flash memory need to be addressed:

  • Limited write durability. Write/erase cycles wear out flash memory. Therefore, HBF must store data that is updated less frequently, such as weights during inference or slowly changing contexts.
  • High page-based read latency. Flash memory reads in pages (tens of KB), with latency far exceeding that of DRAM (tens of microseconds). Small data reads reduce effective bandwidth.

These issues mean that HBF cannot completely replace HBM; systems still require conventional DRAM to store data that is not suitable for HBF storage.

The addition of HBF brings exciting new capabilities for LLM inference:

  • 10x weight memory. Weights are frozen during inference, so HBF's 10x capacity can accommodate more weights (e.g., giant MoE), thus supporting larger models than currently feasible

  • 10 times context memory. Due to limited write endurance, HBF is not suitable for key-value cache data that updates with every query or generated token. However, it is suitable for slowly changing contexts. For example:

  • The web corpus used by LLM search, which stores billions of internet documents.

  • The code database used by AI coding, which stores billions of lines of code.

  • The paper corpus used by AI tutoring, which tracks millions of research papers.

  • Smaller inference systems. Memory capacity determines the minimum hardware required to run the model. HBF can scale down the system, thereby improving communication, reliability, and resource allocation efficiency.

  • Larger resource capacity. HBF will reduce reliance on HBM-only architectures and alleviate the global shortage of mainstream memory devices.

HBF also raises new research questions:

  • How should software cope with limited write endurance and high-latency page-based reads?
  • What should be the ratio of traditional memory to HBF in the system?
  • Can we reduce the limitations of HBF technology itself?
  • How should HBF configurations differ between mobile devices and data centers?
  1. Processing-Near-Memory Technology

Achieving High Bandwidth

Processing-in-Memory (PIM) technology was born in the 1990s, enhancing memory capacity by adding small, low-power processors on memory chips to achieve high bandwidth. While PIM offers excellent bandwidth, its main challenges lie in software fragmentation and memory logic coupling. The former limits the number of software kernels that can run well on PIM, while the latter reduces the power and area efficiency of computational logic. In contrast, Processing-Near-Memory (PNM) technology places memory and logic nearby but still uses separate chips. One form of PNM is 3D computing logic stacking.

Unfortunately, some recent papers have blurred the distinction between PIM and PNM. They use PIM as a generic term, regardless of whether the computational logic is directly placed on the memory chip. We propose a simple and clear distinction: PIM refers to designs where the processor and memory are on the same chip, while PNM refers to them being on adjacent but separate chips. This distinction clarifies the concepts of PIM and PNM.

If software is difficult to use, the hardware advantage is meaningless, which is exactly our experience with PIM and data center LLMs. Table 4 lists the reasons why PNM outperforms PIM in LLM inference, despite PNM's shortcomings in bandwidth and power consumption. Specifically, PIM requires software to fragment the memory structure of LLM into many small blocks with little interaction to fit into 32-64MB memory blocks; whereas in PNM, fragmentation can be 1000 times larger, making it easier to partition LLMs with very low communication overhead In addition, considering that the power consumption and heat dissipation budget of DRAM technology process nodes are very limited, it is still unclear whether the computing power of PIM is sufficient.

Although PNM is superior to PIM for data center LLMs, the comparison between the two is not so clear for mobile devices. The energy consumption of mobile devices is more constrained, and due to single-user operation, their LLMs have fewer weights, shorter contexts, smaller data types, and smaller batch sizes. These differences simplify sharding, reduce computational and heat dissipation demands, making the weaknesses of PIM less prominent, thus PIM may be more feasible on mobile devices.

  1. 3D Memory Logic Stacking for High Bandwidth

Unlike 2D hardware where memory I/O is located at the chip edge, 3D stacking (see Figure 4(b)) uses Through-Silicon Vias (TSV) to achieve wide and dense memory interfaces, thereby achieving high bandwidth at low power consumption.

There are two versions of 3D memory logic stacking:

  1. HBM chip-based computing solutions: Reuse HBM design by inserting computing logic into the HBM substrate. Since the memory interface remains unchanged, the bandwidth is the same as HBM, and power consumption is reduced by 2-3 times due to shorter data paths.

  2. Custom 3D solutions: Achieve higher bandwidth and bandwidth per watt than reusing HBM by using wider and denser memory interfaces and more advanced packaging technologies.

Despite lower bandwidth and power consumption, 3D stacking still faces the following challenges:

  1. Heat dissipation. Due to the smaller surface area, heat dissipation in 3D designs is more challenging than in 2D designs. One solution is to limit the floating-point operations per second (FLOPS) of the computing logic by reducing clock frequency and voltage, as the computational intensity of LLM decoding inference is inherently low.

  2. Memory logic coupling. The memory interface of 3D computing logic stacking may require an industry standard.

3D stacking brings new research questions:

  • The ratio of memory bandwidth to capacity or floating-point operations is significantly different from existing systems. How should software adapt?
  • Imagine a system containing multiple types of memory. How do we efficiently map LLM?
  • How to communicate with other memory logic stacks and the main AI processor (if necessary)?
  • What trade-offs do various design choices (e.g., whether the computing chip is on the top or bottom, the number of memory chips per stack, etc.) have in terms of bandwidth, power consumption, heat dissipation, and reliability?
  • How do these opportunities differ for mobile devices and data center LLM accelerators?
  1. Low Latency Interconnect

Technologies ①-③ help reduce latency and increase throughput: Higher memory bandwidth can reduce the latency of each decoding iteration, while higher memory capacity for each acceleration chip can reduce system size, thereby saving communication overhead. Another promising direction for reducing latency in data centers is to rethink the trade-off between network latency and bandwidth, as inference is more sensitive to interconnect latency For example:

  • High connectivity topology. Topologies with high connectivity (such as tree, dragonfly, and high-dimensional torus shapes) require fewer hops, thereby reducing latency. These topologies may reduce bandwidth but can improve latency.

  • In-network processing. The communication sets used by LLMs (Low Latency Models) (such as broadcasting, all-reduce, MoE scheduling, and collection) are well-suited for in-network acceleration, which can simultaneously improve both bandwidth and latency. For example, a tree topology with in-network aggregation can achieve low latency and high throughput for all-reduce.

  • AI chip optimization. Latency issues affect chip design and lead to several possible optimizations:

  • Directly storing incoming small packets in on-chip SRAM instead of off-chip DRAM;

  • Placing the compute engine close to the network interface to reduce transmission time.

  • Reliability. The co-design of reliability and interconnectivity helps to:

  • Local backup nodes can reduce system failures and lower the latency and throughput loss associated with migrating failed jobs to other healthy nodes without backup nodes.

  • If LLM inference does not require perfect communication, latency can be reduced and satisfactory result quality can be provided by using pseudo data or previous results during message timeouts instead of waiting for delayed messages to arrive.

Related Work

High Bandwidth Flash (HBF). HBF was initially proposed by SanDisk as a flash memory architecture similar to HBM, designed to overcome its bandwidth limitations. (SK Hynix later joined the development.) Researchers at Microsoft proposed a new type of memory focused on read performance and high density, rather than write performance and retention time, for artificial intelligence inference. Although not explicitly mentioned, HBF is a specific example of this new type of AI memory. Another research paper proposed integrating flash memory into mobile processors for device-side LLM inference, using LPDDR interfaces to meet pre-filled low bandwidth requirements and employing near-flash processing to meet high bandwidth requirements for decoding.

Near-memory processing. 3D compute logic stacking, as a technology with bandwidth exceeding HBM, is gaining increasing attention, such as compute solutions based on HBM chips and AMD's concepts.

In the non-3D domain, Samsung AXDIMM9 and Marvell Structera-A connect processors to commercial DDR DRAM. The former integrates compute logic into DIMM buffer chips. The latter enhances programmability and simplifies system integration using the CXL interface. (A review article provides more examples of PNM/PIM.) Many papers discuss the use of PIM/PNM in mobile devices, but this is not the focus of this article.

Low-latency interconnects. A large number of papers describe low-hop network topologies, including tree, dragonfly, and high-dimensional torus shapes. (Due to the reference limit of this journal, citations cannot be provided.) Examples of commercial in-network processing (PIN) include NVIDIA NVLink and Infiniband switches that support switch-in reduction, as well as multicast acceleration achieved through SHARP Ethernet switches have recently also seen similar AI workload capabilities.

Software innovation. In addition to the hardware innovations focused on in this article, there is a rich space for collaborative design of hardware and software that can be used for algorithm and software innovations to improve LLM inference. For example, the autoregressive nature of Transformer decoding is one of its fundamental reasons. A new algorithm that avoids autoregressive generation (such as diffusion algorithms used for image generation) can greatly simplify AI inference hardware.

Conclusion

The importance of LLM inference is increasingly prominent, and the difficulty is also rising, while LLM urgently needs to reduce costs and latency. Therefore, LLM inference is becoming a highly attractive research direction. Autoregressive decoding itself poses significant challenges to memory and interconnect latency, and factors such as mixture of experts (MoE), inference, multimodal data, RAG, and long input/output sequences further exacerbate this challenge.

The field of computer architecture has made significant contributions to addressing these challenges when real simulators are available, such as previous contributions in branch prediction and cache design. Since the main bottlenecks in LLM inference lie in memory and latency, roofline-based performance simulators can effectively provide preliminary performance estimates in many scenarios. Additionally, such frameworks should track memory capacity, explore various partitioning techniques that are critical to performance, and utilize modern performance/cost metrics. We hope that academic researchers can seize this opportunity to accelerate AI research.

The current concept of AI hardware—using full-mask chips with high floating-point operation performance (FLOPS), multiple HBM stacks, and bandwidth-optimized interconnects—does not align with LLM decoding inference. Although many researchers are focused on computing in data centers, we suggest improving memory and networking from four aspects: HBF, PNM, 3D stacking, and low-latency interconnects. Furthermore, new performance/cost metrics that focus on data center capacity, system power consumption, and carbon footprint provide new opportunities compared to traditional metrics. Limited versions of HBF, PNM, PIM, and 3D stacking may also be applicable to LLMs on mobile devices.

These advancements will promote collaboration among all parties to jointly advance the important innovations urgently needed in the world to achieve affordable AI inference.

Risk Warning and Disclaimer

The market has risks, and investment should be cautious. This article does not constitute personal investment advice and does not take into account the specific investment objectives, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investment based on this is at one's own risk.