Liang Wenfeng's new paper: Provide a "dictionary" for large models, with computation and memory separated, IQ skyrockets, spoilers for DeepSeek V4?

This is a reconstruction of the AI "cortex." The latest paper from the DeepSeek team led by Liang Wenfeng reveals that when we separate "memory" from "computation," entrusting what needs to be memorized to a "dictionary" and what needs to be calculated to the brain, the reasoning ability of AI will experience an explosive growth that is counterintuitive. This moment may be the eve of the birth of DeepSeek V4

This is a moment for the reconstruction of the underlying logic of AI.

For a long time, the Transformer architecture has been trapped in an expensive paradox: we use the most advanced GPU computing power to make AI models "memorize" static knowledge that can be found in a dictionary.

The DeepSeek team led by Liang Wenfeng, along with collaborators from Peking University, released a groundbreaking paper titled "Conditional Memory via Scalable Lookup" early this morning, completely breaking this deadlock. They proposed a brand new Engram module, which opens up a second sparse line beyond traditional "conditional computation" (MoE) — "conditional memory."

This is not just a technical patch, but a supply-side reform regarding the model's "brain capacity." It proves that: when we separate "memory" from "computation," delegate what needs to be memorized to the "dictionary," and assign what needs to be computed to the brain, the reasoning ability of AI will experience an explosive growth that defies intuition.

DeepSeek plans to officially release V4 around the Spring Festival in February, and this moment may be the eve of the birth of DeepSeek V4.

Prologue: The "Futile Efforts" of a Six-Layer Neural Network

The story begins with the DeepSeek team's "nuclear magnetic resonance" scan of the internal operating mechanism of the Transformer.

In the black box of artificial intelligence, when the large model sees the phrase "Diana, Princess of Wales," an inexplicable and extremely costly "internal friction" occurs within it.

Researchers found that to recognize this fixed entity, the model actually utilized a full 6 layers of the network:

Layers 1-2: The model is still pondering that "Wales" is probably a country;
Layer 3: It realizes that this is a geographical concept in Europe;
Layer 4: It begins to piece together that "Princess of Wales" seems to be a title;
Layer 5: It associates it with "the wife of the Prince of Wales";
Layer 6: Only at this point does it finally confirm that this refers to the famous "Princess Diana."

In the eyes of an architect pursuing extreme efficiency, this is simply a waste of computing power "Diana, Princess of Wales" is an objectively existing, static entity that does not change its essence due to variations in context. To extract this fact, which could be known by simply looking it up in a dictionary, the Transformer surprisingly employed a costly matrix operation with a depth of six layers to "reconstruct" this concept.

This is like a genius who, before solving a calculus problem, has to spend half an hour rewriting the multiplication table every time. This mechanism of "implicit memory" forces the model to waste valuable parameter capacity and network depth on simple pattern matching.

DeepSeek poses a soul-searching question in this 33-page paper: Why not directly equip large models with a "super dictionary" that can be referenced at any time?

Chapter One: Architectural Restructuring — The Aesthetic of Violence in the Engram Module

To address this issue, DeepSeek proposed a new module called "Engram (Conditional Memory)."

If MoE (Mixture of Experts) divides the "brain" into different areas, allowing different experts to handle different thoughts (conditional computations); then Engram is like adding a huge "hippocampus" to the brain, specifically responsible for storing static knowledge (conditional memory).

1. Reviving "N-gram": Seeking Answers from Ancient Wisdom

The core inspiration for Engram surprisingly comes from the "ancient artifact" of the NLP (Natural Language Processing) field — N-gram. Before deep learning dominated the world, we relied on statistics of "the probability of N words appearing together" to understand language.

DeepSeek modernized this classic concept:

Traditional Transformer: Knowledge is dispersed in the weights of neurons, and extracting knowledge requires complex linear layer calculations, leading to high complexity.
Engram Module: It is a massive, scalable embedding table. When the model encounters fixed phrases like "Zhang Zhongjing" or "Four Great Inventions" (N-gram), it does not need to engage the cerebral cortex for reasoning; it can directly "look up" the corresponding vector in the memory table through a hash index.

This process has a time complexity of O(1) — meaning that regardless of how large the knowledge base expands (even to 100 billion parameters), the lookup speed remains almost unchanged and extremely fast.

2. Three Major Technical Moats Since table lookup is so good, why hasn't anyone done it before? Because there are three roadblocks: storage explosion, polysemy conflict, and parameter allocation. DeepSeek provides a textbook-level solution:

A. Vocabulary Compression: Extreme Deduplication

The combinations of phrases in the world are astronomical. DeepSeek first performs a step of "lossless compression." At the tokenizer level, it normalizes words that have the same meaning but different spellings.

For example, "Apple" (capitalized) and "apple" (lowercase) usually refer to the same thing semantically. Through mapping and merging, the effective vocabulary was reduced by 23%. This not only saves space but also significantly increases the density of knowledge.

B. Multi-Head Hashing: Solving "Hash Conflicts"

It is impossible to store all N-grams. Engram uses "Multi-Head Hashing" technology. By using multiple hash functions, it maps infinite N-grams to a limited number of memory slots. Although there will be hash conflicts (i.e., two different words mapped to the same position), the "multi-head" design allows the model to piece together the correct information from multiple candidate results, greatly enhancing robustness.

C. Contextual Gating: Assigning a "Referee" to Memory

This is the most ingenious part. Table lookup is static, while language is dynamic.

For example, the word "apple." In the context of "eating an apple," it refers to the fruit; in the context of "Apple launch event," it refers to the tech company. Directly looking up the table may introduce noise.

DeepSeek designed a "Context-aware Gating."

Query: The hidden state of the current context.
Key/Value: The static vector obtained from the table lookup.

This gating acts like a referee. If the "static knowledge" retrieved does not match the current "context," the referee will lower the weight (Gate value approaches 0), allowing the model to ignore this noise; if it perfectly fits (for example, "Shanghan Zhabing Lun" followed by "Zhang Zhongjing"), the referee will open the gate (Gate value approaches 1), directly injecting knowledge into the model.

Chapter Two: The Golden Ratio—Discovering the "U-shaped Curve" of AI Models

With the architecture designed, the next question is: how to divide the inheritance?

Assuming the video memory in our graphics card is limited, and the total parameter budget is also fixed, how many parameters should we allocate to the "experts" of MoE (responsible for computation), and how many parameters should we allocate to the "dictionary" of Engram (responsible for memory)? This is a typical resource allocation game. The DeepSeek team conducted a large-scale ablation experiment, scanning allocation ratios from 0% to 100%, resulting in a perfect "U-shaped Scaling Law curve."

This chart reveals the underlying laws of AI model design:

Left extreme (pure Engram): If all parameters are given to the dictionary, the Loss is very high. This is because the model becomes a "bookworm," relying solely on rote memorization without logical reasoning ability.
Right extreme (pure MoE): If all parameters are given to experts, the Loss is also very high. This is because the experts are forced to spend all their energy on memorization (static knowledge) and have no time for actual tasks.
Golden ratio point (ρ ≈ 75%-80%): When we allocate about 20%-25% of the sparse parameter budget to Engram and the rest to MoE, the model's validation set Loss drops to its lowest point.

This is a highly instructive finding: For large models with hundreds of billions of parameters, simply stacking computational units (MoE experts) has diminishing marginal returns; it is necessary to introduce a dedicated static memory module to achieve "storage-computation balance."

Chapter 3: Counterintuitive Explosion—Why Can "Looking Up Words" Improve "Math Scores"?

If Engram merely makes the model "remember better," the weight of this paper would not be enough to shake the community. After all, RAG (Retrieval-Augmented Generation) can also solve knowledge problems.

What truly shocked the industry were the unexpected gains in the experimental results.

DeepSeek constructed three comparative models, strictly controlling the activation parameter amount (3.8B) and training data amount (262B tokens) to be completely consistent:

Dense-4B: Traditional dense model.
MoE-27B: Pure MoE model (72 experts).
Engram-27B: Hybrid model (55 experts + 5.7B Engram parameters).

The results were astonishing:

1. As expected: Knowledge tasks dominate the rankings

On MMLU (comprehensive knowledge), the Engram model improved by 3.4 points; on CMMLU (Chinese knowledge), it improved by 4.0 points. This is understandable; with the addition of a dictionary, common sense naturally improves, and hallucinations decrease.

2. Unexpected: Logic, code, and math skyrocketed

Logically, "looking up words" should have no relation to "doing math problems." However, on BBH (comprehensive reasoning), Engram-27B surprisingly improved by a full 5.0 points compared to the pure MoE baseline with the same parameters!

MATH: Improved by 2.4 points.
HumanEval: Improved by 3.0 points.
ARC-Challenge: Improved by 3.7 points.

3. In-depth Analysis: Effective Depth Theory

Why? How can a "rote memorization" module improve IQ?

The DeepSeek team utilized LogitLens and "CKA (Center Kernel Alignment)" technology to "dissect" the model internally. They discovered an astonishing phenomenon:

Do you remember "Princess Diana" from the beginning?

In the pure MoE model, the early layers of the network are busy "piecing together concepts."

In the Engram model, because the Engram module is inserted at the second layer, the retrieval of static knowledge is completed at a very early stage.

This means that the early layers of the network originally used for "rote memorization" are liberated!

This is equivalent to giving the model a "virtual increase" in depth. Those released network layers and Attention Heads no longer need to handle trivial local dependencies (such as identifying "Zhang Zhongjing"), allowing them to focus entirely on more complex global reasoning, long-range logical construction, and code logic generation.

The essence of Engram is not to "replace" reasoning, but to allow the brain to focus on higher-dimensional thinking by "diverting" trivial tasks.

Chapter Four: Engineering Marvels - Breaking NVIDIA's "Memory Dominance"

For Wall Street investors and operators of computing centers, the most attractive aspect of this paper is not the Score, but the Cost.

In the AI era, the most expensive resource is not computing power (FLOPs), but memory (HBM). The high cost of NVIDIA's H100 is largely due to the scarce HBM3e memory.

Engram brings a disruptive feature: complete separation of storage and computation.

1. Pain Points of MoE: Memory Hogs

Traditional MoE models have a dynamic routing mechanism. The model must first compute the features of the current token, and only after completing this layer can it determine which expert to consult in the next layer. This means that all expert models must always be on standby in the expensive GPU memory, ready to respond at any moment 2. Breakthrough of Engram: Deterministic Prediction

The lookup logic of Engram is deterministic.

As long as the input text is determined (for example, "A New Axis of Sparsity"), its corresponding N-gram index is also determined. We do not need to wait for the model to finish calculating the previous layer; the moment the token enters the model, we already know which table and which row it needs to look up.

3. The Comeback of CPU: Stuffing Large Models into Memory

This feature brings significant engineering benefits:

Offload: We can directly throw the Engram vocabulary with hundreds of billions, or even trillions of parameters, into cheap, large, and easily expandable "CPU memory (DRAM)", or even place it on NVMe SSDs.
Prefetching: While the GPU is desperately calculating the previous layer of the Transformer, the CPU uses the PCIe channel to asynchronously "prefetch" the memory data needed for the next layer and push it to the GPU.

Masking latency and parallel processing.

DeepSeek's test data shows that even when mounting a 100B (hundred billion) parameter Engram table to CPU memory, the throughput drop is less than 3% compared to pure GPU inference.

This is a conclusion that will delight everyone anxious about not being able to buy HBM. It means that in the future, large models can expand their "memory capacity" infinitely at low cost, without being bottlenecked by NVIDIA's video memory.

Chapter 5: The Victory of Long Texts—The Leap of NIAH Testing

In addition to general reasoning, Engram's performance in the long text (Long Context) field also proves the value of "division of labor."

In long text processing, the attention mechanism's window is limited. If the attention is occupied by a large amount of local information (such as fixed phrases), its ability to process global information will decline.

After Engram takes over local dependencies, the attention mechanism can finally look up and see the road.

In the strict RULER benchmark test, the performance of Engram-27B is astonishing:

Multi-Query NIAH: From a MoE baseline of 84.2 points, it skyrocketed to 97.0 points.
Variable Tracking: Improved from 77.0 points to 89.0 points.

This indicates that when we outsource "local memory" to Engram, the original attention mechanism of the Transformer can more efficiently capture the "subtle clues" in documents of tens of thousands of words

Epilogue: The Puzzle of DeepSeek V4 is Revealed

Connecting all the information above, we have vaguely seen the prototype of the next-generation model of DeepSeek— DeepSeek V4.

Wallstreetcn reported that DeepSeek plans to officially release V4 in February (around the Spring Festival). Looking back at the rhythm of DeepSeek: from R1 in January 2024, to V3.2 defeating the GPT-5 benchmark by the end of the year, and now to the upcoming V4, each step has accurately followed the pulse of technological iteration.

If R1 demonstrated the depth of "reasoning," and V3 showcased the efficiency of "MoE," then the upcoming V4 may solve the coupling of memory and computation by introducing Engram technology, achieving a perfect symbiosis between "electronic brain (computation)" and "external memory (Engram)."

DeepSeek V2: Introduced MLA (Multi-Head Potential Attention), compressed KV Cache, and solved the reasoning memory bottleneck.
DeepSeek V3: Optimized "MoE (Mixture of Experts)" and lossless load balancing, addressing training stability and computational costs.
DeepSeek V4 (speculative): Introduced Engram (conditional memory), solving the coupling of memory and computation, achieving a perfect symbiosis between "electronic brain (computation)" and "external memory (Engram)."

This is not a simple version iteration; it is a systematic surgery on the underlying defects of the Transformer architecture. After DeepSeek V3 has swept the globe with its extremely low API prices and powerful performance, if V4 integrates Engram technology, it will bring even more terrifying competitiveness: It will have a larger knowledge base (low-cost memory expansion), stronger logical reasoning (liberation of network depth), and lower reasoning costs (separation of storage and computation).

More importantly, the report mentioned improvements in V4's understanding of data patterns, "avoiding the performance degradation seen in previous models under long training." This aligns with Engram's characteristic of solidifying static knowledge and reducing the burden on dynamic networks—it makes the model more stable and less prone to "forgetting" or "mental confusion."

At the end of the paper, the DeepSeek team confidently wrote:

“We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.”

(We envision that conditional memory will become an indispensable modeling primitive for the next generation of sparse models.)

This paper, released on the eve of the Spring Festival, is not only a technical showcase from DeepSeek but also a signal sent to the entire industry: the era of simply "rolling out computing power" and "piling up parameters" has ended, and the dividend period of architectural innovation has just begun. In this competition to define the standards for the next generation of AI, Chinese large models are not only keeping pace but are even redefining the rules of the game.

In 2026, China's commercial aerospace "D-Day" has just passed; and the moment of "separation of storage and computing" in the AI field may be happening right now.

Paper link: https://github.com/deepseek-ai/Engram/blob/main/Engram_paper.pdf

Open source link: https://github.com/deepseek-ai/Engram