SemiAnalysis: Agents Explode in Popularity, CPU Becomes New "AI Bottleneck"

CPUs are sold out! Dylan Patel, chief of the renowned semiconductor analysis firm SemiAnalysis, points out that as the AI workload paradigm evolves from simple text generation to complex "Agents" and "Reinforcement Learning (RL)", CPUs are facing an extremely severe capacity shortage

With the explosive growth of AI agents and Reinforcement Learning (RL), general-purpose processors (CPUs), which were marginalized in the early stages of the AI wave, are now facing an unprecedented compute crunch, becoming the new infrastructure bottleneck following GPUs.

Recently, as the financial reports of major tech giants have been released, the market's focus on AI infrastructure has been subtly shifting. Investors are not only keeping a close eye on GPU orders and deliveries but are also beginning to search for new growth poles brought about by the implementation of AI applications.

On April 8th, Dylan Patel, chief analyst at the renowned semiconductor analysis firm SemiAnalysis, pointed out in a deep-dive interview that because the AI workload paradigm is evolving from simple text generation to complex "Agents" and "Reinforcement Learning (RL)," CPUs are facing an extremely severe capacity shortage.

In the first few years of AI development, core computing power demand was almost entirely occupied by GPUs. As Dylan Patel put it: "In the first few years of AI, CPUs indeed lagged behind... the workload was light. You send a string, it returns a string—simple inference didn't require much from the CPU."

However, this situation has undergone a disruptive change in the past few months, with the core driving force being a new generation of models represented by OpenAI's o1, which possess logical reasoning and agent attributes.

Agents and Reinforcement Learning Drive Up CPU Demand

Models are no longer just "generating text"; instead, they are beginning to autonomously execute tasks, call databases, and self-validate, which causes the CPU workload to rise exponentially.

Dylan Patel provided a high-impact data point:

"Just in the past six months, revenue from code agents has risen from a few billion dollars to over $10 billion in a very short time. The task duration of these agents has also significantly increased: for example, Claude Code can work continuously for six, seven, or even seven or eight hours... it can ping, crawl, and work autonomously in an agentic manner. This also requires a massive amount of CPU."

Meanwhile, the training loops for Reinforcement Learning are becoming increasingly tight. Future AI will not only solve math problems but also navigate within physics simulators, requiring every step generated by the generator (model) to undergo high-frequency validation on CPU clusters.

"This loop has become tighter and tighter over the past few years... in the past six months, we've seen the entire cloud market's CPU capacity run out."

Cloud Vendors Expanding Frenetically; Microsoft "Sells Out" CPUs, Leading to GitHub Instability

The sudden surge in market demand has directly led to the depletion of cloud computing power. To meet the needs of leading AI labs, large cloud vendors have even sacrificed the stability of other business lines. Dylan Patel stated frankly:

"I don't know if you've been dealing with GitHub much lately, but it's really unstable... that's because Microsoft sold all their idle CPUs to others."

This shortage is forcing enterprises to undergo extreme engineering migrations. It was revealed that OpenAI previously ran almost exclusively on x86 CPUs, but to obtain computing power, they went directly to Amazon for existing processors.

"Amazon has a massive number of ARM CPUs, so they ported their entire stack over—as long as I can get a CPU, I'm willing to port my codebase anywhere."

Regarding CPU market prices, Dylan Patel said:

"CPU profit margins aren't that high, but they are climbing because Intel and AMD are raising prices and supply is tight."

From the data, expansion is happening across the industry. "The number of CPU servers installed by Amazon has grown 3x year-over-year. There is no capacity anywhere."

Furthermore, to avoid leaving expensive GPUs idle while waiting, customers must keep a "hot pool" of CPUs running continuously. This business logic further amplifies the demand for CPUs.

Hardware Gold Rush Spreads: Storage Skyrockets, 3nm Capacity Faces Full-Scale Alert

The shortage of computing power has rapidly transmitted upstream along the industry chain. Not only have Intel and AMD issued price increase notices, but even the consumer-facing PC market has been affected (such as the Apple Mac mini being out of stock).

Dylan Patel used an extremely vivid phrase to describe the current hardware market:

"Usually, when a gold rush occurs, even the guy with a broken pickaxe can sell his pickaxe."

He added price increase data for the storage and chip manufacturing sectors, which are under high market scrutiny:

"Memory prices have increased 4x in the past year and will continue to rise. Now SSD prices have also risen 3-4x, and will rise by at least another 60%."

Even more concerning for the market is the squeeze on wafer foundry capacity. AI chips are sucking dry TSMC's most advanced process capacity:

"AI is buying up all 3nm and 2nm capacity... now all AI chips are migrating to 3nm: AMD's MI350 series, Amazon and Google's Trainium 3 and TPU v7, NVIDIA's Rubin—all of these are on 3nm."

This has even forced mobile giants like Apple and Qualcomm to migrate downwards or led NVIDIA to transfer some orders to Samsung.

The following is the transcript of the interview:

Host:

Hello everyone. It's a pleasure to have Dylan here. The first time I saw his video was an interview where he talked about: although we have CPUs, we need to discuss Neo clouds and why they have a right to exist. That was very interesting to me because today's theme is precisely: when agents have arrived, what are the new infrastructure primitives? You clearly articulated the differences between Neo clouds and hyperscalers, and why they should exist. Can you share that with us?

Dylan Patel:

Of course. In the AI era, hyperscalers were a bit slow, right? Google, Amazon, Microsoft were all a bit slow to enter the AI field. So a whole new batch of companies popped up, and a new low barrier to entry emerged—many of the complex software products built by Amazon, Microsoft, and Google aren't actually needed. In fact, that complex software actually slowed down AI development: they have custom networks, but those networks aren't very AI-friendly, focusing more on reliability and storage traffic rather than things like doing all-reduce on the network.

So these large cloud vendors and hyperscalers have many things that Neo clouds can skip directly, then build focused, optimized solutions and provide lower costs because their overhead is much lower—there aren't 20,000 Google project managers sitting in conference rooms in these Neo clouds (although some Neo clouds have started hiring Google project managers, thereby slowing down). They move quickly on energy and move quickly on setting up GPU clusters, so they were able to carve out a market. Those were the early ones. Since then, many imitators or followers have appeared—many didn't succeed, and many are succeeding. It's actually a battle over who is most capable.

Host:

So, are all—I think there are about 200 or so of these Neo clouds, right?—do you see differentiation among them? Are some just replicating the software stack of the earliest ones? Are some doing other things? Have you seen which of these Neo clouds have been successful and which have been less so?

Dylan Patel:

Yes, there are many factors that distinguish them. We have something called "Cluster Max" that ranks all Neo clouds. We test various things: observability, reliability, networking, security, management, orchestration, etc., and these are all different. For example, some will test if their GPUs work properly when the user is idle—is it an active health check or a passive health check? Is the fan speed appropriate? Is the power consumption correct? Are there node issues? Network issues? Is the performance up to standard? There are all kinds of checks and tests because GPUs are unreliable. There's also the type of software on top of the GPU: many people started with bare metal; for example, Microsoft's initial contract with CoreWeave was all bare metal—you just SSH in, and Microsoft builds the environment itself.

But as things evolve, people want more: some want to install Slurm, which is simple; some want to install Kubernetes, a bit harder but still simple; some want to install Slurm on Kubernetes because it's easier to push jobs, etc. Now people are starting to do managed Ray services and the like for Reinforcement Learning (RL). So there's one class of Neo clouds building these things, while another class says, "I don't care, I'll just build GPUs and rent them out as bare metal." There are also differences in cost: Neo clouds with good software tend to charge more, which in a way returns to the traditional model—Google, Microsoft, and Amazon have good software and charge much more. And you will see many of these cloud companies starting to try to launch inference services and other things.

Host:

Similarly, following that line of thought, I wanted to talk about the CPU. Historically, we've had customers asking for things like OpenClaw—some say, "I need my sandbox or CPU box to run for a long time, can you give me a $5 product like Hetzner?" I say no, because those are bare metal machines and the cost is very low. But when you provide a larger software product, the cost is higher, so it's hard to compete. So I guess it's the same in Neo clouds; as you mentioned, the cost of selling bare metal is lower than for those moving in the software direction.

I'm just curious why they have a right to exist—it's a bit like an analogy for what we do. But the real question is, you mentioned this, and it's a direction we're thinking about: the CPU has become the new bottleneck. Previously, every investor and everyone I talked to only talked about GPUs. Now you've released a major report on CPUs, and I thought, "Okay, right, thank you." Your report says this year will be a bottleneck. So please give us a high-level view from a TDR perspective: why is the CPU a bottleneck now? What have you seen?

Dylan Patel:

Yes, in the first few years of AI, CPUs indeed lagged far behind. They were used for some storage, some checkpoints, some data preprocessing and pre-training, but the workload was light. In terms of inference, models weren't good enough to become agents—you couldn't let them act step-by-step. So at that time, there was no capability to have models execute actions and string them together; basically, you sent a string, it returned a string—simple inference didn't require much from the CPU.

But over the past few years—not just the past few years, for example, starting with Q*, then the turmoil at OpenAI, and finally the release of the o1 preview—honestly, that was 15 or 16 months ago (it feels much longer). o1 was the first of these models. Then a large number of models emerged. Previously, people did simple things, like using regex to check model output to see if it was correct, or doing structured output for function calls, etc. But over time, the checks on models became much larger in scale and have been fully integrated into training—through reinforcement learning.

It's no longer just using regex, but all kinds of classifiers; it's no longer just classifiers, but doing code unit testing and compilation; going further, you run agentic workflows that actually go and call databases or interact with a CPU-heavy environment (such as physics or biological simulations). The model outputs content, then checks it—this environment (the RL environment)—and then goes back to train based on it. This loop has become tighter and tighter over the past few years.

And recently—let's say the past six months—revenue from code agents has risen from a few billion dollars to over $10 billion in a very short time. The task duration for these agents has also significantly increased: for example, Claude Code (or similar models) can work continuously for six, seven, or even seven or eight hours. In this process, it will call databases, call various things (at least we use a lot of Cron servers), anyway, it does everything—it can ping, crawl, and work autonomously in an agentic manner. This also requires a massive amount of CPU. So this area has also exploded in the past six months. Combined with the RL training loops getting tighter. As a result, in the past six months, we've seen all the CPUs in the entire cloud market run out—I don't know if you've been dealing with GitHub much lately, but it's really unstable.

Host:

I think you're the third person to mention that today.

Dylan Patel:

Okay. We've been checking GitHub's statistics: how frequent are the outages? How frequent are the submission failures? The situation is bad. That's because Microsoft sold all their idle CPUs to others—either for internal labs to use, but more often for external labs. They have contracts with Anthropic and OpenAI, so they have almost no CPUs left for themselves.

We're seeing the same situation at many other companies. Previously, there were many GPU servers for every CPU server; for example, 100 megawatts of GPUs might only be served by 1 megawatt or even less of CPU capacity. But now that ratio is becoming very close, whether for RL training or inference (agentic inference). Then you see CPUs running out everywhere. The number of CPU servers installed by Amazon this year is up 3x year-over-year. There is no capacity anywhere. This not only makes GitHub very unstable, but possibly other places as well.

Host:

I mean, we've talked a lot about infrastructure-related things today. It's become very common to see some infrastructure provider—whether it's GitHub or something else (not naming names)—experiencing an outage. This could be due to CPU shortages, or it could be due to the scale of workloads, etc.

Dylan Patel:

Or it could be that everyone's infrastructure code is "vibe coded."

Host:

Right, it could be that everyone's infrastructure code is vibe coded. I don't think it all is, but maybe a part of it is. What I find interesting is the number of CPU workloads running with us—Daytona basically has three use cases: code and command execution (like cloud code things that need to run on a CPU); then there's the computer-use use case, which we're actually seeing grow very fast. We just announced Windows Sandbox today, which also runs on the CPU.

If you need an agent to handle legacy software (like in finance, customer service, etc., it's all there). Also, as you said, regarding reinforcement learning, we have many people who usually use Kubernetes who are now starting to use us. But what's interesting is that the scale and volume of these workloads are massive and growing extremely fast. And we are the smallest cloud in the world. So I'm curious: if a company as small as ours has such a large volume, what does it look like at scale?

And we've encountered—I wonder if you have any insights on this—just RL, let alone long-running agents (just for the latter, we saw a customer come in who ran 1 million BCP workloads in 6 hours yesterday. Just one customer). So how many customers are doing RL? They will all need this. I don't know if you have any insights, but I'm curious.

Dylan Patel:

I mean, some of the metrics are quite startling—1 million vCPUs sounds crazy. But some of the contracts and workload scales that people are signing are even more absurd than that.

Host:

I'm sure that's because we are indeed very small.

Dylan Patel:

Right, that's why. So I think when you look at companies like Anthropic and OpenAI again, they've completely eaten up the entire capacity of multiple clouds. A big driver of the recent deal between Amazon and OpenAI—yes, OpenAI wanted money, they needed compute, but they also went directly to Amazon and said, "Give us your CPUs."

Previously, OpenAI's stack ran almost exclusively on x86 CPUs, but Amazon has a massive number of ARM CPUs, so they ported the entire stack over—as long as I can get a CPU, I'm willing to port my codebase anywhere. That shows the level of engineering effort people are willing to put in, because usually developers are too lazy to move and would just go elsewhere for capacity, but now there's no capacity elsewhere.

Host:

Yes, interesting. We are all x86, only those. We don't have ARM yet. But besides those two, NVIDIA has its own CPU, and others are building their own CPUs. There are differences between these CPUs too. Are they all just general-purpose CPUs? You probably know more than I do, I'm super curious.

Dylan Patel:

Regarding the types of CPUs.

Host:

There are too many types now. Previously, it was basically just x86 and ARM. Now there are different types of CPUs. Is it because everyone has run out, or are they actually better in some ways? Is there anything special?

Dylan Patel:

Usually, when a gold rush occurs, even the guy with a broken pickaxe can sell his pickaxe. The CPU market is very dynamic right now. Currently, it's mainly Intel and AMD—I assume you mainly use Intel and AMD CPUs. Both companies are saying they are completely sold out and have issued price increase notices to customers. They aren't even competing with each other anymore; they're just thinking, "How much can I build and sell?" Similarly, Amazon has Graviton CPUs, which have reached the fifth or sixth generation. NVIDIA has Grace and Vera CPUs.

But previously, no one really deployed Grace standalone CPU chassis—NVIDIA did some small-scale deployments for PR, but actual standalone CPU deployments were very few. Why? Simply because they weren't good enough. But looking ahead, maybe their CPUs have improved, maybe they're bundled better, but more importantly, because they have capacity (since everyone else is out of capacity), they can get more contracts for their various CPUs, likely starting deployment later this year or early next year.

So it's a very dynamic market. Then Microsoft and Google are also starting to deploy their own CPUs and are starting to scale up. Arm is releasing a CPU in a few weeks that Meta will adopt, as will a few companies like Cloudflare. So there will be more ARM standalone solutions, rather than just Arm licensing IP to others. More diversification will appear in the market—that's what happens in a gold rush. Then we'll see, as the supply-demand gap gradually closes, whose quality is actually the best and who can stay.

Host:

But it looks like the scale of demand is still going to grow. At least from what I see: first, RL—it seems most RL is doing post-training, but now there are already vendors and companies pushing and creating services for real-time RL. Because you have some agents, you have some SaaS that are agents in the background, and then it does RL at the end of the day, basically to learn from its own behavior.

So that's growing. Also, these long-running agents—if they can work longer and solve more problems, you can basically let them do more, and they will launch more and more of these agents, which means more and more CPU boxes. So from your perspective, understanding the market dynamics, it might eventually converge, but I feel like it's going to get wider before the demand shrinks.

Dylan Patel:

Yes, absolutely. Because initially all RL was "come do math proofs," and math proofs have low resource requirements. And the model (the generator) would produce a lot of output and then send the correct answer (or what it thought was the answer) to the server for the server to verify. But over time, it's not like that anymore: the model submits multiple times, or tries to compile multiple times in its agentic workflow, or tries to do unit tests multiple times. This increases the frequency with which the generator (the model) sends to the validator, and this loop is getting tighter. As we enter more complex RL, the model will actually constantly validate its own output.

For example, imagine models trained in the next year or two—like a robot model, validating in a world model: a vision-language model (VLM) navigating in the world, trying to pick things up and put them down. Every step needs to be validated, and the physics model runs on some CPU cluster. The amount of CPU needed for that would be crazy, far more than for unit testing or running math proofs. Look at o1; it can basically only do math. Look at models like GPT-5.4 or Opus 4.6; they can do agentic software. But when we enter the next stage—whatever that is—there will be models that can understand, "I need to tie my shoelaces; when I tie them, what is the strength of the shoelace? What is the tensile strength?" All of that requires computation because the validator is just generating the next step, but every step needs to be checked more frequently, and the computational intensity of checking that step will also increase over time.

Host:

But there's another thing you might know better than I do: the intensity of GPUs determines that they can parallel-process a certain number of CPU boxes. With the advent of the next generation of GPUs, I think they will be able to launch, manage, or process more CPUs than they do now, which will put another pressure on CPUs.

Dylan Patel:

Yes, definitely. GPU power consumption is also increasing, so one GPU will correspond to more and more CPU usage over time. And every generation of GPU becomes more expensive, while vCPU prices stay flat or decrease slightly. So the scale is indeed different: a Blackwell compared to a Rubin—performance increases by X times, and the price also increases by X times.

As for the CPU, whether you buy the previous generation or the new generation—this generation has 192 vCPUs, the previous generation was about 96—so you have more CPUs, but the price increase is proportional to the number of extra vCPUs you get. So the ratio of vCPUs will grow, and the cost direction might also favor the CPU, but it's uncertain by how much.

Host:

Yes, another pressure—we've seen this with larger customers: they have GPU time quotas and don't want GPUs to sit idle. So they would rather pay for a hot pool to run CPUs so that when a GPU task comes in, the CPUs are hot (actually working). So they will—because of what we do—most CPUs are somewhat cheap resources (although we don't think of ourselves as cheap), but for this scenario, they are cheap resources. This actually consumes more GPUs because the cost of idle GPUs is too high.

Dylan Patel:

Right, that's a really interesting point. Business-model-wise, no one—of course, there are on-demand GPUs, but for example, Lambda has over 50,000 GPUs, of which only 4,000 are on-demand, and they are always sold out. So essentially, no one really has on-demand GPUs. Everyone signs at least long-term contracts (multi-month), and in most cases, multi-year. CPU usage, on the other hand, can be started and stopped at any time—that's why everyone moved to the cloud in the first place. But under these workloads, the GPU generator (the model running on the GPU) produces a bunch of stuff and sends it to the validator. If the validator isn't ready and waiting, the GPU is spinning idle. You've already paid for the GPU, and if you can't get resources on the CPU side instantaneously, you should actually pre-provision. You're not loading a simulator or an environment; you're wasting money. So that's true.

Host:

Not only that, once these are running, you start another hot pool, so every new iteration continues to grow. So what does this mean overall? We skipped RAM and didn't talk about memory. Previously, the GPU was the bottleneck; we measured that, and now it's the CPU. So for most people, what's more tangible: PCs are hard to buy now because they are super expensive. Will CPUs be like that too? You just mentioned they are relatively cheap and prices are flat. Will market pressure drive them to increase in price?

Dylan Patel:

Yes, PCs, laptops, and custom-built PCs are all hard to buy. For example, the Apple Mac mini is basically sold out. We bought a large batch of Mac minis because people who used to use Excel and Windows now want to use Claude Code, and OSX clearly has a better development environment. So people are buying a lot of Mac minis to deploy and use. I think the whole sector is similar, and data centers are even less elastic in their resource purchasing, which also leads to price increases. GPUs have always been expensive, and NVIDIA's profit margins have consistently been over 70%.

CPU profit margins aren't that high, but they are climbing because Intel and AMD are raising prices and supply is tight. Memory prices have risen 4x in the past year and will continue to rise. Now SSDs have also risen; all resources are rising: SSD prices have also risen 3-4x and will rise by at least another 60%—not as much as DRAM, but still a lot. So overall, Intel and AMD's CPU capacity can be converted between PCs and data centers to some extent, and memory and storage are very interchangeable. The result is: forget the average user—you have to buy a Mac mini now, or you'll be part of the permanent underclass forever, roughly that kind of mindset.

Host:

One last question, we're almost out of time. Not financial advice, but Intel was in a really bad spot before and has recently started to pick up. Will CPU demand pull them out of their predicament?

Dylan Patel:

They will be better off, but that's not to say the company is saved—company valuations are based on future cash flows. And there is a possibility they will get Apple or other customers. More importantly, it's not that CPU demand is so high that Intel can get some short-term profit from it—others (AMD, Amazon, etc.) will catch up and fill their own capacity. More importantly, AI is buying up all the 3nm and 2nm capacity, and in a few years, people will have to move in other directions. For example, with NVIDIA acquiring Grok, people have come up with all sorts of nonsensical reasons, partly because they wanted extremely fast inference, but partly because Grok is manufactured by Samsung—because TSMC didn't have any 3nm capacity for them and they needed to tape out elsewhere.

If AI is really as crazy as we believe and demand is as crazy as we believe, next year will be even crazier. Then as long as you build any decent chip, it will sell—that's roughly the philosophy. Obviously, they've done more in terms of architecture and so on, but the same situation applies to Apple: TSMC told Apple, "Hey, come down from 3nm, migrate to 2nm quickly, I can do it." Because all the AI chips are on 3nm, and that takes time. Small mobile chips are easier to manufacture than large AI chips. Now all AI chips are migrating to 3nm: AMD's MI350 series, Amazon and Google's Trainium 3 and TPU v7, and NVIDIA is releasing Rubin next week—all of these are on 3nm. TSMC told Apple to come down, and told Qualcomm and MediaTek to come down. These three companies might think, "Maybe we should use Intel because Intel isn't telling us to come down"—but Intel can't do it. So it's hard for everyone.

Host:

I still have many questions to ask, but we only have 20 seconds left. My next question will definitely run over. Let's leave it there for now. Thank you very much for coming to talk with us. Thank you!