The Diligence Stack - By Creative Strategies

The Diligence Stack - By Creative Strategies

Secret Agent CPU

Why Agentic AI Changes the CPU-to-GPU Ratio in the Datacenter

Ben Bajarin's avatar
Ben Bajarin
Mar 24, 2026
∙ Paid

The Thesis in 60 Seconds

We believe the shift from monolithic LLM inference to multi-step agentic workflows structurally changes the compute mix inside datacenters. Training-era architectures assumed GPUs would dominate every phase of inference. Agentic workloads have challenged that assumption. When an agent calls a tool, queries a database, waits for human approval, or orchestrates sub-agents, the GPU sits idle while the CPU does the work. Our model estimates put that idle window at ~12 to ~22 percent of total inference time, and it scales with agent complexity. When you look at datacenter CPUs today, we position them as “cloud native,” meaning built to run cloud/web software. We believe a new class of CPU, one that is agentic native, will grow the CPU market even more than most forecasts. We position agent native CPUs, as a dedicated CPU tier, specifically architected for agentic workflows, alongside the GPU cluster, which recovers that idle capacity and, at modest allocations, improves both throughput and cost per token.

Early estimates put a CPU rack cost at roughly $300K ($500k most bullish modeling) fully loaded in a base configuration. A GPU rack costs roughly $4+M. The power draw runs about 7 to 1. At a 5 percent CPU power allocation in a 1GW facility, our model shows a ~2 percent increase in effective token throughput and a ~3.7 percent decrease in cost per token, with only a ~1.7 percent increase in total capex. The breakeven sits at roughly 10 percent allocation, beyond which token throughput declines faster than cost savings accrue. The sweet spot is narrow but real, and it scales with facility size.

The Context

How fast things change in AI land. If we were talking a year ago, the current thesis would have been that data center CPUs are being commoditized, while the GPU was absorbing the vast amount of the value layer. While the GPU/XPU racks will still command the largest dollar share, the relevance of the CPU has come full circle as the compute layer shifts to agentic workloads. As inference evolves from prompt-response interactions toward agentic systems that retrieve information, call tools, manage state, take actions, and coordinate multi-step workflows, the bottleneck begins to move. In that environment, the limiting factor is less often raw model throughput in isolation and more often the system’s ability to orchestrate work around the model. That has direct implications for how future AI datacenters should be architected and, by extension, how stakeholders should think about where incremental infrastructure dollars will be allocated.

Our view is that training-era CPU-to-GPU assumptions have been challenged in light of agentic inference. The older framework treated the CPU largely as a control layer attached to a GPU-heavy architecture, essentially just a basic head node CPU. That may have been sufficient when the dominant job was training large models or serving relatively simple inference requests (turn-based LLM). That compute cycle becomes challenged when the workload shifts to reasoning models. In agentic environments, each model step can trigger retrieval, reranking, serialization, memory lookup, policy checks, browser or application interaction, logging, compliance handling, computer use, and workflow management. All of that consumes CPU cycles, memory capacity, and system-level coordination, often at a rate that is out of proportion to how the market currently thinks about CPU as a percentage of AI infrastructure.

Agentic workloads introduce latency and utilization penalties that do not show up in conventional compute framing. GPUs are highly efficient when fed continuously. They become materially less efficient when they are forced into intermittent work patterns because the orchestration layer cannot keep up. Said differently, if the GPU is waiting on the CPU tier to prepare the next step, the most expensive part of the cluster is underutilized because the cheaper part of the cluster is underprovisioned. That is a poor trade even before we get to the question of power and capex allocation. We think this is where much of the current market framework is still lagging the workload transition.

This is also why we believe the CPU side of AI infrastructure is being underappreciated in both architecture planning and market modeling. This thesis is not that GPU demand weakens. In fact, the opposite is likely true. A better-provisioned CPU orchestration layer can improve effective GPU utilization and increase the economic output of the GPU fleet already being deployed. That is an important distinction in a quickly changing market. The CPU does not compete with the GPU in this framework. It raises the return on the GPU asset by reducing idle time and smoothing workflow execution. In practical terms, that means AI infrastructure can become more CPU-intensive even as the center of gravity remains decisively GPU-led.

CPU Architectures: Cloud Native and Agent Native

Our work suggests that this ratio shift is fundamentally underappreciated. The training-era world looked closer to a 1:4 CPU-to-GPU/XPU resource relationship in AI deployments. Agentic inference moves toward 1:1, and in some forms of enterprise workflow automation, can push the requirement higher still. That should not be confused with a claim that power or capex allocation becomes balanced in the same way. CPU server-equivalents consume a fraction of the power and cost of GPU racks. That is precisely why the economics, as a factor of true TCO, matter. In our datacenter modeling, even a relatively modest dedicated CPU tier can improve overall system economics by lifting effective throughput and reducing cost per unit of useful work. We find it useful to separate this CPU demand into two layers: cloud-native CPUs, meaning the existing fleet that is being repurposed or reconfigured to handle agentic orchestration tasks, and agent-native CPUs, meaning greenfield racks purpose-built for agentic workloads from the ground up. NVIDIA's progression from Grace to Vera dedicated CPU racks, and now Arm's AGI CPU with 136 cores at 300 watts, a dedicated core per thread, and rack densities above 45,000 cores in liquid-cooled configurations, represents a new class of agent-native silicon purpose-built for orchestration-heavy workloads rather than adapted from existing cloud infrastructure. The market, in our view, still tends to assume that the best answer is to push nearly every available watt into GPUs. That may remain directionally right for training clusters. It is unlikely to be right for agentic serving environments where coordination overhead is a critical part of the workload.

While we remain convinced the datacenter CPU TAM will experience growth not seen in years, questions remain. The cloud-native installed base will absorb some of this demand, which is why our model applies a shared infrastructure discount that narrows over time as agent-native deployments scale. The exact pace of this transition will vary by workload mix, model architecture, enterprise adoption patterns, specific customer needs, and how quickly AI systems move from answer engines to execution engines. Not every inference workload is equally agentic, and not every agentic workload is equally CPU-constrained. We also need to keep monitoring what hyperscalers actually optimize for in production, because architecture decisions in the field have a way of clarifying debates faster than conference slides do. What would change our view is evidence that orchestration overhead compresses materially as model and serving stacks mature, or that agentic software patterns prove narrower than current adoption signals suggest. For now, what stands out to us is that the software stack is moving in the opposite direction. It is becoming more stateful, more tool-heavy, and more operational.

The market broadly understands the GPU buildout. We are less convinced it fully appreciates the CPU infrastructure required to make agentic AI economically work at scale. As AI shifts from generating answers to completing work, the orchestration layer becomes more consequential, and the infrastructure conversation broadens with it. That does not reduce the importance of GPUs. It changes what must be attached to them, what improves their utilization, and which parts of the stack are quietly becoming more valuable than consensus currently reflects.

What subscribers get in the full report

  • The full architecture case for why training-era CPU-to-GPU assumptions break down as workloads move from chat and copilots toward agentic inference and enterprise automation.

  • Our framework for the main CPU demand drivers, including multi-agent fan-out, retrieval and reranking, state management, policy and compliance logic, software-operation agents, and workflow coordination.

  • A detailed 1GW datacenter stress test showing how a modest CPU power allocation can improve effective throughput and lower cost per unit of useful work.

  • The installed-base argument for why existing enterprise CPU infrastructure becomes a force multiplier for future AI deployment rather than a displaced legacy layer.

  • Updated datacenter CPU forecast, with TAM expansion from agentic CPUs

  • The monitoring framework we would use to validate, refine, or challenge the thesis as real-world deployment data comes in.

  • The strategic implications across hyperscaler silicon, merchant CPUs, memory, networking, and the broader attach opportunity created by agentic serving.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Creative Strategies, Inc. · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture