Insights
Inference

AI Inference Infrastructure: The 2026 Operator's Playbook

Training built the headlines. Inference pays the electric bill. What AI inference infrastructure is, why it has to live close to people, and how we serve it on our own power.

Chad Harris·May 13, 2026 ·9 min read
AI Inference Infrastructure: The 2026 Operator's Playbook

Training built the headlines. AI inference infrastructure pays the electric bill. The frontier-model launches get the press, but the machine that serves those models to real people — every chat, every query, every agent step — is the one that runs around the clock and shows up on the power statement. So in 2026 the operator question is no longer how to train the next model. It is how to serve a trillion tokens a day at a cost per million you can live with.

Here is the number that reorganized the industry. Per Stanford’s 2025 AI Index, the cost to serve a million tokens from a GPT-3.5-class model fell from twenty dollars in late 2022 to seven cents by late 2024 — roughly a 280-fold drop in eighteen months. However, cheaper tokens do not shrink the bill; they explode the volume. Therefore inference, not training, is now the workload that decides whether an AI business has a cost line it can survive.

Every figure in this brief is sourced in full on our AI inference research page.

Cost to serve a million tokens (GPT-3.5-class), 2022 → 2024
November 2022$20.00
October 2024$0.07
A ~280× drop in roughly 18 months, per Stanford's 2025 AI Index. Cheaper tokens don't shrink the bill — they explode the volume. That is why inference, not training, now drives the power statement.

Inference is not training, and treating them alike is expensive

It is tempting to assume the cluster that trained a model can serve it. However, the two workloads diverge on nearly every axis that matters — power profile, density, cooling, latency budget, hardware mix, and unit economics. Specifically, training is a bursty batch job that tolerates delay and runs in one place; inference is a continuous, latency-sensitive service that has to sit close to the people using it. Treat them as one problem, and within two quarters the cost line pulls away from the revenue line. Moreover, this is where the power concentrates: the Electric Power Research Institute projects data-center load climbing steeply through the decade, with serving — not training — as the durable driver.

Inference power is continuous, not bursty

Start with the power, because it is the part most plans get wrong. A training run spikes and idles around checkpoints and scheduling. By contrast, inference serves requests every second of every day, so its racks draw firm, continuous power at a high duty cycle. Consequently, an inference site behaves like baseload, not like a variable industrial load.

That single fact decides how you should power it. Because the draw is continuous and predictable, on-site generation fits it perfectly — you size firm generation to a steady load and never enter the grid queue. It is the same behind-the-meter logic we lay out in our power field guide: generate on the campus, island from the grid, and the town’s electricity stays the town’s. Take no grid power — that is the refusal, made out of a power plant.

How continuously the racks draw power
Training (bursty)~55% duty
Inference (continuous)~92% duty
Inference serves requests around the clock, so it behaves like baseload — which is exactly what on-site generation is best at. Figures illustrative.

Same density, same cooling, the same closed loop

Inference does not let you off the thermal hook either. The accelerators that serve frontier models run at the same rack densities as training — well past the point where air cooling works — so liquid cooling is the default, not an upgrade. Furthermore, because our cooling loop is sealed and closed, an inference campus draws zero municipal water. The town keeps its water; we keep the heat and put it to work. Refusal and gift, side by side, in the plumbing.

Where inference has to live — and why that changes everything

Here is the part that turns a power-and-cooling problem into a community story. Inference is latency-sensitive: the closer the compute sits to the person making the request, the faster the first token returns and the better the product feels. Therefore inference does not want to be one enormous box in a remote desert. It wants to be distributed — many right-sized campuses, near the people they serve.

That is the whole reason our model is built around campuses in real communities rather than a single distant site. For example, a 20-to-50-megawatt campus near a regional population center serves inference better than a gigawatt complex a thousand miles away. And because each campus lands in an actual town, the question of what it takes and what it gives becomes the central design problem. We answer it the same way every time: take nothing the community needs — no grid power, no municipal water — and give back what the town cannot build on its own — skilled work and a free training institute that opens with the campus. Inference’s appetite for proximity is exactly why give-more-than-you-take is not a slogan for us. It is the operating constraint.

The hardware reality: memory is the bottleneck

On the silicon, inference is gated less by raw compute than by memory. Specifically, high-bandwidth memory capacity and bandwidth determine how many tokens an accelerator can serve, so the HBM supply chain — not the GPU count — is often the real constraint. Moreover, inference-only optimizations such as quantization let an operator serve more tokens per watt from the same hardware. As a result, the efficient operator wins on cost per million tokens without buying more chips.

The unit economics that actually decide it

Three numbers determine whether an inference operation makes money. First, the cost per kilowatt-hour landed at the rack — which on-site generation drives down and stabilizes even as grid prices climb. Second, the tokens served per kilowatt-hour — which liquid cooling and quantization push up. Third, the revenue per million tokens — which the market sets and keeps compressing. Put plainly, the operator who owns the power and tunes the whole stack to tokens-per-watt-per-dollar wins the unit economics; the one who rents power and treats inference like training does not.

How to evaluate an inference partner

If you are signing a five-year inference contract, three gates separate the operators from the slideware. First, latency and topology: can they put compute near your users, or only in one distant site? Second, timeline: can they show first tokens in months, with owned power and a manufacturing line behind the date, or are they quoting you into a grid queue? Finally, data control: who owns the campus, the stack, and the data path — you, or a third party whose terms can change? Score those three plainly, and most vendors fall out.

Why we build it this way

I will say the personal part, because it drives the engineering. Inference is the workload that finally lets us build where people actually live, and that is exactly where I want to build. A campus that serves a region’s tokens can also reskill that region’s workforce, on power we generate and water we never take. My grandparents’ rule was that you give more than you take, and an inference campus next to a community is that rule with a power plant and a school attached. Build it the usual way and you get a remote box nobody nearby benefits from. Build it ours and you get a campus a town has every reason to want. If you want the wider model, read our other field notes or explore the rest of SAVRN.

Frequently asked questions

Can I serve inference on the same cluster I used for training?

Technically yes, economically usually not. Because inference optimizes for latency, memory bandwidth, and continuous duty rather than raw throughput, a training cluster serves tokens inefficiently. Therefore most operators separate the two once volume becomes real.

Does inference have to run at the edge, or can it be regional?

Regional is the sweet spot for most workloads. Specifically, a campus near a population center captures most of the latency benefit without the cost and sprawl of true edge. Only the most latency-critical applications need to sit closer than that.

What is "tokens per kilowatt-hour," and why does it matter?

It is how much sellable output each unit of energy produces. Consequently, it is the inference equivalent of yield — the higher it is, the lower your cost per million tokens, regardless of the electricity price.

How do I forecast inference cost as usage grows?

Multiply cost per million tokens by projected volume, and watch both lines. Notably, because per-token cost keeps falling while volume rises faster, total spend usually grows even as unit cost drops — which is why owning the power matters more every year.

What role does quantization play?

It lets you serve a model at lower numerical precision, which means more tokens per watt from the same accelerators. As a result, it is one of the cheapest ways to improve inference economics without new hardware.

Does inference hardware become obsolete as fast as training hardware?

Generally slower. Because serving an existing model is less demanding than training the next one, inference accelerators often have a longer useful life. Therefore we design the campus envelope to outlive several hardware generations, protecting the capital that is hardest to replace.

How does data residency affect where inference runs?

Directly. Specifically, regulated and sensitive workloads often require the data path and the compute to stay inside a defined jurisdiction, which favors owned, in-region campuses over shared, far-away capacity.

Why does continuous load matter for the power contract?

Because firm, predictable demand is the easiest kind to generate on site. As a result, inference’s around-the-clock profile is what makes behind-the-meter generation not merely possible but ideal.

Can one campus serve multiple customers or models?

Yes. A well-designed inference campus is multi-tenant by nature, serving many models and customers from the same power and cooling envelope. Therefore the fixed cost spreads across more revenue as utilization rises.

How fast can inference capacity come online compared to training?

Faster, when the power is owned. Because we generate on site and build the pods on a line, an inference campus reaches token-bearing operation in 6 to 12 months — while a grid-tied project is still waiting on its interconnection study.