Self-hosting an LLM in production: hardware, cost, latency tradeoffs

Why this guide exists

"Should we self-host?" is the most common question we field from engineering leaders in their first year of serious LLM deployment, and it is almost always asked without the numbers that would answer it. The question is tractable. The answer depends on four variables: your sustained request volume, your latency tolerance, your data sensitivity, and your willingness to operate GPU infrastructure. It produces different conclusions for different workloads inside the same company.

This guide is the reference material we wish existed when those conversations start. It is deliberately vendor-neutral: we name hardware and model families by their public specifications, not by commercial relationships. It is written for engineering teams who need to produce a defensible recommendation to a CFO, not a research paper.

The worked example throughout is a 70B-parameter dense model, the current pragmatic workhorse for production deployments that need to be materially better than a small model but not so large that the economics collapse. The framework generalises to 7B, 13B, and 100B+ class models; we flag where the numbers shift.

1. The decision, in one paragraph

Self-host when your sustained inference volume is high enough that amortised GPU cost beats per-token API pricing, when your latency budget is tight enough that a round-trip to a hyperscaler is material, or when your data classification genuinely prohibits external processing. Use APIs when your traffic is bursty, your volume is modest, or your team cannot commit to operating GPU infrastructure as a first-class production concern. Most organisations end up with both, routed by workload, and the interesting engineering work is in the routing layer, not in picking a side.

The rest of this guide is what you need to make that paragraph concrete for your workload.

2. What "70B in production" actually means

A 70B-parameter dense transformer (the class that includes Llama 3.3 70B, Qwen 2.5 72B, and their successors) has a few characteristics that determine everything downstream.

At 16-bit precision (bf16 or fp16), the weights alone occupy roughly 140 GB of GPU memory. At 8-bit quantisation, roughly 70 GB. At 4-bit, roughly 35 GB. The KV cache, the per-request memory that holds attention state, adds meaningfully on top of this, scaling linearly with context length and batch size. For a 70B model at 16-bit precision with an 8K context and a batch of 16 concurrent requests, budget another 20–30 GB of KV cache.

This memory footprint is what dictates your hardware floor. You cannot run a 70B model at full precision on a single 80 GB GPU; you need either two of them connected by NVLink, a single GPU with more memory (the H200 at 141 GB and the B200 at 192 GB cross that threshold), or aggressive quantisation. Every production decision for this model class is downstream of that fact.

The second thing that matters is the compute profile. Inference on a dense 70B model is memory-bandwidth-bound for single-request generation and compute-bound for batched prefill. This means single-user latency is governed by how fast your GPU can stream weights from HBM to the tensor cores, while throughput under load is governed by raw FLOPs. You will optimise for one or the other, and the choice shapes your hardware selection.

3. Hardware options, as of mid-2026

The practical GPU inventory for serious 70B inference is narrower than vendor marketing suggests. We will cover the five classes that matter.

H100 80GB SXM / PCIe

The workhorse of the last two years, still the most common starting point for organisations building their first inference cluster. Two H100s with NVLink comfortably serve a 70B model at 16-bit precision with reasonable batch sizes. The PCIe variant saves capex but costs you aggregate memory bandwidth and inter-GPU throughput: fine for lower-concurrency workloads, painful once you start batching aggressively. Street availability is good; pricing has softened considerably from the 2024 peak.

H200 141GB

The quiet upgrade. Same compute architecture as the H100, but with HBM3e giving you 141 GB per GPU and materially higher memory bandwidth. For 70B inference this is the sweet spot for 2026: a single H200 fits the model weights at 16-bit precision with room for a healthy KV cache, and the bandwidth increase translates directly into single-request latency improvements. If you are building a new cluster today and the procurement calendar allows, this is usually where the analysis lands.

B200 192GB

The generational step. Roughly 2.5x the inference throughput of an H100 on large models, 192 GB of HBM3e per GPU, and native fp4 support that collapses the memory footprint of 70B inference to something a single GPU handles with ease. The caveat is availability, power (1000W TDP per GPU, and system-level thermal envelopes that most existing datacentre rows cannot accommodate), and cost. For organisations with genuine scale (sustained tens of thousands of requests per minute) the B200 economics are compelling. For anyone else, H200 is the more pragmatic choice.

AMD MI300X 192GB

The credible alternative. 192 GB of HBM3 per GPU, competitive memory bandwidth, and a software stack (ROCm) that has matured substantially. vLLM, SGLang, and TensorRT-LLM all have functional MI300X support; the ecosystem gap that existed in 2024 is largely closed for inference workloads. Pricing typically lands 20–30% below comparable NVIDIA hardware, and availability has been better through most of the last eighteen months. The risk remains operational: if your team has no ROCm experience, budget real time for the learning curve, and do not underestimate the cost of debugging a kernel-level issue on a less-trodden path.

Custom silicon

AWS Trainium2, Google TPU v5p/v6, Groq LPU, Cerebras, SambaNova. Each of these can be the right answer for specific workloads, and each carries lock-in considerations that sit outside pure economics. Groq's LPU produces remarkable single-stream latency, useful for interactive applications where tokens-per-second-per-user dominates perception, but the architectural assumptions differ enough from GPU deployment that porting is non-trivial. TPUs are excellent inside the Google Cloud ecosystem and awkward outside it. Trainium2 is now economically competitive for large-batch inference if your stack already lives in AWS. We treat these as workload-specific accelerators rather than general-purpose infrastructure, and we would not recommend any of them as a first deployment unless you have a specific latency or cost target that only they can hit.

For the rest of this guide, we will use the H200 as the reference hardware. The methodology transfers; the constants will shift.

4. Sizing the deployment

The sizing question reduces to four inputs: expected concurrent requests at peak, the median and p99 context length, your tokens-per-second-per-user latency target, and your tolerable queue depth.

Most teams get this wrong in the same direction. They size for average load, discover at peak that tail latency collapses, and either over-provision reactively or start dropping requests. The correct default is to size for your p95 concurrent load with a 30% headroom, then autoscale the tail.

A single H200 running a 70B model in bf16 with a well-tuned inference server (vLLM, SGLang, or TensorRT-LLM) will sustain, roughly:

Metric	Value
Single-user generation	35–55 tokens/sec, depending on context length and prefill vs decode phase
Batched throughput	1,500–2,500 tokens/sec aggregate at batch 16–32, with 50–80 tokens/sec perceived per request via continuous batching
Prefill	4K prompt: ~200–400ms. 8K prompt: ~500–800ms. 32K prompt: several seconds

These numbers assume 16-bit precision. Quantising to fp8 typically gives 1.3–1.6x throughput improvement with negligible quality loss on well-trained 70B models; fp4 gives another 1.3x on top but with measurable quality degradation on some tasks, so test carefully against your evaluation set.

For sizing, the useful abstraction is request-seconds per GPU-hour. A single H200 delivering 2,000 aggregate tokens per second at an average of 400 tokens generated per request handles five requests per second, or 18,000 requests per hour. Translate your expected request volume into this unit and divide by the headroom you need; that gives you the minimum GPU count.

5. Latency, honestly

Latency is where self-hosting earns its keep or loses its case, and the conversation is usually sloppy. "Faster than the API" is not a useful claim; the question that matters is: faster on which latency percentile, at which token position, under which load?

Break the latency budget into four components.

Time to first token (TTFT)

The interval between request arrival and the first generated token landing in the client. TTFT is dominated by prefill: tokenisation, attention computation over the full input context, and the first decode step. For a 70B model on an H200, TTFT on a warm cache with a 2K input context lands around 250–400ms; on a 16K input context it is 1.5–2.5s. API providers operating at scale achieve comparable TTFT on their dedicated infrastructure. The difference is in the tail: a self-hosted deployment with no network hop and no cross-tenant queueing can deliver deterministic p99 TTFT, while hyperscaler APIs periodically spike to several seconds under load.

Inter-token latency (ITL)

The interval between generated tokens after the first. This is the number that governs the perceived speed of streaming responses. On an H200 running bf16, expect 18–28ms per token under moderate batching. Quantised to fp8, this drops to 12–20ms. Groq's LPU and Cerebras' wafer-scale inference can push ITL below 5ms for 70B models, which materially changes what interactive applications feel like.

End-to-end latency

TTFT plus (ITL × output tokens). For a typical 300-token response with a 2K input context on an H200, this is roughly 400ms + (300 × 22ms) = 7 seconds. For the same workload hitting a major commercial API from an application server in the same region, expect 500–800ms + (300 × 25ms) = 8 seconds, plus the network round-trip and TLS overhead, which add 30–150ms. The difference at the median is usually less than a second. The difference at p99 is frequently three to ten seconds, and that is where real products live or die.

Network and serialisation

A self-hosted deployment inside your VPC adds 2–15ms of network round-trip. A commercial API from the same cloud region adds 30–80ms typically, 200ms+ under load or across regions. If your application makes multiple LLM calls per user action (increasingly common with agentic workloads) this compounds quickly.

6. Cost, with the actual arithmetic

This is the section that matters. We will work through the total cost of operating a 70B inference deployment at three representative scales and compare it directly to current commercial API pricing. The methodology is what transfers; plug your own numbers in.

The cost model

A self-hosted inference deployment has five cost components: GPU capex or lease, host infrastructure (CPU, memory, networking, storage), power and cooling, personnel, and opportunity cost (the engineering time spent operating the system rather than building product). Most internal business cases ignore the last two and overstate the savings as a result.

For a meaningful comparison, we amortise hardware over three years, include a fully-loaded operational overhead, and compare against current commercial per-token pricing for 70B-class models.

Reference deployment: 2× H200 cluster

Two H200 GPUs connected via NVLink, hosted on a dual-socket server with 1 TB of system RAM, 100 GbE networking, and appropriate storage. Purchased outright, the GPUs land at roughly $55,000–65,000 per H200 in mid-2026 pricing, plus $30,000–40,000 for the host. Total capex: approximately $150,000.

Cost component	Annual cost
Hardware depreciation (3-year amortisation)	~$50,000
Power and cooling (2 kW sustained at $0.12/kWh, PUE 1.4)	~$2,100
Colocation / on-prem overhead	$8,000–15,000
Personnel (0.2 FTE, fully loaded)	$40,000–60,000
Total (owned deployment)	~$120,000/year
Total (cloud-reserved deployment)	~$220,000/year

Cloud-hosted equivalent (reserved GPU instances): approximately $4.50–6.50 per GPU-hour for H200 class on major clouds, or roughly $80,000–110,000 per GPU per year. Substantially more expensive than owned hardware amortised, but with no capex and full elasticity.

Throughput at this scale

At the performance envelope discussed earlier (2,000 aggregate output tokens per second per GPU under healthy batching) a 2×H200 cluster produces 4,000 output tokens per second sustained, or roughly 345 million output tokens per day. Assume realistic utilisation of 40–60% (production traffic is not uniform), and you get 140–210 million output tokens per day, or 50–75 billion output tokens per year.

Deployment type	Cost per million output tokens
Self-hosted, owned hardware	~$2.00
Self-hosted, cloud-reserved	~$3.70
Commercial API (70B-class, mid-2026)	$0.50–2.50

The conclusion

For a 70B workload processing in the range of 50–75 billion output tokens per year (which corresponds to, roughly, hundreds of thousands to low millions of requests per day) self-hosting on owned hardware is competitive with commercial APIs, and often modestly cheaper at the upper end of that volume. Below that scale, APIs win on unit economics, sometimes by a factor of five or more. Above that scale, self-hosting wins progressively as you amortise fixed costs across larger throughput.

There is a threshold, and for 70B workloads in 2026 it sits somewhere between 10 and 30 billion output tokens per year, roughly 30,000–100,000 requests per day of typical length. Below it, APIs are the correct answer on cost alone. Above it, the economics flip, and the question becomes whether you want to build the operational capability.

7. The operational cost nobody models

The financial case above undercounts the real cost of self-hosting in one important way: it assumes the inference stack works. In production, it does not, without ongoing engineering investment. The list of things you are now responsible for includes:

Model weight management and versioning
Evaluation infrastructure that runs against every new model or configuration change
Inference server upgrades (vLLM, SGLang, and TensorRT-LLM all move quickly and introduce breaking changes)
GPU driver and CUDA stack maintenance
Kernel-level debugging when a batch pattern tickles a rare bug
Capacity planning that keeps pace with product growth
Incident response at 3am when the inference tier falls over
Security patching of the full stack
Regulatory compliance documentation for the infrastructure
Monitoring, alerting, and SLO definition for a fundamentally probabilistic service

None of this is exotic; it is table stakes for running any production system. But teams new to GPU operations routinely underestimate the magnitude. A fair way to sanity-check a self-hosting business case is to ask whether the team has operated a similarly demanding stateful workload before. If the answer is no, the break-even volume is higher than the pure cost model suggests, and the quality of the first six months of service will be worse than an API.

8. The quality question

Throughout this guide we have treated "70B-class" as a single capability category. It is not, and the gap between the best commercial models and the best open-weight 70B models, while narrowed substantially over the last two years, is still real on the workloads that demand the most capability: complex reasoning, long-context coherence, tool use, and the kind of nuanced instruction-following that distinguishes a product from a demo.

For a large fraction of production use cases (classification, extraction, summarisation, retrieval-augmented generation, structured output, and the bulk of agentic tool-calling) a well-tuned open 70B model is indistinguishable from a frontier API in output quality, and self-hosting is a straightforward cost conversation. For the hardest 10–20% of use cases, the quality gap still favours the frontier APIs, and no amount of GPU investment closes it. The honest architecture for most organisations is to route by workload: self-host the volume, call out to a frontier API for the hard cases, and treat the routing layer as a first-class piece of infrastructure.

9. Data, sovereignty, and the cases where cost is not the question

Some workloads are not a cost conversation. If your regulatory posture prohibits data leaving your infrastructure (certain categories of healthcare, financial services, defence, and increasingly anything covered by strict national sovereignty regimes) the question is not whether self-hosting is cheaper. It is whether self-hosting is possible, at what quality, and on what timeline.

Two patterns are worth knowing.

Air-gapped inference

The model runs inside infrastructure that has no route to the public internet, typically for classified or regulated workloads. Every piece of the stack (weights, inference server, monitoring) must be vendored and auditable. Open-weight models are the only viable option here; API access is definitionally out. This is a mature pattern with real reference deployments, and the tradeoffs are well understood.

Sovereign cloud inference

The model runs in infrastructure that is legally and operationally within a specific jurisdiction, typically to satisfy GDPR, DORA, the EU AI Act's data-handling requirements, or equivalent regimes in other jurisdictions. This is a middle path: you get the operational convenience of a cloud deployment and the legal posture of on-premises, at a cost premium. Every major cloud now offers some version of this for GPU workloads; terms vary substantially, and the contractual due diligence is as important as the technical architecture.

Both patterns materially change the economics, and both are increasingly common. If your organisation falls into one of these categories, the self-hosting question is not whether, but how, and the rest of this guide's cost analysis applies only as an upper bound on what the hyperscaler alternative would have cost.

10. A deployment architecture that works

For teams building their first serious self-hosted 70B deployment, the architecture that reliably works in production has five components.

An inference server (vLLM, SGLang, or TensorRT-LLM). All three are credible in 2026; vLLM has the broadest community, SGLang has the best structured-output performance, TensorRT-LLM has the best raw NVIDIA throughput. Pick one, commit to it, and do not switch without a specific reason.
A gateway layer: a thin service in front of the inference server that handles authentication, rate limiting, request routing, logging, and the retry logic that inevitably becomes necessary. Do not put these concerns inside the inference server. Litellm, Portkey, and Kong AI Gateway are common choices; an in-house FastAPI service is equally valid for most teams.
An evaluation harness: automated regression testing that runs on every model or configuration change, anchored to a real workload trace, not a benchmark. This is the single most important piece of infrastructure and the most frequently deferred. Without it, you cannot safely upgrade anything.
A routing layer: the hybrid-deployment glue that decides which requests go to self-hosted infrastructure, which go to external APIs, and which go to smaller local models. Even if you start with 100% self-hosted traffic, build the routing seam now; it is the foundation of every cost optimisation you will want to make later.
Observability: request-level tracing, token-level accounting, p50/p95/p99 latency by model and endpoint, and cost attribution back to the business unit that generated the traffic. The last point is where most programmes fail: without cost attribution, self-hosting looks free to downstream consumers, and demand expands to fill capacity.

This architecture scales from a two-GPU pilot to a hundred-GPU production fleet without structural change. It also makes the self-host-vs-API decision reversible at the workload level, which is the correct default.

11. A checklist for the decision

If you have read this far, the decision you are making is probably one of these four.

"Should we move our existing API traffic to self-hosted?"

Answer by calculating your current monthly API spend and comparing to a two-or-four-GPU amortised annual cost. If API spend exceeds $15,000–25,000 per month sustained on a single model family, the arithmetic likely favours self-hosting for that family. Below that threshold, the answer is almost always no on cost alone, and the question becomes whether data sensitivity or latency forces the move.

"We have not deployed AI yet. Should we start self-hosted?"

Almost always no. Start with an API, learn what your traffic shape actually looks like, build the evaluation harness and routing layer against real usage, then migrate the volume to self-hosted when the economics justify it. Self-hosting as a first deployment optimises the wrong variable.

"We need latency we cannot get from an API."

Verify the claim with real measurements before building. Most "API is too slow" complaints resolve on inspection to a network or application-layer issue rather than inference latency. If the measurement holds up, self-hosting, or a specialty inference provider like Groq, Cerebras, or a dedicated endpoint from a major provider, is the right answer. The specialty providers often win on this axis without requiring you to build operational capability.

"Our compliance posture requires it."

Then the question is how, not whether. The cost analysis in this guide sets an upper bound on what you would have paid if you had the choice; the actual number will be higher, and that is the price of sovereignty. Budget accordingly, and do not expect the board conversation to treat this as a cost centre in the same frame as the rest of your infrastructure.

12. What will change in the next twelve months

Any reference document on LLM infrastructure has a shelf life measured in quarters, not years. The shifts we are watching:

GPU supply continues to loosen. B200 availability will be mainstream by late 2026; H200 pricing will continue to soften as a result. MI300X successors will arrive. The capex side of the self-hosting equation gets cheaper, which lowers the break-even volume.
Quantisation continues to mature. fp4 inference on 70B models at production quality is now borderline viable; it will be routine within the year, and that effectively doubles the throughput of every GPU you own.
Commercial API pricing continues its downward trajectory, though the rate has slowed as the economics normalise. The break-even volume threshold moves up as API pricing falls, and down as hardware pricing falls; the two forces are partially offsetting.
The open-weight quality gap continues to narrow on most workloads and persists on the hardest. This is the single most consequential trend for the self-host decision, and the honest answer is that it is impossible to predict where it settles.

Appendix: Quick reference numbers

Metric	Value
70B memory footprint	~140 GB at bf16, ~70 GB at fp8, ~35 GB at fp4. Add 10–40 GB for KV cache depending on batch and context
Single-H200 throughput (70B, bf16)	35–55 tokens/sec single-user; 1,500–2,500 tokens/sec aggregate under batching
Typical latency (70B, bf16, H200)	2K context, 300-token output: TTFT 250–400ms; ITL 18–28ms; end-to-end ~7s
Break-even volume (self-host vs API)	~10–30 billion output tokens/year, equivalent to 30,000–100,000 typical requests/day
Amortised cost (2×H200 owned, 3-year)	~$120,000/year including realistic personnel overhead
Cost per million output tokens	~$2.00 self-hosted owned, ~$3.70 cloud-reserved, $0.50–2.50 commercial API