8 min read
Training costs for a model are one-time, but inference costs accrue every day. That’s where the math is shifting: with native FP4 Tensor Cores on NVIDIA Blackwell and a serving layer like vLLM that leverages these formats, GPU hours and latency can be significantly reduced-without retraining the model. For those running AI workloads, the decision now extends beyond model selection to the numerical format used for computation.
Key Takeaways
- Quantization is the cost lever: Switching from FP16 to FP8 or FP4 cuts memory requirements per parameter in half or quarters, easing memory bandwidth-the real bottleneck in inference.
- Hardware and software must align: Blackwell introduces native FP4 Tensor Cores, but only a serving layer with compatible kernels unlocks the advantage. Without a synchronized stack, the format remains untapped.
- Gains are measurable, risks manageable: Throughput can quadruple with comparable latency, while quality loss is contained through selective quantization and evaluation suites.
Related:FinOps sees all but can’t act / Cloud-native matures: Knative and Kubernetes 1.34
Why inference costs are becoming an architectural issue
A production language model doesn’t burn its GPU hours during one-time training-it does so with every single request. When a service scales from hundreds to hundreds of thousands of daily calls, inference becomes the largest line item in the cloud bill, growing linearly with usage. That makes it an architectural challenge, not one that can be solved with a bigger reservation discount.
The bottleneck rarely lies in raw compute power. During token generation, memory bandwidth is usually the limiting factor: the model must reload its weights from GPU memory for every token generated. The more compactly weights are stored, the less bandwidth each step consumes. That’s precisely where quantization comes into play.
What is quantization? Quantization reduces the numerical precision used to store and compute model weights and activations-from 16-bit (FP16) to 8-bit (FP8) or 4-bit (FP4), for example. This shrinks memory requirements and bandwidth load while speeding up matrix multiplications, ideally without noticeable quality loss.
FP8 and FP4: How the Number Formats Are Transforming Blackwell
Hopper GPUs from the previous generation already support FP8 natively. Blackwell takes it a step further by integrating FP4 Tensor Cores directly into the hardware, alongside increased memory bandwidth. This makes a format practical that compresses weights to just a quarter of the FP16 size. According to NVIDIA, the GB200 achieves significantly higher throughput with FP4 and FP8 operations than the older H200.
| Format | Memory per Parameter | Native Hardware | Classification |
|---|---|---|---|
| FP16 | 2 bytes | All common GPUs | Benchmark, highest fidelity, most bandwidth-intensive |
| FP8 | 1 byte | Hopper, Blackwell | Robust standard with minimal quality risk |
| FP4 | 0.5 bytes | Blackwell | Maximum efficiency gain, requires careful calibration |
Memory values rounded; quality behavior model-dependent.
The key takeaway? The format alone isn’t enough. The serving layer must include kernels that fully leverage FP4 and FP8 on the hardware. vLLM has integrated the FlashInfer library for this purpose, utilizing FP8 attention, fast FP8 and FP4 matrix multiplications, and GEMM kernels tailored for the GB200. The result is throughput close to the theoretical FP4 limit while maintaining model quality.
These leaps aren’t a one-off. A single round of targeted kernel optimization recently delivered up to 38% more throughput at peak and 13% lower latency at minimum-across the entire Pareto curve. Keeping your stack up to date means reaping these improvements without any in-house development effort.
What Teams Need to Clarify Before the Switch
Switching to lower precision isn’t a simple toggle you can flip without consequences. FP4 can degrade answer quality for sensitive layers or specific tasks. In practice, teams draw a clear line: attention mechanisms and sensitive layers often stay in FP8, while most weights shift to FP4. Without a dedicated evaluation suite that tests real queries against the quantized version, quality loss remains a blind spot.
What Breaks
- Blindly quantizing every layer erodes answer quality
- Without an evaluation suite, losses stay hidden until complaints roll in
- FP4 gains vanish on hardware lacking native Tensor Cores
What Works
- Selective quantization: sensitive layers in FP8, the rest in FP4
- Keep the serving layer current; kernel gains travel with you
- Track cost per token instead of raw GPU utilization as the key metric
Economically, the effort pays off where volume is high. For a service under constant heavy load, the price per token-not raw GPU utilization-decides the margin. That metric belongs in your monitoring dashboard alongside latency and hit quality. Skip it, and you’re optimizing in the dark.
Architecturally, this means model choice, numeric format, and serving stack form one joint decision, not three separate ones. The cheapest inference doesn’t come from the smallest model alone, but from the right model in the right format running on tuned hardware.
Frequently Asked Questions
Does a model lose noticeable quality with FP4?
It can, but not necessarily. Quantizing all layers indiscriminately measurably reduces fidelity. With selective approaches that keep sensitive layers in FP8, the loss usually remains minimal. The only reliable way to assess this is with your own evaluation suite on real queries.
Do I absolutely need Blackwell hardware?
Not for FP8-Hopper GPUs support it natively. Blackwell is the first to deliver the full FP4 advantage with native Tensor Cores. On older hardware, FP8 remains the sensible compromise between efficiency and quality.
What does the serving layer add if the hardware already supports the format?
The hardware provides the compute units, but the right kernels are what actually tap into them. vLLM leverages FP8 and FP4 kernels through the FlashInfer library, tailored to each GPU. Without this layer, the hardware advantage goes unused.
How significant is the real-world throughput gain?
Documented results show up to four times the throughput on Blackwell compared to Hopper at comparable latency for common models. Ongoing kernel optimizations also deliver double-digit percentage improvements with every update.
Which metric best controls inference costs?
The cost per generated token. It combines GPU hours, format, and model size into a single metric that directly impacts a service’s margins. GPU utilization alone obscures whether a call was cheap or expensive.
More from the MBF Media Network
Source of featured image: AI-generated (June 2026), C2PA certificate embedded in image