{"id":42870,"date":"2026-06-03T11:39:32","date_gmt":"2026-06-03T09:39:32","guid":{"rendered":"https:\/\/www.cloudmagazin.com\/2026\/06\/03\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-2\/"},"modified":"2026-07-23T15:59:30","modified_gmt":"2026-07-23T13:59:30","slug":"fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference","status":"publish","type":"post","link":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/","title":{"rendered":"FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference"},"content":{"rendered":"<p style=\"color:#0bb7fd;font-size:0.9em;margin:0 0 16px;padding:0;\">8 min read<\/p>\n<p><strong>Training costs for a model are one-time, but inference costs accrue every day. That\u2019s where the math is shifting: with native FP4 Tensor Cores on NVIDIA Blackwell and a serving layer like vLLM that leverages these formats, GPU hours and latency can be significantly reduced-without retraining the model. For those running AI workloads, the decision now extends beyond model selection to the numerical format used for computation.<\/strong><\/p>\n<h2>Key Takeaways<\/h2>\n<ul>\n<li><strong>Quantization is the cost lever:<\/strong> Switching from FP16 to FP8 or FP4 cuts memory requirements per parameter in half or quarters, easing memory bandwidth-the real bottleneck in inference.<\/li>\n<li><strong>Hardware and software must align:<\/strong> Blackwell introduces native FP4 Tensor Cores, but only a serving layer with compatible kernels unlocks the advantage. Without a synchronized stack, the format remains untapped.<\/li>\n<li><strong>Gains are measurable, risks manageable:<\/strong> Throughput can quadruple with comparable latency, while quality loss is contained through selective quantization and evaluation suites.<\/li>\n<\/ul>\n<p style=\"font-size:0.88em;color:#666;margin:20px 0 32px 0;border-top:1px solid #e5e5e5;border-bottom:1px solid #e5e5e5;padding:10px 0;\"><span style=\"color:#004a59;font-weight:700;text-transform:uppercase;font-size:0.72em;letter-spacing:0.14em;margin-right:14px;\">Related:<\/span><a href=\"https:\/\/www.cloudmagazin.com\/en\/2026\/05\/29\/finops-sees-everything-but-cant-do-anything-why-cloud-waste-isnt-decreasing\/\" style=\"color:#333;text-decoration:underline;\">FinOps sees all but can\u2019t act<\/a>&nbsp;&nbsp;<span style=\"color:#ccc;\">\/<\/span>&nbsp;&nbsp;<a href=\"https:\/\/www.cloudmagazin.com\/en\/2026\/05\/29\/cloud-native-matures-what-knative-and-kubernetes-1-34-mean-for-ai-workloads\/\" style=\"color:#333;text-decoration:underline;\">Cloud-native matures: Knative and Kubernetes 1.34<\/a><\/p>\n<h2 style=\"margin-top:64px;margin-bottom:20px;padding-top:16px;\">Why inference costs are becoming an architectural issue<\/h2>\n<p>A production language model doesn\u2019t burn its GPU hours during one-time training-it does so with every single request. When a service scales from hundreds to hundreds of thousands of daily calls, inference becomes the largest line item in the cloud bill, growing linearly with usage. That makes it an architectural challenge, not one that can be solved with a bigger reservation discount.<\/p>\n<p>The bottleneck rarely lies in raw compute power. During token generation, memory bandwidth is usually the limiting factor: the model must reload its weights from GPU memory for every token generated. The more compactly weights are stored, the less bandwidth each step consumes. That\u2019s precisely where quantization comes into play.<\/p>\n<p><strong>What is quantization?<\/strong> Quantization reduces the numerical precision used to store and compute model weights and activations-from 16-bit (FP16) to 8-bit (FP8) or 4-bit (FP4), for example. This shrinks memory requirements and bandwidth load while speeding up matrix multiplications, ideally without noticeable quality loss.<\/p>\n<h2 style=\"margin-top:64px;margin-bottom:20px;padding-top:16px;\">FP8 and FP4: How the Number Formats Are Transforming Blackwell<\/h2>\n<p>Hopper GPUs from the previous generation already support FP8 natively. Blackwell takes it a step further by integrating FP4 Tensor Cores directly into the hardware, alongside increased memory bandwidth. This makes a format practical that compresses weights to just a quarter of the FP16 size. According to NVIDIA, the GB200 achieves significantly higher throughput with FP4 and FP8 operations than the older H200.<\/p>\n<div style=\"overflow-x:auto;-webkit-overflow-scrolling:touch;margin:16px 0 32px 0;\" data-element=\"comparison_table\">\n<table style=\"width:100%;min-width:560px;border-collapse:collapse;font-size:0.95em;\">\n<thead>\n<tr style=\"background:#004a59;color:#fff;\">\n<th style=\"padding:12px 16px;text-align:left;border:1px solid #004a59;\">Format<\/th>\n<th style=\"padding:12px 16px;text-align:left;border:1px solid #004a59;\">Memory per Parameter<\/th>\n<th style=\"padding:12px 16px;text-align:left;border:1px solid #004a59;\">Native Hardware<\/th>\n<th style=\"padding:12px 16px;text-align:left;border:1px solid #004a59;\">Classification<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"padding:12px 16px;border:1px solid #ddd;\"><strong>FP16<\/strong><\/td>\n<td style=\"padding:12px 16px;border:1px solid #ddd;\">2 bytes<\/td>\n<td style=\"padding:12px 16px;border:1px solid #ddd;\">All common GPUs<\/td>\n<td style=\"padding:12px 16px;border:1px solid #ddd;color:#004a59;font-weight:600;\">Benchmark, highest fidelity, most bandwidth-intensive<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:12px 16px;border:1px solid #ddd;\"><strong>FP8<\/strong><\/td>\n<td style=\"padding:12px 16px;border:1px solid #ddd;\">1 byte<\/td>\n<td style=\"padding:12px 16px;border:1px solid #ddd;\">Hopper, Blackwell<\/td>\n<td style=\"padding:12px 16px;border:1px solid #ddd;color:#004a59;font-weight:600;\">Robust standard with minimal quality risk<\/td>\n<\/tr>\n<tr>\n<td style=\"padding:12px 16px;border:1px solid #ddd;\"><strong>FP4<\/strong><\/td>\n<td style=\"padding:12px 16px;border:1px solid #ddd;\">0.5 bytes<\/td>\n<td style=\"padding:12px 16px;border:1px solid #ddd;\">Blackwell<\/td>\n<td style=\"padding:12px 16px;border:1px solid #ddd;color:#004a59;font-weight:600;\">Maximum efficiency gain, requires careful calibration<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p style=\"font-size:0.8em;color:#888;margin-top:8px;\">Memory values rounded; quality behavior model-dependent.<\/p>\n<\/div>\n<p>The key takeaway? The format alone isn\u2019t enough. The serving layer must include kernels that fully leverage FP4 and FP8 on the hardware. vLLM has integrated the FlashInfer library for this purpose, utilizing FP8 attention, fast FP8 and FP4 matrix multiplications, and GEMM kernels tailored for the GB200. The result is throughput close to the theoretical FP4 limit while maintaining model quality.<\/p>\n<div class=\"evm-stat-highlight\" style=\"text-align:center;background:#004a59;border-radius:12px;padding:32px 24px;margin:32px 0;\">\n<div style=\"font-size:48px;font-weight:700;color:#fff;letter-spacing:-0.03em;\">up to 4x<\/div>\n<div style=\"font-size:15px;color:#fff;margin-top:8px;max-width:420px;margin-left:auto;margin-right:auto;\">higher throughput on Blackwell compared to Hopper at comparable latency, measured on models like Llama 3.3 70B.<\/div>\n<div style=\"font-size:12px;color:#0bb7fd;margin-top:8px;\">Source: vLLM \/ SemiAnalysis InferenceMAX<\/div>\n<\/div>\n<p>These leaps aren\u2019t a one-off. A single round of targeted kernel optimization recently delivered up to 38% more throughput at peak and 13% lower latency at minimum-across the entire Pareto curve. Keeping your stack up to date means reaping these improvements without any in-house development effort.<\/p>\n<h2 style=\"margin-top:64px;margin-bottom:20px;padding-top:16px;\">What Teams Need to Clarify Before the Switch<\/h2>\n<p>Switching to lower precision isn\u2019t a simple toggle you can flip without consequences. FP4 can degrade answer quality for sensitive layers or specific tasks. In practice, teams draw a clear line: attention mechanisms and sensitive layers often stay in FP8, while most weights shift to FP4. Without a dedicated evaluation suite that tests real queries against the quantized version, quality loss remains a blind spot.<\/p>\n<div class=\"evm-pros-cons\" style=\"display:grid;grid-template-columns:repeat(auto-fit,minmax(280px,1fr));gap:16px;margin:28px 0;\">\n<div style=\"background:#fafafa;border-top:3px solid #c0392b;padding:18px 20px;border-radius:4px;\">\n<p style=\"margin:0 0 10px 0;font-size:0.78em;font-weight:700;text-transform:uppercase;letter-spacing:0.12em;color:#c0392b;\">What Breaks<\/p>\n<ul style=\"margin:0;padding-left:18px;color:#333;line-height:1.55;font-size:0.95em;\">\n<li style=\"margin-bottom:6px;\">Blindly quantizing every layer erodes answer quality<\/li>\n<li style=\"margin-bottom:6px;\">Without an evaluation suite, losses stay hidden until complaints roll in<\/li>\n<li>FP4 gains vanish on hardware lacking native Tensor Cores<\/li>\n<\/ul><\/div>\n<div style=\"background:#fafafa;border-top:3px solid #2d7a3e;padding:18px 20px;border-radius:4px;\">\n<p style=\"margin:0 0 10px 0;font-size:0.78em;font-weight:700;text-transform:uppercase;letter-spacing:0.12em;color:#2d7a3e;\">What Works<\/p>\n<ul style=\"margin:0;padding-left:18px;color:#333;line-height:1.55;font-size:0.95em;\">\n<li style=\"margin-bottom:6px;\">Selective quantization: sensitive layers in FP8, the rest in FP4<\/li>\n<li style=\"margin-bottom:6px;\">Keep the serving layer current; kernel gains travel with you<\/li>\n<li>Track cost per token instead of raw GPU utilization as the key metric<\/li>\n<\/ul><\/div>\n<\/div>\n<p>Economically, the effort pays off where volume is high. For a service under constant heavy load, the price per token-not raw GPU utilization-decides the margin. That metric belongs in your monitoring dashboard alongside latency and hit quality. Skip it, and you\u2019re optimizing in the dark.<\/p>\n<p>Architecturally, this means model choice, numeric format, and serving stack form one joint decision, not three separate ones. The cheapest inference doesn\u2019t come from the smallest model alone, but from the right model in the right format running on tuned hardware.<\/p>\n<h2 style=\"padding-top:64px;margin-bottom:20px;\">Frequently Asked Questions<\/h2>\n<details>\n<summary><strong>Does a model lose noticeable quality with FP4?<\/strong><\/summary>\n<p style=\"margin:8px 0 4px 24px;color:#555;line-height:1.6;\">It can, but not necessarily. Quantizing all layers indiscriminately measurably reduces fidelity. With selective approaches that keep sensitive layers in FP8, the loss usually remains minimal. The only reliable way to assess this is with your own evaluation suite on real queries.<\/p>\n<\/details>\n<details>\n<summary><strong>Do I absolutely need Blackwell hardware?<\/strong><\/summary>\n<p style=\"margin:8px 0 4px 24px;color:#555;line-height:1.6;\">Not for FP8-Hopper GPUs support it natively. Blackwell is the first to deliver the full FP4 advantage with native Tensor Cores. On older hardware, FP8 remains the sensible compromise between efficiency and quality.<\/p>\n<\/details>\n<details>\n<summary><strong>What does the serving layer add if the hardware already supports the format?<\/strong><\/summary>\n<p style=\"margin:8px 0 4px 24px;color:#555;line-height:1.6;\">The hardware provides the compute units, but the right kernels are what actually tap into them. vLLM leverages FP8 and FP4 kernels through the FlashInfer library, tailored to each GPU. Without this layer, the hardware advantage goes unused.<\/p>\n<\/details>\n<details>\n<summary><strong>How significant is the real-world throughput gain?<\/strong><\/summary>\n<p style=\"margin:8px 0 4px 24px;color:#555;line-height:1.6;\">Documented results show up to four times the throughput on Blackwell compared to Hopper at comparable latency for common models. Ongoing kernel optimizations also deliver double-digit percentage improvements with every update.<\/p>\n<\/details>\n<details>\n<summary><strong>Which metric best controls inference costs?<\/strong><\/summary>\n<p style=\"margin:8px 0 4px 24px;color:#555;line-height:1.6;\">The cost per generated token. It combines GPU hours, format, and model size into a single metric that directly impacts a service\u2019s margins. GPU utilization alone obscures whether a call was cheap or expensive.<\/p>\n<\/details>\n<div style=\"margin:40px 0 24px 0;\">\n<p style=\"margin:32px 0 12px 0;font-size:0.78em;font-weight:700;text-transform:uppercase;letter-spacing:0.18em;color:#666;\">More from the MBF Media Network<\/p>\n<div style=\"padding:14px 18px;border-left:3px solid #202528;background:#fafafa;margin-bottom:6px;\">\n<div style=\"font-size:0.7em;font-weight:700;color:#202528;text-transform:uppercase;letter-spacing:0.12em;margin-bottom:4px;\">mybusinessfuture<\/div>\n<p><a href=\"https:\/\/mybusinessfuture.com\/ki-mittelstand-einfuehrung-reihenfolge-bitkom-dihk-2026\/\" style=\"font-weight:600;line-height:1.4;color:#1a1a1a;text-decoration:none;\">Why AI adoption in SMEs fails due to sequencing<\/a><\/p>\n<\/div>\n<div style=\"padding:14px 18px;border-left:3px solid #d65663;background:#fafafa;margin-bottom:6px;\">\n<div style=\"font-size:0.7em;font-weight:700;color:#d65663;text-transform:uppercase;letter-spacing:0.12em;margin-bottom:4px;\">digital-chiefs<\/div>\n<p><a href=\"https:\/\/www.digital-chiefs.de\/ki-budget-cio-outcome-deadline-sommer-2026-gartner-governance\/\" style=\"font-weight:600;line-height:1.4;color:#1a1a1a;text-decoration:none;\">AI budgets before summer: What CIOs must deliver now<\/a><\/p>\n<\/div>\n<div style=\"padding:14px 18px;border-left:3px solid #69d8ed;background:#fafafa;\">\n<div style=\"font-size:0.7em;font-weight:700;color:#69d8ed;text-transform:uppercase;letter-spacing:0.12em;margin-bottom:4px;\">securitytoday<\/div>\n<p><a href=\"https:\/\/www.securitytoday.de\/2026\/05\/31\/type-confusion-in-chromes-v8-warum-dieselbe-luecken-klasse-immer-wiederkommt\/\" style=\"font-weight:600;line-height:1.4;color:#1a1a1a;text-decoration:none;\">Type Confusion in Chrome\u2019s V8: Why the same class of vulnerabilities keeps reappearing<\/a><\/p>\n<\/div>\n<\/div>\n<p style=\"text-align:right;color:#868e96;font-size:0.85em;margin-top:48px;\"><em>Source of featured image: AI-generated (June 2026), C2PA certificate embedded in image<\/em><\/p>\n<p style=\"text-align:right;\"><em>Image source: AI-generated (Juni 2026), C2PA certificate embedded<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"FP8 and FP4 on NVIDIA Blackwell slash AI inference costs without requiring a model swap.","protected":false},"author":31,"featured_media":46704,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_yoast_wpseo_meta-robots-noindex":"","_yoast_wpseo_meta-robots-nofollow":"","_yoast_wpseo_meta-robots-adv":"","_yoast_wpseo_canonical":"","_yoast_wpseo_opengraph-title":"","_yoast_wpseo_opengraph-description":"","_yoast_wpseo_opengraph-image":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/06\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg","_yoast_wpseo_opengraph-image-id":0,"_yoast_wpseo_twitter-title":"","_yoast_wpseo_twitter-description":"","_yoast_wpseo_twitter-image":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/06\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg","_yoast_wpseo_twitter-image-id":0,"pre_headline":"","bildquelle":"","teasertext":"","language":"de","_evm_translation_lang":"","featured_post":0,"featured_post_sortierung":0,"_wp_old_slug":["fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-2"],"footnotes":""},"categories":[929],"tags":[],"industry":[],"class_list":["post-42870","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cm-guides"],"evm_reading_time_minutes":6,"wpml_language":"en","wpml_translation_of":42818,"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference<\/title>\n<meta name=\"description\" content=\"Cut FP8 &amp; FP4 costs on NVIDIA Blackwell. Boost vLLM GPU hours &amp; slash latency without model changes.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference\" \/>\n<meta property=\"og:description\" content=\"Cut FP8 &amp; FP4 costs on NVIDIA Blackwell. Boost vLLM GPU hours &amp; slash latency without model changes.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/\" \/>\n<meta property=\"og:site_name\" content=\"cloudmagazin\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cloudmagazincom\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-03T09:39:32+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-07-23T13:59:30+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/06\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg\" \/>\n<meta name=\"author\" content=\"Alec Chizhik\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/06\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg\" \/>\n<meta name=\"twitter:creator\" content=\"@cloudmagazin\" \/>\n<meta name=\"twitter:site\" content=\"@cloudmagazin\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Alec Chizhik\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"NewsArticle\",\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/\"},\"author\":{\"name\":\"Alec Chizhik\",\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/#\\\/schema\\\/person\\\/ce38baaa19a580268aedce096597eb3c\"},\"headline\":\"FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference\",\"datePublished\":\"2026-06-03T09:39:32+00:00\",\"dateModified\":\"2026-07-23T13:59:30+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/\"},\"wordCount\":1101,\"publisher\":{\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.cloudmagazin.com\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg\",\"articleSection\":[\"Guides\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/\",\"url\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/\",\"name\":\"FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.cloudmagazin.com\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg\",\"datePublished\":\"2026-06-03T09:39:32+00:00\",\"dateModified\":\"2026-07-23T13:59:30+00:00\",\"description\":\"Cut FP8 & FP4 costs on NVIDIA Blackwell. Boost vLLM GPU hours & slash latency without model changes.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.cloudmagazin.com\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg\",\"contentUrl\":\"https:\\\/\\\/www.cloudmagazin.com\\\/wp-content\\\/uploads\\\/2026\\\/06\\\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg\",\"width\":1376,\"height\":768,\"caption\":\"GPU-Kosten der KI-Inferenz senken durch FP8- und FP4-Quantisierung und vLLM-Serving; Architektur-Sketchnote: GPU, Inferenz-Pipeline, Kostenk\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/2026\\\/06\\\/03\\\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/home\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/\",\"name\":\"cloudmagazin\",\"description\":\"Inspiration f\u00fcr Businessentscheider\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/#organization\",\"name\":\"cloudmagazin\",\"url\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.cloudmagazin.com\\\/wp-content\\\/uploads\\\/2020\\\/04\\\/cloudmagazin-logo-klein_menu.jpg\",\"contentUrl\":\"https:\\\/\\\/www.cloudmagazin.com\\\/wp-content\\\/uploads\\\/2020\\\/04\\\/cloudmagazin-logo-klein_menu.jpg\",\"width\":150,\"height\":150,\"caption\":\"cloudmagazin\"},\"image\":{\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/cloudmagazincom\\\/\",\"https:\\\/\\\/x.com\\\/cloudmagazin\",\"https:\\\/\\\/www.linkedin.com\\\/showcase\\\/cloudmagazin\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/#\\\/schema\\\/person\\\/ce38baaa19a580268aedce096597eb3c\",\"name\":\"Alec Chizhik\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.cloudmagazin.com\\\/wp-content\\\/uploads\\\/2026\\\/03\\\/alec-chizhik.jpg\",\"url\":\"https:\\\/\\\/www.cloudmagazin.com\\\/wp-content\\\/uploads\\\/2026\\\/03\\\/alec-chizhik.jpg\",\"contentUrl\":\"https:\\\/\\\/www.cloudmagazin.com\\\/wp-content\\\/uploads\\\/2026\\\/03\\\/alec-chizhik.jpg\",\"caption\":\"Alec Chizhik\"},\"description\":\"Alec is the Chief Digital Officer at Evernine and writes about cloud architectures, IT security, and digital operations practices.\",\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/in\\\/alecchizhik\\\/\"],\"url\":\"https:\\\/\\\/www.cloudmagazin.com\\\/en\\\/author\\\/alec\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference","description":"Cut FP8 & FP4 costs on NVIDIA Blackwell. Boost vLLM GPU hours & slash latency without model changes.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/","og_locale":"en_US","og_type":"article","og_title":"FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference","og_description":"Cut FP8 & FP4 costs on NVIDIA Blackwell. Boost vLLM GPU hours & slash latency without model changes.","og_url":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/","og_site_name":"cloudmagazin","article_publisher":"https:\/\/www.facebook.com\/cloudmagazincom\/","article_published_time":"2026-06-03T09:39:32+00:00","article_modified_time":"2026-07-23T13:59:30+00:00","og_image":[{"url":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/06\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg","type":"","width":"","height":""}],"author":"Alec Chizhik","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/06\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg","twitter_creator":"@cloudmagazin","twitter_site":"@cloudmagazin","twitter_misc":{"Written by":"Alec Chizhik","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"NewsArticle","@id":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/#article","isPartOf":{"@id":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/"},"author":{"name":"Alec Chizhik","@id":"https:\/\/www.cloudmagazin.com\/en\/#\/schema\/person\/ce38baaa19a580268aedce096597eb3c"},"headline":"FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference","datePublished":"2026-06-03T09:39:32+00:00","dateModified":"2026-07-23T13:59:30+00:00","mainEntityOfPage":{"@id":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/"},"wordCount":1101,"publisher":{"@id":"https:\/\/www.cloudmagazin.com\/en\/#organization"},"image":{"@id":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/#primaryimage"},"thumbnailUrl":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/06\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg","articleSection":["Guides"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/","url":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/","name":"FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference","isPartOf":{"@id":"https:\/\/www.cloudmagazin.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/#primaryimage"},"image":{"@id":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/#primaryimage"},"thumbnailUrl":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/06\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg","datePublished":"2026-06-03T09:39:32+00:00","dateModified":"2026-07-23T13:59:30+00:00","description":"Cut FP8 & FP4 costs on NVIDIA Blackwell. Boost vLLM GPU hours & slash latency without model changes.","breadcrumb":{"@id":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/#primaryimage","url":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/06\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg","contentUrl":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/06\/fp8-fp4-und-vllm-wie-quantisierung-die-gpu-kosten-der-ki-inferenz-drueckt-cover-hero-1.jpg","width":1376,"height":768,"caption":"GPU-Kosten der KI-Inferenz senken durch FP8- und FP4-Quantisierung und vLLM-Serving; Architektur-Sketchnote: GPU, Inferenz-Pipeline, Kostenk"},{"@type":"BreadcrumbList","@id":"https:\/\/www.cloudmagazin.com\/en\/2026\/06\/03\/fp8-fp4-and-vllm-reducing-gpu-costs-for-ai-inference\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.cloudmagazin.com\/en\/home\/"},{"@type":"ListItem","position":2,"name":"FP8, FP4 and vLLM: Reducing GPU Costs for AI Inference"}]},{"@type":"WebSite","@id":"https:\/\/www.cloudmagazin.com\/en\/#website","url":"https:\/\/www.cloudmagazin.com\/en\/","name":"cloudmagazin","description":"Inspiration f\u00fcr Businessentscheider","publisher":{"@id":"https:\/\/www.cloudmagazin.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.cloudmagazin.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.cloudmagazin.com\/en\/#organization","name":"cloudmagazin","url":"https:\/\/www.cloudmagazin.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.cloudmagazin.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2020\/04\/cloudmagazin-logo-klein_menu.jpg","contentUrl":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2020\/04\/cloudmagazin-logo-klein_menu.jpg","width":150,"height":150,"caption":"cloudmagazin"},"image":{"@id":"https:\/\/www.cloudmagazin.com\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cloudmagazincom\/","https:\/\/x.com\/cloudmagazin","https:\/\/www.linkedin.com\/showcase\/cloudmagazin\/"]},{"@type":"Person","@id":"https:\/\/www.cloudmagazin.com\/en\/#\/schema\/person\/ce38baaa19a580268aedce096597eb3c","name":"Alec Chizhik","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/03\/alec-chizhik.jpg","url":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/03\/alec-chizhik.jpg","contentUrl":"https:\/\/www.cloudmagazin.com\/wp-content\/uploads\/2026\/03\/alec-chizhik.jpg","caption":"Alec Chizhik"},"description":"Alec is the Chief Digital Officer at Evernine and writes about cloud architectures, IT security, and digital operations practices.","sameAs":["https:\/\/www.linkedin.com\/in\/alecchizhik\/"],"url":"https:\/\/www.cloudmagazin.com\/en\/author\/alec\/"}]}},"_links":{"self":[{"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/posts\/42870","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/users\/31"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/comments?post=42870"}],"version-history":[{"count":2,"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/posts\/42870\/revisions"}],"predecessor-version":[{"id":46706,"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/posts\/42870\/revisions\/46706"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/media\/46704"}],"wp:attachment":[{"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/media?parent=42870"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/categories?post=42870"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/tags?post=42870"},{"taxonomy":"industry","embeddable":true,"href":"https:\/\/www.cloudmagazin.com\/en\/wp-json\/wp\/v2\/industry?post=42870"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}