DRA Stability Forces GPU Workloads to Reimagine Themselves

8 May 2026

6 Min. Reading Time

Kubernetes 1.36 “Haru” has been generally available since April 22, 2026, and brings two changes that require immediate action for DACH enterprise clusters: cgroup v1 is not deprecated but completely removed – Dynamic Resource Allocation for GPU workloads is now stable. Anyone still running on cgroup v1 can no longer perform an upgrade without migrating first.

Key Takeaways

Kubernetes 1.36 “Haru” GA since April 22, 2026: cgroup v1 completely removed, no path back after upgrade
cgroup v2 mandatory: all nodes must be migrated to cgroup v2 before upgrading – RHEL 9+, Ubuntu 22.04+, and SLES 15 SP5+ have it enabled by default
Dynamic Resource Allocation (DRA) stable: structured GPU/FPGA scheduling without device plugin workarounds is now production-ready
Job API: successive parallelism control stable – max-in-flight management for batch workloads significantly improved
Recommendation: lift upgrade freeze only after validating cgroup v2 on all worker nodes across all clusters

What is cgroup v2? Control Groups Version 2 (cgroup v2) is the Linux kernel mechanism for resource control of processes and containers. Unlike cgroup v1, cgroup v2 operates with a unified hierarchy instead of multiple parallel subsystem hierarchies – this simplifies resource accounting and enables more precise memory OOM control, pressure signaling, and I/O latency control for container workloads.

cgroup v1 is gone – not deprecated, but removed

The key difference from previous announcements: The Kubernetes project hasn’t deprecated cgroup v1, but removed it outright. There’s no “deprecation grace period” for 1.36. Anyone upgrading without first migrating all nodes to cgroup v2 will fail to upgrade.

Specifically: kubelet and the container runtime (containerd, CRI-O) expect cgroup v2 as an active kernel feature on every node. On nodes still running in cgroup v1 mode, kubelet won’t start after the upgrade. This particularly affects organizations still running older RHEL 8 or Ubuntu 20.04 nodes – both systems use cgroup v1 by default.

Migration status by distro

Distribution	cgroup v2 default from	Action required
RHEL 9 / Rocky 9 / Alma 9	from release (2022)	None – already on cgroup v2
RHEL 8 / CentOS 8	Not default	OS upgrade to RHEL 9 or manual activation
Ubuntu 22.04 LTS	from release (2022)	None – already on cgroup v2
Ubuntu 20.04 LTS	Not default	Set systemd kernel parameter or upgrade to 22.04
SLES 15 SP5+	from SP5	None from SP5 – check older SPs

“The decision to remove cgroup v1 entirely instead of deprecating it reflects the maturity of cgroup v2 in the kernel and in container runtimes. There’s no technical reason to stay on v1.”

Kubernetes SIG-Node, Release Notes 1.36

Dynamic Resource Allocation (DRA) Stable: What it Means for GPU Workloads

Dynamic Resource Allocation has transitioned from beta to stable in version 1.36. DRA solves a fundamental problem with GPU and accelerator workloads: device plugins like the NVIDIA device plugin have so far implemented a static 1:1 mapping of GPU slots to pods. Multiple pods couldn’t share a GPU in a structured way. With DRA, cluster administrators can granularly allocate GPU resources via ResourceClasses and ResourceClaims – one GPU slice for an inference pod, another for a training job.

DRA Benefits for Enterprise

GPU sharing without NVIDIA MIG hardware requirements
ResourceClaims are namespaced – multi-tenant capable
Better utilization of expensive accelerator hardware
Declarative configuration instead of device plugin hacks

Migration Requires

Existing device plugins must be migrated to the DRA API
NVIDIA device plugin v0.17+ required for DRA compatibility
Scheduler and kubelet must have DRA feature gate enabled
No automatic migration path from old device plugin deployments

What Enterprise Teams in DACH Need to Check Now

Kubernetes 1.36 Upgrade Checklist

Check all worker nodes for cgroup v2: stat -fc %T /sys/fs/cgroup/ must return cgroup2fs
Check container runtime version: containerd 1.7+ and CRI-O 1.28+ support cgroup v2 natively
Check managed cluster: EKS, GKE, and AKS migrate nodes automatically during Kubernetes version upgrade
Self-managed cluster: rotate or manually migrate all nodes before upgrade
Check NVIDIA device plugin version if GPU workloads are in use

Source: Kubernetes Blog, Kubernetes 1.36 Release Notes, CNCF TAG-Runtime, NVIDIA Blog April 2026.

Frequently Asked Questions

Can I still enable cgroup v1 under Kubernetes 1.36?

No. With Kubernetes 1.36, the cgroup v1 code is completely removed from kubelet and the internal Kubernetes libraries. There is no feature gate that would reactivate cgroup v1. Those who stay on 1.35 or older can continue to use cgroup v1, but an upgrade to 1.36+ requires cgroup v2 on all nodes.

How can I check if my nodes are already using cgroup v2?

On each node: stat -fc %T /sys/fs/cgroup/. If it returns cgroup2fs, it means cgroup v2 is active. If it returns tmpfs, it still means cgroup v1. Alternatively: cat /proc/1/cgroup – if all entries converge into a single line with 0::/, the system is using cgroup v2 unified.

Does the cgroup v1 removal also affect managed Kubernetes on AWS/GCP/Azure?

For EKS, GKE, and AKS, the node migration to cgroup v2 is automatically performed by the cloud provider during the Kubernetes version upgrade – new node pools or managed node groups rotate to cgroup v2-capable base images. Problematic are self-managed node groups or spot instance pools with custom AMIs that still run on Ubuntu 20.04 or Amazon Linux 2 (without cgroup v2 configuration).

When should I start upgrading to 1.36?

Only after complete cgroup v2 validation of all worker nodes in the cluster. The recommended path: Upgrade staging clusters, monitor all workloads for 2 weeks, then perform a rolling upgrade of production clusters node by node. For managed clusters with auto-upgrade: Open the auto-upgrade window only after validated node compatibility.

What changes for teams operating GPU workloads with the NVIDIA device plugin?

The classic NVIDIA device plugin continues to work in Kubernetes 1.36 – it uses the device plugin API, not DRA. Those who want to migrate from device plugin to DRA need NVIDIA device plugin v0.17+ and must redefine ResourceClasses and ResourceClaims for their GPU workloads. The migration is optional but recommended for multi-tenant GPU clusters that need to share accelerator hardware.

Source title image: Pexels | Fact basis: Kubernetes Blog, CNCF, NVIDIA Developer Blog

Also available in

Français Español Deutsch