Assigning various GPU types at runtime in Kubernetes

Summary

This postmortem analyzes the failure to schedule a GPU-accelerated container after a node outage when using preferred node affinity in Kubernetes. The core issue was an overly restrictive pod specification that assumed the availability of a specific GPU resource or node label. While the scheduling directive was “preferred,” the pod’s container spec likely defined a specific resource request (e.g., nvidia.com/gpu) that a non-GPU node could not fulfill, or the affinity rules were too tight to allow fallback to a GPU-less node. The key takeaway is that optional resource requests in Kubernetes require specific configuration patterns that differ from mandatory ones.

Root Cause

The root cause lies in the interaction between the pod’s resource requests and the node’s capacity.

  • Inflexible Resource Requests: The container likely requested a specific GPU resource, such as nvidia.com/gpu: 1. This makes the GPU a hard requirement. Even if the scheduling affinity is “preferred,” the scheduler will not place the pod on a node that lacks that specific resource.
  • Missing Fallback Logic: The pod definition did not accommodate a scenario where no GPU is available. Kubernetes does not support “optional” resource requests in the traditional sense (e.g., requesting a resource only if it exists).
  • Label vs. Capacity Mismatch: The affinity rule relied on a label (gpu.type), but the actual hardware requirement was a specific resource capacity. A node might have the label but lack the device plugin registration, or vice versa.

Why This Happens in Real Systems

  • Default Behavior Misunderstanding: Developers often treat GPU acceleration as a performance enhancement rather than a hard dependency. They rely on affinity to direct traffic but forget that the pod spec itself defines the minimum requirements for the container to run.
  • Heterogeneous Environments: Managing clusters with multiple GPU vendors (NVIDIA, AMD, Intel) and non-GPU nodes is complex. Standard deployment manifests often hardcode vendor-specific resources (e.g., nvidia.com/gpu), which fail on nodes without that specific device plugin.
  • Stateful vs. Stateless Assumptions: For stateful workloads, losing a specific node type can be catastrophic if the pod spec doesn’t allow for dynamic relocation to a node with a different configuration (or no GPU).

Real-World Impact

  • Reduced Availability: When the primary GPU node fails, the pod remains in a Pending state indefinitely because no other node satisfies the resource request.
  • Operational Complexity: Operators must manually intervene to modify deployments or tolerations to reschedule workloads.
  • Resource Underutilization: Nodes without GPUs sit idle while the scheduler waits for a specific node type to become available, increasing costs.
  • Degraded Performance: If the fallback node is CPU-only, the workload may run significantly slower or fail entirely if it strictly requires hardware acceleration.

Example or Code

To handle optional GPUs, you should use a strategy that allows the pod to run without a GPU if necessary. This involves removing the explicit GPU resource request from the container spec and using a RuntimeClass or Tolerations/Node Selectors that map to a RuntimeClass handler which conditionally injects devices.

However, a common pattern for “optional” GPU assignment is to use a mutating admission webhook or a DaemonSet-based device plugin that virtualizes the resource. In Kubernetes, a strictly valid way to handle this currently is to define the resource as a generic “accelerator” or rely on environment variables and device plugins that support multi-vendor injection without hard resource requests.

If you must strictly control this without webhooks, you can use RuntimeClass with a handler that sets up the GPU environment, but the pod spec should not request a specific vendor resource if it’s optional.

Here is a conceptual example of a Deployment that is agnostic to the GPU type. Note that it does not request a specific GPU resource like nvidia.com/gpu. Instead, it relies on a node label and a custom device plugin or sidecar to inject the necessary drivers/devices (often handled by operators like the NVIDIA GPU Operator, but with specific configuration for “optional” modes).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: optional-gpu-workload
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      # Use a RuntimeClass that handles GPU injection if available
      runtimeClassName: gpu-optional
      containers:
      - name: worker
        image: my-workload:latest
        # Note: No explicit GPU resource request here.
        # The RuntimeClass 'gpu-optional' would set up the container
        # with access to GPUs if the node has them, or run in CPU mode otherwise.
        resources:
          limits:
            cpu: "2"
            memory: "4Gi"
          requests:
            cpu: "1"
            memory: "2Gi"
      # Affinity tries to find a node with a GPU label
      affinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
            - key: gpu.type
              operator: In
              values: ["nvidia", "amd", "intel"]

If you are using a device plugin that supports multiple vendors (like the k8s-device-plugin with specific configuration), the node must label the resource generically (e.g., accelerator: generic). But strictly speaking, Kubernetes does not support “optional” resource requests natively. The solution usually involves RuntimeClasses or Operator-based injection.

How Senior Engineers Fix It

  • Use RuntimeClass: Define a RuntimeClass with a handler (e.g., nvidia) that configures the container runtime (containerd) to mount GPU devices if they are present. If the node has no GPUs, the handler does nothing, and the container runs on CPU.
  • Decouple Dependency: Do not request specific vendor resources (nvidia.com/gpu, amd.com/gpu) in the pod spec for optional workloads. Instead, use environment variables or feature gates within the application to detect GPU availability at startup.
  • Node Feature Discovery (NFD): Use NFD to label nodes with capabilities. Use nodeSelector or preferredDuringSchedulingIgnoredDuringExecution on generic labels (e.g., accelerator=true) rather than specific vendor resources.
  • Fallback Deployment Strategy: Deploy two separate ReplicaSets: one with GPU resource requests (nodeSelector for GPU nodes) and one without (nodeSelector for non-GPU nodes). Use a single Service to load balance or an HPA based on custom metrics to scale the appropriate set.

Why Juniors Miss It

  • Confusion of “Preferred” vs “Required”: Juniors often mistake preferredDuringSchedulingIgnoredDuringExecution for a way to make a resource optional. It only influences scheduling preference; it does not change the fact that the container’s resource limits (e.g., nvidia.com/gpu: 1) are a hard requirement. If a node lacks that resource, the pod cannot run there, regardless of affinity.
  • Copy-Pasting Manifests: Most tutorials and examples assume a homogenous GPU cluster. Juniors copy these manifests without adapting them for heterogeneous environments where nodes might lack GPUs.
  • Lack of Runtime Awareness: Juniors may not be aware of how the Container Runtime Interface (CRI) interacts with device plugins. They focus purely on the Kubernetes API (Pod/Service) without understanding that the actual device passthrough is configured at the runtime level (via RuntimeClass or explicit device mounts).