Skip to content

[WIP] : Adds support for migrating from clusterpolicy to nvidiadriver and vice-versa#2117

Draft
rahulait wants to merge 1 commit intoNVIDIA:mainfrom
rahulait:upgrade-to-nvidiadriver
Draft

[WIP] : Adds support for migrating from clusterpolicy to nvidiadriver and vice-versa#2117
rahulait wants to merge 1 commit intoNVIDIA:mainfrom
rahulait:upgrade-to-nvidiadriver

Conversation

@rahulait
Copy link
Contributor

@rahulait rahulait commented Feb 11, 2026

This change supports:

  1. Migrating cluster from clusterpolicy to nvidiadriver CR
  2. Migrating cluster from nvidiadriver CR to clusterpolicy

Migration Flow: ClusterPolicy → NVIDIADriver CR

This PR implements seamless migration from ClusterPolicy-managed drivers to NVIDIADriver CR-managed drivers using cascade=orphan deletion and controlled pod upgrades.

Overview

The migration ensures zero downtime by:

  • Using cascade=orphan to keep driver pods running during transition
  • Labeling nodes with orphaned pods to trigger controlled upgrades
  • Leveraging the upgrade controller to orchestrate one-by-one pod replacement
  • Using common component labels that work across both management modes

Migration Steps

1. Initial State

  • ClusterPolicy manages drivers with useNvidiaDriverCRD: false
  • Driver daemonset owned by ClusterPolicy
  • Driver pods running normally on GPU nodes

2. Enable NVIDIADriver CRD

Update ClusterPolicy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
spec:
  driver:
    useNvidiaDriverCRD: true  # Changed from false

3. ClusterPolicy Controller Reconciles

In state-driver step (state_manager.go):

  1. Detects useNvidiaDriverCRD=true
  2. Calls cleanupAllDriverDaemonSets(ctx, orphanPods=true)
  3. Deletes ClusterPolicy-owned driver daemonsets with PropagationPolicy: Orphan
  4. Result: Driver pods continue running but lose owner references (become orphaned)

State transition:

ClusterPolicy-owned DaemonSet (cascade=orphan delete)
    ↓
Orphaned driver pods (no owner, still running)

4. Create NVIDIADriver CR(s)

apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
  name: default
spec:
  driverType: gpu
  # Optional: nodeSelector for node partitioning

What happens:

  1. Watch on NVIDIADriver triggers ClusterPolicy reconciliation
  2. NVIDIADriver controller creates new driver daemonsets
  3. New daemonset pods try to start but can't (anti-affinity prevents duplicate driver pods per node)
  4. New pods remain in Pending state

State:

Orphaned pods (running, no owner)
    +
New NVIDIADriver daemonset pods (pending, anti-affinity blocked)

5. ClusterPolicy Labels Nodes

Post-step() reconciliation loop (clusterpolicy_controller.go):

Conditions for labeling:

  • useNvidiaDriverCRD=true AND hasNVIDIADriverCRs()=true (ClusterPolicy→NVIDIADriver transition)
  • OR useNvidiaDriverCRD=false (NVIDIADriver→ClusterPolicy transition)

Process:

  1. Finds all driver pods with app.kubernetes.io/component=nvidia-driver label
  2. Identifies orphaned pods (no owner references, running)
  3. For each orphaned pod:
    • Verifies node matches a NVIDIADriver CR's nodeSelector
    • Labels node: nvidia.com/gpu-driver-upgrade-state=upgrade-required

State:

Orphaned pods (running) → Node labeled "upgrade-required"
    +
New pods (pending, waiting for old pod removal)

6. Upgrade Controller Orchestrates Transition

If driver.upgradePolicy.autoUpgrade=true in ClusterPolicy:

The upgrade controller uses the common label app.kubernetes.io/component=nvidia-driver to discover all driver pods (works for both ClusterPolicy and NVIDIADriver managed pods).

Per-node orchestration (respecting maxParallelUpgrades and maxUnavailable):

  1. Detect: Node has upgrade-state=upgrade-required
  2. Cordon: Make node unschedulable
  3. Drain: Evict GPU workloads (respects PodDisruptionBudgets)
  4. Delete: Remove orphaned driver pod
  5. Wait: New NVIDIADriver pod starts and becomes ready
  6. Uncordon: Make node schedulable again
  7. Update: Set node label upgrade-state=upgrade-done

Final state:

All nodes: NVIDIADriver-owned pods running
Node labels: nvidia.com/gpu-driver-upgrade-state=upgrade-done

Migration Flow: NVIDIADriver CR → ClusterPolicy

This document describes the seamless migration from NVIDIADriver CR-managed drivers back to ClusterPolicy-managed drivers using cascade=orphan deletion and controlled pod upgrades.

Overview

The reverse migration ensures zero downtime by:

  • Using cascade=orphan to keep driver pods running during transition
  • Labeling all nodes with orphaned pods to trigger controlled upgrades
  • Leveraging the upgrade controller to orchestrate one-by-one pod replacement
  • Using common component labels that work across both management modes

Migration Steps

1. Initial State

  • NVIDIADriver CR(s) manage drivers
  • Driver daemonsets owned by NVIDIADriver controller
  • Driver pods running normally on GPU nodes
  • ClusterPolicy has useNvidiaDriverCRD: true

2. Disable NVIDIADriver CRD in ClusterPolicy

Update ClusterPolicy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
spec:
  driver:
    useNvidiaDriverCRD: false  # Changed from true

3. ClusterPolicy Controller Reconciles

In state-driver step (state_manager.go):

  1. Detects useNvidiaDriverCRD=false
  2. Calls cleanupNVIDIADriverOwnedDaemonSets(ctx, orphanPods=true)
  3. Deletes all NVIDIADriver-owned driver daemonsets with PropagationPolicy: Orphan
  4. ClusterPolicy creates its own driver daemonset (owned by ClusterPolicy)
  5. Result:
    • Old NVIDIADriver pods continue running but lose owner references (become orphaned)
    • New ClusterPolicy daemonset created but pods remain Pending

State transition:

NVIDIADriver-owned DaemonSet (cascade=orphan delete)
    ↓
Orphaned driver pods (no owner, still running)
    +
New ClusterPolicy DaemonSet (pods pending, anti-affinity blocked)

4. Delete NVIDIADriver CR(s)

# List existing NVIDIADriver CRs
kubectl get nvidiadrivers -A

# Delete them
kubectl delete nvidiadriver <name>

What happens:

  • NVIDIADriver CR deletion is processed
  • No finalizer to remove, CR is deleted immediately
  • Orphaned daemonsets (if any remaining) are cleaned up by ClusterPolicy
  • ClusterPolicy daemonset already exists and managing new pods

State:

Orphaned NVIDIADriver pods (running, no owner)
    +
ClusterPolicy daemonset pods (pending, anti-affinity blocked)

5. ClusterPolicy Labels Nodes

Post-step() reconciliation loop (clusterpolicy_controller.go):

Conditions for labeling:

  • useNvidiaDriverCRD=false (always labels all orphaned pods in this mode)

Process:

  1. Finds all driver pods with app.kubernetes.io/component=nvidia-driver label
  2. Identifies orphaned pods (no owner references, running)
  3. For each orphaned pod:
    • Labels node: nvidia.com/gpu-driver-upgrade-state=upgrade-required
    • No nodeSelector validation needed - ClusterPolicy will manage all nodes

State:

Orphaned pods (running) → All nodes labeled "upgrade-required"
    +
ClusterPolicy pods (pending, waiting for old pod removal)

6. Upgrade Controller Orchestrates Transition

If driver.upgradePolicy.autoUpgrade=true in ClusterPolicy:

The upgrade controller uses the common label app.kubernetes.io/component=nvidia-driver to discover all driver pods.

Per-node orchestration (respecting maxParallelUpgrades and maxUnavailable):

  1. Detect: Node has upgrade-state=upgrade-required
  2. Cordon: Make node unschedulable
  3. Drain: Evict GPU workloads (respects PodDisruptionBudgets)
  4. Delete: Remove orphaned NVIDIADriver pod
  5. Wait: New ClusterPolicy pod starts and becomes ready
  6. Uncordon: Make node schedulable again
  7. Update: Set node label upgrade-state=upgrade-done

Final state:

All nodes: ClusterPolicy-owned pods running
Node labels: nvidia.com/gpu-driver-upgrade-state=upgrade-done
DaemonSets: Single ClusterPolicy-owned driver daemonset

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

…e-versa

This change supports:
1. Migrating cluster from clusterpolicy to nvidiadriver CR
2. Migrating cluster from nvidiadriver CR to clusterpolicy

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant