[WIP] : Adds support for migrating from clusterpolicy to nvidiadriver and vice-versa by rahulait · Pull Request #2117 · NVIDIA/gpu-operator

rahulait · 2026-02-11T05:28:09Z

This change supports:

Migrating cluster from clusterpolicy to nvidiadriver CR
Migrating cluster from nvidiadriver CR to clusterpolicy

Migration Flow: ClusterPolicy → NVIDIADriver CR

This PR implements seamless migration from ClusterPolicy-managed drivers to NVIDIADriver CR-managed drivers using cascade=orphan deletion and controlled pod upgrades.

Overview

The migration ensures zero downtime by:

Using cascade=orphan to keep driver pods running during transition
Labeling nodes with orphaned pods to trigger controlled upgrades
Leveraging the upgrade controller to orchestrate one-by-one pod replacement
Using common component labels that work across both management modes

Migration Steps

1. Initial State

ClusterPolicy manages drivers with useNvidiaDriverCRD: false
Driver daemonset owned by ClusterPolicy
Driver pods running normally on GPU nodes

2. Enable NVIDIADriver CRD

Update ClusterPolicy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
spec:
  driver:
    useNvidiaDriverCRD: true  # Changed from false

3. ClusterPolicy Controller Reconciles

In state-driver step (state_manager.go):

Detects useNvidiaDriverCRD=true
Calls cleanupAllDriverDaemonSets(ctx, orphanPods=true)
Deletes ClusterPolicy-owned driver daemonsets with PropagationPolicy: Orphan
Result: Driver pods continue running but lose owner references (become orphaned)

State transition:

ClusterPolicy-owned DaemonSet (cascade=orphan delete)
    ↓
Orphaned driver pods (no owner, still running)

4. Create NVIDIADriver CR(s)

apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
  name: default
spec:
  driverType: gpu
  # Optional: nodeSelector for node partitioning

What happens:

Watch on NVIDIADriver triggers ClusterPolicy reconciliation
NVIDIADriver controller creates new driver daemonsets
New daemonset pods try to start but can't (anti-affinity prevents duplicate driver pods per node)
New pods remain in Pending state

State:

Orphaned pods (running, no owner)
    +
New NVIDIADriver daemonset pods (pending, anti-affinity blocked)

5. ClusterPolicy Labels Nodes

Post-step() reconciliation loop (clusterpolicy_controller.go):

Conditions for labeling:

useNvidiaDriverCRD=true AND hasNVIDIADriverCRs()=true (ClusterPolicy→NVIDIADriver transition)
OR useNvidiaDriverCRD=false (NVIDIADriver→ClusterPolicy transition)

Process:

Finds all driver pods with app.kubernetes.io/component=nvidia-driver label
Identifies orphaned pods (no owner references, running)
For each orphaned pod:
- Verifies node matches a NVIDIADriver CR's nodeSelector
- Labels node: nvidia.com/gpu-driver-upgrade-state=upgrade-required

State:

Orphaned pods (running) → Node labeled "upgrade-required"
    +
New pods (pending, waiting for old pod removal)

6. Upgrade Controller Orchestrates Transition

If driver.upgradePolicy.autoUpgrade=true in ClusterPolicy:

The upgrade controller uses the common label app.kubernetes.io/component=nvidia-driver to discover all driver pods (works for both ClusterPolicy and NVIDIADriver managed pods).

Per-node orchestration (respecting maxParallelUpgrades and maxUnavailable):

Detect: Node has upgrade-state=upgrade-required
Cordon: Make node unschedulable
Drain: Evict GPU workloads (respects PodDisruptionBudgets)
Delete: Remove orphaned driver pod
Wait: New NVIDIADriver pod starts and becomes ready
Uncordon: Make node schedulable again
Update: Set node label upgrade-state=upgrade-done

Final state:

All nodes: NVIDIADriver-owned pods running
Node labels: nvidia.com/gpu-driver-upgrade-state=upgrade-done

Migration Flow: NVIDIADriver CR → ClusterPolicy

This document describes the seamless migration from NVIDIADriver CR-managed drivers back to ClusterPolicy-managed drivers using cascade=orphan deletion and controlled pod upgrades.

Overview

The reverse migration ensures zero downtime by:

Using cascade=orphan to keep driver pods running during transition
Labeling all nodes with orphaned pods to trigger controlled upgrades
Leveraging the upgrade controller to orchestrate one-by-one pod replacement
Using common component labels that work across both management modes

Migration Steps

1. Initial State

NVIDIADriver CR(s) manage drivers
Driver daemonsets owned by NVIDIADriver controller
Driver pods running normally on GPU nodes
ClusterPolicy has useNvidiaDriverCRD: true

2. Disable NVIDIADriver CRD in ClusterPolicy

Update ClusterPolicy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
spec:
  driver:
    useNvidiaDriverCRD: false  # Changed from true

3. ClusterPolicy Controller Reconciles

In state-driver step (state_manager.go):

Detects useNvidiaDriverCRD=false
Calls cleanupNVIDIADriverOwnedDaemonSets(ctx, orphanPods=true)
Deletes all NVIDIADriver-owned driver daemonsets with PropagationPolicy: Orphan
ClusterPolicy creates its own driver daemonset (owned by ClusterPolicy)
Result:
- Old NVIDIADriver pods continue running but lose owner references (become orphaned)
- New ClusterPolicy daemonset created but pods remain Pending

State transition:

NVIDIADriver-owned DaemonSet (cascade=orphan delete)
    ↓
Orphaned driver pods (no owner, still running)
    +
New ClusterPolicy DaemonSet (pods pending, anti-affinity blocked)

4. Delete NVIDIADriver CR(s)

# List existing NVIDIADriver CRs
kubectl get nvidiadrivers -A

# Delete them
kubectl delete nvidiadriver <name>

What happens:

NVIDIADriver CR deletion is processed
No finalizer to remove, CR is deleted immediately
Orphaned daemonsets (if any remaining) are cleaned up by ClusterPolicy
ClusterPolicy daemonset already exists and managing new pods

State:

Orphaned NVIDIADriver pods (running, no owner)
    +
ClusterPolicy daemonset pods (pending, anti-affinity blocked)

5. ClusterPolicy Labels Nodes

Post-step() reconciliation loop (clusterpolicy_controller.go):

Conditions for labeling:

useNvidiaDriverCRD=false (always labels all orphaned pods in this mode)

Process:

Finds all driver pods with app.kubernetes.io/component=nvidia-driver label
Identifies orphaned pods (no owner references, running)
For each orphaned pod:
- Labels node: nvidia.com/gpu-driver-upgrade-state=upgrade-required
- No nodeSelector validation needed - ClusterPolicy will manage all nodes

State:

Orphaned pods (running) → All nodes labeled "upgrade-required"
    +
ClusterPolicy pods (pending, waiting for old pod removal)

6. Upgrade Controller Orchestrates Transition

If driver.upgradePolicy.autoUpgrade=true in ClusterPolicy:

The upgrade controller uses the common label app.kubernetes.io/component=nvidia-driver to discover all driver pods.

Per-node orchestration (respecting maxParallelUpgrades and maxUnavailable):

Detect: Node has upgrade-state=upgrade-required
Cordon: Make node unschedulable
Drain: Evict GPU workloads (respects PodDisruptionBudgets)
Delete: Remove orphaned NVIDIADriver pod
Wait: New ClusterPolicy pod starts and becomes ready
Uncordon: Make node schedulable again
Update: Set node label upgrade-state=upgrade-done

Final state:

All nodes: ClusterPolicy-owned pods running
Node labels: nvidia.com/gpu-driver-upgrade-state=upgrade-done
DaemonSets: Single ClusterPolicy-owned driver daemonset

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)
Test cases are added for new code paths

Testing

…e-versa This change supports: 1. Migrating cluster from clusterpolicy to nvidiadriver CR 2. Migrating cluster from nvidiadriver CR to clusterpolicy Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

Adds support for migrating from clusterpolicy to nvidiadriver and vic…

48394ba

…e-versa This change supports: 1. Migrating cluster from clusterpolicy to nvidiadriver CR 2. Migrating cluster from nvidiadriver CR to clusterpolicy Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] : Adds support for migrating from clusterpolicy to nvidiadriver and vice-versa#2117

[WIP] : Adds support for migrating from clusterpolicy to nvidiadriver and vice-versa#2117
rahulait wants to merge 1 commit intoNVIDIA:mainfrom
rahulait:upgrade-to-nvidiadriver

rahulait commented Feb 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rahulait commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Migration Flow: ClusterPolicy → NVIDIADriver CR

Overview

Migration Steps

1. Initial State

2. Enable NVIDIADriver CRD

3. ClusterPolicy Controller Reconciles

4. Create NVIDIADriver CR(s)

5. ClusterPolicy Labels Nodes

6. Upgrade Controller Orchestrates Transition

Migration Flow: NVIDIADriver CR → ClusterPolicy

Overview

Migration Steps

1. Initial State

2. Disable NVIDIADriver CRD in ClusterPolicy

3. ClusterPolicy Controller Reconciles

4. Delete NVIDIADriver CR(s)

5. ClusterPolicy Labels Nodes

6. Upgrade Controller Orchestrates Transition

Checklist

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rahulait commented Feb 11, 2026 •

edited

Loading