[WIP] : Adds support for migrating from clusterpolicy to nvidiadriver and vice-versa#2117
Draft
rahulait wants to merge 1 commit intoNVIDIA:mainfrom
Draft
[WIP] : Adds support for migrating from clusterpolicy to nvidiadriver and vice-versa#2117rahulait wants to merge 1 commit intoNVIDIA:mainfrom
rahulait wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…e-versa This change supports: 1. Migrating cluster from clusterpolicy to nvidiadriver CR 2. Migrating cluster from nvidiadriver CR to clusterpolicy Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change supports:
Migration Flow: ClusterPolicy → NVIDIADriver CR
This PR implements seamless migration from ClusterPolicy-managed drivers to NVIDIADriver CR-managed drivers using cascade=orphan deletion and controlled pod upgrades.
Overview
The migration ensures zero downtime by:
cascade=orphanto keep driver pods running during transitionMigration Steps
1. Initial State
useNvidiaDriverCRD: false2. Enable NVIDIADriver CRD
Update ClusterPolicy:
3. ClusterPolicy Controller Reconciles
In state-driver step (
state_manager.go):useNvidiaDriverCRD=truecleanupAllDriverDaemonSets(ctx, orphanPods=true)PropagationPolicy: OrphanState transition:
4. Create NVIDIADriver CR(s)
What happens:
PendingstateState:
5. ClusterPolicy Labels Nodes
Post-step() reconciliation loop (
clusterpolicy_controller.go):Conditions for labeling:
useNvidiaDriverCRD=trueANDhasNVIDIADriverCRs()=true(ClusterPolicy→NVIDIADriver transition)useNvidiaDriverCRD=false(NVIDIADriver→ClusterPolicy transition)Process:
app.kubernetes.io/component=nvidia-driverlabelnvidia.com/gpu-driver-upgrade-state=upgrade-requiredState:
6. Upgrade Controller Orchestrates Transition
If
driver.upgradePolicy.autoUpgrade=truein ClusterPolicy:The upgrade controller uses the common label
app.kubernetes.io/component=nvidia-driverto discover all driver pods (works for both ClusterPolicy and NVIDIADriver managed pods).Per-node orchestration (respecting
maxParallelUpgradesandmaxUnavailable):upgrade-state=upgrade-requiredupgrade-state=upgrade-doneFinal state:
Migration Flow: NVIDIADriver CR → ClusterPolicy
This document describes the seamless migration from NVIDIADriver CR-managed drivers back to ClusterPolicy-managed drivers using cascade=orphan deletion and controlled pod upgrades.
Overview
The reverse migration ensures zero downtime by:
cascade=orphanto keep driver pods running during transitionMigration Steps
1. Initial State
useNvidiaDriverCRD: true2. Disable NVIDIADriver CRD in ClusterPolicy
Update ClusterPolicy:
3. ClusterPolicy Controller Reconciles
In state-driver step (
state_manager.go):useNvidiaDriverCRD=falsecleanupNVIDIADriverOwnedDaemonSets(ctx, orphanPods=true)PropagationPolicy: OrphanPendingState transition:
4. Delete NVIDIADriver CR(s)
What happens:
State:
5. ClusterPolicy Labels Nodes
Post-step() reconciliation loop (
clusterpolicy_controller.go):Conditions for labeling:
useNvidiaDriverCRD=false(always labels all orphaned pods in this mode)Process:
app.kubernetes.io/component=nvidia-driverlabelnvidia.com/gpu-driver-upgrade-state=upgrade-requiredState:
6. Upgrade Controller Orchestrates Transition
If
driver.upgradePolicy.autoUpgrade=truein ClusterPolicy:The upgrade controller uses the common label
app.kubernetes.io/component=nvidia-driverto discover all driver pods.Per-node orchestration (respecting
maxParallelUpgradesandmaxUnavailable):upgrade-state=upgrade-requiredupgrade-state=upgrade-doneFinal state:
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing