-
Notifications
You must be signed in to change notification settings - Fork 452
Description
Hi All,
Question
I'm trying to run both CDI enabled workloads as well as workloads that rely on NVIDIA_VISIBLE_DEVICES in the same cluster?
Spec
gpu-operator version: 25.10.1
K8s: EKS running 1.35
cdi:
enabled: true
devicePlugin:
enabled: true
env:
- name: DEVICE_LIST_STRATEGY
value: envvar, cdi-annotations, cdi-cri
driver:
enabled: false
toolkit:
enabled: ture
env:
- name: RUNTIME_CONFIG_SOURCE
value: "file"Errors I'm seeing
With CDI disabled my non-privileged workloads are failing with:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices
With CDI enabled my privileged containers get NVIDIA_VISIBLE_DEVICES=void even with Runtime Class Name: nvidia-legacy and containerd on the nodes expecting nvidia-legacy
It seams like this is an either or setup even though I thought that the whole point of having multiple runtime classes (nvidia, nvidia-legacy, and nvidia-cdi) was to allow both in the same time?
Many thanks,
Kryz