Use device-local host NUMA for IPC pinned pools#1575
Use device-local host NUMA for IPC pinned pools#1575rwgk wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Align IPC-enabled pinned pools with the host NUMA node closest to the active device to avoid allocation failures on multi-NUMA systems. Update tests to validate dynamic NUMA selection. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
| assert mr.is_ipc_enabled | ||
| assert mr.is_device_accessible | ||
| assert mr.is_host_accessible | ||
| assert mr.device_id == 0 # IPC-enabled uses location id 0 |
There was a problem hiding this comment.
Shouldn't the device_id still be zero?
Also, perhaps line 981 should be updated to say Device(0) explicitly? Otherwise, it uses the currently active device, but that doesn't make sense if tests are supposed to be isolated.
There was a problem hiding this comment.
Or is device_id not really the device ID; rather it's the NUMA ID? That seems confusing.
There was a problem hiding this comment.
Cursor told me:
On
Device()vsDevice(0): Device() resolves to the current device if a context exists, else it defaults to 0 (cudart‑like behavior). We immediately call set_current(), so the test is deterministic and follows the device whose host NUMA ID we should use. I’m happy to make it explicit (Device(0)) if you prefer stricter isolation; it doesn’t change the intent.
I then asked it to make Device(0) explicit.
Shouldn't the device_id still be zero?
Cursor explained:
device_idhere isn’t a GPU ordinal for pinned host pools. For host allocations we use the CUmemLocation.id as the “device_id” field: -1 for plain host memory, and host NUMA node ID for CU_MEM_LOCATION_TYPE_HOST_NUMA. So with IPC‑enabled pinned pools, device_id reflects the host NUMA node closest to the current device, not necessarily 0. That’s why the test now checks device.properties.host_numa_id (falling back to 0 if the attribute is unavailable).
Do you think that's correct? I assumed so and asked it to add a terse comment.
There was a problem hiding this comment.
Maybe @leofang can chime in. For most memory resources, the device_id is the device ordinal. It seems like, here, it is overloaded with another meaning. I'm not sure it has a clear meaning for pinned memory that is not associated with a specific device. Perhaps it should be None? Another opinion would help.
There was a problem hiding this comment.
The underlying driver overloads the id member of the CUmemLocation_v1 struct, but I don't think we ultimately do the same. Maybe we should introduce some specific properties like device_id and numa_id?
Make the pinned IPC mempool test explicitly target device 0 and note that device_id reflects the host NUMA location for IPC-enabled pinned pools. Co-authored-by: Cursor <cursoragent@cursor.com>
Closes nvbug 5823243
Align IPC-enabled pinned pools with the host NUMA node closest to the active device to avoid allocation failures on multi-NUMA systems. Update tests to validate dynamic NUMA selection.