Skip to content

Use device-local host NUMA for IPC pinned pools#1575

Open
rwgk wants to merge 2 commits intoNVIDIA:mainfrom
rwgk:numa_dynamic
Open

Use device-local host NUMA for IPC pinned pools#1575
rwgk wants to merge 2 commits intoNVIDIA:mainfrom
rwgk:numa_dynamic

Conversation

@rwgk
Copy link
Collaborator

@rwgk rwgk commented Feb 4, 2026

Closes nvbug 5823243

Align IPC-enabled pinned pools with the host NUMA node closest to the active device to avoid allocation failures on multi-NUMA systems. Update tests to validate dynamic NUMA selection.

Align IPC-enabled pinned pools with the host NUMA node closest to the active device to avoid allocation failures on multi-NUMA systems. Update tests to validate dynamic NUMA selection.

Co-authored-by: Cursor <cursoragent@cursor.com>
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 4, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Feb 4, 2026

/ok to test

@rwgk rwgk self-assigned this Feb 4, 2026
@rwgk rwgk added the cuda.core Everything related to the cuda.core module label Feb 4, 2026
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

assert mr.is_ipc_enabled
assert mr.is_device_accessible
assert mr.is_host_accessible
assert mr.device_id == 0 # IPC-enabled uses location id 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the device_id still be zero?

Also, perhaps line 981 should be updated to say Device(0) explicitly? Otherwise, it uses the currently active device, but that doesn't make sense if tests are supposed to be isolated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is device_id not really the device ID; rather it's the NUMA ID? That seems confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor told me:

On Device() vs Device(0): Device() resolves to the current device if a context exists, else it defaults to 0 (cudart‑like behavior). We immediately call set_current(), so the test is deterministic and follows the device whose host NUMA ID we should use. I’m happy to make it explicit (Device(0)) if you prefer stricter isolation; it doesn’t change the intent.

I then asked it to make Device(0) explicit.

Shouldn't the device_id still be zero?

Cursor explained:

device_id here isn’t a GPU ordinal for pinned host pools. For host allocations we use the CUmemLocation.id as the “device_id” field: -1 for plain host memory, and host NUMA node ID for CU_MEM_LOCATION_TYPE_HOST_NUMA. So with IPC‑enabled pinned pools, device_id reflects the host NUMA node closest to the current device, not necessarily 0. That’s why the test now checks device.properties.host_numa_id (falling back to 0 if the attribute is unavailable).

Do you think that's correct? I assumed so and asked it to add a terse comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @leofang can chime in. For most memory resources, the device_id is the device ordinal. It seems like, here, it is overloaded with another meaning. I'm not sure it has a clear meaning for pinned memory that is not associated with a specific device. Perhaps it should be None? Another opinion would help.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1g75cfd5b9fa5c1c6ee2be2547bfbe882e

The underlying driver overloads the id member of the CUmemLocation_v1 struct, but I don't think we ultimately do the same. Maybe we should introduce some specific properties like device_id and numa_id?

Make the pinned IPC mempool test explicitly target device 0 and note that
device_id reflects the host NUMA location for IPC-enabled pinned pools.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants