Statically-allocate small arrays of numbers#1545
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
There was a problem hiding this comment.
Pull request overview
This PR implements a performance optimization to reduce Python-to-C overhead in the CUDA bindings by statically allocating small arrays (≤5 elements) instead of using dynamic allocation. This addresses the performance gap reported in issue #659, where CUDA API calls through cuda-bindings were observed to be 3x slower than direct CUDA C++ API calls.
Changes:
- Replaced dynamic memory allocation (calloc/free) with static stack allocation for array parameters with 5 or fewer elements in tensor map encoding functions
- Removed explicit type validation checks that were previously performed before array processing
- Simplified memory management by eliminating the need for explicit free() calls for small arrays
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cdef cydriver.cuuint32_t* cyelementStrides | ||
| cdef size_t elementStridesLen | ||
| cdef cydriver.cuuint32_t[5] elementStridesStatic | ||
| elementStridesLen = 0 if elementStrides is None else len(elementStrides) | ||
| if elementStridesLen == 0: | ||
| cyelementStrides = NULL | ||
| elif elementStridesLen == 1: | ||
| cyelementStrides = <cydriver.cuuint32_t *>(<cuuint32_t?> elementStrides[0])._pvt_ptr | ||
| elif elementStridesLen <= 5: | ||
| for idx in range(elementStridesLen): | ||
| elementStridesStatic[idx] = <cydriver.cuuint32_t>(<cuuint32_t?> elementStrides[idx])._pvt_ptr[0] | ||
| cyelementStrides = elementStridesStatic | ||
| else: | ||
| raise ValueError("Argument 'elementStrides' too long, must be <= 5") |
There was a problem hiding this comment.
The removal of type validation for the elementStrides parameter changes the error handling behavior. The old code explicitly checked that all elements were cuuint32_t instances and raised a clear TypeError. The new code relies on Cython's type casting, which may produce less clear error messages if invalid types are passed. Consider adding explicit type validation before the casting logic to maintain clear error messages for users.
There was a problem hiding this comment.
Good catch. I think that the Cython checking is sufficient -- it will raise an exception in the same cases as before, but is faster and non-duplicative.
This comment has been minimized.
This comment has been minimized.
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
Successfully created backport PR for |
(cherry picked from commit a35cdd3)
|
See #659 for a discussion of this solution.
There weren't as many instances of this pattern as I had hoped, but nonetheless, for functions where it is used, we should see ~25% reduction in Python-to-C overhead.
Keeping this as a draft for testing while the generator changes get reviewed.