Conversation
|
@RDambrosio016 whenever you get some time (no rush), let me know what you think. I am testing this out as I go on a fairly large project of mine which brought about this need in the first place. Overall, the bridging code is quite simple. I've given an outline of how I think this should be exposed overall. Let me know what you think, happy to modify things as I go. Also, for this first pass, I would like to keep focused only on the grid-level components of the cooperative groups API, as well as the basic cooperative launch host-side function. We can add multi-device and the other cooperative group components later. |
4bbc882 to
e44a8bc
Compare
This works as follows:
- Users build their Cuda code via `CudaBuilder` as normal.
- If they want to use the cooperative groups API, then in
their `build.rs`, just after building their PTX, they will:
- Create a `cuda_builder::cg::CooperativeGroups` instance,
- Add any needed opts for building the Cooperative Groups
API bridge code (`-arch=sm_*` and so on),
- Add their newly built PTX code to be linked with the CG API,
which can include multiple PTX, cubin or fatbin files,
- Call `.compile(..)`, which will spit out a fully linked `cubin`,
- In the user's main application code, instead of using `launch!` to
schedule their GPU work, they will now use `launch_cooperative!`.
e44a8bc to
aefa92a
Compare
|
This looks neat, but if im not mistaken, those functions map to single PTX intrinsics directly, wouldn't it be easier to use inline assembly? though i haven't actually looked into this so im not sure if they map to more than one PTX instruction |
I started down that path at first, and for a few of the pertinent functions the corresponding PTX was clear. I was using a base C++ program compiled down to PTX to verify in addition to cross-referencing with the PTX ISA spec. However, I will say, many of the interfaces were not as clear, and this seemed to be a potentially more reliable way to generate the needed code. Perhaps we can replace some of the clear interfaces with some ASM instead. Happy to iterate on this in the future. |
|
Hello! We are rebooting this project. Sorry for your PR not getting merged! Is this still relevant? |
This works as follows:
CudaBuilderas normal.build.rs, just after building their PTX, they will:cuda_builder::cg::CooperativeGroupsinstance,-arch=sm_*and so on),.compile(..), which will spit out a fully linkedcubin,launch!to schedule their GPU work, they will now uselaunch_cooperative!.todo
cuLaunchCooperativeKernelin a nice interface. We can add the cooperative multi device bits later, along with all of the other bits from the cooperative API.