[libc] Update the GPU allocator to work under post-Volta ITS
Summary:
There were several gaps that caused the allocator not to work under
NVIDIA's independent thread scheduling model. The problems (I know of)
are fixed in this commit. Generally this required using correct masks,
synchronizing before a few dependent operations, and overhauling the
allocate function to stick with the existing mask instead of querying
it.
The general idiom here is that at the start we obtain a single mask and
opportunistically use it. Every use must specifically sync this subset.
I.e. query a single time and never change it.
This passes most tests, however I have encountered two issues.
1. A bug in `nvlink` failing to link symbols called in 'free'
2. A deadlock under heavy divergence caused by IPSCCP altering control
flow.
I will address these later, but for now this makes the *source* correct
so it can be enabled by anyone else if they need it.
GitOrigin-RevId: eac18e783f034bd294a82cd0e69a7abf73583d28
1 file changed