234ff6b4012919fab48dd47492dfd9bb8bc55eed - llvm-project/openmp

commit	234ff6b4012919fab48dd47492dfd9bb8bc55eed	[log] [tgz]
author	Joseph Huber <huberjn@outlook.com>	Tue Jan 02 16:53:53 2024 -0600
committer	Copybara-Service <copybara-worker@google.com>	Tue Jan 02 14:55:45 2024 -0800
tree	c06c1748cae5aaa11033ebb9fda78c75f096e8c9
parent	74b1a4db3f1e5da47f0bb6b3d9a9699939a73643 [diff]

 [Libomptarget] Fix RPC-based malloc on NVPTX  (#72440)

Summary:
The device allocator on NVPTX architectures is enqueued to a stream that
the kernel is potentially executing on. This can lead to deadlocks as
the kernel will not proceed until the allocation is complete and the
allocation will not proceed until the kernel is complete. CUDA 11.2
introduced async allocations that we can manually place on separate
streams to combat this. This patch makes a new allocation type that's
guaranteed to be non-blocking so it will actually make progress, only
Nvidia needs to care about this as the others are not blocking in this
way by default.

I had originally tried to make the `alloc` and `free` methods take a
`__tgt_async_info`. However, I observed that with the large volume of
streams being created by a parallel test it quickly locked up the system
as presumably too many streams were being created. This implementation
not just creates a new stream and immediately destroys it. This
obviously isn't very fast, but it at least gets the cases to stop
deadlocking for now.

GitOrigin-RevId: fb32977ac768f27890af28308a6968c30af2aa3e

8 files changed

tree: c06c1748cae5aaa11033ebb9fda78c75f096e8c9