[AMDGPU] Correct rmw atomics s_waitcnt generation

The AMD GPU SIMemoryLegalizer was using the ordering address space
rather than the instruction address space when determining the
s_waitcnt to generate to ensure that a read-modify-write atomic has
completed. This resulted in additional unnecessary counters being
waited on.

Differential Revision: https://reviews.llvm.org/D96743

GitOrigin-RevId: c62b737ad655f189cf76f4324ba04317133d6648
11 files changed