54067f47e6b116e8d11d87f33b60157fe605646b - llvm-project/libc

commit	54067f47e6b116e8d11d87f33b60157fe605646b	[log] [tgz]
author	Leandro Lacerda <leandrolcampos@yahoo.com.br>	Sat Aug 16 17:14:26 2025 -0300
committer	Copybara-Service <copybara-worker@google.com>	Sat Aug 16 13:15:17 2025 -0700
tree	f7cd1f3d7e911dcfa774db799fd778874655fb56
parent	f7b6ed379a35a91a87348499f849d2c0d3db70bc [diff]

[libc][gpu] Disable loop unrolling in the throughput benchmark loop (#153971)

This patch makes GPU throughput benchmark results more comparable across
targets by disabling loop unrolling in the benchmark loop.

Motivation:
* PTX (post-LTO) evidence on NVPTX: for libc `sin`, the generated PTX
shows the `throughput` loop unrolled 8x at `N=128` (one iteration
advances the input pointer by 64 bytes = 8 doubles), interleaving eight
independent chains before the back-edge. This hides latency and
significantly reduces cycles/call as the batch size `N` grows.
* Observed scaling (NVPTX measurements): with unrolling enabled, `sin`
dropped from ~3,100 cycles/call at `N=1` to ~360 at `N=128`. After
enforcing `#pragma clang loop unroll(disable)`, results stabilized
(e.g., from ~3100 cycles/call at `N=1` to ~2700 at `N=128`).
* libdevice contrast: the libdevice `sin` path did not exhibit a similar
drop in our measurements, and the PTX appears as compact internal calls
rather than a long FMA chain, leaving less ILP for the outer loop to
extract.

What this change does:
* Applies `#pragma clang loop unroll(disable)` to the GPU `throughput()`
loop in both NVPTX and AMDGPU backends.

Leaving unrolling entirely to the optimizer makes apples-to-apples
comparisons uneven (e.g., libc vs. vendor). Disabling unrolling yields
fairer, more consistent numbers.

GitOrigin-RevId: 75bf7392089d027bb6fa78ded21acaa97b16a412

2 files changed

tree: f7cd1f3d7e911dcfa774db799fd778874655fb56