[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations (#170069)

I got somewhat nerd-sniped when looking at a Rust issue and seeing [this
comment about how various min/max operations are compiled on various
architectures](https://github.com/rust-lang/rust/issues/91079#issuecomment-3592393680).
The current `minimum`/`maximum`/`minimumnum`/`maximumnum` code is very
branchy because of the signed-zero handling. Even though we emit select
operations, LLVM *really* prefers to lower them to branches, to the
point of scalarizing vector code to do so, even if `blendv` is
supported. (Should I open a separate issue for that? It seems concerning
that LLVM would rather scalarize a vector operation than emit a couple
`blendv` operations in a row.)

It turns out that handling signed zero operands properly can be done
using a couple bitwise operations, which is branchless and easily
vectorizable, by taking advantage of the following properties:
- When you take the maximum of two floats, the output sign bit will be
the bitwise AND of the input sign bits (since 0 means positive, and the
maximum always prefers the positive number).
- When you take the minimum of two floats, the output sign bit will be
the bitwise OR of the input sign bits (since 1 means negative, and the
minimum always prefers the negative number).

We can further optimize this by taking advantage of the fact that x86's
min/max instructions operate like a floating-point compare+select,
returning the second operand if both are (positive or negative) zero.
Altogether, the operations go as follows:

- For taking the minimum:
- Call `minps`/`minpd`/etc. on the input operands. This will return the
minimum, unless both are zero, in which case it will return the second
operand.
- Take the bitwise AND of the first operand and the highest bit, so that
everything is zero except the sign bit.
- Finally, OR that with the minimum from earlier. The only incorrect
case was when the second operand was +0.0 and the first operand was
-0.0. By OR-ing the first operand's sign bit with the existing minimum,
we correct this.

- Analogously, for taking the maximum:
- Call `maxps`/`maxpd`/etc. on the input operands. This will return the
maximum, unless both are zero, in which case it will return the second
operand.
- Take the bitwise OR of the first operand and a bit pattern which is
all ones except for the highest bit, so that everything is ones except
the sign bit.
  - Finally, AND that with the maximum from earlier.

In the case of NaNs, this approach might change the output NaN's sign
bit. We don't have to worry about this for a couple reasons: firstly,
LLVM's language reference [allows NaNs to have a nondeterministic sign
bit](https://llvm.org/docs/LangRef.html#floatnan); secondly, there's
already a step after this that selects one of the input NaNs anyway.

[Here's an Alive2 proof.](https://alive2.llvm.org/ce/z/EfQZ-G) It
obviously can't verify that the implementation is sound, but shows that
at least the theory is.

I believe this approach is faster than even properly-vectorized `blendv`
operations because it eliminates a data dependency chain. Furthermore on
AVX-512, the load, AND, and OR can become a single `vpternlogd`. My
highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so
SSE4.1) say ~7.5%-10% faster than `blendv`, which makes me confident
this is at least not a regression.
6 files changed
tree: 14882d559ecd9e0b7e755d100699b479f54ef3af
  1. .ci/
  2. .github/
  3. bolt/
  4. clang/
  5. clang-tools-extra/
  6. cmake/
  7. compiler-rt/
  8. cross-project-tests/
  9. flang/
  10. flang-rt/
  11. libc/
  12. libclc/
  13. libcxx/
  14. libcxxabi/
  15. libsycl/
  16. libunwind/
  17. lld/
  18. lldb/
  19. llvm/
  20. llvm-libgcc/
  21. mlir/
  22. offload/
  23. openmp/
  24. orc-rt/
  25. polly/
  26. runtimes/
  27. third-party/
  28. utils/
  29. .clang-format
  30. .clang-format-ignore
  31. .clang-tidy
  32. .git-blame-ignore-revs
  33. .gitattributes
  34. .gitignore
  35. .mailmap
  36. CODE_OF_CONDUCT.md
  37. CONTRIBUTING.md
  38. LICENSE.TXT
  39. pyproject.toml
  40. README.md
  41. SECURITY.md
README.md

The LLVM Compiler Infrastructure

OpenSSF Scorecard OpenSSF Best Practices libc++

Welcome to the LLVM project!

This repository contains the source code for LLVM, a toolkit for the construction of highly optimized compilers, optimizers, and run-time environments.

The LLVM project has multiple components. The core of the project is itself called “LLVM”. This contains all of the tools, libraries, and header files needed to process intermediate representations and convert them into object files. Tools include an assembler, disassembler, bitcode analyzer, and bitcode optimizer.

C-like languages use the Clang frontend. This component compiles C, C++, Objective-C, and Objective-C++ code into LLVM bitcode -- and from there into object files, using LLVM.

Other components include: the libc++ C++ standard library, the LLD linker, and more.

Getting the Source Code and Building LLVM

Consult the Getting Started with LLVM page for information on building and running LLVM.

For information on how to contribute to the LLVM project, please take a look at the Contributing to LLVM guide.

Getting in touch

Join the LLVM Discourse forums, Discord chat, LLVM Office Hours or Regular sync-ups.

The LLVM project has adopted a code of conduct for participants to all modes of communication within the project.