[libc][math] Implement atan2f correctly rounded to all rounding modes. (#86716)

We compute atan2f(y, x) in 2 stages:
- Fast step: perform computations in double precision , with relative
errors < 2^-50
- Accurate step: if the result from the Fast step fails Ziv's rounding
test, then we perform computations in double-double precision, with
relative errors < 2^-100.

On Ryzen 5900X, worst-case latency is ~ 200 clocks, compared to average
latency ~ 60 clocks, and average reciprocal throughput ~ 20 clocks.

GitOrigin-RevId: 2be722587f5987891ed8b2904a03f983e987f226
22 files changed