[CUDA] Fixed sm version constrain for __bmma_m8n8k128_mma_and_popc_b1.

As stated in
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma:
".and operation in single-bit wmma requires sm_80 or higher."

tra@: Fixed a bug in builtins-nvptx-mma.py test generator and regenerated the tests.

Differential Revision: https://reviews.llvm.org/D131265

GitOrigin-RevId: 3e0e5568a6a8c744d26f79a1e55360fe2655867c
3 files changed