[test-suite, CUDA] Run test kernel with just one thread.

For some reason this test is flaky on sm_60+, with the flaky failures irrelevant
to what we're testing here. Reducing grid size should reduce the failure rate.
1 file changed