[x86] split more v8f32/v8i32 shuffles in lowering

Similar to D57867 - this is a small patch with lots of test diffs.
With half-vector-width narrowing potential, using an extract + 128-bit vshufps
is a win because it replaces a 256-bit shuffle with a 128-bit shufle.

This seems like it should be a win even for targets with 'fast-variable-shuffle',
but we are intentionally deferring that to an independent change to make sure
that is true.

Differential Revision: https://reviews.llvm.org/D58181

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@354279 91177308-0d34-0410-b5e6-96231b3b80d8
20 files changed