This change copes with the AS->MS payload being placed in group-shared
by the application (and MSFT's samples do indeed do this). (TIL, thanks
to pow2clk, that the spec says that the payload counts against the
group-shared total, implying, if not explicitly stating, that at least
on some platforms, the payload will be in group-shared anyway.)
The MS pass needs to be given data from the AS about the AS's thread
group topology, and this is done by extending the payload struct to add
three uints. This can't be done when the payload is resident in
group-shared, of course, because that would change the layout of
group-shared memory.
So the new approach here is to copy the payload to a new alloca (in the
default address space) struct with the members of the base struct plus
the extended data the MS needs, and then to copy piece-wise because
llvm.memcpy isn't appropriate for group-shared-to-normal address space
copies.