
The code goes to the trouble of ensuring that data is aligned at a 16-byte boundary, then goes ahead and uses the unaligned form of the load intrinsic _mm_loadu_si128. Either the code shouldn't bother aligning the data to the start of the whitespace, or it should use the aligned form of the intrinsic.