From 29016f3e2bedc3080e30ae03be35787752807eba Mon Sep 17 00:00:00 2001 From: Milo Yip Date: Tue, 2 Feb 2016 22:49:33 +0800 Subject: [PATCH] Add notes about SIMD optimization issue. [skip ci] #499 --- doc/internals.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/doc/internals.md b/doc/internals.md index de482cb..40cf03e 100644 --- a/doc/internals.md +++ b/doc/internals.md @@ -199,6 +199,20 @@ To enable this optimization, need to define `RAPIDJSON_SSE2` or `RAPIDJSON_SSE42 Note that, these are compile-time settings. Running the executable on a machine without such instruction set support will make it crash. +### Page boundary issue + +In an early version of RapidJSON, [an issue](https://code.google.com/archive/p/rapidjson/issues/104) reported that the `SkipWhitespace_SIMD()` causes crash very rarely (around 1 in 500,000). After investigation, it is suspected that `_mm_load_si128()` accessed bytes after '\0', and across a protected page boundary. + +In [IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual +](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html), section 10.2.1: + +> To support algorithms requiring unaligned 128-bit SIMD memory accesses, memory buffer allocation by a caller function should consider adding some pad space so that a callee function can safely use the address pointer safely with unaligned 128-bit SIMD memory operations. +> The minimal padding size should be the width of the SIMD register that might be used in conjunction with unaligned SIMD memory access. + +This is not feasible as RapidJSON should not enforce such requirement. + +To fix this issue, currently the routine process bytes up to the next aligned address. After tha, use aligned read to perform SIMD processing. Also see [#85](https://github.com/miloyip/rapidjson/issues/85). + ## Local Stream Copy {#LocalStreamCopy} During optimization, it is found that some compilers cannot localize some member data access of streams into local variables or registers. Experimental results show that for some stream types, making a copy of the stream and used it in inner-loop can improve performance. For example, the actual (non-SIMD) implementation of `SkipWhitespace()` is implemented as: