The internal HLL raw encoding used by PFCOUNT when merging multiple keys
is aligned to 8 bits (1 byte per register) so we can exploit this to
improve performances by processing multiple bytes per iteration.
In benchmarks the new code was several times faster with HLLs with many
registers set to zero, while no slowdown was observed with populated
HLLs.
The internal HLL raw encoding used by PFCOUNT when merging multiple keys
is aligned to 8 bits (1 byte per register) so we can exploit this to
improve performances by processing multiple bytes per iteration.
In benchmarks the new code was several times faster with HLLs with many
registers set to zero, while no slowdown was observed with populated
HLLs.
When the register is set to zero, we need to add 2^-0 to E, which is 1,
but it is faster to just add 'ez' at the end, which is the number of
registers set to zero, a value we need to compute anyway.
When the register is set to zero, we need to add 2^-0 to E, which is 1,
but it is faster to just add 'ez' at the end, which is the number of
registers set to zero, a value we need to compute anyway.
Given that the code was written with a 2 years pause... something
strange happened in the middle. So there was no function to free a
lex range min/max objects, and in some places the range was passed by
value.
Given that the code was written with a 2 years pause... something
strange happened in the middle. So there was no function to free a
lex range min/max objects, and in some places the range was passed by
value.
After running a few benchmarks, 3000 looks like a reasonable value to
keep HLLs with a few thousand elements small while the CPU cost is
still not huge.
This covers all the cases where the dense representation would use N
orders of magnitude more space, like in the case of many HLLs with
carinality of a few tens or hundreds.
It is not impossible that in the future this gets user configurable,
however it is easy to pick an unreasoable value just looking at savings
in the space dimension without checking what happens in the time
dimension.
After running a few benchmarks, 3000 looks like a reasonable value to
keep HLLs with a few thousand elements small while the CPU cost is
still not huge.
This covers all the cases where the dense representation would use N
orders of magnitude more space, like in the case of many HLLs with
carinality of a few tens or hundreds.
It is not impossible that in the future this gets user configurable,
however it is easy to pick an unreasoable value just looking at savings
in the space dimension without checking what happens in the time
dimension.