Free strings during BGSAVE/BGAOFRW to reduce copy-on-write (#905)

**Motivation**

Copy-on-write (COW) amplification refers to the issue where writing to a
small object leads to the entire page being cloned, resulting in
inefficient memory usage. This issue arises during the BGSAVE process,
which can be particularly problematic on instances with limited memory.
If the BGSAVE process could release unneeded memory, it could reduce
memory consumption. To address this, the BGSAVE process calls the
`madvise` function to signal the operating system to reclaim the buffer.
However, this approach does not work for buffers smaller than a page
(usually 4KiB). Even after multiple such calls, where a full page may be
free, the operating system will not reclaim it.
To solve this issue, we can call `zfree` directly. This allows the
allocator (jemalloc) to handle the bookkeeping and release pages when
buffers are no longer needed. This approach reduces copy-on-write
events.

**Benchmarks**
To understand how usage of `zfree` affects BGSAVE and the memory
consumption I ran 45 benchmarks that compares my clonewith the vanilla
version. The benchmark has the following steps:
1. Start a new Valkey process
2. Fill the DB with data sequentially
3. Run a warmup to randomize the memory layout
4. Introduce fragmentation by deleting part of the keys
5. In parallel:
    1. Trigger BGSAVE
    2. Start 80/20 get/set load

I played the following parameters to understand their influence:

1. Number of keys: 3M, 6M, and 12M.
2. Data size. While key themselves are of fixed length ~30 bytes, the
value size is 120, 250, 500, 1000, and 2000 bytes.
3. Fragmentation. I delete 5%, 10%, and 15% of the original key range.

I'm attaching a graph of BGSAVE process memory consumption. Instead of
all benchmarks, I show the most representative runs IMO.

<img width="1570" alt="3m-fixed"
src="https://github.com/user-attachments/assets/3dbbc528-01c1-4821-a3c2-6be455e7f78a">


For 2000 bytes values peak memory usage is ~53% compared to vanilla. The
peak happens at 57% BGSAVE progress.
For 500 bytes values the peak is ~80% compared to vanilla. And happens
at ~80% progress.
For 120 bytes the difference is under 5%, and the patched version could
even use more memory.



![500b-fixed](https://github.com/user-attachments/assets/b09451d3-4bce-4f33-b3db-2b5df2178ed2)


For 12M keys, the peak is ~85% of the vanilla’s. Happens at ~70% mark.
For 6M keys, the peak is ~87% of the vanilla’s. Happens at ~77% mark.
For 3M keys, the peak is ~87% of the vanilla’s Happens at ~80% mark.

**Changes**

The PR contains 2 changes:
1. Static buffer for RDB comrpession.
RDB compression leads to COW events even without any write load if we
use `zfree`. It happens because the compression functions allocates a
new buffer for each object. Together with freeing objects with `zfree`
it leads to reusing of the memory shared with the main process.
To deal with this problem, we use a pre-allocated constant 8K buffer for
compression. If the object size is too big for this buffer, than we fall
back to the ad hoc allocation behavior.

2. Freeing string objects instead of dismissing them
Call to `zfree` is more expensive than direct call to `madvise`. But
with #453 strings use the fast path – `zfree_with_size`. As a possible
next step we can optimize `zfree` for other data types as well.

---------

Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This commit is contained in:
Vadym Khoptynets 2024-12-01 17:12:27 +02:00 committed by GitHub
parent 7043ef0bbb
commit 90475af594
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 19 additions and 9 deletions

View File

@ -398,9 +398,14 @@ void decrRefCount(robj *o) {
}
}
/* See dismissObject() */
/* See dismissObject(). sds is an exception, because the allocation
* size is known. Instead of dismissing it with madvise(MADV_DONTNEED)
* we free it via the allocator, which has minimal overhead when the
* size is known. This has advantage that it allows the allocator to
* accumulate free buffers to free whole pages, while madvise is nop
* if the buffer is less than a page. */
void dismissSds(sds s) {
dismissMemory(sdsAllocPtr(s), sdsAllocSize(s));
sdsfree(s);
}
/* See dismissObject() */

View File

@ -49,6 +49,9 @@
#include <sys/stat.h>
#include <sys/param.h>
/* Size of the static buffer used for rdbcompression */
#define LZF_STATIC_BUFFER_SIZE (8 * 1024)
/* This macro is called when the internal RDB structure is corrupt */
#define rdbReportCorruptRDB(...) rdbReportError(1, __LINE__, __VA_ARGS__)
/* This macro is called when RDB read failed (possibly a short read) */
@ -388,18 +391,20 @@ writeerr:
ssize_t rdbSaveLzfStringObject(rio *rdb, unsigned char *s, size_t len) {
size_t comprlen, outlen;
void *out;
static void *buffer = NULL;
/* We require at least four bytes compression for this to be worth it */
if (len <= 4) return 0;
outlen = len - 4;
if ((out = zmalloc(outlen + 1)) == NULL) return 0;
comprlen = lzf_compress(s, len, out, outlen);
if (comprlen == 0) {
zfree(out);
return 0;
if (outlen < LZF_STATIC_BUFFER_SIZE) {
if (!buffer) buffer = zmalloc(LZF_STATIC_BUFFER_SIZE);
out = buffer;
} else {
if ((out = zmalloc(outlen + 1)) == NULL) return 0;
}
ssize_t nwritten = rdbSaveLzfBlob(rdb, out, comprlen, len);
zfree(out);
comprlen = lzf_compress(s, len, out, outlen);
ssize_t nwritten = comprlen ? rdbSaveLzfBlob(rdb, out, comprlen, len) : 0;
if (out != buffer) zfree(out);
return nwritten;
}