Free strings during BGSAVE/BGAOFRW to reduce copy-on-write (#905)

**Motivation** Copy-on-write (COW) amplification refers to the issue where writing to a small object leads to the entire page being cloned, resulting in inefficient memory usage. This issue arises during the BGSAVE process, which can be particularly problematic on instances with limited memory. If the BGSAVE process could release unneeded memory, it could reduce memory consumption. To address this, the BGSAVE process calls the `madvise` function to signal the operating system to reclaim the buffer. However, this approach does not work for buffers smaller than a page (usually 4KiB). Even after multiple such calls, where a full page may be free, the operating system will not reclaim it. To solve this issue, we can call `zfree` directly. This allows the allocator (jemalloc) to handle the bookkeeping and release pages when buffers are no longer needed. This approach reduces copy-on-write events. **Benchmarks** To understand how usage of `zfree` affects BGSAVE and the memory consumption I ran 45 benchmarks that compares my clonewith the vanilla version. The benchmark has the following steps: 1. Start a new Valkey process 2. Fill the DB with data sequentially 3. Run a warmup to randomize the memory layout 4. Introduce fragmentation by deleting part of the keys 5. In parallel: 1. Trigger BGSAVE 2. Start 80/20 get/set load I played the following parameters to understand their influence: 1. Number of keys: 3M, 6M, and 12M. 2. Data size. While key themselves are of fixed length ~30 bytes, the value size is 120, 250, 500, 1000, and 2000 bytes. 3. Fragmentation. I delete 5%, 10%, and 15% of the original key range. I'm attaching a graph of BGSAVE process memory consumption. Instead of all benchmarks, I show the most representative runs IMO. <img width="1570" alt="3m-fixed" src="https://github.com/user-attachments/assets/3dbbc528-01c1-4821-a3c2-6be455e7f78a"> For 2000 bytes values peak memory usage is ~53% compared to vanilla. The peak happens at 57% BGSAVE progress. For 500 bytes values the peak is ~80% compared to vanilla. And happens at ~80% progress. For 120 bytes the difference is under 5%, and the patched version could even use more memory. ![500b-fixed](https://github.com/user-attachments/assets/b09451d3-4bce-4f33-b3db-2b5df2178ed2) For 12M keys, the peak is ~85% of the vanilla’s. Happens at ~70% mark. For 6M keys, the peak is ~87% of the vanilla’s. Happens at ~77% mark. For 3M keys, the peak is ~87% of the vanilla’s Happens at ~80% mark. **Changes** The PR contains 2 changes: 1. Static buffer for RDB comrpession. RDB compression leads to COW events even without any write load if we use `zfree`. It happens because the compression functions allocates a new buffer for each object. Together with freeing objects with `zfree` it leads to reusing of the memory shared with the main process. To deal with this problem, we use a pre-allocated constant 8K buffer for compression. If the object size is too big for this buffer, than we fall back to the ad hoc allocation behavior. 2. Freeing string objects instead of dismissing them Call to `zfree` is more expensive than direct call to `madvise`. But with #453 strings use the fast path – `zfree_with_size`. As a possible next step we can optimize `zfree` for other data types as well. --------- Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com> Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-01 17:12:27 +02:00 · 2024-12-01 17:12:27 +02:00 · 90475af594
commit 90475af594
parent 7043ef0bbb
2 changed files with 19 additions and 9 deletions
--- a/src/object.c
+++ b/src/object.c
@ -398,9 +398,14 @@ void decrRefCount(robj *o) {
    }
 }

-/* See dismissObject() */
+/* See dismissObject(). sds is an exception, because the allocation
+ * size is known. Instead of dismissing it with madvise(MADV_DONTNEED)
+ * we free it via the allocator, which has minimal overhead when the
+ * size is known. This has advantage that it allows the allocator to
+ * accumulate free buffers to free whole pages, while madvise is nop
+ * if the buffer is less than a page.  */
 void dismissSds(sds s) {
-    dismissMemory(sdsAllocPtr(s), sdsAllocSize(s));
+    sdsfree(s);
 }

 /* See dismissObject() */
--- a/src/rdb.c
+++ b/src/rdb.c
@ -49,6 +49,9 @@
 #include <sys/stat.h>
 #include <sys/param.h>

+/* Size of the static buffer used for rdbcompression */
+#define LZF_STATIC_BUFFER_SIZE (8 * 1024)
+
 /* This macro is called when the internal RDB structure is corrupt */
 #define rdbReportCorruptRDB(...) rdbReportError(1, __LINE__, __VA_ARGS__)
 /* This macro is called when RDB read failed (possibly a short read) */
@ -388,18 +391,20 @@ writeerr:
 ssize_t rdbSaveLzfStringObject(rio *rdb, unsigned char *s, size_t len) {
    size_t comprlen, outlen;
    void *out;
+    static void *buffer = NULL;

    /* We require at least four bytes compression for this to be worth it */
    if (len <= 4) return 0;
    outlen = len - 4;
-    if ((out = zmalloc(outlen + 1)) == NULL) return 0;
-    comprlen = lzf_compress(s, len, out, outlen);
-    if (comprlen == 0) {
-        zfree(out);
-        return 0;
+    if (outlen < LZF_STATIC_BUFFER_SIZE) {
+        if (!buffer) buffer = zmalloc(LZF_STATIC_BUFFER_SIZE);
+        out = buffer;
+    } else {
+        if ((out = zmalloc(outlen + 1)) == NULL) return 0;
    }
-    ssize_t nwritten = rdbSaveLzfBlob(rdb, out, comprlen, len);
-    zfree(out);
+    comprlen = lzf_compress(s, len, out, outlen);
+    ssize_t nwritten = comprlen ? rdbSaveLzfBlob(rdb, out, comprlen, len) : 0;
+    if (out != buffer) zfree(out);
    return nwritten;
 }