
### Summary of the change This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc [native api](https://github.com/jemalloc/jemalloc/pull/1463#issuecomment-479706489) instead of relying on custom code changes made by valkey in the jemalloc ([je_defrag_hint](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)
)) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions. This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the `computeDefragCycles` and not for every single ptr, this way we are avoiding the additional api call. Followup work will utilize the new options that are now open and will further improve the defrag decision and process. ### Added files: `allocator_defrag.c` / `allocator_defrag.h` - This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation. ### Moved functions: [`zmalloc_no_tcache` , `zfree_no_tcache` ](4593dc2f05/src/zmalloc.c (L215)
) - these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc]. `frag_smallbins_bytes` - the logic and implementation moved to the new file ### Existing API: * [once a second + when completed full cycle] [`computeDefragCycles`](4593dc2f05/src/defrag.c (L916)
) * `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active, resident, retained, muzzy_, `frag_smallbins_bytes` * [`frag_smallbins_bytes`](4593dc2f05/src/zmalloc.c (L690)
) : for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs` * [during defrag, for each pointer] * `je_defrag_hint` is getting a memory pointer and returns {0,1} . [Internally it uses](4593dc2f05/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L368)
) this information points: * #`nonfull_slabs` * #`total_slabs` * #free regs in the ptr slab ## Jemalloc API (via ctl interface) [BATCH][`experimental_utilization_batch_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L4114)
) : gets an array of pointers, returns for each pointer 3 values, * number of free regions in the extent * number of regions in the extent * size of the extent in terms of bytes [EXTENDED][`experimental_utilization_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L3989)
) : * memory address of the extent a potential reallocation would go into * number of free regions in the extent * number of regions in the extent * size of the extent in terms of bytes * [stats-enabled]total number of free regions in the bin the extent belongs to * [stats-enabled]total number of regions in the bin the extent belongs to ### `experimental_utilization_batch_query_ctl` vs valkey `je_defrag_hint`? [good] - We can query pointers in a batch, reduce the overall overhead - The per ptr decision algorithm is not within jemalloc api, jemalloc only provides information, valkey can tune\configure\optimize easily [bad] - In the batch API we only know the utilization of the slab (of that memory ptr), we don’t get the data about #`nonfull_slabs` and total allocated regs. ## New functions: 1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the [MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster calls. See this quote from the jemalloc documentation: The mallctlnametomib() function provides a way to avoid repeated name lookups for applications that repeatedly query the same portion of the namespace,by translating a name to a “Management Information Base” (MIB) that can be passed repeatedly to mallctlbymib(). 6. `jemalloc_sz2binind_lgq*` : this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived from [`lg_quantum`](4593dc2f05/deps/Makefile (L115)
) 7. `defrag_jemalloc_get_frag_smallbins` : This function replaces `frag_smallbins_bytes` the logic moved to the new file allocator_defrag `defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the results 8. `should_defrag` : implements the same logic as the existing implementation [inside](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)
) je_defrag_hint 9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed. ### Logical differences: In order to get the information about #`nonfull_slabs` and #`regs`, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we use `jemalloc_sz2binind_lgq*` . ## Testing This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance. ### Test: 1. disable `activedefrag` 2. run valkey benchmark on overlapping address ranges with different block sizes 3. wait untill `used_memory` reaches 10GB 4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru` 5. stop load 6. wait for `mem_fragmentation_ratio` to reach 2 7. enable `activedefrag` - start test timer 8. wait until reach `mem_fragmentation_ratio` = 1.1 #### Results*: (With this PR)Test results: ` 56 sec` (Without this PR)Test results: `67 sec` *both runs perform same "work" number of buffers moved to reach fragmentation target Next benchmarking is to compare to: - DONE // existing `je_get_defrag_hint` - compare with naive defrag all: `int defrag_hint() {return 1;}` --------- Signed-off-by: Zvi Schneider <ezvisch@amazon.com> Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com> Co-authored-by: Zvi Schneider <ezvisch@amazon.com> Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
169 lines
7.2 KiB
C
169 lines
7.2 KiB
C
/* zmalloc - total amount of allocated memory aware version of malloc()
|
|
*
|
|
* Copyright (c) 2009-2010, Redis Ltd.
|
|
* All rights reserved.
|
|
*
|
|
* Redistribution and use in source and binary forms, with or without
|
|
* modification, are permitted provided that the following conditions are met:
|
|
*
|
|
* * Redistributions of source code must retain the above copyright notice,
|
|
* this list of conditions and the following disclaimer.
|
|
* * Redistributions in binary form must reproduce the above copyright
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
* documentation and/or other materials provided with the distribution.
|
|
* * Neither the name of Redis nor the names of its contributors may be used
|
|
* to endorse or promote products derived from this software without
|
|
* specific prior written permission.
|
|
*
|
|
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
|
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
|
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
|
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
|
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
|
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
|
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
|
* POSSIBILITY OF SUCH DAMAGE.
|
|
*/
|
|
|
|
#ifndef __ZMALLOC_H
|
|
#define __ZMALLOC_H
|
|
|
|
#include <stddef.h>
|
|
|
|
/* Double expansion needed for stringification of macro values. */
|
|
#define __xstr(s) __str(s)
|
|
#define __str(s) #s
|
|
|
|
#if defined(USE_TCMALLOC)
|
|
#define ZMALLOC_LIB ("tcmalloc-" __xstr(TC_VERSION_MAJOR) "." __xstr(TC_VERSION_MINOR))
|
|
#include <gperftools/tcmalloc.h>
|
|
#if (TC_VERSION_MAJOR == 1 && TC_VERSION_MINOR >= 6) || (TC_VERSION_MAJOR > 1)
|
|
#define HAVE_MALLOC_SIZE 1
|
|
#define zmalloc_size(p) tc_malloc_size(p)
|
|
#else
|
|
#error "Newer version of tcmalloc required"
|
|
#endif
|
|
|
|
#elif defined(USE_JEMALLOC)
|
|
#define ZMALLOC_LIB \
|
|
("jemalloc-" __xstr(JEMALLOC_VERSION_MAJOR) "." __xstr(JEMALLOC_VERSION_MINOR) "." __xstr(JEMALLOC_VERSION_BUGFIX))
|
|
#include <jemalloc/jemalloc.h>
|
|
#if (JEMALLOC_VERSION_MAJOR == 2 && JEMALLOC_VERSION_MINOR >= 1) || (JEMALLOC_VERSION_MAJOR > 2)
|
|
#define HAVE_MALLOC_SIZE 1
|
|
#define zmalloc_size(p) je_malloc_usable_size(p)
|
|
#else
|
|
#error "Newer version of jemalloc required"
|
|
#endif
|
|
|
|
#elif defined(__APPLE__)
|
|
#include <malloc/malloc.h>
|
|
#define HAVE_MALLOC_SIZE 1
|
|
#define zmalloc_size(p) malloc_size(p)
|
|
#endif
|
|
|
|
/* On native libc implementations, we should still do our best to provide a
|
|
* HAVE_MALLOC_SIZE capability. This can be set explicitly as well:
|
|
*
|
|
* NO_MALLOC_USABLE_SIZE disables it on all platforms, even if they are
|
|
* known to support it.
|
|
* USE_MALLOC_USABLE_SIZE forces use of malloc_usable_size() regardless
|
|
* of platform.
|
|
*/
|
|
#ifndef ZMALLOC_LIB
|
|
#define ZMALLOC_LIB "libc"
|
|
#define USE_LIBC 1
|
|
|
|
#if !defined(NO_MALLOC_USABLE_SIZE) && (defined(__GLIBC__) || defined(__FreeBSD__) || defined(__DragonFly__) || \
|
|
defined(__HAIKU__) || defined(USE_MALLOC_USABLE_SIZE))
|
|
|
|
/* Includes for malloc_usable_size() */
|
|
#ifdef __FreeBSD__
|
|
#include <malloc_np.h>
|
|
#else
|
|
#ifndef _GNU_SOURCE
|
|
#define _GNU_SOURCE
|
|
#endif
|
|
#include <malloc.h>
|
|
#endif
|
|
|
|
#define HAVE_MALLOC_SIZE 1
|
|
#define zmalloc_size(p) malloc_usable_size(p)
|
|
|
|
#endif
|
|
#endif
|
|
|
|
/* Includes for malloc_trim(), see zlibc_trim(). */
|
|
#if defined(__GLIBC__) && !defined(USE_LIBC)
|
|
#include <malloc.h>
|
|
#endif
|
|
|
|
/* The zcalloc symbol is a symbol name already used by zlib, which is defining
|
|
* other names using the "z" prefix specific to zlib. In practice, linking
|
|
* valkey with a static openssl, which itself might depend on a static libz
|
|
* will result in link time error rejecting multiple symbol definitions. */
|
|
#define zmalloc valkey_malloc
|
|
#define zcalloc valkey_calloc
|
|
#define zrealloc valkey_realloc
|
|
#define zfree valkey_free
|
|
|
|
/* 'noinline' attribute is intended to prevent the `-Wstringop-overread` warning
|
|
* when using gcc-12 later with LTO enabled. It may be removed once the
|
|
* bug[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503] is fixed. */
|
|
__attribute__((malloc, alloc_size(1), noinline)) void *zmalloc(size_t size);
|
|
__attribute__((malloc, alloc_size(1), noinline)) void *zcalloc(size_t size);
|
|
__attribute__((malloc, alloc_size(1, 2), noinline)) void *zcalloc_num(size_t num, size_t size);
|
|
__attribute__((alloc_size(2), noinline)) void *zrealloc(void *ptr, size_t size);
|
|
__attribute__((malloc, alloc_size(1), noinline)) void *ztrymalloc(size_t size);
|
|
__attribute__((malloc, alloc_size(1), noinline)) void *ztrycalloc(size_t size);
|
|
__attribute__((alloc_size(2), noinline)) void *ztryrealloc(void *ptr, size_t size);
|
|
void zfree(void *ptr);
|
|
void zfree_with_size(void *ptr, size_t size);
|
|
void *zmalloc_usable(size_t size, size_t *usable);
|
|
void *zcalloc_usable(size_t size, size_t *usable);
|
|
void *zrealloc_usable(void *ptr, size_t size, size_t *usable);
|
|
void *ztrymalloc_usable(size_t size, size_t *usable);
|
|
void *ztrycalloc_usable(size_t size, size_t *usable);
|
|
void *ztryrealloc_usable(void *ptr, size_t size, size_t *usable);
|
|
__attribute__((malloc)) char *zstrdup(const char *s);
|
|
size_t zmalloc_used_memory(void);
|
|
void zmalloc_set_oom_handler(void (*oom_handler)(size_t));
|
|
size_t zmalloc_get_rss(void);
|
|
int zmalloc_get_allocator_info(size_t *allocated, size_t *active, size_t *resident, size_t *retained, size_t *muzzy);
|
|
void set_jemalloc_bg_thread(int enable);
|
|
int jemalloc_purge(void);
|
|
size_t zmalloc_get_private_dirty(long pid);
|
|
size_t zmalloc_get_smap_bytes_by_field(char *field, long pid);
|
|
size_t zmalloc_get_memory_size(void);
|
|
void zlibc_free(void *ptr);
|
|
void zlibc_trim(void);
|
|
void zmadvise_dontneed(void *ptr);
|
|
|
|
#ifndef HAVE_MALLOC_SIZE
|
|
size_t zmalloc_size(void *ptr);
|
|
size_t zmalloc_usable_size(void *ptr);
|
|
#else
|
|
/* If we use 'zmalloc_usable_size()' to obtain additional available memory size
|
|
* and manipulate it, we need to call 'extend_to_usable()' afterwards to ensure
|
|
* the compiler recognizes this extra memory. However, if we use the pointer
|
|
* obtained from z[*]_usable() family functions, there is no need for this step. */
|
|
#define zmalloc_usable_size(p) zmalloc_size(p)
|
|
|
|
/* derived from https://github.com/systemd/systemd/pull/25688
|
|
* We use zmalloc_usable_size() everywhere to use memory blocks, but that is an abuse since the
|
|
* malloc_usable_size() isn't meant for this kind of use, it is for diagnostics only. That is also why the
|
|
* behavior is flaky when built with _FORTIFY_SOURCE, the compiler can sense that we reach outside
|
|
* the allocated block and SIGABRT.
|
|
* We use a dummy allocator function to tell the compiler that the new size of ptr is newsize.
|
|
* The implementation returns the pointer as is; the only reason for its existence is as a conduit for the
|
|
* alloc_size attribute. This cannot be a static inline because gcc then loses the attributes on the function.
|
|
* See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503 */
|
|
__attribute__((alloc_size(2), noinline)) void *extend_to_usable(void *ptr, size_t size);
|
|
#endif
|
|
|
|
int get_proc_stat_ll(int i, long long *res);
|
|
|
|
#endif /* __ZMALLOC_H */
|