Remove valkey specific changes in jemalloc source code (#1266)

### Summary of the change

This is a base PR for refactoring defrag. It moves the defrag logic to
rely on jemalloc [native
api](https://github.com/jemalloc/jemalloc/pull/1463#issuecomment-479706489)
instead of relying on custom code changes made by valkey in the jemalloc
([je_defrag_hint](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)))
library. This enables valkey to use latest vanila jemalloc without the
need to maintain code changes cross jemalloc versions.

This change requires some modifications because the new api is providing
only the information, not a yes\no defrag. The logic needs to be
implemented at valkey code. Additionally, the api does not provide,
within single call, all the information needed to make a decision, this
information is available through additional api call. To reduce the
calls to jemalloc, in this PR the required information is collected
during the `computeDefragCycles` and not for every single ptr, this way
we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will
further improve the defrag decision and process.

### Added files: 

`allocator_defrag.c` / `allocator_defrag.h` - This files implement the
allocator specific knowledge for making defrag decision. The knowledge
about slabs and allocation logic and so on, all goes into this file.
This improves the separation between jemalloc specific code and other
possible implementation.


### Moved functions: 

[`zmalloc_no_tcache` , `zfree_no_tcache`
](4593dc2f05/src/zmalloc.c (L215))
- these are very jemalloc specific logic assumptions, and are very
specific to how we defrag with jemalloc. This is also with the vision
that from performance perspective we should consider using tcache, we
only need to make sure we don't recycle entries without going through
the arena [for example: we can use private tcache, one for free and one
for alloc].
`frag_smallbins_bytes` - the logic and implementation moved to the new
file

### Existing API:

* [once a second + when completed full cycle]
[`computeDefragCycles`](4593dc2f05/src/defrag.c (L916))
* `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active,
resident, retained, muzzy_, `frag_smallbins_bytes`
*
[`frag_smallbins_bytes`](4593dc2f05/src/zmalloc.c (L690))
: for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs`
* [during defrag, for each pointer]
* `je_defrag_hint` is getting a memory pointer and returns {0,1} .
[Internally it
uses](4593dc2f05/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L368))
this information points:
        * #`nonfull_slabs`
        * #`total_slabs`
        * #free regs in the ptr slab

## Jemalloc API (via ctl interface)


[BATCH][`experimental_utilization_batch_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L4114))
: gets an array of pointers, returns for each pointer 3 values,

* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes


[EXTENDED][`experimental_utilization_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L3989))
:

* memory address of the extent a potential reallocation would go into
* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes
* [stats-enabled]total number of free regions in the bin the extent
belongs to
* [stats-enabled]total number of regions in the bin the extent belongs
to

### `experimental_utilization_batch_query_ctl` vs valkey
`je_defrag_hint`?
[good]
   - We can query pointers in a batch, reduce the overall overhead
- The per ptr decision algorithm is not within jemalloc api, jemalloc
only provides information, valkey can tune\configure\optimize easily

 
[bad]
- In the batch API we only know the utilization of the slab (of that
memory ptr), we don’t get the data about #`nonfull_slabs` and total
allocated regs.


## New functions:
1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the
[MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster
calls. See this quote from the jemalloc documentation:
    
The mallctlnametomib() function provides a way to avoid repeated name
lookups for
applications that repeatedly query the same portion of the namespace,by
translating
a name to a “Management Information Base” (MIB) that can be passed
repeatedly to
    mallctlbymib().

6. `jemalloc_sz2binind_lgq*` : this api is to support reverse map
between bin size and it’s info without lookup. This mapping depends on
the number of size classes we have that are derived from
[`lg_quantum`](4593dc2f05/deps/Makefile (L115))
7. `defrag_jemalloc_get_frag_smallbins` : This function replaces
`frag_smallbins_bytes` the logic moved to the new file allocator_defrag
`defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the
results
8. `should_defrag` : implements the same logic as the existing
implementation
[inside](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382))
je_defrag_hint
9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an
array of pointers, utilizing the new batch api. currently only 1 pointer
is passed.


### Logical differences:

In order to get the information about #`nonfull_slabs` and #`regs`, we
use the query cycle to collect the information per size class. In order
to find the index of bin information given bin size, in o(1), we use
`jemalloc_sz2binind_lgq*` .


## Testing
This is the first draft. I did some initial testing that basically
fragmentation by reducing max memory and than waiting for defrag to
reach desired level. The test only serves as sanity that defrag is
succeeding eventually, no data provided here regarding efficiency and
performance.

### Test: 
1. disable `activedefrag`
2. run valkey benchmark on overlapping address ranges with different
block sizes
3. wait untill `used_memory` reaches 10GB
4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru`
5. stop load
6. wait for `mem_fragmentation_ratio` to reach 2
7. enable `activedefrag` - start test timer
8. wait until reach `mem_fragmentation_ratio` = 1.1

#### Results*:
(With this PR)Test results: ` 56 sec`
(Without this PR)Test results: `67 sec`

*both runs perform same "work" number of buffers moved to reach
fragmentation target

Next benchmarking is to compare to:
- DONE // existing `je_get_defrag_hint` 
- compare with naive defrag all: `int defrag_hint() {return 1;}`

---------

Signed-off-by: Zvi Schneider <ezvisch@amazon.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Zvi Schneider <ezvisch@amazon.com>
Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This commit is contained in:
zvi-code 2024-11-22 02:29:21 +02:00 committed by GitHub
parent b486a41500
commit b56eed2479
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
12 changed files with 466 additions and 171 deletions

View File

@ -74,6 +74,7 @@ set(VALKEY_SERVER_SRCS
${CMAKE_SOURCE_DIR}/src/geohash.c
${CMAKE_SOURCE_DIR}/src/geohash_helper.c
${CMAKE_SOURCE_DIR}/src/childinfo.c
${CMAKE_SOURCE_DIR}/src/allocator_defrag.c
${CMAKE_SOURCE_DIR}/src/defrag.c
${CMAKE_SOURCE_DIR}/src/siphash.c
${CMAKE_SOURCE_DIR}/src/rax.c

View File

@ -337,55 +337,4 @@ imalloc_fastpath(size_t size, void *(fallback_alloc)(size_t)) {
return fallback_alloc(size);
}
JEMALLOC_ALWAYS_INLINE int
iget_defrag_hint(tsdn_t *tsdn, void* ptr) {
int defrag = 0;
emap_alloc_ctx_t alloc_ctx;
emap_alloc_ctx_lookup(tsdn, &arena_emap_global, ptr, &alloc_ctx);
if (likely(alloc_ctx.slab)) {
/* Small allocation. */
edata_t *slab = emap_edata_lookup(tsdn, &arena_emap_global, ptr);
arena_t *arena = arena_get_from_edata(slab);
szind_t binind = edata_szind_get(slab);
unsigned binshard = edata_binshard_get(slab);
bin_t *bin = arena_get_bin(arena, binind, binshard);
malloc_mutex_lock(tsdn, &bin->lock);
arena_dalloc_bin_locked_info_t info;
arena_dalloc_bin_locked_begin(&info, binind);
/* Don't bother moving allocations from the slab currently used for new allocations */
if (slab != bin->slabcur) {
int free_in_slab = edata_nfree_get(slab);
if (free_in_slab) {
const bin_info_t *bin_info = &bin_infos[binind];
/* Find number of non-full slabs and the number of regs in them */
unsigned long curslabs = 0;
size_t curregs = 0;
/* Run on all bin shards (usually just one) */
for (uint32_t i=0; i< bin_info->n_shards; i++) {
bin_t *bb = arena_get_bin(arena, binind, i);
curslabs += bb->stats.nonfull_slabs;
/* Deduct the regs in full slabs (they're not part of the game) */
unsigned long full_slabs = bb->stats.curslabs - bb->stats.nonfull_slabs;
curregs += bb->stats.curregs - full_slabs * bin_info->nregs;
if (bb->slabcur) {
/* Remove slabcur from the overall utilization (not a candidate to nove from) */
curregs -= bin_info->nregs - edata_nfree_get(bb->slabcur);
curslabs -= 1;
}
}
/* Compare the utilization ratio of the slab in question to the total average
* among non-full slabs. To avoid precision loss in division, we do that by
* extrapolating the usage of the slab as if all slabs have the same usage.
* If this slab is less used than the average, we'll prefer to move the data
* to hopefully more used ones. To avoid stagnation when all slabs have the same
* utilization, we give additional 12.5% weight to the decision to defrag. */
defrag = (bin_info->nregs - free_in_slab) * curslabs <= curregs + curregs / 8;
}
}
arena_dalloc_bin_locked_finish(tsdn, arena, bin, &info);
malloc_mutex_unlock(tsdn, &bin->lock);
}
return defrag;
}
#endif /* JEMALLOC_INTERNAL_INLINES_C_H */

View File

@ -147,7 +147,3 @@
#else
# define JEMALLOC_SYS_NOTHROW JEMALLOC_NOTHROW
#endif
/* This version of Jemalloc, modified for Redis, has the je_get_defrag_hint()
* function. */
#define JEMALLOC_FRAG_HINT

View File

@ -4474,12 +4474,3 @@ jemalloc_postfork_child(void) {
}
/******************************************************************************/
/* Helps the application decide if a pointer is worth re-allocating in order to reduce fragmentation.
* returns 1 if the allocation should be moved, and 0 if the allocation be kept.
* If the application decides to re-allocate it should use MALLOCX_TCACHE_NONE when doing so. */
JEMALLOC_EXPORT int JEMALLOC_NOTHROW
get_defrag_hint(void* ptr) {
assert(ptr != NULL);
return iget_defrag_hint(TSDN_NULL, ptr);
}

View File

@ -411,7 +411,7 @@ endif
ENGINE_NAME=valkey
SERVER_NAME=$(ENGINE_NAME)-server$(PROG_SUFFIX)
ENGINE_SENTINEL_NAME=$(ENGINE_NAME)-sentinel$(PROG_SUFFIX)
ENGINE_SERVER_OBJ=threads_mngr.o adlist.o quicklist.o ae.o anet.o dict.o kvstore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o memory_prefetch.o io_threads.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o cluster_slot_stats.o crc16.o endianconv.o slowlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crccombine.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o valkey-check-rdb.o valkey-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script_lua.o script.o functions.o function_lua.o commands.o strl.o connection.o unix.o logreqres.o
ENGINE_SERVER_OBJ=threads_mngr.o adlist.o quicklist.o ae.o anet.o dict.o kvstore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o memory_prefetch.o io_threads.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o cluster_slot_stats.o crc16.o endianconv.o slowlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crccombine.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o valkey-check-rdb.o valkey-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o allocator_defrag.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script_lua.o script.o functions.o function_lua.o commands.o strl.o connection.o unix.o logreqres.o
ENGINE_CLI_NAME=$(ENGINE_NAME)-cli$(PROG_SUFFIX)
ENGINE_CLI_OBJ=anet.o adlist.o dict.o valkey-cli.o zmalloc.o release.o ae.o serverassert.o crcspeed.o crccombine.o crc64.o siphash.o crc16.o monotonic.o cli_common.o mt19937-64.o strl.o cli_commands.o
ENGINE_BENCHMARK_NAME=$(ENGINE_NAME)-benchmark$(PROG_SUFFIX)

426
src/allocator_defrag.c Normal file
View File

@ -0,0 +1,426 @@
/* Copyright 2024- Valkey contributors
* All rights reserved.
* SPDX-License-Identifier: BSD-3-Clause
*/
/*
* This file implements allocator-specific defragmentation logic used
* within the Valkey engine. Below is the relationship between various
* components involved in allocation and defragmentation:
*
* Application code
* / \
* allocation / \ defrag
* / \
* zmalloc allocator_defrag
* / | \ / \
* / | \ / \
* / | \ / \
* libc tcmalloc jemalloc other
*
* Explanation:
* - **Application code**: High-level application logic that uses memory
* allocation and may trigger defragmentation.
* - **zmalloc**: An abstraction layer over the memory allocator, providing
* a uniform allocation interface to the application code. It can delegate
* to various underlying allocators (e.g., libc, tcmalloc, jemalloc, or others).
* It is not dependant on defrag implementation logic and it's possible to use jemalloc
* version that does not support defrag.
* - **allocator_defrag**: This file contains allocator-specific logic for
* defragmentation, invoked from `defrag.c` when memory defragmentation is needed.
* currently jemalloc is the only allocator with implemented defrag logic. It is possible that
* future implementation will include non-allocator defragmentation (think of data-structure
* compaction for example).
* - **Underlying allocators**: These are the actual memory allocators, such as
* libc, tcmalloc, jemalloc, or other custom allocators. The defragmentation
* logic in `allocator_defrag` interacts with these allocators to reorganize
* memory and reduce fragmentation.
*
* The `defrag.c` file acts as the central entry point for defragmentation,
* invoking allocator-specific implementations provided here in `allocator_defrag.c`.
*
* Note: Developers working on `zmalloc` or `allocator_defrag` should refer to
* the other component to ensure both are using the same allocator configuration.
*/
#include <stdio.h>
#include "serverassert.h"
#include "allocator_defrag.h"
#define UNUSED(x) (void)(x)
#if defined(HAVE_DEFRAG) && defined(USE_JEMALLOC)
#define STRINGIFY_(x) #x
#define STRINGIFY(x) STRINGIFY_(x)
#define BATCH_QUERY_ARGS_OUT 3
#define SLAB_NFREE(out, i) out[(i) * BATCH_QUERY_ARGS_OUT]
#define SLAB_LEN(out, i) out[(i) * BATCH_QUERY_ARGS_OUT + 2]
#define SLAB_NUM_REGS(out, i) out[(i) * BATCH_QUERY_ARGS_OUT + 1]
#define UTILIZATION_THRESHOLD_FACTOR_MILI (125) // 12.5% additional utilization
/*
* Represents a precomputed key for querying jemalloc statistics.
*
* The `jeMallctlKey` structure stores a key corresponding to a specific jemalloc
* statistics field name. This key is used with the `je_mallctlbymib` interface
* to query statistics more efficiently, bypassing the need for runtime string
* lookup and translation performed by `je_mallctl`.
*
* - `je_mallctlnametomib` is called once for each statistics field to precompute
* and store the key corresponding to the field name.
* - Subsequent queries use `je_mallctlbymib` with the stored key, avoiding the
* overhead of repeated string-based lookups.
*
*/
typedef struct jeMallctlKey {
size_t key[6]; /* The precomputed key used to query jemalloc statistics. */
size_t keylen; /* The length of the key array. */
} jeMallctlKey;
/* Stores MIB (Management Information Base) keys for jemalloc bin queries.
*
* This struct holds precomputed `jeMallctlKey` values for querying various
* jemalloc bin-related statistics efficiently.
*/
typedef struct jeBinInfoKeys {
jeMallctlKey curr_slabs; /* Key to query the current number of slabs in the bin. */
jeMallctlKey nonfull_slabs; /* Key to query the number of non-full slabs in the bin. */
jeMallctlKey curr_regs; /* Key to query the current number of regions in the bin. */
} jeBinInfoKeys;
/* Represents detailed information about a jemalloc bin.
*
* This struct provides metadata about a jemalloc bin, including the size of
* its regions, total number of regions, and related MIB keys for efficient
* queries.
*/
typedef struct jeBinInfo {
size_t reg_size; /* Size of each region in the bin. */
uint32_t nregs; /* Total number of regions in the bin. */
jeBinInfoKeys info_keys; /* Precomputed MIB keys for querying bin statistics. */
} jeBinInfo;
/* Represents the configuration for jemalloc bins.
*
* This struct contains information about the number of bins and metadata for
* each bin, as well as precomputed keys for batch utility queries and epoch updates.
*/
typedef struct jemallocCB {
unsigned nbins; /* Number of bins in the jemalloc configuration. */
jeBinInfo *bin_info; /* Array of `jeBinInfo` structs, one for each bin. */
jeMallctlKey util_batch_query; /* Key to query batch utilization information. */
jeMallctlKey epoch; /* Key to trigger statistics sync between threads. */
} jemallocCB;
/* Represents the latest usage statistics for a jemalloc bin.
*
* This struct tracks the current usage of a bin, including the number of slabs
* and regions, and calculates the number of full slabs from other fields.
*/
typedef struct jemallocBinUsageData {
size_t curr_slabs; /* Current number of slabs in the bin. */
size_t curr_nonfull_slabs; /* Current number of non-full slabs in the bin. */
size_t curr_regs; /* Current number of regions in the bin. */
} jemallocBinUsageData;
static int defrag_supported = 0;
/* Control block holding information about bins and query helper -
* this structure is initialized once when calling allocatorDefragInit. It does not change afterwards*/
static jemallocCB je_cb = {0, NULL, {{0}, 0}, {{0}, 0}};
/* Holds the latest usage statistics for each bin. This structure is updated when calling
* allocatorDefragGetFragSmallbins and later is used to make a defrag decision for a memory pointer. */
static jemallocBinUsageData *je_usage_info = NULL;
/* -----------------------------------------------------------------------------
* Alloc/Free API that are cooperative with defrag
* -------------------------------------------------------------------------- */
/* Allocation and free functions that bypass the thread cache
* and go straight to the allocator arena bins.
* Currently implemented only for jemalloc. Used for online defragmentation.
*/
void *allocatorDefragAlloc(size_t size) {
void *ptr = je_mallocx(size, MALLOCX_TCACHE_NONE);
return ptr;
}
void allocatorDefragFree(void *ptr, size_t size) {
if (ptr == NULL) return;
je_sdallocx(ptr, size, MALLOCX_TCACHE_NONE);
}
/* -----------------------------------------------------------------------------
* Helper functions for jemalloc translation between size and index
* -------------------------------------------------------------------------- */
/* Get the bin index in bin array from the reg_size.
*
* these are reverse engineered mapping of reg_size -> binind. We need this information because the utilization query
* returns the size of the buffer and not the bin index, and we need the bin index to access it's usage information
*
* Note: In case future PR will return the binind (that is better API anyway) we can get rid of
* these conversion functions
*/
static inline unsigned jeSize2BinIndexLgQ3(size_t sz) {
/* Smallest power-of-2 quantum for binning */
const size_t size_class_group_size = 4;
/* Number of bins in each power-of-2 size class group */
const size_t lg_quantum_3_first_pow2 = 3;
/* Offset for exponential bins */
const size_t lg_quantum_3_offset = ((64 >> lg_quantum_3_first_pow2) - 1);
/* Small sizes (8-64 bytes) use linear binning */
if (sz <= 64) { // 64 = 1 << (lg_quantum_3_first_pow2 + 3)
return (sz >> 3) - 1; // Divide by 8 and subtract 1
}
/* For larger sizes, use exponential binning */
/* Calculate leading zeros of (sz - 1) to properly handle power-of-2 sizes */
unsigned leading_zeros = __builtin_clzll(sz - 1);
unsigned exp = 64 - leading_zeros; // Effective log2(sz)
/* Calculate the size's position within its group */
unsigned within_group_offset = size_class_group_size -
(((1ULL << exp) - sz) >> (exp - lg_quantum_3_first_pow2));
/* Calculate the final bin index */
return within_group_offset +
((exp - (lg_quantum_3_first_pow2 + 3)) - 1) * size_class_group_size +
lg_quantum_3_offset;
}
/* -----------------------------------------------------------------------------
* Interface functions to get fragmentation info from jemalloc
* -------------------------------------------------------------------------- */
#define ARENA_TO_QUERY MALLCTL_ARENAS_ALL
static inline void jeRefreshStats(const jemallocCB *je_cb) {
uint64_t epoch = 1; // Value doesn't matter
size_t sz = sizeof(epoch);
/* Refresh stats */
je_mallctlbymib(je_cb->epoch.key, je_cb->epoch.keylen, &epoch, &sz, &epoch, sz);
}
/* Extract key that corresponds to the given name for fast query. This should be called once for each key_name */
static inline int jeQueryKeyInit(const char *key_name, jeMallctlKey *key_info) {
key_info->keylen = sizeof(key_info->key) / sizeof(key_info->key[0]);
int res = je_mallctlnametomib(key_name, key_info->key, &key_info->keylen);
/* sanity check that returned value is not larger than provided */
assert(key_info->keylen <= sizeof(key_info->key) / sizeof(key_info->key[0]));
return res;
}
/* Query jemalloc control interface using previously extracted key (with jeQueryKeyInit) instead of name string.
* This interface (named MIB in jemalloc) is faster as it avoids string dict lookup at run-time. */
static inline int jeQueryCtlInterface(const jeMallctlKey *key_info, void *value) {
size_t sz = sizeof(size_t);
return je_mallctlbymib(key_info->key, key_info->keylen, value, &sz, NULL, 0);
}
static inline int binQueryHelperInitialization(jeBinInfoKeys *helper, unsigned bin_index) {
char mallctl_name[128];
/* Mib of fetch number of used regions in the bin */
snprintf(mallctl_name, sizeof(mallctl_name), "stats.arenas." STRINGIFY(ARENA_TO_QUERY) ".bins.%d.curregs", bin_index);
if (jeQueryKeyInit(mallctl_name, &helper->curr_regs) != 0) return -1;
/* Mib of fetch number of current slabs in the bin */
snprintf(mallctl_name, sizeof(mallctl_name), "stats.arenas." STRINGIFY(ARENA_TO_QUERY) ".bins.%d.curslabs", bin_index);
if (jeQueryKeyInit(mallctl_name, &helper->curr_slabs) != 0) return -1;
/* Mib of fetch nonfull slabs */
snprintf(mallctl_name, sizeof(mallctl_name), "stats.arenas." STRINGIFY(ARENA_TO_QUERY) ".bins.%d.nonfull_slabs", bin_index);
if (jeQueryKeyInit(mallctl_name, &helper->nonfull_slabs) != 0) return -1;
return 0;
}
/* Initializes the defragmentation system for the jemalloc memory allocator.
*
* This function performs the necessary setup and initialization steps for the defragmentation system.
* It retrieves the configuration information for the jemalloc arenas and bins, and initializes the usage
* statistics data structure.
*
* return 0 on success, or a non-zero error code on failure.
*
* The initialization process involves the following steps:
* 1. Check if defragmentation is supported by the current jemalloc version.
* 2. Retrieve the arena bin configuration information using the `je_mallctlbymib` function.
* 3. Initialize the `usage_latest` structure with the bin usage statistics and configuration data.
* 4. Set the `defrag_supported` flag to indicate that defragmentation is enabled.
*
* Note: This function must be called before using any other defragmentation-related functionality.
* It should be called during the initialization phase of the code that uses the
* defragmentation feature.
*/
int allocatorDefragInit(void) {
char mallctl_name[100];
jeBinInfo *bin_info;
size_t sz;
int je_res;
/* the init should be called only once, fail if unexpected call */
assert(!defrag_supported);
/* Get the mib of the per memory pointers query command that is used during defrag scan over memory */
if (jeQueryKeyInit("experimental.utilization.batch_query", &je_cb.util_batch_query) != 0) return -1;
je_res = jeQueryKeyInit("epoch", &je_cb.epoch);
assert(je_res == 0);
jeRefreshStats(&je_cb);
/* get quantum for verification only, current code assumes lg-quantum should be 3 */
size_t jemalloc_quantum;
sz = sizeof(jemalloc_quantum);
je_mallctl("arenas.quantum", &jemalloc_quantum, &sz, NULL, 0);
/* lg-quantum should be 3 so jemalloc_quantum should be 1<<3 */
assert(jemalloc_quantum == 8);
sz = sizeof(je_cb.nbins);
je_res = je_mallctl("arenas.nbins", &je_cb.nbins, &sz, NULL, 0);
assert(je_res == 0 && je_cb.nbins != 0);
je_cb.bin_info = je_calloc(je_cb.nbins, sizeof(jeBinInfo));
assert(je_cb.bin_info != NULL);
je_usage_info = je_calloc(je_cb.nbins, sizeof(jemallocBinUsageData));
assert(je_usage_info != NULL);
for (unsigned j = 0; j < je_cb.nbins; j++) {
bin_info = &je_cb.bin_info[j];
/* The size of the current bin */
snprintf(mallctl_name, sizeof(mallctl_name), "arenas.bin.%d.size", j);
sz = sizeof(bin_info->reg_size);
je_res = je_mallctl(mallctl_name, &bin_info->reg_size, &sz, NULL, 0);
assert(je_res == 0);
/* Number of regions per slab */
snprintf(mallctl_name, sizeof(mallctl_name), "arenas.bin.%d.nregs", j);
sz = sizeof(bin_info->nregs);
je_res = je_mallctl(mallctl_name, &bin_info->nregs, &sz, NULL, 0);
assert(je_res == 0);
/* init bin specific fast query keys */
je_res = binQueryHelperInitialization(&bin_info->info_keys, j);
assert(je_res == 0);
/* verify the reverse map of reg_size to bin index */
assert(jeSize2BinIndexLgQ3(bin_info->reg_size) == j);
}
/* defrag is supported mark it to enable defrag queries */
defrag_supported = 1;
return 0;
}
/* Total size of consumed meomry in unused regs in small bins (AKA external fragmentation).
* The function will refresh the epoch.
*
* return total fragmentation bytes
*/
unsigned long allocatorDefragGetFragSmallbins(void) {
assert(defrag_supported);
unsigned long frag = 0;
jeRefreshStats(&je_cb);
for (unsigned j = 0; j < je_cb.nbins; j++) {
jeBinInfo *bin_info = &je_cb.bin_info[j];
jemallocBinUsageData *bin_usage = &je_usage_info[j];
/* Number of current slabs in the bin */
jeQueryCtlInterface(&bin_info->info_keys.curr_regs, &bin_usage->curr_regs);
/* Number of current slabs in the bin */
jeQueryCtlInterface(&bin_info->info_keys.curr_slabs, &bin_usage->curr_slabs);
/* Number of non full slabs in the bin */
jeQueryCtlInterface(&bin_info->info_keys.nonfull_slabs, &bin_usage->curr_nonfull_slabs);
/* Calculate the fragmentation bytes for the current bin and add it to the total. */
frag += ((bin_info->nregs * bin_usage->curr_slabs) - bin_usage->curr_regs) * bin_info->reg_size;
}
return frag;
}
/* Determines whether defragmentation should be performed on a pointer based on jemalloc information.
*
* bin_info Pointer to the bin information structure.
* bin_usage Pointer to the bin usage structure.
* nalloced Number of allocated regions in the bin.
*
* return 1 if defragmentation should be performed, 0 otherwise.
*
* This function checks the following conditions to determine if defragmentation should be performed:
* 1. If the number of allocated regions (nalloced) is equal to the total number of regions (bin_info->nregs),
* defragmentation is not necessary as moving regions is guaranteed not to change the fragmentation ratio.
* 2. If the number of non-full slabs (bin_usage->curr_nonfull_slabs) is less than 2, defragmentation is not performed
* because there is no other slab to move regions to.
* 3. If slab utilization < 'avg utilization'*1.125 [code 1.125 == (1000+UTILIZATION_THRESHOLD_FACTOR_MILI)/1000]
* than we should defrag. This is aligned with previous je_defrag_hint implementation.
*/
static inline int makeDefragDecision(jeBinInfo *bin_info, jemallocBinUsageData *bin_usage, unsigned long nalloced) {
unsigned long curr_full_slabs = bin_usage->curr_slabs - bin_usage->curr_nonfull_slabs;
size_t allocated_nonfull = bin_usage->curr_regs - curr_full_slabs * bin_info->nregs;
if (bin_info->nregs == nalloced || bin_usage->curr_nonfull_slabs < 2 ||
1000 * nalloced * bin_usage->curr_nonfull_slabs > (1000 + UTILIZATION_THRESHOLD_FACTOR_MILI) * allocated_nonfull) {
return 0;
}
return 1;
}
/*
* Performs defragmentation analysis for a given ptr.
*
* ptr - ptr to memory region to be analyzed.
*
* return - the function returns 1 if defrag should be performed, 0 otherwise.
*/
int allocatorShouldDefrag(void *ptr) {
assert(defrag_supported);
size_t out[BATCH_QUERY_ARGS_OUT];
size_t out_sz = sizeof(out);
size_t in_sz = sizeof(ptr);
for (unsigned j = 0; j < BATCH_QUERY_ARGS_OUT; j++) {
out[j] = -1;
}
je_mallctlbymib(je_cb.util_batch_query.key,
je_cb.util_batch_query.keylen,
out, &out_sz,
&ptr, in_sz);
/* handle results with appropriate quantum value */
assert(SLAB_NUM_REGS(out, 0) > 0);
assert(SLAB_LEN(out, 0) > 0);
assert(SLAB_NFREE(out, 0) != (size_t)-1);
unsigned region_size = SLAB_LEN(out, 0) / SLAB_NUM_REGS(out, 0);
/* check that the allocation size is in range of small bins */
if (region_size > je_cb.bin_info[je_cb.nbins - 1].reg_size) {
return 0;
}
/* get the index based on quantum used */
unsigned binind = jeSize2BinIndexLgQ3(region_size);
/* make sure binind is in range and reverse map is correct */
assert(binind < je_cb.nbins && region_size == je_cb.bin_info[binind].reg_size);
return makeDefragDecision(&je_cb.bin_info[binind],
&je_usage_info[binind],
je_cb.bin_info[binind].nregs - SLAB_NFREE(out, 0));
}
#else
int allocatorDefragInit(void) {
return -1;
}
void allocatorDefragFree(void *ptr, size_t size) {
UNUSED(ptr);
UNUSED(size);
}
__attribute__((malloc)) void *allocatorDefragAlloc(size_t size) {
UNUSED(size);
return NULL;
}
unsigned long allocatorDefragGetFragSmallbins(void) {
return 0;
}
int allocatorShouldDefrag(void *ptr) {
UNUSED(ptr);
return 0;
}
#endif

22
src/allocator_defrag.h Normal file
View File

@ -0,0 +1,22 @@
#ifndef __ALLOCATOR_DEFRAG_H
#define __ALLOCATOR_DEFRAG_H
#if defined(USE_JEMALLOC)
#include <jemalloc/jemalloc.h>
/* We can enable the server defrag capabilities only if we are using Jemalloc
* and the version that has the experimental.utilization namespace in mallctl . */
#if defined(JEMALLOC_VERSION_MAJOR) && \
(JEMALLOC_VERSION_MAJOR > 5 || \
(JEMALLOC_VERSION_MAJOR == 5 && JEMALLOC_VERSION_MINOR > 2) || \
(JEMALLOC_VERSION_MAJOR == 5 && JEMALLOC_VERSION_MINOR == 2 && JEMALLOC_VERSION_BUGFIX >= 1))
#define HAVE_DEFRAG
#endif
#endif
int allocatorDefragInit(void);
void allocatorDefragFree(void *ptr, size_t size);
__attribute__((malloc)) void *allocatorDefragAlloc(size_t size);
unsigned long allocatorDefragGetFragSmallbins(void);
int allocatorShouldDefrag(void *ptr);
#endif /* __ALLOCATOR_DEFRAG_H */

View File

@ -49,10 +49,6 @@ typedef struct defragPubSubCtx {
dict *(*clientPubSubChannels)(client *);
} defragPubSubCtx;
/* this method was added to jemalloc in order to help us understand which
* pointers are worthwhile moving and which aren't */
int je_get_defrag_hint(void *ptr);
/* Defrag helper for generic allocations.
*
* returns NULL in case the allocation wasn't moved.
@ -61,7 +57,7 @@ int je_get_defrag_hint(void *ptr);
void *activeDefragAlloc(void *ptr) {
size_t size;
void *newptr;
if (!je_get_defrag_hint(ptr)) {
if (!allocatorShouldDefrag(ptr)) {
server.stat_active_defrag_misses++;
return NULL;
}
@ -69,9 +65,9 @@ void *activeDefragAlloc(void *ptr) {
* make sure not to use the thread cache. so that we don't get back the same
* pointers we try to free */
size = zmalloc_size(ptr);
newptr = zmalloc_no_tcache(size);
newptr = allocatorDefragAlloc(size);
memcpy(newptr, ptr, size);
zfree_no_tcache(ptr);
allocatorDefragFree(ptr, size);
server.stat_active_defrag_hits++;
return newptr;
}
@ -756,8 +752,8 @@ void defragScanCallback(void *privdata, const dictEntry *de) {
* without the possibility of getting any results. */
float getAllocatorFragmentation(size_t *out_frag_bytes) {
size_t resident, active, allocated, frag_smallbins_bytes;
zmalloc_get_allocator_info(&allocated, &active, &resident, NULL, NULL, &frag_smallbins_bytes);
zmalloc_get_allocator_info(&allocated, &active, &resident, NULL, NULL);
frag_smallbins_bytes = allocatorDefragGetFragSmallbins();
/* Calculate the fragmentation ratio as the proportion of wasted memory in small
* bins (which are defraggable) relative to the total allocated memory (including large bins).
* This is because otherwise, if most of the memory usage is large bins, we may show high percentage,

View File

@ -1297,8 +1297,8 @@ void cronUpdateMemoryStats(void) {
* allocations, and allocator reserved pages that can be pursed (all not actual frag) */
zmalloc_get_allocator_info(
&server.cron_malloc_stats.allocator_allocated, &server.cron_malloc_stats.allocator_active,
&server.cron_malloc_stats.allocator_resident, NULL, &server.cron_malloc_stats.allocator_muzzy,
&server.cron_malloc_stats.allocator_frag_smallbins_bytes);
&server.cron_malloc_stats.allocator_resident, NULL, &server.cron_malloc_stats.allocator_muzzy);
server.cron_malloc_stats.allocator_frag_smallbins_bytes = allocatorDefragGetFragSmallbins();
/* in case the allocator isn't providing these stats, fake them so that
* fragmentation info still shows some (inaccurate metrics) */
if (!server.cron_malloc_stats.allocator_resident) {
@ -6794,7 +6794,10 @@ __attribute__((weak)) int main(int argc, char **argv) {
#endif
tzset(); /* Populates 'timezone' global. */
zmalloc_set_oom_handler(serverOutOfMemoryHandler);
#if defined(HAVE_DEFRAG)
int res = allocatorDefragInit();
serverAssert(res == 0);
#endif
/* To achieve entropy, in case of containers, their time() and getpid() can
* be the same. But value of tv_usec is fast enough to make the difference */
gettimeofday(&tv, NULL);

View File

@ -35,6 +35,7 @@
#include "solarisfixes.h"
#include "rio.h"
#include "commands.h"
#include "allocator_defrag.h"
#include <stdio.h>
#include <stdlib.h>

View File

@ -84,8 +84,6 @@ void zlibc_free(void *ptr) {
#define calloc(count, size) je_calloc(count, size)
#define realloc(ptr, size) je_realloc(ptr, size)
#define free(ptr) je_free(ptr)
#define mallocx(size, flags) je_mallocx(size, flags)
#define dallocx(ptr, flags) je_dallocx(ptr, flags)
#endif
#define thread_local _Thread_local
@ -207,25 +205,6 @@ void *zmalloc_usable(size_t size, size_t *usable) {
return ptr;
}
/* Allocation and free functions that bypass the thread cache
* and go straight to the allocator arena bins.
* Currently implemented only for jemalloc. Used for online defragmentation. */
#ifdef HAVE_DEFRAG
void *zmalloc_no_tcache(size_t size) {
if (size >= SIZE_MAX / 2) zmalloc_oom_handler(size);
void *ptr = mallocx(size + PREFIX_SIZE, MALLOCX_TCACHE_NONE);
if (!ptr) zmalloc_oom_handler(size);
update_zmalloc_stat_alloc(zmalloc_size(ptr));
return ptr;
}
void zfree_no_tcache(void *ptr) {
if (ptr == NULL) return;
update_zmalloc_stat_free(zmalloc_size(ptr));
dallocx(ptr, MALLOCX_TCACHE_NONE);
}
#endif
/* Try allocating memory and zero it, and return NULL if failed.
* '*usable' is set to the usable size if non NULL. */
static inline void *ztrycalloc_usable_internal(size_t size, size_t *usable) {
@ -683,52 +662,7 @@ size_t zmalloc_get_rss(void) {
#define STRINGIFY_(x) #x
#define STRINGIFY(x) STRINGIFY_(x)
/* Compute the total memory wasted in fragmentation of inside small arena bins.
* Done by summing the memory in unused regs in all slabs of all small bins. */
size_t zmalloc_get_frag_smallbins(void) {
unsigned nbins;
size_t sz, frag = 0;
char buf[100];
sz = sizeof(unsigned);
assert(!je_mallctl("arenas.nbins", &nbins, &sz, NULL, 0));
for (unsigned j = 0; j < nbins; j++) {
size_t curregs, curslabs, reg_size;
uint32_t nregs;
/* The size of the current bin */
snprintf(buf, sizeof(buf), "arenas.bin.%d.size", j);
sz = sizeof(size_t);
assert(!je_mallctl(buf, &reg_size, &sz, NULL, 0));
/* Number of used regions in the bin */
snprintf(buf, sizeof(buf), "stats.arenas." STRINGIFY(MALLCTL_ARENAS_ALL) ".bins.%d.curregs", j);
sz = sizeof(size_t);
assert(!je_mallctl(buf, &curregs, &sz, NULL, 0));
/* Number of regions per slab */
snprintf(buf, sizeof(buf), "arenas.bin.%d.nregs", j);
sz = sizeof(uint32_t);
assert(!je_mallctl(buf, &nregs, &sz, NULL, 0));
/* Number of current slabs in the bin */
snprintf(buf, sizeof(buf), "stats.arenas." STRINGIFY(MALLCTL_ARENAS_ALL) ".bins.%d.curslabs", j);
sz = sizeof(size_t);
assert(!je_mallctl(buf, &curslabs, &sz, NULL, 0));
/* Calculate the fragmentation bytes for the current bin and add it to the total. */
frag += ((nregs * curslabs) - curregs) * reg_size;
}
return frag;
}
int zmalloc_get_allocator_info(size_t *allocated,
size_t *active,
size_t *resident,
size_t *retained,
size_t *muzzy,
size_t *frag_smallbins_bytes) {
int zmalloc_get_allocator_info(size_t *allocated, size_t *active, size_t *resident, size_t *retained, size_t *muzzy) {
uint64_t epoch = 1;
size_t sz;
*allocated = *resident = *active = 0;
@ -763,8 +697,6 @@ int zmalloc_get_allocator_info(size_t *allocated,
*muzzy = pmuzzy * page;
}
/* Total size of consumed meomry in unused regs in small bins (AKA external fragmentation). */
*frag_smallbins_bytes = zmalloc_get_frag_smallbins();
return 1;
}
@ -789,13 +721,8 @@ int jemalloc_purge(void) {
#else
int zmalloc_get_allocator_info(size_t *allocated,
size_t *active,
size_t *resident,
size_t *retained,
size_t *muzzy,
size_t *frag_smallbins_bytes) {
*allocated = *resident = *active = *frag_smallbins_bytes = 0;
int zmalloc_get_allocator_info(size_t *allocated, size_t *active, size_t *resident, size_t *retained, size_t *muzzy) {
*allocated = *resident = *active = 0;
if (retained) *retained = 0;
if (muzzy) *muzzy = 0;
return 1;

View File

@ -100,13 +100,6 @@
#include <malloc.h>
#endif
/* We can enable the server defrag capabilities only if we are using Jemalloc
* and the version used is our special version modified for the server having
* the ability to return per-allocation fragmentation hints. */
#if defined(USE_JEMALLOC) && defined(JEMALLOC_FRAG_HINT)
#define HAVE_DEFRAG
#endif
/* The zcalloc symbol is a symbol name already used by zlib, which is defining
* other names using the "z" prefix specific to zlib. In practice, linking
* valkey with a static openssl, which itself might depend on a static libz
@ -138,12 +131,7 @@ __attribute__((malloc)) char *zstrdup(const char *s);
size_t zmalloc_used_memory(void);
void zmalloc_set_oom_handler(void (*oom_handler)(size_t));
size_t zmalloc_get_rss(void);
int zmalloc_get_allocator_info(size_t *allocated,
size_t *active,
size_t *resident,
size_t *retained,
size_t *muzzy,
size_t *frag_smallbins_bytes);
int zmalloc_get_allocator_info(size_t *allocated, size_t *active, size_t *resident, size_t *retained, size_t *muzzy);
void set_jemalloc_bg_thread(int enable);
int jemalloc_purge(void);
size_t zmalloc_get_private_dirty(long pid);
@ -153,11 +141,6 @@ void zlibc_free(void *ptr);
void zlibc_trim(void);
void zmadvise_dontneed(void *ptr);
#ifdef HAVE_DEFRAG
void zfree_no_tcache(void *ptr);
__attribute__((malloc)) void *zmalloc_no_tcache(size_t size);
#endif
#ifndef HAVE_MALLOC_SIZE
size_t zmalloc_size(void *ptr);
size_t zmalloc_usable_size(void *ptr);