futriix/src/zmalloc.h
zvi-code b56eed2479
Remove valkey specific changes in jemalloc source code (#1266)
### Summary of the change

This is a base PR for refactoring defrag. It moves the defrag logic to
rely on jemalloc [native
api](https://github.com/jemalloc/jemalloc/pull/1463#issuecomment-479706489)
instead of relying on custom code changes made by valkey in the jemalloc
([je_defrag_hint](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)))
library. This enables valkey to use latest vanila jemalloc without the
need to maintain code changes cross jemalloc versions.

This change requires some modifications because the new api is providing
only the information, not a yes\no defrag. The logic needs to be
implemented at valkey code. Additionally, the api does not provide,
within single call, all the information needed to make a decision, this
information is available through additional api call. To reduce the
calls to jemalloc, in this PR the required information is collected
during the `computeDefragCycles` and not for every single ptr, this way
we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will
further improve the defrag decision and process.

### Added files: 

`allocator_defrag.c` / `allocator_defrag.h` - This files implement the
allocator specific knowledge for making defrag decision. The knowledge
about slabs and allocation logic and so on, all goes into this file.
This improves the separation between jemalloc specific code and other
possible implementation.


### Moved functions: 

[`zmalloc_no_tcache` , `zfree_no_tcache`
](4593dc2f05/src/zmalloc.c (L215))
- these are very jemalloc specific logic assumptions, and are very
specific to how we defrag with jemalloc. This is also with the vision
that from performance perspective we should consider using tcache, we
only need to make sure we don't recycle entries without going through
the arena [for example: we can use private tcache, one for free and one
for alloc].
`frag_smallbins_bytes` - the logic and implementation moved to the new
file

### Existing API:

* [once a second + when completed full cycle]
[`computeDefragCycles`](4593dc2f05/src/defrag.c (L916))
* `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active,
resident, retained, muzzy_, `frag_smallbins_bytes`
*
[`frag_smallbins_bytes`](4593dc2f05/src/zmalloc.c (L690))
: for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs`
* [during defrag, for each pointer]
* `je_defrag_hint` is getting a memory pointer and returns {0,1} .
[Internally it
uses](4593dc2f05/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L368))
this information points:
        * #`nonfull_slabs`
        * #`total_slabs`
        * #free regs in the ptr slab

## Jemalloc API (via ctl interface)


[BATCH][`experimental_utilization_batch_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L4114))
: gets an array of pointers, returns for each pointer 3 values,

* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes


[EXTENDED][`experimental_utilization_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L3989))
:

* memory address of the extent a potential reallocation would go into
* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes
* [stats-enabled]total number of free regions in the bin the extent
belongs to
* [stats-enabled]total number of regions in the bin the extent belongs
to

### `experimental_utilization_batch_query_ctl` vs valkey
`je_defrag_hint`?
[good]
   - We can query pointers in a batch, reduce the overall overhead
- The per ptr decision algorithm is not within jemalloc api, jemalloc
only provides information, valkey can tune\configure\optimize easily

 
[bad]
- In the batch API we only know the utilization of the slab (of that
memory ptr), we don’t get the data about #`nonfull_slabs` and total
allocated regs.


## New functions:
1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the
[MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster
calls. See this quote from the jemalloc documentation:
    
The mallctlnametomib() function provides a way to avoid repeated name
lookups for
applications that repeatedly query the same portion of the namespace,by
translating
a name to a “Management Information Base” (MIB) that can be passed
repeatedly to
    mallctlbymib().

6. `jemalloc_sz2binind_lgq*` : this api is to support reverse map
between bin size and it’s info without lookup. This mapping depends on
the number of size classes we have that are derived from
[`lg_quantum`](4593dc2f05/deps/Makefile (L115))
7. `defrag_jemalloc_get_frag_smallbins` : This function replaces
`frag_smallbins_bytes` the logic moved to the new file allocator_defrag
`defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the
results
8. `should_defrag` : implements the same logic as the existing
implementation
[inside](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382))
je_defrag_hint
9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an
array of pointers, utilizing the new batch api. currently only 1 pointer
is passed.


### Logical differences:

In order to get the information about #`nonfull_slabs` and #`regs`, we
use the query cycle to collect the information per size class. In order
to find the index of bin information given bin size, in o(1), we use
`jemalloc_sz2binind_lgq*` .


## Testing
This is the first draft. I did some initial testing that basically
fragmentation by reducing max memory and than waiting for defrag to
reach desired level. The test only serves as sanity that defrag is
succeeding eventually, no data provided here regarding efficiency and
performance.

### Test: 
1. disable `activedefrag`
2. run valkey benchmark on overlapping address ranges with different
block sizes
3. wait untill `used_memory` reaches 10GB
4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru`
5. stop load
6. wait for `mem_fragmentation_ratio` to reach 2
7. enable `activedefrag` - start test timer
8. wait until reach `mem_fragmentation_ratio` = 1.1

#### Results*:
(With this PR)Test results: ` 56 sec`
(Without this PR)Test results: `67 sec`

*both runs perform same "work" number of buffers moved to reach
fragmentation target

Next benchmarking is to compare to:
- DONE // existing `je_get_defrag_hint` 
- compare with naive defrag all: `int defrag_hint() {return 1;}`

---------

Signed-off-by: Zvi Schneider <ezvisch@amazon.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Zvi Schneider <ezvisch@amazon.com>
Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-11-21 16:29:21 -08:00

169 lines
7.2 KiB
C

/* zmalloc - total amount of allocated memory aware version of malloc()
*
* Copyright (c) 2009-2010, Redis Ltd.
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* * Redistributions of source code must retain the above copyright notice,
* this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of Redis nor the names of its contributors may be used
* to endorse or promote products derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
* POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef __ZMALLOC_H
#define __ZMALLOC_H
#include <stddef.h>
/* Double expansion needed for stringification of macro values. */
#define __xstr(s) __str(s)
#define __str(s) #s
#if defined(USE_TCMALLOC)
#define ZMALLOC_LIB ("tcmalloc-" __xstr(TC_VERSION_MAJOR) "." __xstr(TC_VERSION_MINOR))
#include <gperftools/tcmalloc.h>
#if (TC_VERSION_MAJOR == 1 && TC_VERSION_MINOR >= 6) || (TC_VERSION_MAJOR > 1)
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) tc_malloc_size(p)
#else
#error "Newer version of tcmalloc required"
#endif
#elif defined(USE_JEMALLOC)
#define ZMALLOC_LIB \
("jemalloc-" __xstr(JEMALLOC_VERSION_MAJOR) "." __xstr(JEMALLOC_VERSION_MINOR) "." __xstr(JEMALLOC_VERSION_BUGFIX))
#include <jemalloc/jemalloc.h>
#if (JEMALLOC_VERSION_MAJOR == 2 && JEMALLOC_VERSION_MINOR >= 1) || (JEMALLOC_VERSION_MAJOR > 2)
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) je_malloc_usable_size(p)
#else
#error "Newer version of jemalloc required"
#endif
#elif defined(__APPLE__)
#include <malloc/malloc.h>
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) malloc_size(p)
#endif
/* On native libc implementations, we should still do our best to provide a
* HAVE_MALLOC_SIZE capability. This can be set explicitly as well:
*
* NO_MALLOC_USABLE_SIZE disables it on all platforms, even if they are
* known to support it.
* USE_MALLOC_USABLE_SIZE forces use of malloc_usable_size() regardless
* of platform.
*/
#ifndef ZMALLOC_LIB
#define ZMALLOC_LIB "libc"
#define USE_LIBC 1
#if !defined(NO_MALLOC_USABLE_SIZE) && (defined(__GLIBC__) || defined(__FreeBSD__) || defined(__DragonFly__) || \
defined(__HAIKU__) || defined(USE_MALLOC_USABLE_SIZE))
/* Includes for malloc_usable_size() */
#ifdef __FreeBSD__
#include <malloc_np.h>
#else
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <malloc.h>
#endif
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) malloc_usable_size(p)
#endif
#endif
/* Includes for malloc_trim(), see zlibc_trim(). */
#if defined(__GLIBC__) && !defined(USE_LIBC)
#include <malloc.h>
#endif
/* The zcalloc symbol is a symbol name already used by zlib, which is defining
* other names using the "z" prefix specific to zlib. In practice, linking
* valkey with a static openssl, which itself might depend on a static libz
* will result in link time error rejecting multiple symbol definitions. */
#define zmalloc valkey_malloc
#define zcalloc valkey_calloc
#define zrealloc valkey_realloc
#define zfree valkey_free
/* 'noinline' attribute is intended to prevent the `-Wstringop-overread` warning
* when using gcc-12 later with LTO enabled. It may be removed once the
* bug[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503] is fixed. */
__attribute__((malloc, alloc_size(1), noinline)) void *zmalloc(size_t size);
__attribute__((malloc, alloc_size(1), noinline)) void *zcalloc(size_t size);
__attribute__((malloc, alloc_size(1, 2), noinline)) void *zcalloc_num(size_t num, size_t size);
__attribute__((alloc_size(2), noinline)) void *zrealloc(void *ptr, size_t size);
__attribute__((malloc, alloc_size(1), noinline)) void *ztrymalloc(size_t size);
__attribute__((malloc, alloc_size(1), noinline)) void *ztrycalloc(size_t size);
__attribute__((alloc_size(2), noinline)) void *ztryrealloc(void *ptr, size_t size);
void zfree(void *ptr);
void zfree_with_size(void *ptr, size_t size);
void *zmalloc_usable(size_t size, size_t *usable);
void *zcalloc_usable(size_t size, size_t *usable);
void *zrealloc_usable(void *ptr, size_t size, size_t *usable);
void *ztrymalloc_usable(size_t size, size_t *usable);
void *ztrycalloc_usable(size_t size, size_t *usable);
void *ztryrealloc_usable(void *ptr, size_t size, size_t *usable);
__attribute__((malloc)) char *zstrdup(const char *s);
size_t zmalloc_used_memory(void);
void zmalloc_set_oom_handler(void (*oom_handler)(size_t));
size_t zmalloc_get_rss(void);
int zmalloc_get_allocator_info(size_t *allocated, size_t *active, size_t *resident, size_t *retained, size_t *muzzy);
void set_jemalloc_bg_thread(int enable);
int jemalloc_purge(void);
size_t zmalloc_get_private_dirty(long pid);
size_t zmalloc_get_smap_bytes_by_field(char *field, long pid);
size_t zmalloc_get_memory_size(void);
void zlibc_free(void *ptr);
void zlibc_trim(void);
void zmadvise_dontneed(void *ptr);
#ifndef HAVE_MALLOC_SIZE
size_t zmalloc_size(void *ptr);
size_t zmalloc_usable_size(void *ptr);
#else
/* If we use 'zmalloc_usable_size()' to obtain additional available memory size
* and manipulate it, we need to call 'extend_to_usable()' afterwards to ensure
* the compiler recognizes this extra memory. However, if we use the pointer
* obtained from z[*]_usable() family functions, there is no need for this step. */
#define zmalloc_usable_size(p) zmalloc_size(p)
/* derived from https://github.com/systemd/systemd/pull/25688
* We use zmalloc_usable_size() everywhere to use memory blocks, but that is an abuse since the
* malloc_usable_size() isn't meant for this kind of use, it is for diagnostics only. That is also why the
* behavior is flaky when built with _FORTIFY_SOURCE, the compiler can sense that we reach outside
* the allocated block and SIGABRT.
* We use a dummy allocator function to tell the compiler that the new size of ptr is newsize.
* The implementation returns the pointer as is; the only reason for its existence is as a conduit for the
* alloc_size attribute. This cannot be a static inline because gcc then loses the attributes on the function.
* See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503 */
__attribute__((alloc_size(2), noinline)) void *extend_to_usable(void *ptr, size_t size);
#endif
int get_proc_stat_ll(int i, long long *res);
#endif /* __ZMALLOC_H */