zvi-code b56eed2479
Remove valkey specific changes in jemalloc source code (#1266)
### Summary of the change

This is a base PR for refactoring defrag. It moves the defrag logic to
rely on jemalloc [native
api](https://github.com/jemalloc/jemalloc/pull/1463#issuecomment-479706489)
instead of relying on custom code changes made by valkey in the jemalloc
([je_defrag_hint](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)))
library. This enables valkey to use latest vanila jemalloc without the
need to maintain code changes cross jemalloc versions.

This change requires some modifications because the new api is providing
only the information, not a yes\no defrag. The logic needs to be
implemented at valkey code. Additionally, the api does not provide,
within single call, all the information needed to make a decision, this
information is available through additional api call. To reduce the
calls to jemalloc, in this PR the required information is collected
during the `computeDefragCycles` and not for every single ptr, this way
we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will
further improve the defrag decision and process.

### Added files: 

`allocator_defrag.c` / `allocator_defrag.h` - This files implement the
allocator specific knowledge for making defrag decision. The knowledge
about slabs and allocation logic and so on, all goes into this file.
This improves the separation between jemalloc specific code and other
possible implementation.


### Moved functions: 

[`zmalloc_no_tcache` , `zfree_no_tcache`
](4593dc2f05/src/zmalloc.c (L215))
- these are very jemalloc specific logic assumptions, and are very
specific to how we defrag with jemalloc. This is also with the vision
that from performance perspective we should consider using tcache, we
only need to make sure we don't recycle entries without going through
the arena [for example: we can use private tcache, one for free and one
for alloc].
`frag_smallbins_bytes` - the logic and implementation moved to the new
file

### Existing API:

* [once a second + when completed full cycle]
[`computeDefragCycles`](4593dc2f05/src/defrag.c (L916))
* `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active,
resident, retained, muzzy_, `frag_smallbins_bytes`
*
[`frag_smallbins_bytes`](4593dc2f05/src/zmalloc.c (L690))
: for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs`
* [during defrag, for each pointer]
* `je_defrag_hint` is getting a memory pointer and returns {0,1} .
[Internally it
uses](4593dc2f05/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L368))
this information points:
        * #`nonfull_slabs`
        * #`total_slabs`
        * #free regs in the ptr slab

## Jemalloc API (via ctl interface)


[BATCH][`experimental_utilization_batch_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L4114))
: gets an array of pointers, returns for each pointer 3 values,

* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes


[EXTENDED][`experimental_utilization_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L3989))
:

* memory address of the extent a potential reallocation would go into
* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes
* [stats-enabled]total number of free regions in the bin the extent
belongs to
* [stats-enabled]total number of regions in the bin the extent belongs
to

### `experimental_utilization_batch_query_ctl` vs valkey
`je_defrag_hint`?
[good]
   - We can query pointers in a batch, reduce the overall overhead
- The per ptr decision algorithm is not within jemalloc api, jemalloc
only provides information, valkey can tune\configure\optimize easily

 
[bad]
- In the batch API we only know the utilization of the slab (of that
memory ptr), we don’t get the data about #`nonfull_slabs` and total
allocated regs.


## New functions:
1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the
[MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster
calls. See this quote from the jemalloc documentation:
    
The mallctlnametomib() function provides a way to avoid repeated name
lookups for
applications that repeatedly query the same portion of the namespace,by
translating
a name to a “Management Information Base” (MIB) that can be passed
repeatedly to
    mallctlbymib().

6. `jemalloc_sz2binind_lgq*` : this api is to support reverse map
between bin size and it’s info without lookup. This mapping depends on
the number of size classes we have that are derived from
[`lg_quantum`](4593dc2f05/deps/Makefile (L115))
7. `defrag_jemalloc_get_frag_smallbins` : This function replaces
`frag_smallbins_bytes` the logic moved to the new file allocator_defrag
`defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the
results
8. `should_defrag` : implements the same logic as the existing
implementation
[inside](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382))
je_defrag_hint
9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an
array of pointers, utilizing the new batch api. currently only 1 pointer
is passed.


### Logical differences:

In order to get the information about #`nonfull_slabs` and #`regs`, we
use the query cycle to collect the information per size class. In order
to find the index of bin information given bin size, in o(1), we use
`jemalloc_sz2binind_lgq*` .


## Testing
This is the first draft. I did some initial testing that basically
fragmentation by reducing max memory and than waiting for defrag to
reach desired level. The test only serves as sanity that defrag is
succeeding eventually, no data provided here regarding efficiency and
performance.

### Test: 
1. disable `activedefrag`
2. run valkey benchmark on overlapping address ranges with different
block sizes
3. wait untill `used_memory` reaches 10GB
4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru`
5. stop load
6. wait for `mem_fragmentation_ratio` to reach 2
7. enable `activedefrag` - start test timer
8. wait until reach `mem_fragmentation_ratio` = 1.1

#### Results*:
(With this PR)Test results: ` 56 sec`
(Without this PR)Test results: `67 sec`

*both runs perform same "work" number of buffers moved to reach
fragmentation target

Next benchmarking is to compare to:
- DONE // existing `je_get_defrag_hint` 
- compare with naive defrag all: `int defrag_hint() {return 1;}`

---------

Signed-off-by: Zvi Schneider <ezvisch@amazon.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Zvi Schneider <ezvisch@amazon.com>
Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-11-21 16:29:21 -08:00
..

This directory contains all Valkey dependencies, except for the libc that should be provided by the operating system.

  • Jemalloc is our memory allocator, used as replacement for libc malloc on Linux by default. It has good performances and excellent fragmentation behavior. This component is upgraded from time to time.
  • hiredis is the official C client library for Redis. It is used by redis-cli, redis-benchmark and Redis Sentinel. It is part of the Redis official ecosystem but is developed externally from the Redis repository, so we just upgrade it as needed.
  • linenoise is a readline replacement. It is developed by the same authors of Valkey but is managed as a separated project and updated as needed.
  • lua is Lua 5.1 with minor changes for security and additional libraries.
  • hdr_histogram Used for per-command latency tracking histograms.

How to upgrade the above dependencies

Jemalloc

Jemalloc is modified with changes that allow us to implement the Valkey active defragmentation logic. However this feature of Valkey is not mandatory and Valkey is able to understand if the Jemalloc version it is compiled against supports such Valkey-specific modifications. So in theory, if you are not interested in the active defragmentation, you can replace Jemalloc just following these steps:

  1. Remove the jemalloc directory.
  2. Substitute it with the new jemalloc source tree.
  3. Edit the Makefile located in the same directory as the README you are reading, and change the --with-version in the Jemalloc configure script options with the version you are using. This is required because otherwise Jemalloc configuration script is broken and will not work nested in another git repository.

However note that we change Jemalloc settings via the configure script of Jemalloc using the --with-lg-quantum option, setting it to the value of 3 instead of 4. This provides us with more size classes that better suit the Valkey data structures, in order to gain memory efficiency.

If you want to upgrade Jemalloc while also providing support for active defragmentation, in addition to the above steps you need to perform the following additional steps:

  1. In Jemalloc tree, file include/jemalloc/jemalloc_macros.h.in, make sure to add #define JEMALLOC_FRAG_HINT.
  2. Implement the function je_get_defrag_hint() inside src/jemalloc.c. You can see how it is implemented in the current Jemalloc source tree shipped with Valkey, and rewrite it according to the new Jemalloc internals, if they changed, otherwise you could just copy the old implementation if you are upgrading just to a similar version of Jemalloc.

Updating/upgrading jemalloc

The jemalloc directory is pulled as a subtree from the upstream jemalloc github repo. To update it you should run from the project root:

  1. git subtree pull --prefix deps/jemalloc https://github.com/jemalloc/jemalloc.git <version-tag> --squash
    This should hopefully merge the local changes into the new version.
  2. In case any conflicts arise (due to our changes) you'll need to resolve them and commit.
  3. Reconfigure jemalloc:
rm deps/jemalloc/VERSION deps/jemalloc/configure
cd deps/jemalloc
./autogen.sh --with-version=<version-tag>-0-g0
  1. Update jemalloc's version in deps/Makefile: search for "--with-version=<old-version-tag>-0-g0" and update it accordingly.
  2. Commit the changes (VERSION,configure,Makefile).

Hiredis

Hiredis is used by Sentinel, valkey-cli and valkey-benchmark. Like Valkey, uses the SDS string library, but not necessarily the same version. In order to avoid conflicts, this version has all SDS identifiers prefixed by hi.

  1. git subtree pull --prefix deps/hiredis https://github.com/redis/hiredis.git <version-tag> --squash
    This should hopefully merge the local changes into the new version.
  2. Conflicts will arise (due to our changes) you'll need to resolve them and commit.

Linenoise

Linenoise is rarely upgraded as needed. The upgrade process is trivial since Valkey uses a non modified version of linenoise, so to upgrade just do the following:

  1. Remove the linenoise directory.
  2. Substitute it with the new linenoise source tree.

Lua

We use Lua 5.1 and no upgrade is planned currently, since we don't want to break Lua scripts for new Lua features: in the context of Valkey Lua scripts the capabilities of 5.1 are usually more than enough, the release is rock solid, and we definitely don't want to break old scripts.

So upgrading of Lua is up to the Valkey project maintainers and should be a manual procedure performed by taking a diff between the different versions.

Currently we have at least the following differences between official Lua 5.1 and our version:

  1. Makefile is modified to allow a different compiler than GCC.
  2. We have the implementation source code, and directly link to the following external libraries: lua_cjson.o, lua_struct.o, lua_cmsgpack.o and lua_bit.o.
  3. There is a security fix in ldo.c, line 498: The check for LUA_SIGNATURE[0] is removed in order to avoid direct bytecode execution.
  4. In lstring.c, the luaS_newlstr function's hash calculation has been upgraded from a simple hash function to MurmurHash3, implemented within the same file, to enhance performance, particularly for operations involving large strings.

Hdr_Histogram

Updated source can be found here: https://github.com/HdrHistogram/HdrHistogram_c We use a customized version based on master branch commit e4448cf6d1cd08fff519812d3b1e58bd5a94ac42.

  1. Compare all changes under /hdr_histogram directory to upstream master commit e4448cf6d1cd08fff519812d3b1e58bd5a94ac42
  2. Copy updated files from newer version onto files in /hdr_histogram.
  3. Apply the changes from 1 above to the updated files.