This PR builds upon the [previous entry prefetching
optimization](https://github.com/valkey-io/valkey/pull/1501) to further
enhance performance by implementing value prefetching for hashtable
iterators.
## Implementation
Modified `hashtableInitIterator` to accept a new flags parameter,
allowing control over iterator behavior.
Implemented conditional value prefetching within `hashtableNext` based
on the new `HASHTABLE_ITER_PREFETCH_VALUES` flag.
When the flag is set, hashtableNext now calls `prefetchBucketValues` at
the start of each new bucket, preemptively loading the values of filled
entries into the CPU cache.
The actual prefetching of values is performed using type-specific
callback functions implemented in `server.c`:
- For `robj` the `hashtableObjectPrefetchValue` callback is used to
prefetch the value if not embeded.
This implementation is specifically focused on main database iterations
at this stage. Applying it to hashtables that hold other object types
should not be problematic, but its performance benefits for those cases
will need to be proven through testing and benchmarking.
## Performance
### Setup:
- 64cores Graviton 3 Amazon EC2 instance.
- 50 mil keys with different value sizes.
- Running valkey server over RAM file system.
- crc checksum and comperssion off.
### Action
- save command.
### Results
The results regarding the duration of “save” command was taken from
“info all” command.
```
+--------------------+------------------+------------------+
| Prefetching | Value size (byte)| Time (seconds) |
+--------------------+------------------+------------------+
| No | 100 | 20.112279 |
| Yes | 100 | 12.758519 |
| No | 40 | 16.945366 |
| Yes | 40 | 10.902022 |
| No | 20 | 9.817000 |
| Yes | 20 | 9.626821 |
| No | 10 | 9.71510 |
| Yes | 10 | 9.510565 |
+--------------------+------------------+------------------+
```
The results largely align with our expectations, showing significant
improvements for larger values (100 bytes and 40 bytes) that are stored
outside the robj. For smaller values (20 bytes and 10 bytes) that are
embedded within the robj, we see almost no improvement, which is as
expected.
However, the small improvement observed even for these embedded values
is somewhat surprising. Given that we are not actively prefetching these
embedded values, this minor performance gain was not anticipated.
perf record on save command **without** value prefetching:
```
--99.98%--rdbSaveDb
|
|--91.38%--rdbSaveKeyValuePair
| |
| |--42.72%--rdbSaveRawString
| | |
| | |--26.69%--rdbWriteRaw
| | | |
| | | --25.75%--rioFileWrite.lto_priv.0
| | |
| | --15.41%--rdbSaveLen
| | |
| | |--7.58%--rdbWriteRaw
| | | |
| | | --7.08%--rioFileWrite.lto_priv.0
| | | |
| | | --6.54%--_IO_fwrite
| | |
| | |
| | --7.42%--rdbWriteRaw.constprop.1
| | |
| | --7.18%--rioFileWrite.lto_priv.0
| | |
| | --6.73%--_IO_fwrite
| |
| |
| |--40.44%--rdbSaveStringObject
| |
| --7.62%--rdbSaveObjectType
| |
| --7.39%--rdbWriteRaw.constprop.1
| |
| --7.04%--rioFileWrite.lto_priv.0
| |
| --6.59%--_IO_fwrite
|
|
--7.33%--hashtableNext.constprop.1
|
--6.28%--prefetchNextBucketEntries.lto_priv.0
```
perf record on save command **with** value prefetching:
```
rdbSaveRio
|
--99.93%--rdbSaveDb
|
|--79.81%--rdbSaveKeyValuePair
| |
| |--66.79%--rdbSaveRawString
| | |
| | |--42.31%--rdbWriteRaw
| | | |
| | | --40.74%--rioFileWrite.lto_priv.0
| | |
| | --23.37%--rdbSaveLen
| | |
| | |--11.78%--rdbWriteRaw
| | | |
| | | --11.03%--rioFileWrite.lto_priv.0
| | | |
| | | --10.30%--_IO_fwrite
| | | |
| | |
| | --10.98%--rdbWriteRaw.constprop.1
| | |
| | --10.44%--rioFileWrite.lto_priv.0
| | |
| | --9.74%--_IO_fwrite
| | |
| |
| |--11.33%--rdbSaveObjectType
| | |
| | --10.96%--rdbWriteRaw.constprop.1
| | |
| | --10.51%--rioFileWrite.lto_priv.0
| | |
| | --9.75%--_IO_fwrite
| | |
| |
| --0.77%--rdbSaveStringObject
|
--18.39%--hashtableNext
|
|--10.04%--hashtableObjectPrefetchValue
|
--6.06%--prefetchNextBucketEntries
```
Conclusions:
The prefetching strategy appears to be working as intended, shifting the
performance bottleneck from data access to I/O operations.
The significant reduction in rdbSaveStringObject time suggests that
string objects(which are the values) are being accessed more
efficiently.
Signed-off-by: NadavGigi <nadavgigi102@gmail.com>
This PR replaces dict with the new hashtable data structure in the HASH
datatype. There is a new struct for hashtable items which contains a
pointer to value sds string and the embedded key sds string. These
values were previously stored in dictEntry. This structure is kept
opaque so we can easily add small value embedding or other optimizations
in the future.
closes#1095
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
This PR replaces dict with hashtable in the ZSET datatype. Instead of
mapping key to score as dict did, the hashtable maps key to a node in
the skiplist, which contains the score. This takes advantage of
hashtable performance improvements and saves 15 bytes per set item - 24
bytes overhead before, 9 bytes after.
Closes#1096
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In this commit we move all structures and functions declarations related
to Valkey modules from `server.h` to the recently added `module.h` file.
This re-organization makes it easier for new contributors to find the
valkey modules related code, as well as reducing the compilation times
when changes are made to the modules code.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
The new `hashtable` provides faster lookups and uses less memory than
`dict`.
A TCL test case "SRANDMEMBER with a dict containing long chain" is
deleted because it's covered by a hashtable unit test
"test_random_entry_with_long_chain", which is already present.
This change also moves some logic from dismissMemory (object.c) to
zmadvise_dontneed (zmalloc.c), so the hashtable implementation which
needs the dismiss functionality doesn't need to depend on object.c and
server.h.
This PR follows #1186.
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Instead of a dictEntry with pointers to key and value, the hashtable
has a pointer directly to the value (robj) which can hold an embedded
key and acts as a key-value in the hashtable. This minimizes the number
of pointers to follow and thus the number of memory accesses to lookup
a key-value pair.
Keys robj
hashtable
+-------+ +-----------------------+
| 0 | | type, encoding, LRU |
| 1 ------->| refcount, expire |
| 2 | | ptr |
| ... | | optional embedded key |
+-------+ | optional embedded val |
+-----------------------+
The expire timestamp (TTL) is also stored in the robj, if any. The expire
hash table points to the same robj.
Overview of changes:
* Replace dict with hashtable in kvstore (kvstore.c)
* Add functions for embedding key and expire in robj (object.c)
* When there's unused space, reserve an expire field to avoid realloting
it later if expire is added.
* Always reserve space for expire for large key names to avoid realloc
if it's set later.
* Update db functions (db.c)
* dbAdd, setKey and setExpire reallocate the object when embedding a key
* setKey does not increment the reference counter, since it would require
duplicating the object. This responsibility is moved to the caller.
* Remove logic for shared integer objects as values in the database. The keys
are now embedded in the objects, so all objects in the database need to be
unique. Thus, we can't use shared objects as values. Also delete test cases
for shared integers.
* Adjust various commands to the changes mentioned above.
* Adjust defrag code
* Improvement: Don't access the expires table before defrag has actually
reallocated the object.
* Adjust test cases that were using hard-coded sizes for dict when realloc
would happen, and some other adjustments in test cases.
* Adjust memory prefetch for new hash table implementation in IO-threading,
using new `hashtableIncrementalFind` API
* Adjust offloading of free() to IO threads: Object free to be done in main
thread while keeping obj->ptr offloading in IO-thread since the DB object is
now allocated by the main-thread and not by the IO-thread as it used to be.
* Let expireIfNeeded take an optional value, to avoid looking up the expires
table when possible.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
**Motivation**
Copy-on-write (COW) amplification refers to the issue where writing to a
small object leads to the entire page being cloned, resulting in
inefficient memory usage. This issue arises during the BGSAVE process,
which can be particularly problematic on instances with limited memory.
If the BGSAVE process could release unneeded memory, it could reduce
memory consumption. To address this, the BGSAVE process calls the
`madvise` function to signal the operating system to reclaim the buffer.
However, this approach does not work for buffers smaller than a page
(usually 4KiB). Even after multiple such calls, where a full page may be
free, the operating system will not reclaim it.
To solve this issue, we can call `zfree` directly. This allows the
allocator (jemalloc) to handle the bookkeeping and release pages when
buffers are no longer needed. This approach reduces copy-on-write
events.
**Benchmarks**
To understand how usage of `zfree` affects BGSAVE and the memory
consumption I ran 45 benchmarks that compares my clonewith the vanilla
version. The benchmark has the following steps:
1. Start a new Valkey process
2. Fill the DB with data sequentially
3. Run a warmup to randomize the memory layout
4. Introduce fragmentation by deleting part of the keys
5. In parallel:
1. Trigger BGSAVE
2. Start 80/20 get/set load
I played the following parameters to understand their influence:
1. Number of keys: 3M, 6M, and 12M.
2. Data size. While key themselves are of fixed length ~30 bytes, the
value size is 120, 250, 500, 1000, and 2000 bytes.
3. Fragmentation. I delete 5%, 10%, and 15% of the original key range.
I'm attaching a graph of BGSAVE process memory consumption. Instead of
all benchmarks, I show the most representative runs IMO.
<img width="1570" alt="3m-fixed"
src="https://github.com/user-attachments/assets/3dbbc528-01c1-4821-a3c2-6be455e7f78a">
For 2000 bytes values peak memory usage is ~53% compared to vanilla. The
peak happens at 57% BGSAVE progress.
For 500 bytes values the peak is ~80% compared to vanilla. And happens
at ~80% progress.
For 120 bytes the difference is under 5%, and the patched version could
even use more memory.

For 12M keys, the peak is ~85% of the vanilla’s. Happens at ~70% mark.
For 6M keys, the peak is ~87% of the vanilla’s. Happens at ~77% mark.
For 3M keys, the peak is ~87% of the vanilla’s Happens at ~80% mark.
**Changes**
The PR contains 2 changes:
1. Static buffer for RDB comrpession.
RDB compression leads to COW events even without any write load if we
use `zfree`. It happens because the compression functions allocates a
new buffer for each object. Together with freeing objects with `zfree`
it leads to reusing of the memory shared with the main process.
To deal with this problem, we use a pre-allocated constant 8K buffer for
compression. If the object size is too big for this buffer, than we fall
back to the ad hoc allocation behavior.
2. Freeing string objects instead of dismissing them
Call to `zfree` is more expensive than direct call to `madvise`. But
with #453 strings use the fast path – `zfree_with_size`. As a possible
next step we can optimize `zfree` for other data types as well.
---------
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Introduce a `size_t` field into the rax struct to track allocation size.
Update the allocation size on rax insert and deletes.
Return the allocation size when `raxAllocSize` is called.
This size tracking is now used in MEMORY USAGE and MEMORY STATS in place
of the previous method based on sampling.
The module API allows to create sorted dictionaries, which are backed by
rax. Users now also get precise memory allocation for them (through
`ValkeyModule_MallocSizeDict`).
Fixes#677.
For the release notes:
* MEMORY USAGE and MEMORY STATS are now exact for streams, rather than
based on sampling.
---------
Signed-off-by: Guillaume Koenig <knggk@amazon.com>
Signed-off-by: Guillaume Koenig <106696198+knggk@users.noreply.github.com>
Co-authored-by: Joey <yzhaon@amazon.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This commit hopefully improves the formatting of the codebase by setting
ColumnLimit to 0 and hence stopping clang-format from trying to put as
much stuff in one line as possible.
This change enabled us to remove most of `clang-format off` directives
and fixed a bunch of lines that looked like this:
```c
#define KEY \
VALUE /* comment */
```
Additionally, one pair of `clang-format off` / `clang-format on` had
`clang-format off` as the second comment and hence didn't enable the
formatting for the rest of the file. This commit addresses this issue as
well.
Please tell me if anything in the changes seem off. If everything is
fine, I will add this commit to `.git-blame-ignore-revs` later.
---------
Signed-off-by: Mikhail Koviazin <mikhail.koviazin@aiven.io>
sdsAllocSize returns the correct size without consulting the
allocator. Which is much faster than consulting the allocator.
The only exception is SDS_TYPE_5, for which it has to
consult the allocator.
This PR also sets alloc field correctly for embedded string objects.
It assumes that no allocator would allocate a buffer larger
than `259 + sizeof(robj)` for embedded string. We use embedded strings
for strings up to 44 bytes. If this assumption is wrong, the whole
function would require a rewrite. In general case sds type adjustment
might be needed. Such logic should go to sds.c.
---------
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Update references of copyright being assigned to Salvatore when it was
transferred to Redis Ltd. as per
https://github.com/valkey-io/valkey/issues/544.
---------
Signed-off-by: Pieter Cailliau <pieter@redis.com>
This PR incorporates changes related to key embedding described in the
https://github.com/redis/redis/issues/12216
With this change there will be no `key` pointer and embedded the key
within the `dictEntry`. 1 byte is used for additional bookkeeping.
Overall the saving would be 7 bytes on average.
Key changes:
New dict entry type introduced, which is now used as an entry for the
main dictionary:
```c
typedef struct {
union {
void *val;
uint64_t u64;
int64_t s64;
double d;
} v;
struct dictEntry *next; /* Next entry in the same hash bucket. */
uint8_t key_header_size; /* offset into key_buf where the key is located at. */
unsigned char key_buf[]; /* buffer with embedded key. */
} embeddedDictEntry;
```
One new function has been added to the dictType:
```c
size_t (*embedKey)(unsigned char *buf, size_t buf_len, const void *key, unsigned char *header_size);
```
Change is opt-in per dict type, hence sets, hashes and other types that
are using dictionary are not impacted.
With this change main dictionary now owns the data, so copy on insert in
dbAdd is no longer needed.
### Benchmarking results
TLDR; Around 9-10% memory usage reduction in overall memory usage for
scenario with key of 16 bytes and value of 8 bytes and 16 bytes. The
throughput per second varies but is similar or greater in most of the
run(s) with the changes against unstable (ae2d421).
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
We added some clang-format off comments before we had decided on the
format configuration. Now, it turns out that turning formatting off is
often not necessary.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When LRU/LFU enabled, Valkey does not allow using shared objects, as
value objects may be shared among many different keys and they can't
share LRU/LFU information.
However `maxmemory-policy` is modifiable at runtime. If LRU/LFU is not
enabled at start, but then enabled when some shared objects are already
used, there could be some confusion in LRU/LFU information.
For `set` command it is OK since it is going to create a new object when
LRU/LFU enabled, but `get` command will not unshare the object and just
update LRU/LFU information.
So we may duplicate the object in this case. It is a one-time task for
each key using shared objects, unless this is the case for so many keys,
there should be no serious performance degradation.
Still, LRU will be updated anyway, no matter LRU/LFU is enabled or not,
because `OBJECT IDLETIME` needs it, unless `maxmemory-policy` is set to
LFU. So idle time of a key may still be messed up.
---------
Signed-off-by: chentianjie.ctj <chentianjie.ctj@alibaba-inc.com>
Signed-off-by: Chen Tianjie <TJ_Chen@outlook.com>
I have validated that these settings closely match the existing coding
style with one major exception on `BreakBeforeBraces`, which will be
`Attach` going forward. The mixed `BreakBeforeBraces` styles in the
current codebase are hard to imitate and also very odd IMHO - see below
```
if (a == 1) { /*Attach */
}
```
```
if (a == 1 ||
b == 2)
{ /* Why? */
}
```
Please do NOT merge just yet. Will add the github action next once the
style is reviewed/approved.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
This is a preparation for adding clang-format.
These comments prevent automatic formatting in some places. With these
exceptions, we will be able to run clang-format on the rest of the code.
This is a preparation for #323.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
1. Rename `REDIS_*` macros defined in object.c to `VALKEY_*`,
2. Rename `Redis` to `Valkey` , `redis-cli` to `valkey-cli` in logs
(i.e. put statement) and descriptions in object.c and
utils/create-cluster/README
---------
Signed-off-by: Sher Sun <sher.sun@huawei.com>
Co-authored-by: Sher Sun <sher.sun@huawei.com>
This includes comments used for module API documentation.
* Strategy for replacement: Regex search: `(//|/\*| \*|#).* ("|\()?(r|R)edis( |\.
|'|\n|,|-|\)|")(?!nor the names of its contributors)(?!Ltd.)(?!Labs)(?!Contributors.)`
* Don't edit copyright comments
* Replace "Redis version X.X" -> "Redis OSS version X.X" to distinguish
from newly licensed repository
* Replace "Redis Object" -> "Object"
* Exclude markdown for now
* Don't edit Lua scripting comments referring to redis.X API
* Replace "Redis Protocol" -> "RESP"
* Replace redis-benchmark, -cli, -server, -check-aof/rdb with "valkey-"
prefix
* Most other places, I use best judgement to either remove "Redis", or
replace with "the server" or "server"
Fixes#148
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Sometimes we need to make fast judgement about why Redis is suddenly
taking more memory. One of the reasons is main DB's dicts doing
rehashing.
We may use `MEMORY STATS` to monitor the overhead memory of each DB, but
there still lacks a total sum to show an overall trend. So this PR adds
the total overhead of all DBs to `INFO MEMORY` section, together with
the total count of rehashing DB dicts, providing some intuitive metrics
about main dicts rehashing.
This PR adds the following metrics to INFO MEMORY
* `mem_overhead_db_hashtable_rehashing` - only size of ht[0] in
dictionaries we're rehashing (i.e. the memory that's gonna get released
soon)
and a similar ones to MEMORY STATS:
* `overhead.db.hashtable.lut` (complements the existing
`overhead.hashtable.main` and `overhead.hashtable.expires` which also
counts the `dictEntry` structs too)
* `overhead.db.hashtable.rehashing` - temporary rehashing overhead.
* `db.dict.rehashing.count` - number of top level dictionaries being
rehashed.
---------
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Implement #12963
## Changes
1. large bins don't have external fragmentation or are at least
non-defraggable, so we should ignore the effect of
large bins when measuring fragmentation, and only measure fragmentation
of small bins. this affects both the allocator_frag* metrics and also
the active-defrag trigger
2. Adding INFO metrics for `muzzy` memory, which is memory returned to
the OS but still shows as RSS until the OS reclaims it.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
We forgot to call quicklistSetOptions after createQuicklistObject,
in the sort store scenario, we will create a quicklist with default
fill or compress options.
This PR adds fill and depth parameters to createQuicklistObject to
specify that options need to be set after creating a quicklist.
This closes#12871.
release notes:
> Fix lists created by SORT STORE to respect list compression and
packing configs.
# Description
Gather most of the scattered `redisDb`-related code from the per-slot
dict PR (#11695) and turn it to a new data structure, `kvstore`. i.e.
it's a class that represents an array of dictionaries.
# Motivation
The main motivation is code cleanliness, the idea of using an array of
dictionaries is very well-suited to becoming a self-contained data
structure.
This allowed cleaning some ugly code, among others: loops that run twice
on the main dict and expires dict, and duplicate code for allocating and
releasing this data structure.
# Notes
1. This PR reverts the part of https://github.com/redis/redis/pull/12848
where the `rehashing` list is global (handling rehashing `dict`s is
under the responsibility of `kvstore`, and should not be managed by the
server)
2. This PR also replaces the type of `server.pubsubshard_channels` from
`dict**` to `kvstore` (original PR:
https://github.com/redis/redis/pull/12804). After that was done,
server.pubsub_channels was also chosen to be a `kvstore` (with only one
`dict`, which seems odd) just to make the code cleaner by making it the
same type as `server.pubsubshard_channels`, see
`pubsubtype.serverPubSubChannels`
3. the keys and expires kvstores are currenlty configured to allocate
the individual dicts only when the first key is added (unlike before, in
which they allocated them in advance), but they won't release them when
the last key is deleted.
Worth mentioning that due to the recent change the reply of DEBUG
HTSTATS changed, in case no keys were ever added to the db.
before:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
```
after:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
[Expires HT]
```
Change the calculation method of bytes_per_key to make it closer to
the true average key size. The calculation method is as follows:
mh->bytes_per_key = mh->total_keys ? (mh->dataset / mh->total_keys) : 0;
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
A value of type long long is always less than 21 bytes when convert to a
string, so always meets the conditions for using embedded string object
which can always get memory reduction and performance gain (less calls
to the heap allocator).
Additionally, for the conversion of longlong type to sds, we also use a faster
algorithm (the one in util.c instead of the one that used to be in sds.c).
For the DECR command on 32-bit Redis, we get about a 5.7% performance
improvement. There will also be some performance gains for some commands
that heavily use sdscatfmt to convert numbers, such as INFO.
Co-authored-by: Oran Agra <oran@redislabs.com>
This pr can get two performance benefits:
1. Stop redundant initialization when most robj objects are created
2. LRU_CLOCK will no longer be called in io threads, so we can avoid the `atomicGet`
Another code optimization:
deleted the redundant judgment in dbSetValue, no matter in LFU or LRU, the lru field inold
robj is always the freshest (it is always updated in lookupkey), so we don't need to judge if in LFU
This change attempts to alleviate a minor memory usage degradation for Redis 6.2 and onwards when using rather large objects (~2k) in streams. Introduced in #6281, we pre-allocate the head nodes of a stream to be 4kb, to limit the amount of unnecessary initial reallocations that are done. However, if we only ever allocate one object because 2 objects exceeds the max_stream_entry_size, we never actually shrink it to fit the single item. This can lead to a lot of excessive memory usage. For smaller item sizes this becomes less of an issue, as the overhead decreases as the items become smaller in size.
This commit also changes the MEMORY USAGE of streams, since it was reporting the lpBytes instead of the allocated size. This introduced an observability issue when diagnosing the memory issue, since Redis reported the same amount of used bytes pre and post change, even though the new implementation allocated more memory.
As `sdsRemoveFreeSpace` have an impact on performance even if it is a no-op (see details at #11508).
Only call the function when there is a possibility that the string contains free space.
* For strings coming from the network, it's only if they're bigger than PROTO_MBULK_BIG_ARG
* For strings coming from scripts, it's only if they're smaller than LUA_CMD_OBJCACHE_MAX_LEN
* For strings coming from modules, it could be anything.
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: sundb <sundbcn@gmail.com>
In #7875 (Redis 6.2), we changed the sds alloc to be the usable allocation
size in order to:
> reduce the need for realloc calls by making the sds implicitly take over
the internal fragmentation
This change was done most sds functions, excluding `sdsRemoveFreeSpace` and
`sdsResize`, the reason is that in some places (e.g. clientsCronResizeQueryBuffer)
we call sdsRemoveFreeSpace when we see excessive free space and want to trim it.
so if we don't trim it exactly to size, the caller may still see excessive free space and
call it again and again.
However, this resulted in some excessive calls to realloc, even when there's no need
and it's gonna be a no-op (e.g. when reducing 15 bytes allocation to 13).
It turns out that a call for realloc with jemalloc can be expensive even if it ends up
doing nothing, so this PR adds a check using `je_nallocx`, which is cheap to avoid
the call for realloc.
in addition to that this PR unifies sdsResize and sdsRemoveFreeSpace into common
code. the difference between them was that sdsResize would avoid using SDS_TYPE_5,
since it want to keep the string ready to be resized again, while sdsRemoveFreeSpace
would permit using SDS_TYPE_5 and get an optimal memory consumption.
now both methods take a `would_regrow` argument that makes it more explicit.
the only actual impact of that is that in clientsCronResizeQueryBuffer we call both sdsResize
and sdsRemoveFreeSpace for in different cases, and we now prevent the use of SDS_TYPE_5 in both.
The new test that was added to cover this concern used to pass before this PR as well,
this PR is just a performance optimization and cleanup.
Benchmark:
`redis-benchmark -c 100 -t set -d 512 -P 10 -n 100000000`
on i7-9850H with jemalloc, shows improvement from 1021k ops/sec to 1067k (average of 3 runs).
some 4.5% improvement.
Co-authored-by: Oran Agra <oran@redislabs.com>
This change deletes the dictGetNext and dictGetNextRef functions, so the
dict API doesn't expose the next field at all.
The bucket function in dictScan is deleted. A separate dictScanDefrag function
is added which takes a defrag alloc function to defrag-reallocate the dict entries.
"Dirty" code accessing the dict internals in active defrag is removed.
An 'afterReplaceEntry' is added to dictType, which allows the dict user
to keep the dictEntry metadata up to date after reallocation/defrag/move.
Additionally, for updating the cluster slot-to-key mapping, after a dictEntry
has been reallocated, we need to know which db a dict belongs to, so we store
a pointer to the db in a new metadata section in the dict struct, which is
a new mechanism similar to dictEntry metadata. This adds some complexity but
provides better isolation.
PR #11290 added listpack encoding for sets, but was missing two things:
1. Correct handling of MEMORY USAGE (leading to an assertion).
2. Had an uncontrolled scratch buffer size in SRANDMEMBER leading to
OOM panic (reported in #11668). Fixed by copying logic from ZRANDMEMBER.
note that both issues didn't exist in any redis release.
In #11290, we added listpack encoding for SET object.
But forgot to support it in zuiFind, causes ZINTER, ZINTERSTORE,
ZINTERCARD, ZIDFF, ZDIFFSTORE to crash.
And forgot to support it in RM_ScanKey, causes it hang.
This PR add support SET listpack in zuiFind, and in RM_ScanKey.
And add tests for related commands to cover this case.
Other changes:
- There is no reason for zuiFind to go into the internals of the SET.
It can simply use setTypeIsMember and don't care about encoding.
- Remove the `#include "intset.h"` from server.h reduce the chance of
accidental intset API use.
- Move setTypeAddAux, setTypeRemoveAux and setTypeIsMemberAux
interfaces to the header.
- In scanGenericCommand, use setTypeInitIterator and setTypeNext
to handle OBJ_SET scan.
- In RM_ScanKey, improve hash scan mode, use lpGetValue like zset,
they can share code and better performance.
The zuiFind part fixes#11578
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
## Issue
During the client input/output buffer processing, the memory usage is
incrementally updated to keep track of clients going beyond a certain
threshold `maxmemory-clients` to be evicted. However, this additional
tracking activity leads to unnecessary CPU cycles wasted when no
client-eviction is required. It is applicable in two cases.
* `maxmemory-clients` is set to `0` which equates to no client eviction
(applicable to all clients)
* `CLIENT NO-EVICT` flag is set to `ON` which equates to a particular
client not applicable for eviction.
## Solution
* Disable client memory usage tracking during the read/write flow when
`maxmemory-clients` is set to `0` or `client no-evict` is `on`.
The memory usage is tracked only during the `clientCron` i.e. it gets
periodically updated.
* Cleanup the clients from the memory usage bucket when client eviction
is disabled.
* When the maxmemory-clients config is enabled or disabled at runtime,
we immediately update the memory usage buckets for all clients (tested
scanning 80000 took some 20ms)
Benchmark shown that this can improve performance by about 5% in
certain situations.
Co-authored-by: Oran Agra <oran@redislabs.com>
Improve memory efficiency of list keys
## Description of the feature
The new listpack encoding uses the old `list-max-listpack-size` config
to perform the conversion, which we can think it of as a node inside a
quicklist, but without 80 bytes overhead (internal fragmentation included)
of quicklist and quicklistNode structs.
For example, a list key with 5 items of 10 chars each, now takes 128 bytes
instead of 208 it used to take.
## Conversion rules
* Convert listpack to quicklist
When the listpack length or size reaches the `list-max-listpack-size` limit,
it will be converted to a quicklist.
* Convert quicklist to listpack
When a quicklist has only one node, and its length or size is reduced to half
of the `list-max-listpack-size` limit, it will be converted to a listpack.
This is done to avoid frequent conversions when we add or remove at the bounding size or length.
## Interface changes
1. add list entry param to listTypeSetIteratorDirection
When list encoding is listpack, `listTypeIterator->lpi` points to the next entry of current entry,
so when changing the direction, we need to use the current node (listTypeEntry->p) to
update `listTypeIterator->lpi` to the next node in the reverse direction.
## Benchmark
### Listpack VS Quicklist with one node
* LPUSH - roughly 0.3% improvement
* LRANGE - roughly 13% improvement
### Both are quicklist
* LRANGE - roughly 3% improvement
* LRANGE without pipeline - roughly 3% improvement
From the benchmark, as we can see from the results
1. When list is quicklist encoding, LRANGE improves performance by <5%.
2. When list is listpack encoding, LRANGE improves performance by ~13%,
the main enhancement is brought by `addListListpackRangeReply()`.
## Memory usage
1M lists(key:0~key:1000000) with 5 items of 10 chars ("hellohello") each.
shows memory usage down by 35.49%, from 214MB to 138MB.
## Note
1. Add conversion callback to support doing some work before conversion
Since the quicklist iterator decompresses the current node when it is released, we can
no longer decompress the quicklist after we convert the list.
Small sets with not only integer elements are listpack encoded, by default
up to 128 elements, max 64 bytes per element, new config `set-max-listpack-entries`
and `set-max-listpack-value`. This saves memory for small sets compared to using a hashtable.
Sets with only integers, even very small sets, are still intset encoded (up to 1G
limit, etc.). Larger sets are hashtable encoded.
This PR increments the RDB version, and has an effect on OBJECT ENCODING
Possible conversions when elements are added:
intset -> listpack
listpack -> hashtable
intset -> hashtable
Note: No conversion happens when elements are deleted. If all elements are
deleted and then added again, the set is deleted and recreated, thus implicitly
converted to a smaller encoding.
Following #10996, it forgot to modify RM_StringCompare in module.c
Modified RM_StringCompare, compareStringObjectsWithFlags,
compareStringObjects and collateStringObjects.
Remove some dead code in object.c, ziplist is no longer used in 7.0
Some backgrounds:
zipmap - hash: replaced by ziplist in #285
ziplist - hash: replaced by listpack in #8887
ziplist - zset: replaced by listpack in #9366
ziplist - list: replaced by quicklist (listpack) in #2143 / #9740
Moved the location of ziplist.h in the server.c
Add APIs to allow modules to compute the memory consumption of opaque objects owned by redis.
Without these, the mem_usage callbacks of module data types are useless in many cases.
Other changes:
Fix streamRadixTreeMemoryUsage to include the size of the rax structure itself