2009-03-22 10:30:00 +01:00
|
|
|
/* Hash Tables Implementation.
|
|
|
|
*
|
2012-10-02 21:58:36 -07:00
|
|
|
* This file implements in-memory hash tables with insert/del/replace/find/
|
|
|
|
* get-random-element operations. Hash tables will auto-resize if needed
|
2009-03-22 10:30:00 +01:00
|
|
|
* tables of power of two in size are used, collisions are handled by
|
|
|
|
* chaining. See the source code for more information... :)
|
|
|
|
*
|
2024-08-14 17:20:36 +01:00
|
|
|
* Copyright (c) 2006-2012, Redis Ltd.
|
2009-03-22 10:30:00 +01:00
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions are met:
|
|
|
|
*
|
|
|
|
* * Redistributions of source code must retain the above copyright notice,
|
|
|
|
* this list of conditions and the following disclaimer.
|
|
|
|
* * Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* * Neither the name of Redis nor the names of its contributors may be used
|
|
|
|
* to endorse or promote products derived from this software without
|
|
|
|
* specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
|
|
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
|
|
|
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
|
|
|
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
|
|
|
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
|
|
|
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
|
|
|
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
|
|
|
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
|
|
|
* POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef __DICT_H
|
|
|
|
#define __DICT_H
|
|
|
|
|
2020-12-23 05:52:07 -08:00
|
|
|
#include "mt19937-64.h"
|
|
|
|
#include <limits.h>
|
|
|
|
#include <stdint.h>
|
|
|
|
#include <stdlib.h>
|
|
|
|
|
2009-03-22 10:30:00 +01:00
|
|
|
#define DICT_OK 0
|
|
|
|
#define DICT_ERR 1
|
|
|
|
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
/* Hash table parameters */
|
Change the threshold of dict expand, shrink and rehash (#12948)
Before this change (most recently modified in
https://github.com/redis/redis/pull/12850#discussion_r1421406393), The
trigger for normal expand threshold was 100% utilization and the trigger
for normal shrink threshold was 10% (HASHTABLE_MIN_FILL).
While during fork (DICT_RESIZE_AVOID), when we want to avoid rehash, the
trigger thresholds were multiplied by 5 (`dict_force_resize_ratio`),
meaning 500% for expand and 2% (100/10/5) for shrink.
However, in `dictRehash` (the incremental rehashing), the rehashing
threshold for shrinking during fork (DICT_RESIZE_AVOID) was 20% by
mistake.
This meant that if a shrinking is triggered when `dict_can_resize` is
`DICT_RESIZE_ENABLE` which the threshold is 10%, the rehashing can
continue when `dict_can_resize` is `DICT_RESIZE_AVOID`.
This would cause unwanted CopyOnWrite damage.
It'll make sense to change the thresholds of the rehash trigger and the
thresholds of the incremental rehashing the same, however, in one we
compare the size of the hash table to the number of records, and in the
other we compare the size of ht[0] to the size of ht[1], so the formula
is not exactly the same.
to make things easier we change all the thresholds to powers of 2, so
the normal shrinking threshold is changed from 100/10 (i.e. 10%) to
100/8 (i.e. 12.5%), and we change the threshold during forks from 5 to
4, i.e. from 500% to 400% for expand, and from 2% (100/10/5) to 3.125%
(100/8/4)
2024-01-19 23:00:43 +08:00
|
|
|
#define HASHTABLE_MIN_FILL 8 /* Minimal hash table fill 12.5%(100/8) */
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
|
2023-01-11 09:57:10 +01:00
|
|
|
typedef struct dictEntry dictEntry; /* opaque */
|
2021-08-05 08:25:58 +03:00
|
|
|
typedef struct dict dict;
|
|
|
|
|
2009-03-22 10:30:00 +01:00
|
|
|
typedef struct dictType {
|
Refactor the per-slot dict-array db.c into a new kvstore data structure (#12822)
# Description
Gather most of the scattered `redisDb`-related code from the per-slot
dict PR (#11695) and turn it to a new data structure, `kvstore`. i.e.
it's a class that represents an array of dictionaries.
# Motivation
The main motivation is code cleanliness, the idea of using an array of
dictionaries is very well-suited to becoming a self-contained data
structure.
This allowed cleaning some ugly code, among others: loops that run twice
on the main dict and expires dict, and duplicate code for allocating and
releasing this data structure.
# Notes
1. This PR reverts the part of https://github.com/redis/redis/pull/12848
where the `rehashing` list is global (handling rehashing `dict`s is
under the responsibility of `kvstore`, and should not be managed by the
server)
2. This PR also replaces the type of `server.pubsubshard_channels` from
`dict**` to `kvstore` (original PR:
https://github.com/redis/redis/pull/12804). After that was done,
server.pubsub_channels was also chosen to be a `kvstore` (with only one
`dict`, which seems odd) just to make the code cleaner by making it the
same type as `server.pubsubshard_channels`, see
`pubsubtype.serverPubSubChannels`
3. the keys and expires kvstores are currenlty configured to allocate
the individual dicts only when the first key is added (unlike before, in
which they allocated them in advance), but they won't release them when
the last key is deleted.
Worth mentioning that due to the recent change the reply of DEBUG
HTSTATS changed, in case no keys were ever added to the db.
before:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
```
after:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
[Expires HT]
```
2024-02-05 22:21:35 +07:00
|
|
|
/* Callbacks */
|
2017-02-20 16:09:54 +01:00
|
|
|
uint64_t (*hashFunction)(const void *key);
|
2024-11-14 00:45:47 -08:00
|
|
|
void *(*keyDup)(const void *key);
|
|
|
|
int (*keyCompare)(const void *key1, const void *key2);
|
|
|
|
void (*keyDestructor)(void *key);
|
|
|
|
void (*valDestructor)(void *obj);
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
int (*resizeAllowed)(size_t moreMem, double usedRatio);
|
2023-10-27 12:05:40 -07:00
|
|
|
/* Invoked at the start of dict initialization/rehashing (old and new ht are already created) */
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
void (*rehashingStarted)(dict *d);
|
2023-10-27 12:05:40 -07:00
|
|
|
/* Invoked at the end of dict initialization/rehashing of all the entries from old to new ht. Both ht still exists
|
|
|
|
* and are cleaned up after this callback. */
|
|
|
|
void (*rehashingCompleted)(dict *d);
|
Unified db rehash method for both standalone and cluster (#12848)
After #11695, we added two functions `rehashingStarted` and
`rehashingCompleted` to the dict structure. We also registered two
handlers for the main database's dict and expire structures. This allows
the main database to record the dict in `rehashing` list when rehashing
starts. Later, in `serverCron`, the `incrementallyRehash` function is
continuously called to perform the rehashing operation. However,
currently, when rehashing is completed, `rehashingCompleted` does not
remove the dict from the `rehashing` list. This results in the
`rehashing` list containing many invalid dicts. Although subsequent cron
checks and removes dicts that don't require rehashing, it is still
inefficient.
This PR implements the functionality to remove the dict from the
`rehashing` list in `rehashingCompleted`. This is achieved by adding
`metadata` to the dict structure, which keeps track of its position in
the `rehashing` list, allowing for quick removal. This approach avoids
storing duplicate dicts in the `rehashing` list.
Additionally, there are other modifications:
1. Whether in standalone or cluster mode, the dict in database is
inserted into the rehashing linked list when rehashing starts. This
eliminates the need to distinguish between standalone and cluster mode
in `incrementallyRehash`. The function only needs to focus on the dicts
in the `rehashing` list that require rehashing.
2. `rehashing` list is moved from per-database to Redis server level.
This decouples `incrementallyRehash` from the database ID, and in
standalone mode, there is no need to iterate over all databases,
avoiding unnecessary access to databases that do not require rehashing.
In the future, even if unsharded-cluster mode supports multiple
databases, there will be no risk involved.
3. The insertion and removal operations of dict structures in the
`rehashing` list are decoupled from `activerehashing` config.
`activerehashing` only controls whether `incrementallyRehash` is
executed in serverCron. There is no need for additional steps when
modifying the `activerehashing` switch, as in #12705.
2023-12-15 10:42:53 +08:00
|
|
|
/* Allow a dict to carry extra caller-defined metadata. The
|
|
|
|
* extra memory is initialized to 0 when a dict is allocated. */
|
|
|
|
size_t (*dictMetadataBytes)(dict *d);
|
2024-07-02 15:45:37 -07:00
|
|
|
/* Method for copying a given key into a buffer of buf_len. Also used for
|
|
|
|
* computing the length of the key + header when buf is NULL. */
|
|
|
|
size_t (*embedKey)(unsigned char *buf, size_t buf_len, const void *key, unsigned char *header_size);
|
|
|
|
|
Key as dict entry - memory optimization for sets (#11595)
If a dict has only keys, and no use of values, then a key can be stored directly in a
dict's hashtable. The key replaces the dictEntry. To distinguish between a key and
a dictEntry, we only use this optimization if the key is odd, i.e. if the key has the least
significant bit set. This is true for sds strings, since the sds header is always an odd
number of bytes.
Dict entries are used as a fallback when there is a hash collision. A special dict entry
without a value (only key and next) is used so we save one word in this case too.
This saves 24 bytes per set element for larges sets, and also gains some speed improvement
as a side effect (less allocations and cache misses).
A quick test adding 1M elements to a set using the command below resulted in memory
usage of 28.83M, compared to 46.29M on unstable.
That's 18 bytes per set element on average.
eval 'for i=1,1000000,1 do redis.call("sadd", "myset", "x"..i) end' 0
Other changes:
Allocations are ensured to have at least 8 bits alignment on all systems. This affects 32-bit
builds compiled without HAVE_MALLOC_SIZE (not jemalloc or glibc) in which Redis
stores the size of each allocation, after this change in 8 bytes instead of previously 4 bytes
per allocation. This is done so we can reliably use the 3 least significant bits in a pointer to
encode stuff.
2023-01-20 17:45:29 +01:00
|
|
|
/* Flags */
|
|
|
|
/* The 'no_value' flag, if set, indicates that values are not used, i.e. the
|
|
|
|
* dict is a set. When this flag is set, it's not possible to access the
|
|
|
|
* value of a dictEntry and it's also impossible to use dictSetKey(). Entry
|
|
|
|
* metadata can also not be used. */
|
|
|
|
unsigned int no_value : 1;
|
|
|
|
/* If no_value = 1 and all keys are odd (LSB=1), setting keys_are_odd = 1
|
|
|
|
* enables one more optimization: to store a key without an allocated
|
|
|
|
* dictEntry. */
|
|
|
|
unsigned int keys_are_odd : 1;
|
2024-07-02 15:45:37 -07:00
|
|
|
/* If embedded_entry flag is set, it indicates that a copy of the key is created and the key is embedded
|
|
|
|
* as part of the dict entry. */
|
|
|
|
unsigned int embedded_entry : 1;
|
Io thread work offload (#763)
### IO-Threads Work Offloading
This PR is the 2nd of 3 PRs intended to achieve the goal of 1M requests
per second.
(1st PR: https://github.com/valkey-io/valkey/pull/758)
This PR offloads additional work to the I/O threads, beyond the current
read-parse/write operations, to better utilize the I/O threads and
reduce the load on the main thread.
It contains the following 3 commits:
### Poll Offload
Currently, the main thread is responsible for executing the poll-wait
system call, while the IO threads wait for tasks from the main thread.
The poll-wait operation is expensive and can consume up to 30% of the
main thread's time. We could have let the IO threads do the poll-wait by
themselves, with each thread listening to some of the clients and
notifying the main thread when a client's command is ready to execute.
However, the current approach, where the main thread listens for events
from the network, has several benefits. The main thread remains in
charge, allowing it to know the state of each client
(idle/read/write/close) at any given time. Additionally, it makes the
threads flexible, enabling us to drain an IO thread's job queue and stop
a thread when the load is light without modifying the event loop and
moving its clients to a different IO thread. Furthermore, with this
approach, the IO threads don't need to wait for both messages from the
network and from the main thread instead, the threads wait only for
tasks from the main thread.
To enjoy the benefits of both the main thread remaining in charge and
the poll being offloaded, we propose offloading the poll-wait as a
single-time, non-blocking job to one of the IO threads. The IO thread
will perform a poll-wait non-blocking call while the main thread
processes the client commands. Later, in `aeProcessEvents`, instead of
sleeping on the poll, we check for the IO thread's poll-wait results.
The poll-wait will be offloaded in `beforeSleep` only when there are
ready events for the main thread to process. If no events are pending,
the main thread will revert to the current behavior and sleep on the
poll by itself.
**Implementation Details**
A new call back `custompoll` was added to the `aeEventLoop` when not set
to `NULL` the ae will call the `custompoll` callback instead of the
`aeApiPoll`.
When the poll is offloaded we will set the `custompoll` to
`getIOThreadPollResults` and send a poll-job to the thread. the thread
will take a mutex, call a non-blocking (with timeout 0) to `aePoll`
which will populate the fired events array. the IO thread will set the
`server.io_fired_events` to the number of the returning `numevents`,
later the main-thread in `custompoll` will return the
`server.io_fired_events` and will set the `customPoll` back to `NULL`.
To ensure thread safety when accessing server.el, all functions that
modify the eventloop events were wrapped with a mutex to ensure mutual
exclusion when modifying the events.
### Command Lookup Offload
As the IO thread parses the command from the client's Querybuf, it can
perform a command lookup in the commands dictionary, which can consume
up to ~5% of the main-thread runtime.
**Implementation details**
The IO thread will store the looked-up command in the client's new field
`io_parsed_cmd` field. We can't use `c->cmd` for that since we use
`c->cmd `to check if a command was reprocessed or not.
To ensure thread safety when accessing the command dictionary, we make
sure the main thread isn't changing the dictionary while IO threads are
accessing it. This is accomplished by introducing a new flag called
`no_incremental_rehash` for the `dictType` commands. When performing
`dictResize`, we will rehash the entire dictionary in place rather than
deferring the process.
### Free Offload
Since the command arguments are allocated by the I/O thread, it would be
beneficial if they were also freed by the same thread. If the main
thread frees objects allocated by the I/O thread, two issues arise:
1. During the freeing process, the main thread needs to access the SDS
pointed to by the object to get its length.
2. With Jemalloc, each thread manages thread local pool (`tcache`) of
buffers for quick reallocation without accessing the arena. If the main
thread constantly frees objects allocated by other threads, those
threads will have to frequently access the shared arena to obtain new
memory allocations
**Implementation Details**
When freeing the client's argv, we will send the argv array to the
thread that allocated it. The thread will be identified by the client
ID. When freeing an object during `dbOverwrite`, we will offload the
object free as well. We will extend this to offload the free during
`dbDelete` in a future PR, as its effects on defrag/memory evictions
need to be studied.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
2024-07-19 05:21:45 +03:00
|
|
|
/* Perform rehashing during resizing instead of incrementally rehashing across multiple steps */
|
|
|
|
unsigned int no_incremental_rehash : 1;
|
2009-03-22 10:30:00 +01:00
|
|
|
} dictType;
|
|
|
|
|
2021-08-05 08:25:58 +03:00
|
|
|
#define DICTHT_SIZE(exp) ((exp) == -1 ? 0 : (unsigned long)1 << (exp))
|
|
|
|
#define DICTHT_SIZE_MASK(exp) ((exp) == -1 ? 0 : (DICTHT_SIZE(exp)) - 1)
|
2010-04-15 11:59:13 +02:00
|
|
|
|
2021-08-05 08:25:58 +03:00
|
|
|
struct dict {
|
2010-04-15 11:59:13 +02:00
|
|
|
dictType *type;
|
2021-08-05 08:25:58 +03:00
|
|
|
|
|
|
|
dictEntry **ht_table[2];
|
|
|
|
unsigned long ht_used[2];
|
|
|
|
|
2014-08-26 10:18:56 +02:00
|
|
|
long rehashidx; /* rehashing not in progress if rehashidx == -1 */
|
2021-08-05 08:25:58 +03:00
|
|
|
|
|
|
|
/* Keep small vars at end for optimal (minimal) struct padding */
|
2021-02-20 02:56:30 -08:00
|
|
|
int16_t pauserehash; /* If >0 rehashing is paused (<0 indicates coding error) */
|
2021-08-15 21:37:44 +03:00
|
|
|
signed char ht_size_exp[2]; /* exponent of size. (size = 1<<exp) */
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
int16_t pauseAutoResize; /* If >0 automatic resizing is disallowed (<0 indicates coding error) */
|
Unified db rehash method for both standalone and cluster (#12848)
After #11695, we added two functions `rehashingStarted` and
`rehashingCompleted` to the dict structure. We also registered two
handlers for the main database's dict and expire structures. This allows
the main database to record the dict in `rehashing` list when rehashing
starts. Later, in `serverCron`, the `incrementallyRehash` function is
continuously called to perform the rehashing operation. However,
currently, when rehashing is completed, `rehashingCompleted` does not
remove the dict from the `rehashing` list. This results in the
`rehashing` list containing many invalid dicts. Although subsequent cron
checks and removes dicts that don't require rehashing, it is still
inefficient.
This PR implements the functionality to remove the dict from the
`rehashing` list in `rehashingCompleted`. This is achieved by adding
`metadata` to the dict structure, which keeps track of its position in
the `rehashing` list, allowing for quick removal. This approach avoids
storing duplicate dicts in the `rehashing` list.
Additionally, there are other modifications:
1. Whether in standalone or cluster mode, the dict in database is
inserted into the rehashing linked list when rehashing starts. This
eliminates the need to distinguish between standalone and cluster mode
in `incrementallyRehash`. The function only needs to focus on the dicts
in the `rehashing` list that require rehashing.
2. `rehashing` list is moved from per-database to Redis server level.
This decouples `incrementallyRehash` from the database ID, and in
standalone mode, there is no need to iterate over all databases,
avoiding unnecessary access to databases that do not require rehashing.
In the future, even if unsharded-cluster mode supports multiple
databases, there will be no risk involved.
3. The insertion and removal operations of dict structures in the
`rehashing` list are decoupled from `activerehashing` config.
`activerehashing` only controls whether `incrementallyRehash` is
executed in serverCron. There is no need for additional steps when
modifying the `activerehashing` switch, as in #12705.
2023-12-15 10:42:53 +08:00
|
|
|
void *metadata[];
|
2021-08-05 08:25:58 +03:00
|
|
|
};
|
2009-03-22 10:30:00 +01:00
|
|
|
|
2012-10-02 21:58:36 -07:00
|
|
|
/* If safe is set to 1 this is a safe iterator, that means, you can call
|
2011-05-10 10:15:50 +02:00
|
|
|
* dictAdd, dictFind, and other functions against the dictionary even while
|
|
|
|
* iterating. Otherwise it is a non safe iterator, and only dictNext()
|
|
|
|
* should be called while iterating. */
|
2009-03-22 10:30:00 +01:00
|
|
|
typedef struct dictIterator {
|
2010-04-15 11:59:13 +02:00
|
|
|
dict *d;
|
2014-08-26 10:18:56 +02:00
|
|
|
long index;
|
|
|
|
int table, safe;
|
2009-03-22 10:30:00 +01:00
|
|
|
dictEntry *entry, *nextEntry;
|
2014-08-26 10:18:56 +02:00
|
|
|
/* unsafe iterator fingerprint for misuse detection. */
|
2021-11-11 14:51:33 +03:00
|
|
|
unsigned long long fingerprint;
|
2009-03-22 10:30:00 +01:00
|
|
|
} dictIterator;
|
|
|
|
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
typedef struct dictStats {
|
|
|
|
int htidx;
|
|
|
|
unsigned long buckets;
|
|
|
|
unsigned long maxChainLen;
|
|
|
|
unsigned long totalChainLen;
|
|
|
|
unsigned long htSize;
|
|
|
|
unsigned long htUsed;
|
|
|
|
unsigned long *clvector;
|
|
|
|
} dictStats;
|
|
|
|
|
2012-07-09 01:00:26 -07:00
|
|
|
typedef void(dictScanFunction)(void *privdata, const dictEntry *de);
|
2022-11-20 23:23:54 +01:00
|
|
|
typedef void *(dictDefragAllocFunction)(void *ptr);
|
2024-07-02 15:45:37 -07:00
|
|
|
typedef void(dictDefragEntryCb)(void *privdata, void *ptr);
|
Key as dict entry - memory optimization for sets (#11595)
If a dict has only keys, and no use of values, then a key can be stored directly in a
dict's hashtable. The key replaces the dictEntry. To distinguish between a key and
a dictEntry, we only use this optimization if the key is odd, i.e. if the key has the least
significant bit set. This is true for sds strings, since the sds header is always an odd
number of bytes.
Dict entries are used as a fallback when there is a hash collision. A special dict entry
without a value (only key and next) is used so we save one word in this case too.
This saves 24 bytes per set element for larges sets, and also gains some speed improvement
as a side effect (less allocations and cache misses).
A quick test adding 1M elements to a set using the command below resulted in memory
usage of 28.83M, compared to 46.29M on unstable.
That's 18 bytes per set element on average.
eval 'for i=1,1000000,1 do redis.call("sadd", "myset", "x"..i) end' 0
Other changes:
Allocations are ensured to have at least 8 bits alignment on all systems. This affects 32-bit
builds compiled without HAVE_MALLOC_SIZE (not jemalloc or glibc) in which Redis
stores the size of each allocation, after this change in 8 bytes instead of previously 4 bytes
per allocation. This is done so we can reliably use the 3 least significant bits in a pointer to
encode stuff.
2023-01-20 17:45:29 +01:00
|
|
|
typedef struct {
|
2024-07-02 15:45:37 -07:00
|
|
|
dictDefragAllocFunction *defragAlloc; /* Used for entries etc. */
|
|
|
|
dictDefragAllocFunction *defragKey; /* Defrag-realloc keys (optional) */
|
|
|
|
dictDefragAllocFunction *defragVal; /* Defrag-realloc values (optional) */
|
|
|
|
dictDefragEntryCb *defragEntryStartCb; /* Callback invoked prior to the start of defrag of dictEntry. */
|
|
|
|
dictDefragEntryCb *defragEntryFinishCb; /* Callback invoked after the defrag of dictEntry is tried. */
|
Key as dict entry - memory optimization for sets (#11595)
If a dict has only keys, and no use of values, then a key can be stored directly in a
dict's hashtable. The key replaces the dictEntry. To distinguish between a key and
a dictEntry, we only use this optimization if the key is odd, i.e. if the key has the least
significant bit set. This is true for sds strings, since the sds header is always an odd
number of bytes.
Dict entries are used as a fallback when there is a hash collision. A special dict entry
without a value (only key and next) is used so we save one word in this case too.
This saves 24 bytes per set element for larges sets, and also gains some speed improvement
as a side effect (less allocations and cache misses).
A quick test adding 1M elements to a set using the command below resulted in memory
usage of 28.83M, compared to 46.29M on unstable.
That's 18 bytes per set element on average.
eval 'for i=1,1000000,1 do redis.call("sadd", "myset", "x"..i) end' 0
Other changes:
Allocations are ensured to have at least 8 bits alignment on all systems. This affects 32-bit
builds compiled without HAVE_MALLOC_SIZE (not jemalloc or glibc) in which Redis
stores the size of each allocation, after this change in 8 bytes instead of previously 4 bytes
per allocation. This is done so we can reliably use the 3 least significant bits in a pointer to
encode stuff.
2023-01-20 17:45:29 +01:00
|
|
|
} dictDefragFunctions;
|
2012-07-09 01:00:26 -07:00
|
|
|
|
2009-03-22 10:30:00 +01:00
|
|
|
/* This is the initial size of every hash table */
|
2021-08-05 08:25:58 +03:00
|
|
|
#define DICT_HT_INITIAL_EXP 2
|
|
|
|
#define DICT_HT_INITIAL_SIZE (1 << (DICT_HT_INITIAL_EXP))
|
2009-03-22 10:30:00 +01:00
|
|
|
|
|
|
|
/* ------------------------------- Macros ------------------------------------*/
|
2024-11-14 00:45:47 -08:00
|
|
|
static inline int dictCompareKeys(dict *d, const void *key1, const void *key2) {
|
|
|
|
if (d->type->keyCompare) {
|
|
|
|
return d->type->keyCompare(key1, key2);
|
|
|
|
} else {
|
|
|
|
return (key1 == key2);
|
|
|
|
}
|
|
|
|
}
|
2009-03-22 10:30:00 +01:00
|
|
|
|
Unified db rehash method for both standalone and cluster (#12848)
After #11695, we added two functions `rehashingStarted` and
`rehashingCompleted` to the dict structure. We also registered two
handlers for the main database's dict and expire structures. This allows
the main database to record the dict in `rehashing` list when rehashing
starts. Later, in `serverCron`, the `incrementallyRehash` function is
continuously called to perform the rehashing operation. However,
currently, when rehashing is completed, `rehashingCompleted` does not
remove the dict from the `rehashing` list. This results in the
`rehashing` list containing many invalid dicts. Although subsequent cron
checks and removes dicts that don't require rehashing, it is still
inefficient.
This PR implements the functionality to remove the dict from the
`rehashing` list in `rehashingCompleted`. This is achieved by adding
`metadata` to the dict structure, which keeps track of its position in
the `rehashing` list, allowing for quick removal. This approach avoids
storing duplicate dicts in the `rehashing` list.
Additionally, there are other modifications:
1. Whether in standalone or cluster mode, the dict in database is
inserted into the rehashing linked list when rehashing starts. This
eliminates the need to distinguish between standalone and cluster mode
in `incrementallyRehash`. The function only needs to focus on the dicts
in the `rehashing` list that require rehashing.
2. `rehashing` list is moved from per-database to Redis server level.
This decouples `incrementallyRehash` from the database ID, and in
standalone mode, there is no need to iterate over all databases,
avoiding unnecessary access to databases that do not require rehashing.
In the future, even if unsharded-cluster mode supports multiple
databases, there will be no risk involved.
3. The insertion and removal operations of dict structures in the
`rehashing` list are decoupled from `activerehashing` config.
`activerehashing` only controls whether `incrementallyRehash` is
executed in serverCron. There is no need for additional steps when
modifying the `activerehashing` switch, as in #12705.
2023-12-15 10:42:53 +08:00
|
|
|
#define dictMetadata(d) (&(d)->metadata)
|
|
|
|
#define dictMetadataSize(d) ((d)->type->dictMetadataBytes ? (d)->type->dictMetadataBytes(d) : 0)
|
|
|
|
|
2022-08-03 19:38:08 +03:00
|
|
|
#define dictHashKey(d, key) ((d)->type->hashFunction(key))
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
#define dictBuckets(d) (DICTHT_SIZE((d)->ht_size_exp[0]) + DICTHT_SIZE((d)->ht_size_exp[1]))
|
2021-08-05 08:25:58 +03:00
|
|
|
#define dictSize(d) ((d)->ht_used[0] + (d)->ht_used[1])
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
#define dictIsEmpty(d) ((d)->ht_used[0] == 0 && (d)->ht_used[1] == 0)
|
2012-05-18 20:34:10 +08:00
|
|
|
#define dictIsRehashing(d) ((d)->rehashidx != -1)
|
2022-08-03 19:38:08 +03:00
|
|
|
#define dictPauseRehashing(d) ((d)->pauserehash++)
|
|
|
|
#define dictResumeRehashing(d) ((d)->pauserehash--)
|
Fix dictionary use-after-free in active expire and make kvstore iter to respect EMPTY flag (#13135)
After #13072, there is an use-after-free error. In expireScanCallback, we
will delete the dict, and then in dictScan we will continue to use the dict,
like we will doing `dictResumeRehashing(d)` in the end, this casued an error.
In this PR, in freeDictIfNeeded, if the dict's pauserehash is set, don't
delete the dict yet, and then when scan returns try to delete it again.
At the same time, we noticed that there will be similar problems in iterator.
We may also delete elements during the iteration process, causing the dict
to be deleted, so the part related to iter in the PR has also been modified.
dictResetIterator was also missing from the previous kvstoreIteratorNextDict,
we currently have no scenario that elements will be deleted in kvstoreIterator
process, deal with it together to avoid future problems. Added some simple
tests to verify the changes.
In addition, the modification in #13072 omitted initTempDb and emptyDbAsync,
and they were also added. This PR also remove the slow flag from the expire
test (consumes 1.3s) so that problems can be found in CI in the future.
2024-03-18 23:41:54 +08:00
|
|
|
#define dictIsRehashingPaused(d) ((d)->pauserehash > 0)
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
#define dictPauseAutoResize(d) ((d)->pauseAutoResize++)
|
|
|
|
#define dictResumeAutoResize(d) ((d)->pauseAutoResize--)
|
2009-03-22 10:30:00 +01:00
|
|
|
|
2020-12-23 05:52:07 -08:00
|
|
|
/* If our unsigned long type can store a 64 bit number, use a 64 bit PRNG. */
|
|
|
|
#if ULONG_MAX >= 0xffffffffffffffff
|
|
|
|
#define randomULong() ((unsigned long)genrand64_int64())
|
|
|
|
#else
|
|
|
|
#define randomULong() random()
|
|
|
|
#endif
|
|
|
|
|
2023-01-10 08:40:40 +02:00
|
|
|
typedef enum {
|
|
|
|
DICT_RESIZE_ENABLE,
|
|
|
|
DICT_RESIZE_AVOID,
|
|
|
|
DICT_RESIZE_FORBID,
|
|
|
|
} dictResizeEnable;
|
|
|
|
|
2009-03-22 10:30:00 +01:00
|
|
|
/* API */
|
2021-08-05 08:25:58 +03:00
|
|
|
dict *dictCreate(dictType *type);
|
2010-04-15 18:07:57 +02:00
|
|
|
int dictExpand(dict *d, unsigned long size);
|
2020-11-22 21:22:49 +02:00
|
|
|
int dictTryExpand(dict *d, unsigned long size);
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
int dictShrink(dict *d, unsigned long size);
|
2010-04-15 18:07:57 +02:00
|
|
|
int dictAdd(dict *d, void *key, void *val);
|
2016-05-09 18:01:09 +03:00
|
|
|
dictEntry *dictAddRaw(dict *d, void *key, dictEntry **existing);
|
Key as dict entry - memory optimization for sets (#11595)
If a dict has only keys, and no use of values, then a key can be stored directly in a
dict's hashtable. The key replaces the dictEntry. To distinguish between a key and
a dictEntry, we only use this optimization if the key is odd, i.e. if the key has the least
significant bit set. This is true for sds strings, since the sds header is always an odd
number of bytes.
Dict entries are used as a fallback when there is a hash collision. A special dict entry
without a value (only key and next) is used so we save one word in this case too.
This saves 24 bytes per set element for larges sets, and also gains some speed improvement
as a side effect (less allocations and cache misses).
A quick test adding 1M elements to a set using the command below resulted in memory
usage of 28.83M, compared to 46.29M on unstable.
That's 18 bytes per set element on average.
eval 'for i=1,1000000,1 do redis.call("sadd", "myset", "x"..i) end' 0
Other changes:
Allocations are ensured to have at least 8 bits alignment on all systems. This affects 32-bit
builds compiled without HAVE_MALLOC_SIZE (not jemalloc or glibc) in which Redis
stores the size of each allocation, after this change in 8 bytes instead of previously 4 bytes
per allocation. This is done so we can reliably use the 3 least significant bits in a pointer to
encode stuff.
2023-01-20 17:45:29 +01:00
|
|
|
void *dictFindPositionForInsert(dict *d, const void *key, dictEntry **existing);
|
|
|
|
dictEntry *dictInsertAtPosition(dict *d, void *key, void *position);
|
2016-09-14 16:43:38 +02:00
|
|
|
dictEntry *dictAddOrFind(dict *d, void *key);
|
2010-04-15 18:07:57 +02:00
|
|
|
int dictReplace(dict *d, void *key, void *val);
|
|
|
|
int dictDelete(dict *d, const void *key);
|
2021-08-05 08:25:58 +03:00
|
|
|
dictEntry *dictUnlink(dict *d, const void *key);
|
2016-05-09 18:01:09 +03:00
|
|
|
void dictFreeUnlinkedEntry(dict *d, dictEntry *he);
|
2022-11-30 17:56:36 +08:00
|
|
|
dictEntry *dictTwoPhaseUnlinkFind(dict *d, const void *key, dictEntry ***plink, int *table_index);
|
|
|
|
void dictTwoPhaseUnlinkFree(dict *d, dictEntry *he, dictEntry **plink, int table_index);
|
2010-04-15 18:07:57 +02:00
|
|
|
void dictRelease(dict *d);
|
|
|
|
dictEntry *dictFind(dict *d, const void *key);
|
2010-04-16 10:04:51 +02:00
|
|
|
void *dictFetchValue(dict *d, const void *key);
|
Optimize resizing hash table to resize not only non-empty dicts. (#12819)
The function `tryResizeHashTables` only attempts to shrink the dicts
that has keys (change from #11695), this was a serious problem until the
change in #12850 since it meant if all keys are deleted, we won't shrink
the dick.
But still, both dictShrink and dictExpand may be blocked by a fork child
process, therefore, the cron job needs to perform both dictShrink and
dictExpand, for not just non-empty dicts, but all dicts in DBs.
What this PR does:
1. Try to resize all dicts in DBs (not just non-empty ones, as it was
since #12850)
2. handle both shrink and expand (not just shrink, as it was since
forever)
3. Refactor some APIs about dict resizing (get rid of `htNeedsShrink`
`htNeedsShrink` `dictShrinkToFit`, and expose `dictShrinkIfNeeded`
`dictExpandIfNeeded` which already contains all the code of those
functions we get rid of, to make APIs more neat)
4. In the `Don't rehash if redis has child process` test, now that cron
would do resizing, we no longer need to write to DB after the child
process got killed, and can wait for the cron to expand the hash table.
2024-01-30 03:02:07 +08:00
|
|
|
int dictShrinkIfNeeded(dict *d);
|
|
|
|
int dictExpandIfNeeded(dict *d);
|
2023-01-11 09:57:10 +01:00
|
|
|
void dictSetKey(dict *d, dictEntry *de, void *key);
|
|
|
|
void dictSetVal(dict *d, dictEntry *de, void *val);
|
|
|
|
void dictSetSignedIntegerVal(dictEntry *de, int64_t val);
|
|
|
|
void dictSetUnsignedIntegerVal(dictEntry *de, uint64_t val);
|
|
|
|
void dictSetDoubleVal(dictEntry *de, double val);
|
|
|
|
int64_t dictIncrSignedIntegerVal(dictEntry *de, int64_t val);
|
|
|
|
uint64_t dictIncrUnsignedIntegerVal(dictEntry *de, uint64_t val);
|
|
|
|
double dictIncrDoubleVal(dictEntry *de, double val);
|
|
|
|
void *dictGetKey(const dictEntry *de);
|
|
|
|
void *dictGetVal(const dictEntry *de);
|
|
|
|
int64_t dictGetSignedIntegerVal(const dictEntry *de);
|
|
|
|
uint64_t dictGetUnsignedIntegerVal(const dictEntry *de);
|
|
|
|
double dictGetDoubleVal(const dictEntry *de);
|
|
|
|
double *dictGetDoubleValPtr(dictEntry *de);
|
|
|
|
size_t dictMemUsage(const dict *d);
|
2024-07-02 15:45:37 -07:00
|
|
|
size_t dictEntryMemUsage(dictEntry *de);
|
2010-04-15 18:07:57 +02:00
|
|
|
dictIterator *dictGetIterator(dict *d);
|
2011-05-10 10:15:50 +02:00
|
|
|
dictIterator *dictGetSafeIterator(dict *d);
|
2022-09-08 04:57:43 +03:00
|
|
|
void dictInitIterator(dictIterator *iter, dict *d);
|
|
|
|
void dictInitSafeIterator(dictIterator *iter, dict *d);
|
|
|
|
void dictResetIterator(dictIterator *iter);
|
2009-03-22 10:30:00 +01:00
|
|
|
dictEntry *dictNext(dictIterator *iter);
|
Improve multithreaded performance with memory prefetching (#861)
This PR utilizes the IO threads to execute commands in batches, allowing
us to prefetch the dictionary data in advance.
After making the IO threads asynchronous and offloading more work to
them in the first 2 PRs, the `lookupKey` function becomes a main
bottle-neck and it takes about 50% of the main-thread time (Tested with
SET command). This is because the Valkey dictionary is a straightforward
but inefficient chained hash implementation. While traversing the hash
linked lists, every access to either a dictEntry structure, pointer to
key, or a value object requires, with high probability, an expensive
external memory access.
### Memory Access Amortization
Memory Access Amortization (MAA) is a technique designed to optimize the
performance of dynamic data structures by reducing the impact of memory
access latency. It is applicable when multiple operations need to be
executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.
Rather than executing operations sequentially, this approach interleaves
the execution of all operations. This is done in such a way that
whenever a memory access is required during an operation, the program
prefetches the necessary memory and transitions to another operation.
This ensures that when one operation is blocked awaiting memory access,
other memory accesses are executed in parallel, thereby reducing the
average access latency.
We applied this method in the development of `dictPrefetch`, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, `dictPrefetch` is an interleaved execution of dictFind for
all the keys.
**Implementation details**
When the main thread iterates over the `clients-pending-io-read`, for
clients with ready-to-execute commands (i.e., clients for which the IO
thread has parsed the commands), a batch of up to 16 commands is
created. Initially, the command's argv, which were allocated by the IO
thread, is prefetched to the main thread's L1 cache. Subsequently, all
the dict entries and values required for the commands are prefetched
from the dictionary before the command execution. Only then will the
commands be executed.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
2024-08-27 07:10:44 +03:00
|
|
|
dictEntry *dictGetNext(const dictEntry *de);
|
2009-03-22 10:30:00 +01:00
|
|
|
void dictReleaseIterator(dictIterator *iter);
|
2010-04-15 18:07:57 +02:00
|
|
|
dictEntry *dictGetRandomKey(dict *d);
|
2019-02-18 18:27:18 +01:00
|
|
|
dictEntry *dictGetFairRandomKey(dict *d);
|
2015-02-06 15:48:42 +01:00
|
|
|
unsigned int dictGetSomeKeys(dict *d, dictEntry **des, unsigned int count);
|
2023-05-24 16:27:44 +03:00
|
|
|
void dictGetStats(char *buf, size_t bufsize, dict *d, int full);
|
2021-12-13 20:16:25 +01:00
|
|
|
uint64_t dictGenHashFunction(const void *key, size_t len);
|
|
|
|
uint64_t dictGenCaseHashFunction(const unsigned char *buf, size_t len);
|
2021-08-05 08:25:58 +03:00
|
|
|
void dictEmpty(dict *d, void(callback)(dict *));
|
2023-01-10 08:40:40 +02:00
|
|
|
void dictSetResizeEnabled(dictResizeEnable enable);
|
2010-04-15 11:59:13 +02:00
|
|
|
int dictRehash(dict *d, int n);
|
Unified db rehash method for both standalone and cluster (#12848)
After #11695, we added two functions `rehashingStarted` and
`rehashingCompleted` to the dict structure. We also registered two
handlers for the main database's dict and expire structures. This allows
the main database to record the dict in `rehashing` list when rehashing
starts. Later, in `serverCron`, the `incrementallyRehash` function is
continuously called to perform the rehashing operation. However,
currently, when rehashing is completed, `rehashingCompleted` does not
remove the dict from the `rehashing` list. This results in the
`rehashing` list containing many invalid dicts. Although subsequent cron
checks and removes dicts that don't require rehashing, it is still
inefficient.
This PR implements the functionality to remove the dict from the
`rehashing` list in `rehashingCompleted`. This is achieved by adding
`metadata` to the dict structure, which keeps track of its position in
the `rehashing` list, allowing for quick removal. This approach avoids
storing duplicate dicts in the `rehashing` list.
Additionally, there are other modifications:
1. Whether in standalone or cluster mode, the dict in database is
inserted into the rehashing linked list when rehashing starts. This
eliminates the need to distinguish between standalone and cluster mode
in `incrementallyRehash`. The function only needs to focus on the dicts
in the `rehashing` list that require rehashing.
2. `rehashing` list is moved from per-database to Redis server level.
This decouples `incrementallyRehash` from the database ID, and in
standalone mode, there is no need to iterate over all databases,
avoiding unnecessary access to databases that do not require rehashing.
In the future, even if unsharded-cluster mode supports multiple
databases, there will be no risk involved.
3. The insertion and removal operations of dict structures in the
`rehashing` list are decoupled from `activerehashing` config.
`activerehashing` only controls whether `incrementallyRehash` is
executed in serverCron. There is no need for additional steps when
modifying the `activerehashing` switch, as in #12705.
2023-12-15 10:42:53 +08:00
|
|
|
int dictRehashMicroseconds(dict *d, uint64_t us);
|
2017-02-20 16:09:54 +01:00
|
|
|
void dictSetHashFunctionSeed(uint8_t *seed);
|
|
|
|
uint8_t *dictGetHashFunctionSeed(void);
|
2022-11-20 23:23:54 +01:00
|
|
|
unsigned long dictScan(dict *d, unsigned long v, dictScanFunction *fn, void *privdata);
|
Key as dict entry - memory optimization for sets (#11595)
If a dict has only keys, and no use of values, then a key can be stored directly in a
dict's hashtable. The key replaces the dictEntry. To distinguish between a key and
a dictEntry, we only use this optimization if the key is odd, i.e. if the key has the least
significant bit set. This is true for sds strings, since the sds header is always an odd
number of bytes.
Dict entries are used as a fallback when there is a hash collision. A special dict entry
without a value (only key and next) is used so we save one word in this case too.
This saves 24 bytes per set element for larges sets, and also gains some speed improvement
as a side effect (less allocations and cache misses).
A quick test adding 1M elements to a set using the command below resulted in memory
usage of 28.83M, compared to 46.29M on unstable.
That's 18 bytes per set element on average.
eval 'for i=1,1000000,1 do redis.call("sadd", "myset", "x"..i) end' 0
Other changes:
Allocations are ensured to have at least 8 bits alignment on all systems. This affects 32-bit
builds compiled without HAVE_MALLOC_SIZE (not jemalloc or glibc) in which Redis
stores the size of each allocation, after this change in 8 bytes instead of previously 4 bytes
per allocation. This is done so we can reliably use the 3 least significant bits in a pointer to
encode stuff.
2023-01-20 17:45:29 +01:00
|
|
|
unsigned long
|
2024-12-03 08:42:29 -08:00
|
|
|
dictScanDefrag(dict *d, unsigned long v, dictScanFunction *fn, const dictDefragFunctions *defragfns, void *privdata);
|
2017-11-30 11:34:37 +08:00
|
|
|
uint64_t dictGetHash(dict *d, const void *key);
|
2023-12-10 16:55:30 +08:00
|
|
|
void dictRehashingInfo(dict *d, unsigned long long *from_size, unsigned long long *to_size);
|
2009-03-22 10:30:00 +01:00
|
|
|
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
size_t dictGetStatsMsg(char *buf, size_t bufsize, dictStats *stats, int full);
|
|
|
|
dictStats *dictGetStatsHt(dict *d, int htidx, int full);
|
|
|
|
void dictCombineStats(dictStats *from, dictStats *into);
|
|
|
|
void dictFreeStats(dictStats *stats);
|
|
|
|
|
2009-03-22 10:30:00 +01:00
|
|
|
#endif /* __DICT_H */
|