2009-03-22 10:30:00 +01:00
|
|
|
/* Hash Tables Implementation.
|
|
|
|
*
|
2012-10-02 21:58:36 -07:00
|
|
|
* This file implements in-memory hash tables with insert/del/replace/find/
|
|
|
|
* get-random-element operations. Hash tables will auto-resize if needed
|
2009-03-22 10:30:00 +01:00
|
|
|
* tables of power of two in size are used, collisions are handled by
|
|
|
|
* chaining. See the source code for more information... :)
|
|
|
|
*
|
2012-10-02 21:58:36 -07:00
|
|
|
* Copyright (c) 2006-2012, Salvatore Sanfilippo <antirez at gmail dot com>
|
2009-03-22 10:30:00 +01:00
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions are met:
|
|
|
|
*
|
|
|
|
* * Redistributions of source code must retain the above copyright notice,
|
|
|
|
* this list of conditions and the following disclaimer.
|
|
|
|
* * Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* * Neither the name of Redis nor the names of its contributors may be used
|
|
|
|
* to endorse or promote products derived from this software without
|
|
|
|
* specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
|
|
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
|
|
|
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
|
|
|
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
|
|
|
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
|
|
|
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
|
|
|
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
|
|
|
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
|
|
|
* POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef __DICT_H
|
|
|
|
#define __DICT_H
|
|
|
|
|
2020-12-23 05:52:07 -08:00
|
|
|
#include "mt19937-64.h"
|
|
|
|
#include <limits.h>
|
|
|
|
#include <stdint.h>
|
|
|
|
#include <stdlib.h>
|
|
|
|
|
2009-03-22 10:30:00 +01:00
|
|
|
#define DICT_OK 0
|
|
|
|
#define DICT_ERR 1
|
|
|
|
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
/* Hash table parameters */
|
Change the threshold of dict expand, shrink and rehash (#12948)
Before this change (most recently modified in
https://github.com/redis/redis/pull/12850#discussion_r1421406393), The
trigger for normal expand threshold was 100% utilization and the trigger
for normal shrink threshold was 10% (HASHTABLE_MIN_FILL).
While during fork (DICT_RESIZE_AVOID), when we want to avoid rehash, the
trigger thresholds were multiplied by 5 (`dict_force_resize_ratio`),
meaning 500% for expand and 2% (100/10/5) for shrink.
However, in `dictRehash` (the incremental rehashing), the rehashing
threshold for shrinking during fork (DICT_RESIZE_AVOID) was 20% by
mistake.
This meant that if a shrinking is triggered when `dict_can_resize` is
`DICT_RESIZE_ENABLE` which the threshold is 10%, the rehashing can
continue when `dict_can_resize` is `DICT_RESIZE_AVOID`.
This would cause unwanted CopyOnWrite damage.
It'll make sense to change the thresholds of the rehash trigger and the
thresholds of the incremental rehashing the same, however, in one we
compare the size of the hash table to the number of records, and in the
other we compare the size of ht[0] to the size of ht[1], so the formula
is not exactly the same.
to make things easier we change all the thresholds to powers of 2, so
the normal shrinking threshold is changed from 100/10 (i.e. 10%) to
100/8 (i.e. 12.5%), and we change the threshold during forks from 5 to
4, i.e. from 500% to 400% for expand, and from 2% (100/10/5) to 3.125%
(100/8/4)
2024-01-19 23:00:43 +08:00
|
|
|
#define HASHTABLE_MIN_FILL 8 /* Minimal hash table fill 12.5%(100/8) */
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
|
2023-01-11 09:57:10 +01:00
|
|
|
typedef struct dictEntry dictEntry; /* opaque */
|
2021-08-05 08:25:58 +03:00
|
|
|
typedef struct dict dict;
|
|
|
|
|
2009-03-22 10:30:00 +01:00
|
|
|
typedef struct dictType {
|
2017-02-20 16:09:54 +01:00
|
|
|
uint64_t (*hashFunction)(const void *key);
|
2021-08-05 08:25:58 +03:00
|
|
|
void *(*keyDup)(dict *d, const void *key);
|
|
|
|
void *(*valDup)(dict *d, const void *obj);
|
|
|
|
int (*keyCompare)(dict *d, const void *key1, const void *key2);
|
|
|
|
void (*keyDestructor)(dict *d, void *key);
|
|
|
|
void (*valDestructor)(dict *d, void *obj);
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
int (*resizeAllowed)(size_t moreMem, double usedRatio);
|
2023-10-27 12:05:40 -07:00
|
|
|
/* Invoked at the start of dict initialization/rehashing (old and new ht are already created) */
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
void (*rehashingStarted)(dict *d);
|
2023-10-27 12:05:40 -07:00
|
|
|
/* Invoked at the end of dict initialization/rehashing of all the entries from old to new ht. Both ht still exists
|
|
|
|
* and are cleaned up after this callback. */
|
|
|
|
void (*rehashingCompleted)(dict *d);
|
Unified db rehash method for both standalone and cluster (#12848)
After #11695, we added two functions `rehashingStarted` and
`rehashingCompleted` to the dict structure. We also registered two
handlers for the main database's dict and expire structures. This allows
the main database to record the dict in `rehashing` list when rehashing
starts. Later, in `serverCron`, the `incrementallyRehash` function is
continuously called to perform the rehashing operation. However,
currently, when rehashing is completed, `rehashingCompleted` does not
remove the dict from the `rehashing` list. This results in the
`rehashing` list containing many invalid dicts. Although subsequent cron
checks and removes dicts that don't require rehashing, it is still
inefficient.
This PR implements the functionality to remove the dict from the
`rehashing` list in `rehashingCompleted`. This is achieved by adding
`metadata` to the dict structure, which keeps track of its position in
the `rehashing` list, allowing for quick removal. This approach avoids
storing duplicate dicts in the `rehashing` list.
Additionally, there are other modifications:
1. Whether in standalone or cluster mode, the dict in database is
inserted into the rehashing linked list when rehashing starts. This
eliminates the need to distinguish between standalone and cluster mode
in `incrementallyRehash`. The function only needs to focus on the dicts
in the `rehashing` list that require rehashing.
2. `rehashing` list is moved from per-database to Redis server level.
This decouples `incrementallyRehash` from the database ID, and in
standalone mode, there is no need to iterate over all databases,
avoiding unnecessary access to databases that do not require rehashing.
In the future, even if unsharded-cluster mode supports multiple
databases, there will be no risk involved.
3. The insertion and removal operations of dict structures in the
`rehashing` list are decoupled from `activerehashing` config.
`activerehashing` only controls whether `incrementallyRehash` is
executed in serverCron. There is no need for additional steps when
modifying the `activerehashing` switch, as in #12705.
2023-12-15 10:42:53 +08:00
|
|
|
/* Allow a dict to carry extra caller-defined metadata. The
|
|
|
|
* extra memory is initialized to 0 when a dict is allocated. */
|
|
|
|
size_t (*dictMetadataBytes)(dict *d);
|
Key as dict entry - memory optimization for sets (#11595)
If a dict has only keys, and no use of values, then a key can be stored directly in a
dict's hashtable. The key replaces the dictEntry. To distinguish between a key and
a dictEntry, we only use this optimization if the key is odd, i.e. if the key has the least
significant bit set. This is true for sds strings, since the sds header is always an odd
number of bytes.
Dict entries are used as a fallback when there is a hash collision. A special dict entry
without a value (only key and next) is used so we save one word in this case too.
This saves 24 bytes per set element for larges sets, and also gains some speed improvement
as a side effect (less allocations and cache misses).
A quick test adding 1M elements to a set using the command below resulted in memory
usage of 28.83M, compared to 46.29M on unstable.
That's 18 bytes per set element on average.
eval 'for i=1,1000000,1 do redis.call("sadd", "myset", "x"..i) end' 0
Other changes:
Allocations are ensured to have at least 8 bits alignment on all systems. This affects 32-bit
builds compiled without HAVE_MALLOC_SIZE (not jemalloc or glibc) in which Redis
stores the size of each allocation, after this change in 8 bytes instead of previously 4 bytes
per allocation. This is done so we can reliably use the 3 least significant bits in a pointer to
encode stuff.
2023-01-20 17:45:29 +01:00
|
|
|
/* Flags */
|
|
|
|
/* The 'no_value' flag, if set, indicates that values are not used, i.e. the
|
|
|
|
* dict is a set. When this flag is set, it's not possible to access the
|
|
|
|
* value of a dictEntry and it's also impossible to use dictSetKey(). Entry
|
|
|
|
* metadata can also not be used. */
|
|
|
|
unsigned int no_value:1;
|
|
|
|
/* If no_value = 1 and all keys are odd (LSB=1), setting keys_are_odd = 1
|
|
|
|
* enables one more optimization: to store a key without an allocated
|
|
|
|
* dictEntry. */
|
|
|
|
unsigned int keys_are_odd:1;
|
|
|
|
/* TODO: Add a 'keys_are_even' flag and use a similar optimization if that
|
|
|
|
* flag is set. */
|
2009-03-22 10:30:00 +01:00
|
|
|
} dictType;
|
|
|
|
|
2021-08-05 08:25:58 +03:00
|
|
|
#define DICTHT_SIZE(exp) ((exp) == -1 ? 0 : (unsigned long)1<<(exp))
|
|
|
|
#define DICTHT_SIZE_MASK(exp) ((exp) == -1 ? 0 : (DICTHT_SIZE(exp))-1)
|
2010-04-15 11:59:13 +02:00
|
|
|
|
2021-08-05 08:25:58 +03:00
|
|
|
struct dict {
|
2010-04-15 11:59:13 +02:00
|
|
|
dictType *type;
|
2021-08-05 08:25:58 +03:00
|
|
|
|
|
|
|
dictEntry **ht_table[2];
|
|
|
|
unsigned long ht_used[2];
|
|
|
|
|
2014-08-26 10:18:56 +02:00
|
|
|
long rehashidx; /* rehashing not in progress if rehashidx == -1 */
|
2021-08-05 08:25:58 +03:00
|
|
|
|
|
|
|
/* Keep small vars at end for optimal (minimal) struct padding */
|
2021-02-20 02:56:30 -08:00
|
|
|
int16_t pauserehash; /* If >0 rehashing is paused (<0 indicates coding error) */
|
2021-08-15 21:37:44 +03:00
|
|
|
signed char ht_size_exp[2]; /* exponent of size. (size = 1<<exp) */
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
int16_t pauseAutoResize; /* If >0 automatic resizing is disallowed (<0 indicates coding error) */
|
Unified db rehash method for both standalone and cluster (#12848)
After #11695, we added two functions `rehashingStarted` and
`rehashingCompleted` to the dict structure. We also registered two
handlers for the main database's dict and expire structures. This allows
the main database to record the dict in `rehashing` list when rehashing
starts. Later, in `serverCron`, the `incrementallyRehash` function is
continuously called to perform the rehashing operation. However,
currently, when rehashing is completed, `rehashingCompleted` does not
remove the dict from the `rehashing` list. This results in the
`rehashing` list containing many invalid dicts. Although subsequent cron
checks and removes dicts that don't require rehashing, it is still
inefficient.
This PR implements the functionality to remove the dict from the
`rehashing` list in `rehashingCompleted`. This is achieved by adding
`metadata` to the dict structure, which keeps track of its position in
the `rehashing` list, allowing for quick removal. This approach avoids
storing duplicate dicts in the `rehashing` list.
Additionally, there are other modifications:
1. Whether in standalone or cluster mode, the dict in database is
inserted into the rehashing linked list when rehashing starts. This
eliminates the need to distinguish between standalone and cluster mode
in `incrementallyRehash`. The function only needs to focus on the dicts
in the `rehashing` list that require rehashing.
2. `rehashing` list is moved from per-database to Redis server level.
This decouples `incrementallyRehash` from the database ID, and in
standalone mode, there is no need to iterate over all databases,
avoiding unnecessary access to databases that do not require rehashing.
In the future, even if unsharded-cluster mode supports multiple
databases, there will be no risk involved.
3. The insertion and removal operations of dict structures in the
`rehashing` list are decoupled from `activerehashing` config.
`activerehashing` only controls whether `incrementallyRehash` is
executed in serverCron. There is no need for additional steps when
modifying the `activerehashing` switch, as in #12705.
2023-12-15 10:42:53 +08:00
|
|
|
void *metadata[];
|
2021-08-05 08:25:58 +03:00
|
|
|
};
|
2009-03-22 10:30:00 +01:00
|
|
|
|
2012-10-02 21:58:36 -07:00
|
|
|
/* If safe is set to 1 this is a safe iterator, that means, you can call
|
2011-05-10 10:15:50 +02:00
|
|
|
* dictAdd, dictFind, and other functions against the dictionary even while
|
|
|
|
* iterating. Otherwise it is a non safe iterator, and only dictNext()
|
|
|
|
* should be called while iterating. */
|
2009-03-22 10:30:00 +01:00
|
|
|
typedef struct dictIterator {
|
2010-04-15 11:59:13 +02:00
|
|
|
dict *d;
|
2014-08-26 10:18:56 +02:00
|
|
|
long index;
|
|
|
|
int table, safe;
|
2009-03-22 10:30:00 +01:00
|
|
|
dictEntry *entry, *nextEntry;
|
2014-08-26 10:18:56 +02:00
|
|
|
/* unsafe iterator fingerprint for misuse detection. */
|
2021-11-11 14:51:33 +03:00
|
|
|
unsigned long long fingerprint;
|
2009-03-22 10:30:00 +01:00
|
|
|
} dictIterator;
|
|
|
|
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
typedef struct dictStats {
|
|
|
|
int htidx;
|
|
|
|
unsigned long buckets;
|
|
|
|
unsigned long maxChainLen;
|
|
|
|
unsigned long totalChainLen;
|
|
|
|
unsigned long htSize;
|
|
|
|
unsigned long htUsed;
|
|
|
|
unsigned long *clvector;
|
|
|
|
} dictStats;
|
|
|
|
|
2012-07-09 01:00:26 -07:00
|
|
|
typedef void (dictScanFunction)(void *privdata, const dictEntry *de);
|
2022-11-20 23:23:54 +01:00
|
|
|
typedef void *(dictDefragAllocFunction)(void *ptr);
|
Key as dict entry - memory optimization for sets (#11595)
If a dict has only keys, and no use of values, then a key can be stored directly in a
dict's hashtable. The key replaces the dictEntry. To distinguish between a key and
a dictEntry, we only use this optimization if the key is odd, i.e. if the key has the least
significant bit set. This is true for sds strings, since the sds header is always an odd
number of bytes.
Dict entries are used as a fallback when there is a hash collision. A special dict entry
without a value (only key and next) is used so we save one word in this case too.
This saves 24 bytes per set element for larges sets, and also gains some speed improvement
as a side effect (less allocations and cache misses).
A quick test adding 1M elements to a set using the command below resulted in memory
usage of 28.83M, compared to 46.29M on unstable.
That's 18 bytes per set element on average.
eval 'for i=1,1000000,1 do redis.call("sadd", "myset", "x"..i) end' 0
Other changes:
Allocations are ensured to have at least 8 bits alignment on all systems. This affects 32-bit
builds compiled without HAVE_MALLOC_SIZE (not jemalloc or glibc) in which Redis
stores the size of each allocation, after this change in 8 bytes instead of previously 4 bytes
per allocation. This is done so we can reliably use the 3 least significant bits in a pointer to
encode stuff.
2023-01-20 17:45:29 +01:00
|
|
|
typedef struct {
|
|
|
|
dictDefragAllocFunction *defragAlloc; /* Used for entries etc. */
|
|
|
|
dictDefragAllocFunction *defragKey; /* Defrag-realloc keys (optional) */
|
|
|
|
dictDefragAllocFunction *defragVal; /* Defrag-realloc values (optional) */
|
|
|
|
} dictDefragFunctions;
|
2012-07-09 01:00:26 -07:00
|
|
|
|
2009-03-22 10:30:00 +01:00
|
|
|
/* This is the initial size of every hash table */
|
2021-08-05 08:25:58 +03:00
|
|
|
#define DICT_HT_INITIAL_EXP 2
|
|
|
|
#define DICT_HT_INITIAL_SIZE (1<<(DICT_HT_INITIAL_EXP))
|
2009-03-22 10:30:00 +01:00
|
|
|
|
|
|
|
/* ------------------------------- Macros ------------------------------------*/
|
2022-08-03 19:38:08 +03:00
|
|
|
#define dictFreeVal(d, entry) do { \
|
|
|
|
if ((d)->type->valDestructor) \
|
2023-01-11 09:57:10 +01:00
|
|
|
(d)->type->valDestructor((d), dictGetVal(entry)); \
|
2022-08-03 19:38:08 +03:00
|
|
|
} while(0)
|
2009-03-22 10:30:00 +01:00
|
|
|
|
2011-11-08 17:07:55 +01:00
|
|
|
#define dictFreeKey(d, entry) \
|
2010-04-15 11:59:13 +02:00
|
|
|
if ((d)->type->keyDestructor) \
|
2023-01-11 09:57:10 +01:00
|
|
|
(d)->type->keyDestructor((d), dictGetKey(entry))
|
2009-03-22 10:30:00 +01:00
|
|
|
|
2011-11-08 17:07:55 +01:00
|
|
|
#define dictCompareKeys(d, key1, key2) \
|
2010-04-15 11:59:13 +02:00
|
|
|
(((d)->type->keyCompare) ? \
|
2021-08-05 08:25:58 +03:00
|
|
|
(d)->type->keyCompare((d), key1, key2) : \
|
2009-03-22 10:30:00 +01:00
|
|
|
(key1) == (key2))
|
|
|
|
|
Unified db rehash method for both standalone and cluster (#12848)
After #11695, we added two functions `rehashingStarted` and
`rehashingCompleted` to the dict structure. We also registered two
handlers for the main database's dict and expire structures. This allows
the main database to record the dict in `rehashing` list when rehashing
starts. Later, in `serverCron`, the `incrementallyRehash` function is
continuously called to perform the rehashing operation. However,
currently, when rehashing is completed, `rehashingCompleted` does not
remove the dict from the `rehashing` list. This results in the
`rehashing` list containing many invalid dicts. Although subsequent cron
checks and removes dicts that don't require rehashing, it is still
inefficient.
This PR implements the functionality to remove the dict from the
`rehashing` list in `rehashingCompleted`. This is achieved by adding
`metadata` to the dict structure, which keeps track of its position in
the `rehashing` list, allowing for quick removal. This approach avoids
storing duplicate dicts in the `rehashing` list.
Additionally, there are other modifications:
1. Whether in standalone or cluster mode, the dict in database is
inserted into the rehashing linked list when rehashing starts. This
eliminates the need to distinguish between standalone and cluster mode
in `incrementallyRehash`. The function only needs to focus on the dicts
in the `rehashing` list that require rehashing.
2. `rehashing` list is moved from per-database to Redis server level.
This decouples `incrementallyRehash` from the database ID, and in
standalone mode, there is no need to iterate over all databases,
avoiding unnecessary access to databases that do not require rehashing.
In the future, even if unsharded-cluster mode supports multiple
databases, there will be no risk involved.
3. The insertion and removal operations of dict structures in the
`rehashing` list are decoupled from `activerehashing` config.
`activerehashing` only controls whether `incrementallyRehash` is
executed in serverCron. There is no need for additional steps when
modifying the `activerehashing` switch, as in #12705.
2023-12-15 10:42:53 +08:00
|
|
|
#define dictMetadata(d) (&(d)->metadata)
|
|
|
|
#define dictMetadataSize(d) ((d)->type->dictMetadataBytes \
|
|
|
|
? (d)->type->dictMetadataBytes(d) : 0)
|
|
|
|
|
2022-08-03 19:38:08 +03:00
|
|
|
#define dictHashKey(d, key) ((d)->type->hashFunction(key))
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
#define dictBuckets(d) (DICTHT_SIZE((d)->ht_size_exp[0])+DICTHT_SIZE((d)->ht_size_exp[1]))
|
2021-08-05 08:25:58 +03:00
|
|
|
#define dictSize(d) ((d)->ht_used[0]+(d)->ht_used[1])
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
#define dictIsEmpty(d) ((d)->ht_used[0] == 0 && (d)->ht_used[1] == 0)
|
2012-05-18 20:34:10 +08:00
|
|
|
#define dictIsRehashing(d) ((d)->rehashidx != -1)
|
2022-08-03 19:38:08 +03:00
|
|
|
#define dictPauseRehashing(d) ((d)->pauserehash++)
|
|
|
|
#define dictResumeRehashing(d) ((d)->pauserehash--)
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
#define dictPauseAutoResize(d) ((d)->pauseAutoResize++)
|
|
|
|
#define dictResumeAutoResize(d) ((d)->pauseAutoResize--)
|
2009-03-22 10:30:00 +01:00
|
|
|
|
2020-12-23 05:52:07 -08:00
|
|
|
/* If our unsigned long type can store a 64 bit number, use a 64 bit PRNG. */
|
|
|
|
#if ULONG_MAX >= 0xffffffffffffffff
|
|
|
|
#define randomULong() ((unsigned long) genrand64_int64())
|
|
|
|
#else
|
|
|
|
#define randomULong() random()
|
|
|
|
#endif
|
|
|
|
|
2023-01-10 08:40:40 +02:00
|
|
|
typedef enum {
|
|
|
|
DICT_RESIZE_ENABLE,
|
|
|
|
DICT_RESIZE_AVOID,
|
|
|
|
DICT_RESIZE_FORBID,
|
|
|
|
} dictResizeEnable;
|
|
|
|
|
2009-03-22 10:30:00 +01:00
|
|
|
/* API */
|
2021-08-05 08:25:58 +03:00
|
|
|
dict *dictCreate(dictType *type);
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
dict **dictCreateMultiple(dictType *type, int count);
|
2010-04-15 18:07:57 +02:00
|
|
|
int dictExpand(dict *d, unsigned long size);
|
2020-11-22 21:22:49 +02:00
|
|
|
int dictTryExpand(dict *d, unsigned long size);
|
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 14:20:53 +08:00
|
|
|
int dictShrink(dict *d, unsigned long size);
|
2010-04-15 18:07:57 +02:00
|
|
|
int dictAdd(dict *d, void *key, void *val);
|
2016-05-09 18:01:09 +03:00
|
|
|
dictEntry *dictAddRaw(dict *d, void *key, dictEntry **existing);
|
Key as dict entry - memory optimization for sets (#11595)
If a dict has only keys, and no use of values, then a key can be stored directly in a
dict's hashtable. The key replaces the dictEntry. To distinguish between a key and
a dictEntry, we only use this optimization if the key is odd, i.e. if the key has the least
significant bit set. This is true for sds strings, since the sds header is always an odd
number of bytes.
Dict entries are used as a fallback when there is a hash collision. A special dict entry
without a value (only key and next) is used so we save one word in this case too.
This saves 24 bytes per set element for larges sets, and also gains some speed improvement
as a side effect (less allocations and cache misses).
A quick test adding 1M elements to a set using the command below resulted in memory
usage of 28.83M, compared to 46.29M on unstable.
That's 18 bytes per set element on average.
eval 'for i=1,1000000,1 do redis.call("sadd", "myset", "x"..i) end' 0
Other changes:
Allocations are ensured to have at least 8 bits alignment on all systems. This affects 32-bit
builds compiled without HAVE_MALLOC_SIZE (not jemalloc or glibc) in which Redis
stores the size of each allocation, after this change in 8 bytes instead of previously 4 bytes
per allocation. This is done so we can reliably use the 3 least significant bits in a pointer to
encode stuff.
2023-01-20 17:45:29 +01:00
|
|
|
void *dictFindPositionForInsert(dict *d, const void *key, dictEntry **existing);
|
|
|
|
dictEntry *dictInsertAtPosition(dict *d, void *key, void *position);
|
2016-09-14 16:43:38 +02:00
|
|
|
dictEntry *dictAddOrFind(dict *d, void *key);
|
2010-04-15 18:07:57 +02:00
|
|
|
int dictReplace(dict *d, void *key, void *val);
|
|
|
|
int dictDelete(dict *d, const void *key);
|
2021-08-05 08:25:58 +03:00
|
|
|
dictEntry *dictUnlink(dict *d, const void *key);
|
2016-05-09 18:01:09 +03:00
|
|
|
void dictFreeUnlinkedEntry(dict *d, dictEntry *he);
|
2022-11-30 17:56:36 +08:00
|
|
|
dictEntry *dictTwoPhaseUnlinkFind(dict *d, const void *key, dictEntry ***plink, int *table_index);
|
|
|
|
void dictTwoPhaseUnlinkFree(dict *d, dictEntry *he, dictEntry **plink, int table_index);
|
2010-04-15 18:07:57 +02:00
|
|
|
void dictRelease(dict *d);
|
|
|
|
dictEntry * dictFind(dict *d, const void *key);
|
2010-04-16 10:04:51 +02:00
|
|
|
void *dictFetchValue(dict *d, const void *key);
|
Optimize resizing hash table to resize not only non-empty dicts. (#12819)
The function `tryResizeHashTables` only attempts to shrink the dicts
that has keys (change from #11695), this was a serious problem until the
change in #12850 since it meant if all keys are deleted, we won't shrink
the dick.
But still, both dictShrink and dictExpand may be blocked by a fork child
process, therefore, the cron job needs to perform both dictShrink and
dictExpand, for not just non-empty dicts, but all dicts in DBs.
What this PR does:
1. Try to resize all dicts in DBs (not just non-empty ones, as it was
since #12850)
2. handle both shrink and expand (not just shrink, as it was since
forever)
3. Refactor some APIs about dict resizing (get rid of `htNeedsShrink`
`htNeedsShrink` `dictShrinkToFit`, and expose `dictShrinkIfNeeded`
`dictExpandIfNeeded` which already contains all the code of those
functions we get rid of, to make APIs more neat)
4. In the `Don't rehash if redis has child process` test, now that cron
would do resizing, we no longer need to write to DB after the child
process got killed, and can wait for the cron to expand the hash table.
2024-01-30 03:02:07 +08:00
|
|
|
int dictShrinkIfNeeded(dict *d);
|
|
|
|
int dictExpandIfNeeded(dict *d);
|
2023-01-11 09:57:10 +01:00
|
|
|
void dictSetKey(dict *d, dictEntry* de, void *key);
|
|
|
|
void dictSetVal(dict *d, dictEntry *de, void *val);
|
|
|
|
void dictSetSignedIntegerVal(dictEntry *de, int64_t val);
|
|
|
|
void dictSetUnsignedIntegerVal(dictEntry *de, uint64_t val);
|
|
|
|
void dictSetDoubleVal(dictEntry *de, double val);
|
|
|
|
int64_t dictIncrSignedIntegerVal(dictEntry *de, int64_t val);
|
|
|
|
uint64_t dictIncrUnsignedIntegerVal(dictEntry *de, uint64_t val);
|
|
|
|
double dictIncrDoubleVal(dictEntry *de, double val);
|
2022-11-20 23:23:54 +01:00
|
|
|
void *dictEntryMetadata(dictEntry *de);
|
2023-01-11 09:57:10 +01:00
|
|
|
void *dictGetKey(const dictEntry *de);
|
|
|
|
void *dictGetVal(const dictEntry *de);
|
|
|
|
int64_t dictGetSignedIntegerVal(const dictEntry *de);
|
|
|
|
uint64_t dictGetUnsignedIntegerVal(const dictEntry *de);
|
|
|
|
double dictGetDoubleVal(const dictEntry *de);
|
|
|
|
double *dictGetDoubleValPtr(dictEntry *de);
|
|
|
|
size_t dictMemUsage(const dict *d);
|
|
|
|
size_t dictEntryMemUsage(void);
|
2010-04-15 18:07:57 +02:00
|
|
|
dictIterator *dictGetIterator(dict *d);
|
2011-05-10 10:15:50 +02:00
|
|
|
dictIterator *dictGetSafeIterator(dict *d);
|
2022-09-08 04:57:43 +03:00
|
|
|
void dictInitIterator(dictIterator *iter, dict *d);
|
|
|
|
void dictInitSafeIterator(dictIterator *iter, dict *d);
|
|
|
|
void dictResetIterator(dictIterator *iter);
|
2009-03-22 10:30:00 +01:00
|
|
|
dictEntry *dictNext(dictIterator *iter);
|
|
|
|
void dictReleaseIterator(dictIterator *iter);
|
2010-04-15 18:07:57 +02:00
|
|
|
dictEntry *dictGetRandomKey(dict *d);
|
2019-02-18 18:27:18 +01:00
|
|
|
dictEntry *dictGetFairRandomKey(dict *d);
|
2015-02-06 15:48:42 +01:00
|
|
|
unsigned int dictGetSomeKeys(dict *d, dictEntry **des, unsigned int count);
|
2023-05-24 16:27:44 +03:00
|
|
|
void dictGetStats(char *buf, size_t bufsize, dict *d, int full);
|
2021-12-13 20:16:25 +01:00
|
|
|
uint64_t dictGenHashFunction(const void *key, size_t len);
|
|
|
|
uint64_t dictGenCaseHashFunction(const unsigned char *buf, size_t len);
|
2021-08-05 08:25:58 +03:00
|
|
|
void dictEmpty(dict *d, void(callback)(dict*));
|
2023-01-10 08:40:40 +02:00
|
|
|
void dictSetResizeEnabled(dictResizeEnable enable);
|
2010-04-15 11:59:13 +02:00
|
|
|
int dictRehash(dict *d, int n);
|
Unified db rehash method for both standalone and cluster (#12848)
After #11695, we added two functions `rehashingStarted` and
`rehashingCompleted` to the dict structure. We also registered two
handlers for the main database's dict and expire structures. This allows
the main database to record the dict in `rehashing` list when rehashing
starts. Later, in `serverCron`, the `incrementallyRehash` function is
continuously called to perform the rehashing operation. However,
currently, when rehashing is completed, `rehashingCompleted` does not
remove the dict from the `rehashing` list. This results in the
`rehashing` list containing many invalid dicts. Although subsequent cron
checks and removes dicts that don't require rehashing, it is still
inefficient.
This PR implements the functionality to remove the dict from the
`rehashing` list in `rehashingCompleted`. This is achieved by adding
`metadata` to the dict structure, which keeps track of its position in
the `rehashing` list, allowing for quick removal. This approach avoids
storing duplicate dicts in the `rehashing` list.
Additionally, there are other modifications:
1. Whether in standalone or cluster mode, the dict in database is
inserted into the rehashing linked list when rehashing starts. This
eliminates the need to distinguish between standalone and cluster mode
in `incrementallyRehash`. The function only needs to focus on the dicts
in the `rehashing` list that require rehashing.
2. `rehashing` list is moved from per-database to Redis server level.
This decouples `incrementallyRehash` from the database ID, and in
standalone mode, there is no need to iterate over all databases,
avoiding unnecessary access to databases that do not require rehashing.
In the future, even if unsharded-cluster mode supports multiple
databases, there will be no risk involved.
3. The insertion and removal operations of dict structures in the
`rehashing` list are decoupled from `activerehashing` config.
`activerehashing` only controls whether `incrementallyRehash` is
executed in serverCron. There is no need for additional steps when
modifying the `activerehashing` switch, as in #12705.
2023-12-15 10:42:53 +08:00
|
|
|
int dictRehashMicroseconds(dict *d, uint64_t us);
|
2017-02-20 16:09:54 +01:00
|
|
|
void dictSetHashFunctionSeed(uint8_t *seed);
|
|
|
|
uint8_t *dictGetHashFunctionSeed(void);
|
2022-11-20 23:23:54 +01:00
|
|
|
unsigned long dictScan(dict *d, unsigned long v, dictScanFunction *fn, void *privdata);
|
Key as dict entry - memory optimization for sets (#11595)
If a dict has only keys, and no use of values, then a key can be stored directly in a
dict's hashtable. The key replaces the dictEntry. To distinguish between a key and
a dictEntry, we only use this optimization if the key is odd, i.e. if the key has the least
significant bit set. This is true for sds strings, since the sds header is always an odd
number of bytes.
Dict entries are used as a fallback when there is a hash collision. A special dict entry
without a value (only key and next) is used so we save one word in this case too.
This saves 24 bytes per set element for larges sets, and also gains some speed improvement
as a side effect (less allocations and cache misses).
A quick test adding 1M elements to a set using the command below resulted in memory
usage of 28.83M, compared to 46.29M on unstable.
That's 18 bytes per set element on average.
eval 'for i=1,1000000,1 do redis.call("sadd", "myset", "x"..i) end' 0
Other changes:
Allocations are ensured to have at least 8 bits alignment on all systems. This affects 32-bit
builds compiled without HAVE_MALLOC_SIZE (not jemalloc or glibc) in which Redis
stores the size of each allocation, after this change in 8 bytes instead of previously 4 bytes
per allocation. This is done so we can reliably use the 3 least significant bits in a pointer to
encode stuff.
2023-01-20 17:45:29 +01:00
|
|
|
unsigned long dictScanDefrag(dict *d, unsigned long v, dictScanFunction *fn, dictDefragFunctions *defragfns, void *privdata);
|
2017-11-30 11:34:37 +08:00
|
|
|
uint64_t dictGetHash(dict *d, const void *key);
|
2022-11-20 23:23:54 +01:00
|
|
|
dictEntry *dictFindEntryByPtrAndHash(dict *d, const void *oldptr, uint64_t hash);
|
2023-12-10 16:55:30 +08:00
|
|
|
void dictRehashingInfo(dict *d, unsigned long long *from_size, unsigned long long *to_size);
|
2009-03-22 10:30:00 +01:00
|
|
|
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
|
|
|
size_t dictGetStatsMsg(char *buf, size_t bufsize, dictStats *stats, int full);
|
|
|
|
dictStats* dictGetStatsHt(dict *d, int htidx, int full);
|
|
|
|
void dictCombineStats(dictStats *from, dictStats *into);
|
|
|
|
void dictFreeStats(dictStats *stats);
|
|
|
|
|
2021-03-10 15:13:11 +08:00
|
|
|
#ifdef REDIS_TEST
|
2021-11-16 14:55:10 +08:00
|
|
|
int dictTest(int argc, char *argv[], int flags);
|
2021-03-10 15:13:11 +08:00
|
|
|
#endif
|
|
|
|
|
2009-03-22 10:30:00 +01:00
|
|
|
#endif /* __DICT_H */
|