This PR replaces dict with hashtable in the ZSET datatype. Instead of
mapping key to score as dict did, the hashtable maps key to a node in
the skiplist, which contains the score. This takes advantage of
hashtable performance improvements and saves 15 bytes per set item - 24
bytes overhead before, 9 bytes after.
Closes#1096
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Instead of a dictEntry with pointers to key and value, the hashtable
has a pointer directly to the value (robj) which can hold an embedded
key and acts as a key-value in the hashtable. This minimizes the number
of pointers to follow and thus the number of memory accesses to lookup
a key-value pair.
Keys robj
hashtable
+-------+ +-----------------------+
| 0 | | type, encoding, LRU |
| 1 ------->| refcount, expire |
| 2 | | ptr |
| ... | | optional embedded key |
+-------+ | optional embedded val |
+-----------------------+
The expire timestamp (TTL) is also stored in the robj, if any. The expire
hash table points to the same robj.
Overview of changes:
* Replace dict with hashtable in kvstore (kvstore.c)
* Add functions for embedding key and expire in robj (object.c)
* When there's unused space, reserve an expire field to avoid realloting
it later if expire is added.
* Always reserve space for expire for large key names to avoid realloc
if it's set later.
* Update db functions (db.c)
* dbAdd, setKey and setExpire reallocate the object when embedding a key
* setKey does not increment the reference counter, since it would require
duplicating the object. This responsibility is moved to the caller.
* Remove logic for shared integer objects as values in the database. The keys
are now embedded in the objects, so all objects in the database need to be
unique. Thus, we can't use shared objects as values. Also delete test cases
for shared integers.
* Adjust various commands to the changes mentioned above.
* Adjust defrag code
* Improvement: Don't access the expires table before defrag has actually
reallocated the object.
* Adjust test cases that were using hard-coded sizes for dict when realloc
would happen, and some other adjustments in test cases.
* Adjust memory prefetch for new hash table implementation in IO-threading,
using new `hashtableIncrementalFind` API
* Adjust offloading of free() to IO threads: Object free to be done in main
thread while keeping obj->ptr offloading in IO-thread since the DB object is
now allocated by the main-thread and not by the IO-thread as it used to be.
* Let expireIfNeeded take an optional value, to avoid looking up the expires
table when possible.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
New config: `import-mode (yes|no)`
New command: `CLIENT IMPORT-SOURCE (ON|OFF)`
The config, when set to `yes`, disables eviction and deletion of expired
keys, except for commands coming from a client which has marked itself
as an import-source, the data source when importing data from another
node, using the CLIENT IMPORT-SOURCE command.
When we sync data from the source Valkey to the destination Valkey using
some sync tools like
[redis-shake](https://github.com/tair-opensource/RedisShake), the
destination Valkey can perform expiration and eviction, which may cause
data corruption. This problem has been discussed in
https://github.com/redis/redis/discussions/9760#discussioncomment-1681041
and Redis already have a solution. But in Valkey we haven't fixed it by
now.
E.g. we call `set key 1 ex 1` on the source server and transfer this
command to the destination server. Then we call `incr key` on the source
server before the key expired, we will have a key on the source server
with a value of 2. But when the command arrived at the destination
server, the key may be expired and has deleted. So we will have a key on
the destination server with a value of 1, which is inconsistent with the
source server.
In standalone mode, we can use writable replica to simplify the sync
process. However, in cluster mode, we still need a sync tool to help us
transfer the source data to the destination. The sync tool usually work
as a normal client and the destination works as a primary which keep
expiration and eviction.
In this PR, we add a new mode named 'import-mode'. In this mode, server
stop expiration and eviction just like a replica. Notice that this mode
exists only in sync state to avoid data inconsistency caused by
expiration and eviction. Import mode only takes effect on the primary.
Sync tools can mark their clients as an import source by `CLIENT
IMPORT-SOURCE`, which work like a client from primary and can visit
expired keys in `lookupkey`.
**Notice: during the migration, other clients, apart from the import
source, should not access the data imported by import source.**
---------
Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com>
Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
moduleTimerHandler is aeTimeProc handler and event loop gets created
with this. However, found that the function return type is int but
actually returns "long long" value(i.e., next_period). and return value
being assigned to int variable in processTimeEvents(where time events
are processed), this might cause an overflow of the timer values. So
changed the return type of the function to long long. And also updated
other callback function return type to be consistent.
I found this when I was checking functions reported in
https://github.com/valkey-io/valkey/issues/1054 issue stacktrace. (FYI,
this is just to update the return type to be consistent and it will not
the fix for the issue reported)
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Update references of copyright being assigned to Salvatore when it was
transferred to Redis Ltd. as per
https://github.com/valkey-io/valkey/issues/544.
---------
Signed-off-by: Pieter Cailliau <pieter@redis.com>
- Replaces custom atomics logic with C11 default atomics logic.
- Drops "atomicvar_api" field from server info
Closes#485
---------
Signed-off-by: adetunjii <adetunjithomas1@outlook.com>
Signed-off-by: Samuel Adetunji <adetunjithomas1@outlook.com>
Co-authored-by: teej4y <samuel.adetunji@prunny.com>
I have validated that these settings closely match the existing coding
style with one major exception on `BreakBeforeBraces`, which will be
`Attach` going forward. The mixed `BreakBeforeBraces` styles in the
current codebase are hard to imitate and also very odd IMHO - see below
```
if (a == 1) { /*Attach */
}
```
```
if (a == 1 ||
b == 2)
{ /* Why? */
}
```
Please do NOT merge just yet. Will add the github action next once the
style is reviewed/approved.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Fix the mem_freed variable to be initialized with init.
with this PR prevents the variable from acting unknowingly.
Signed-off-by: NAM UK KIM <namuk2004@naver.com>
This includes comments used for module API documentation.
* Strategy for replacement: Regex search: `(//|/\*| \*|#).* ("|\()?(r|R)edis( |\.
|'|\n|,|-|\)|")(?!nor the names of its contributors)(?!Ltd.)(?!Labs)(?!Contributors.)`
* Don't edit copyright comments
* Replace "Redis version X.X" -> "Redis OSS version X.X" to distinguish
from newly licensed repository
* Replace "Redis Object" -> "Object"
* Exclude markdown for now
* Don't edit Lua scripting comments referring to redis.X API
* Replace "Redis Protocol" -> "RESP"
* Replace redis-benchmark, -cli, -server, -check-aof/rdb with "valkey-"
prefix
* Most other places, I use best judgement to either remove "Redis", or
replace with "the server" or "server"
Fixes#148
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
# Description
Gather most of the scattered `redisDb`-related code from the per-slot
dict PR (#11695) and turn it to a new data structure, `kvstore`. i.e.
it's a class that represents an array of dictionaries.
# Motivation
The main motivation is code cleanliness, the idea of using an array of
dictionaries is very well-suited to becoming a self-contained data
structure.
This allowed cleaning some ugly code, among others: loops that run twice
on the main dict and expires dict, and duplicate code for allocating and
releasing this data structure.
# Notes
1. This PR reverts the part of https://github.com/redis/redis/pull/12848
where the `rehashing` list is global (handling rehashing `dict`s is
under the responsibility of `kvstore`, and should not be managed by the
server)
2. This PR also replaces the type of `server.pubsubshard_channels` from
`dict**` to `kvstore` (original PR:
https://github.com/redis/redis/pull/12804). After that was done,
server.pubsub_channels was also chosen to be a `kvstore` (with only one
`dict`, which seems odd) just to make the code cleaner by making it the
same type as `server.pubsubshard_channels`, see
`pubsubtype.serverPubSubChannels`
3. the keys and expires kvstores are currenlty configured to allocate
the individual dicts only when the first key is added (unlike before, in
which they allocated them in advance), but they won't release them when
the last key is deleted.
Worth mentioning that due to the recent change the reply of DEBUG
HTSTATS changed, in case no keys were ever added to the db.
before:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
```
after:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
[Expires HT]
```
The decrement time was replaced by access time in
583c31472577fb8175e17ee0ce243972f4dd8425.
The halved and doubled LFU_INIT_VAL logic has been changed in
06ca9d683920da19ad53532f8cd55b54584027bc.
Now we just decrement the counter by num_periods. This has been
previously fixed in redis.conf, #11108.
Additional optimizations for the eviction logic in #11695:
To make the eviction logic clearer and decouple the number of sampled
keys from the running mode (cluster or standalone).
* When sampling in each database, we only care about the number of keys
in the current database (not the dicts we sampled from).
* If there are a insufficient number of keys in the current database
(e.g. 10 times the value of `maxmemory_samples`), we can break out
sooner (to avoid looping on a sparse database).
* We'll never try to sample the db dicts more times than the number of
non-empty dicts in the db (max 1 in non-cluster mode).
And it also ensures that each database has a sufficient amount of
sampled keys, so even if unsharded-cluster supports multiple databases,
there won't be any issues.
other changes:
1. keep track of the number of non-empty dicts in each database.
2. move key_count tracking into cumulativeKeyCountAdd rather than all
it's callers
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Redis 7.2 (#9406) introduced a new modules event, `RedisModuleEvent_Key`.
This new event allows the module to read the key data just before it is removed
from the database (either deleted, expired, evicted, or overwritten).
When the key is removed from the database, either by active expire or eviction.
The new event was not called as part of an execution unit. This can cause an
issue if the module registers a post notification job inside the event. This job will
not be executed atomically with the expiration/eviction operation and will not
replicated inside a Multi/Exec. Moreover, the post notification job will be executed
right after the event where it is still not safe to perform any write operation, this will
violate the promise that post notification job will be called atomically with the
operation that triggered it and **only when it is safe to write**.
This PR fixes the issue by wrapping each expiration/eviction of a key with an execution
unit. This makes sure the entire operation will run atomically and all the post notification
jobs will be executed at the end where it is safe to write.
Tests were modified to verify the fix.
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
This pr can get two performance benefits:
1. Stop redundant initialization when most robj objects are created
2. LRU_CLOCK will no longer be called in io threads, so we can avoid the `atomicGet`
Another code optimization:
deleted the redundant judgment in dbSetValue, no matter in LFU or LRU, the lru field inold
robj is always the freshest (it is always updated in lookupkey), so we don't need to judge if in LFU
Technically declaring a prototype with an empty declaration has been deprecated since the early days of C, but we never got a warning for it. C2x will apparently be introducing a breaking change if you are using this type of declarator, so Clang 15 has started issuing a warning with -pedantic. Although not apparently a problem for any of the compiler we build on, if feels like the right thing is to properly adhere to the C standard and use (void).
1. Get rid of server.core_propagates - we can just rely on module/call nesting levels
2. Rename in_nested_call to execution_nesting and update the comment
3. Remove module_ctx_nesting (redundant, we can use execution_nesting)
4. Modify postExecutionUnitOperations according to the comment (The main purpose of this PR)
5. trackingHandlePendingKeyInvalidations: Check the nesting level inside this function
Add a new module event `RedisModule_Event_Key`, this event is fired
when a key is removed from the keyspace.
The event includes an open key that can be used for reading the key before
it is removed. Modules can also extract the key-name, and use RM_Open
or RM_Call to access key from within that event, but shouldn't modify anything
from within this event.
The following sub events are available:
- `REDISMODULE_SUBEVENT_KEY_DELETED`
- `REDISMODULE_SUBEVENT_KEY_EXPIRED`
- `REDISMODULE_SUBEVENT_KEY_EVICTED`
- `REDISMODULE_SUBEVENT_KEY_OVERWRITE`
The data pointer can be casted to a RedisModuleKeyInfo structure
with the following fields:
```
RedisModuleKey *key; // Opened Key
```
### internals
* We also add two dict functions:
`dictTwoPhaseUnlinkFind` finds an element from the table, also get the plink of the entry.
The entry is returned if the element is found. The user should later call `dictTwoPhaseUnlinkFree`
with it in order to unlink and release it. Otherwise if the key is not found, NULL is returned.
These two functions should be used in pair. `dictTwoPhaseUnlinkFind` pauses rehash and
`dictTwoPhaseUnlinkFree` resumes rehash.
* We change `dbOverwrite` to `dbReplaceValue` which just replaces the value of the key and
doesn't fire any events. The "overwrite" part (which emits events) is just when called from `setKey`,
the other places that called dbOverwrite were ones that just update the value in-place (INCR*, SPOP,
and dbUnshareStringValue). This should not have any real impact since `moduleNotifyKeyUnlink` and
`signalDeletedKeyAsReady` wouldn't have mattered in these cases anyway (i.e. module keys and
stream keys didn't have direct calls to dbOverwrite)
* since we allow doing RM_OpenKey from withing these callbacks, we temporarily disable lazy expiry.
* We also temporarily disable lazy expiry when we are in unlink/unlink2 callback and keyspace
notification callback.
* Move special definitions to the top of redismodule.h
This is needed to resolve compilation errors with RedisModuleKeyInfoV1
that carries a RedisModuleKey member.
Co-authored-by: Oran Agra <oran@redislabs.com>
### Summary of API additions
* `RedisModule_AddPostNotificationJob` - new API to call inside a key space
notification (and on more locations in the future) and allow to add a post job as describe above.
* New module option, `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`,
allows to disable Redis protection of nested key-space notifications.
* `RedisModule_GetModuleOptionsAll` - gets the mask of all supported module options so a module
will be able to check if a given option is supported by the current running Redis instance.
### Background
The following PR is a proposal of handling write operations inside module key space notifications.
After a lot of discussions we came to a conclusion that module should not perform any write
operations on key space notification.
Some examples of issues that such write operation can cause are describe on the following links:
* Bad replication oreder - https://github.com/redis/redis/pull/10969
* Used after free - https://github.com/redis/redis/pull/10969#issuecomment-1223771006
* Used after free - https://github.com/redis/redis/pull/9406#issuecomment-1221684054
There are probably more issues that are yet to be discovered. The underline problem with writing
inside key space notification is that the notification runs synchronously, this means that the notification
code will be executed in the middle on Redis logic (commands logic, eviction, expire).
Redis **do not assume** that the data might change while running the logic and such changes
can crash Redis or cause unexpected behaviour.
The solution is to state that modules **should not** perform any write command inside key space
notification (we can chose whether or not we want to force it). To still cover the use-case where
module wants to perform a write operation as a reaction to key space notifications, we introduce
a new API , `RedisModule_AddPostNotificationJob`, that allows to register a callback that will be
called by Redis when the following conditions hold:
* It is safe to perform any write operation.
* The job will be called atomically along side the operation that triggers it (in our case, key
space notification).
Module can use this new API to safely perform any write operation and still achieve atomicity
between the notification and the write.
Although currently the API is supported on key space notifications, the API is written in a generic
way so that in the future we will be able to use it on other places (server events for example).
### Technical Details
Whenever a module uses `RedisModule_AddPostNotificationJob` the callback is added to a list
of callbacks (called `modulePostExecUnitJobs`) that need to be invoke after the current execution
unit ends (whether its a command, eviction, or active expire). In order to trigger those callback
atomically with the notification effect, we call those callbacks on `postExecutionUnitOperations`
(which was `propagatePendingCommands` before this PR). The new function fires the post jobs
and then calls `propagatePendingCommands`.
If the callback perform more operations that triggers more key space notifications. Those keys
space notifications might register more callbacks. Those callbacks will be added to the end
of `modulePostExecUnitJobs` list and will be invoke atomically after the current callback ends.
This raises a concerns of entering an infinite loops, we consider infinite loops as a logical bug
that need to be fixed in the module, an attempt to protect against infinite loops by halting the
execution could result in violation of the feature correctness and so **Redis will make no attempt
to protect the module from infinite loops**
In addition, currently key space notifications are not nested. Some modules might want to allow
nesting key-space notifications. To allow that and keep backward compatibility, we introduce a
new module option called `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`.
Setting this option will disable the Redis key-space notifications nesting protection and will
pass this responsibility to the module.
### Redis infrastructure
This PR promotes the existing `propagatePendingCommands` to an "Execution Unit" concept,
which is called after each atomic unit of execution,
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Renamed from "Pause Clients" to "Pause Actions" since the mechanism can pause
several actions in redis, not just clients (e.g. eviction, expiration).
Previously each pause purpose (which has a timeout that's tracked separately from others purposes),
also implicitly dictated what it pauses (reads, writes, eviction, etc). Now it is explicit, and
the actions that are paused (bit flags) are defined separately from the purpose.
- Previously, when using feature pause-client it also implicitly means to make the server static:
- Pause replica traffic
- Pauses eviction processing
- Pauses expire processing
Making the server static is used also for failover and shutdown. This PR internally rebrand
pause-client API to become pause-action API. It also Simplifies pauseClients structure
by replacing pointers array with static array.
The context of this PR is to add another trigger to pause-client which will activated in case
of OOM as throttling mechanism ([see here](https://github.com/redis/redis/issues/10907)).
In this case we want only to pause client, and eviction actions.
This bug is introduced in #7653. (Redis 6.2.0)
When `server.maxmemory_eviction_tenacity` is 100, `eviction_time_limit_us` is
`ULONG_MAX`, and if we cannot find the best key to delete (e.g. maxmemory-policy
is `volatile-lru` and all keys with ttl have been evicted), in `cant_free` redis will sleep
forever if some items are being freed in the lazyfree thread.
The PR reverts the changes made on #10969.
The reason for revert was trigger because of occasional test failure
that started after the PR was merged.
The issue is that if there is a lazy expire during the command invocation,
the `del` command is added to the replication stream after the command
placeholder. So the logical order on the primary is:
* Delete the key (lazy expiration)
* Command invocation
But the replication stream gets it the other way around:
* Command invocation (because the command is written into the placeholder)
* Delete the key (lazy expiration)
So if the command write to the key that was just lazy expired we will get
inconsistency between primary and replica.
One solution we considered is to add another lazy expire replication stream
and write all the lazy expire there. Then when replicating, we will replicate the
lazy expire replication stream first. This will solve this specific test failure but
we realize that the issues does not ends here and the more we dig the more
problems we find.One of the example we thought about (that can actually
crashes Redis) is as follow:
* User perform SINTERSTORE
* When Redis tries to fetch the second input key it triggers lazy expire
* The lazy expire trigger a module logic that deletes the first input key
* Now Redis hold the robj of the first input key that was actually freed
We believe we took the wrong approach and we will come up with another
PR that solve the problem differently, for now we revert the changes so we
will not have the tests failure.
Notice that not the entire code was revert, some parts of the PR are changes
that we would like to keep. The changes that **was** reverted are:
* Saving a placeholder for replication at the beginning of the command (`call` function)
* Order of the replication stream on active expire and eviction (we will decide how
to handle it correctly on follow up PR)
* `Spop` changes are no longer needed (because we reverted the placeholder code)
Changes that **was not** reverted:
* On expire/eviction, wrap the `del` and the notification effect in a multi exec.
* `PropagateNow` function can still accept a special dbid, -1, indicating not to replicate select.
* Keep optimisation for reusing the `alsoPropagate` array instead of allocating it each time.
Tests:
* All tests was kept and only few tests was modify to work correctly with the changes
* Test was added to verify that the revert fixes the issues.
Fix replication inconsistency on modules that uses key space notifications.
### The Problem
In general, key space notifications are invoked after the command logic was
executed (this is not always the case, we will discuss later about specific
command that do not follow this rules). For example, the `set x 1` will trigger
a `set` notification that will be invoked after the `set` logic was performed, so
if the notification logic will try to fetch `x`, it will see the new data that was written.
Consider the scenario on which the notification logic performs some write
commands. for example, the notification logic increase some counter,
`incr x{counter}`, indicating how many times `x` was changed.
The logical order by which the logic was executed is has follow:
```
set x 1
incr x{counter}
```
The issue is that the `set x 1` command is added to the replication buffer
at the end of the command invocation (specifically after the key space
notification logic was invoked and performed the `incr` command).
The replication/aof sees the commands in the wrong order:
```
incr x{counter}
set x 1
```
In this specific example the order is less important.
But if, for example, the notification would have deleted `x` then we would
end up with primary-replica inconsistency.
### The Solution
Put the command that cause the notification in its rightful place. In the
above example, the `set x 1` command logic was executed before the
notification logic, so it should be added to the replication buffer before
the commands that is invoked by the notification logic. To achieve this,
without a major code refactoring, we save a placeholder in the replication
buffer, when finishing invoking the command logic we check if the command
need to be replicated, and if it does, we use the placeholder to add it to the
replication buffer instead of appending it to the end.
To be efficient and not allocating memory on each command to save the
placeholder, the replication buffer array was modified to reuse memory
(instead of allocating it each time we want to replicate commands).
Also, to avoid saving a placeholder when not needed, we do it only for
WRITE or MAY_REPLICATE commands.
#### Additional Fixes
* Expire and Eviction notifications:
* Expire/Eviction logical order was to first perform the Expire/Eviction
and then the notification logic. The replication buffer got this in the
other way around (first notification effect and then the `del` command).
The PR fixes this issue.
* The notification effect and the `del` command was not wrap with
`multi-exec` (if needed). The PR also fix this issue.
* SPOP command:
* On spop, the `spop` notification was fired before the command logic
was executed. The change in this PR would have cause the replication
order to be change (first `spop` command and then notification `logic`)
although the logical order is first the notification logic and then the
`spop` logic. The right fix would have been to move the notification to
be fired after the command was executed (like all the other commands),
but this can be considered a breaking change. To overcome this, the PR
keeps the current behavior and changes the `spop` code to keep the right
logical order when pushing commands to the replication buffer. Another PR
will follow to fix the SPOP properly and match it to the other command (we
split it to 2 separate PR's so it will be easy to cherry-pick this PR to 7.0 if
we chose to).
#### Unhanded Known Limitations
* key miss event:
* On key miss event, if a module performed some write command on the
event (using `RM_Call`), the `dirty` counter would increase and the read
command that cause the key miss event would be replicated to the replication
and aof. This problem can also happened on a write command that open
some keys but eventually decides not to perform any action. We decided
not to handle this problem on this PR because the solution is complex
and will cause additional risks in case we will want to cherry-pick this PR.
We should decide if we want to handle it in future PR's. For now, modules
writers is advice not to perform any write commands on key miss event.
#### Testing
* We already have tests to cover cases where a notification is invoking write
commands that are also added to the replication buffer, the tests was modified
to verify that the replica gets the command in the correct logical order.
* Test was added to verify that `spop` behavior was kept unchanged.
* Test was added to verify key miss event behave as expected.
* Test was added to verify the changes do not break lazy expiration.
#### Additional Changes
* `propagateNow` function can accept a special dbid, -1, indicating not
to replicate `select`. We use this to replicate `multi/exec` on `propagatePendingCommands`
function. The side effect of this change is that now the `select` command
will appear inside the `multi/exec` block on the replication stream (instead of
outside of the `multi/exec` block). Tests was modified to match this new behavior.
* Fix broken protocol when redis can't persist to RDB (general commands, not
modules), excessive newline. regression of #10372 (7.0 RC3)
* Fix broken protocol when Redis can't persist to AOF (modules and
scripts), missing newline.
* Fix bug in OOM check of EVAL scripts called from RM_Call.
set the cached OOM state for scripts before executing module commands too,
so that it can serve scripts that are executed by modules.
i.e. in the past EVAL executed by RM_Call could have either falsely
fail or falsely succeeded because of a wrong cached OOM state flag.
* Fix bugs with RM_Yield:
1. SHUTDOWN should only accept the NOSAVE mode
2. Avoid eviction during yield command processing.
3. Avoid processing master client commands while yielding from another client
* Add new two more checks to RM_Call script mode.
1. READONLY You can't write against a read only replica
2. MASTERDOWN Link with MASTER is down and `replica-serve-stale-data` is set to `no`
* Add new RM_Call flag to let redis automatically refuse `deny-oom` commands
while over the memory limit.
* Add tests to cover various errors from Scripts, Modules, Modules
calling scripts, and Modules calling commands in script mode.
Add tests:
* Looks like the MISCONF error was completely uncovered by the tests,
add tests for it, including from scripts, and modules
* Add tests for NOREPLICAS from scripts
* Add tests for the various errors in module RM_Call, including RM_Call that
calls EVAL, and RM_call in "eval mode". that includes:
NOREPLICAS, READONLY, MASTERDOWN, MISCONF
If was first added in #9890 to solve the problem of
CONFIG SET maxmemory causing eviction inside MULTI/EXEC,
but that problem is already fixed (CONFIG SET doesn't evict
directly, it just schedules a later eviction)
Keep that condition may hide bugs (i.e. performEvictions
should always expect to have an empty server.also_propagate)
This would mean that the effects of `CONFIG SET maxmemory` may not be visible once the command returns.
That could anyway happen since incremental eviction was added in redis 6.2 (see #7653)
We do this to fix one of the propagation bugs about eviction see #9890 and #10014.
Implement Multi-Part AOF mechanism to avoid overheads during AOFRW.
Introducing a folder with multiple AOF files tracked by a manifest file.
The main issues with the the original AOFRW mechanism are:
* buffering of commands that are processed during rewrite (consuming a lot of RAM)
* freezes of the main process when the AOFRW completes to drain the remaining part of the buffer and fsync it.
* double disk IO for the data that arrives during AOFRW (had to be written to both the old and new AOF files)
The main modifications of this PR:
1. Remove the AOF rewrite buffer and related code.
2. Divide the AOF into multiple files, they are classified as two types, one is the the `BASE` type,
it represents the full amount of data (Maybe AOF or RDB format) after each AOFRW, there is only
one `BASE` file at most. The second is `INCR` type, may have more than one. They represent the
incremental commands since the last AOFRW.
3. Use a AOF manifest file to record and manage these AOF files mentioned above.
4. The original configuration of `appendfilename` will be the base part of the new file name, for example:
`appendonly.aof.1.base.rdb` and `appendonly.aof.2.incr.aof`
5. Add manifest-related TCL tests, and modified some existing tests that depend on the `appendfilename`
6. Remove the `aof_rewrite_buffer_length` field in info.
7. Add `aof-disable-auto-gc` configuration. By default we're automatically deleting HISTORY type AOFs.
It also gives users the opportunity to preserve the history AOFs. just for testing use now.
8. Add AOFRW limiting measure. When the AOFRW failures reaches the threshold (3 times now),
we will delay the execution of the next AOFRW by 1 minute. If the next AOFRW also fails, it will be
delayed by 2 minutes. The next is 4, 8, 16, the maximum delay is 60 minutes (1 hour). During the limit
period, we can still use the 'bgrewriteaof' command to execute AOFRW immediately.
9. Support upgrade (load) data from old version redis.
10. Add `appenddirname` configuration, as the directory name of the append only files. All AOF files and
manifest file will be placed in this directory.
11. Only the last AOF file (BASE or INCR) can be truncated. Otherwise redis will exit even if
`aof-load-truncated` is enabled.
Co-authored-by: Oran Agra <oran@redislabs.com>
The mess:
Some parts use alsoPropagate for late propagation, others using an immediate one (propagate()),
causing edge cases, ugly/hacky code, and the tendency for bugs
The basic idea is that all commands are propagated via alsoPropagate (i.e. added to a list) and the
top-most call() is responsible for going over that list and actually propagating them (and wrapping
them in MULTI/EXEC if there's more than one command). This is done in the new function,
propagatePendingCommands.
Callers to propagatePendingCommands:
1. top-most call() (we want all nested call()s to add to the also_propagate array and just the top-most
one to propagate them) - via `afterCommand`
2. handleClientsBlockedOnKeys: it is out of call() context and it may propagate stuff - via `afterCommand`.
3. handleClientsBlockedOnKeys edge case: if the looked-up key is already expired, we will propagate the
expire but will not unblock any client so `afterCommand` isn't called. in that case, we have to propagate
the deletion explicitly.
4. cron stuff: active-expire and eviction may also propagate stuff
5. modules: the module API allows to propagate stuff from just about anywhere (timers, keyspace notifications,
threads). I could have tried to catch all the out-of-call-context places but it seemed easier to handle it in one
place: when we free the context. in the spirit of what was done in call(), only the top-most freeing of a module
context may cause propagation.
6. modules: when using a thread-safe ctx it's not clear when/if the ctx will be freed. we do know that the module
must lock the GIL before calling RM_Replicate/RM_Call so we propagate the pending commands when
releasing the GIL.
A "known limitation", which were actually a bug, was fixed because of this commit (see propagate.tcl):
When using a mix of RM_Call with `!` and RM_Replicate, the command would propagate out-of-order:
first all the commands from RM_Call, and then the ones from RM_Replicate
Another thing worth mentioning is that if, in the past, a client would issue a MULTI/EXEC with just one
write command the server would blindly propagate the MULTI/EXEC too, even though it's redundant.
not anymore.
This commit renames propagate() to propagateNow() in order to cause conflicts in pending PRs.
propagatePendingCommands is the only caller of propagateNow, which is now a static, internal helper function.
Optimizations:
1. alsoPropagate will not add stuff to also_propagate if there's no AOF and replicas
2. alsoPropagate reallocs also_propagagte exponentially, to save calls to memmove
Bugfixes:
1. CONFIG SET can create evictions, sending notifications which can cause to dirty++ with modules.
we need to prevent it from propagating to AOF/replicas
2. We need to set current_client in RM_Call. buggy scenario:
- CONFIG SET maxmemory, eviction notifications, module hook calls RM_Call
- assertion in lookupKey crashes, because current_client has CONFIG SET, which isn't CMD_WRITE
3. minor: in eviction, call propagateDeletion after notification, like active-expire and all commands
(we always send a notification before propagating the command)
Script unit is a new unit located on script.c.
Its purpose is to provides an API for functions (and eval)
to interact with Redis. Interaction includes mostly
executing commands, but also functionalities like calling
Redis back on long scripts or check if the script was killed.
The interaction is done using a scriptRunCtx object that
need to be created by the user and initialized using scriptPrepareForRun.
Detailed list of functionalities expose by the unit:
1. Calling commands (including all the validation checks such as
acl, cluster, read only run, ...)
2. Set Resp
3. Set Replication method (AOF/REPLICATION/NONE)
4. Call Redis back to on long running scripts to allow Redis reply
to clients and perform script kill
The commit introduce the new unit and uses it on eval commands to
interact with Redis.
The following variable was renamed:
1. lua_caller -> script_caller
2. lua_time_limit -> script_time_limit
3. lua_timedout -> script_timedout
4. lua_oom -> script_oom
5. lua_disable_deny_script -> script_disable_deny_script
6. in_eval -> in_script
The following variables was moved to lctx under eval.c
1. lua
2. lua_client
3. lua_cur_script
4. lua_scripts
5. lua_scripts_mem
6. lua_replicate_commands
7. lua_write_dirty
8. lua_random_dirty
9. lua_multi_emitted
10. lua_repl
11. lua_kill
12. lua_time_start
13. lua_time_snapshot
This commit is in a low risk of introducing any issues and it
is just moving varibales around and not changing any logic.
## Background
For redis master, one replica uses one copy of replication buffer, that is a big waste of memory,
more replicas more waste, and allocate/free memory for every reply list also cost much.
If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with
replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big,
master may be OOM when there are many replicas that separately keep much memory.
Because replication buffers of different replica client are the same, one simple idea is that
all replicas only use one replication buffer, that will effectively save memory.
Since replication backlog content is the same as replicas' output buffer, now we
can discard replication backlog memory and use global shared replication buffer
to implement replication backlog mechanism.
## Implementation
I create one global "replication buffer" which contains content of replication stream.
The structure of "replication buffer" is similar to the reply list that exists in every client.
But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields.
```c
/* Replication buffer blocks is the list of replBufBlock.
*
* +--------------+ +--------------+ +--------------+
* | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 |
* +--------------+ +--------------+ +--------------+
* | / \
* | / \
* | / \
* Repl Backlog Replia_A Replia_B
*
* Each replica or replication backlog increments only the refcount of the
* 'ref_repl_buf_node' which it points to. So when replica walks to the next
* node, it should first increase the next node's refcount, and when we trim
* the replication buffer nodes, we remove node always from the head node which
* refcount is 0. If the refcount of the head node is not 0, we must stop
* trimming and never iterate the next node. */
/* Similar with 'clientReplyBlock', it is used for shared buffers between
* all replica clients and replication backlog. */
typedef struct replBufBlock {
int refcount; /* Number of replicas or repl backlog using. */
long long id; /* The unique incremental number. */
long long repl_offset; /* Start replication offset of the block. */
size_t size, used;
char buf[];
} replBufBlock;
```
So now when we feed replication stream into replication backlog and all replicas, we only need
to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of
replication backlog and replicas to references of the global replication buffer blocks. And we also
need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim
replication backlog if exceeding `repl-backlog-size`.
When sending reply to replicas, we also need to iterate replication buffer blocks and send its
content, when totally sending one block for replica, we decrease current node count and
increase the next current node count, and then free the block which reference is 0 from the
head of replication buffer blocks.
Since now we use linked list to manage replication backlog, it may cost much time for iterating
all linked list nodes to find corresponding replication buffer node. So we create a rax tree to
store some nodes for index, but to avoid rax tree occupying too much memory, i record
one per 64 nodes for index.
Currently, to make partial resynchronization as possible as much, we always let replication
backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting
if slow replicas that reference vast replication buffer blocks, and this method doesn't increase
memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced
replication buffer blocks when we need to trim backlog for exceeding backlog size setting,
we trim backlog incrementally (free 64 blocks per call now), and make it faster in
`beforeSleep` (free 640 blocks).
### Other changes
- `mem_total_replication_buffers`: we add this field in INFO command, it means the total
memory of replication buffers used.
- `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory
is not 0, but it still may be 0, since replication backlog and replicas share one global replication
buffer, only if replication buffer memory is more than the repl backlog setting size, we consider
the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption
of repl backlog.
- Key eviction
Since all replicas and replication backlog share global replication buffer, we think only the
part of exceeding backlog size the extra separate consumption of replicas.
Because we trim backlog incrementally in the background, backlog size may exceeds our
setting if slow replicas that reference vast replication buffer blocks disconnect.
To avoid massive eviction loop, we don't count the delayed freed replication backlog into
used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
- `client-output-buffer-limit` check for replica clients
It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size
config (partial sync will succeed and then replica will get disconnected). Such a configuration is
ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption
implications since the replica client will share the backlog buffers memory.
- Drop replication backlog after loading data if needed
We always create replication backlog if server is a master, we need it because we put DELs in
it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb,
it is not possible to support partial resynchronization, to avoid extra memory of replication backlog,
we drop it.
- Multi IO threads
Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled,
to guarantee data accessing thread safe, we must let main thread handle sending the output buffer
to all replicas. But before, other IO threads could handle sending output buffer of all replicas.
## Other optimizations
This solution resolve some other problem:
- When replicas disconnect with master since of out of output buffer limit, releasing the output
buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now,
it doesn't cause freezing.
- This implementation may mitigate reply list copy cost time(also freezes server) when one replication
has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy
reference info, it is very light.
- If we set replication backlog size big, it also may cost much time to copy replication backlog into
replica's output buffer. But this commit eliminates this problem.
- Resizing replication backlog size doesn't empty current replication backlog content.
Minor optimize getMaxmemoryState, when server.maxmemory is not set,
don't count AOF and replicas buffers.
Co-authored-by: Viktor Söderqvist <viktor@zuiderkwast.se>
Add two INFO metrics:
```
total_eviction_exceeded_time:69734
current_eviction_exceeded_time:10230
```
`current_eviction_exceeded_time` if greater than 0, means how much time current used memory is greater than `maxmemory`. And we are still over the maxmemory. If used memory is below `maxmemory`, this metric is reset to 0.
`total_eviction_exceeded_time` means total time used memory is greater than `maxmemory` since server startup.
The units of these two metrics are ms.
Co-authored-by: Oran Agra <oran@redislabs.com>
This PR adds a spell checker CI action that will fail future PRs if they introduce typos and spelling mistakes.
This spell checker is based on blacklist of common spelling mistakes, so it will not catch everything,
but at least it is also unlikely to cause false positives.
Besides that, the PR also fixes many spelling mistakes and types, not all are a result of the spell checker we use.
Here's a summary of other changes:
1. Scanned the entire source code and fixes all sorts of typos and spelling mistakes (including missing or extra spaces).
2. Outdated function / variable / argument names in comments
3. Fix outdated keyspace masks error log when we check `config.notify-keyspace-events` in loadServerConfigFromString.
4. Trim the white space at the end of line in `module.c`. Check: https://github.com/redis/redis/pull/7751
5. Some outdated https link URLs.
6. Fix some outdated comment. Such as:
- In README: about the rdb, we used to said create a `thread`, change to `process`
- dbRandomKey function coment (about the dictGetRandomKey, change to dictGetFairRandomKey)
- notifyKeyspaceEvent fucntion comment (add type arg)
- Some others minor fix in comment (Most of them are incorrectly quoted by variable names)
7. Modified the error log so that users can easily distinguish between TCP and TLS in `changeBindAddr`
Make aof rewrite buffer memory size more accurate, before, there may be 20%
deviation with its real memory usage.
The implication are both lower memory usage, and also a more accurate INFO.
In theory we should have used zmalloc_usable_size, but it may be slower,
and now after #7875, there's no actual difference.
So this change isn't making a dramatic change.
Co-authored-by: Oran Agra <oran@redislabs.com>
When client tracking is enabled signalModifiedKey can increase memory usage,
this can cause the loop in performEvictions to keep running since it was measuring
the memory usage impact of signalModifiedKey.
The section that measures the memory impact of the eviction should be just on dbDelete,
excluding keyspace notification, client tracking, and propagation to AOF and replicas.
This resolves part of the problem described in #8069
p.s. fix took 1 minute, test took about 3 hours to write.
As we know, redis may reject user's requests or evict some keys if
used memory is over maxmemory. Dictionaries expanding may make
things worse, some big dictionaries, such as main db and expires dict,
may eat huge memory at once for allocating a new big hash table and be
far more than maxmemory after expanding.
There are related issues: #4213#4583
More details, when expand dict in redis, we will allocate a new big
ht[1] that generally is double of ht[0], The size of ht[1] will be
very big if ht[0] already is big. For db dict, if we have more than
64 million keys, we need to cost 1GB for ht[1] when dict expands.
If the sum of used memory and new hash table of dict needed exceeds
maxmemory, we shouldn't allow the dict to expand. Because, if we
enable keys eviction, we still couldn't add much more keys after
eviction and rehashing, what's worse, redis will keep less keys when
redis only remains a little memory for storing new hash table instead
of users' data. Moreover users can't write data in redis if disable
keys eviction.
What this commit changed ?
Add a new member function expandAllowed for dict type, it provide a way
for caller to allow expand or not. We expose two parameters for this
function: more memory needed for expanding and dict current load factor,
users can implement a function to make a decision by them.
For main db dict and expires dict type, these dictionaries may be very
big and cost huge memory for expanding, so we implement a judgement
function: we can stop dict to expand provisionally if used memory will
be over maxmemory after dict expands, but to guarantee the performance
of redis, we still allow dict to expand if dict load factor exceeds the
safe load factor.
Add test cases to verify we don't allow main db to expand when left
memory is not enough, so that avoid keys eviction.
Other changes:
For new hash table size when expand. Before this commit, the size is
that double used of dict and later _dictNextPower. Actually we aim to
control a dict load factor between 0.5 and 1.0. Now we replace *2 with
+1, since the first check is that used >= size, the outcome of before
will usually be the same as _dictNextPower(used+1). The only case where
it'll differ is when dict_can_resize is false during fork, so that later
the _dictNextPower(used*2) will cause the dict to jump to *4 (i.e.
_dictNextPower(1025*2) will return 4096).
Fix rehash test cases due to changing algorithm of new hash table size
when expand.
If 'delta' is negative 'mem_freed' may underflow and cause
the while loop to exit prematurely (And not evicting enough
memory)
mem_freed can be negative when:
1. We use lazy free (consuming memory by appending to a list)
2. Thread doing an allocation between the two calls to zmalloc_used_memory.
Redis 6.0 introduces I/O threads, it is so cool and efficient, we use C11
_Atomic to establish inter-thread synchronization without mutex. But the
compiler that must supports C11 _Atomic can compile redis code, that brings a
lot of inconvenience since some common platforms can't support by default such
as CentOS7, so we want to implement redis atomic type to make it more portable.
We have implemented our atomic variable for redis that only has 'relaxed'
operations in src/atomicvar.h, so we implement some operations with
'sequentially-consistent', just like the default behavior of C11 _Atomic that
can establish inter-thread synchronization. And we replace all uses of C11
_Atomic with redis atomic variable.
Our implementation of redis atomic variable uses C11 _Atomic, __atomic or
__sync macros if available, it supports most common platforms, and we will
detect automatically which feature we use. In Makefile we use a dummy file to
detect if the compiler supports C11 _Atomic. Now for gcc, we can compile redis
code theoretically if your gcc version is not less than 4.1.2(starts to support
__sync_xxx operations). Otherwise, we remove use mutex fallback to implement
redis atomic variable for performance and test. You will get compiling errors
if your compiler doesn't support all features of above.
For cover redis atomic variable tests, we add other CI jobs that build redis on
CentOS6 and CentOS7 and workflow daily jobs that run the tests on them.
For them, we just install gcc by default in order to cover different compiler
versions, gcc is 4.4.7 by default installation on CentOS6 and 4.8.5 on CentOS7.
We restore the feature that we can test redis with Helgrind to find data race
errors. But you need install Valgrind in the default path configuration firstly
before running your tests, since we use macros in helgrind.h to tell Helgrind
inter-thread happens-before relationship explicitly for avoiding false positives.
Please open an issue on github if you find data race errors relate to this commit.
Unrelated:
- Fix redefinition of typedef 'RedisModuleUserChangedFunc'
For some old version compilers, they will report errors or warnings, if we
re-define function type.
Rather than blindly evicting until maxmemory limit is achieved, this
update adds a time limit to eviction. While over the maxmemory limit,
eviction will process before each command AND as a timeProc when no
commands are running.
This will reduce the latency impact on many cases, especially pathological
cases like massive used memory increase during dict rehashing.
There is a risk that some other edge cases (like massive pipelined use
of MGET) could cause Redis memory usage to keep growing despite the
eviction attempts, so a new maxmemory-eviction-tenacity config is
introduced to let users mitigate that.