* Till now, replicas that were unable to persist, would still execute the commands
they got from the master, now they'll panic by default, and we add a new
`replica-ignore-disk-errors` config to change that.
* Till now, when a command failed on a replica or AOF-loading, it only logged a
warning and a stat, we add a new `propagation-error-behavior` config to allow
panicking in that state (may become the default one day)
Note that commands that fail on the replica can either indicate a bug that could
cause data inconsistency between the replica and the master, or they could be
in some cases (specifically in previous versions), a result of a command (e.g. EVAL)
that failed on the master, but still had to be propagated to fail on the replica as well.
Adds the `allow-cross-slot-keys` flag to Eval scripts and Functions to allow
scripts to access keys from multiple slots.
The default behavior is now that they are not allowed to do that (unlike before).
This is a breaking change for 7.0 release candidates (to be part of 7.0.0), but
not for previous redis releases since EVAL without shebang isn't doing this check.
Note that the check is done on both the keys declared by the EVAL / FCALL command
arguments, and also the ones used by the script when making a `redis.call`.
A note about the implementation, there seems to have been some confusion
about allowing access to non local keys. I thought I missed something in our
wider conversation, but Redis scripts do block access to non-local keys.
So the issue was just about cross slots being accessed.
1. Disk error and slave count checks didn't flag the transactions or counted correctly in command stats (regression from #10372 , 7.0 RC3)
2. RM_Call will reply the same way Redis does, in case of non-exisitng command or arity error
3. RM_WrongArtiy will consider the full command name
4. Use lowercase 'u' in "unknonw subcommand" (to align with "unknown command")
Followup work of #10127
Durability of database is a big and old topic, in this regard Redis use AOF to
support it, and `appendfsync=alwasys` policy is the most strict level, guarantee
all data is both written and synced on disk before reply success to client.
But there are some cases have been overlooked, and could lead to durability broken.
1. The most clear one is about threaded-io mode
we should also set client's write handler with `ae_barrier` in
`handleClientsWithPendingWritesUsingThreads`, or the write handler would be
called after read handler in the next event loop, it means the write command result
could be replied to client before flush to AOF.
2. About blocked client (mostly by module)
in `beforeSleep()`, `handleClientsBlockedOnKeys()` should be called before
`flushAppendOnlyFile()`, in case the unblocked clients modify data without persistence
but send reply.
3. When handling `ProcessingEventsWhileBlocked`
normally it takes place when lua/function/module timeout, and we give a chance to users
to kill the slow operation, but we should call `flushAppendOnlyFile()` before
`handleClientsWithPendingWrites()`, in case the other clients in the last event loop get
acknowledge before data persistence.
for a instance:
```
in the same event loop
client A executes set foo bar
client B executes eval "for var=1,10000000,1 do end" 0
```
after the script timeout, client A will get `OK` but lose data after restart (kill redis when
timeout) if we don't flush the write command to AOF.
4. A more complex case about `ProcessingEventsWhileBlocked`
it is lua timeout in transaction, for example
`MULTI; set foo bar; eval "for var=1,10000000,1 do end" 0; EXEC`, then client will get set
command's result before the whole transaction done, that breaks atomicity too.
fortunately, it's already fixed by #5428 (although it's not the original purpose just a side
effect : )), but module timeout should be fixed too.
case 1, 2, 3 are fixed in this commit, the module issue in case 4 needs a followup PR.
To remove `pending_querybuf`, the key point is reusing `querybuf`, it means master client's `querybuf` is not only used to parse command, but also proxy to sub-replicas.
1. add a new variable `repl_applied` for master client to record how many data applied (propagated via `replicationFeedStreamFromMasterStream()`) but not trimmed in `querybuf`.
2. don't sdsrange `querybuf` in `commandProcessed()`, we trim it to `repl_applied` after the whole replication pipeline processed to avoid fragmented `sdsrange`. And here are some scenarios we cannot trim to `qb_pos`:
* we don't receive complete command from master
* master client blocked because of client pause
* IO threads operate read, master client flagged with CLIENT_PENDING_COMMAND
In these scenarios, `qb_pos` points to the part of the current command or the beginning of next command, and the current command is not applied yet, so the `repl_applied` is not equal to `qb_pos`.
Some other notes:
* Do not do big arg optimization on master client, since we can only sdsrange `querybuf` after data sent to replicas.
* Set `qb_pos` and `repl_applied` to 0 when `freeClient` in `replicationCacheMaster`.
* Rewrite `processPendingCommandsAndResetClient` to `processPendingCommandAndInputBuffer`, let `processInputBuffer` to be called successively after `processCommandAndResetClient`.
In a benchmark we noticed we spend a relatively long time updating the client
memory usage leading to performance degradation.
Before #8687 this was performed in the client's cron and didn't affect performance.
But since introducing client eviction we need to perform this after filling the input
buffers and after processing commands. This also lead me to write this code to be
thread safe and perform it in the i/o threads.
It turns out that the main performance issue here is related to atomic operations
being performed while updating the total clients memory usage stats used for client
eviction (`server.stat_clients_type_memory[]`). This update needed to be atomic
because `updateClientMemUsage()` was called from the IO threads.
In this commit I make sure to call `updateClientMemUsage()` only from the main thread.
In case of threaded IO I call it for each client during the "fan-in" phase of the read/write
operation. This also means I could chuck the `updateClientMemUsageBucket()` function
which was called during this phase and embed it into `updateClientMemUsage()`.
Profiling shows this makes `updateClientMemUsage()` (on my x86_64 linux) roughly x4 faster.
Deleting a stream while a client is blocked XREADGROUP should unblock the client.
The idea is that if a client is blocked via XREADGROUP is different from
any other blocking type in the sense that it depends on the existence of both
the key and the group. Even if the key is deleted and then revived with XADD
it won't help any clients blocked on XREADGROUP because the group no longer
exist, so they would fail with -NOGROUP anyway.
The conclusion is that it's better to unblock these clients (with error) upon
the deletion of the key, rather than waiting for the first XADD.
Other changes:
1. Slightly optimize all `serveClientsBlockedOn*` functions by checking `server.blocked_clients_by_type`
2. All `serveClientsBlockedOn*` functions now use a list iterator rather than looking at `listFirst`, relying
on `unblockClient` to delete the head of the list. Before this commit, only `serveClientsBlockedOnStreams`
used to work like that.
3. bugfix: CLIENT UNBLOCK ERROR should work even if the command doesn't have a timeout_callback
(only relevant to module commands)
This PR fix 2 issues on Lua scripting:
* Server error reply statistics (some errors were counted twice).
* Error code and error strings returning from scripts (error code was missing / misplaced).
## Statistics
a Lua script user is considered part of the user application, a sophisticated transaction,
so we want to count an error even if handled silently by the script, but when it is
propagated outwards from the script we don't wanna count it twice. on the other hand,
if the script decides to throw an error on its own (using `redis.error_reply`), we wanna
count that too.
Besides, we do count the `calls` in command statistics for the commands the script calls,
we we should certainly also count `failed_calls`.
So when a simple `eval "return redis.call('set','x','y')" 0` fails, it should count the failed call
to both SET and EVAL, but the `errorstats` and `total_error_replies` should be counted only once.
The PR changes the error object that is raised on errors. Instead of raising a simple Lua
string, Redis will raise a Lua table in the following format:
```
{
err='<error message (including error code)>',
source='<User source file name>',
line='<line where the error happned>',
ignore_error_stats_update=true/false,
}
```
The `luaPushError` function was modified to construct the new error table as describe above.
The `luaRaiseError` was renamed to `luaError` and is now simply called `lua_error` to raise
the table on the top of the Lua stack as the error object.
The reason is that since its functionality is changed, in case some Redis branch / fork uses it,
it's better to have a compilation error than a bug.
The `source` and `line` fields are enriched by the error handler (if possible) and the
`ignore_error_stats_update` is optional and if its not present then the default value is `false`.
If `ignore_error_stats_update` is true, the error will not be counted on the error stats.
When parsing Redis call reply, each error is translated to a Lua table on the format describe
above and the `ignore_error_stats_update` field is set to `true` so we will not count errors
twice (we counted this error when we invoke the command).
The changes in this PR might have been considered as a breaking change for users that used
Lua `pcall` function. Before, the error was a string and now its a table. To keep backward
comparability the PR override the `pcall` implementation and extract the error message from
the error table and return it.
Example of the error stats update:
```
127.0.0.1:6379> lpush l 1
(integer) 2
127.0.0.1:6379> eval "return redis.call('get', 'l')" 0
(error) WRONGTYPE Operation against a key holding the wrong kind of value. script: e471b73f1ef44774987ab00bdf51f21fd9f7974a, on @user_script:1.
127.0.0.1:6379> info Errorstats
# Errorstats
errorstat_WRONGTYPE:count=1
127.0.0.1:6379> info commandstats
# Commandstats
cmdstat_eval:calls=1,usec=341,usec_per_call=341.00,rejected_calls=0,failed_calls=1
cmdstat_info:calls=1,usec=35,usec_per_call=35.00,rejected_calls=0,failed_calls=0
cmdstat_lpush:calls=1,usec=14,usec_per_call=14.00,rejected_calls=0,failed_calls=0
cmdstat_get:calls=1,usec=10,usec_per_call=10.00,rejected_calls=0,failed_calls=1
```
## error message
We can now construct the error message (sent as a reply to the user) from the error table,
so this solves issues where the error message was malformed and the error code appeared
in the middle of the error message:
```diff
127.0.0.1:6379> eval "return redis.call('set','x','y')" 0
-(error) ERR Error running script (call to 71e6319f97b0fe8bdfa1c5df3ce4489946dda479): @user_script:1: OOM command not allowed when used memory > 'maxmemory'.
+(error) OOM command not allowed when used memory > 'maxmemory' @user_script:1. Error running script (call to 71e6319f97b0fe8bdfa1c5df3ce4489946dda479)
```
```diff
127.0.0.1:6379> eval "redis.call('get', 'l')" 0
-(error) ERR Error running script (call to f_8a705cfb9fb09515bfe57ca2bd84a5caee2cbbd1): @user_script:1: WRONGTYPE Operation against a key holding the wrong kind of value
+(error) WRONGTYPE Operation against a key holding the wrong kind of value script: 8a705cfb9fb09515bfe57ca2bd84a5caee2cbbd1, on @user_script:1.
```
Notica that `redis.pcall` was not change:
```
127.0.0.1:6379> eval "return redis.pcall('get', 'l')" 0
(error) WRONGTYPE Operation against a key holding the wrong kind of value
```
## other notes
Notice that Some commands (like GEOADD) changes the cmd variable on the client stats so we
can not count on it to update the command stats. In order to be able to update those stats correctly
we needed to promote `realcmd` variable to be located on the client struct.
Tests was added and modified to verify the changes.
Related PR's: #10279, #10218, #10278, #10309
Co-authored-by: Oran Agra <oran@redislabs.com>
Avoid deferred array reply on genericZrangebyrankCommand() when consumer type is client.
I.e. any ZRANGE / ZREVRNGE (when tank is used).
This was a performance regression introduced in #7844 (v 6.2) mainly affecting pipelined workloads.
Co-authored-by: Oran Agra <oran@redislabs.com>
Avoid sprintf/ll2string on setDeferredAggregateLen()/addReplyLongLongWithPrefix() when we can used shared objects.
In some pipelined workloads this achieves about 10% improvement.
Co-authored-by: Oran Agra <oran@redislabs.com>
There are scenarios where it results in many small objects in the reply list,
such as commands heavily using deferred array replies (`addReplyDeferredLen`).
E.g. what COMMAND command and CLUSTER SLOTS used to do (see #10056, #7123),
but also in case of a transaction or a pipeline of commands that use just one differed array reply.
We used to have to run multiple loops along with multiple calls to `write()` to send data back to
peer based on the current code, but by means of `writev()`, we can gather those scattered
objects in reply list and include the static reply buffer as well, then send it by one system call,
that ought to achieve higher performance.
In the case of TLS, we simply check and concatenate buffers into one big buffer and send it
away by one call to `connTLSWrite()`, if the amount of all buffers exceeds `NET_MAX_WRITES_PER_EVENT`,
then invoke `connTLSWrite()` multiple times to avoid a huge massive of memory copies.
Note that aside of reducing system calls, this change will also reduce the amount of
small TCP packets sent.
Current implementation simple idle client which serves no traffic still
use ~17Kb of memory. this is mainly due to a fixed size reply buffer
currently set to 16kb.
We have encountered some cases in which the server operates in a low memory environments.
In such cases a user who wishes to create large connection pools to support potential burst period,
will exhaust a large amount of memory to maintain connected Idle clients.
Some users may choose to "sacrifice" performance in order to save memory.
This commit introduce a dynamic mechanism to shrink and expend the client reply buffer based on
periodic observed peak.
the algorithm works as follows:
1. each time a client reply buffer has been fully written, the last recorded peak is updated:
new peak = MAX( last peak, current written size)
2. during clients cron we check for each client if the last observed peak was:
a. matching the current buffer size - in which case we expend (resize) the buffer size by 100%
b. less than half the buffer size - in which case we shrink the buffer size by 50%
3. In any case we will **not** resize the buffer in case:
a. the current buffer peak is less then the current buffer usable size and higher than 1/2 the
current buffer usable size
b. the value of (current buffer usable size/2) is less than 1Kib
c. the value of (current buffer usable size*2) is larger than 16Kib
4. the peak value is reset to the current buffer position once every **5** seconds. we maintain a new
field in the client structure (buf_peak_last_reset_time) which is used to keep track of how long it
passed since the last buffer peak reset.
### **Interface changes:**
**CIENT LIST** - now contains 2 new extra fields:
rbs= < the current size in bytes of the client reply buffer >
rbp=< the current value in bytes of the last observed buffer peak position >
**INFO STATS** - now contains 2 new statistics:
reply_buffer_shrinks = < total number of buffer shrinks performed >
reply_buffer_expends = < total number of buffer expends performed >
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Yoav Steinberg <yoav@redislabs.com>
This is a followup work for #10278, and a discussion about #10279
The changes:
- fix failed_calls in command stats for blocked clients that got error.
including CLIENT UNBLOCK, and module replying an error from a thread.
- fix latency stats for XREADGROUP that filed with -NOGROUP
Theory behind which errors should be counted:
- error stats represents errors returned to the user, so an error handled by a
module should not be counted.
- total error counter should be the same.
- command stats represents execution of commands (even with RM_Call, and if
they fail or get rejected it counts these calls in commandstats, so it should
also count failed_calls)
Some thoughts about Scripts:
for scripts it could be different since they're part of user code, not the infra (not an extension to redis)
we certainly want commandstats to contain all calls and errors
a simple script is like mult-exec transaction so an error inside it should be counted in error stats
a script that replies with an error to the user (using redis.error_reply) should also be counted in error stats
but then the problem is that a plain `return redis.call("SET")` should not be counted twice (once for the SET
and once for EVAL)
so that's something left to be resolved in #10279
This PR handles several aspects
1. Calls to RM_ReplyWithError from thread safe contexts don't violate thread safety.
2. Errors returning from RM_Call to the module aren't counted in the statistics (they
might be handled silently by the module)
3. When a module propagates a reply it got from RM_Call to it's client, then the error
statistics are counted.
This is done by:
1. When appending an error reply to the output buffer, we avoid updating the global
error statistics, instead we cache that error in a deferred list in the client struct.
2. When creating a RedisModuleCallReply object, the deferred error list is moved from
the client into that object.
3. when a module calls RM_ReplyWithCallReply we copy the deferred replies to the dest
client (if that's a real client, then that's when the error statistics are updated to the server)
Note about RM_ReplyWithCallReply: if the original reply had an array with errors, and the module
replied with just a portion of the original reply, and not the entire reply, the errors are currently not
propagated and the errors stats will not get propagated.
Fix#10180
Summary of changes:
1. Rename `redisCommand->name` to `redisCommand->declared_name`, it is a
const char * for native commands and SDS for module commands.
2. Store the [sub]command fullname in `redisCommand->fullname` (sds).
3. List subcommands in `ACL CAT`
4. List subcommands in `COMMAND LIST`
5. `moduleUnregisterCommands` now will also free the module subcommands.
6. RM_GetCurrentCommandName returns full command name
Other changes:
1. Add `addReplyErrorArity` and `addReplyErrorExpireTime`
2. Remove `getFullCommandName` function that now is useless.
3. Some cleanups about `fullname` since now it is SDS.
4. Delete `populateSingleCommand` function from server.h that is useless.
5. Added tests to cover this change.
6. Add some module unload tests and fix the leaks
7. Make error messages uniform, make sure they always contain the full command
name and that it's quoted.
7. Fixes some typos
see the history in #9504, fixes#10124
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: guybe7 <guy.benoish@redislabs.com>
Some modules might perform a long-running logic in different stages of Redis lifetime, for example:
* command execution
* RDB loading
* thread safe context
During this long-running logic Redis is not responsive.
This PR offers
1. An API to process events while a busy command is running (`RM_Yield`)
2. A new flag (`ALLOW_BUSY`) to mark the commands that should be handled during busy
jobs which can also be used by modules (`allow-busy`)
3. In slow commands and thread safe contexts, this flag will start rejecting commands with -BUSY only
after `busy-reply-threshold`
4. During loading (`rdb_load` callback), it'll process events right away (not wait for `busy-reply-threshold`),
but either way, the processing is throttled to the server hz rate.
5. Allow modules to Yield to redis background tasks, but not to client commands
* rename `script-time-limit` to `busy-reply-threshold` (an alias to the pre-7.0 `lua-time-limit`)
Co-authored-by: Oran Agra <oran@redislabs.com>
1. enable diskless replication by default
2. add a new config named repl-diskless-sync-max-replicas that enables
replication to start before the full repl-diskless-sync-delay was
reached.
3. put replica online sooner on the master (see below)
4. test suite uses repl-diskless-sync-delay of 0 to be faster
5. a few tests that use multiple replica on a pre-populated master, are
now using the new repl-diskless-sync-max-replicas
6. fix possible timing issues in a few cluster tests (see below)
put replica online sooner on the master
----------------------------------------------------
there were two tests that failed because they needed for the master to
realize that the replica is online, but the test code was actually only
waiting for the replica to realize it's online, and in diskless it could
have been before the master realized it.
changes include two things:
1. the tests wait on the right thing
2. issues in the master, putting the replica online in two steps.
the master used to put the replica as online in 2 steps. the first
step was to mark it as online, and the second step was to enable the
write event (only after getting ACK), but in fact the first step didn't
contains some of the tasks to put it online (like updating good slave
count, and sending the module event). this meant that if a test was
waiting to see that the replica is online form the point of view of the
master, and then confirm that the module got an event, or that the
master has enough good replicas, it could fail due to timing issues.
so now the full effect of putting the replica online, happens at once,
and only the part about enabling the writes is delayed till the ACK.
fix cluster tests
--------------------
I added some code to wait for the replica to sync and avoid race
conditions.
later realized the sentinel and cluster tests where using the original 5
seconds delay, so changed it to 0.
this means the other changes are probably not needed, but i suppose
they're still better (avoid race conditions)
Added a pool for temporary client objects to reuse in module operations.
By reusing temporary clients, we are avoiding expensive createClient()/freeClient()
calls and improving performance of RM_BlockClient() and RM_GetThreadSafeContext() calls.
This commit contains two optimizations:
1 - RM_BlockClient() and RM_GetThreadSafeContext() calls create temporary clients and they are freed in
RM_UnblockClient() and RM_FreeThreadSafeContext() calls respectively. Creating/destroying client object
takes quite time. To avoid that, added a pool of temporary clients. Pool expands when more clients are needed.
Also, added a cron function to shrink the pool and free unused clients after some time. Pool starts with zero
clients in it. It does not have max size and can grow unbounded as we need it. We will keep minimum of 8
temporary clients in the pool once created. Keeping small amount of clients to avoid client allocation costs
if temporary clients are required after some idle period.
2 - After unblocking a client (RM_UnblockClient()), one byte is written to pipe to wake up Redis main thread.
If there are many clients that will be unblocked, each operation requires one write() call which is quite expensive.
Changed code to avoid subsequent calls if possible.
There are a few more places that need temporary client objects (e.g RM_Call()). These are now using the same
temporary client pool to make things more centralized.
The following steps will crash redis-server:
```
[root]# cat crash
PSYNC replicationid -1
SLOWLOG GET
GET key
[root]# nc 127.0.0.1 6379 < crash
```
This one following #10020 and the crash was reported in #10076.
Other changes about the output info:
1. Cmd with a full name by using `getFullCommandName`, now it will print the right
subcommand name like `slowlog|get`.
2. Print the full client info by using `catClientInfoString`, the info is also valuable.:
The following error commands will crash redis-server:
```
> get|
Error: Server closed the connection
> get|set
Error: Server closed the connection
> get|other
```
The reason is in #9504, we use `lookupCommandBySds` for find the
container command. And it split the command (argv[0]) with `|`.
If we input something like `get|other`, after the split, `get`
will become a valid command name, pass the `ERR unknown command`
check, and finally crash in `addReplySubcommandSyntaxError`
In this case we do not need to split the command name with `|`
and just look in the commands dict to find if `argv[0]` is a
container command.
So this commit introduce a new function call `isContainerCommandBySds`
that it will return true if a command name is a container command.
Also with the old code, there is a incorrect error message:
```
> config|get set
(error) ERR Unknown subcommand or wrong number of arguments for 'set'. Try CONFIG|GET HELP.
```
The crash was reported in #10070.
This commit implements a sharded pubsub implementation based off of shard channels.
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Since #9166 we have an assertion here to make sure replica clients don't write anything to their buffer.
But in reality a replica may attempt write data to it's buffer simply by sending a command on the replication link.
This command in most cases will be rejected since #8868 but it'll still generate an error.
Actually the only valid command to send on a replication link is 'REPCONF ACK` which generates no response.
We want to keep the design so that replicas can send commands but we need to avoid any situation where we start
putting data in their response buffers, especially since they aren't used anymore. This PR makes sure to disconnect
a rogue client which generated a write on the replication link that cause something to be written to the response buffer.
To recreate the bug this fixes simply connect via telnet to a redis server and write sync\r\n wait for the the payload to
be written and then write any command (valid or invalid), such as ping\r\n on the telnet connection. It'll crash the server.
To avoid data loss, this commit adds a grace period for lagging replicas to
catch up the replication offset.
Done:
* Wait for replicas when shutdown is triggered by SIGTERM and SIGINT.
* Wait for replicas when shutdown is triggered by the SHUTDOWN command. A new
blocked client type BLOCKED_SHUTDOWN is introduced, allowing multiple clients
to call SHUTDOWN in parallel.
Note that they don't expect a response unless an error happens and shutdown is aborted.
* Log warning for each replica lagging behind when finishing shutdown.
* CLIENT_PAUSE_WRITE while waiting for replicas.
* Configurable grace period 'shutdown-timeout' in seconds (default 10).
* New flags for the SHUTDOWN command:
- NOW disables the grace period for lagging replicas.
- FORCE ignores errors writing the RDB or AOF files which would normally
prevent a shutdown.
- ABORT cancels ongoing shutdown. Can't be combined with other flags.
* New field in the output of the INFO command: 'shutdown_in_milliseconds'. The
value is the remaining maximum time to wait for lagging replicas before
finishing the shutdown. This field is present in the Server section **only**
during shutdown.
Not directly related:
* When shutting down, if there is an AOF saving child, it is killed **even** if AOF
is disabled. This can happen if BGREWRITEAOF is used when AOF is off.
* Client pause now has end time and type (WRITE or ALL) per purpose. The
different pause purposes are *CLIENT PAUSE command*, *failover* and
*shutdown*. If clients are unpaused for one purpose, it doesn't affect client
pause for other purposes. For example, the CLIENT UNPAUSE command doesn't
affect client pause initiated by the failover or shutdown procedures. A completed
failover or a failed shutdown doesn't unpause clients paused by the CLIENT
PAUSE command.
Notes:
* DEBUG RESTART doesn't wait for replicas.
* We already have a warning logged when a replica disconnects. This means that
if any replica connection is lost during the shutdown, it is either logged as
disconnected or as lagging at the time of exit.
Co-authored-by: Oran Agra <oran@redislabs.com>
This is needed in order to ease the deployment of functions for ephemeral cases, where user
needs to spin up a server with functions pre-loaded.
#### Details:
* Added `--functions-rdb` option to _redis-cli_.
* Functions only rdb via `REPLCONF rdb-filter-only functions`. This is a placeholder for a space
separated inclusion filter for the RDB. In the future can be `REPLCONF rdb-filter-only
"functions db:3 key-patten:user*"` and a complementing `rdb-filter-exclude` `REPLCONF`
can also be added.
* Handle "slave requirements" specification to RDB saving code so we can use the same RDB
when different slaves express the same requirements (like functions-only) and not share the
RDB when their requirements differ. This is currently just a flags `int`, but can be extended to
a more complex structure with various filter fields.
* make sure to support filters only in diskless replication mode (not to override the persistence file),
we do that by forcing diskless (even if disabled by config)
other changes:
* some refactoring in rdb.c (extract portion of a big function to a sub-function)
* rdb_key_save_delay used in AOFRW too
* sendChildInfo takes the number of updated keys (incremental, rather than absolute)
Co-authored-by: Oran Agra <oran@redislabs.com>
The mess:
Some parts use alsoPropagate for late propagation, others using an immediate one (propagate()),
causing edge cases, ugly/hacky code, and the tendency for bugs
The basic idea is that all commands are propagated via alsoPropagate (i.e. added to a list) and the
top-most call() is responsible for going over that list and actually propagating them (and wrapping
them in MULTI/EXEC if there's more than one command). This is done in the new function,
propagatePendingCommands.
Callers to propagatePendingCommands:
1. top-most call() (we want all nested call()s to add to the also_propagate array and just the top-most
one to propagate them) - via `afterCommand`
2. handleClientsBlockedOnKeys: it is out of call() context and it may propagate stuff - via `afterCommand`.
3. handleClientsBlockedOnKeys edge case: if the looked-up key is already expired, we will propagate the
expire but will not unblock any client so `afterCommand` isn't called. in that case, we have to propagate
the deletion explicitly.
4. cron stuff: active-expire and eviction may also propagate stuff
5. modules: the module API allows to propagate stuff from just about anywhere (timers, keyspace notifications,
threads). I could have tried to catch all the out-of-call-context places but it seemed easier to handle it in one
place: when we free the context. in the spirit of what was done in call(), only the top-most freeing of a module
context may cause propagation.
6. modules: when using a thread-safe ctx it's not clear when/if the ctx will be freed. we do know that the module
must lock the GIL before calling RM_Replicate/RM_Call so we propagate the pending commands when
releasing the GIL.
A "known limitation", which were actually a bug, was fixed because of this commit (see propagate.tcl):
When using a mix of RM_Call with `!` and RM_Replicate, the command would propagate out-of-order:
first all the commands from RM_Call, and then the ones from RM_Replicate
Another thing worth mentioning is that if, in the past, a client would issue a MULTI/EXEC with just one
write command the server would blindly propagate the MULTI/EXEC too, even though it's redundant.
not anymore.
This commit renames propagate() to propagateNow() in order to cause conflicts in pending PRs.
propagatePendingCommands is the only caller of propagateNow, which is now a static, internal helper function.
Optimizations:
1. alsoPropagate will not add stuff to also_propagate if there's no AOF and replicas
2. alsoPropagate reallocs also_propagagte exponentially, to save calls to memmove
Bugfixes:
1. CONFIG SET can create evictions, sending notifications which can cause to dirty++ with modules.
we need to prevent it from propagating to AOF/replicas
2. We need to set current_client in RM_Call. buggy scenario:
- CONFIG SET maxmemory, eviction notifications, module hook calls RM_Call
- assertion in lookupKey crashes, because current_client has CONFIG SET, which isn't CMD_WRITE
3. minor: in eviction, call propagateDeletion after notification, like active-expire and all commands
(we always send a notification before propagating the command)
## background
Till now CONFIG SET was blocked during loading.
(In the not so distant past, GET was disallowed too)
We recently (not released yet) added an async-loading mode, see #9323,
and during that time it'll serve CONFIG SET and any other command.
And now we realized (#9770) that some configs, and commands are dangerous
during async-loading.
## changes
* Allow most CONFIG SET during loading (both on async-loading and normal loading)
* Allow CONFIG REWRITE and CONFIG RESETSTAT during loading
* Block a few config during loading (`appendonly`, `repl-diskless-load`, and `dir`)
* Block a few commands during loading (list below)
## the blocked commands:
* SAVE - obviously we don't wanna start a foregreound save during loading 8-)
* BGSAVE - we don't mind to schedule one, but we don't wanna fork now
* BGREWRITEAOF - we don't mind to schedule one, but we don't wanna fork now
* MODULE - we obviously don't wanna unload a module during replication / rdb loading
(MODULE HELP and MODULE LIST are not blocked)
* SYNC / PSYNC - we're in the middle of RDB loading from master, must not allow sync
requests now.
* REPLICAOF / SLAVEOF - we're in the middle of replicating, maybe it makes sense to let
the user abort it, but he couldn't do that so far, i don't wanna take any risk of bugs due to odd state.
* CLUSTER - only allow [HELP, SLOTS, NODES, INFO, MYID, LINKS, KEYSLOT, COUNTKEYSINSLOT,
GETKEYSINSLOT, RESET, REPLICAS, COUNT_FAILURE_REPORTS], for others, preserve the status quo
## other fixes
* processEventsWhileBlocked had an issue when being nested, this could happen with a busy script
during async loading (new), but also in a busy script during AOF loading (old). this lead to a crash in
the scenario described in #6988
Block sensitive configs and commands by default.
* `enable-protected-configs` - block modification of configs with the new `PROTECTED_CONFIG` flag.
Currently we add this flag to `dbfilename`, and `dir` configs,
all of which are non-mutable configs that can set a file redis will write to.
* `enable-debug-command` - block the `DEBUG` command
* `enable-module-command` - block the `MODULE` command
These have a default value set to `no`, so that these features are not
exposed by default to client connections, and can only be set by modifying the config file.
Users can change each of these to either `yes` (allow all access), or `local` (allow access from
local TCP connections and unix domain connections)
Note that this is a **breaking change** (specifically the part about MODULE command being disabled by default).
I.e. we don't consider DEBUG command being blocked as an issue (people shouldn't have been using it),
and the few configs we protected are unlikely to have been set at runtime anyway.
On the other hand, it's likely to assume some users who use modules, load them from the config file anyway.
Note that's the whole point of this PR, for redis to be more secure by default and reduce the attack surface on
innocent users, so secure defaults will necessarily mean a breaking change.
Delete the hardcoded command table and replace it with an auto-generated table, based
on a JSON file that describes the commands (each command must have a JSON file).
These JSON files are the SSOT of everything there is to know about Redis commands,
and it is reflected fully in COMMAND INFO.
These JSON files are used to generate commands.c (using a python script), which is then
committed to the repo and compiled.
The purpose is:
* Clients and proxies will be able to get much more info from redis, instead of relying on hard coded logic.
* drop the dependency between Redis-user and the commands.json in redis-doc.
* delete help.h and have redis-cli learn everything it needs to know just by issuing COMMAND (will be
done in a separate PR)
* redis.io should stop using commands.json and learn everything from Redis (ultimately one of the release
artifacts should be a large JSON, containing all the information about all of the commands, which will be
generated from COMMAND's reply)
* the byproduct of this is:
* module commands will be able to provide that info and possibly be more of a first-class citizens
* in theory, one may be able to generate a redis client library for a strictly typed language, by using this info.
### Interface changes
#### COMMAND INFO's reply change (and arg-less COMMAND)
Before this commit the reply at index 7 contained the key-specs list
and reply at index 8 contained the sub-commands list (Both unreleased).
Now, reply at index 7 is a map of:
- summary - short command description
- since - debut version
- group - command group
- complexity - complexity string
- doc-flags - flags used for documentation (e.g. "deprecated")
- deprecated-since - if deprecated, from which version?
- replaced-by - if deprecated, which command replaced it?
- history - a list of (version, what-changed) tuples
- hints - a list of strings, meant to provide hints for clients/proxies. see https://github.com/redis/redis/issues/9876
- arguments - an array of arguments. each element is a map, with the possibility of nesting (sub-arguments)
- key-specs - an array of keys specs (already in unstable, just changed location)
- subcommands - a list of sub-commands (already in unstable, just changed location)
- reply-schema - will be added in the future (see https://github.com/redis/redis/issues/9845)
more details on these can be found in https://github.com/redis/redis-doc/pull/1697
only the first three fields are mandatory
#### API changes (unreleased API obviously)
now they take RedisModuleCommand opaque pointer instead of looking up the command by name
- RM_CreateSubcommand
- RM_AddCommandKeySpec
- RM_SetCommandKeySpecBeginSearchIndex
- RM_SetCommandKeySpecBeginSearchKeyword
- RM_SetCommandKeySpecFindKeysRange
- RM_SetCommandKeySpecFindKeysKeynum
Currently, we did not add module API to provide additional information about their commands because
we couldn't agree on how the API should look like, see https://github.com/redis/redis/issues/9944.
### Somehow related changes
1. Literals should be in uppercase while placeholder in lowercase. Now all the GEO* command
will be documented with M|KM|FT|MI and can take both lowercase and uppercase
### Unrelated changes
1. Bugfix: no_madaory_keys was absent in COMMAND's reply
2. expose CMD_MODULE as "module" via COMMAND
3. have a dedicated uint64 for ACL categories (instead of having them in the same uint64 as command flags)
Co-authored-by: Itamar Haber <itamar@garantiadata.com>
Script unit is a new unit located on script.c.
Its purpose is to provides an API for functions (and eval)
to interact with Redis. Interaction includes mostly
executing commands, but also functionalities like calling
Redis back on long scripts or check if the script was killed.
The interaction is done using a scriptRunCtx object that
need to be created by the user and initialized using scriptPrepareForRun.
Detailed list of functionalities expose by the unit:
1. Calling commands (including all the validation checks such as
acl, cluster, read only run, ...)
2. Set Resp
3. Set Replication method (AOF/REPLICATION/NONE)
4. Call Redis back to on long running scripts to allow Redis reply
to clients and perform script kill
The commit introduce the new unit and uses it on eval commands to
interact with Redis.
The following variable was renamed:
1. lua_caller -> script_caller
2. lua_time_limit -> script_time_limit
3. lua_timedout -> script_timedout
4. lua_oom -> script_oom
5. lua_disable_deny_script -> script_disable_deny_script
6. in_eval -> in_script
The following variables was moved to lctx under eval.c
1. lua
2. lua_client
3. lua_cur_script
4. lua_scripts
5. lua_scripts_mem
6. lua_replicate_commands
7. lua_write_dirty
8. lua_random_dirty
9. lua_multi_emitted
10. lua_repl
11. lua_kill
12. lua_time_start
13. lua_time_snapshot
This commit is in a low risk of introducing any issues and it
is just moving varibales around and not changing any logic.
* Fix CLIENT KILL kill all clients with id 0 or with skipme
CLIENT KILL with ID argument should only kill the client with the provided ID. In old code,
CLIENT KILL with id 0 will kill all the connected clients.
Co-authored-by: Ofir Luzon <ofirluzon@gmail.com>
Some people complain that QUIT is missing from help/command table.
Not appearing in COMMAND command, command stats, ACL, etc.
and instead, there's a hack in processCommand with a comment that looks outdated.
Note that it is [documented](https://redis.io/commands/quit)
At the same time, HOST: and POST are there in the command table although these are not real commands.
They would appear in the COMMAND command, and even in commandstats.
Other changes:
1. Initialize the static logged_time static var in securityWarningCommand
2. add `no-auth` flag to RESET so it can always be executed.
With dynamically growing argc (#9528), it is necessary to initialize
argv_len. Normally createClient() handles that, but in the case of a
module shared_client, this needs to be done explicitly.
This also addresses an issue with rewriteClientCommandArgument() which
doesn't properly handle the case where the new element extends beyond
argc but not beyond argv_len.
- Added sanitizer support. `address`, `undefined` and `thread` sanitizers are available.
- To build Redis with desired sanitizer : `make SANITIZER=undefined`
- There were some sanitizer findings, cleaned up codebase
- Added tests with address and undefined behavior sanitizers to daily CI.
- Added tests with address sanitizer to the per-PR CI (smoke out mem leaks sooner).
Basically, there are three types of issues :
**1- Unaligned load/store** : Most probably, this issue may cause a crash on a platform that
does not support unaligned access. Redis does unaligned access only on supported platforms.
**2- Signed integer overflow.** Although, signed overflow issue can be problematic time to time
and change how compiler generates code, current findings mostly about signed shift or simple
addition overflow. For most platforms Redis can be compiled for, this wouldn't cause any issue
as far as I can tell (checked generated code on godbolt.org).
**3 -Minor leak** (redis-cli), **use-after-free**(just before calling exit());
UB means nothing guaranteed and risky to reason about program behavior but I don't think any
of the fixes here worth backporting. As sanitizers are now part of the CI, preventing new issues
will be the real benefit.
## Background
For redis master, one replica uses one copy of replication buffer, that is a big waste of memory,
more replicas more waste, and allocate/free memory for every reply list also cost much.
If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with
replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big,
master may be OOM when there are many replicas that separately keep much memory.
Because replication buffers of different replica client are the same, one simple idea is that
all replicas only use one replication buffer, that will effectively save memory.
Since replication backlog content is the same as replicas' output buffer, now we
can discard replication backlog memory and use global shared replication buffer
to implement replication backlog mechanism.
## Implementation
I create one global "replication buffer" which contains content of replication stream.
The structure of "replication buffer" is similar to the reply list that exists in every client.
But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields.
```c
/* Replication buffer blocks is the list of replBufBlock.
*
* +--------------+ +--------------+ +--------------+
* | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 |
* +--------------+ +--------------+ +--------------+
* | / \
* | / \
* | / \
* Repl Backlog Replia_A Replia_B
*
* Each replica or replication backlog increments only the refcount of the
* 'ref_repl_buf_node' which it points to. So when replica walks to the next
* node, it should first increase the next node's refcount, and when we trim
* the replication buffer nodes, we remove node always from the head node which
* refcount is 0. If the refcount of the head node is not 0, we must stop
* trimming and never iterate the next node. */
/* Similar with 'clientReplyBlock', it is used for shared buffers between
* all replica clients and replication backlog. */
typedef struct replBufBlock {
int refcount; /* Number of replicas or repl backlog using. */
long long id; /* The unique incremental number. */
long long repl_offset; /* Start replication offset of the block. */
size_t size, used;
char buf[];
} replBufBlock;
```
So now when we feed replication stream into replication backlog and all replicas, we only need
to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of
replication backlog and replicas to references of the global replication buffer blocks. And we also
need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim
replication backlog if exceeding `repl-backlog-size`.
When sending reply to replicas, we also need to iterate replication buffer blocks and send its
content, when totally sending one block for replica, we decrease current node count and
increase the next current node count, and then free the block which reference is 0 from the
head of replication buffer blocks.
Since now we use linked list to manage replication backlog, it may cost much time for iterating
all linked list nodes to find corresponding replication buffer node. So we create a rax tree to
store some nodes for index, but to avoid rax tree occupying too much memory, i record
one per 64 nodes for index.
Currently, to make partial resynchronization as possible as much, we always let replication
backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting
if slow replicas that reference vast replication buffer blocks, and this method doesn't increase
memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced
replication buffer blocks when we need to trim backlog for exceeding backlog size setting,
we trim backlog incrementally (free 64 blocks per call now), and make it faster in
`beforeSleep` (free 640 blocks).
### Other changes
- `mem_total_replication_buffers`: we add this field in INFO command, it means the total
memory of replication buffers used.
- `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory
is not 0, but it still may be 0, since replication backlog and replicas share one global replication
buffer, only if replication buffer memory is more than the repl backlog setting size, we consider
the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption
of repl backlog.
- Key eviction
Since all replicas and replication backlog share global replication buffer, we think only the
part of exceeding backlog size the extra separate consumption of replicas.
Because we trim backlog incrementally in the background, backlog size may exceeds our
setting if slow replicas that reference vast replication buffer blocks disconnect.
To avoid massive eviction loop, we don't count the delayed freed replication backlog into
used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory.
- `client-output-buffer-limit` check for replica clients
It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size
config (partial sync will succeed and then replica will get disconnected). Such a configuration is
ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption
implications since the replica client will share the backlog buffers memory.
- Drop replication backlog after loading data if needed
We always create replication backlog if server is a master, we need it because we put DELs in
it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb,
it is not possible to support partial resynchronization, to avoid extra memory of replication backlog,
we drop it.
- Multi IO threads
Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled,
to guarantee data accessing thread safe, we must let main thread handle sending the output buffer
to all replicas. But before, other IO threads could handle sending output buffer of all replicas.
## Other optimizations
This solution resolve some other problem:
- When replicas disconnect with master since of out of output buffer limit, releasing the output
buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now,
it doesn't cause freezing.
- This implementation may mitigate reply list copy cost time(also freezes server) when one replication
has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy
reference info, it is very light.
- If we set replication backlog size big, it also may cost much time to copy replication backlog into
replica's output buffer. But this commit eliminates this problem.
- Resizing replication backlog size doesn't empty current replication backlog content.
## Intro
The purpose is to allow having different flags/ACL categories for
subcommands (Example: CONFIG GET is ok-loading but CONFIG SET isn't)
We create a small command table for every command that has subcommands
and each subcommand has its own flags, etc. (same as a "regular" command)
This commit also unites the Redis and the Sentinel command tables
## Affected commands
CONFIG
Used to have "admin ok-loading ok-stale no-script"
Changes:
1. Dropped "ok-loading" in all except GET (this doesn't change behavior since
there were checks in the code doing that)
XINFO
Used to have "read-only random"
Changes:
1. Dropped "random" in all except CONSUMERS
XGROUP
Used to have "write use-memory"
Changes:
1. Dropped "use-memory" in all except CREATE and CREATECONSUMER
COMMAND
No changes.
MEMORY
Used to have "random read-only"
Changes:
1. Dropped "random" in PURGE and USAGE
ACL
Used to have "admin no-script ok-loading ok-stale"
Changes:
1. Dropped "admin" in WHOAMI, GENPASS, and CAT
LATENCY
No changes.
MODULE
No changes.
SLOWLOG
Used to have "admin random ok-loading ok-stale"
Changes:
1. Dropped "random" in RESET
OBJECT
Used to have "read-only random"
Changes:
1. Dropped "random" in ENCODING and REFCOUNT
SCRIPT
Used to have "may-replicate no-script"
Changes:
1. Dropped "may-replicate" in all except FLUSH and LOAD
CLIENT
Used to have "admin no-script random ok-loading ok-stale"
Changes:
1. Dropped "random" in all except INFO and LIST
2. Dropped "admin" in ID, TRACKING, CACHING, GETREDIR, INFO, SETNAME, GETNAME, and REPLY
STRALGO
No changes.
PUBSUB
No changes.
CLUSTER
Changes:
1. Dropped "admin in countkeysinslots, getkeysinslot, info, nodes, keyslot, myid, and slots
SENTINEL
No changes.
(note that DEBUG also fits, but we decided not to convert it since it's for
debugging and anyway undocumented)
## New sub-command
This commit adds another element to the per-command output of COMMAND,
describing the list of subcommands, if any (in the same structure as "regular" commands)
Also, it adds a new subcommand:
```
COMMAND LIST [FILTERBY (MODULE <module-name>|ACLCAT <cat>|PATTERN <pattern>)]
```
which returns a set of all commands (unless filters), but excluding subcommands.
## Module API
A new module API, RM_CreateSubcommand, was added, in order to allow
module writer to define subcommands
## ACL changes:
1. Now, that each subcommand is actually a command, each has its own ACL id.
2. The old mechanism of allowed_subcommands is redundant
(blocking/allowing a subcommand is the same as blocking/allowing a regular command),
but we had to keep it, to support the widespread usage of allowed_subcommands
to block commands with certain args, that aren't subcommands (e.g. "-select +select|0").
3. I have renamed allowed_subcommands to allowed_firstargs to emphasize the difference.
4. Because subcommands are commands in ACL too, you can now use "-" to block subcommands
(e.g. "+client -client|kill"), which wasn't possible in the past.
5. It is also possible to use the allowed_firstargs mechanism with subcommand.
For example: `+config -config|set +config|set|loglevel` will block all CONFIG SET except
for setting the log level.
6. All of the ACL changes above required some amount of refactoring.
## Misc
1. There are two approaches: Either each subcommand has its own function or all
subcommands use the same function, determining what to do according to argv[0].
For now, I took the former approaches only with CONFIG and COMMAND,
while other commands use the latter approach (for smaller blamelog diff).
2. Deleted memoryGetKeys: It is no longer needed because MEMORY USAGE now uses the "range" key spec.
4. Bugfix: GETNAME was missing from CLIENT's help message.
5. Sentinel and Redis now use the same table, with the same function pointer.
Some commands have a different implementation in Sentinel, so we redirect
them (these are ROLE, PUBLISH, and INFO).
6. Command stats now show the stats per subcommand (e.g. instead of stats just
for "config" you will have stats for "config|set", "config|get", etc.)
7. It is now possible to use COMMAND directly on subcommands:
COMMAND INFO CONFIG|GET (The pipeline syntax was inspired from ACL, and
can be used in functions lookupCommandBySds and lookupCommandByCString)
8. STRALGO is now a container command (has "help")
## Breaking changes:
1. Command stats now show the stats per subcommand (see (5) above)
This change sets a low limit for multibulk and bulk length in the
protocol for unauthenticated connections, so that they can't easily
cause redis to allocate massive amounts of memory by sending just a few
characters on the network.
The new limits are 10 arguments of 16kb each (instead of 1m of 512mb)
Remove hard coded multi-bulk limit (was 1,048,576), new limit is INT_MAX.
When client sends an m-bulk that's higher than 1024, we initially only allocate
the argv array for 1024 arguments, and gradually grow that allocation as arguments
are received.
Fixing CI test issues introduced in #8687
- valgrind warnings in readQueryFromClient when client was freed by processInputBuffer
- adding DEBUG pause-cron for tests not to be time dependent.
- skipping a test that depends on socket buffers / events not compatible with TLS
- making sure client got subscribed by not using deferring client
### Description
A mechanism for disconnecting clients when the sum of all connected clients is above a
configured limit. This prevents eviction or OOM caused by accumulated used memory
between all clients. It's a complimentary mechanism to the `client-output-buffer-limit`
mechanism which takes into account not only a single client and not only output buffers
but rather all memory used by all clients.
#### Design
The general design is as following:
* We track memory usage of each client, taking into account all memory used by the
client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date
after reading from the socket, after processing commands and after writing to the socket.
* Based on the used memory we sort all clients into buckets. Each bucket contains all
clients using up up to x2 memory of the clients in the bucket below it. For example up
to 1m clients, up to 2m clients, up to 4m clients, ...
* Before processing a command and before sleep we check if we're over the configured
limit. If we are we start disconnecting clients from larger buckets downwards until we're
under the limit.
#### Config
`maxmemory-clients` max memory all clients are allowed to consume, above this threshold
we disconnect clients.
This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB
suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%`
would mean 10% of `maxmemory`).
#### Important code changes
* During the development I encountered yet more situations where our io-threads access
global vars. And needed to fix them. I also had to handle keeps the clients sorted into the
memory buckets (which are global) while their memory usage changes in the io-thread.
To achieve this I decided to simplify how we check if we're in an io-thread and make it
much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking
if the client is in an io-thread (it wasn't used for anything else) and just used the global
`io_threads_op` variable the same way to check during writes.
* I optimized the cleanup of the client from the `clients_pending_read` list on client freeing.
We now store a pointer in the `client` struct to this list so we don't need to search in it
(`pending_read_list_node`).
* Added `evicted_clients` stat to `INFO` command.
* Added `CLIENT NO-EVICT ON|OFF` sub command to exclude a specific client from the
client eviction mechanism. Added corrosponding 'e' flag in the client info string.
* Added `multi-mem` field in the client info string to show how much memory is used up
by buffered multi commands.
* Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and
channels (partially), tracking prefixes (partially).
* CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so
clients will be disconnected between processing different clients and not only before sleep.
This new function can be used in the future for work we want to do outside the command
processing loop but don't want to wait for all clients to be processed before we get to it.
Specifically I wanted to handle output-buffer-limit related closing before we process client
eviction in case the two race with each other.
* Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction
buckets.
* Each client now holds a pointer to the client eviction memory usage bucket it belongs to
and listNode to itself in that bucket for quick removal.
* Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value
indicating no io-threading is currently being executed.
* In order to track memory used by each clients in real-time we can't rely on updating
these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()`
(used to be `clientsCronTrackClientsMemUsage()`) after command processing, after
writing data to pubsub clients, after writing the output buffer and after reading from the
socket (and maybe other places too). The function is written to be fast.
* Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before
processing a command (before performing oom-checks and key-eviction).
* All clients memory usage buckets are grouped as follows:
* All clients using less than 64k.
* 64K..128K
* 128K..256K
* ...
* 2G..4G
* All clients using 4g and up.
* Added client-eviction.tcl with a bunch of tests for the new mechanism.
* Extended maxmemory.tcl to test the interaction between maxmemory and
maxmemory-clients settings.
* Added an option to flag a numeric configuration variable as a "percent", this means that
if we encounter a '%' after the number in the config file (or config set command) we
consider it as valid. Such a number is store internally as a negative value. This way an
integer value can be interpreted as either a percent (negative) or absolute value (positive).
This is useful for example if some numeric configuration can optionally be set to a percentage
of something else.
Co-authored-by: Oran Agra <oran@redislabs.com>
A write request may be paused unexpectedly because `server.client_pause_end_time` is old.
**Recreate this:**
redis-cli -p 6379
127.0.0.1:6379> client pause 500000000 write
OK
127.0.0.1:6379> client unpause
OK
127.0.0.1:6379> client pause 10000 write
OK
127.0.0.1:6379> set key value
The write request `set key value` is paused util the timeout of 500000000 milliseconds was reached.
**Fix:**
reset `server.client_pause_end_time` = 0 in `unpauseClients`
When a replica paused, it would not apply any commands event the command comes from master, if we feed the non-applied command to replication stream, the replication offset would be wrong, and data would be lost after failover(since replica's `master_repl_offset` grows but command is not applied).
To fix it, here are the changes:
* Don't update replica's replication offset or propagate commands to sub-replicas when it's paused in `commandProcessed`.
* Show `slave_read_repl_offset` in info reply.
* Add an assert to make sure master client should never be blocked unless pause or module (some modules may use block way to do background (parallel) processing and forward original block module command to the replica, it's not a good way but it can work, so the assert excludes module now, but someday in future all modules should rewrite block command to propagate like what `BLPOP` does).