Add a defensive checks to prevent double freeing a node from the cluster blacklist.
(cherry picked from commit 8d675950e6528e164454300d54aaf7bc024b075c)
In #10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.
It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like #12538.
In this PR, we remove the assert and replace it with an unconditional removal.
(cherry picked from commit e5ef161374155bb84a5720387836415ef3217963)
When entering pubsub mode and using the redis-cli only
connect command, we need to reset pubsub_mode because
we switch to a different connection.
This will affect the prompt when the connection is successful,
and redis-cli will crash when the connect fails:
```
127.0.0.1:6379> subscribe ch
1) "subscribe"
2) "ch"
3) (integer) 1
127.0.0.1:6379(subscribed mode)> connect 127.0.0.1 6380
127.0.0.1:6380(subscribed mode)> ping
PONG
127.0.0.1:6380(subscribed mode)> connect a b
Could not connect to Redis at a:0: Name or service not known
Segmentation fault
```
(cherry picked from commit 4de4fcf280648519239927d408624ea34953c0e0)
If we set `fsynced_reploff_pending` in `startAppendOnly`, and the fork doesn't start
immediately (e.g. there's another fork active at the time), any subsequent commands
will increment `server.master_repl_offset`, but will not cause a fsync (given they were
executed before the fork started, they just ended up in the RDB part of it)
Therefore, any WAITAOF will wait on the new master_repl_offset, but it will time out
because no fsync will be executed.
Release notes:
```
WAITAOF could timeout in the absence of write traffic in case a new AOF is created and
an AOFRW can't immediately start.
This can happen by the appendonly config is changed at runtime, but also after FLUSHALL,
and replica full sync.
```
(cherry picked from commit bfa3931a04bddec8eb37c91c25d3a77c70e33000)
The `retval` variable is defined as an `int`, so with 4 bytes, it cannot properly represent
microsecond values greater than the equivalent of about 35 minutes.
This bug shouldn't impact standard Redis behavior because Redis doesn't have timer
events that are scheduled as far as 35 minutes out, but it may affect custom Redis modules
which interact with the event timers via the RM_CreateTimer API.
The impact is that `usUntilEarliestTimer` may return 0 for as long as `retval` is scaled to
an overflowing value. While `usUntilEarliestTimer` continues to return `0`, `aeApiPoll`
will have a zero timeout, and so Redis will use significantly more CPU iterating through
its event loop without pause. For timers scheduled far enough into the future, Redis will
cycle between ~35 minute periods of high CPU usage and ~35 minute periods of standard
CPU usage.
(cherry picked from commit 24187ed8e396625cc44a6bbeeb87e01aec55c27d)
Starting a change in #12233 (released in 7.2), CLUSTER commands use client's
connection to decide whether to return TLS port or non-TLS port, but commands
called by Lua script and module's RM_Call don't have a real client with connection,
and would currently be regarded as non-TLS connections.
We can use server.current_client instead when it is available. When it is not (module calls
commands without a real client), we may see this as an undefined behavior, and return null
or default port (currently in this PR it returns default port, judged by server.tls_cluster).
(cherry picked from commit 2aad03fa399f520febb8b7db972459bc97484104)
An unintentional change was introduced in #10536, we used
to use addReplyLongLong and now it is addReplyBulkLonglong,
revert it back the previous behavior.
(cherry picked from commit 4031a187321be26fc96c044fec3ee759841e8379)
Use the __MAC_OS_X_VERSION_MIN_REQUIRED macro to detect the
macOS system version instead of using MAC_OS_X_VERSION_10_6.
From MacOSX14.0.sdk, the default definitions of MAC_OS_X_VERSION_xxx have
been removed in usr/include/AvailabilityMacros.h. It includes AvailabilityVersions.h,
where the following condition must be met:
`#if (!defined(_POSIX_C_SOURCE) && !defined(_XOPEN_SOURCE)) || defined(_DARWIN_C_SOURCE)`
Only then will MAC_OS_X_VERSION_xxx be defined.
However, in the project, _DARWIN_C_SOURCE is not defined, which leads to the
loss of the definition for MAC_OS_X_VERSION_10_6.
(cherry picked from commit a2b0701d2cf64c62f9bb41c01ad44586f01c2c5b)
The current commands.json doesn't mention the special NO ONE arguments.
This change is also applied to SLAVEOF
(cherry picked from commit 8d92f7f2b7ff2d3e42425f65c5f65fe80888dfaf)
Before this commit, Unix socket setup performed chmod(2) on the socket
file after calling listen(2). Depending on what umask is used, this
could leave the file with the wrong permissions for a short period of
time. As a result, another process could exploit this race condition and
establish a connection that would otherwise not be possible.
We now make sure the socket permissions are set up prior to calling
listen(2).
Updated the command tips for ACL SAVE / SETUSER / DELUSER, CLIENT SETNAME / SETINFO, and LATENCY RESET.
The tips now match CONFIG SET, since there's a similar behavior for all of these commands - the
user expects to update the various configurations & states on all nodes, not only on a single, random node.
For LATENCY RESET the response tip is now agg_sum.
Co-authored-by: Shachar Langbeheim <shachlan@amazon.com>
(cherry picked from commit 90e9fc387c81d48d3d82ea7fcd2d3b6d896bf923)
When connecting between a 7.0 and 7.2 cluster, the 7.0 cluster will not populate the shard_id field, which is expect on the 7.2 cluster. This is not intended behavior, as the 7.2 cluster is supposed to use a temporary shard_id while the node is in the upgrading state, but it wasn't being correctly set in this case.
(cherry picked from commit a2046c1eb1bcfcdeffadfffffad3b1f635965652)
Since the three commands have similar behavior (change config, return
OK), the tips that govern how they should behave should be similar.
Co-authored-by: Shachar Langbeheim <shachlan@amazon.com>
(cherry picked from commit 4b281ce519974061d22eef7a9a8072d3611e64f9)
When adding a new ACL rule was added, an attempt was made to remove
any "overlapping" rules. However, there when a match was found, the search
was not resumed at the right location, but instead after the original position of
the original command.
For example, if the current rules were `-config +config|get` and a rule `+config`
was added. It would identify that `-config` was matched, but it would skip over
`+config|get`, leaving the compacted rule `-config +config`. This would be evaluated
safely, but looks weird.
This bug can only be triggered with subcommands, since that is the only way to
have sequential matching rules. Resolves#12470. This is also only present in 7.2.
I think there was also a minor risk of removing another valid rule, since it would start
the search of the next command at an arbitrary point. I couldn't find a valid offset that
would have cause a match using any of the existing commands that have subcommands
with another command.
Fix the assertion when a busy script (timeout) signal ready keys (like LPUSH),
and then an arbitrary client's `allow-busy` command steps into `handleClientsBlockedOnKeys`
try wake up clients blocked on keys (like BLPOP).
Reproduction process:
1. start a redis with aof
`./redis-server --appendonly yes`
2. exec blpop
`127.0.0.1:6379> blpop a 0`
3. use another client call a busy script and this script push the blocked key
`127.0.0.1:6379> eval "redis.call('lpush','a','b') while(1) do end" 0`
4. user a new client call an allow-busy command like auth
`127.0.0.1:6379> auth a`
BTW, this issue also break the atomicity of script.
This bug has been around for many years, the old versions only have the
atomic problem, only 7.0/7.2 has the assertion problem.
Co-authored-by: Oran Agra <oran@redislabs.com>
This PR mainly fixes a possible integer overflow in `json_append_string()`.
When we use `cjson.encoding()` to encode a string larger than 2GB, at specific
compilation flags, an integer overflow may occur leading to truncation, resulting
in the part of the string larger than 2GB not being encoded.
On the other hand, this overflow doesn't cause any read or write out-of-range or segment fault.
1) using -O0 for lua_cjson (`make LUA_DEBUG=yes`)
In this case, `i` will overflow and leads to truncation.
When `i` reaches `INT_MAX+1` and overflows to INT_MIN, when compared to
len, `i` (1000000..00) is expanded to 64 bits signed integer (1111111.....000000) .
At this point i will be greater than len and jump out of the loop, so `for (i = 0; i < len; i++)`
will loop up to 2^31 times, and the part of larger than 2GB will be truncated.
```asm
`i` => -0x24(%rbp)
<+253>: addl $0x1,-0x24(%rbp) ; overflow if i large than 2^31
<+257>: mov -0x24(%rbp),%eax
<+260>: movslq %eax,%rdx ; move a 32-bit value with sign extension into a 64-bit signed
<+263>: mov -0x20(%rbp),%rax
<+267>: cmp %rax,%rdx ; check `i < len`
<+270>: jb 0x212600 <json_append_string+148>
```
2) using -O2/-O3 for lua_cjson (`make LUA_DEBUG=no`, **the default**)
In this case, because singed integer overflow is an undefined behavior, `i` will not overflow.
`i` will be optimized by the compiler and use 64-bit registers for all subsequent instructions.
```asm
<+180>: add $0x1,%rbx ; Using 64-bit register `rbx` for i++
<+184>: lea 0x1(%rdx),%rsi
<+188>: mov %rsi,0x10(%rbp)
<+192>: mov %al,(%rcx,%rdx,1)
<+195>: cmp %rbx,(%rsp) ; check `i < len`
<+199>: ja 0x20b63a <json_append_string+154>
```
3) using 32bit
Because `strbuf_ensure_empty_length()` preallocates memory of length (len * 6 + 2),
in 32-bit `cjson.encode()` can only handle strings smaller than ((2 ^ 32) - 3 ) / 6.
So 32bit is not affected.
Also change `i` in `strbuf_append_string()` to `size_t`.
Since its second argument `str` is taken from the `char2escape` string array which is never
larger than 6, so `strbuf_append_string()` is not at risk of overflow (the bug was unreachable).
GEOHASH / GEODIST / GEOPOS use zsetScore to get the score, in skiplist encoding,
we use dictFind to get the score, which is O(1), same as ZSCORE command.
It is not clear why these commands had O(Log(N)), and O(N) until now.
When loading a function from either RDB/AOF or a replica, it is essential not to
fail on timeout errors. The loading time may vary due to various factors, such as
hardware specifications or the system's workload during the loading process.
Once a function has been successfully loaded, it should be allowed to load from
persistence or on replicas without encountering a timeout failure.
To maintain a clear separation between the engine and Redis internals, the
implementation refrains from directly checking the state of Redis within the
engine itself. Instead, the engine receives the desired timeout as part of the
library creation and duly respects this timeout value. If Redis wishes to disable
any timeout, it can simply send a value of 0.
When doing merge selector, we should check whether the merge
has started (i.e., whether open_bracket_start is -1) every time.
Otherwise, encountering an illegal selector pattern could succeed
and also cause memory leaks, for example:
```
acl setuser test1 (+PING (+SELECT (+DEL )
```
The above would leak memory and succeed with only DEL being applied,
and would now error after the fix.
Co-authored-by: Oran Agra <oran@redislabs.com>
Additional test coverage for incr/decr operation.
integer number could be present in raw encoding format due to operation like append. A incr/decr operation following it optimize the string to int encoding format.
A bug introduced in #11657 (7.2 RC1), causes client-eviction (#8687)
and INFO to have inaccurate memory usage metrics of MONITOR clients.
Because the type in `c->type` and the type in `getClientType()` are confusing
(in the later, `CLIENT_TYPE_NORMAL` not `CLIENT_TYPE_SLAVE`), the comment
we wrote in `updateClientMemUsageAndBucket` was wrong, and in fact that function
didn't skip monitor clients.
And since it doesn't skip monitor clients, it was wrong to delete the call for it from
`replicationFeedMonitors` (it wasn't a NOP).
That deletion could mean that the monitor client memory usage is not always up to
date (updated less frequently, but still a candidate for client eviction).
changing the response and request policy of a few commands,
see https://redis.io/docs/reference/command-tips
1. RANDOMKEY used to have no response policy, which means
that when sent to multiple shards, the responses should be aggregated.
this normally applies to commands that return arrays, but since RANDOMKEY
replies with a simple string, it actually requires a SPECIAL response policy
(for the client to select just one)
2. SCAN used to have no response policy, but although the key names part of
the response can be aggregated, the cursor part certainly can't.
3. MSETNX had a request policy of MULTI_SHARD and response policy of AGG_MIN,
but in fact the contract with MSETNX is that when one key exists, it returns 0
and doesn't set any key, routing it to multiple shards would mean that if one failed
and another succeeded, it's atomicity is broken and it's impossible to return a valid
response to the caller.
Co-authored-by: Shachar Langbeheim <shachlan@amazon.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
When the redis server cluster running on cluster-preferred-endpoint-type unknown-endpoint mode, and receive a request that should be redirected to another redis server node, it does not reply the hostip, but a empty host like MOVED 3999 :6381.
The redis-cli would try to connect to an address without a host, which cause the issue:
```
127.0.0.1:7002> set bar bar
-> Redirected to slot [5061] located at :7000
Could not connect to Redis at :7000: No address associated with hostname
Could not connect to Redis at :7000: No address associated with hostname
not connected> exit
```
In this case, the redis-cli should use the previous hostip when there's no host provided by the server.
---------
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelynolson@gmail.com>
According to the format shown in https://redis.io/commands/cluster-nodes/
```
<ip:port@cport[,hostname[,auxiliary_field=value]*]>
```
when there is no hostname, and the auxiliary fields are hidden, the cluster topology should be
```
<ip:port@cport>
```
However in the code we always print the hostname even when it is an empty string, leaving an unnecessary comma tailing after cport, which is weird and conflicts with the doc.
```
94ca2f6cf85228a49fde7b738ee1209de7bee325 127.0.0.1:6379@16379, myself,master - 0 0 0 connected 0-16383
```
60e5075d4 Version 1.2.0 (#1202)
adef139a7 Remove support in deprecated TLS versions 1.0 and 1.1 (#1205)
d543baba6 Install major version symlink of shared objects.
052f99ab2 Ensure functionality without _MSC_VER definition
ded32c7d1 Add a test for the TCP_USER_TIMEOUT option. (#1192)
5cbd1f296 Add -Werror as a default. (#1193)
af1445638 CI: Update homebrew Redis version. (#1191)
6de326e87 Fix a typo in b6a052f.
git-subtree-dir: deps/hiredis
git-subtree-split: 60e5075d4ac77424809f855ba3e398df7aacefe8
* Fix integer overflows due to using wrong integer size.
* Add assertions / panic when overflow still happens.
* Deletion of dead code to avoid need to maintain it
* Some changes are not because of bugs, but rather paranoia.
* Improve cmsgpack and cjson test coverage.
Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
valgrind report a Uninitialised warning:
```
==25508== Uninitialised value was created by a heap allocation
==25508== at 0x4848899: malloc (in
/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==25508== by 0x1A35A1: ztrymalloc_usable_internal (zmalloc.c:117)
==25508== by 0x1A368D: zmalloc (zmalloc.c:145)
==25508== by 0x21FDEA: clusterInit (cluster.c:973)
==25508== by 0x19DC09: main (server.c:7306)
```
Introduced in #12344
This is an addition to #12380, to prevent potential bugs when collecting
keys from multiple commands in the future.
Note that this function also resets numkeys in some cases.