Introduce Shard IDs to logically group nodes in cluster mode.
1. Added a new "shard_id" field to "cluster nodes" output and nodes.conf after "hostname"
2. Added a new PING extension to propagate "shard_id"
3. Handled upgrade from pre-7.2 releases automatically
4. Refactored PING extension assembling/parsing logic
Behavior of Shard IDs:
Replicas will always follow the shards of their reported primaries. If a primary updates its shard ID, the replica will follow. (This need not follow for cluster v2) This is not an expected use case.
* Print IP and port on cluster bus message sanity check
Add a print statement to indicate which IP/port is sending
the error messages. That way we can at least check to see
if it is a node in the cluster or some other nefarious nodes.
It is proposed in #11339.
Unrelated changes: the return check for connAddrPeerName should
be -1 instead of C_ERR, although the value of C_ERR is also -1.
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Renamed from "Pause Clients" to "Pause Actions" since the mechanism can pause
several actions in redis, not just clients (e.g. eviction, expiration).
Previously each pause purpose (which has a timeout that's tracked separately from others purposes),
also implicitly dictated what it pauses (reads, writes, eviction, etc). Now it is explicit, and
the actions that are paused (bit flags) are defined separately from the purpose.
- Previously, when using feature pause-client it also implicitly means to make the server static:
- Pause replica traffic
- Pauses eviction processing
- Pauses expire processing
Making the server static is used also for failover and shutdown. This PR internally rebrand
pause-client API to become pause-action API. It also Simplifies pauseClients structure
by replacing pointers array with static array.
The context of this PR is to add another trigger to pause-client which will activated in case
of OOM as throttling mechanism ([see here](https://github.com/redis/redis/issues/10907)).
In this case we want only to pause client, and eviction actions.
As discussed on #11084, `propagatePendingCommands` should happened after the del
notification is fired so that the notification effect and the `del` will be replicated inside MULTI EXEC.
Test was added to verify the fix.
PR #9320 introduces initialization order changes. Now cluster is initialized after modules.
This changes causes a crash if the module uses RM_Call inside the load function
on cluster mode (the code will try to access `server.cluster` which at this point is NULL).
To solve it, separate cluster initialization into 2 phases:
1. Structure initialization that happened before the modules initialization
2. Listener initialization that happened after.
Test was added to verify the fix.
Freeze time during execution of scripts and all other commands.
This means that a key is either expired or not, and doesn't change
state during a script execution. resolves#10182
This PR try to add a new `commandTimeSnapshot` function.
The function logic is extracted from `keyIsExpired`, but the related
calls to `fixed_time_expire` and `mstime()` are removed, see below.
In commands, we will avoid calling `mstime()` multiple times
and just use the one that sampled in call. The background is,
e.g. using `PEXPIRE 1` with valgrind sometimes result in the key
being deleted rather than expired. The reason is that both `PEXPIRE`
command and `checkAlreadyExpired` call `mstime()` separately.
There are other more important changes in this PR:
1. Eliminate `fixed_time_expire`, it is no longer needed.
When we want to sample time we should always use a time snapshot.
We will use `in_nested_call` instead to update the cached time in `call`.
2. Move the call for `updateCachedTime` from `serverCron` to `afterSleep`.
Now `commandTimeSnapshot` will always return the sample time, the
`lookupKeyReadWithFlags` call in `getNodeByQuery` will get a outdated
cached time (because `processCommand` is out of the `call` context).
We put the call to `updateCachedTime` in `aftersleep`.
3. Cache the time each time the module lock Redis.
Call `updateCachedTime` in `moduleGILAfterLock`, affecting `RM_ThreadSafeContextLock`
and `RM_ThreadSafeContextTryLock`
Currently the commandTimeSnapshot change affects the following TTL commands:
- SET EX / SET PX
- EXPIRE / PEXPIRE
- SETEX / PSETEX
- GETEX EX / GETEX PX
- TTL / PTTL
- EXPIRETIME / PEXPIRETIME
- RESTORE key TTL
And other commands just use the cached mstime (including TIME).
This is considered to be a breaking change since it can break a script
that uses a loop to wait for a key to expire.
This PR introduces a couple of changes to improve cluster test stability:
1. Increase the cluster node timeout to 3 seconds, which is similar to the
normal cluster tests, but introduce a new mechanism to increase the ping
period so that the tests are still fast. This new config is a debug config.
2. Set `cluster-replica-no-failover yes` on a wider array of tests which are
sensitive to failovers. This was occurring on the ARM CI.
- fix `the the` typo
- `LPOPRPUSH` does not exist, should be `RPOPLPUSH`
- `CLUSTER GETKEYINSLOT` 's time complexity should be O(N)
- `there bytes` should be `three bytes`, this closes#11266
- `slave` word to `replica` in log, modified the front and missed the back
- remove useless aofReadDiffFromParent in server.h
- `trackingHandlePendingKeyInvalidations` method adds a void parameter
* Fix CLUSTER SHARDS showing empty hostname
In #10290, we changed clusterNode hostname from `char*`
to `sds`, and the old `node->hostname` was changed to
`sdslen(node->hostname)!=0`.
But in `addNodeDetailsToShardReply` it is missing.
It results in the return of an empty string hostname
in CLUSTER SHARDS command if it unavailable.
Like this (note that we listed it as optional in the doc):
```
9) "hostname"
10) ""
```
EVAL scripts are by default not considered `write` commands, so they were allowed on a replica.
But when adding a shebang, they become `write` command (unless the `no-writes` flag is added).
With this change we'll handle them as write commands, and reply with MOVED instead of
READONLY when executed on a redis cluster replica.
Co-authored-by: chendianqiang <chendianqiang@meituan.com>
Bugfix:
with the scenario if we force assigned a slot to other master,
old master will lose the slot ownership, then old master will
call the function delKeysInSlot() to delete all keys which in
the slot. These delete operations should replicate to replicas,
avoid the data divergence issue in master and replicas.
Additionally, in this case, we now call:
* signalModifiedKey (to invalidate WATCH)
* moduleNotifyKeyspaceEvent (key space notification for modules)
* dirty++ (to signal that the persistence file may be outdated)
Co-authored-by: weimeng <weimeng@didiglobal.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Redis 7.0 has #9890 which added an assertion when the propagation queue
was not flushed and we got to beforeSleep.
But it turns out that when processCommands calls getNodeByQuery and
decides to reject the command, it can lead to a key that was lazy
expired and is deleted without later flushing the propagation queue.
This change prevents lazy expiry from deleting the key at this stage
(not as part of a command being processed in `call`)
Introduce listen method into connection type, this allows no hard code
of listen logic. Originally, we initialize server during startup like
this:
if (server.port)
listenToPort(server.port,&server.ipfd);
if (server.tls_port)
listenToPort(server.port,&server.tlsfd);
if (server.unixsocket)
anetUnixServer(...server.unixsocket...);
...
if (createSocketAcceptHandler(&server.ipfd, acceptTcpHandler) != C_OK)
if (createSocketAcceptHandler(&server.tlsfd, acceptTcpHandler) != C_OK)
if (createSocketAcceptHandler(&server.sofd, acceptTcpHandler) != C_OK)
...
If a new connection type gets supported, we have to add more hard code
to setup listener.
Introduce .listen and refactor listener, and Unix socket supports this.
this allows to setup listener arguments and create listener in a loop.
What's more, '.listen' is defined in connection.h, so we should include
server.h to import 'struct socketFds', but server.h has already include
'connection.h'. To avoid including loop(also to make code reasonable),
define 'struct connListener' in connection.h instead of 'struct socketFds'
in server.h. This leads this commit to get more changes.
There are more fields in 'struct connListener', hence it's possible to
simplify changeBindAddr & applyTLSPort() & updatePort() into a single
logic: update the listener config from the server.xxx, and re-create
the listener.
Because of the new field 'priv' in struct connListener, we expect to pass
this to the accept handler(even it's not used currently), this may be used
in the future.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Suggested by Oran, use an array to store all the connection types
instead of a linked list, and use connection name of string. The index
of a connection is dynamically allocated.
Currently we support max 8 connection types, include:
- tcp
- unix socket
- tls
and RDMA is in the plan, then we have another 4 types to support, it
should be enough in a long time.
Introduce 3 functions to get connection type by a fast path:
- connectionTypeTcp()
- connectionTypeTls()
- connectionTypeUnix()
Note that connectionByType() is designed to use only in unlikely code path.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Abstract common interface of connection type, so Redis can hide the
implementation and uplayer only calls connection API without macro.
uplayer
|
connection layer
/ \
socket TLS
Currently, for both socket and TLS, all the methods of connection type
are declared as static functions.
It's possible to build TLS(even socket) as a shared library, and Redis
loads it dynamically in the next step.
Also add helper function connTypeOfCluster() and
connTypeOfReplication() to simplify the code:
link->conn = server.tls_cluster ? connCreateTLS() : connCreateSocket();
-> link->conn = connCreate(connTypeOfCluster());
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Originally, connPeerToString is designed to get the address info from
socket only(for both TCP & TLS), and the API 'connPeerToString' is
oriented to operate a FD like:
int connPeerToString(connection *conn, char *ip, size_t ip_len, int *port) {
return anetFdToString(conn ? conn->fd : -1, ip, ip_len, port, FD_TO_PEER_NAME);
}
Introduce connAddr and implement .addr method for socket and TLS,
thus the API 'connAddr' and 'connFormatAddr' become oriented to a
connection like:
static inline int connAddr(connection *conn, char *ip, size_t ip_len, int *port, int remote) {
if (conn && conn->type->addr) {
return conn->type->addr(conn, ip, ip_len, port, remote);
}
return -1;
}
Also remove 'FD_TO_PEER_NAME' & 'FD_TO_SOCK_NAME', use a boolean type
'remote' to get local/remote address of a connection.
With these changes, it's possible to support the other connection
types which does not use socket(Ex, RDMA).
Thanks to Oran for suggestions!
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
In clusterMsgPingExtForgottenNode, sizeof(name) is CLUSTER_NAMELEN,
and sizeof(clusterMsgPingExtForgottenNode) is > CLUSTER_NAMELEN.
Doing a (name + sizeof(clusterMsgPingExtForgottenNode)) sanitizer
generates an out-of-bounds error which is a false positive in here
Gossip the cluster node blacklist in ping and pong messages.
This means that CLUSTER FORGET doesn't need to be sent to all nodes in a cluster.
It can be sent to one or more nodes and then be propagated to the rest of them.
For each blacklisted node, its node id and its remaining blacklist TTL is gossiped in a
cluster bus ping extension (introduced in #9530).
As an outstanding part mentioned in #10737, we could just make the cluster config file and
ACL file saving done with a more safe and atomic pattern (write to temp file, fsync, rename, fsync dir).
The cluster config file uses an in-place overwrite and truncation (which was also used by the
main config file before #7824).
The ACL file is using the temp file and rename approach, but was missing an fsync.
Co-authored-by: 朱天 <zhutian03@meituan.com>
replace use of:
sprintf --> snprintf
strcpy/strncpy --> redis_strlcpy
strcat/strncat --> redis_strlcat
**why are we making this change?**
Much of the code uses some unsafe variants or deprecated buffer handling
functions.
While most cases are probably not presenting any issue on the known path
programming errors and unterminated strings might lead to potential
buffer overflows which are not covered by tests.
**As part of this PR we change**
1. added implementation for redis_strlcpy and redis_strlcat based on the strl implementation: https://linux.die.net/man/3/strl
2. change all occurrences of use of sprintf with use of snprintf
3. change occurrences of use of strcpy/strncpy with redis_strlcpy
4. change occurrences of use of strcat/strncat with redis_strlcat
5. change the behavior of ll2string/ull2string/ld2string so that it will always place null
termination ('\0') on the output buffer in the first index. this was done in order to make
the use of these functions more safe in cases were the user will not check the output
returned by them (for example in rdbRemoveTempFile)
6. we added a compiler directive to issue a deprecation error in case a use of
sprintf/strcpy/strcat is found during compilation which will result in error during compile time.
However keep in mind that since the deprecation attribute is not supported on all compilers,
this is expected to fail during push workflows.
**NOTE:** while this is only an initial milestone. We might also consider
using the *_s implementation provided by the C11 Extensions (however not
yet widly supported). I would also suggest to start
looking at static code analyzers to track unsafe use cases.
For example LLVM clang checker supports security.insecureAPI.DeprecatedOrUnsafeBufferHandling
which can help locate unsafe function usage.
https://clang.llvm.org/docs/analyzer/checkers.html#security-insecureapi-deprecatedorunsafebufferhandling-c
The main reason not to onboard it at this stage is that the alternative
excepted by clang is to use the C11 extensions which are not always
supported by stdlib.
Currently in cluster mode, Redis process locks the cluster config file when
starting up and holds the lock for the entire lifetime of the process.
When the server shuts down, it doesn't explicitly release the lock on the
cluster config file. We noticed a problem with restart testing that if you shut down
a very large redis-server process (i.e. with several hundred GB of data stored),
it takes the OS a while to free the resources and unlock the cluster config file.
So if we immediately try to restart the redis server process, it might fail to acquire
the lock on the cluster config file and fail to come up.
This fix explicitly releases the lock on the cluster config file upon a shutdown rather
than relying on the OS to release the lock, which is a cleaner and safer approach to
free up resources acquired.
* In cluster getNodeByQuery when target slot is in migrating state and
the slot lack some keys but have at least one key, should return TRYAGAIN.
Before this commit, when a node is in migrating state and recevies
multiple keys command, if some keys don't exist, the command emits
an `ASK` redirection.
After this commit, if some keys exist and some keys don't exist, the
command emits a TRYAGAIN error. If all keys don't exist, the command
emits an `ASK` redirection.
It used to returns slots as strings, like:
```
redis> cluster shards
1) 1) "slots"
2) 1) "10923"
2) "16383"
```
CLUSTER SHARDS docs and the top comment of #10293 says that it returns integers.
Note other commands like CLUSTER SLOTS, it returns slots as integers.
Use addReplyLongLong instead of addReplyBulkLongLong, now it returns slots as integers:
```
redis> cluster shards
1) 1) "slots"
2) 1) (integer) 10923
2) (integer) 16383
```
This is a small breaking change, introduced in 7.0.0 (7.0 RC3, #10293)
Fixes#10680
since PUBLISH and SPUBLISH use different dictionaries for channels and clients,
and we already have an API for PUBLISH, it only makes sense to have one for SPUBLISH
Add test coverage and unifying some test infrastructure.
* Limit cluster node id length for CLUSTER commands loading
* Cluster node name sanity check for length and values
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
* Fix race condition where node loses its last slot and turns into replica
When a node has lost its last slot and finds out from the SETSLOT command
before the cluster bus PONG from the new owner arrives. In this case, the
node didn't turn itself into a replica of the new slot owner.
This commit adds the same logic to the SETSLOT command as already exists
for the cluster bus PONG processing.
* Revert "Fix new / failing cluster slot migration test (#10482)"
This reverts commit 0b21ef8d49c47a5dd47a0bcc0cf1b2c235f24fd0.
In this test, the old slot owner finds out that it has lost its last
slot in a nondeterministic way. Either the cluster bus PONG from the
new slot owner and sometimes in a SETSLOT command from redis-cli. In
both cases, the result should be the same and the old owner should
turn itself into a replica of the new slot owner.
This commit improve malloc efficiency of the slots_info_pairs mechanism in cluster.c
by changing adlist into an array being realloced with greedy growth mechanism
Recently the cluster tests are consistently failing when executed with ASAN in the CI.
I tried to track down the commit that started it, and it appears to be #10293.
Looking at the commit, i realize it didn't affect this test / flow, other than the
replacement of the slots_info_pairs from sds to list.
I concluded that what could be happening is that the slot range is very fragmented,
and that results in many allocations.
with sds, it results in one allocation and also, we have a greedy growth mechanism,
but with adlist, we just have many many small allocations.
this probably causes stress on ASAN, and causes it to be slow at termination.
Implement a new cluster shards command, which provides a flexible and extensible API for topology discovery.
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Adds the ability to track the lag of a consumer group (CG), that is, the number
of entries yet-to-be-delivered from the stream.
The proposed constant-time solution is in the spirit of "best-effort."
Partially addresses #8737.
## Description of approach
We add a new "entries_added" property to the stream. This starts at 0 for a new
stream and is incremented by 1 with every `XADD`. It is essentially an all-time
counter of the entries added to the stream.
Given the stream's length and this counter value, we can trivially find the logical
"entries_added" counter of the first ID if and only if the stream is contiguous.
A fragmented stream contains one or more tombstones generated by `XDEL`s.
The new "xdel_max_id" stream property tracks the latest tombstone.
The CG also tracks its last delivered ID's as an "entries_read" counter and
increments it independently when delivering new messages, unless the this
read counter is invalid (-1 means invalid offset). When the CG's counter is
available, the reported lag is the difference between added and read counters.
Lastly, this also adds a "first_id" field to the stream structure in order to make
looking it up cheaper in most cases.
## Limitations
There are two cases in which the mechanism isn't able to track the lag.
In these cases, `XINFO` replies with `null` in the "lag" field.
The first case is when a CG is created with an arbitrary last delivered ID,
that isn't "0-0", nor the first or the last entries of the stream. In this case,
it is impossible to obtain a valid read counter (short of an O(N) operation).
The second case is when there are one or more tombstones fragmenting
the stream's entries range.
In both cases, given enough time and assuming that the consumers are
active (reading and lacking) and advancing, the CG should be able to
catch up with the tip of the stream and report zero lag.
Once that's achieved, lag tracking would resume as normal (until the
next tombstone is set).
## API changes
* `XGROUP CREATE` added with the optional named argument `[ENTRIESREAD entries-read]`
for explicitly specifying the new CG's counter.
* `XGROUP SETID` added with an optional positional argument `[ENTRIESREAD entries-read]`
for specifying the CG's counter.
* `XINFO` reports the maximal tombstone ID, the recorded first entry ID, and total
number of entries added to the stream.
* `XINFO` reports the current lag and logical read counter of CGs.
* `XSETID` is an internal command that's used in replication/aof. It has been added with
the optional positional arguments `[ENTRIESADDED entries-added] [MAXDELETEDID max-deleted-entry-id]`
for propagating the CG's offset and maximal tombstone ID of the stream.
## The generic unsolved problem
The current stream implementation doesn't provide an efficient way to obtain the
approximate/exact size of a range of entries. While it could've been nice to have
that ability (#5813) in general, let alone specifically in the context of CGs, the risk
and complexities involved in such implementation are in all likelihood prohibitive.
## A refactoring note
The `streamGetEdgeID` has been refactored to accommodate both the existing seek
of any entry as well as seeking non-deleted entries (the addition of the `skip_tombstones`
argument). Furthermore, this refactoring also migrated the seek logic to use the
`streamIterator` (rather than `raxIterator`) that was, in turn, extended with the
`skip_tombstones` Boolean struct field to control the emission of these.
Co-authored-by: Guy Benoish <guy.benoish@redislabs.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
publishshard was added in #8621 (7.0 RC1), but the publishshard_sent
stat is not shown in CLUSTER INFO command.
Other changes:
1. Remove useless `needhelp` statements, it was removed in 3dad819.
2. Use `LL_WARNING` log level for some error logs (I/O error, Connection failed).
3. Fix typos that saw by the way.