This PR adjusts the logging conditions of clusterLogCantFailover
in this two ways.
1. For the same cant_failover_reason, we will print the log once
in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which
is a bit long, shorten it to 1s, so we can better track its state.
We get to see the system making progress by watching the message.
Using 1s also covers pretty much all cases as i don't see a reason
for using a <1s node timeout, test or prod.
2. We will not print logs before the nolog_fail_time, its value
is cluster-node-timeout+5000. This may casue us to lose some logs,
for example, if cluster-node-timeout is small, auth_timeout will
be 2000, and auth_retry_time will be 4000. In this case, we will
lose all the reasons during the election if the failover is timedout.
So remove the nolog_fail_time logic, since we still do have the
CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many
logs.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Release candidates have a version that is lower than 8.0.0 to allow for
8.0.0 to have 0x080000 as a release number. However, we did an explicit
check to make sure a version was 8.0 or greater to validate a replica
supports a feature. Now we are using the highest patch version of latest
minor to do the comparison to accommodate future versions.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Adds light-weight cluster bus header for pubsub message. Closes#557.
This also supports sending to and receiving non-light messages from
older versions of the engine.
The light-weight cluster bus message supports multiple pubsub messages
(payloads) for one pubsub channel. Receiving messages with multiple
payloads is supported but we're not yet sending such multi-payload
messages to other nodes.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
In some cases, like read more than write scenario, the replication
offset of the replicas are the same. When the primary fails, the
replicas have the same rankings (rank == 0). They issue the election
at the same time (although we have a random 500), the simultaneous
elections may lead to the failure of the election due to quorum.
In clusterGetReplicaRank, when we calculates the rank, if the offsets
are the same, the one with the smaller node name will have a better
rank to avoid this situation.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
The metric tracks cpu time in micro-seconds, sharing the same value as
`INFO COMMANDSTATS`, aggregated under per-slot context.
---------
Signed-off-by: Kyle Kim <kimkyle@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.
Let's assume the following scenario:
1. Two replicas initiate an election.
2. Replica 1 is elected as the primary node, and replica 2 does not have
enough votes.
3. Replica 1 is down, ie the new primary node down again in a short
time.
4. Replica 2 know that the new primary node is down and wants to
initiate
a failover, but because the failover_auth_time of the previous round
has not been reset, it needs to wait for it to time out and then wait
for the next retry time, which will take cluster-node-timeout * 4 times,
this adds a lot of delay.
There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to
initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).
That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Fix#784
Prior to the change, `CLUSTER SHARDS` command processing might pick a
failed primary node which won't have the slot coverage information and
the slots `output` in turn would be empty. This change finds an
appropriate node which has the slot coverage information served by a
given shard and correctly displays it as part of `CLUSTER SHARDS`
output.
Before:
```
1) 1) "slots"
2) (empty array)
3) "nodes"
4) 1) 1) "id"
2) "2936f22a490095a0a851b7956b0a88f2b67a5d44"
...
9) "role"
10) "master"
...
13) "health"
14) "fail"
```
After:
```
1) 1) "slots"
2) 1) 0
2) 5461
3) "nodes"
4) 1) 1) "id"
2) "2936f22a490095a0a851b7956b0a88f2b67a5d44"
...
9) "role"
10) "master"
...
13) "health"
14) "fail"
```
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Fixes a regression introduced in PR #445, which allowed a message from a
replica
to update the slot ownership of its primary. The regression results in a
`replicaof` cycle, causing server crashes due to the cycle detection
assert. The
fix restores the previous behavior where only primary senders can
trigger
`clusterUpdateSlotsConfigWith`.
Additional changes:
* Handling of primaries without slots is obsoleted by new handling of
when a
sender that was a replica announces that it is now a primary.
* Replication loop detection code is unchanged but shifted downwards.
* Some variables are renamed for better readability and some are
introduced to
avoid repeated memcmp() calls.
Fixes#753.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Allows cluster admins to configure the blacklist TTL as needed to allow
sufficient time for `CLUSTER FORGET` to be executed on every node in the
cluster.
Config name `cluster-blacklist-ttl`; unit seconds; deault 60.
---------
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
New configs:
* `cluster-announce-client-ipv4`
* `cluster-announce-client-ipv6`
New module API function:
* `ValkeyModule_GetClusterNodeInfoForClient`, takes a client id and is
otherwise just like its non-ForClient cousin.
If configured, one of these IP addresses are reported to each client in
CLUSTER SLOTS, CLUSTER SHARDS, CLUSTER NODES and redirects, replacing
the IP (`custer-announce-ip` or the auto-detected IP) of each node.
Which one is reported to the client depends on whether the client is
connected over IPv4 or IPv6.
Benefits:
* This allows clients using IPv4 to get the IPv4 addresses of all
cluster nodes and IPv6 clients to get the IPv6 clients.
* This allows the IPs visible to clients to be different to the IPs used
between the cluster nodes due to NAT'ing.
The information is propagated in the cluster bus using new Ping
extensions. (Old nodes without this feature ignore unknown Ping
extensions.)
This adds another dimension to CLUSTER SLOTS reply. It now depends on
the client's use of TLS, the IP address family and RESP version.
Refactoring: The cached connection type definition is moved from
connection.h (it actually has nothing to do with the connection
abstraction) to server.h and is changed to a bitmap, with one bit for
each of TLS, IPv6 and RESP3.
Fixes#337
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
We forgot to call sdsfreesplitres in the error path during a
nodes.conf corruption check, this function exits on the error
paths so this is just a cleanup.
Signed-off-by: bentotten <59932872+bentotten@users.noreply.github.com>
Add check in CLUSTERLINK KILL cmd to avoid freeing cluster bus links to
myself. Also add an assert in `freeClusterLink()`.
Testing:
```
127.0.0.1:6379> debug clusterlink kill all c0404ee68574c6aa1048aaebfe90283afe51d2fc
(error) ERR Cannot free cluster link(s) to myself
```
Signed-off-by: Pierre Turin <pieturin@amazon.com>
I think the log of pfail status changes will be very useful.
The other parts were scanned and found that it can be modified.
Changes:
1. Changing pfail status releated logs from VERBOSE to NOTICE.
2. Changing configEpoch collision log from VERBOSE(warning) to NOTICE.
3. Changing some logs from DEBUG to NOTICE.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
In markNodeAsFailingIfNeeded we will count needed_quorum and failures,
needed_quorum is the half the cluster->size and plus one, and
cluster-size
is the size of primary node which contain slots, but when counting
failures, we dit not check if primary has slots.
Only the primary has slots that has the rights to vote, adding a new
clusterNodeIsVotingPrimary to formalize this concept.
Release notes:
bugfix where nodes not in the quorum group might spuriously mark nodes
as failed
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
When there is a link failure while an ongoing MEET request is sent the
sending node stops sending anymore MEET and starts sending PINGs. Since
every node responds to PINGs from unknown nodes with a PONG, the
receiving node never adds the sending node. But the sending node adds
the receiving node when it sees a PONG. This can lead to asymmetry in
cluster membership. This changes makes the sender keep sending MEET
until it sees a PONG, avoiding the asymmetry.
---------
Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>
Safe iterators must call resetIterators when they are done being used.
Fix one issue where a safe iterator was not correctly calling reset
during cluster slot caching and fixed a second issue where reset
iterator was being called twice. For the double release case,
kvstoreIteratorNextDict is responsible for patching up the iterator, but
we were calling it a second time in kvstoreIteratorNext.
In addition, I added some documentation around initializing iterators,
added an assert to prevent double initialization, and remove a function
from the public interface which isn't needed and might lead to incorrect
usage of the safe iterators.
Bumping srgsanky for finding it here:
c4782066e7 (r142867004).
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
More rebranding of
* Log messages (#252)
* The DENIED error reply
* Internal function names and comments, mainly Lua API
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Remove the unused value duplicate API from dict. It's unused in the codebase and introduces unnecessary overhead.
---------
Signed-off-by: Eran Liberty <eran.liberty@gmail.com>
There are currently three block types: BLOCKED_WAIT, BLOCKED_WAITAOF,
and BLOCKED_WAIT_PREREPL, used to block clients executing `WAIT`,
`WAITAOF`, and `CLUSTER SETSLOT`, respectively. They share the same
workflow: the client is blocked until replication to the expected number
of replicas completes. However, they provide different responses
depending on the commands involved. Using distinct block types leads to
code duplication and reduced readability. This PR consolidates the three
types into a single WAIT type, differentiating them using the pending
command to ensure the appropriate response is returned.
Fix#427
---------
Signed-off-by: Ping Xie <pingxie@google.com>
In #53, we will cache the CLUSTER SLOTS response to improve the
throughput and reduct the latency.
In the code snippet below, the second cluster slots will use the
old hostname:
```
config set cluster-preferred-endpoint-type hostname
config set cluster-announce-hostname old-hostname.com
multi
cluster slots
config set cluster-announce-hostname new-hostname.com
cluster slots
exec
```
When updating the hostname, in updateAnnouncedHostname, we will set
CLUSTER_TODO_SAVE_CONFIG and we will do a clearCachedClusterSlotsResponse
in clusterSaveConfigOrDie, so harmless in most cases.
Move the clearCachedClusterSlotsResponse call to clusterDoBeforeSleep
instead of scheduling it to be called in clusterSaveConfigOrDie.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This aligns the behaviour with established Valkey commands with a
TIMEOUT argument, such as BLPOP.
Fix#422
Signed-off-by: Ping Xie <pingxie@google.com>
I have validated that these settings closely match the existing coding
style with one major exception on `BreakBeforeBraces`, which will be
`Attach` going forward. The mixed `BreakBeforeBraces` styles in the
current codebase are hard to imitate and also very odd IMHO - see below
```
if (a == 1) { /*Attach */
}
```
```
if (a == 1 ||
b == 2)
{ /* Why? */
}
```
Please do NOT merge just yet. Will add the github action next once the
style is reviewed/approved.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
This commit adds a logic to cache `CLUSTER SLOTS` response for reduced
latency and also updates the cache when a change in the cluster is
detected.
Historically, `CLUSTER SLOTS` command was deprecated, however all the
server clients have been using `CLUSTER SLOTS` and have not migrated to
`CLUSTER SHARDS`. In future this logic can be added to any other
commands to improve the performance of the engine.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Cleaned up the minor cluster refactoring notes that were intended to be
follow ups that never happened. Basically:
1. Minor style nitpicks
2. Generalized clusterNodeIsMyself so that it wasn't implementation
dependent.
3. Removed getMyClusterId, and just make it an explicit call to myself's
name, which seems more straightforward and removes unnecessary
abstraction.
4. Remove clusterNodeGetSlaveof infavor of clusterNodeGetMaster. We
already do a check if it's a replica, and if it wasn't working it would
have been crashing.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Renamed redis to valkey/server in aof.c serverlogs.
The AOF rewrite child process title is set to "redis-aof-rewrite" if
Valkey was started from a redis-server symlink, otherwise to
"valkey-aof-rewrite".
This is a breaking changes since logs are changed.
Part of #207.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
This patch try to do following things:
1. Rename `redis_*` and `REDIS_*` macros defined in config.h to
`valkey_*`, `VALKEY_*` and update associated used files. (`redis_fstat`,
`redis_fsync`, `REDIS_THREAD_STACK_SIZE`, etc.)
2. Remove the leading double underscore for guard macro in config.h.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Log messages containing "Redis" in some files are changed.
Add macro SERVER_TITLE defined to "Valkey" (uppercase V) is introduced
and used in log messages, so at least it will be easy to patch this
definition to get Redis or any other name in the logs instead of Valkey.
Change "Redis" in some log messages to use %s and SERVER_TITLE.
This is a partial implementation of
https://github.com/valkey-io/valkey/issues/207
---------
Signed-off-by: 0del <bany.y0599@gmail.com>
This includes comments used for module API documentation.
* Strategy for replacement: Regex search: `(//|/\*| \*|#).* ("|\()?(r|R)edis( |\.
|'|\n|,|-|\)|")(?!nor the names of its contributors)(?!Ltd.)(?!Labs)(?!Contributors.)`
* Don't edit copyright comments
* Replace "Redis version X.X" -> "Redis OSS version X.X" to distinguish
from newly licensed repository
* Replace "Redis Object" -> "Object"
* Exclude markdown for now
* Don't edit Lua scripting comments referring to redis.X API
* Replace "Redis Protocol" -> "RESP"
* Replace redis-benchmark, -cli, -server, -check-aof/rdb with "valkey-"
prefix
* Most other places, I use best judgement to either remove "Redis", or
replace with "the server" or "server"
Fixes#148
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Ref: https://github.com/redis/redis/pull/12760
### Description
#### Fixes compatibilty of PlaceholderKV cluster (7.2 - extensions
enabled by default) with older Redis cluster (< 7.0 - extensions not
handled) .
With some of the extensions enabled by default in 7.2 version, new nodes
running 7.2 and above start sending out larger clusterbus message
payload including the ping extensions. This caused an incompatibility
with node running engine versions < 7.0. Old nodes (< 7.0) would receive
the payload from new nodes (> 7.2) would observe a payload length
(totlen) > (estlen) and would perform an early exit and won't process
the message.
This fix introduces a flag `extensions_supported` on the clusterMsg
indicating the sender node can handle extensions parsing. Once, a
receiver nodes receives a message with this flag set to 1, it would
update clusterNode new field extensions_supported and start sending out
extensions if it has any.
This PR also introduces a DEBUG sub command to enable/disable cluster
message extensions `process-clustermsg-extensions` feature.
Note: A successful `PING`/`PONG` is required as a sender for a given
node to be marked as `extensions_supported` and then extensions message
will be sent to it. This could cause a slight delay in receiving the
extensions message(s).
### Testing
TCL test verifying the cluster state is healthy irrespective of
enabling/disabling cluster message extensions feature.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
This commit updates the processing of PONG gossip messages in the
cluster. When a node (B) becomes a replica due to a failover, its PONG
messages include its new primary node's (A) information and B's
configuration epoch is aligned with A's. This allows observer nodes to
identify changes in primary-ship, addressing issues of intermediate
states and enhancing cluster state consistency during topology changes.
Fix#13018
The receiver does not update any of its cluster state based on gossip
about itself. This commit explicitly avoids sending or processing gossip
about the receiver.
Currently cluster bus gossips include 10% of nodes in the cluster with a
minimum of 3 nodes. For up to 30 node clusters, this commit makes sure
that 1/3 of the gossip (1 out of 3 gossips) is never discarded. This
should help with relatively faster convergence of cluster state in
general.
After fix for #13033, address sanitizer reports this heap-use-after-free
error. When the pubsubshard_channels dict becomes empty, we will delete
the dict, and the dictReleaseIterator will call dictResetIterator, it
will use the dict so we will trigger the error.
This PR introduced a new struct kvstoreDictIterator to wrap
dictIterator.
Replace the original dict iterator with the new kvstore dict iterator.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: guybe7 <guy.benoish@redislabs.com>