futriix

Author	SHA1	Message	Date
Binbin	211b250aad	Do election in order based on failed primary rank to avoid voting conflicts (#1018 ) When multiple primary nodes fail simultaneously, the cluster can not recover within the default effective time (data_age limit). The main reason is that the vote is without ranking among multiple replica nodes, which case too many epoch conflicts. Therefore, we introduced into ranking based on the failed primary shard-id. Introduced a new failed_primary_rank var, this var means the rank of this myself instance in the context of all failed primary list. This var will be used in failover and we will do the failover election packets in order based on the rank, this can effectively avoid the voting conflicts. If a single primary is down, the behavior is the same as before. If multiple primaries are down, their replica election initiation time will be delayed by 500ms according to the ranking. Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-01-11 10:43:18 +08:00
Pierre	e4179f1f3b	Only (re-)send MEET packet once every handshake timeout period (#1441 ) Add `meet_sent` field in `clusterNode` indicating the last time we sent a MEET packet. Use this field to only (re-)send a MEET packet once every handshake timeout period when detecting a node without an inbound link. When receiving multiple MEET packets on the same link while the node is in handshake state, instead of dropping the packet, we now simply prevent the creation of a new node. This way we still process the MEET packet's gossip and reply with a PONG as any other packets. Improve some logging messages to include `human_nodename`. Add `nodeExceedsHandshakeTimeout()` function. This is a follow-up to this previous PR: https://github.com/valkey-io/valkey/pull/1307 And a partial fix to the crash described in: https://github.com/valkey-io/valkey/pull/1436 --------- Signed-off-by: Pierre Turin <pieturin@amazon.com>	2024-12-30 15:56:39 -05:00
Binbin	ad24220681	Automatic failover vote is not limited by two times the node timeout (#1356 ) This is a follow of #1305, we now decided to apply the same change to automatic failover as well, that is, move forward with removing it for both automatic and manual failovers. Quote from Ping during the review: Note that we already debounce transient primary failures with node timeout, ensuring failover is only triggered after sustained outages. Election timing is naturally staggered by replica spacing, making the likelihood of simultaneous elections from replicas of the same shard very low. The one-vote-per-epoch rule further throttles retries and ensures orderly elections. On top of that, quorum-based primary failure confirmation, cluster-state convergence, and slot ownership validation are all built into the process. Quote from Madelyn during the review: It against the specific primary. It's to prevent double failovers. If a primary just took over we don't want someone else to try to take over and give the new primary some amount of time to take over. I have not seen this issue though, it might have been over optimizing? The double failure mode, where a node fails and then another node fails within the nodetimeout also doesn't seem that common either though. So the conclusion is that we all agreed to remove it completely, it will make the code a lot simpler. And if there is other specific edge cases we are missing, we will fix it in other way. See discussion #1305 for more information. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-15 12:09:53 +08:00
Pierre	5f7fe9ef21	Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout (#1307 ) In some cases, when meeting a new node, if the handshake times out, we can end up with an inconsistent view of the cluster where the new node knows about all the nodes in the cluster, but the cluster does not know about this new node (or vice versa). To detect this inconsistency, we now check if a node has an outbound link but no inbound link, in this case it probably means this node does not know us. In this case we (re-)send a MEET packet to this node to do a new handshake with it. If we receive a MEET packet from a known node, we disconnect the outbound link to force a reconnect and sending of a PING packet so that the other node recognizes the link as belonging to us. This prevents cases where a node could send MEET packets in a loop because it thinks the other node does not have an inbound link. This fixes the bug described in #1251. --------- Signed-off-by: Pierre Turin <pieturin@amazon.com>	2024-12-11 17:26:06 -08:00
Binbin	fbbfe5d3d3	Print logs when the cluster state changes to fail or the fail reason changes (#1188 ) This log allows us to easily distinguish between full coverage and minority partition when the cluster fails. Sometimes it is not easy to see the minority partition in a healthy shards (both primary and replicas). And we decided not to add a cluster_fail_reason field to cluster info. Given that there are only two reasons and both are well-known and if we ended up adding more down the road we can add it in the furture. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-02 15:55:24 +08:00
Binbin	b9d224097a	Brocast a PONG to all node in cluster when role changed (#1295 ) When a node role changes, we should brocast the change to notify other nodes. For example, one primary and one replica, after a failover, the replica became a new primary, the primary became a new replica. And then we trigger a second cluster failover for the new replica, the new replica will send a MFSTART to its primary, ie, the new primary. But the new primary may reject the MFSTART due to this logic: ``` } else if (type == CLUSTERMSG_TYPE_MFSTART) { if (!sender \|\| sender->replicaof != myself) return 1; ``` In the new primary views, sender is still a primary, and sender->replicaof is NULL, so we will return. Then the manual failover timedout. Another possibility is that other primaries refuse to vote after receiving the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary, so it refuse to vote, and then manual failover timedout. ``` void clusterSendFailoverAuthIfNeeded(clusterNode node, clusterMsg request) { ... if (clusterNodeIsPrimary(node)) { serverLog(LL_WARNING, "Failover auth denied to... ``` The reason is that, currently, we only update the node->replicaof information when we receive a PING/PONG from the sender. For details, see clusterProcessPacket. Therefore, in some scenarios, such as clusters with many nodes and a large cluster-ping-interval (that is, cluster-node-timeout), the role change of the node will be very delayed. Added a DEBUG DISABLE-CLUSTER-RANDOM-PING command, send cluster ping to a random node every second (see clusterCron). Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-23 00:22:04 +08:00
Binbin	ee386c92ff	Manual failover vote is not limited by two times the node timeout (#1305 ) This limit should not restrict manual failover, otherwise in some scenarios, manual failover will time out. For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs are lost during a manual failover, it cannot vote in the second manual failover. Or in a mixed scenario of plain failover and manual failover, it cannot vote for the subsequent manual failover. The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-11-19 11:17:20 -05:00
Mikhail Koviazin	af811748e7	clang-format: set ColumnLimit to 0 and reformat (#1045 ) This commit hopefully improves the formatting of the codebase by setting ColumnLimit to 0 and hence stopping clang-format from trying to put as much stuff in one line as possible. This change enabled us to remove most of `clang-format off` directives and fixed a bunch of lines that looked like this: ```c #define KEY \ VALUE /* comment */ ``` Additionally, one pair of `clang-format off` / `clang-format on` had `clang-format off` as the second comment and hence didn't enable the formatting for the rest of the file. This commit addresses this issue as well. Please tell me if anything in the changes seem off. If everything is fine, I will add this commit to `.git-blame-ignore-revs` later. --------- Signed-off-by: Mikhail Koviazin <mikhail.koviazin@aiven.io>	2024-09-25 01:22:54 +02:00
Binbin	380f700816	Improve cluster cant failover log conditions (#780 ) This PR adjusts the logging conditions of clusterLogCantFailover in this two ways. 1. For the same cant_failover_reason, we will print the log once in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which is a bit long, shorten it to 1s, so we can better track its state. We get to see the system making progress by watching the message. Using 1s also covers pretty much all cases as i don't see a reason for using a <1s node timeout, test or prod. 2. We will not print logs before the nolog_fail_time, its value is cluster-node-timeout+5000. This may casue us to lose some logs, for example, if cluster-node-timeout is small, auth_timeout will be 2000, and auth_retry_time will be 4000. In this case, we will lose all the reasons during the election if the failover is timedout. So remove the nolog_fail_time logic, since we still do have the CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many logs. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-08-06 21:14:18 +08:00
Kyle Kim (kimkyle@)	e1d936b339	Add network-bytes-in and network-bytes-out metric support under CLUSTER SLOT-STATS command (#20 ) (#720 ) Adds two new metrics for per-slot statistics, network-bytes-in and network-bytes-out. The network bytes are inclusive of replication bytes but exclude other types of network traffic such as clusterbus traffic. #### network-bytes-in The metric tracks network ingress bytes under per-slot context, by reverse calculation of `c->argv_len_sum` and `c->argc`, stored under a newly introduced field `c->net_input_bytes_curr_cmd`. #### network-bytes-out The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. #### sample response Both metrics are reported under the `CLUSTER SLOT-STATS` command. ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` --------- Signed-off-by: Kyle Kim <kimkyle@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-26 16:06:16 -07:00
Roshan Khatri	e745e9c240	Adds Light-weight cluster bus header for pubsub message. (#654 ) Adds light-weight cluster bus header for pubsub message. Closes #557. This also supports sending to and receiving non-light messages from older versions of the engine. The light-weight cluster bus message supports multiple pubsub messages (payloads) for one pubsub channel. Receiving messages with multiple payloads is supported but we're not yet sending such multi-payload messages to other nodes. --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>	2024-07-26 10:49:18 -07:00
Kyle Kim (kimkyle@)	5000c050b5	Add cpu-usec metric support under CLUSTER SLOT-STATS command (#20 ). (#712 ) The metric tracks cpu time in micro-seconds, sharing the same value as `INFO COMMANDSTATS`, aggregated under per-slot context. --------- Signed-off-by: Kyle Kim <kimkyle@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-22 18:03:28 -07:00
Viktor Söderqvist	a323dce890	Dual stack and client-specific IPs in cluster (#736 ) New configs: * `cluster-announce-client-ipv4` * `cluster-announce-client-ipv6` New module API function: * `ValkeyModule_GetClusterNodeInfoForClient`, takes a client id and is otherwise just like its non-ForClient cousin. If configured, one of these IP addresses are reported to each client in CLUSTER SLOTS, CLUSTER SHARDS, CLUSTER NODES and redirects, replacing the IP (`custer-announce-ip` or the auto-detected IP) of each node. Which one is reported to the client depends on whether the client is connected over IPv4 or IPv6. Benefits: * This allows clients using IPv4 to get the IPv4 addresses of all cluster nodes and IPv6 clients to get the IPv6 clients. * This allows the IPs visible to clients to be different to the IPs used between the cluster nodes due to NAT'ing. The information is propagated in the cluster bus using new Ping extensions. (Old nodes without this feature ignore unknown Ping extensions.) This adds another dimension to CLUSTER SLOTS reply. It now depends on the client's use of TLS, the IP address family and RESP version. Refactoring: The cached connection type definition is moved from connection.h (it actually has nothing to do with the connection abstraction) to server.h and is changed to a bitmap, with one bit for each of TLS, IPv6 and RESP3. Fixes #337 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-07-10 13:53:52 +02:00
Binbin	2d6791bb11	Use clusterNodeIsVotingPrimary function to check the right (#735 ) Minor cleanups. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-03 20:42:25 +02:00
Harkrishn Patro	76fc041685	represent cluster node flags with bitwise shift value (#642 ) While debugging a cluster bus issue, found the cluster node flags were represented in numbers. I generally find it easy when these are represented as bitwise shift operation. It improves readability a bit. Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>	2024-06-14 00:58:03 +02:00
Ping Xie	54c9747935	Remove `master` and `slave` from source code (#591 ) External facing interfaces are not affected. --------- Signed-off-by: Ping Xie <pingxie@google.com>	2024-06-07 14:21:33 -07:00
Ping Xie	c41dd77a3e	Add clang-format configs (#323 ) I have validated that these settings closely match the existing coding style with one major exception on `BreakBeforeBraces`, which will be `Attach` going forward. The mixed `BreakBeforeBraces` styles in the current codebase are hard to imitate and also very odd IMHO - see below ``` if (a == 1) { /Attach / } ``` ``` if (a == 1 \|\| b == 2) { /* Why? */ } ``` Please do NOT merge just yet. Will add the github action next once the style is reviewed/approved. --------- Signed-off-by: Ping Xie <pingxie@google.com>	2024-05-22 23:24:12 -07:00
Roshan Khatri	c4782066e7	Cache CLUSTER SLOTS response for improving throughput and reduced latency. (#53 ) This commit adds a logic to cache `CLUSTER SLOTS` response for reduced latency and also updates the cache when a change in the cluster is detected. Historically, `CLUSTER SLOTS` command was deprecated, however all the server clients have been using `CLUSTER SLOTS` and have not migrated to `CLUSTER SHARDS`. In future this logic can be added to any other commands to improve the performance of the engine. --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>	2024-05-22 14:21:41 -07:00
Ping Xie	6e7af9471c	Slot migration improvement (#445 )	2024-05-06 21:40:28 -07:00
bentotten	19c4c647e0	Fix incorrect comment for count in clusterMsg (#381 ) The "count" field of clusterMsg is only used for gossip. Signed-off-by: Ben Totten <btotten@amazon.com> Co-authored-by: Ben Totten <btotten@amazon.com>	2024-04-25 16:33:44 -07:00
bentotten	6975242529	Update comment in cluster_legacy.h (#277 ) Update comment suggesting clusterMsgPingExtTypes to clusterMsgPingtypes as clusterMsgPingExtTypes does not exist Signed-off-by: Ben Totten <btotten@amazon.com>	2024-04-11 13:18:20 -07:00
Jacob Murphy	df5db0627f	Remove trademarked language in code comments (#223 ) This includes comments used for module API documentation. * Strategy for replacement: Regex search: `(//\|/\\| \\|#).* ("\|\()?(r\|R)edis( \|\. \|'\|\n\|,\|-\|\)\|")(?!nor the names of its contributors)(?!Ltd.)(?!Labs)(?!Contributors.)` * Don't edit copyright comments * Replace "Redis version X.X" -> "Redis OSS version X.X" to distinguish from newly licensed repository * Replace "Redis Object" -> "Object" * Exclude markdown for now * Don't edit Lua scripting comments referring to redis.X API * Replace "Redis Protocol" -> "RESP" * Replace redis-benchmark, -cli, -server, -check-aof/rdb with "valkey-" prefix * Most other places, I use best judgement to either remove "Redis", or replace with "the server" or "server" Fixes #148 --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-04-09 10:24:03 +02:00
Harkrishn Patro	ebfb440629	Pass extensions to node if extension processing is handled by it (#52 ) Ref: https://github.com/redis/redis/pull/12760 ### Description #### Fixes compatibilty of PlaceholderKV cluster (7.2 - extensions enabled by default) with older Redis cluster (< 7.0 - extensions not handled) . With some of the extensions enabled by default in 7.2 version, new nodes running 7.2 and above start sending out larger clusterbus message payload including the ping extensions. This caused an incompatibility with node running engine versions < 7.0. Old nodes (< 7.0) would receive the payload from new nodes (> 7.2) would observe a payload length (totlen) > (estlen) and would perform an early exit and won't process the message. This fix introduces a flag `extensions_supported` on the clusterMsg indicating the sender node can handle extensions parsing. Once, a receiver nodes receives a message with this flag set to 1, it would update clusterNode new field extensions_supported and start sending out extensions if it has any. This PR also introduces a DEBUG sub command to enable/disable cluster message extensions `process-clustermsg-extensions` feature. Note: A successful `PING`/`PONG` is required as a sender for a given node to be marked as `extensions_supported` and then extensions message will be sent to it. This could cause a slight delay in receiving the extensions message(s). ### Testing TCL test verifying the cluster state is healthy irrespective of enabling/disabling cluster message extensions feature. --------- Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>	2024-04-08 09:01:30 -07:00
Chen Tianjie	8527959598	Replace slots_to_channels radix tree with slot specific dictionaries for shard channels. (#12804 ) We have achieved replacing `slots_to_keys` radix tree with key->slot linked list (#9356), and then replacing the list with slot specific dictionaries for keys (#11695). Shard channels behave just like keys in many ways, and we also need a slots->channels mapping. Currently this is still done by using a radix tree. So we should split `server.pubsubshard_channels` into 16384 dicts and drop the radix tree, just like what we did to DBs. Some benefits (basically the benefits of what we've done to DBs): 1. Optimize counting channels in a slot. This is currently used only in removing channels in a slot. But this is potentially more useful: sometimes we need to know how many channels there are in a specific slot when doing slot migration. Counting is now implemented by traversing the radix tree, and with this PR it will be as simple as calling `dictSize`, from O(n) to O(1). 2. The radix tree in the cluster has been removed. The shard channel names no longer require additional storage, which can save memory. 3. Potentially useful in slot migration, as shard channels are logically split by slots, thus making it easier to migrate, remove or add as a whole. 4. Avoid rehashing a big dict when there is a large number of channels. Drawbacks: 1. Takes more memory than using radix tree when there are relatively few shard channels. What this PR does: 1. in cluster mode, split `server.pubsubshard_channels` into 16384 dicts, in standalone mode, still use only one dict. 2. drop the `slots_to_channels` radix tree. 3. to save memory (to solve the drawback above), all 16384 dicts are created lazily, which means only when a channel is about to be inserted to the dict will the dict be initialized, and when all channels are deleted, the dict would delete itself. 5. use `server.shard_channel_count` to keep track of the number of all shard channels. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2023-12-27 17:40:45 +08:00
Josh Hershberg	4944eda696	Cluster refactor: Move more stuff from cluster.h to cluster_legacy.h More declerations can be moved into cluster_legacy.h as they are not requied for the cluster api. The code was simply moved, not changed in any way. Signed-off-by: Josh Hershberg <yehoshua@redis.com>	2023-11-22 05:54:03 +02:00
Josh Hershberg	d9a0478599	Cluster refactor: Make clusterNode private Move clusterNode into cluster_legacy.h. In order to achieve this some accessor methods were added and also a refactor of how debugCommand handles cluster related subcommands. Signed-off-by: Josh Hershberg <yehoshua@redis.com>	2023-11-22 05:50:46 +02:00
Josh Hershberg	98a6c44b75	Cluster refactor: Make clusterState private Move clusterState into cluster_legacy.h. In order to achieve this some "accessor" methods needed to be added to the cluster API and some other minor refactors. Signed-off-by: Josh Hershberg <yehoshua@redis.com>	2023-11-22 05:44:10 +02:00
Josh Hershberg	5292adb985	Cluster refactor: Move trivial stuff into cluster_legacy.h Move some declerations from cluster.h to cluster_legacy.h. The items moved are specific to the legacy clustering implementation and DO NOT require any other refactoring other than moving them from one file to another. Signed-off-by: Josh Hershberg <yehoshua@redis.com>	2023-11-21 12:49:14 +02:00
Josh Hershberg	86915775f1	Cluster refactor: rename cluster.c -> cluster_legacy.c Signed-off-by: Josh Hershberg <yehoshua@redis.com>	2023-11-21 12:49:14 +02:00

29 Commits