futriix

Author	SHA1	Message	Date
Binbin	b803f7aeff	Cleaned up getSlotOrReply is return -1 instead of C_ERR (#1211 ) Minor cleanup since getSlotOrReply return -1 on error, not return C_ERR. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-10-23 17:11:42 +08:00
Binbin	5d70ccd70e	Make replica CLUSTER RESET flush async based on lazyfree-lazy-user-flush (#1190 ) Currently, if the replica has a lot of data, CLUSTER RESET will block for a while and report the slowlog, and it seems that there is no harm in making it async so external components can be easier when monitoring it. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>	2024-10-23 10:22:25 +08:00
Madelyn Olson	e617bf2ddc	Removing incorrect comment about a warning (#1132 ) There is a lot of bad legacy usage of `default:` with enums, which is an anti-pattern. If you omit the default, the compiler will tell you if a new enum value was added and that it is missing from a switch statement. Someone mentioned on another PR they used `default:` because of this warning, so just removing it, but might create an issue to do a wider cleanup. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-10-07 11:56:15 -07:00
Binbin	6e0216471d	Trigger the election as soon as possible when doing a forced manual failover (#1067 ) In CLUSTER FAILOVER FORCE case, we will set mf_can_start to 1 and wait for a cron to trigger the election. We can also set a CLUSTER_TODO_HANDLE_MANUALFAILOVER flag so that we can start the election as soon as possible instead of waiting for the cron, so that we won't have a 100ms delay (clusterCron). Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-09-25 12:08:48 +08:00
Mikhail Koviazin	af811748e7	clang-format: set ColumnLimit to 0 and reformat (#1045 ) This commit hopefully improves the formatting of the codebase by setting ColumnLimit to 0 and hence stopping clang-format from trying to put as much stuff in one line as possible. This change enabled us to remove most of `clang-format off` directives and fixed a bunch of lines that looked like this: ```c #define KEY \ VALUE /* comment */ ``` Additionally, one pair of `clang-format off` / `clang-format on` had `clang-format off` as the second comment and hence didn't enable the formatting for the rest of the file. This commit addresses this issue as well. Please tell me if anything in the changes seem off. If everything is fine, I will add this commit to `.git-blame-ignore-revs` later. --------- Signed-off-by: Mikhail Koviazin <mikhail.koviazin@aiven.io>	2024-09-25 01:22:54 +02:00
Binbin	56fba564b6	Print an empty primary log when primary lost its last slot (#1064 ) The one in CLUSTER SETSLOT help us keep track of state better, of course it also can make the test case happy. The one in gossip process fixes a problem that a replica can print a log saying it is an empty primary. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>	2024-09-23 13:14:09 +08:00
Binbin	7fab15795f	Add log about old primary after myself failover (#1058 ) Sometims it is hard to see the old primary during a multi primaries failover, adding this log can help use to find the old primary node. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>	2024-09-20 14:15:19 +08:00
Binbin	dcc7678fc4	Fix replica unable trigger migration when it received CLUSTER SETSLOT in advance (#981 ) Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas There is a timing bug where the primary and replica have different `cluster-allow-replica-migration` settings. In issue #970, we found that if the replica receives `CLUSTER SETSLOT` before the gossip update, it remains in the original shard. This happens because we only process the `cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`. This commit fixes the issue by also evaluating this flag for replicas in the `CLUSTER SETSLOT` path, ensuring correct replica migration behavior. Closes #970 --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>	2024-09-13 15:32:20 -07:00
Binbin	38457b7320	Trigger a save of the cluster configuration file before shutting down (#822 ) The cluster configuration file is the metadata "database" for the cluster. It is best to trigger a save when shutdown the server, to avoid inconsistent content that is not refreshed. We save the nodes.conf whenever something that affects the nodes.conf has changed. But we are saving nodes.conf in clusterBeforeSleep, and some events may save it without a fsync, there is a time gap. And shutdown has its own save seems good to me, it doesn't need to care about the others. At the same time, a comment is added in unlock nodes.conf to explain why we actively unlock when shutdown. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-09-12 15:43:12 +08:00
bentotten	affbea5dc1	For MEETs, save the extensions support flag immediately during MEET processing (#778 ) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes #774 --------- Signed-off-by: Ben Totten <btotten@amazon.com> Co-authored-by: Ben Totten <btotten@amazon.com>	2024-09-09 20:46:02 -07:00
Binbin	c642cf0134	Add client info to SHUTDOWN / CLUSTER FAILOVER logs (#875 ) Print the full client info by using catClientInfoString, the info is useful when we want to identify the source of request. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-09-08 16:26:56 +08:00
Binbin	9b51949abe	Fix missing replication link re-connection when primary's IP/port is updated in `clusterProcessGossipSection` (#965 ) `clusterProcessGossipSection` currently doesn't trigger a check and call `replicationSetPrimary` when `myself`'s primary node’s IP/port is updated. This fix ensures that after every node address update, `replicationSetPrimary` is called if the updated node is `myself`'s primary. This prevents missed updates and ensures that replicas reconnect properly to maintain their replication link with the primary.	2024-09-05 22:19:50 -07:00
Binbin	ecbfb6a7ec	Fix reconfiguring sub-replica causing data loss when myself change shard_id (#944 ) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of #885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow #885 and closes #942. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>	2024-08-29 22:39:53 +08:00
Binbin	c7d1daea05	Add epoch information to failover auth denied logs (#816 ) When failover deny to vote, sometimes due to network or some blocking operations, the time of FAILOVER_AUTH_REQUEST packet arrival is very uncertain. Since there is no epoch information in these logs, it is hard to associate the log with other logs. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-08-24 18:03:24 +08:00
Binbin	5d97f5133c	Fix CLUSTER SETSLOT block and unblock error when all replicas are down (#879 ) In CLUSTER SETSLOT propagation logic, if the replicas are down, the client will get block during command processing and then unblock with `NOREPLICAS Not enough good replicas to write`. The reason is that all replicas are down (or some are down), but myself->num_replicas is including all replicas, so the client will get block and always get timeout. We should only wait for those online replicas, otherwise the waiting propagation will always timeout since there are not enough replicas. The admin can easily check if there are replicas that are down for an extended period of time. If they decide to move forward anyways, we should not block it. If a replica failed right before the replication and was not included in the replication, it would also unlikely win the election. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>	2024-08-23 16:21:53 +08:00
Yunxiao Du	0a11c4a140	Delete redundant declaration clusterNodeCoversSlot and countKeysInSlot (#930 ) Delete redundant declaration, clusterNodeCoversSlot and countKeysInSlot has been declared in cluster.h Signed-off-by: Yunxiao Du <me@jackdu.cn>	2024-08-23 12:17:27 +08:00
Binbin	08aaeea4b7	Avoid to re-establish replication if node is already myself primary in CLUSTER REPLICATE (#884 ) If n is already myself primary, there is no need to re-establish the replication connection. In the past we allow a replica node to reconnect with its primary via this CLUSTER REPLICATE command, it will use psync. But since #885, we will assume that a full sync is needed in this case, so if we don't do this, the replica will always use full sync. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>	2024-08-22 11:00:18 +08:00
Binbin	e1b3629186	Fix data loss when replica do a failover with a old history repl offset (#885 ) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-08-21 13:11:21 +08:00
Pieter Cailliau	4d284daefd	Copyright update to reflect IP transfer from salvatore to Redis (#740 ) Update references of copyright being assigned to Salvatore when it was transferred to Redis Ltd. as per https://github.com/valkey-io/valkey/issues/544. --------- Signed-off-by: Pieter Cailliau <pieter@redis.com>	2024-08-14 09:20:36 -07:00
Binbin	380f700816	Improve cluster cant failover log conditions (#780 ) This PR adjusts the logging conditions of clusterLogCantFailover in this two ways. 1. For the same cant_failover_reason, we will print the log once in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which is a bit long, shorten it to 1s, so we can better track its state. We get to see the system making progress by watching the message. Using 1s also covers pretty much all cases as i don't see a reason for using a <1s node timeout, test or prod. 2. We will not print logs before the nolog_fail_time, its value is cluster-node-timeout+5000. This may casue us to lose some logs, for example, if cluster-node-timeout is small, auth_timeout will be 2000, and auth_retry_time will be 4000. In this case, we will lose all the reasons during the election if the failover is timedout. So remove the nolog_fail_time logic, since we still do have the CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many logs. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-08-06 21:14:18 +08:00
Madelyn Olson	4b8de6b1be	Update replica version comparison to handle version 8 RC candidates (#851 ) Release candidates have a version that is lower than 8.0.0 to allow for 8.0.0 to have 0x080000 as a release number. However, we did an explicit check to make sure a version was 8.0 or greater to validate a replica supports a feature. Now we are using the highest patch version of latest minor to do the comparison to accommodate future versions. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-31 10:01:48 -07:00
Madelyn Olson	b4d96caa78	Remove static to avoid compiler warning (#836 )	2024-07-28 13:08:09 -07:00
Roshan Khatri	e745e9c240	Adds Light-weight cluster bus header for pubsub message. (#654 ) Adds light-weight cluster bus header for pubsub message. Closes #557. This also supports sending to and receiving non-light messages from older versions of the engine. The light-weight cluster bus message supports multiple pubsub messages (payloads) for one pubsub channel. Receiving messages with multiple payloads is supported but we're not yet sending such multi-payload messages to other nodes. --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>	2024-07-26 10:49:18 -07:00
Binbin	f00c8f6214	Modify clusterSaveConfig function call to use C_OK / C_ERR return value (#818 ) Minor cleanups. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-24 09:58:44 +08:00
Binbin	59aa00823c	Replicas with the same offset queue up for election (#762 ) In some cases, like read more than write scenario, the replication offset of the replicas are the same. When the primary fails, the replicas have the same rankings (rank == 0). They issue the election at the same time (although we have a random 500), the simultaneous elections may lead to the failure of the election due to quorum. In clusterGetReplicaRank, when we calculates the rank, if the offsets are the same, the one with the smaller node name will have a better rank to avoid this situation. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-22 23:43:16 -07:00
Kyle Kim (kimkyle@)	5000c050b5	Add cpu-usec metric support under CLUSTER SLOT-STATS command (#20 ). (#712 ) The metric tracks cpu time in micro-seconds, sharing the same value as `INFO COMMANDSTATS`, aggregated under per-slot context. --------- Signed-off-by: Kyle Kim <kimkyle@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-22 18:03:28 -07:00
Binbin	15a8290231	Optimize failover time when the new primary node is down again (#782 ) We will not reset failover_auth_time after setting it, this is used to check auth_timeout and auth_retry_time, but we should at least reset it after a successful failover. Let's assume the following scenario: 1. Two replicas initiate an election. 2. Replica 1 is elected as the primary node, and replica 2 does not have enough votes. 3. Replica 1 is down, ie the new primary node down again in a short time. 4. Replica 2 know that the new primary node is down and wants to initiate a failover, but because the failover_auth_time of the previous round has not been reset, it needs to wait for it to time out and then wait for the next retry time, which will take cluster-node-timeout * 4 times, this adds a lot of delay. There is another problem. Like we will set additional random time for failover_auth_time, such as random 500ms and replicas ranking 1s. If replica 2 receives PONG from the new primary node before sending the FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will change itself to a replica. If the new primary node goes down again at this time, replica 2 will use the previous failover_auth_time to initiate an election instead of going through the logic of random 500ms and replicas ranking 1s again, which may lead to unexpected consequences (for example, a low-ranking replica initiates an election and becomes the new primary node). That is, we need to reset failover_auth_time at the appropriate time. When the replica switches to a new primary, we reset it, because the existing failover_auth_time is already out of date in this case. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-19 15:27:49 -04:00
Harkrishn Patro	816accea76	Generate correct slot information in cluster shards command on primary failure (#790 ) Fix #784 Prior to the change, `CLUSTER SHARDS` command processing might pick a failed primary node which won't have the slot coverage information and the slots `output` in turn would be empty. This change finds an appropriate node which has the slot coverage information served by a given shard and correctly displays it as part of `CLUSTER SHARDS` output. Before: ``` 1) 1) "slots" 2) (empty array) 3) "nodes" 4) 1) 1) "id" 2) "2936f22a490095a0a851b7956b0a88f2b67a5d44" ... 9) "role" 10) "master" ... 13) "health" 14) "fail" ``` After: ``` 1) 1) "slots" 2) 1) 0 2) 5461 3) "nodes" 4) 1) 1) "id" 2) "2936f22a490095a0a851b7956b0a88f2b67a5d44" ... 9) "role" 10) "master" ... 13) "health" 14) "fail" ``` --------- Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>	2024-07-19 09:32:39 -07:00
Ping Xie	66d0f7d9a1	Ensure only primary sender drives slot ownership updates (#754 ) Fixes a regression introduced in PR #445, which allowed a message from a replica to update the slot ownership of its primary. The regression results in a `replicaof` cycle, causing server crashes due to the cycle detection assert. The fix restores the previous behavior where only primary senders can trigger `clusterUpdateSlotsConfigWith`. Additional changes: * Handling of primaries without slots is obsoleted by new handling of when a sender that was a replica announces that it is now a primary. * Replication loop detection code is unchanged but shifted downwards. * Some variables are renamed for better readability and some are introduced to avoid repeated memcmp() calls. Fixes #753. --------- Signed-off-by: Ping Xie <pingxie@google.com>	2024-07-16 13:05:49 -07:00
Brennan	34649bd034	Configurable cluster blacklist TTL (#738 ) Allows cluster admins to configure the blacklist TTL as needed to allow sufficient time for `CLUSTER FORGET` to be executed on every node in the cluster. Config name `cluster-blacklist-ttl`; unit seconds; deault 60. --------- Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>	2024-07-13 20:38:25 +02:00
Viktor Söderqvist	a323dce890	Dual stack and client-specific IPs in cluster (#736 ) New configs: * `cluster-announce-client-ipv4` * `cluster-announce-client-ipv6` New module API function: * `ValkeyModule_GetClusterNodeInfoForClient`, takes a client id and is otherwise just like its non-ForClient cousin. If configured, one of these IP addresses are reported to each client in CLUSTER SLOTS, CLUSTER SHARDS, CLUSTER NODES and redirects, replacing the IP (`custer-announce-ip` or the auto-detected IP) of each node. Which one is reported to the client depends on whether the client is connected over IPv4 or IPv6. Benefits: * This allows clients using IPv4 to get the IPv4 addresses of all cluster nodes and IPv6 clients to get the IPv6 clients. * This allows the IPs visible to clients to be different to the IPs used between the cluster nodes due to NAT'ing. The information is propagated in the cluster bus using new Ping extensions. (Old nodes without this feature ignore unknown Ping extensions.) This adds another dimension to CLUSTER SLOTS reply. It now depends on the client's use of TLS, the IP address family and RESP version. Refactoring: The cached connection type definition is moved from connection.h (it actually has nothing to do with the connection abstraction) to server.h and is changed to a bitmap, with one bit for each of TLS, IPv6 and RESP3. Fixes #337 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-07-10 13:53:52 +02:00
bentotten	f2bbd1ff0f	Fix minor memory leak in clusterLoadConfig (#741 ) We forgot to call sdsfreesplitres in the error path during a nodes.conf corruption check, this function exits on the error paths so this is just a cleanup. Signed-off-by: bentotten <59932872+bentotten@users.noreply.github.com>	2024-07-04 16:55:55 -07:00
Binbin	2d6791bb11	Use clusterNodeIsVotingPrimary function to check the right (#735 ) Minor cleanups. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-03 20:42:25 +02:00
skyfirelee	e4c1f6d45a	Replace client flags to bitfield (#614 )	2024-06-30 11:33:10 -07:00
zhaozhao.zz	4fbe31ab87	Fix the TLS and REPS issues about CLUSTER SLOTS cache (#581 ) PR #53 introduced the cache of CLUSTER SLOTS response, but the cache has some problems for different types of clients: 1. the RESP version is wrongly ignored: ``` $./valkey-cli 127.0.0.1:6379> cluster slots 1) 1) (integer) 0 2) (integer) 16383 3) 1) "" 2) (integer) 6379 3) "f1aeceb352401ce57acd432c68c60b359c00ef85" 4) (empty array) 127.0.0.1:6379> hello 3 1# "server" => "valkey" 2# "version" => "255.255.255" 3# "proto" => (integer) 3 4# "id" => (integer) 3 5# "mode" => "cluster" 6# "role" => "master" 7# "modules" => (empty array) 127.0.0.1:6379> cluster slots 1) 1) (integer) 0 2) (integer) 16383 3) 1) "" 2) (integer) 6379 3) "f1aeceb352401ce57acd432c68c60b359c00ef85" 4) (empty array) ``` RESP3 should get "empty hash" but get RESP2's "empty array" 3. we should use the original client's connect type, or lua/function and module would get wrong port: ``` $./valkey-cli --tls --insecure -p 6789 127.0.0.1:6789> config get port tls-port 1) "tls-port" 2) "6789" 3) "port" 4) "6379" 127.0.0.1:6789> cluster slots 1) 1) (integer) 0 2) (integer) 16383 3) 1) "" 2) (integer) 6789 3) "f1aeceb352401ce57acd432c68c60b359c00ef85" 4) (empty array) 127.0.0.1:6789> eval "return redis.call('cluster','slots')" 0 1) 1) (integer) 0 2) (integer) 16383 3) 1) "" 2) (integer) 6379 3) "f1aeceb352401ce57acd432c68c60b359c00ef85" 4) (empty array) ``` --------- Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>	2024-06-28 14:56:13 +08:00
Pierre	495c35d918	Add check in CLUSTERLINK KILL cmd to avoid freeing links to myself (#689 ) Add check in CLUSTERLINK KILL cmd to avoid freeing cluster bus links to myself. Also add an assert in `freeClusterLink()`. Testing: ``` 127.0.0.1:6379> debug clusterlink kill all c0404ee68574c6aa1048aaebfe90283afe51d2fc (error) ERR Cannot free cluster link(s) to myself ``` Signed-off-by: Pierre Turin <pieturin@amazon.com>	2024-06-25 15:18:30 -07:00
Ping Xie	32ca6e5b38	Improve `CLUSTER SETSLOT` replication handling to support older replica versions. (#686 )	2024-06-23 22:08:52 -07:00
Ping Xie	4135894a5d	Update remaining `master` references to `primary` (#660 ) Signed-off-by: Ping Xie <pingxie@google.com>	2024-06-17 20:31:15 -07:00
Binbin	495a121f19	Adjust the log level of some logs in the cluster (#633 ) I think the log of pfail status changes will be very useful. The other parts were scanned and found that it can be modified. Changes: 1. Changing pfail status releated logs from VERBOSE to NOTICE. 2. Changing configEpoch collision log from VERBOSE(warning) to NOTICE. 3. Changing some logs from DEBUG to NOTICE. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-06-18 10:46:56 +08:00
Binbin	db6d3c1138	Only primary with slots has the right to mark a node as failed (#634 ) In markNodeAsFailingIfNeeded we will count needed_quorum and failures, needed_quorum is the half the cluster->size and plus one, and cluster-size is the size of primary node which contain slots, but when counting failures, we dit not check if primary has slots. Only the primary has slots that has the rights to vote, adding a new clusterNodeIsVotingPrimary to formalize this concept. Release notes: bugfix where nodes not in the quorum group might spuriously mark nodes as failed --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>	2024-06-16 20:46:08 -07:00
Sankar	a81c32079c	Make cluster meet reliable under link failures (#461 ) When there is a link failure while an ongoing MEET request is sent the sending node stops sending anymore MEET and starts sending PINGs. Since every node responds to PINGs from unknown nodes with a PONG, the receiving node never adds the sending node. But the sending node adds the receiving node when it sees a PONG. This can lead to asymmetry in cluster membership. This changes makes the sender keep sending MEET until it sees a PONG, avoiding the asymmetry. --------- Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>	2024-06-16 20:37:09 -07:00
Ping Xie	8a776c3509	Fix potential infinite loop in `clusterNodeGetPrimary` (#651 )	2024-06-13 23:43:36 -07:00
Madelyn Olson	a3f1535b57	Fix misuse of safe iterators (#612 ) Safe iterators must call resetIterators when they are done being used. Fix one issue where a safe iterator was not correctly calling reset during cluster slot caching and fixed a second issue where reset iterator was being called twice. For the double release case, kvstoreIteratorNextDict is responsible for patching up the iterator, but we were calling it a second time in kvstoreIteratorNext. In addition, I added some documentation around initializing iterators, added an assert to prevent double initialization, and remove a function from the public interface which isn't needed and might lead to incorrect usage of the safe iterators. Bumping srgsanky for finding it here: `c4782066e7 (r142867004)`. --------- Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-06-10 12:30:57 -07:00
flowerysong	d28ae52004	Remove redundant function `nextPingExt()` (#613 ) Functionally identical to the older, documented `getNextPingExt()`. Fixes #610. Signed-off-by: Paul Arthur <paul.arthur@flowerysong.com>	2024-06-08 21:55:58 -07:00
Ping Xie	aad6769a80	Replicate slot migration states via RDB aux fields (#586 )	2024-06-07 20:32:27 -07:00
Ping Xie	54c9747935	Remove `master` and `slave` from source code (#591 ) External facing interfaces are not affected. --------- Signed-off-by: Ping Xie <pingxie@google.com>	2024-06-07 14:21:33 -07:00
Viktor Söderqvist	ad5fd5b95c	More rebranding (#606 ) More rebranding of * Log messages (#252) * The DENIED error reply * Internal function names and comments, mainly Lua API --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-06-07 01:40:55 +02:00
Shivshankar	9319f7aeca	Replace valkey in log and panic messages (#550 ) Part of #207 --------- Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>	2024-06-04 20:46:59 +02:00
Eran Liberty	0700c441c6	Remove unused valDup (#443 ) Remove the unused value duplicate API from dict. It's unused in the codebase and introduces unnecessary overhead. --------- Signed-off-by: Eran Liberty <eran.liberty@gmail.com>	2024-06-03 12:22:06 -07:00
Ping Xie	f927565d28	Consolidate various BLOCKED_WAIT* states (#562 ) There are currently three block types: BLOCKED_WAIT, BLOCKED_WAITAOF, and BLOCKED_WAIT_PREREPL, used to block clients executing `WAIT`, `WAITAOF`, and `CLUSTER SETSLOT`, respectively. They share the same workflow: the client is blocked until replication to the expected number of replicas completes. However, they provide different responses depending on the commands involved. Using distinct block types leads to code duplication and reduced readability. This PR consolidates the three types into a single WAIT type, differentiating them using the pending command to ensure the appropriate response is returned. Fix #427 --------- Signed-off-by: Ping Xie <pingxie@google.com>	2024-05-30 23:45:47 -07:00

1 2

95 Commits