futriix

Author	SHA1	Message	Date
Madelyn Olson	2b207ee1b3	Improve stability of hostnames test (#1016 ) Maybe partially resolves https://github.com/valkey-io/valkey/issues/952. The hostnames test relies on an assumption that node zero and node six don't communicate with each other to test a bunch of behavior in the handshake stake. This was done by previously dropping all meet packets, however it seems like there was some case where node zero was sending a single pong message to node 6, which was partially initializing the state. I couldn't track down why this happened, but I adjusted the test to simply pause node zero which also correctly emulates the state we want to be in since we're just testing state on node 6, and removes the chance of errant messages. The test was failing about 5% of the time locally, and I wasn't able to reproduce a failure with this new configuration. --------- Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-09-11 09:52:34 -07:00
Ping Xie	66d0f7d9a1	Ensure only primary sender drives slot ownership updates (#754 ) Fixes a regression introduced in PR #445, which allowed a message from a replica to update the slot ownership of its primary. The regression results in a `replicaof` cycle, causing server crashes due to the cycle detection assert. The fix restores the previous behavior where only primary senders can trigger `clusterUpdateSlotsConfigWith`. Additional changes: * Handling of primaries without slots is obsoleted by new handling of when a sender that was a replica announces that it is now a primary. * Replication loop detection code is unchanged but shifted downwards. * Some variables are renamed for better readability and some are introduced to avoid repeated memcmp() calls. Fixes #753. --------- Signed-off-by: Ping Xie <pingxie@google.com>	2024-07-16 13:05:49 -07:00
Ping Xie	aad6769a80	Replicate slot migration states via RDB aux fields (#586 )	2024-06-07 20:32:27 -07:00
Binbin	6bab2d7968	Make sure clear the CLUSTER SLOTS cache on time when updating hostname (#564 ) In #53, we will cache the CLUSTER SLOTS response to improve the throughput and reduct the latency. In the code snippet below, the second cluster slots will use the old hostname: ``` config set cluster-preferred-endpoint-type hostname config set cluster-announce-hostname old-hostname.com multi cluster slots config set cluster-announce-hostname new-hostname.com cluster slots exec ``` When updating the hostname, in updateAnnouncedHostname, we will set CLUSTER_TODO_SAVE_CONFIG and we will do a clearCachedClusterSlotsResponse in clusterSaveConfigOrDie, so harmless in most cases. Move the clearCachedClusterSlotsResponse call to clusterDoBeforeSleep instead of scheduling it to be called in clusterSaveConfigOrDie. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-05-30 10:44:12 +08:00
Ping Xie	6e7af9471c	Slot migration improvement (#445 )	2024-05-06 21:40:28 -07:00
Harkrishn Patro	ebfb440629	Pass extensions to node if extension processing is handled by it (#52 ) Ref: https://github.com/redis/redis/pull/12760 ### Description #### Fixes compatibilty of PlaceholderKV cluster (7.2 - extensions enabled by default) with older Redis cluster (< 7.0 - extensions not handled) . With some of the extensions enabled by default in 7.2 version, new nodes running 7.2 and above start sending out larger clusterbus message payload including the ping extensions. This caused an incompatibility with node running engine versions < 7.0. Old nodes (< 7.0) would receive the payload from new nodes (> 7.2) would observe a payload length (totlen) > (estlen) and would perform an early exit and won't process the message. This fix introduces a flag `extensions_supported` on the clusterMsg indicating the sender node can handle extensions parsing. Once, a receiver nodes receives a message with this flag set to 1, it would update clusterNode new field extensions_supported and start sending out extensions if it has any. This PR also introduces a DEBUG sub command to enable/disable cluster message extensions `process-clustermsg-extensions` feature. Note: A successful `PING`/`PONG` is required as a sender for a given node to be marked as `extensions_supported` and then extensions message will be sent to it. This could cause a slight delay in receiving the extensions message(s). ### Testing TCL test verifying the cluster state is healthy irrespective of enabling/disabling cluster message extensions feature. --------- Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>	2024-04-08 09:01:30 -07:00
Madelyn Olson	73cf0243df	Make nodename test more consistent (#12330 ) To determine when everything was stable, we couldn't just query the nodename since they aren't API visible by design. Instead, we were using a proxy piece of information which was bumping the epoch and waiting for everyone to observe that. This works for making source Node 0 and Node 1 had pinged, and Node 0 and Node 2 had pinged, but did not guarantee that Node 1 and Node 2 had pinged. Although unlikely, this can cause this failure message. To fix it I hijacked hostnames and used its validation that it has been propagated, since we know that it is stable. I also noticed while stress testing this sometimes the test took almost 4.5 seconds to finish, which is really close to the current 5 second limit of the log check, so I bumped that up as well just to make it a bit more consistent.	2023-06-20 18:00:55 -07:00
Wen Hui	070453eef3	Cluster human readable nodename feature (#9564 ) This PR adds a human readable name to a node in clusters that are visible as part of error logs. This is useful so that admins and operators of Redis cluster have better visibility into failures without having to cross-reference the generated ID with some logical identifier (such as pod-ID or EC2 instance ID). This is mentioned in #8948. Specific nodenames can be set by using the variable cluster-announce-human-nodename. The nodename is gossiped using the clusterbus extension in #9530. Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2023-06-17 21:16:51 -07:00
Madelyn Olson	663fbd3459	Stabilize cluster hostnames tests (#11307 ) This PR introduces a couple of changes to improve cluster test stability: 1. Increase the cluster node timeout to 3 seconds, which is similar to the normal cluster tests, but introduce a new mechanism to increase the ping period so that the tests are still fast. This new config is a debug config. 2. Set `cluster-replica-no-failover yes` on a wider array of tests which are sensitive to failovers. This was occurring on the ARM CI.	2022-10-03 09:25:16 +03:00
Viktor Söderqvist	5032de50f2	Gossip forgotten nodes on `CLUSTER FORGET` (#10869 ) Gossip the cluster node blacklist in ping and pong messages. This means that CLUSTER FORGET doesn't need to be sent to all nodes in a cluster. It can be sent to one or more nodes and then be propagated to the rest of them. For each blacklisted node, its node id and its remaining blacklist TTL is gossiped in a cluster bus ping extension (introduced in #9530).	2022-07-26 10:28:13 +03:00
Madelyn Olson	3abdec9969	Fix cluster hostnames test causing failover while running valgrind (#10991 ) In the newly added cluster hostnames test, the primary is failing over during the reboot for valgrind so we are validating the wrong node. This change just sets the replica to prevent taking over, which seems to fix the test. We could have also set the timeout higher, but it slows down the test.	2022-07-17 09:57:34 +03:00
Madelyn Olson	8a4e3bcd8d	Cluster test improvements (#10920 ) * Restructured testing to allow running cluster tests easily as part of the normal testing	2022-07-12 10:41:29 -07:00

12 Commits