Cluster: master nodes wait before rejoining the cluster after reboot.

One of the simple heuristics used by Redis Cluster in order to avoid losing data in the typical failure modes created by the asynchronous replication with the slaves (a master is unable, when accepting a write, to immediately tell if it should be really accepted or refused because of a configuration change), is to wait some time before to rejoin the cluster after being partitioned away from the majority of instances. A similar condition happens when a master is restarted. It does not know if it was already failed over, nor if all the clients have already an updated configuration about the cluster map, so it is possible that clients will try to write to stale masters that were restarted. In a similar way this commit changes masters behavior so they wait 2000 milliseconds before accepting writes after a reboot. There is nothing special about 2 seconds if not to be a value supposedly larger a few orders of magnitude compared to the cluster bus communication latencies.
2014-01-20 11:52:52 +01:00 · 2014-01-20 11:52:52 +01:00 · 80e80668f4
commit 80e80668f4
parent e6970e204f
1 changed files with 16 additions and 1 deletions
--- a/src/cluster.c
+++ b/src/cluster.c
@ -222,7 +222,6 @@ int clusterLoadConfig(char *filename) {
    redisLog(REDIS_NOTICE,"Node configuration loaded, I'm %.40s",
        server.cluster->myself->name);
    clusterSetStartupEpoch();
-    clusterUpdateState();
    return REDIS_OK;

 fmterr:
@ -2320,13 +2319,29 @@ int clusterDelNodeSlots(clusterNode *node) {
 * Cluster state evaluation function
 * -------------------------------------------------------------------------- */

+/* The following are defines that are only used in the evaluation function
+ * and are based on heuristics. Actaully the main point about the rejoin and
+ * writable delay is that they should be a few orders of magnitude larger
+ * than the network latency. */
 #define REDIS_CLUSTER_MAX_REJOIN_DELAY 5000
 #define REDIS_CLUSTER_MIN_REJOIN_DELAY 500
+#define REDIS_CLUSTER_WRITABLE_DELAY 2000

 void clusterUpdateState(void) {
    int j, new_state;
    int unreachable_masters = 0;
    static mstime_t among_minority_time;
+    static mstime_t first_call_time = 0;
+
+    /* If this is a master node, wait some time before turning the state
+     * into OK, since it is not a good idea to rejoin the cluster as a writable
+     * master, after a reboot, without giving the cluster a chance to
+     * reconfigure this node. Note that the delay is calculated starting from
+     * the first call to this function and not since the server start, in order
+     * to don't count the DB loading time. */
+    if (first_call_time == 0) first_call_time = mstime();
+    if (server.cluster->myself->flags & REDIS_NODE_MASTER &&
+        mstime() - first_call_time < REDIS_CLUSTER_WRITABLE_DELAY) return;

    /* Start assuming the state is OK. We'll turn it into FAIL if there
     * are the right conditions. */