Cluster: master nodes wait before rejoining the cluster after reboot.

One of the simple heuristics used by Redis Cluster in order to avoid
losing data in the typical failure modes created by the asynchronous
replication with the slaves (a master is unable, when accepting a
write, to immediately tell if it should be really accepted or refused
because of a configuration change), is to wait some time before to
rejoin the cluster after being partitioned away from the majority of
instances.

A similar condition happens when a master is restarted. It does not know
if it was already failed over, nor if all the clients have already an
updated configuration about the cluster map, so it is possible that
clients will try to write to stale masters that were restarted.

In a similar way this commit changes masters behavior so they wait
2000 milliseconds before accepting writes after a reboot. There is
nothing special about 2 seconds if not to be a value supposedly larger
a few orders of magnitude compared to the cluster bus communication
latencies.
This commit is contained in:
antirez 2014-01-20 11:52:52 +01:00
parent e6970e204f
commit 80e80668f4

View File

@ -222,7 +222,6 @@ int clusterLoadConfig(char *filename) {
redisLog(REDIS_NOTICE,"Node configuration loaded, I'm %.40s",
server.cluster->myself->name);
clusterSetStartupEpoch();
clusterUpdateState();
return REDIS_OK;
fmterr:
@ -2320,13 +2319,29 @@ int clusterDelNodeSlots(clusterNode *node) {
* Cluster state evaluation function
* -------------------------------------------------------------------------- */
/* The following are defines that are only used in the evaluation function
* and are based on heuristics. Actaully the main point about the rejoin and
* writable delay is that they should be a few orders of magnitude larger
* than the network latency. */
#define REDIS_CLUSTER_MAX_REJOIN_DELAY 5000
#define REDIS_CLUSTER_MIN_REJOIN_DELAY 500
#define REDIS_CLUSTER_WRITABLE_DELAY 2000
void clusterUpdateState(void) {
int j, new_state;
int unreachable_masters = 0;
static mstime_t among_minority_time;
static mstime_t first_call_time = 0;
/* If this is a master node, wait some time before turning the state
* into OK, since it is not a good idea to rejoin the cluster as a writable
* master, after a reboot, without giving the cluster a chance to
* reconfigure this node. Note that the delay is calculated starting from
* the first call to this function and not since the server start, in order
* to don't count the DB loading time. */
if (first_call_time == 0) first_call_time = mstime();
if (server.cluster->myself->flags & REDIS_NODE_MASTER &&
mstime() - first_call_time < REDIS_CLUSTER_WRITABLE_DELAY) return;
/* Start assuming the state is OK. We'll turn it into FAIL if there
* are the right conditions. */