3921 Commits

Author SHA1 Message Date
antirez
ef82c59716 Cluster: swap two code blocks to have a more obvious flow. 2014-01-30 16:34:23 +01:00
antirez
b582a91335 Cluster: remove not needed return statement breaking failover. 2014-01-29 17:28:46 +01:00
antirez
378e9745db Cluster: broadcast pong to other slaves in the same ring.
When we schedule a failover, broadcast a PONG to the slaves.
The other slaves that plan to get elected will do the same too, this way
it is likely that every slave will have a good picture of its own rank.

Note that this is N*N messages where N is the number of slaves for the
failing master, however usually even large clusters have many master
nodes but a limited number of replicas per node, so this is harmless.
2014-01-29 17:19:55 +01:00
antirez
6546c6ff3b Cluster: log offset when announcing the failover election delay. 2014-01-29 17:16:10 +01:00
antirez
f5c5e2707c Cluster: added progressive election delay according to slave rank.
Note that when we compute the initial delay, there are probably still
more up to date information to receive from slaves with new offsets, so
the delay is recomputed when new data is available.
2014-01-29 16:53:45 +01:00
antirez
6ece04b1fc Cluster: function clusterGetSlaveRank() added.
Return the number of slaves for the same master having a better
replication offset of the current slave, that is, the slave "rank" used
to pick a delay before the request for election.
2014-01-29 16:39:04 +01:00
antirez
e529f813ae Cluster: update node replication offset from bus packets headers. 2014-01-29 16:01:00 +01:00
antirez
0c36b17916 Cluster: refactoring: new macros to check node flags. 2014-01-29 12:17:16 +01:00
antirez
604fe17883 Cluster: use myself instead of server->cluster.myself. 2014-01-29 11:38:14 +01:00
antirez
9e91cc8b28 Cluster: added a global myself pointer in cluster.c.
Accessing to the 'myself' node, the node representing the currently
running instance, is handy without the need to type
server.cluster->myself every time.
2014-01-29 11:22:22 +01:00
antirez
5124365a8f Cluster: clusterBroadcastPong() improved with target selection.
Now we can broadcast a pong to all the instances or just the local
slaves (that is useful for replication offset propagation).
2014-01-29 11:08:52 +01:00
antirez
b8a0220498 Cluster: broadcast master/slave replication offset in bus header. 2014-01-28 16:51:50 +01:00
antirez
0ae1d07b01 Cluster: limit cluster.h to 80 cols. 2014-01-28 16:34:23 +01:00
antirez
91afff8fee Cluster: introduced repl_offset fields in clusterNode.
The two fields are used in order to remember the latest known
replication offset and the time we received it from other slave nodes.

This will be used by slaves in order to start the election procedure
with a delay that is proportional to the rank of the slave among the
other slaves for this master, when sorted for replication offset.

Usually this allows the slave with the most updated offset to win the
election and replace the failing master in the cluster.
2014-01-28 16:28:07 +01:00
antirez
68fdf31201 Fixed inverted if condition in MISCONF error code path. 2014-01-28 10:11:12 +01:00
antirez
e13076867f Don't log MONITOR clients as disconnecting slaves. 2014-01-25 11:53:53 +01:00
antirez
505e3836a7 Cluster: redis-trib set-timeout implemented. 2014-01-24 15:06:01 +01:00
antirez
ee6f19d7cb Cluster: update slaves lists in clusterSetMaster(). 2014-01-22 18:46:53 +01:00
antirez
423e5e601b Cluster: CLUSTER SLAVES subcommand added. 2014-01-22 18:38:42 +01:00
antirez
ae66039bb1 Cluster: clusterGenNodesDescription() refactored into two functions. 2014-01-22 18:36:12 +01:00
antirez
a166f49794 redis-cli --help output improved with --scan and periods. 2014-01-22 12:07:42 +01:00
antirez
d97e8ccd5c redis-cli: support for --scan option. 2014-01-22 12:04:08 +01:00
antirez
54dcf13614 Use fflush() before fsync() in rio.c.
Incremental flushing in rio.c is only used to avoid huge kernel buffers
synched to slow disks creating big latency spikes, so this fix has no
durability implications, however it is certainly more correct to make
sure that the FILE buffers are flushed to the kernel before calling
fsync on the file descriptor.

Thanks to Li Shao Kai for reporting this issue in the Redis mailing
list.
2014-01-22 09:54:55 +01:00
antirez
f57d70571f Cluster: master nodes wait before rejoining the cluster after reboot.
One of the simple heuristics used by Redis Cluster in order to avoid
losing data in the typical failure modes created by the asynchronous
replication with the slaves (a master is unable, when accepting a
write, to immediately tell if it should be really accepted or refused
because of a configuration change), is to wait some time before to
rejoin the cluster after being partitioned away from the majority of
instances.

A similar condition happens when a master is restarted. It does not know
if it was already failed over, nor if all the clients have already an
updated configuration about the cluster map, so it is possible that
clients will try to write to stale masters that were restarted.

In a similar way this commit changes masters behavior so they wait
2000 milliseconds before accepting writes after a reboot. There is
nothing special about 2 seconds if not to be a value supposedly larger
a few orders of magnitude compared to the cluster bus communication
latencies.
2014-01-20 11:52:52 +01:00
antirez
c7de167cb1 Cluster: debug printf statemets removed.
These were committed for error after being inserted in order to fix an
issue.
2014-01-20 11:19:04 +01:00
antirez
26c1bf2a61 Cluster: don't rewrite slaveof config directive in cluster mode. 2014-01-20 11:10:42 +01:00
antirez
4dff037c2b Cluster: fix error reporting when slaveof is found in config. 2014-01-20 11:08:14 +01:00
antirez
aa79af76fc Cluster: allow CLUSTER REPLICATE to switch master.
The code was doing checks for slaves that should be done only when the
instance is currently a master. Switching a slave from a master to
another one should just work.
2014-01-17 18:22:35 +01:00
antirez
1caae15fdd Set server.repl_down_since to 0 when changing master.
When an instance is potentially set to replicate with another master, it
is conceptually disconnected forever, since we have no old copy of the
dataset for this master in memory.
2014-01-17 18:20:31 +01:00
antirez
c40ca44941 Cluster: redis-trib shows number of replicas of masters. 2014-01-17 17:56:45 +01:00
antirez
383b2d63ea Cluster: redis-trib help output format modified. 2014-01-17 12:32:49 +01:00
antirez
9fe6577942 Cluster: redis-trib shows what a slave replicates + fixes.
Also the :replicates info field in the node object is now correctly
populated. This also fixes the :replicas field computation.
2014-01-17 12:06:18 +01:00
antirez
82859e547d Cluster: redis-trib addnode is now able to add replicas. 2014-01-17 11:48:42 +01:00
antirez
a422dceef7 Cluster: fix redis-trib help subcommand. 2014-01-17 10:29:40 +01:00
antirez
ef73ddbca9 Cluster: redis-trib delnode implementation. 2014-01-16 18:22:03 +01:00
antirez
3c581e39ce Cluster: don't let a node forget its own master.
redis-trib should make sure to reconfigure slaves of a node to remove
from the cluster to replicate with other nodes before sending CLUSTER
FORGET.
2014-01-16 17:49:35 +01:00
antirez
2c9f8fc22b Cluster: redis-trib help output improved.
Show options if any. Clarify that for some command any node address is
ok.
2014-01-16 16:23:33 +01:00
antirez
595ab5f26b Cluster: don't forget yourself with CLUSTER FORGET. 2014-01-16 09:46:23 +01:00
antirez
ed0cfc31f3 Cluster: use the node blacklist in CLUSTER FORGET.
CLUSTER FORGET is not useful if we can't remove a node from all the
nodes of our cluster because of the Gossip protocol that keeps adding
a given node to nodes where we already tried to remove it.

So now CLUSTER FORGET implements a nodes blacklist that is set and
checked by the Gossip section processing function. This way before a
node is re-added at least 60 seconds must elapse since the FORGET
execution.

This means that redis-trib has some time to remove a node from a whole
cluster. It is possible that in the future it will be uesful to raise
the 60 sec figure to something bigger.
2014-01-15 16:50:45 +01:00
antirez
7f48ad0629 Cluster: fix clusterBlacklistAddNode() by setting right expire time.
The hash table value should be set to now + 60 seconds otherwise it
expires immediately.
2014-01-15 16:49:31 +01:00
antirez
8602cbc8b4 Cluster: clusterBlacklistAddNode() key lookup fixed.
We can't lookup by node->name that's not an SDS string but a plain C
array in the node structure.
2014-01-15 16:45:07 +01:00
antirez
90197bd012 Cluster: clusterBlacklistExists() requires blacklist cleanup before lookup. 2014-01-15 16:06:54 +01:00
antirez
25cfa0817e Cluster: set a minimum rejoin delay if node_timeout is too small.
The rejoin delay usually is the node timeout. However if the node
timeout is too small, we set it to 500 milliseconds, that is a value
chosen to be greater than most setups RTT / instances latency figures
so that likely communication with other nodes happen before rejoining.
2014-01-15 12:34:33 +01:00
antirez
58459f96cd Cluster: periodically call clusterUpdateState() when cluster is down.
Usually we update the cluster state (to understand if we should accept
queries or reply with an error) only when there is a change in the state
of the nodes. However for the "delayed rejoin" feature to work, that is,
for a master to wait some time before accepting queries again after it
rejoins the majority, we need to periodically update the last time when
the node was partitioned away from the majority.

With this commit if the cluster is down we update the state ten times
per second.
2014-01-15 12:26:12 +01:00
antirez
d5fdcc269f Cluster: range checking in getSlotOrReply() fixed.
See issue #1426 on Github.
2014-01-15 11:33:46 +01:00
antirez
adc613f456 Cluster: ignore empty lines in nodes.conf.
Even without the user messing manually with the file, it is still
possible to have blank lines (just a single "\n" per line) because of
how the nodes.conf update/write process works.
2014-01-15 11:23:41 +01:00
antirez
e4a1d6bb5d Cluster: atomic update of nodes.conf file.
The way the file was generated was unsafe and leaded to nodes.conf file
corruption (zero length file) on server stop/crash during the creation
of the file.

The previous file update method was as simple as open with O_TRUNC
followed by the write call. While the write call was a single one with
the full payload, ensuring no half-written files for POSIX semantics,
stopping the server just after the open call resulted into a zero-length
file (all the nodes information lost!).
2014-01-15 10:31:20 +01:00
antirez
fdab41fe65 Cluster: support to read from slave nodes.
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).

The READWRITE command can be used in order to exit the readonly state.
2014-01-14 16:33:16 +01:00
antirez
67141ef4e1 Fix typo in aofRewriteBufferAppend() comment. 2014-01-14 15:37:49 +01:00
antirez
639613b3f0 Set REDIS_AOF_REWRITE_MIN_SIZE to 64mb.
64mb is the default value in redis.conf. For some reason instead the
hard-coded default was 1mb that is too small.
2014-01-14 11:27:28 +01:00