632 Commits

Author SHA1 Message Date
antirez
adc613f456 Cluster: ignore empty lines in nodes.conf.
Even without the user messing manually with the file, it is still
possible to have blank lines (just a single "\n" per line) because of
how the nodes.conf update/write process works.
2014-01-15 11:23:41 +01:00
antirez
e4a1d6bb5d Cluster: atomic update of nodes.conf file.
The way the file was generated was unsafe and leaded to nodes.conf file
corruption (zero length file) on server stop/crash during the creation
of the file.

The previous file update method was as simple as open with O_TRUNC
followed by the write call. While the write call was a single one with
the full payload, ensuring no half-written files for POSIX semantics,
stopping the server just after the open call resulted into a zero-length
file (all the nodes information lost!).
2014-01-15 10:31:20 +01:00
antirez
fdab41fe65 Cluster: support to read from slave nodes.
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).

The READWRITE command can be used in order to exit the readonly state.
2014-01-14 16:33:16 +01:00
antirez
ed3c6c0124 Fix RESTORE ttl handling in 32 bit archs.
long was used instead of long long in order to handle a 64 bit
resolution millisecond timestamp.

This fixes issue #1483.
2014-01-09 11:09:23 +01:00
antirez
98901950f9 Cluster: clusterProcessPacket() was not 80 cols friendly.
The function actually needs to be split into sub-functions at some
point in the future.
2013-12-25 17:57:36 +01:00
antirez
a75b334bdf Redis Cluster: add repl_ping_slave_period to slave data validity time.
When the configured node timeout is very small, the data validity time
(maximum data age for a slave to try a failover) is too little (ten
times the configured node timeout) when the replication link with the
master is mostly idle. In this case we'll receive some data from the
master only every server.repl_ping_slave_period to refresh the last
interaction with the master.

This commit adds to the max data validity time the slave ping period to
avoid this problem of slaves sensing too old data without a good reason.
However this max data validity time is likely a setting that should be
configurable by the Redis Cluster user in a way completely independent
from the node timeout.
2013-12-22 10:05:16 +01:00
antirez
db016acb7f Redis Cluster: move node failure reports logging from VERBOSE to NOTICE level. 2013-12-21 00:04:53 +01:00
antirez
8527ba1eea Redis Cluster: remove no longer relevant comment. 2013-12-20 14:40:11 +01:00
antirez
dd10efb31a Redis Cluster: reconfigure replication when master changes address. 2013-12-20 12:47:22 +01:00
antirez
4d11d4c86c Redis Cluster: handshake code refactoring + Gossip IP switch detection.
This commit makes it simple to start an handshake with a specific node
address, and uses this in order to detect a node IP change and start a
new handshake in order to fix the IP if possible.
2013-12-20 12:38:03 +01:00
antirez
f42e0277ab Redis Cluster: delay state change when in the majority again.
As specified in the Redis Cluster specification, when a node can reach
the majority again after a period in which it was partitioend away with
the minorty of masters, wait some time before accepting queries, to
provide a reasonable amount of time for other nodes to upgrade its
configuration.

This lowers the probabilities of both a client and a master with not
updated configuration to rejoin the cluster at the same time, with a
stale master accepting writes.
2013-12-20 09:56:18 +01:00
antirez
e48365e2c2 Cluster: set n->slaves to NULL in clusterNodeResetSlaves().
The value was otherwise undefined, so next time the node was promoted
again from slave to master, adding a slave to the list of slaves
would likely crash the server or result into undefined behavior.
2013-12-17 14:50:24 +01:00
antirez
7f51cf8b56 Cluster: check link is valid before sending UPDATE. 2013-12-17 12:28:37 +01:00
antirez
c17be18035 Cluster: initialize todo_before_sleep flags to 0. 2013-12-17 12:22:02 +01:00
antirez
195aab3345 Cluster: use proper type mstime_t for ping delay var. 2013-12-17 10:27:36 +01:00
antirez
118d0fb533 Fixed clearNodeFailureIfNeeded() time type to mstime_t.
This prevented 32bit cluster instances from clearing the FAIL flag when
needed.
2013-12-17 09:45:52 +01:00
antirez
9180fb7931 Cluster: use long long for timestamps in clusterGenNodesDescription().
Ping sent and pong received fields need to be casted to long long to be
printed correctly into 32 bit systems.
2013-12-17 09:38:11 +01:00
antirez
7a5a646df9 Fixed grammar: before H the article is a, not an. 2013-12-05 16:35:32 +01:00
antirez
b7c955046d Cluster: nodes re-addition blacklist API. 2013-12-02 11:12:23 +01:00
antirez
5502face59 Cluster: basic data structures for nodes black list. 2013-11-29 17:37:06 +01:00
antirez
a829c85988 Cluster: some code about clusterHandleSlaveFailover() marginally improved.
80 cols friendly, some minor change to the code to make it simpler.
2013-11-29 16:17:05 +01:00
antirez
e159239f9c Cluster: removed not needed newline at end of redisLog() msg. 2013-11-08 17:28:02 +01:00
antirez
a67935e5e3 Cluster: send a single UPDATE packet for now. 2013-11-08 17:25:49 +01:00
antirez
a146482c83 Cluster: replace hardcoded 4096 for bus msg len with sizeof(). 2013-11-08 17:19:19 +01:00
antirez
36db83ac50 Cluster: slots update refactored + UPDATE msg processing.
Now there is a function that handles the update of the local slot
configuration every time we have some new info about a node and its set
of served slots and configEpoch.

Moreoever the UPDATE packets are now processed when received (it was a
work in progress in the previous commit).
2013-11-08 17:02:10 +01:00
antirez
a19147c2fa Cluster: UPDATE msg data structure and sending function. 2013-11-08 16:26:50 +01:00
antirez
4666966c9f Cluster: refactoring of slots update code and more.
The commit also introduces detection of nodes publishing not updated
configuration. More work in progress to send an UPDATE packet to inform
of the config change.
2013-11-08 10:32:16 +01:00
antirez
f6738923a6 Cluster: initialize senderConfigEpoch and senderCurrentEpoch for warnings suppression. 2013-11-05 12:01:07 +01:00
antirez
e45d9420e0 Cluster: there is a lower limit for the handshake timeout. 2013-10-11 10:34:32 +02:00
antirez
39c90945e0 Cluster: data_age conversion to milliseconds fixed. 2013-10-09 16:36:06 +02:00
antirez
aa0e7dbcf3 Cluster: clusterCron() freq is now 10h. Still ping 1 node every sec.
After the change in clusterCron() frequency of call, we still want to
ping just one random node every second.
2013-10-09 16:29:17 +02:00
antirez
e4b341a335 Cluster: time switched from seconds to milliseconds.
All the internal state of cluster involving time is now using mstime_t
and mstime() in order to use milliseconds resolution.

Also the clusterCron() function is called with a 10 hz frequency instead
of 1 hz.

The cluster node_timeout must be also configured in milliseconds by the
user in redis.conf.
2013-10-09 16:19:26 +02:00
antirez
1560b70889 Cluster: cluster stuff moved from redis.h to cluster.h. 2013-10-09 15:38:05 +02:00
antirez
0f079966c7 Cluster: masters don't vote for a slave with stale config.
When a slave requests our vote, the configEpoch he claims for its master
and the set of served slots must be greater or equal to the configEpoch
of the nodes serving these slots in the current configuraiton of the
master granting its vote.

In other terms, masters don't vote for slaves having a stale
configuration for the slots they want to serve.
2013-10-08 12:45:35 +02:00
antirez
26ea55b7f5 Cluster: fix slave data age computation when master is still connected. 2013-10-07 16:07:13 +02:00
antirez
acd9ec222e Cluster: log message improved when FAIL is cleared from a slave node. 2013-10-07 15:44:58 +02:00
antirez
e9b8b30c81 Cluster: slave nodes advertise master slots bitmap and configEpoch. 2013-10-07 11:31:12 +02:00
antirez
dbf6c85d5e Cluster: new clusterDoBeforeSleep() API.
The new API is able to remember operations to perform before returning
to the event loop, such as checking if there is the failover quorum for
a slave, save and fsync the configuraiton file, and so forth.

Because this operations are performed before returning on the event
loop we are sure that messages that are sent in the same event loop run
will be delivered *after* the configuration is already saved, that is a
requirement sometimes. For instance we want to publish a new epoch only
when it is already stored in nodes.conf in order to avoid returning back
in the logical clock when a node is restarted.

This new API provides a big performance advantage compared to saving and
possibly fsyncing the configuration file multiple times in the same
event loop run, especially in the case of big clusters with tens or
hundreds of nodes.
2013-10-03 09:58:06 +02:00
antirez
43f3df99c8 Cluster: update cluster config when slave changes master. 2013-10-02 12:27:12 +02:00
antirez
5cbb913994 Cluster: bus messages stats in CLUSTER info. 2013-10-02 10:10:08 +02:00
antirez
90b06ab7b5 Cluster: FAIL messages from unknown senders are handled better.
Previously the event was not logged but instead the node reported an
unknown packet type received.
2013-10-02 09:42:45 +02:00
antirez
3be5010adb Cluster: senderCurrentEpoch == node currentEpoch was too strict.
We can accept a vote as long as its epoch is >= the epoch at which we
started the voting process. There is no need for it to be exactly the
same.
2013-10-01 17:21:28 +02:00
antirez
0000cfbf38 Cluster: fix typo in clusterProcessPacket() comment. 2013-10-01 15:40:20 +02:00
antirez
6ed0dee927 Cluster: time field removed from cluster messages header.
The new algorithm does not check replies time as checking for the
currentEpoch in the reply ensures that the reply is about the current
election process.
2013-09-30 16:19:44 +02:00
antirez
60d4ae49be Cluster: log message shortened. 2013-09-30 11:51:58 +02:00
antirez
1239f49065 Cluster: detect cluster reconfiguration when master slots drop to 0.
The old algorithm used a PROMOTED flag and explicitly checks about
slave->master convertions. Wit the new cluster meta-data propagation
algorithm we just look at the configEpoch to check if we need to
reconfigure slots, then:

1) If a node is a master but it reaches zero served slots becuase of
reconfiguration.
2) If a node is a slave but the master reaches zero served slots because
of a reconfiguration.

We switch as a replica of the new slots owner.
2013-09-30 11:45:26 +02:00
antirez
2a391b8bac Cluster: re-order failover operations to make it safer.
We need to:

1) Increment the configEpoch.
2) Save it to disk and fsync the file.
3) Broadcast the PONG with the new configuration.

If other nodes will receive the updated configuration we need to be sure
to restart with this new config in the event of a crash.
2013-09-30 10:16:48 +02:00
antirez
0b63dc2841 Cluster: when upading the configEpoch for a node, save config on disk ASAP. 2013-09-30 10:16:25 +02:00
antirez
5d393adeac Cluster: fsync data when saving the cluster config. 2013-09-30 10:13:07 +02:00
antirez
8fa4e7817a Cluster: update the node configEpoch when newer is detected. 2013-09-27 09:55:41 +02:00