599 Commits

Author SHA1 Message Date
antirez
0f079966c7 Cluster: masters don't vote for a slave with stale config.
When a slave requests our vote, the configEpoch he claims for its master
and the set of served slots must be greater or equal to the configEpoch
of the nodes serving these slots in the current configuraiton of the
master granting its vote.

In other terms, masters don't vote for slaves having a stale
configuration for the slots they want to serve.
2013-10-08 12:45:35 +02:00
antirez
26ea55b7f5 Cluster: fix slave data age computation when master is still connected. 2013-10-07 16:07:13 +02:00
antirez
acd9ec222e Cluster: log message improved when FAIL is cleared from a slave node. 2013-10-07 15:44:58 +02:00
antirez
e9b8b30c81 Cluster: slave nodes advertise master slots bitmap and configEpoch. 2013-10-07 11:31:12 +02:00
antirez
dbf6c85d5e Cluster: new clusterDoBeforeSleep() API.
The new API is able to remember operations to perform before returning
to the event loop, such as checking if there is the failover quorum for
a slave, save and fsync the configuraiton file, and so forth.

Because this operations are performed before returning on the event
loop we are sure that messages that are sent in the same event loop run
will be delivered *after* the configuration is already saved, that is a
requirement sometimes. For instance we want to publish a new epoch only
when it is already stored in nodes.conf in order to avoid returning back
in the logical clock when a node is restarted.

This new API provides a big performance advantage compared to saving and
possibly fsyncing the configuration file multiple times in the same
event loop run, especially in the case of big clusters with tens or
hundreds of nodes.
2013-10-03 09:58:06 +02:00
antirez
43f3df99c8 Cluster: update cluster config when slave changes master. 2013-10-02 12:27:12 +02:00
antirez
5cbb913994 Cluster: bus messages stats in CLUSTER info. 2013-10-02 10:10:08 +02:00
antirez
90b06ab7b5 Cluster: FAIL messages from unknown senders are handled better.
Previously the event was not logged but instead the node reported an
unknown packet type received.
2013-10-02 09:42:45 +02:00
antirez
3be5010adb Cluster: senderCurrentEpoch == node currentEpoch was too strict.
We can accept a vote as long as its epoch is >= the epoch at which we
started the voting process. There is no need for it to be exactly the
same.
2013-10-01 17:21:28 +02:00
antirez
0000cfbf38 Cluster: fix typo in clusterProcessPacket() comment. 2013-10-01 15:40:20 +02:00
antirez
6ed0dee927 Cluster: time field removed from cluster messages header.
The new algorithm does not check replies time as checking for the
currentEpoch in the reply ensures that the reply is about the current
election process.
2013-09-30 16:19:44 +02:00
antirez
60d4ae49be Cluster: log message shortened. 2013-09-30 11:51:58 +02:00
antirez
1239f49065 Cluster: detect cluster reconfiguration when master slots drop to 0.
The old algorithm used a PROMOTED flag and explicitly checks about
slave->master convertions. Wit the new cluster meta-data propagation
algorithm we just look at the configEpoch to check if we need to
reconfigure slots, then:

1) If a node is a master but it reaches zero served slots becuase of
reconfiguration.
2) If a node is a slave but the master reaches zero served slots because
of a reconfiguration.

We switch as a replica of the new slots owner.
2013-09-30 11:45:26 +02:00
antirez
2a391b8bac Cluster: re-order failover operations to make it safer.
We need to:

1) Increment the configEpoch.
2) Save it to disk and fsync the file.
3) Broadcast the PONG with the new configuration.

If other nodes will receive the updated configuration we need to be sure
to restart with this new config in the event of a crash.
2013-09-30 10:16:48 +02:00
antirez
0b63dc2841 Cluster: when upading the configEpoch for a node, save config on disk ASAP. 2013-09-30 10:16:25 +02:00
antirez
5d393adeac Cluster: fsync data when saving the cluster config. 2013-09-30 10:13:07 +02:00
antirez
8fa4e7817a Cluster: update the node configEpoch when newer is detected. 2013-09-27 09:55:41 +02:00
antirez
c8d6bc94e4 Cluster: react faster when a slave wins an election. 2013-09-26 16:54:43 +02:00
antirez
7dfa4c5981 Cluster: removed an old source of delay to start the slave failover. 2013-09-26 13:28:19 +02:00
antirez
3bd69bcdf1 Cluster: master node now uses new protocol to vote. 2013-09-26 13:00:41 +02:00
antirez
f941650091 Cluster: slave node now uses the new protocol to get elected. 2013-09-26 11:13:17 +02:00
antirez
24b2894194 Cluster: add currentEpoch to CLUSTER INFO. 2013-09-25 12:38:36 +02:00
antirez
6dbd939a24 Cluster: update our currentEpoch when a greater one is seen. 2013-09-25 12:36:29 +02:00
antirez
1adf457b5b Cluster: broadcast currentEpoch and configEpoch in packets header. 2013-09-25 11:53:35 +02:00
antirez
cdf4eede58 Cluster: configEpoch added in cluster nodes description. 2013-09-25 11:47:13 +02:00
antirez
3a9bf5e618 Cluster: PFAIL -> FAIL transition allowed for slaves.
First change: now there is no need to be a master in order to detect a
failure, however the majority of masters signaling PFAIL or FAIL is needed.

This change is important because it allows slaves rejoining the cluster
after a partition to sense the FAIL condition so that eventually all the
nodes agree on failures.
2013-09-20 11:26:44 +02:00
antirez
3f5034d1d7 Cluster: added time field in cluster bus messages.
The time is sent in requests, and copied back in reply packets.
This way the receiver can compare the time field in a reply with its
local clock and check the age of the request associated with this reply.

This is an easy way to discard delayed replies. Note that only a clock
is used here, that is the one of the node sending the packet. The
receiver only copies the field back into the reply, so no
synchronization is needed between clocks of different hosts.
2013-09-20 09:22:21 +02:00
antirez
c7cb80c8bb Cluster: don't add an handshake node for the same ip:port pair multiple times. 2013-09-04 15:52:16 +02:00
antirez
79a1deac28 Cluster: free HANDSHAKE nodes after node_timeout.
Handshake nodes should turn into normal nodes or be freed in a
reasonable amount of time, otherwise they'll keep accumulating if the
address they are associated with is not reachable for some reason.
2013-09-04 12:41:21 +02:00
antirez
232f84ec2d Cluster: CLUSTER SAVECONFIG command added. 2013-09-04 10:33:00 +02:00
antirez
61eb16c4da Cluster: don't save HANDSHAKE nodes in nodes.conf. 2013-09-04 10:25:26 +02:00
antirez
6e460eac58 Cluster: always use safe iteartors to iterate server.cluster->nodes. 2013-09-04 10:07:50 +02:00
antirez
853defe071 Cluster: clusterReadHandler() reworked to be more correct and simpler to follow. 2013-09-03 11:43:52 +02:00
antirez
e307150c21 Cluster: use non-blocking I/O for the cluster bus. 2013-09-03 11:43:52 +02:00
antirez
d3726385c2 Cluster: fixed a bug in clusterSendPublish() due to inverted statements.
The code used to copy the header *after* the 'hdr' pointer was already
switched to the new buffer. Of course we need to do the reverse.
2013-09-03 11:43:43 +02:00
antirez
33b286bf68 Don't update node pong time via gossip.
This feature was implemented in the initial days of the Redis Cluster
implementaiton but is not a good idea at all.

1) It depends on clocks to be synchronized, that is already very bad.
2) Moreover it adds a bug where the pong time is updated via gossip so
no new PING is ever sent by the current node, with the effect of no PONG
received, no update of tables, no clearing of PFAIL flag.

In general to trust other nodes about the reachability of other nodes is
a broken distributed programming model.
2013-08-26 16:16:25 +02:00
antirez
d7d90442f5 Cluster: set event handler in cluster bus listening socket.
The commit using listenToPort() introduced this bug by no longer
creating the event handler to handle incoming messages from the cluster
bus.
2013-08-22 14:53:53 +02:00
antirez
a8dc4ecd21 Use listenToPort() in cluster.c as well. 2013-08-22 14:05:07 +02:00
antirez
f45f05531d Cluster: fix CLUSTER MEET ip address validation.
This was broken by the IPv6 support patches.
2013-08-22 11:54:28 +02:00
antirez
77c71b2046 Cluster: process MEET packets as PING packets.
Somewhat a previous commit broken this so CLUSTER MEET was no longer
working.
2013-08-22 11:53:28 +02:00
antirez
487951c9b4 Use a safe dict.c iterator in clusterCron(). 2013-08-21 15:51:15 +02:00
antirez
fc11a99390 sdsrange() does not need to return a value.
Actaully the string is modified in-place and a reallocation is never
needed, so there is no need to return the new sds string pointer as
return value of the function, that is now just "void".
2013-07-24 11:21:39 +02:00
antirez
aa32f92338 Introduction of a new string encoding: EMBSTR
Previously two string encodings were used for string objects:

1) REDIS_ENCODING_RAW: a string object with obj->ptr pointing to an sds
stirng.

2) REDIS_ENCODING_INT: a string object where the obj->ptr void pointer
is casted to a long.

This commit introduces a experimental new encoding called
REDIS_ENCODING_EMBSTR that implements an object represented by an sds
string that is not modifiable but allocated in the same memory chunk as
the robj structure itself.

The chunk looks like the following:

+--------------+-----------+------------+--------+----+
| robj data... | robj->ptr | sds header | string | \0 |
+--------------+-----+-----+------------+--------+----+
                     |                       ^
                     +-----------------------+

The robj->ptr points to the contiguous sds string data, so the object
can be manipulated with the same functions used to manipulate plan
string objects, however we need just on malloc and one free in order to
allocate or release this kind of objects. Moreover it has better cache
locality.

This new allocation strategy should benefit both the memory usage and
the performances. A performance gain between 60 and 70% was observed
during micro-benchmarks, however there is more work to do to evaluate
the performance impact and the memory usage behavior.
2013-07-22 10:31:38 +02:00
antirez
e4d2e6fc9d All IP string repr buffers are now REDIS_IP_STR_LEN bytes. 2013-07-09 11:32:52 +02:00
Geoff Garside
5d702e012e Mark places that might want changing for IPv6.
Any places which I feel might want to be updated to work differently
with IPv6 have been marked with a comment starting "IPV6:".

Currently the only comments address places where an IP address is
combined with a port using the standard : separated form. These may want
to be changed when printing IPv6 addresses to wrap the address in []
such as

	[2001:db8::c0:ffee]:6379

instead of

	2001:db8::c0:ffee:6379

as the latter format is a technically valid IPv6 address and it is hard
to distinguish the IPv6 address component from the port unless you know
the port is supposed to be there.
2013-07-08 15:58:14 +02:00
Geoff Garside
15e37522ff Mark ip string buffers which could be reduced.
In two places buffers have been created with a size of 128 bytes which
could be reduced to INET6_ADDRSTRLEN to still hold a full IP address.
These places have been marked as they are presently big enough to handle
the needs of storing a printable IPv6 address.
2013-07-08 15:57:23 +02:00
Geoff Garside
9c994de435 Update clusterCommand to handle AF_INET6 addresses
Changes the sockaddr_in to a sockaddr_storage. Attempts to convert the
IP address into an AF_INET or AF_INET6 before returning an "Invalid IP
address" error. Handles converting the sockaddr from either AF_INET or
AF_INET6 back into a string for storage in the clusterNode ip field.
2013-07-08 15:57:23 +02:00
Geoff Garside
241e41a527 Update node2IpString to handle AF_INET6 addresses.
Change the sockaddr_in to sockaddr_storage which is capable of storing
both AF_INET and AF_INET6 sockets. Uses the sockaddr_storage ss_family
to correctly return the printable IP address and port.

Function makes the assumption that the buffer is of at least
REDIS_CLUSTER_IPLEN bytes in size.
2013-07-08 15:57:23 +02:00
Geoff Garside
cc9c474c60 Add missing includes for getpeername.
getpeername(2) requires <sys/socket.h> which on some systems also
requires <sys/types.h>. Include both to avoid compilation warnings.
2013-07-08 15:55:39 +02:00
Geoff Garside
a6c9ad267c Add macro to define clusterNode.ip buffer size.
Add REDIS_CLUSTER_IPLEN macro to define the size of the clusterNode ip
character array. Additionally use this macro in inet_ntop(3) calls where
the size of the array was being defined manually.

The REDIS_CLUSTER_IPLEN is defined as INET_ADDRSTRLEN which defines the
correct size of a buffer to store an IPv4 address in. The
INET_ADDRSTRLEN macro itself is defined in the <netinet/in.h> header
file and should be portable across the majority of systems.
2013-07-08 15:55:39 +02:00