7364 Commits

Author SHA1 Message Date
antirez
271733f4f8 Cluster: discard pong times in the future.
However we allow for 500 milliseconds of tolerance, in order to
avoid often discarding semantically valid info (the node is up)
because of natural few milliseconds desync among servers even when
NTP is used.

Note that anyway we should ping the node from time to time regardless and
discover if it's actually down from our point of view, since no update
is accepted while we have an active ping on the node.

Related to #3929.
2017-04-15 10:12:08 +02:00
antirez
4d4f22843c Cluster: discard pong times in the future.
However we allow for 500 milliseconds of tolerance, in order to
avoid often discarding semantically valid info (the node is up)
because of natural few milliseconds desync among servers even when
NTP is used.

Note that anyway we should ping the node from time to time regardless and
discover if it's actually down from our point of view, since no update
is accepted while we have an active ping on the node.

Related to #3929.
2017-04-15 10:12:08 +02:00
antirez
02777bb252 Cluster: always add PFAIL nodes at end of gossip section.
To rely on the fact that nodes in PFAIL state will be shared around by
randomly adding them in the gossip section is a weak assumption,
especially after changes related to sending less ping/pong packets.

We want to always include gossip entries for all the nodes that are in
PFAIL state, so that the PFAIL -> FAIL state promotion can happen much
faster and reliably.

Related to #3929.
2017-04-14 13:39:49 +02:00
antirez
0530c7e564 Cluster: always add PFAIL nodes at end of gossip section.
To rely on the fact that nodes in PFAIL state will be shared around by
randomly adding them in the gossip section is a weak assumption,
especially after changes related to sending less ping/pong packets.

We want to always include gossip entries for all the nodes that are in
PFAIL state, so that the PFAIL -> FAIL state promotion can happen much
faster and reliably.

Related to #3929.
2017-04-14 13:39:49 +02:00
antirez
8c829d9e43 Cluster: fix gossip section ping/pong times encoding.
The gossip section times are 32 bit, so cannot store the milliseconds
time but just the seconds approximation, which is good enough for our
uses. At the same time however, when comparing the gossip section times
of other nodes with our node's view, we need to convert back to
milliseconds.

Related to #3929. Without this change the patch to reduce the traffic in
the bus message does not work.
2017-04-14 11:01:22 +02:00
antirez
06de4577d6 Cluster: fix gossip section ping/pong times encoding.
The gossip section times are 32 bit, so cannot store the milliseconds
time but just the seconds approximation, which is good enough for our
uses. At the same time however, when comparing the gossip section times
of other nodes with our node's view, we need to convert back to
milliseconds.

Related to #3929. Without this change the patch to reduce the traffic in
the bus message does not work.
2017-04-14 11:01:22 +02:00
antirez
6878a3fedd Cluster: add clean-logs command to create-cluster script. 2017-04-14 10:52:00 +02:00
antirez
81a36f41ec Cluster: add clean-logs command to create-cluster script. 2017-04-14 10:52:00 +02:00
antirez
8f7bf2841a Cluster: decrease ping/pong traffic by trusting other nodes reports.
Cluster of bigger sizes tend to have a lot of traffic in the cluster bus
just for failure detection: a node will try to get a ping reply from
another node no longer than when the half the node timeout would elapsed,
in order to avoid a false positive.

However this means that if we have N nodes and the node timeout is set
to, for instance M seconds, we'll have to ping N nodes every M/2
seconds. This N*M/2 pings will receive the same number of pongs, so
a total of N*M packets per node. However given that we have a total of N
nodes doing this, the total number of messages will be N*N*M.

In a 100 nodes cluster with a timeout of 60 seconds, this translates
to a total of 100*100*30 packets per second, summing all the packets
exchanged by all the nodes.

This is, as you can guess, a lot... So this patch changes the
implementation in a very simple way in order to trust the reports of
other nodes: if a node A reports a node B as alive at least up to
a given time, we update our view accordingly.

The problem with this approach is that it could result into a subset of
nodes being able to reach a given node X, and preventing others from
detecting that is actually not reachable from the majority of nodes.
So the above algorithm is refined by trusting other nodes only if we do
not have currently a ping pending for the node X, and if there are no
failure reports for that node.

Since each node, anyway, pings 10 other nodes every second (one node
every 100 milliseconds), anyway eventually even trusting the other nodes
reports, we will detect if a given node is down from our POV.

Now to understand the number of packets that the cluster would exchange
for failure detection with the patch, we can start considering the
random PINGs that the cluster sent anyway as base line:
Each node sends 10 packets per second, so the total traffic if no
additioal packets would be sent, including PONG packets, would be:

    Total messages per second = N*10*2

However by trusting other nodes gossip sections will not AWALYS prevent
pinging nodes for the "half timeout reached" rule all the times. The
math involved in computing the actual rate as N and M change is quite
complex and depends also on another parameter, which is the number of
entries in the gossip section of PING and PONG packets. However it is
possible to compare what happens in cluster of different sizes
experimentally. After applying this patch a very important reduction in
the number of packets exchanged is trivial to observe, without apparent
impacts on the failure detection performances.

Actual numbers with different cluster sizes should be published in the
Reids Cluster documentation in the future.

Related to #3929.
2017-04-14 10:43:53 +02:00
antirez
de2eed4838 Cluster: decrease ping/pong traffic by trusting other nodes reports.
Cluster of bigger sizes tend to have a lot of traffic in the cluster bus
just for failure detection: a node will try to get a ping reply from
another node no longer than when the half the node timeout would elapsed,
in order to avoid a false positive.

However this means that if we have N nodes and the node timeout is set
to, for instance M seconds, we'll have to ping N nodes every M/2
seconds. This N*M/2 pings will receive the same number of pongs, so
a total of N*M packets per node. However given that we have a total of N
nodes doing this, the total number of messages will be N*N*M.

In a 100 nodes cluster with a timeout of 60 seconds, this translates
to a total of 100*100*30 packets per second, summing all the packets
exchanged by all the nodes.

This is, as you can guess, a lot... So this patch changes the
implementation in a very simple way in order to trust the reports of
other nodes: if a node A reports a node B as alive at least up to
a given time, we update our view accordingly.

The problem with this approach is that it could result into a subset of
nodes being able to reach a given node X, and preventing others from
detecting that is actually not reachable from the majority of nodes.
So the above algorithm is refined by trusting other nodes only if we do
not have currently a ping pending for the node X, and if there are no
failure reports for that node.

Since each node, anyway, pings 10 other nodes every second (one node
every 100 milliseconds), anyway eventually even trusting the other nodes
reports, we will detect if a given node is down from our POV.

Now to understand the number of packets that the cluster would exchange
for failure detection with the patch, we can start considering the
random PINGs that the cluster sent anyway as base line:
Each node sends 10 packets per second, so the total traffic if no
additioal packets would be sent, including PONG packets, would be:

    Total messages per second = N*10*2

However by trusting other nodes gossip sections will not AWALYS prevent
pinging nodes for the "half timeout reached" rule all the times. The
math involved in computing the actual rate as N and M change is quite
complex and depends also on another parameter, which is the number of
entries in the gossip section of PING and PONG packets. However it is
possible to compare what happens in cluster of different sizes
experimentally. After applying this patch a very important reduction in
the number of packets exchanged is trivial to observe, without apparent
impacts on the failure detection performances.

Actual numbers with different cluster sizes should be published in the
Reids Cluster documentation in the future.

Related to #3929.
2017-04-14 10:43:53 +02:00
antirez
c5d6f577f0 Cluster: collect more specific bus messages stats.
First step in order to change Cluster in order to use less messages.
Related to issue #3929.
2017-04-13 19:22:35 +02:00
antirez
c628ae114b Cluster: collect more specific bus messages stats.
First step in order to change Cluster in order to use less messages.
Related to issue #3929.
2017-04-13 19:22:35 +02:00
Itamar Haber
b8286d1fc9 Changes command stats iteration to being dict-based
With the addition of modules, looping over the redisCommandTable
misses any added commands. By moving to dictionary iteration this
is resolved.
2017-04-13 17:03:46 +03:00
Itamar Haber
cac5a8b65d Changes command stats iteration to being dict-based
With the addition of modules, looping over the redisCommandTable
misses any added commands. By moving to dictionary iteration this
is resolved.
2017-04-13 17:03:46 +03:00
antirez
104584b95e Fix typo in feedReplicationBacklog() top comment. 2017-04-12 12:28:05 +02:00
antirez
341adb516b Fix typo in feedReplicationBacklog() top comment. 2017-04-12 12:28:05 +02:00
antirez
1210af3804 Add a top comment in crucial functions inside networking.c. 2017-04-12 10:12:27 +02:00
antirez
6081a5873c Add a top comment in crucial functions inside networking.c. 2017-04-12 10:12:27 +02:00
antirez
4a850be4dc Set lua-time-limit default value at safe place.
Otherwise, as it was, it will overwrite whatever the user set.

Close #3703.
2017-04-11 16:56:00 +02:00
antirez
7c415014b0 Set lua-time-limit default value at safe place.
Otherwise, as it was, it will overwrite whatever the user set.

Close #3703.
2017-04-11 16:56:00 +02:00
antirez
f47607af02 Fix preprocessor if/else chain broken in order to fix #3927. 2017-04-11 16:54:27 +02:00
antirez
ffce9ebf9b Fix preprocessor if/else chain broken in order to fix #3927. 2017-04-11 16:54:27 +02:00
antirez
74720ea993 Merge branch 'unstable' of github.com:/antirez/redis into unstable 2017-04-11 16:45:49 +02:00
antirez
de3e670282 Merge branch 'unstable' of github.com:/antirez/redis into unstable 2017-04-11 16:45:49 +02:00
antirez
aa5b4be02e Fix zmalloc_get_memory_size() ifdefs to actually use the else branch.
Close #3927.
2017-04-11 16:45:11 +02:00
antirez
7e0c3177d6 Fix zmalloc_get_memory_size() ifdefs to actually use the else branch.
Close #3927.
2017-04-11 16:45:11 +02:00
Salvatore Sanfilippo
69ce5c5d10 Merge pull request #3924 from lorneli/unstable
Expire: Update comment of activeExpireCycle function
2017-04-11 16:31:55 +02:00
Salvatore Sanfilippo
3b0a442dd2 Merge pull request #3924 from lorneli/unstable
Expire: Update comment of activeExpireCycle function
2017-04-11 16:31:55 +02:00
antirez
531647bb1b Make more obvious why there was issue #3843. 2017-04-10 13:17:05 +02:00
antirez
71c350f73b Make more obvious why there was issue #3843. 2017-04-10 13:17:05 +02:00
Salvatore Sanfilippo
01b6966afc Merge pull request #3843 from dvirsky/fix_bc_free
fixed free of blocked client before refering to it
2017-04-10 13:14:52 +02:00
Salvatore Sanfilippo
cccd21d5c6 Merge pull request #3843 from dvirsky/fix_bc_free
fixed free of blocked client before refering to it
2017-04-10 13:14:52 +02:00
antirez
ffefc9f92d Fix modules blocking commands awake delay.
If a thread unblocks a client blocked in a module command, by using the
RedisMdoule_UnblockClient() API, the event loop may not be awaken until
the next timeout of the multiplexing API or the next unrelated I/O
operation on other clients. We actually want the client to be served
ASAP, so a mechanism is needed in order for the unblocking API to inform
Redis that there is a client to serve ASAP.

This commit fixes the issue using the old trick of the pipe: when a
client needs to be unblocked, a byte is written in a pipe. When we run
the list of clients blocked in modules, we consume all the bytes
written in the pipe. Writes and reads are performed inside the context
of the mutex, so no race is possible in which we consume the bytes that
are actually related to an awake request for a client that should still
be put into the list of clients to unblock.

It was verified that after the fix the server handles the blocked
clients with the expected short delay.

Thanks to @dvirsky for understanding there was such a problem and
reporting it.
2017-04-10 09:33:21 +02:00
antirez
60ffbb72b4 Fix modules blocking commands awake delay.
If a thread unblocks a client blocked in a module command, by using the
RedisMdoule_UnblockClient() API, the event loop may not be awaken until
the next timeout of the multiplexing API or the next unrelated I/O
operation on other clients. We actually want the client to be served
ASAP, so a mechanism is needed in order for the unblocking API to inform
Redis that there is a client to serve ASAP.

This commit fixes the issue using the old trick of the pipe: when a
client needs to be unblocked, a byte is written in a pipe. When we run
the list of clients blocked in modules, we consume all the bytes
written in the pipe. Writes and reads are performed inside the context
of the mutex, so no race is possible in which we consume the bytes that
are actually related to an awake request for a client that should still
be put into the list of clients to unblock.

It was verified that after the fix the server handles the blocked
clients with the expected short delay.

Thanks to @dvirsky for understanding there was such a problem and
reporting it.
2017-04-10 09:33:21 +02:00
antirez
91999fce40 Rax library updated.
Important bugs fixed.
2017-04-08 17:31:13 +02:00
antirez
f9db3144d6 Rax library updated.
Important bugs fixed.
2017-04-08 17:31:13 +02:00
lorneli
98db5739cc Expire: Update comment of activeExpireCycle function
The macro REDIS_EXPIRELOOKUPS_TIME_PERC has been replaced by
ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC in commit
6500fabfb881a7ffaadfbff74ab801c55d4591fc.
2017-04-08 15:15:24 +08:00
lorneli
e8b44eb33a Expire: Update comment of activeExpireCycle function
The macro REDIS_EXPIRELOOKUPS_TIME_PERC has been replaced by
ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC in commit
6500fabfb881a7ffaadfbff74ab801c55d4591fc.
2017-04-08 15:15:24 +08:00
antirez
3f9e2322ec Rax library updated. 2017-04-07 08:46:39 +02:00
antirez
fcad87788a Rax library updated. 2017-04-07 08:46:39 +02:00
antirez
1409c545da Cluster: hash slots tracking using a radix tree. 2017-03-27 16:37:22 +02:00
antirez
de52b6375b Cluster: hash slots tracking using a radix tree. 2017-03-27 16:37:22 +02:00
itamar
443f279a3a Sets up fake client to select current db in RM_Call() 2017-03-06 14:37:10 +02:00
itamar
a04ba58d9a Sets up fake client to select current db in RM_Call() 2017-03-06 14:37:10 +02:00
Guy Benoish
71a8df6a2b Merge branch 'unstable' of https://github.com/antirez/redis into unstable 2017-03-02 13:25:05 +02:00
Guy Benoish
dcbf01295b Merge branch 'unstable' of https://github.com/antirez/redis into unstable 2017-03-02 13:25:05 +02:00
Dvir Volk
4b2229e4b8 fixed free of blocked client before refering to it 2017-03-01 16:51:01 +02:00
Dvir Volk
c40322945a fixed free of blocked client before refering to it 2017-03-01 16:51:01 +02:00
Salvatore Sanfilippo
9cc83d2ad9 Makefile: fix building with Solaris C compiler, 64 bit. 2017-02-23 16:53:39 +01:00
Salvatore Sanfilippo
1f0dae3c7f Makefile: fix building with Solaris C compiler, 64 bit. 2017-02-23 16:53:39 +01:00