20963 Commits

Author SHA1 Message Date
antirez
75084e057d Sentinel: abort failover when in wait-start if master is back.
When we are a Leader Sentinel in wait-start state, starting with this
commit the failover is aborted if the master returns online.

This improves the way we handle a notable case of net split, that is the
split between Sentinels and Redis servers, that will be a very common
case of split becase Sentinels will often be installed in the client's
network and servers can be in a differnt arm of the network.

When Sentinels and Redis servers are isolated the master is in ODOWN
condition since the Sentinels can agree about this state, however the
failover does not start since there are no good slaves to promote (in
this specific case all the slaves are unreachable).

However when the split is resolved, Sentinels may sense the slave back
a moment before they sense the master is back, so the failover may start
without a good reason (since the master is actually working too).

Now this condition is reversible, so the failover will be aborted
immediately after if the master is detected to be working again, that
is, not in SDOWN nor in ODOWN condition.
2012-07-31 10:19:34 +02:00
antirez
ba383a34e3 Merge remote-tracking branch 'origin/unstable' into unstable 2012-07-28 20:55:17 +02:00
antirez
7f5bdba434 Merge remote-tracking branch 'origin/unstable' into unstable 2012-07-28 20:55:17 +02:00
antirez
756ad8da8b Sentinel: scripts execution engine improved.
We no longer use a vanilla fork+execve but take a queue of jobs of
scripts to execute, with retry on error, timeouts, and so forth.

Currently this is used only for notifications but soon the ability to
also call clients reconfiguration scripts will be added.
2012-07-28 20:54:27 +02:00
antirez
3f194a9d25 Sentinel: scripts execution engine improved.
We no longer use a vanilla fork+execve but take a queue of jobs of
scripts to execute, with retry on error, timeouts, and so forth.

Currently this is used only for notifications but soon the ability to
also call clients reconfiguration scripts will be added.
2012-07-28 20:54:27 +02:00
Jan-Erik Rediger
ac03c5dd89 Include sys/wait.h to avoid compiler warning
gcc warned about an implicit declaration of function 'wait3'. 
Including this header fixes this.
2012-07-28 12:33:01 +03:00
Jan-Erik Rediger
c6c19c8372 Include sys/wait.h to avoid compiler warning
gcc warned about an implicit declaration of function 'wait3'. 
Including this header fixes this.
2012-07-28 12:33:01 +03:00
Salvatore Sanfilippo
7cf677f7d6 Merge pull request #587 from saj/truncate-short-write-from-aof
Truncate short write from the AOF
2012-07-27 03:56:48 -07:00
Salvatore Sanfilippo
ed7b308c1c Merge pull request #587 from saj/truncate-short-write-from-aof
Truncate short write from the AOF
2012-07-27 03:56:48 -07:00
Salvatore Sanfilippo
8736a89ca4 Merge pull request #586 from saj/aof_last_bgrewrite_status
New in INFO: aof_last_bgrewrite_status
2012-07-27 03:55:20 -07:00
Salvatore Sanfilippo
04950a9e4d Merge pull request #586 from saj/aof_last_bgrewrite_status
New in INFO: aof_last_bgrewrite_status
2012-07-27 03:55:20 -07:00
antirez
4fd46a9861 Sentinel: don't start a failover as leader if there is no good slave. 2012-07-26 12:09:40 +02:00
antirez
ce7b838fb9 Sentinel: don't start a failover as leader if there is no good slave. 2012-07-26 12:09:40 +02:00
antirez
4661258e05 Sentinel: ability to execute notification scripts. 2012-07-25 16:33:37 +02:00
antirez
baace5fc42 Sentinel: ability to execute notification scripts. 2012-07-25 16:33:37 +02:00
Salvatore Sanfilippo
0d4941a782 Merge pull request #603 from mrb/fix_sentinel_config_warning
Fix warning in redis.c for sentinel config load
2012-07-25 07:15:53 -07:00
Salvatore Sanfilippo
42c571864e Merge pull request #603 from mrb/fix_sentinel_config_warning
Fix warning in redis.c for sentinel config load
2012-07-25 07:15:53 -07:00
mrb
290344e8d4 Fix warning in redis.c for sentinel config load 2012-07-25 09:55:53 -04:00
mrb
f1c8661e74 Fix warning in redis.c for sentinel config load 2012-07-25 09:55:53 -04:00
antirez
e7975f719a Sentinel: abort failover if no good slave is available.
The previous behavior of the state machine was to wait some time and
retry the slave selection, but this is not robust enough against drastic
changes in the conditions of the monitored instances.

What we do now when the slave selection fails is to abort the failover
and return back monitoring the master. If the ODOWN condition is still
present a new failover will be triggered and so forth.

This commit also refactors the code we use to abort a failover.
2012-07-25 11:32:19 +02:00
antirez
672102c2ce Sentinel: abort failover if no good slave is available.
The previous behavior of the state machine was to wait some time and
retry the slave selection, but this is not robust enough against drastic
changes in the conditions of the monitored instances.

What we do now when the slave selection fails is to abort the failover
and return back monitoring the master. If the ODOWN condition is still
present a new failover will be triggered and so forth.

This commit also refactors the code we use to abort a failover.
2012-07-25 11:32:19 +02:00
antirez
4013599560 Sentinel: reset pending_commands in a more generic way. 2012-07-24 18:57:26 +02:00
antirez
9e5bef38e6 Sentinel: reset pending_commands in a more generic way. 2012-07-24 18:57:26 +02:00
antirez
b94bca01fc Prevent a spurious +sdown event on switch.
When we reset the master we should start with clean timestamps for ping
replies otherwise we'll detect a spurious +sdown event, because on
+master-switch event the previous master instance was probably in +sdown
condition. Since we updated the address we should count time from
scratch again.

Also this commit makes sure to explicitly reset the count of pending
commands, now we can do this because of the new way the hiredis link
is closed.
2012-07-24 18:46:04 +02:00
antirez
a23a5b6c7d Prevent a spurious +sdown event on switch.
When we reset the master we should start with clean timestamps for ping
replies otherwise we'll detect a spurious +sdown event, because on
+master-switch event the previous master instance was probably in +sdown
condition. Since we updated the address we should count time from
scratch again.

Also this commit makes sure to explicitly reset the count of pending
commands, now we can do this because of the new way the hiredis link
is closed.
2012-07-24 18:46:04 +02:00
antirez
66b8296c55 Sentinel: debugging message removed. 2012-07-24 18:20:05 +02:00
antirez
d918e6f127 Sentinel: debugging message removed. 2012-07-24 18:20:05 +02:00
antirez
b41b05c487 Sentinel: changes to connection handling and redirection.
We disconnect the Redis instances hiredis link in a more robust way now.
Also we change the way we perform the redirection for the +switch-master
event, that is not just an instance reset with an address change.

Using the same system we now implement the +redirect-to-master event
that is triggered by an instance that is configured to be master but
found to be a slave at the first INFO reply. In that case we monitor the
master instead, logging the incident as an event.
2012-07-24 18:15:44 +02:00
antirez
75fb6e5b8a Sentinel: changes to connection handling and redirection.
We disconnect the Redis instances hiredis link in a more robust way now.
Also we change the way we perform the redirection for the +switch-master
event, that is not just an instance reset with an address change.

Using the same system we now implement the +redirect-to-master event
that is triggered by an instance that is configured to be master but
found to be a slave at the first INFO reply. In that case we monitor the
master instead, logging the incident as an event.
2012-07-24 18:15:44 +02:00
antirez
87701d32b5 Sentinel: check that instance still exists in reply callbacks.
We can't be sure the instance object still exists when the reply
callback is called.
2012-07-24 16:37:57 +02:00
antirez
2179c26916 Sentinel: check that instance still exists in reply callbacks.
We can't be sure the instance object still exists when the reply
callback is called.
2012-07-24 16:37:57 +02:00
antirez
439e2d101e Sentinel: more robust failover detection as observer.
Sentinel observers detect failover checking if a slave attached to the
monitored master turns into its replication state from slave to master.
However while this change may in theory only happen after a SLAVEOF NO
ONE command, in practie it is very easy to reboot a slave instance with
a wrong configuration that turns it into a master, especially if it was
a past master before a successfull failover.

This commit changes the detection policy so that if an instance goes
from slave to master, but at the same time the runid has changed, we
sense a reboot, and in that case we don't detect a failover at all.

This commit also introduces the "reboot" sentinel event, that is logged
at "warning" level (so this will trigger an admin notification).

The commit also fixes a problem in the disconnect handler that assumed
that the instance object always existed, that is not the case. Now we
no longer assume that redisAsyncFree() will call the disconnection
handler before returning.
2012-07-24 12:42:40 +02:00
antirez
d876d6feac Sentinel: more robust failover detection as observer.
Sentinel observers detect failover checking if a slave attached to the
monitored master turns into its replication state from slave to master.
However while this change may in theory only happen after a SLAVEOF NO
ONE command, in practie it is very easy to reboot a slave instance with
a wrong configuration that turns it into a master, especially if it was
a past master before a successfull failover.

This commit changes the detection policy so that if an instance goes
from slave to master, but at the same time the runid has changed, we
sense a reboot, and in that case we don't detect a failover at all.

This commit also introduces the "reboot" sentinel event, that is logged
at "warning" level (so this will trigger an admin notification).

The commit also fixes a problem in the disconnect handler that assumed
that the instance object always existed, that is not the case. Now we
no longer assume that redisAsyncFree() will call the disconnection
handler before returning.
2012-07-24 12:42:40 +02:00
antirez
43cc64d30e First implementation of Redis Sentinel.
This commit implements the first, beta quality implementation of Redis
Sentinel, a distributed monitoring system for Redis with notification
and automatic failover capabilities.

More info at http://redis.io/topics/sentinel
2012-07-23 13:14:44 +02:00
antirez
6b5daa2df2 First implementation of Redis Sentinel.
This commit implements the first, beta quality implementation of Redis
Sentinel, a distributed monitoring system for Redis with notification
and automatic failover capabilities.

More info at http://redis.io/topics/sentinel
2012-07-23 13:14:44 +02:00
antirez
a7cb2683f0 Merge remote-tracking branch 'origin/unstable' into unstable 2012-07-22 17:18:42 +02:00
antirez
03f412ddef Merge remote-tracking branch 'origin/unstable' into unstable 2012-07-22 17:18:42 +02:00
antirez
30e9bb4ee8 Allow Pub/Sub in contexts where other commands are blocked.
Redis loading data from disk, and a Redis slave disconnected from its
master with serve-stale-data disabled, are two conditions where
commands are normally refused by Redis, returning an error.

However there is no reason to disable Pub/Sub commands as well, given
that this layer does not interact with the dataset. To allow Pub/Sub in
as many contexts as possible is especially interesting now that Redis
Sentinel uses Pub/Sub of a Redis master as a communication channel
between Sentinels.

This commit allows Pub/Sub to be used in the above two contexts where
it was previously denied.
2012-07-22 17:18:16 +02:00
antirez
5d73073f6e Allow Pub/Sub in contexts where other commands are blocked.
Redis loading data from disk, and a Redis slave disconnected from its
master with serve-stale-data disabled, are two conditions where
commands are normally refused by Redis, returning an error.

However there is no reason to disable Pub/Sub commands as well, given
that this layer does not interact with the dataset. To allow Pub/Sub in
as many contexts as possible is especially interesting now that Redis
Sentinel uses Pub/Sub of a Redis master as a communication channel
between Sentinels.

This commit allows Pub/Sub to be used in the above two contexts where
it was previously denied.
2012-07-22 17:18:16 +02:00
antirez
3fbd3b28a0 Don't assume that "char" is signed.
For the C standard char can be either signed or unsigned, it's up to the
compiler, but Redis assumed that it was signed in a few places.

The practical effect of this patch is that now Redis 2.6 will run
correctly in every system where char is unsigned, notably the RaspBerry
PI and other ARM systems with GCC.

Thanks to Georgi Marinov (@eesn on twitter) that reported the problem
and allowed me to use his RaspBerry via SSH to trace and fix the issue!
2012-07-18 12:04:58 +02:00
antirez
b62bdf1c64 Don't assume that "char" is signed.
For the C standard char can be either signed or unsigned, it's up to the
compiler, but Redis assumed that it was signed in a few places.

The practical effect of this patch is that now Redis 2.6 will run
correctly in every system where char is unsigned, notably the RaspBerry
PI and other ARM systems with GCC.

Thanks to Georgi Marinov (@eesn on twitter) that reported the problem
and allowed me to use his RaspBerry via SSH to trace and fix the issue!
2012-07-18 12:04:58 +02:00
Saj Goonatilleke
7b7d54befd Truncate short write from the AOF
If Redis only manages to write out a partial buffer, the AOF file won't
load back into Redis the next time it starts up.  It is better to
discard the short write than waste time running redis-check-aof.
2012-07-18 10:35:17 +10:00
Saj Goonatilleke
55302e9e28 Truncate short write from the AOF
If Redis only manages to write out a partial buffer, the AOF file won't
load back into Redis the next time it starts up.  It is better to
discard the short write than waste time running redis-check-aof.
2012-07-18 10:35:17 +10:00
Saj Goonatilleke
fdda6c5b19 New in INFO: aof_last_bgrewrite_status
Behaves like rdb_last_bgsave_status -- even down to reporting 'ok' when
no rewrite has been done yet.  (You might want to check that
aof_last_rewrite_time_sec is not -1.)
2012-07-18 09:54:55 +10:00
Saj Goonatilleke
48553a29e8 New in INFO: aof_last_bgrewrite_status
Behaves like rdb_last_bgsave_status -- even down to reporting 'ok' when
no rewrite has been done yet.  (You might want to check that
aof_last_rewrite_time_sec is not -1.)
2012-07-18 09:54:55 +10:00
Steeve Lennmark
16609e29b4 Check that we have connection before enabling pipe mode 2012-07-15 14:35:02 +02:00
Steeve Lennmark
e9828cb6f7 Check that we have connection before enabling pipe mode 2012-07-15 14:35:02 +02:00
Saj Goonatilleke
bc235fe40f Bug fix: slaves being pinged every second
REDIS_REPL_PING_SLAVE_PERIOD controls how often the master should
transmit a heartbeat (PING) to its slaves.  This period, which defaults
to 10, is measured in seconds.

Redis 2.4 masters used to ping their slaves every ten seconds, just like
it says on the tin.

The Redis 2.6 masters I have been experimenting with, on the other hand,
ping their slaves *every second*.  (master_last_io_seconds_ago never
approaches 10.)  I think the ping period was inadvertently slashed to
one-tenth of its nominal value around the time REDIS_HZ was introduced.
This commit reintroduces correct ping schedule behaviour.
2012-07-05 14:29:27 +10:00
Saj Goonatilleke
9edfe63553 Bug fix: slaves being pinged every second
REDIS_REPL_PING_SLAVE_PERIOD controls how often the master should
transmit a heartbeat (PING) to its slaves.  This period, which defaults
to 10, is measured in seconds.

Redis 2.4 masters used to ping their slaves every ten seconds, just like
it says on the tin.

The Redis 2.6 masters I have been experimenting with, on the other hand,
ping their slaves *every second*.  (master_last_io_seconds_ago never
approaches 10.)  I think the ping period was inadvertently slashed to
one-tenth of its nominal value around the time REDIS_HZ was introduced.
This commit reintroduces correct ping schedule behaviour.
2012-07-05 14:29:27 +10:00
jokea
1324697567 mark fd as writable when EPOLLERR or EPOLLHUP is returned by epoll_wait. 2012-06-29 12:06:38 +08:00