Fix race in sentinel manual failover test ()

In , we added some SENTINEL DEBUG to reduce default
timeouts and allow tests to execute faster. The change
in 05-manual.tcl may cause a race that SENTINEL FAILOVER
response with a NOGOODSLAVE:
```
Manual failover works: FAILED: Expected NOGOODSLAVE No suitable replica to promote eq "OK" (context: type eval line 6 cmd {assert {$reply eq "OK"}} proc ::test)
(Jumping to next unit after error)
FAILED: caught an error in the test
assertion:Expected NOGOODSLAVE No suitable replica to promote eq "OK" (context: type eval line 6 cmd {assert {$reply eq "OK"}} proc ::test)
```

The reason is that the info-period value was reduced in 
(the default value is 10000), and then manual failover was
performed immediately, but the INFO may not exchanged between
the sentinel and replicas, causing the sentinel to skip all
the replicas in sentinelSelectSlave (Because replica's info_refresh
is not updated, see the code snippet below), then return a NOGOODSLAVE,
break the test.

Code snippet from sentinelSelectSlave:
```
while((de = dictNext(di)) != NULL) {
    sentinelRedisInstance *slave = dictGetVal(de);
    mstime_t info_validity_time;
    if (master->flags & SRI_S_DOWN)
        info_validity_time = sentinel_ping_period*5;
    else
        info_validity_time = sentinel_info_period*3;
    if (mstime() - slave->info_refresh > info_validity_time) continue;
}
```

By adding a wait_for_condition, we have the opportunity to
let sentinel update the info_period of the replicas.
This commit is contained in:
Binbin 2023-03-12 19:25:10 +08:00 committed by GitHub
parent 4ba47d2d21
commit 4e7eb16ae7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -12,8 +12,21 @@ test "Manual failover works" {
set old_port [RPort $master_id]
set addr [S 0 SENTINEL GET-MASTER-ADDR-BY-NAME mymaster]
assert {[lindex $addr 1] == $old_port}
# Since we reduced the info-period (default 10000) above immediately,
# sentinel - replica may not have enough time to exchange INFO and update
# the replica's info-period, so the test may get a NOGOODSLAVE.
wait_for_condition 300 50 {
[catch {S 0 SENTINEL FAILOVER mymaster}] == 0
} else {
catch {S 0 SENTINEL FAILOVER mymaster} reply
puts [S 0 SENTINEL REPLICAS mymaster]
fail "Sentinel manual failover did not work, got: $reply"
}
catch {S 0 SENTINEL FAILOVER mymaster} reply
assert {$reply eq "OK"}
assert_match {*INPROG*} $reply ;# Failover already in progress
foreach_sentinel_id id {
wait_for_condition 1000 50 {
[lindex [S $id SENTINEL GET-MASTER-ADDR-BY-NAME mymaster] 1] != $old_port