Fix data loss when replica do a failover with a old history repl offset (#885)
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 12:42:59 +08:00
|
|
|
# Allocate slot 0 to the last primary and evenly distribute the remaining
|
|
|
|
# slots to the remaining primaries.
|
|
|
|
proc my_slot_allocation {masters replicas} {
|
|
|
|
set avg [expr double(16384) / [expr $masters-1]]
|
|
|
|
set slot_start 1
|
|
|
|
for {set j 0} {$j < $masters-1} {incr j} {
|
|
|
|
set slot_end [expr int(ceil(($j + 1) * $avg) - 1)]
|
|
|
|
R $j cluster addslotsrange $slot_start $slot_end
|
|
|
|
set slot_start [expr $slot_end + 1]
|
|
|
|
}
|
|
|
|
R [expr $masters-1] cluster addslots 0
|
|
|
|
}
|
|
|
|
|
Fix reconfiguring sub-replica causing data loss when myself change shard_id (#944)
When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.
In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.
As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.
This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.
This one follow #885 and closes #942.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
2024-08-29 22:39:53 +08:00
|
|
|
proc get_my_primary_peer {srv_idx} {
|
|
|
|
set role_response [R $srv_idx role]
|
|
|
|
set primary_ip [lindex $role_response 1]
|
|
|
|
set primary_port [lindex $role_response 2]
|
|
|
|
set primary_peer "$primary_ip:$primary_port"
|
|
|
|
return $primary_peer
|
|
|
|
}
|
|
|
|
|
2024-08-28 11:08:27 +08:00
|
|
|
proc test_migrated_replica {type} {
|
|
|
|
test "Migrated replica reports zero repl offset and rank, and fails to win election - $type" {
|
Fix data loss when replica do a failover with a old history repl offset (#885)
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 12:42:59 +08:00
|
|
|
# Write some data to primary 0, slot 1, make a small repl_offset.
|
|
|
|
for {set i 0} {$i < 1024} {incr i} {
|
|
|
|
R 0 incr key_991803
|
|
|
|
}
|
|
|
|
assert_equal {1024} [R 0 get key_991803]
|
|
|
|
|
|
|
|
# Write some data to primary 3, slot 0, make a big repl_offset.
|
|
|
|
for {set i 0} {$i < 10240} {incr i} {
|
|
|
|
R 3 incr key_977613
|
|
|
|
}
|
|
|
|
assert_equal {10240} [R 3 get key_977613]
|
|
|
|
|
|
|
|
# 10s, make sure primary 0 will hang in the save.
|
|
|
|
R 0 config set rdb-key-save-delay 100000000
|
|
|
|
|
|
|
|
# Move the slot 0 from primary 3 to primary 0
|
|
|
|
set addr "[srv 0 host]:[srv 0 port]"
|
|
|
|
set myid [R 3 CLUSTER MYID]
|
|
|
|
set code [catch {
|
|
|
|
exec src/valkey-cli {*}[valkeycli_tls_config "./tests"] --cluster rebalance $addr --cluster-weight $myid=0
|
|
|
|
} result]
|
|
|
|
if {$code != 0} {
|
|
|
|
fail "valkey-cli --cluster rebalance returns non-zero exit code, output below:\n$result"
|
|
|
|
}
|
|
|
|
|
|
|
|
# Validate that shard 3's primary and replica can convert to replicas after
|
|
|
|
# they lose the last slot.
|
|
|
|
R 3 config set cluster-replica-validity-factor 0
|
|
|
|
R 7 config set cluster-replica-validity-factor 0
|
|
|
|
R 3 config set cluster-allow-replica-migration yes
|
|
|
|
R 7 config set cluster-allow-replica-migration yes
|
|
|
|
|
2024-08-28 11:08:27 +08:00
|
|
|
if {$type == "shutdown"} {
|
|
|
|
# Shutdown primary 0.
|
|
|
|
catch {R 0 shutdown nosave}
|
|
|
|
} elseif {$type == "sigstop"} {
|
|
|
|
# Pause primary 0.
|
|
|
|
set primary0_pid [s 0 process_id]
|
|
|
|
pause_process $primary0_pid
|
|
|
|
}
|
Fix data loss when replica do a failover with a old history repl offset (#885)
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 12:42:59 +08:00
|
|
|
|
|
|
|
# Wait for the replica to become a primary, and make sure
|
|
|
|
# the other primary become a replica.
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[s -4 role] eq {master} &&
|
|
|
|
[s -3 role] eq {slave} &&
|
|
|
|
[s -7 role] eq {slave}
|
|
|
|
} else {
|
|
|
|
puts "s -4 role: [s -4 role]"
|
|
|
|
puts "s -3 role: [s -3 role]"
|
|
|
|
puts "s -7 role: [s -7 role]"
|
|
|
|
fail "Failover does not happened"
|
|
|
|
}
|
|
|
|
|
|
|
|
# Make sure the offset of server 3 / 7 is 0.
|
|
|
|
verify_log_message -3 "*Start of election*offset 0*" 0
|
|
|
|
verify_log_message -7 "*Start of election*offset 0*" 0
|
|
|
|
|
|
|
|
# Make sure the right replica gets the higher rank.
|
|
|
|
verify_log_message -4 "*Start of election*rank #0*" 0
|
|
|
|
|
|
|
|
# Wait for the cluster to be ok.
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[CI 3 cluster_state] eq "ok" &&
|
|
|
|
[CI 4 cluster_state] eq "ok" &&
|
|
|
|
[CI 7 cluster_state] eq "ok"
|
|
|
|
} else {
|
|
|
|
puts "R 3: [R 3 cluster info]"
|
|
|
|
puts "R 4: [R 4 cluster info]"
|
|
|
|
puts "R 7: [R 7 cluster info]"
|
|
|
|
fail "Cluster is down"
|
|
|
|
}
|
|
|
|
|
|
|
|
# Make sure the key exists and is consistent.
|
|
|
|
R 3 readonly
|
|
|
|
R 7 readonly
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[R 3 get key_991803] == 1024 && [R 3 get key_977613] == 10240 &&
|
|
|
|
[R 4 get key_991803] == 1024 && [R 4 get key_977613] == 10240 &&
|
|
|
|
[R 7 get key_991803] == 1024 && [R 7 get key_977613] == 10240
|
|
|
|
} else {
|
|
|
|
puts "R 3: [R 3 keys *]"
|
|
|
|
puts "R 4: [R 4 keys *]"
|
|
|
|
puts "R 7: [R 7 keys *]"
|
|
|
|
fail "Key not consistent"
|
|
|
|
}
|
2024-08-28 11:08:27 +08:00
|
|
|
|
|
|
|
if {$type == "sigstop"} {
|
|
|
|
resume_process $primary0_pid
|
|
|
|
|
|
|
|
# Wait for the old primary to go online and become a replica.
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[s 0 role] eq {slave}
|
|
|
|
} else {
|
|
|
|
fail "The old primary was not converted into replica"
|
|
|
|
}
|
|
|
|
}
|
Fix data loss when replica do a failover with a old history repl offset (#885)
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 12:42:59 +08:00
|
|
|
}
|
2024-08-28 11:08:27 +08:00
|
|
|
} ;# proc
|
|
|
|
|
|
|
|
start_cluster 4 4 {tags {external:skip cluster} overrides {cluster-node-timeout 1000 cluster-migration-barrier 999}} {
|
|
|
|
test_migrated_replica "shutdown"
|
Fix data loss when replica do a failover with a old history repl offset (#885)
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 12:42:59 +08:00
|
|
|
} my_slot_allocation cluster_allocate_replicas ;# start_cluster
|
|
|
|
|
|
|
|
start_cluster 4 4 {tags {external:skip cluster} overrides {cluster-node-timeout 1000 cluster-migration-barrier 999}} {
|
2024-08-28 11:08:27 +08:00
|
|
|
test_migrated_replica "sigstop"
|
|
|
|
} my_slot_allocation cluster_allocate_replicas ;# start_cluster
|
|
|
|
|
|
|
|
proc test_nonempty_replica {type} {
|
|
|
|
test "New non-empty replica reports zero repl offset and rank, and fails to win election - $type" {
|
Fix data loss when replica do a failover with a old history repl offset (#885)
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 12:42:59 +08:00
|
|
|
# Write some data to primary 0, slot 1, make a small repl_offset.
|
|
|
|
for {set i 0} {$i < 1024} {incr i} {
|
|
|
|
R 0 incr key_991803
|
|
|
|
}
|
|
|
|
assert_equal {1024} [R 0 get key_991803]
|
|
|
|
|
|
|
|
# Write some data to primary 3, slot 0, make a big repl_offset.
|
|
|
|
for {set i 0} {$i < 10240} {incr i} {
|
|
|
|
R 3 incr key_977613
|
|
|
|
}
|
|
|
|
assert_equal {10240} [R 3 get key_977613]
|
|
|
|
|
|
|
|
# 10s, make sure primary 0 will hang in the save.
|
|
|
|
R 0 config set rdb-key-save-delay 100000000
|
|
|
|
|
|
|
|
# Make server 7 a replica of server 0.
|
|
|
|
R 7 config set cluster-replica-validity-factor 0
|
|
|
|
R 7 config set cluster-allow-replica-migration yes
|
|
|
|
R 7 cluster replicate [R 0 cluster myid]
|
|
|
|
|
2024-08-28 11:08:27 +08:00
|
|
|
if {$type == "shutdown"} {
|
|
|
|
# Shutdown primary 0.
|
|
|
|
catch {R 0 shutdown nosave}
|
|
|
|
} elseif {$type == "sigstop"} {
|
|
|
|
# Pause primary 0.
|
|
|
|
set primary0_pid [s 0 process_id]
|
|
|
|
pause_process $primary0_pid
|
|
|
|
}
|
Fix data loss when replica do a failover with a old history repl offset (#885)
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 12:42:59 +08:00
|
|
|
|
|
|
|
# Wait for the replica to become a primary.
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[s -4 role] eq {master} &&
|
|
|
|
[s -7 role] eq {slave}
|
|
|
|
} else {
|
|
|
|
puts "s -4 role: [s -4 role]"
|
|
|
|
puts "s -7 role: [s -7 role]"
|
|
|
|
fail "Failover does not happened"
|
|
|
|
}
|
|
|
|
|
|
|
|
# Make sure server 7 gets the lower rank and it's offset is 0.
|
|
|
|
verify_log_message -4 "*Start of election*rank #0*" 0
|
|
|
|
verify_log_message -7 "*Start of election*offset 0*" 0
|
|
|
|
|
|
|
|
# Wait for the cluster to be ok.
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[CI 4 cluster_state] eq "ok" &&
|
|
|
|
[CI 7 cluster_state] eq "ok"
|
|
|
|
} else {
|
|
|
|
puts "R 4: [R 4 cluster info]"
|
|
|
|
puts "R 7: [R 7 cluster info]"
|
|
|
|
fail "Cluster is down"
|
|
|
|
}
|
|
|
|
|
|
|
|
# Make sure the key exists and is consistent.
|
|
|
|
R 7 readonly
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[R 4 get key_991803] == 1024 &&
|
|
|
|
[R 7 get key_991803] == 1024
|
|
|
|
} else {
|
|
|
|
puts "R 4: [R 4 get key_991803]"
|
|
|
|
puts "R 7: [R 7 get key_991803]"
|
|
|
|
fail "Key not consistent"
|
|
|
|
}
|
2024-08-28 11:08:27 +08:00
|
|
|
|
|
|
|
if {$type == "sigstop"} {
|
|
|
|
resume_process $primary0_pid
|
|
|
|
|
Fix reconfiguring sub-replica causing data loss when myself change shard_id (#944)
When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.
In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.
As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.
This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.
This one follow #885 and closes #942.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
2024-08-29 22:39:53 +08:00
|
|
|
# Wait for the old primary to go online and become a replica.
|
2024-08-28 11:08:27 +08:00
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[s 0 role] eq {slave}
|
|
|
|
} else {
|
|
|
|
fail "The old primary was not converted into replica"
|
|
|
|
}
|
|
|
|
}
|
Fix data loss when replica do a failover with a old history repl offset (#885)
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 12:42:59 +08:00
|
|
|
}
|
2024-08-28 11:08:27 +08:00
|
|
|
} ;# proc
|
|
|
|
|
|
|
|
start_cluster 4 4 {tags {external:skip cluster} overrides {cluster-node-timeout 1000 cluster-migration-barrier 999}} {
|
|
|
|
test_nonempty_replica "shutdown"
|
|
|
|
} my_slot_allocation cluster_allocate_replicas ;# start_cluster
|
|
|
|
|
|
|
|
start_cluster 4 4 {tags {external:skip cluster} overrides {cluster-node-timeout 1000 cluster-migration-barrier 999}} {
|
|
|
|
test_nonempty_replica "sigstop"
|
Fix data loss when replica do a failover with a old history repl offset (#885)
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 12:42:59 +08:00
|
|
|
} my_slot_allocation cluster_allocate_replicas ;# start_cluster
|
2024-08-23 16:22:30 +08:00
|
|
|
|
Fix reconfiguring sub-replica causing data loss when myself change shard_id (#944)
When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.
In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.
As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.
This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.
This one follow #885 and closes #942.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
2024-08-29 22:39:53 +08:00
|
|
|
proc test_sub_replica {type} {
|
|
|
|
test "Sub-replica reports zero repl offset and rank, and fails to win election - $type" {
|
|
|
|
# Write some data to primary 0, slot 1, make a small repl_offset.
|
|
|
|
for {set i 0} {$i < 1024} {incr i} {
|
|
|
|
R 0 incr key_991803
|
|
|
|
}
|
|
|
|
assert_equal {1024} [R 0 get key_991803]
|
|
|
|
|
|
|
|
# Write some data to primary 3, slot 0, make a big repl_offset.
|
|
|
|
for {set i 0} {$i < 10240} {incr i} {
|
|
|
|
R 3 incr key_977613
|
|
|
|
}
|
|
|
|
assert_equal {10240} [R 3 get key_977613]
|
|
|
|
|
|
|
|
R 3 config set cluster-replica-validity-factor 0
|
|
|
|
R 7 config set cluster-replica-validity-factor 0
|
|
|
|
R 3 config set cluster-allow-replica-migration yes
|
|
|
|
R 7 config set cluster-allow-replica-migration no
|
|
|
|
|
|
|
|
# 10s, make sure primary 0 will hang in the save.
|
|
|
|
R 0 config set rdb-key-save-delay 100000000
|
|
|
|
|
|
|
|
# Move slot 0 from primary 3 to primary 0.
|
|
|
|
set addr "[srv 0 host]:[srv 0 port]"
|
|
|
|
set myid [R 3 CLUSTER MYID]
|
|
|
|
set code [catch {
|
|
|
|
exec src/valkey-cli {*}[valkeycli_tls_config "./tests"] --cluster rebalance $addr --cluster-weight $myid=0
|
|
|
|
} result]
|
|
|
|
if {$code != 0} {
|
|
|
|
fail "valkey-cli --cluster rebalance returns non-zero exit code, output below:\n$result"
|
|
|
|
}
|
|
|
|
|
|
|
|
# Make sure server 3 and server 7 becomes a replica of primary 0.
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[get_my_primary_peer 3] eq $addr &&
|
|
|
|
[get_my_primary_peer 7] eq $addr
|
|
|
|
} else {
|
|
|
|
puts "R 3 role: [R 3 role]"
|
|
|
|
puts "R 7 role: [R 7 role]"
|
|
|
|
fail "Server 3 and 7 role response has not changed"
|
|
|
|
}
|
|
|
|
|
|
|
|
# Make sure server 7 got a sub-replica log.
|
|
|
|
verify_log_message -7 "*I'm a sub-replica!*" 0
|
|
|
|
|
|
|
|
if {$type == "shutdown"} {
|
|
|
|
# Shutdown primary 0.
|
|
|
|
catch {R 0 shutdown nosave}
|
|
|
|
} elseif {$type == "sigstop"} {
|
|
|
|
# Pause primary 0.
|
|
|
|
set primary0_pid [s 0 process_id]
|
|
|
|
pause_process $primary0_pid
|
|
|
|
}
|
|
|
|
|
|
|
|
# Wait for the replica to become a primary, and make sure
|
|
|
|
# the other primary become a replica.
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[s -4 role] eq {master} &&
|
|
|
|
[s -3 role] eq {slave} &&
|
|
|
|
[s -7 role] eq {slave}
|
|
|
|
} else {
|
|
|
|
puts "s -4 role: [s -4 role]"
|
|
|
|
puts "s -3 role: [s -3 role]"
|
|
|
|
puts "s -7 role: [s -7 role]"
|
|
|
|
fail "Failover does not happened"
|
|
|
|
}
|
|
|
|
|
|
|
|
# Make sure the offset of server 3 / 7 is 0.
|
|
|
|
verify_log_message -3 "*Start of election*offset 0*" 0
|
|
|
|
verify_log_message -7 "*Start of election*offset 0*" 0
|
|
|
|
|
|
|
|
# Wait for the cluster to be ok.
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[CI 3 cluster_state] eq "ok" &&
|
|
|
|
[CI 4 cluster_state] eq "ok" &&
|
|
|
|
[CI 7 cluster_state] eq "ok"
|
|
|
|
} else {
|
|
|
|
puts "R 3: [R 3 cluster info]"
|
|
|
|
puts "R 4: [R 4 cluster info]"
|
|
|
|
puts "R 7: [R 7 cluster info]"
|
|
|
|
fail "Cluster is down"
|
|
|
|
}
|
|
|
|
|
|
|
|
# Make sure the key exists and is consistent.
|
|
|
|
R 3 readonly
|
|
|
|
R 7 readonly
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[R 3 get key_991803] == 1024 && [R 3 get key_977613] == 10240 &&
|
|
|
|
[R 4 get key_991803] == 1024 && [R 4 get key_977613] == 10240 &&
|
|
|
|
[R 7 get key_991803] == 1024 && [R 7 get key_977613] == 10240
|
|
|
|
} else {
|
|
|
|
puts "R 3: [R 3 keys *]"
|
|
|
|
puts "R 4: [R 4 keys *]"
|
|
|
|
puts "R 7: [R 7 keys *]"
|
|
|
|
fail "Key not consistent"
|
|
|
|
}
|
|
|
|
|
|
|
|
if {$type == "sigstop"} {
|
|
|
|
resume_process $primary0_pid
|
|
|
|
|
|
|
|
# Wait for the old primary to go online and become a replica.
|
|
|
|
wait_for_condition 1000 50 {
|
|
|
|
[s 0 role] eq {slave}
|
|
|
|
} else {
|
|
|
|
fail "The old primary was not converted into replica"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
start_cluster 4 4 {tags {external:skip cluster} overrides {cluster-node-timeout 1000 cluster-migration-barrier 999}} {
|
|
|
|
test_sub_replica "shutdown"
|
|
|
|
} my_slot_allocation cluster_allocate_replicas ;# start_cluster
|
|
|
|
|
|
|
|
start_cluster 4 4 {tags {external:skip cluster} overrides {cluster-node-timeout 1000 cluster-migration-barrier 999}} {
|
|
|
|
test_sub_replica "sigstop"
|
|
|
|
} my_slot_allocation cluster_allocate_replicas ;# start_cluster
|
|
|
|
|
2024-08-23 16:22:30 +08:00
|
|
|
start_cluster 4 4 {tags {external:skip cluster} overrides {cluster-node-timeout 1000 cluster-migration-barrier 999}} {
|
|
|
|
test "valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT" {
|
|
|
|
R 3 config set cluster-allow-replica-migration no
|
|
|
|
R 7 config set cluster-allow-replica-migration yes
|
|
|
|
|
|
|
|
# Move slot 0 from primary 3 to primary 0.
|
|
|
|
set addr "[srv 0 host]:[srv 0 port]"
|
|
|
|
set myid [R 3 CLUSTER MYID]
|
|
|
|
set code [catch {
|
|
|
|
exec src/valkey-cli {*}[valkeycli_tls_config "./tests"] --cluster rebalance $addr --cluster-weight $myid=0
|
|
|
|
} result]
|
|
|
|
if {$code != 0} {
|
|
|
|
fail "valkey-cli --cluster rebalance returns non-zero exit code, output below:\n$result"
|
|
|
|
}
|
|
|
|
|
2024-08-30 19:58:46 +08:00
|
|
|
# Make sure server 3 lost its replica (server 7) and server 7 becomes a replica of primary 0.
|
2024-08-28 09:51:10 +08:00
|
|
|
wait_for_condition 1000 50 {
|
2024-08-30 19:58:46 +08:00
|
|
|
[s -3 role] eq {master} &&
|
|
|
|
[s -3 connected_slaves] eq 0 &&
|
|
|
|
[s -7 role] eq {slave} &&
|
|
|
|
[get_my_primary_peer 7] eq $addr
|
2024-08-28 09:51:10 +08:00
|
|
|
} else {
|
|
|
|
puts "R 3 role: [R 3 role]"
|
|
|
|
puts "R 7 role: [R 7 role]"
|
2024-08-30 19:58:46 +08:00
|
|
|
fail "Server 3 and 7 role response has not changed"
|
2024-08-28 09:51:10 +08:00
|
|
|
}
|
2024-08-23 16:22:30 +08:00
|
|
|
}
|
|
|
|
} my_slot_allocation cluster_allocate_replicas ;# start_cluster
|