
In this PR we introduce the main benefit of dual channel replication by continuously steaming the COB (client output buffers) in parallel to the RDB and thus keeping the primary's side COB small AND accelerating the overall sync process. By streaming the replication data to the replica during the full sync, we reduce 1. Memory load from the primary's node. 2. CPU load from the primary's main process. [Latest performance tests](#data) ## Motivation * Reduce primary memory load. We do that by moving the COB tracking to the replica side. This also decrease the chance for COB overruns. Note that primary's input buffer limits at the replica side are less restricted then primary's COB as the replica plays less critical part in the replication group. While increasing the primary’s COB may end up with primary reaching swap and clients suffering, at replica side we’re more at ease with it. Larger COB means better chance to sync successfully. * Reduce primary main process CPU load. By opening a new, dedicated connection for the RDB transfer, child processes can have direct access to the new connection. Due to TLS connection restrictions, this was not possible using one main connection. We eliminate the need for the child process to use the primary's child-proc -> main-proc pipeline, thus freeing up the main process to process clients queries. ## Dual Channel Replication high level interface design - Dual channel replication begins when the replica sends a `REPLCONF CAPA DUALCHANNEL` to the primary during initial handshake. This is used to state that the replica is capable of dual channel sync and that this is the replica's main channel, which is not used for snapshot transfer. - When replica lacks sufficient data for PSYNC, the primary will send `-FULLSYNCNEEDED` response instead of RDB data. As a next step, the replica creates a new connection (rdb-channel) and configures it against the primary with the appropriate capabilities and requirements. The replica then requests a sync using the RDB channel. - Prior to forking, the primary sends the replica the snapshot's end repl-offset, and attaches the replica to the replication backlog to keep repl data until the replica requests psync. The replica uses the main channel to request a PSYNC starting at the snapshot end offset. - The primary main threads sends incremental changes via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. As for the replica, the incremental changes are stored on a local buffer, while the RDB is loaded into memory. - Once the replica completes loading the rdb, it drops the rdb-connection and streams the accumulated incremental changes into memory. Repl steady state continues normally. ## New replica state machine  ## Data <a name="data"></a>    ## Explanation These graphs demonstrate performance improvements during full sync sessions using rdb-channel + streaming rdb directly from the background process to the replica. First graph- with at most 50 clients and light weight commands, we saw 5%-7.5% improvement in write latency during sync session. Two graphs below- full sync was tested during heavy read commands from the primary (such as sdiff, sunion on large sets). In that case, the child process writes to the replica without sharing CPU with the loaded main process. As a result, this not only improves client response time, but may also shorten sync time by about 50%. The shorter sync time results in less memory being used to store replication diffs (>60% in some of the tested cases). ## Test setup Both primary and replica in the performance tests ran on the same machine. RDB size in all tests is 3.7gb. I generated write load using valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list __rand_int__`. --------- Signed-off-by: naglera <anagler123@gmail.com> Signed-off-by: naglera <58042354+naglera@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
225 lines
8.6 KiB
Tcl
225 lines
8.6 KiB
Tcl
start_server {tags {"psync2 external:skip"}} {
|
|
start_server {} {
|
|
start_server {} {
|
|
set master [srv 0 client]
|
|
set master_host [srv 0 host]
|
|
set master_port [srv 0 port]
|
|
|
|
set replica [srv -1 client]
|
|
set replica_host [srv -1 host]
|
|
set replica_port [srv -1 port]
|
|
|
|
set sub_replica [srv -2 client]
|
|
|
|
# Make sure the server saves an RDB on shutdown
|
|
$master config set save "3600 1"
|
|
|
|
# Because we will test partial resync later, we don't want a timeout to cause
|
|
# the master-replica disconnect, then the extra reconnections will break the
|
|
# sync_partial_ok stat test
|
|
$master config set repl-timeout 3600
|
|
$replica config set repl-timeout 3600
|
|
$sub_replica config set repl-timeout 3600
|
|
|
|
# Avoid PINGs
|
|
$master config set repl-ping-replica-period 3600
|
|
$master config rewrite
|
|
|
|
# Build replication chain
|
|
$replica replicaof $master_host $master_port
|
|
$sub_replica replicaof $replica_host $replica_port
|
|
|
|
wait_for_condition 50 100 {
|
|
[status $replica master_link_status] eq {up} &&
|
|
[status $sub_replica master_link_status] eq {up}
|
|
} else {
|
|
fail "Replication not started."
|
|
}
|
|
|
|
test "PSYNC2: Partial resync after Master restart using RDB aux fields when offset is 0" {
|
|
assert {[status $master master_repl_offset] == 0}
|
|
|
|
set replid [status $master master_replid]
|
|
$replica config resetstat
|
|
|
|
catch {
|
|
restart_server 0 true false true now
|
|
set master [srv 0 client]
|
|
}
|
|
wait_for_condition 50 1000 {
|
|
[status $replica master_link_status] eq {up} &&
|
|
[status $sub_replica master_link_status] eq {up}
|
|
} else {
|
|
fail "Replicas didn't sync after master restart"
|
|
}
|
|
|
|
# Make sure master restore replication info correctly
|
|
assert {[status $master master_replid] != $replid}
|
|
assert {[status $master master_repl_offset] == 0}
|
|
assert {[status $master master_replid2] eq $replid}
|
|
assert {[status $master second_repl_offset] == 1}
|
|
|
|
# Make sure master set replication backlog correctly
|
|
assert {[status $master repl_backlog_active] == 1}
|
|
assert {[status $master repl_backlog_first_byte_offset] == 1}
|
|
assert {[status $master repl_backlog_histlen] == 0}
|
|
|
|
# Partial resync after Master restart
|
|
assert {[status $master sync_partial_ok] == 1}
|
|
assert {[status $replica sync_partial_ok] == 1}
|
|
}
|
|
|
|
# Generate some data
|
|
createComplexDataset $master 1000
|
|
|
|
test "PSYNC2: Partial resync after Master restart using RDB aux fields with data" {
|
|
wait_for_condition 500 100 {
|
|
[status $master master_repl_offset] == [status $replica master_repl_offset] &&
|
|
[status $master master_repl_offset] == [status $sub_replica master_repl_offset]
|
|
} else {
|
|
fail "Replicas and master offsets were unable to match *exactly*."
|
|
}
|
|
|
|
set replid [status $master master_replid]
|
|
set offset [status $master master_repl_offset]
|
|
$replica config resetstat
|
|
|
|
catch {
|
|
# SHUTDOWN NOW ensures master doesn't send GETACK to replicas before
|
|
# shutting down which would affect the replication offset.
|
|
restart_server 0 true false true now
|
|
set master [srv 0 client]
|
|
}
|
|
wait_for_condition 50 1000 {
|
|
[status $replica master_link_status] eq {up} &&
|
|
[status $sub_replica master_link_status] eq {up}
|
|
} else {
|
|
fail "Replicas didn't sync after master restart"
|
|
}
|
|
|
|
# Make sure master restore replication info correctly
|
|
assert {[status $master master_replid] != $replid}
|
|
assert {[status $master master_repl_offset] == $offset}
|
|
assert {[status $master master_replid2] eq $replid}
|
|
assert {[status $master second_repl_offset] == [expr $offset+1]}
|
|
|
|
# Make sure master set replication backlog correctly
|
|
assert {[status $master repl_backlog_active] == 1}
|
|
assert {[status $master repl_backlog_first_byte_offset] == [expr $offset+1]}
|
|
assert {[status $master repl_backlog_histlen] == 0}
|
|
|
|
# Partial resync after Master restart
|
|
assert {[status $master sync_partial_ok] == 1}
|
|
assert {[status $replica sync_partial_ok] == 1}
|
|
}
|
|
|
|
test "PSYNC2: Partial resync after Master restart using RDB aux fields with expire" {
|
|
$master debug set-active-expire 0
|
|
for {set j 0} {$j < 1024} {incr j} {
|
|
$master select [expr $j%16]
|
|
$master set $j somevalue px 10
|
|
}
|
|
|
|
after 20
|
|
|
|
# Wait until master has received ACK from replica. If the master thinks
|
|
# that any replica is lagging when it shuts down, master would send
|
|
# GETACK to the replicas, affecting the replication offset.
|
|
set offset [status $master master_repl_offset]
|
|
wait_for_condition 500 100 {
|
|
[string match "*slave0:*,offset=$offset,*" [$master info replication]] &&
|
|
$offset == [status $replica master_repl_offset] &&
|
|
$offset == [status $sub_replica master_repl_offset]
|
|
} else {
|
|
show_cluster_status
|
|
fail "Replicas and master offsets were unable to match *exactly*."
|
|
}
|
|
|
|
set offset [status $master master_repl_offset]
|
|
$replica config resetstat
|
|
|
|
catch {
|
|
# Unlike the test above, here we use SIGTERM, which behaves
|
|
# differently compared to SHUTDOWN NOW if there are lagging
|
|
# replicas. This is just to increase coverage and let each test use
|
|
# a different shutdown approach. In this case there are no lagging
|
|
# replicas though.
|
|
restart_server 0 true false
|
|
set master [srv 0 client]
|
|
}
|
|
wait_for_condition 50 1000 {
|
|
[status $replica master_link_status] eq {up} &&
|
|
[status $sub_replica master_link_status] eq {up}
|
|
} else {
|
|
fail "Replicas didn't sync after master restart"
|
|
}
|
|
|
|
set expired_offset [status $master repl_backlog_histlen]
|
|
# Stale keys expired and master_repl_offset grows correctly
|
|
assert {[status $master rdb_last_load_keys_expired] == 1024}
|
|
assert {[status $master master_repl_offset] == [expr $offset+$expired_offset]}
|
|
|
|
# Partial resync after Master restart
|
|
assert {[status $master sync_partial_ok] == 1}
|
|
assert {[status $replica sync_partial_ok] == 1}
|
|
|
|
set digest [$master debug digest]
|
|
assert {$digest eq [$replica debug digest]}
|
|
assert {$digest eq [$sub_replica debug digest]}
|
|
}
|
|
|
|
test "PSYNC2: Full resync after Master restart when too many key expired" {
|
|
$master config set repl-backlog-size 16384
|
|
$master config rewrite
|
|
|
|
$master debug set-active-expire 0
|
|
# Make sure replication backlog is full and will be trimmed.
|
|
for {set j 0} {$j < 2048} {incr j} {
|
|
$master select [expr $j%16]
|
|
$master set $j somevalue px 10
|
|
}
|
|
|
|
after 20
|
|
|
|
wait_for_condition 500 100 {
|
|
[status $master master_repl_offset] == [status $replica master_repl_offset] &&
|
|
[status $master master_repl_offset] == [status $sub_replica master_repl_offset]
|
|
} else {
|
|
fail "Replicas and master offsets were unable to match *exactly*."
|
|
}
|
|
|
|
$replica config resetstat
|
|
|
|
catch {
|
|
# Unlike the test above, here we use SIGTERM. This is just to
|
|
# increase coverage and let each test use a different shutdown
|
|
# approach.
|
|
restart_server 0 true false
|
|
set master [srv 0 client]
|
|
}
|
|
wait_for_condition 50 1000 {
|
|
[status $replica master_link_status] eq {up} &&
|
|
[status $sub_replica master_link_status] eq {up}
|
|
} else {
|
|
fail "Replicas didn't sync after master restart"
|
|
}
|
|
set dualchannel [lindex [r config get dual-channel-replication-enabled] 1]
|
|
set psync_count 0
|
|
if {$dualchannel == "yes"} {
|
|
# Expect one fake psync
|
|
set psync_count 1
|
|
}
|
|
|
|
# Replication backlog is full
|
|
assert {[status $master repl_backlog_first_byte_offset] > [status $master second_repl_offset]}
|
|
assert {[status $master sync_partial_ok] == $psync_count}
|
|
assert {[status $master sync_full] == 1}
|
|
assert {[status $master rdb_last_load_keys_expired] == 2048}
|
|
assert {[status $replica sync_full] == 1}
|
|
|
|
set digest [$master debug digest]
|
|
assert {$digest eq [$replica debug digest]}
|
|
assert {$digest eq [$sub_replica debug digest]}
|
|
}
|
|
}}}
|