futriix/tests/integration/replication-buffer.tcl
naglera ff6b780fe6
Dual channel replication (#60)
In this PR we introduce the main benefit of dual channel replication by
continuously steaming the COB (client output buffers) in parallel to the
RDB and thus keeping the primary's side COB small AND accelerating the
overall sync process. By streaming the replication data to the replica
during the full sync, we reduce
1. Memory load from the primary's node.
2. CPU load from the primary's main process. [Latest performance
tests](#data)

## Motivation
* Reduce primary memory load. We do that by moving the COB tracking to
the replica side. This also decrease the chance for COB overruns. Note
that primary's input buffer limits at the replica side are less
restricted then primary's COB as the replica plays less critical part in
the replication group. While increasing the primary’s COB may end up
with primary reaching swap and clients suffering, at replica side we’re
more at ease with it. Larger COB means better chance to sync
successfully.
* Reduce primary main process CPU load. By opening a new, dedicated
connection for the RDB transfer, child processes can have direct access
to the new connection. Due to TLS connection restrictions, this was not
possible using one main connection. We eliminate the need for the child
process to use the primary's child-proc -> main-proc pipeline, thus
freeing up the main process to process clients queries.


 ## Dual Channel Replication high level interface design
- Dual channel replication begins when the replica sends a `REPLCONF
CAPA DUALCHANNEL` to the primary during initial
handshake. This is used to state that the replica is capable of dual
channel sync and that this is the replica's main channel, which is not
used for snapshot transfer.
- When replica lacks sufficient data for PSYNC, the primary will send
`-FULLSYNCNEEDED` response instead
of RDB data. As a next step, the replica creates a new connection
(rdb-channel) and configures it against
the primary with the appropriate capabilities and requirements. The
replica then requests a sync
     using the RDB channel. 
- Prior to forking, the primary sends the replica the snapshot's end
repl-offset, and attaches the replica
to the replication backlog to keep repl data until the replica requests
psync. The replica uses the main
     channel to request a PSYNC starting at the snapshot end offset. 
- The primary main threads sends incremental changes via the main
channel, while the bgsave process
sends the RDB directly to the replica via the rdb-channel. As for the
replica, the incremental
changes are stored on a local buffer, while the RDB is loaded into
memory.
- Once the replica completes loading the rdb, it drops the
rdb-connection and streams the accumulated incremental
     changes into memory. Repl steady state continues normally.

## New replica state machine


![image](https://github.com/user-attachments/assets/38fbfff0-60b9-4066-8b13-becdb87babc3)





## Data <a name="data"></a>

![image](https://github.com/user-attachments/assets/d73631a7-0a58-4958-a494-a7f4add9108f)


![image](https://github.com/user-attachments/assets/f44936ed-c59a-4223-905d-0fe48a6d31a6)


![image](https://github.com/user-attachments/assets/bd333ee2-3c47-47e5-b244-4ea75f77c836)

## Explanation 
These graphs demonstrate performance improvements during full sync
sessions using rdb-channel + streaming rdb directly from the background
process to the replica.

First graph- with at most 50 clients and light weight commands, we saw
5%-7.5% improvement in write latency during sync session.
Two graphs below- full sync was tested during heavy read commands from
the primary (such as sdiff, sunion on large sets). In that case, the
child process writes to the replica without sharing CPU with the loaded
main process. As a result, this not only improves client response time,
but may also shorten sync time by about 50%. The shorter sync time
results in less memory being used to store replication diffs (>60% in
some of the tested cases).

## Test setup 
Both primary and replica in the performance tests ran on the same
machine. RDB size in all tests is 3.7gb. I generated write load using
valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list
__rand_int__`.

---------

Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: naglera <58042354+naglera@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-17 13:59:33 -07:00

364 lines
15 KiB
Tcl

# This test group aims to test that all replicas share one global replication buffer,
# two replicas don't make replication buffer size double, and when there is no replica,
# replica buffer will shrink.
foreach dualchannel {"yes" "no"} {
start_server {tags {"repl external:skip"}} {
start_server {} {
start_server {} {
start_server {} {
set replica1 [srv -3 client]
set replica2 [srv -2 client]
set replica3 [srv -1 client]
$replica1 config set dual-channel-replication-enabled $dualchannel
$replica2 config set dual-channel-replication-enabled $dualchannel
$replica3 config set dual-channel-replication-enabled $dualchannel
set master [srv 0 client]
set master_host [srv 0 host]
set master_port [srv 0 port]
$master config set save ""
$master config set repl-backlog-size 16384
$master config set repl-diskless-sync-delay 5
$master config set repl-diskless-sync-max-replicas 1
$master config set client-output-buffer-limit "replica 0 0 0"
$master config set dual-channel-replication-enabled $dualchannel
# Make sure replica3 is synchronized with master
$replica3 replicaof $master_host $master_port
wait_for_sync $replica3
# Generating RDB will take some 100 seconds
$master config set rdb-key-save-delay 1000000
populate 100 "" 16
# Make sure replica1 and replica2 are waiting bgsave
$master config set repl-diskless-sync-max-replicas 2
$replica1 replicaof $master_host $master_port
$replica2 replicaof $master_host $master_port
wait_for_condition 50 100 {
([s rdb_bgsave_in_progress] == 1) &&
[lindex [$replica1 role] 3] eq {sync} &&
[lindex [$replica2 role] 3] eq {sync}
} else {
fail "fail to sync with replicas"
}
test "All replicas share one global replication buffer dualchannel $dualchannel" {
set before_used [s used_memory]
populate 1024 "" 1024 ; # Write extra 1M data
# New data uses 1M memory, but all replicas use only one
# replication buffer, so all replicas output memory is not
# more than double of replication buffer.
set repl_buf_mem [s mem_total_replication_buffers]
set extra_mem [expr {[s used_memory]-$before_used-1024*1024}]
if {$dualchannel == "yes"} {
# master's replication buffers should not grow during dual channel replication
assert {$extra_mem < 1024*1024}
assert {$repl_buf_mem < 1024*1024}
} else {
assert {$extra_mem < 2*$repl_buf_mem}
}
# Kill replica1, replication_buffer will not become smaller
catch {$replica1 shutdown nosave}
set cur_slave_count 2
if {$dualchannel == "yes"} {
# slave3 is connected, slave2 is syncing (has two connection)
set cur_slave_count 3
}
wait_for_condition 500 100 {
[s connected_slaves] eq $cur_slave_count
} else {
fail "replica doesn't disconnect with master"
}
assert_equal $repl_buf_mem [s mem_total_replication_buffers]
}
test "Replication buffer will become smaller when no replica uses dualchannel $dualchannel" {
# Make sure replica3 catch up with the master
wait_for_ofs_sync $master $replica3
set repl_buf_mem [s mem_total_replication_buffers]
# Kill replica2, replication_buffer will become smaller
catch {$replica2 shutdown nosave}
wait_for_condition 50 100 {
[s connected_slaves] eq {1}
} else {
fail "replica2 doesn't disconnect with master"
}
if {$dualchannel == "yes"} {
# master's replication buffers should not grow during dual channel replication
assert {1024*512 > [s mem_total_replication_buffers]}
} else {
assert {[expr $repl_buf_mem - 1024*1024] > [s mem_total_replication_buffers]}
}
}
}
}
}
}
}
# This test group aims to test replication backlog size can outgrow the backlog
# limit config if there is a slow replica which keep massive replication buffers,
# and replicas could use this replication buffer (beyond backlog config) for
# partial re-synchronization. Of course, replication backlog memory also can
# become smaller when master disconnects with slow replicas since output buffer
# limit is reached.
foreach dualchannel {yes no} {
start_server {tags {"repl external:skip"}} {
start_server {} {
start_server {} {
set replica1 [srv -2 client]
set replica1_pid [s -2 process_id]
set replica2 [srv -1 client]
set replica2_pid [s -1 process_id]
$replica1 config set dual-channel-replication-enabled $dualchannel
set master [srv 0 client]
set master_host [srv 0 host]
set master_port [srv 0 port]
$master config set save ""
$master config set repl-backlog-size 16384
$master config set client-output-buffer-limit "replica 0 0 0"
$master config set dual-channel-replication-enabled $dualchannel
# Executing 'debug digest' on master which has many keys costs much time
# (especially in valgrind), this causes that replica1 and replica2 disconnect
# with master.
$master config set repl-timeout 1000
$replica1 config set repl-timeout 1000
$replica2 config set repl-timeout 1000
$replica2 config set client-output-buffer-limit "replica 0 0 0"
$replica2 config set dual-channel-replication-enabled $dualchannel
$replica1 replicaof $master_host $master_port
wait_for_sync $replica1
test "Replication backlog size can outgrow the backlog limit config dualchannel $dualchannel" {
# Generating RDB will take 1000 seconds
$master config set rdb-key-save-delay 1000000
populate 1000 master 10000
$replica2 replicaof $master_host $master_port
# Make sure replica2 is waiting bgsave
wait_for_condition 5000 100 {
([s rdb_bgsave_in_progress] == 1) &&
[lindex [$replica2 role] 3] eq {sync}
} else {
fail "fail to sync with replicas"
}
# Replication actual backlog grow more than backlog setting since
# the slow replica2 kept replication buffer.
populate 20000 master 10000
assert {[s repl_backlog_histlen] > [expr 10000*10000]}
}
# Wait replica1 catch up with the master
wait_for_condition 1000 100 {
[s -2 master_repl_offset] eq [s master_repl_offset]
} else {
fail "Replica offset didn't catch up with the master after too long time"
}
test "Replica could use replication buffer (beyond backlog config) for partial resynchronization dualchannel $dualchannel" {
# replica1 disconnects with master
$replica1 replicaof [srv -1 host] [srv -1 port]
# Write a mass of data that exceeds repl-backlog-size
populate 10000 master 10000
# replica1 reconnects with master
$replica1 replicaof $master_host $master_port
wait_for_condition 1000 100 {
[s -2 master_repl_offset] eq [s master_repl_offset]
} else {
fail "Replica offset didn't catch up with the master after too long time"
}
# replica2 still waits for bgsave ending
assert {[s rdb_bgsave_in_progress] eq {1} && [lindex [$replica2 role] 3] eq {sync}}
# master accepted replica1 partial resync
if { $dualchannel == "yes" } {
# 2 psync using main channel
# +1 "real" psync
assert_equal [s sync_partial_ok] {3}
} else {
assert_equal [s sync_partial_ok] {1}
}
assert_equal [$master debug digest] [$replica1 debug digest]
}
test "Replication backlog memory will become smaller if disconnecting with replica dualchannel $dualchannel" {
assert {[s repl_backlog_histlen] > [expr 2*10000*10000]}
if {$dualchannel == "yes"} {
# 1 connection of replica1
# +2 connections during sync of replica2
assert_equal [s connected_slaves] {3}
} else {
assert_equal [s connected_slaves] {2}
}
pause_process $replica2_pid
r config set client-output-buffer-limit "replica 128k 0 0"
# trigger output buffer limit check
r set key [string repeat A [expr 64*2048]]
# master will close replica2's connection since replica2's output
# buffer limit is reached, so there only is replica1.
wait_for_condition 100 100 {
[s connected_slaves] eq {1} ||
([s connected_slaves] eq {2} &&
[string match {*slave*state=wait_bgsave*type=rdb-channel*} [$master info]])
} else {
fail "master didn't disconnect with replica2"
}
# Since we trim replication backlog inrementally, replication backlog
# memory may take time to be reclaimed.
wait_for_condition 1000 100 {
[s repl_backlog_histlen] < [expr 10000*10000]
} else {
fail "Replication backlog memory is not smaller"
}
resume_process $replica2_pid
}
# speed up termination
$master config set shutdown-timeout 0
}
}
}
}
foreach dualchannel {"yes" "no"} {
test "Partial resynchronization is successful even client-output-buffer-limit is less than repl-backlog-size. dualchannel $dualchannel" {
start_server {tags {"repl external:skip"}} {
start_server {} {
r config set save ""
r config set repl-backlog-size 100mb
r config set client-output-buffer-limit "replica 512k 0 0"
r config set dual-channel-replication-enabled $dualchannel
set replica [srv -1 client]
$replica config set dual-channel-replication-enabled $dualchannel
$replica replicaof [srv 0 host] [srv 0 port]
wait_for_sync $replica
set big_str [string repeat A [expr 10*1024*1024]] ;# 10mb big string
r multi
r client kill type replica
r set key $big_str
r set key $big_str
r debug sleep 2 ;# wait for replica reconnecting
r exec
# When replica reconnects with master, master accepts partial resync,
# and don't close replica client even client output buffer limit is
# reached.
r set key $big_str ;# trigger output buffer limit check
wait_for_ofs_sync r $replica
# master accepted replica partial resync
set psync_count 1
if {$dualchannel == "yes"} {
# One fake and one real psync
set psync_count 2
}
assert_equal [s sync_full] {1}
assert_equal [s sync_partial_ok] $psync_count
r multi
r set key $big_str
r set key $big_str
r exec
# replica's reply buffer size is more than client-output-buffer-limit but
# doesn't exceed repl-backlog-size, we don't close replica client.
wait_for_condition 1000 100 {
[s -1 master_repl_offset] eq [s master_repl_offset]
} else {
fail "Replica offset didn't catch up with the master after too long time"
}
assert_equal [s sync_full] {1}
assert_equal [s sync_partial_ok] $psync_count
}
}
}
# This test was added to make sure big keys added to the backlog do not trigger psync loop.
test "Replica client-output-buffer size is limited to backlog_limit/16 when no replication data is pending. dualchannel $dualchannel" {
proc client_field {r type f} {
set client [$r client list type $type]
if {![regexp $f=(\[a-zA-Z0-9-\]+) $client - res]} {
error "field $f not found for in $client"
}
return $res
}
start_server {tags {"repl external:skip"}} {
start_server {} {
set replica [srv -1 client]
set replica_host [srv -1 host]
set replica_port [srv -1 port]
set master [srv 0 client]
set master_host [srv 0 host]
set master_port [srv 0 port]
$master config set maxmemory-policy allkeys-lru
$master config set repl-backlog-size 16384
$master config set client-output-buffer-limit "replica 32768 32768 60"
$master config set dual-channel-replication-enabled $dualchannel
$replica config set dual-channel-replication-enabled $dualchannel
# Key has has to be larger than replica client-output-buffer limit.
set keysize [expr 256*1024]
$replica replicaof $master_host $master_port
wait_for_condition 50 100 {
[lindex [$replica role] 0] eq {slave} &&
[string match {*master_link_status:up*} [$replica info replication]]
} else {
fail "Can't turn the instance into a replica"
}
# Write a big key that is gonna breach the obuf limit and cause the replica to disconnect,
# then in the same event loop, add at least 16 more keys, and enable eviction, so that the
# eviction code has a chance to call flushSlavesOutputBuffers, and then run PING to trigger the eviction code
set _v [prepare_value $keysize]
$master write "[format_command mset key $_v k1 1 k2 2 k3 3 k4 4 k5 5 k6 6 k7 7 k8 8 k9 9 ka a kb b kc c kd d ke e kf f kg g kh h]config set maxmemory 1\r\nping\r\n"
$master flush
$master read
$master read
$master read
wait_for_ofs_sync $master $replica
# Write another key to force the test to wait for another event loop iteration so that we
# give the serverCron a chance to disconnect replicas with COB size exceeding the limits
$master config set maxmemory 0
$master set key1 1
wait_for_ofs_sync $master $replica
assert {[status $master connected_slaves] == 1}
wait_for_condition 50 100 {
[client_field $master replica tot-mem] < $keysize
} else {
fail "replica client-output-buffer usage is higher than expected."
}
# now we expect the replica to re-connect but fail partial sync (it doesn't have large
# enough COB limit and must result in a full-sync)
if {$dualchannel == "yes"} {
assert {[status $master sync_partial_ok] == [status $master sync_full]}
} else {
assert {[status $master sync_partial_ok] == 0}
}
# Before this fix (#11905), the test would trigger an assertion in 'o->used >= c->ref_block_pos'
test {The update of replBufBlock's repl_offset is ok - Regression test for #11666} {
set rd [valkey_deferring_client]
set replid [status $master master_replid]
set offset [status $master repl_backlog_first_byte_offset]
$rd psync $replid $offset
assert_equal {PONG} [$master ping] ;# Make sure the master doesn't crash.
$rd close
}
}
}
}
}