From ffb691f6f1b0f176b1f03c474f1dfd854ee6bf78 Mon Sep 17 00:00:00 2001 From: Binbin Date: Wed, 1 Feb 2023 20:48:16 +0800 Subject: [PATCH] Fix handshake timeout replication test race (#11773) Test on x86 + TLS fail with this error: ``` *** [err]: Slave is able to detect timeout during handshake in tests/integration/replication.tcl Replica is not able to detect timeout ``` The replica logs is: ``` ### Starting test Slave is able to detect timeout during handshake in tests/integration/replication.tcl 7681:S 05 Jan 2023 00:21:56.635 * Non blocking connect for SYNC fired the event. 7681:S 05 Jan 2023 00:21:56.638 * Master replied to PING, replication can continue... 7681:S 05 Jan 2023 00:21:56.638 * Trying a partial resynchronization (request ef70638885500aad12dd673c68ca1541116a59fe:1). 7681:S 05 Jan 2023 00:22:56.894 # Failed to read response from the server: error:0A000126:SSL routines::unexpected eof while reading 7681:S 05 Jan 2023 00:22:56.894 # Master did not reply to PSYNC, will try later ``` This is another issue that appeared after #11640 was merged. This PR try to fix it. The idea is to make it stable in `wait_bgsave`, for example, it may wait until the next psync retry in the following situation: `Master did not reply to PSYNC, will try later` Other than that, the change will make the test more consistent / predictable since it'll mean the master is always frozen in the desired state (waiting for repl-diskless-sync-delay to happen, rather than earlier stages of the handshake). --- tests/integration/replication.tcl | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/tests/integration/replication.tcl b/tests/integration/replication.tcl index 319462695..e23edad9d 100644 --- a/tests/integration/replication.tcl +++ b/tests/integration/replication.tcl @@ -31,6 +31,14 @@ start_server {tags {"repl network external:skip"}} { } } + test {Slave enters wait_bgsave} { + wait_for_condition 50 1000 { + [string match *state=wait_bgsave* [$master info replication]] + } else { + fail "Replica does not enter wait_bgsave state" + } + } + # Use a short replication timeout on the slave, so that if there # are no bugs the timeout is triggered in a reasonable amount # of time.