futriix/tests/unit/cluster/failover.tcl

# Check the basic monitoring and failover capabilities.

start_cluster 5 5 {tags {external:skip cluster}} {

test "Cluster is up" {
    wait_for_cluster_state ok
}

test "Cluster is writable" {
    cluster_write_test [srv 0 port]
}

test "Instance #5 is a slave" {
    assert {[s -5 role] eq {slave}}
}

test "Instance #5 synced with the master" {
    wait_for_condition 1000 50 {
        [s -5 master_link_status] eq {up}
    } else {
        fail "Instance #5 master link status is not up"
    }
}

set current_epoch [CI 1 cluster_current_epoch]

set paused_pid [srv 0 pid]
test "Killing one master node" {
    pause_process $paused_pid
}

test "Wait for failover" {
    wait_for_condition 1000 50 {
        [CI 1 cluster_current_epoch] > $current_epoch
    } else {
        fail "No failover detected"
    }
}

test "Cluster should eventually be up again" {
    for {set j 0} {$j < [llength $::servers]} {incr j} {
        if {[process_is_paused [srv -$j pid]]} continue
        wait_for_condition 1000 50 {
            [CI $j cluster_state] eq "ok"
        } else {
            fail "Cluster node $j cluster_state:[CI $j cluster_state]"
        }
    }
}

test "Cluster is writable" {
    cluster_write_test [srv -1 port]
}

test "Instance #5 is now a master" {
    assert {[s -5 role] eq {master}}
}

test "Restarting the previously killed master node" {
    resume_process $paused_pid
}

test "Instance #0 gets converted into a slave" {
    wait_for_condition 1000 50 {
        [s 0 role] eq {slave}
    } else {
        fail "Old master was not converted into slave"
    }
    wait_for_cluster_propagation
}

} ;# start_cluster

start_cluster 3 6 {tags {external:skip cluster}} {

    test "Cluster is up" {
        wait_for_cluster_state ok
    }

    test "Cluster is writable" {
        cluster_write_test [srv 0 port]
    }

    set current_epoch [CI 1 cluster_current_epoch]

    set paused_pid [srv 0 pid]
    test "Killing the first primary node" {
        pause_process $paused_pid
    }

    test "Wait for failover" {
        wait_for_condition 1000 50 {
            [CI 1 cluster_current_epoch] > $current_epoch
        } else {
            fail "No failover detected"
        }
    }

    test "Cluster should eventually be up again" {
        for {set j 0} {$j < [llength $::servers]} {incr j} {
            if {[process_is_paused [srv -$j pid]]} continue
            wait_for_condition 1000 50 {
                [CI $j cluster_state] eq "ok"
            } else {
                fail "Cluster node $j cluster_state:[CI $j cluster_state]"
            }
        }
    }

    test "Restarting the previously killed primary node" {
        resume_process $paused_pid
    }

    test "Instance #0 gets converted into a replica" {
        wait_for_condition 1000 50 {
            [s 0 role] eq {slave}
        } else {
            fail "Old primary was not converted into replica"
        }
        wait_for_cluster_propagation
    }

    test "Make sure the replicas always get the different ranks" {
        if {[s -3 role] == "master"} {
            verify_log_message -3 "*Start of election*rank #0*" 0
            verify_log_message -6 "*Start of election*rank #1*" 0
        } else {
            verify_log_message -3 "*Start of election*rank #1*" 0
            verify_log_message -6 "*Start of election*rank #0*" 0
        }
    }

} ;# start_cluster
Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00			`# Check the basic monitoring and failover capabilities.`

Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`start_cluster 5 5 {tags {external:skip cluster}} {`
Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00
			`test "Cluster is up" {`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`wait_for_cluster_state ok`
Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00			`}`

			`test "Cluster is writable" {`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`cluster_write_test [srv 0 port]`
Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00			`}`

			`test "Instance #5 is a slave" {`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`assert {[s -5 role] eq {slave}}`
Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00			`}`

Cluster test: 02 unit more reliable waiting for slave sync. 2014-06-10 15:00:39 +02:00			`test "Instance #5 synced with the master" {`
			`wait_for_condition 1000 50 {`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`[s -5 master_link_status] eq {up}`
Cluster test: 02 unit more reliable waiting for slave sync. 2014-06-10 15:00:39 +02:00			`} else {`
			`fail "Instance #5 master link status is not up"`
			`}`
			`}`

Cluster test: unit 02 should wait for failover. 2014-06-10 14:18:54 +02:00			`set current_epoch [CI 1 cluster_current_epoch]`

Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`set paused_pid [srv 0 pid]`
Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00			`test "Killing one master node" {`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`pause_process $paused_pid`
Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00			`}`

Cluster test: unit 02 should wait for failover. 2014-06-10 14:18:54 +02:00			`test "Wait for failover" {`
			`wait_for_condition 1000 50 {`
			`[CI 1 cluster_current_epoch] > $current_epoch`
			`} else {`
			`fail "No failover detected"`
			`}`
			`}`

Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00			`test "Cluster should eventually be up again" {`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`for {set j 0} {$j < [llength $::servers]} {incr j} {`
Fix incorrect usage of process_is_paused in tests (#783) It was introduced wrong in #442. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-07-19 11:25:58 +08:00			`if {[process_is_paused [srv -$j pid]]} continue`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`wait_for_condition 1000 50 {`
			`[CI $j cluster_state] eq "ok"`
			`} else {`
			`fail "Cluster node $j cluster_state:[CI $j cluster_state]"`
			`}`
			`}`
Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00			`}`

			`test "Cluster is writable" {`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`cluster_write_test [srv -1 port]`
Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00			`}`

			`test "Instance #5 is now a master" {`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`assert {[s -5 role] eq {master}}`
Cluster test: basic failover unit added. 2014-05-23 11:47:47 +02:00			`}`
Cluster test: check master -> slave role switch. 2014-06-10 13:54:38 +02:00
			`test "Restarting the previously killed master node" {`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`resume_process $paused_pid`
Cluster test: check master -> slave role switch. 2014-06-10 13:54:38 +02:00			`}`

			`test "Instance #0 gets converted into a slave" {`
			`wait_for_condition 1000 50 {`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00			`[s 0 role] eq {slave}`
Cluster test: check master -> slave role switch. 2014-06-10 13:54:38 +02:00			`} else {`
			`fail "Old master was not converted into slave"`
			`}`
Cache CLUSTER SLOTS response for improving throughput and reduced latency. (#53) This commit adds a logic to cache `CLUSTER SLOTS` response for reduced latency and also updates the cache when a change in the cluster is detected. Historically, `CLUSTER SLOTS` command was deprecated, however all the server clients have been using `CLUSTER SLOTS` and have not migrated to `CLUSTER SHARDS`. In future this logic can be added to any other commands to improve the performance of the engine. --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> 2024-05-23 02:51:41 +05:30			`wait_for_cluster_propagation`
Cluster test: check master -> slave role switch. 2014-06-10 13:54:38 +02:00			`}`
Migrate cluster mode tests to normal framework (#442) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-05-09 10:14:47 +08:00
			`} ;# start_cluster`
Replicas with the same offset queue up for election (#762) In some cases, like read more than write scenario, the replication offset of the replicas are the same. When the primary fails, the replicas have the same rankings (rank == 0). They issue the election at the same time (although we have a random 500), the simultaneous elections may lead to the failure of the election due to quorum. In clusterGetReplicaRank, when we calculates the rank, if the offsets are the same, the one with the smaller node name will have a better rank to avoid this situation. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> 2024-07-23 14:43:16 +08:00
			`start_cluster 3 6 {tags {external:skip cluster}} {`

			`test "Cluster is up" {`
			`wait_for_cluster_state ok`
			`}`

			`test "Cluster is writable" {`
			`cluster_write_test [srv 0 port]`
			`}`

			`set current_epoch [CI 1 cluster_current_epoch]`

			`set paused_pid [srv 0 pid]`
			`test "Killing the first primary node" {`
			`pause_process $paused_pid`
			`}`

			`test "Wait for failover" {`
			`wait_for_condition 1000 50 {`
			`[CI 1 cluster_current_epoch] > $current_epoch`
			`} else {`
			`fail "No failover detected"`
			`}`
			`}`

			`test "Cluster should eventually be up again" {`
			`for {set j 0} {$j < [llength $::servers]} {incr j} {`
			`if {[process_is_paused [srv -$j pid]]} continue`
			`wait_for_condition 1000 50 {`
			`[CI $j cluster_state] eq "ok"`
			`} else {`
			`fail "Cluster node $j cluster_state:[CI $j cluster_state]"`
			`}`
			`}`
			`}`

			`test "Restarting the previously killed primary node" {`
			`resume_process $paused_pid`
			`}`

			`test "Instance #0 gets converted into a replica" {`
			`wait_for_condition 1000 50 {`
			`[s 0 role] eq {slave}`
			`} else {`
			`fail "Old primary was not converted into replica"`
			`}`
			`wait_for_cluster_propagation`
			`}`

			`test "Make sure the replicas always get the different ranks" {`
			`if {[s -3 role] == "master"} {`
			`verify_log_message -3 "Start of electionrank #0*" 0`
			`verify_log_message -6 "Start of electionrank #1*" 0`
			`} else {`
			`verify_log_message -3 "Start of electionrank #1*" 0`
			`verify_log_message -6 "Start of electionrank #0*" 0`
			`}`
			`}`

			`} ;# start_cluster`