Limit CLUSTER_CANT_FAILOVER_DATA_AGE log to 10 times period (#1189)

If a replica is step into data_age too old stage, it can not trigger the failover and currently it can not be automatically recovered and we will print a log every CLUSTER_CANT_FAILOVER_RELOG_PERIOD, which is every second. If the primary has not recovered or there is no manual failover, this log will flood the log file. In this case, limit its frequency to 10 times period, which is 10 seconds in our code. Also in this data_age too old stage, the repeated logs also can stand for the progress of the failover. See also #780 for more details about it. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>
2024-10-24 16:38:47 +08:00 · 2024-10-24 16:38:47 +08:00 · a21fe718f4
commit a21fe718f4
parent c419524c05
1 changed files with 10 additions and 3 deletions
--- a/src/cluster_legacy.c
+++ b/src/cluster_legacy.c
@ -4433,11 +4433,18 @@ int clusterGetReplicaRank(void) {
 void clusterLogCantFailover(int reason) {
    char *msg;
    static time_t lastlog_time = 0;
+    time_t now = time(NULL);

-    /* Don't log if we have the same reason for some time. */
-    if (reason == server.cluster->cant_failover_reason &&
-        time(NULL) - lastlog_time < CLUSTER_CANT_FAILOVER_RELOG_PERIOD)
+    /* General logging suppression if the same reason has occurred recently. */
+    if (reason == server.cluster->cant_failover_reason && now - lastlog_time < CLUSTER_CANT_FAILOVER_RELOG_PERIOD) {
        return;
+    }
+
+    /* Special case: If the failure reason is due to data age, log 10 times less frequently. */
+    if (reason == server.cluster->cant_failover_reason && reason == CLUSTER_CANT_FAILOVER_DATA_AGE &&
+        now - lastlog_time < 10 * CLUSTER_CANT_FAILOVER_RELOG_PERIOD) {
+        return;
+    }

    server.cluster->cant_failover_reason = reason;