11 Commits

Author SHA1 Message Date
Madelyn Olson
b728e4170f
Disable empty shard slot migration test until test is de-flaked (#859)
We have a number of test failures in the empty shard migration which
seem to be related to race conditions in the failover, but could be more
pervasive. For now disable the tests to prevent so many false negative
test failures.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-31 16:52:20 -07:00
w. ian douglas
b59762f734
Very minor misspelling in some tests (#705)
Fix misspelling "faiover" instead of "failover" in two test cases.

Signed-off-by: w. ian douglas <ian.douglas@iandouglas.com>
2024-06-28 23:56:30 +02:00
Ping Xie
5d9d41868d
Replace DEBUG RESTART with pause_server and resume_server (#652) 2024-06-13 17:52:50 -07:00
Ping Xie
aad6769a80
Replicate slot migration states via RDB aux fields (#586) 2024-06-07 20:32:27 -07:00
Madelyn Olson
b95e7c384f
Skip tls for xgroup read regression since it doesn't matter (#595)
"Client blocked on XREADGROUP while stream's slot is migrated" uses the
migrate command, which requires special handling for TLS and non-tls.
This was not being handled, so was throwing an error.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-06-03 11:49:15 -07:00
nitaicaro
6fb90adf4b
Fix crash where command duration is not reset when client is blocked … (#526)
In #11012, we changed the way command durations were computed to handle
the same command being executed multiple times. In #11970, we added an
assert if the duration is not properly reset, potentially indicating
that a call to report statistics was missed.

I found an edge case where this happens - easily reproduced by blocking
a client on `XGROUPREAD` and migrating the stream's slot. This causes
the engine to process the `XGROUPREAD` command twice:

1. First time, we are blocked on the stream, so we wait for unblock to
come back to it a second time. In most cases, when we come back to
process the command second time after unblock, we process the command
normally, which includes recording the duration and then resetting it.
2. After unblocking we come back to process the command, and this is
where we hit the edge case - at this point, we had already migrated the
slot to another node, so we return a `MOVED` response. But when we do
that, we don’t reset the duration field.

Fix: also reset the duration when returning a `MOVED` response. I think
this is right, because the client should redirect the command to the
right node, which in turn will calculate the execution duration.

Also wrote a test which reproduces this, it fails without the fix and
passes with it.

---------

Signed-off-by: Nitai Caro <caronita@amazon.com>
Co-authored-by: Nitai Caro <caronita@amazon.com>
2024-05-30 12:55:00 -07:00
Ping Xie
e4ead9442b
Make CLUSTER SETSLOT with TIMEOUT 0 block indefinitely (#556)
This aligns the behaviour with established Valkey commands with a
TIMEOUT argument, such as BLPOP.

Fix #422

Signed-off-by: Ping Xie <pingxie@google.com>
2024-05-27 07:11:24 -07:00
Viktor Söderqvist
d72ba06dd0
Make cluster replicas return ASK and TRYAGAIN (#495)
After READONLY, make a cluster replica behave as its primary regarding
returning ASK redirects and TRYAGAIN.

Without this patch, a client reading from a replica cannot tell if a key
doesn't exist or if it has already been migrated to another shard as
part of an ongoing slot migration. Therefore, without an ASK redirect in
this situation, offloading reads to cluster replicas wasn't reliable.

Note: The target of a redirect is always a primary. If a client wants to
continue reading from a replica after following a redirect, it needs to
figure out the replicas of that new primary using CLUSTER SHARDS or
similar.

This is related to #21 and has been made possible by the introduction of
Replication of Slot Migration States in #445.

----

Release notes:

During cluster slot migration, replicas are able to return -ASK
redirects and -TRYAGAIN.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-05-24 17:58:03 +02:00
Ping Xie
fd53f17a61
Use pause_process to stop a node to make Valgrind happy, hopefully (#508)
Signed-off-by: Ping Xie <pingxie@google.com>
2024-05-16 22:59:00 -07:00
Ping Xie
ac47ca2d47
Suppress ASAN errors on tests that intentially crash the server via crash-memcheck-enabled no (#489)
Fix daily CI run errors like
https://github.com/valkey-io/valkey/actions/runs/9039450198/job/24842308071#step:6:4176

Signed-off-by: Ping Xie <pingxie@google.com>
2024-05-12 16:08:47 -07:00
Ping Xie
6e7af9471c
Slot migration improvement (#445) 2024-05-06 21:40:28 -07:00