1532 Commits

Author SHA1 Message Date
Madelyn Olson
150c197bdd
Apply CVE patches for CVE-2024-31449, CVE-2024-31227, CVE-2024-31228 (#1115)
Applying the CVEs against mainline.

(CVE-2024-31449) Lua library commands may lead to stack overflow and
potential RCE.
(CVE-2024-31227) Potential Denial-of-service due to malformed ACL
selectors.
(CVE-2024-31228) Potential Denial-of-service due to unbounded pattern
matching.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-10-02 19:22:09 -04:00
Binbin
9827eef4d0
Avoid timing issue in diskless-load-swapdb test (#1077)
Since we paused the primary node earlier, the replica may enter
cluster down due to primary node pfail. Here set allow read to
prevent subsequent read errors.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-10-01 13:14:30 -07:00
Viktor Söderqvist
69eddb4874
Speed up AOF rewrite test case (#1093)
These two test cases run in a loop:

* AOF rewrite during write load: RDB preamble=yes
* AOF rewrite during write load: RDB preamble=no

Both of the test cases build up a lot of data (3-4 million keys when I
run locally) so we should empty the data before the second test case.
Otherwise, the second test cases adds keys on top of the keys added in
the first test case, resulting in the double number of keys and takes
more time.

Before this commit:

    [ok]: AOF rewrite during write load: RDB preamble=yes (18225 ms)
    [ok]: AOF rewrite during write load: RDB preamble=no (37249 ms)

After:

    [ok]: AOF rewrite during write load: RDB preamble=yes (18777 ms)
    [ok]: AOF rewrite during write load: RDB preamble=no (19940 ms)

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-09-30 19:55:23 +02:00
Viktor Söderqvist
99865b197c
Fix bug for CLUSTER SLOTS from EVAL over TLS (#1072)
For fake clients like the ones used for Lua and modules, we don't
determine TLS in the right way, causing CLUSTER SLOTS from EVAL over TLS
to fail a debug-assert.

This error was introduced when the caching of CLUSTER SLOTS was
introduced, i.e. in 8.0.0.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-09-25 03:55:53 -04:00
Binbin
80fcbd3fec
Fix module / script call CLUSTER SLOTS / SHARDS fake client check crash (#1063)
The reason is VM_Call will use a fake client without connection,
so we also need to check if c->conn is NULL.

This also affects scripts. If they are called in the script, the
server will crash. Injecting commands into AOF will also cause
startup failure.

Fixes #1054.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-25 14:50:48 +08:00
Binbin
56fba564b6
Print an empty primary log when primary lost its last slot (#1064)
The one in CLUSTER SETSLOT help us keep track of state better,
of course it also can make the test case happy.

The one in gossip process fixes a problem that a replica can
print a log saying it is an empty primary.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
2024-09-23 13:14:09 +08:00
Binbin
d9c41e9ef9
Fix timing issue in the new tot-net-out replica test (#1060)
Apparently there is a timing issue when using wait_for_ofs_sync:
```
[exception]: Executing test client: can't read "out_before": no such variable.
can't read "out_before": no such variable
```

The reason is that if the connection between the primary
and the replica is not established yet, the master_repl_offset
of the primary and replica in wait_for_ofs_sync is 0, and
the check fails, resulting in no replica client in the
client list below.

In this case, we need to make sure the replica is online
before proceeding.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-20 14:25:05 +08:00
Shivshankar
56fd97733b
Move printver test to info-command file (#1056)
This fixes: #219

Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
2024-09-20 10:18:19 +08:00
Binbin
dcc7678fc4
Fix replica unable trigger migration when it received CLUSTER SETSLOT in advance (#981)
Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas

There is a timing bug where the primary and replica have different 
`cluster-allow-replica-migration` settings. In issue #970, we found that if 
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains 
in the original shard. This happens because we only process the 
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.

This commit fixes the issue by also evaluating this flag for replicas in the 
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.

Closes #970
---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
2024-09-13 15:32:20 -07:00
Ping Xie
3cc619f637
Disable flaky empty shard slot migration tests (#1027)
Will continue my investigation offline

Signed-off-by: Ping Xie <pingxie@google.com>
2024-09-13 00:02:39 -07:00
Binbin
f7c5b40183
Avoid false positive in election tests (#984)
The node may not be able to initiate an election in time due to
problems with cluster communication. If an election is initiated,
make sure its offset is 0.

Closes #967.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-13 14:53:39 +08:00
Ping Xie
76a59788e6
Re-enable empty-shard slot migration tests (#1024)
Related to #734 and #858

Signed-off-by: Ping Xie <pingxie@google.com>
2024-09-11 23:19:32 -07:00
uriyage
8cca11ac54
Fix wrong count for replica's tot-net-out (#1013)
Fix duplicate calculation of replica's `net_output_bytes`

- Remove redundant calculation leftover from previous refactor
- Add test to prevent regression

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2024-09-12 10:36:40 +08:00
Madelyn Olson
2b207ee1b3
Improve stability of hostnames test (#1016)
Maybe partially resolves https://github.com/valkey-io/valkey/issues/952.

The hostnames test relies on an assumption that node zero and node six
don't communicate with each other to test a bunch of behavior in the
handshake stake. This was done by previously dropping all meet packets,
however it seems like there was some case where node zero was sending a
single pong message to node 6, which was partially initializing the
state.

I couldn't track down why this happened, but I adjusted the test to
simply pause node zero which also correctly emulates the state we want
to be in since we're just testing state on node 6, and removes the
chance of errant messages. The test was failing about 5% of the time
locally, and I wasn't able to reproduce a failure with this new
configuration.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-09-11 09:52:34 -07:00
Binbin
4033c99ef5
Fix module RdbLoad wrongly disable the AOF (#1001)
In RdbLoad, we disable AOF before emptyData and rdbLoad to prevent copy-on-write issues. After rdbLoad completes, AOF should be re-enabled, but the code incorrectly checks server.aof_state, which has been reset to AOF_OFF in stopAppendOnly. This leads to AOF not being re-enabled after being disabled.
---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-10 21:00:08 -07:00
Binbin
50c1fe59f7
Add missing moduleapi getchannels test and fix tests (#1002)
Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-10 10:13:54 +08:00
Binbin
5693fe4664
Fix set expire test due to the new lazyfree configs changes (#980)
Test failed because these two PRs #865 and #913.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-02 22:43:09 +08:00
Binbin
70624ea63d
Change all the lazyfree configurations to yes by default (#913)
## Set replica-lazy-flush and lazyfree-lazy-user-flush to yes by
default.
There are many problems with running flush synchronously. Even in
single CPU environments, the thread managers should balance between
the freeing and serving incoming requests.

## Set lazy eviction, expire, server-del, user-del to yes by default
We now have a del and a lazyfree del, we also have these configuration
items to control: lazyfree-lazy-eviction, lazyfree-lazy-expire,
lazyfree-lazy-server-del, lazyfree-lazy-user-del. In most cases lazyfree
is better since it reduces the risk of blocking the main thread, and
because we have lazyfreeGetFreeEffort, on those with high effor
(currently
64) will use lazyfree.

Part of #653.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-02 07:07:17 -07:00
Binbin
e3af1a30e4
Fast path in SET if the expiration time is expired (#865)
If the expiration time passed in SET is expired, for example, it
has expired due to the machine time (DTS) or the expiration time
passed in (wrong arg). In this case, we don't need to set the key
and wait for the active expire scan before deleting the key.

Compared with previous changes:
1. If the key does not exist, previously we would set the key and wait
for the active expire to delete it, so it is a set + del from the
perspective
of propaganda. Now we will no set the key and return, so it a NOP.

2. If the key exists, previously we woule set the key and wait
for the active expire to delete it, so it is a set + del From the
perspective
of propaganda. Now we will delete it and return, so it is a del.

Adding a new deleteExpiredKeyFromOverwriteAndPropagate function
to reduce the duplicate code.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-08-31 22:39:07 +08:00
Binbin
fea49bce2c
Fix timing issue in replica migration test (#968)
The reason is the server 3 still have the server 7 as its replica
due to a short wait, the wait is not enough, we should wait for
server loss its replica.
```
*** [err]: valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT
Expected '{127.0.0.1 21497 267}' to be equal to '' (context: type eval line 34 cmd {assert_equal [lindex [R 3 role] 2] {}} proc ::test)
```

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-30 19:58:46 +08:00
zhaozhao.zz
743f5ac2ae
standalone -REDIRECT handles special case of MULTI context (#895)
In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.

We have adopted the same handling method as the cluster mode:

1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  -REDIRECT
    EXEC     ==>  -EXECABORT
    ```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  +QUEUED
    failover
    EXEC     ==>  -REDIRECT
    ```

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-08-30 10:17:53 +08:00
Binbin
ecbfb6a7ec
Fix reconfiguring sub-replica causing data loss when myself change shard_id (#944)
When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow #885 and closes #942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
2024-08-29 22:39:53 +08:00
Ping Xie
ad0ede302c
Exclude '.' and ':' from isValidAuxChar's banned charset (#963)
Fix a bug in isValidAuxChar where valid characters '.' and ':' were
incorrectly included in the banned charset. This issue affected the
validation of auxiliary fields in the nodes.conf file used by Valkey in
cluster mode, particularly when handling IPv4 and IPv6 addresses. The
code now correctly allows '.' and ':' as valid characters, ensuring
proper handling of these fields. Comments were added to clarify the use
of the banned charset.
 
Related to #736

---------

Signed-off-by: Ping Xie <pingxie@google.com>
2024-08-28 23:35:31 -07:00
Binbin
75b824052d
Revert make KEYS to be an exact match if there is no pattern (#964)
In #792, the time complexity became ambiguous, fluctuating between
O(1) and O(n), which is a significant difference. And we agree uncertainty
can potentially bring disaster to the business, the right thing to do is
to persuade users to use EXISTS instead of KEYS in this case, to do the
right thing the right way, rather than accommodating this incorrect usage.

This reverts commit d66a06e8183818c035bb78706f46fd62645db07e.
This reverts #792.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-29 10:58:19 +08:00
Binbin
4fe8320711
Add pause path coverage to replica migration tests (#937)
In #885, we only add a shutdown path, there is another path
is that the server might got hang by slowlog. This PR added
the pause path coverage to cover it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-28 11:08:27 +08:00
Binbin
6a84e06b05
Wait for the role change and fix the timing issue in the new test (#947)
The test might be fast enough and then there is no change in the role
causing the test to fail. Adding a wait to avoid the timing issue:
```
*** [err]: valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT
Expected '{127.0.0.1 23154 267}' to be equal to '' (context: type eval line 24 cmd {assert_equal [lindex [R 3 role] 2] {}} proc ::test)
```

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-28 09:51:10 +08:00
uriyage
04d76d8b02
Improve multithreaded performance with memory prefetching (#861)
This PR utilizes the IO threads to execute commands in batches, allowing
us to prefetch the dictionary data in advance.

After making the IO threads asynchronous and offloading more work to
them in the first 2 PRs, the `lookupKey` function becomes a main
bottle-neck and it takes about 50% of the main-thread time (Tested with
SET command). This is because the Valkey dictionary is a straightforward
but inefficient chained hash implementation. While traversing the hash
linked lists, every access to either a dictEntry structure, pointer to
key, or a value object requires, with high probability, an expensive
external memory access.

### Memory Access Amortization

Memory Access Amortization (MAA) is a technique designed to optimize the
performance of dynamic data structures by reducing the impact of memory
access latency. It is applicable when multiple operations need to be
executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.

Rather than executing operations sequentially, this approach interleaves
the execution of all operations. This is done in such a way that
whenever a memory access is required during an operation, the program
prefetches the necessary memory and transitions to another operation.
This ensures that when one operation is blocked awaiting memory access,
other memory accesses are executed in parallel, thereby reducing the
average access latency.

We applied this method in the development of `dictPrefetch`, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, `dictPrefetch` is an interleaved execution of dictFind for
all the keys.


**Implementation details**

When the main thread iterates over the `clients-pending-io-read`, for
clients with ready-to-execute commands (i.e., clients for which the IO
thread has parsed the commands), a batch of up to 16 commands is
created. Initially, the command's argv, which were allocated by the IO
thread, is prefetched to the main thread's L1 cache. Subsequently, all
the dict entries and values required for the commands are prefetched
from the dictionary before the command execution. Only then will the
commands be executed.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
2024-08-26 21:10:44 -07:00
Binbin
d66a06e818
Make KEYS to be an exact match if there is no pattern (#792)
Although KEYS is a dangerous command and we recommend people
to avoid using it, some people who are not familiar with it
still using it, and even use KEYS with no pattern at all.

Once KEYS is using with no pattern, we can convert it to an
exact match to avoid iterating over all data.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-27 12:04:27 +08:00
Ayush Sharma
b48596a914
Add support for setting the group on a unix domain socket (#901)
Add new optional, immutable string config called `unixsocketgroup`. 
Change the group of the unix socket to `unixsocketgroup` after `bind()`
if specified.

Adds tests to validate the behavior.

Fixes #873.

Signed-off-by: Ayush Sharma <mrayushs933@gmail.com>
2024-08-23 11:52:08 -07:00
Binbin
8045994972
valkey-cli make source node ignores NOREPLICAS when doing the last CLUSTER SETSLOT (#928)
This fixes #899. In that issue, the primary is cluster-allow-replica-migration no
and its replica is cluster-allow-replica-migration yes.

And during the slot migration:
1. Primary calling blockClientForReplicaAck, waiting its replica.
2. Its replica reconfiguring itself as a replica of other shards due to
replica migration and disconnect from the old primary.
3. The old primary never got the chance to receive the ack, so it got a
timeout and got a NOREPLICAS error.

In this case, the replicas might automatically migrate to another primary,
resulting in the client being unblocked with the NOREPLICAS error. In this
case, since the configuration will eventually propagate itself, we can safely
ignore this error on the source node.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-23 16:22:30 +08:00
Binbin
5d97f5133c
Fix CLUSTER SETSLOT block and unblock error when all replicas are down (#879)
In CLUSTER SETSLOT propagation logic, if the replicas are down, the
client will get block during command processing and then unblock
with `NOREPLICAS Not enough good replicas to write`.

The reason is that all replicas are down (or some are down), but
myself->num_replicas is including all replicas, so the client will
get block and always get timeout.

We should only wait for those online replicas, otherwise the waiting
propagation will always timeout since there are not enough replicas.
The admin can easily check if there are replicas that are down for an
extended period of time. If they decide to move forward anyways, we
should not block it. If a replica  failed right before the replication and
was not included in the replication, it would also unlikely win the election.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
2024-08-23 16:21:53 +08:00
Wen Hui
959dd3485b
Decline unsubscribe related command in non-subscribed mode (#759)
Now, when clients run the unsubscribe, sunsubscribe and punsubscribe
commands in the non-subscribed mode, it returns 0.
Indeed this is a bug, we should not allow client run these kind of
commands here.

Thus, this PR fixes this bug, but it is a break change for existing
clients

---------

Signed-off-by: hwware <wen.hui.ware@gmail.com>
2024-08-22 11:21:33 -04:00
uriyage
39f8bcb91b
Skip tracking clients OOM test when I/O threads are enabled (#764)
Fix feedback loop in key eviction with tracking clients when using I/O
threads.

Current issue:
Evicting keys while tracking clients or key space-notification exist
creates a feedback loop when using I/O threads:

While evicting keys we send tracking async writes to I/O threads,
preventing immediate release of tracking clients' COB memory
consumption.

Before the I/O thread finishes its write, we recheck used_memory, which
now includes the tracking clients' COB and thus continue to evict more
keys.

**Fix:**
We will skip the test for now while IO threads are active. We may
consider avoiding sending writes in `processPendingWrites` to I/O
threads for tracking clients when we are out of memory.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-08-21 17:02:57 -07:00
Binbin
e1b3629186 Fix data loss when replica do a failover with a old history repl offset (#885)
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.

2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 13:11:21 +08:00
Binbin
5166d489da
Correctly recode client infomation to the slowlog when runing script (#805)
Currently when we are runing a script, we will passing a fake client.
So if the command executed in the script is recoded in the slowlog,
the client's ip:port and name information will be empty.

before:
```
127.0.0.1:6379> client setname myclient
OK
127.0.0.1:6379> config set slowlog-log-slower-than 0
OK
127.0.0.1:6379> eval "redis.call('ping')" 0
(nil)
127.0.0.1:6379> slowlog get 2
1) 1) (integer) 2
   2) (integer) 1721314289
   3) (integer) 96
   4) 1) "eval"
      2) "redis.call('ping')"
      3) "0"
   5) "127.0.0.1:61106"
   6) "myclient"
2) 1) (integer) 1
   2) (integer) 1721314289
   3) (integer) 4
   4) 1) "ping"
   5) ""
   6) ""
```

after:
```
127.0.0.1:6379> client setname myclient
OK
127.0.0.1:6379> config set slowlog-log-slower-than 0
OK
127.0.0.1:6379> eval "redis.call('ping')" 0
(nil)
127.0.0.1:6379> slowlog get 2
1) 1) (integer) 2
   2) (integer) 1721314371
   3) (integer) 53
   4) 1) "eval"
      2) "redis.call('ping')"
      3) "0"
   5) "127.0.0.1:51983"
   6) "myclient"
2) 1) (integer) 1
   2) (integer) 1721314371
   3) (integer) 1
   4) 1) "ping"
   5) "127.0.0.1:51983"
   6) "myclient"
```

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-10 23:46:56 +08:00
Madelyn Olson
b728e4170f
Disable empty shard slot migration test until test is de-flaked (#859)
We have a number of test failures in the empty shard migration which
seem to be related to race conditions in the failover, but could be more
pervasive. For now disable the tests to prevent so many false negative
test failures.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-31 16:52:20 -07:00
Kyle Kim (kimkyle@)
e1d936b339
Add network-bytes-in and network-bytes-out metric support under CLUSTER SLOT-STATS command (#20) (#720)
Adds two new metrics for per-slot statistics, network-bytes-in and
network-bytes-out. The network bytes are inclusive of replication bytes
but exclude other types of network traffic such as clusterbus traffic.

#### network-bytes-in
The metric tracks network ingress bytes under per-slot context, by
reverse calculation of `c->argv_len_sum` and `c->argc`, stored under a
newly introduced field `c->net_input_bytes_curr_cmd`.

#### network-bytes-out
The metric tracks network egress bytes under per-slot context, by
hooking onto COB buffer mutations.

#### sample response
Both metrics are reported under the `CLUSTER SLOT-STATS` command.
```
127.0.0.1:6379> cluster slot-stats slotsrange 0 0
1) 1) (integer) 0
    2) 1) "key-count"
       2) (integer) 0
       3) "cpu-usec"
       4) (integer) 0
       5) "network-bytes-in"
       6) (integer) 0
       7) "network-bytes-out"
       8) (integer) 0
```

---------

Signed-off-by: Kyle Kim <kimkyle@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-26 16:06:16 -07:00
Binbin
59aa00823c
Replicas with the same offset queue up for election (#762)
In some cases, like read more than write scenario, the replication
offset of the replicas are the same. When the primary fails, the
replicas have the same rankings (rank == 0). They issue the election
at the same time (although we have a random 500), the simultaneous
elections may lead to the failure of the election due to quorum.

In clusterGetReplicaRank, when we calculates the rank, if the offsets
are the same, the one with the smaller node name will have a better
rank to avoid this situation.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-07-22 23:43:16 -07:00
Kyle Kim (kimkyle@)
5000c050b5
Add cpu-usec metric support under CLUSTER SLOT-STATS command (#20). (#712)
The metric tracks cpu time in micro-seconds, sharing the same value as
`INFO COMMANDSTATS`, aggregated under per-slot context.

---------

Signed-off-by: Kyle Kim <kimkyle@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-22 18:03:28 -07:00
Binbin
14e09e981e
Fix the wrong woff when execute WAIT / WAITAOF in script (#776)
When executing the script, the client passed in is a fake
client, and its woff is always 0.

This results in woff always being 0 when executing wait/waitaof
in the script, and the command returns a wrong number.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-07-22 10:33:10 +02:00
Binbin
15a8290231
Optimize failover time when the new primary node is down again (#782)
We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.

Let's assume the following scenario:
1. Two replicas initiate an election.
2. Replica 1 is elected as the primary node, and replica 2 does not have
   enough votes.
3. Replica 1 is down, ie the new primary node down again in a short
time.
4. Replica 2 know that the new primary node is down and wants to
initiate
   a failover, but because the failover_auth_time of the previous round
   has not been reset, it needs to wait for it to time out and then wait
for the next retry time, which will take cluster-node-timeout * 4 times,
   this adds a lot of delay.

There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to
initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).

That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-07-19 15:27:49 -04:00
Harkrishn Patro
816accea76
Generate correct slot information in cluster shards command on primary failure (#790)
Fix #784 

Prior to the change, `CLUSTER SHARDS` command processing might pick a
failed primary node which won't have the slot coverage information and
the slots `output` in turn would be empty. This change finds an
appropriate node which has the slot coverage information served by a
given shard and correctly displays it as part of `CLUSTER SHARDS`
output.

 Before:
 
 ```
 1) 1) "slots"
   2)  (empty array)
   3) "nodes"
   4) 1)  1) "id"
          2) "2936f22a490095a0a851b7956b0a88f2b67a5d44"
          ...
          9) "role"
         10) "master"
         ...
         13) "health"
         14) "fail"
 ```
 
 After:
 

 ```
 1) 1) "slots"
   2) 1) 0
       2) 5461
   3) "nodes"
   4) 1)  1) "id"
          2) "2936f22a490095a0a851b7956b0a88f2b67a5d44"
          ...
          9) "role"
         10) "master"
         ...
         13) "health"
         14) "fail"
 ```

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
2024-07-19 09:32:39 -07:00
Binbin
35a1888333
Fix incorrect usage of process_is_paused in tests (#783)
It was introduced wrong in #442.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-07-19 11:25:58 +08:00
naglera
ff6b780fe6
Dual channel replication (#60)
In this PR we introduce the main benefit of dual channel replication by
continuously steaming the COB (client output buffers) in parallel to the
RDB and thus keeping the primary's side COB small AND accelerating the
overall sync process. By streaming the replication data to the replica
during the full sync, we reduce
1. Memory load from the primary's node.
2. CPU load from the primary's main process. [Latest performance
tests](#data)

## Motivation
* Reduce primary memory load. We do that by moving the COB tracking to
the replica side. This also decrease the chance for COB overruns. Note
that primary's input buffer limits at the replica side are less
restricted then primary's COB as the replica plays less critical part in
the replication group. While increasing the primary’s COB may end up
with primary reaching swap and clients suffering, at replica side we’re
more at ease with it. Larger COB means better chance to sync
successfully.
* Reduce primary main process CPU load. By opening a new, dedicated
connection for the RDB transfer, child processes can have direct access
to the new connection. Due to TLS connection restrictions, this was not
possible using one main connection. We eliminate the need for the child
process to use the primary's child-proc -> main-proc pipeline, thus
freeing up the main process to process clients queries.


 ## Dual Channel Replication high level interface design
- Dual channel replication begins when the replica sends a `REPLCONF
CAPA DUALCHANNEL` to the primary during initial
handshake. This is used to state that the replica is capable of dual
channel sync and that this is the replica's main channel, which is not
used for snapshot transfer.
- When replica lacks sufficient data for PSYNC, the primary will send
`-FULLSYNCNEEDED` response instead
of RDB data. As a next step, the replica creates a new connection
(rdb-channel) and configures it against
the primary with the appropriate capabilities and requirements. The
replica then requests a sync
     using the RDB channel. 
- Prior to forking, the primary sends the replica the snapshot's end
repl-offset, and attaches the replica
to the replication backlog to keep repl data until the replica requests
psync. The replica uses the main
     channel to request a PSYNC starting at the snapshot end offset. 
- The primary main threads sends incremental changes via the main
channel, while the bgsave process
sends the RDB directly to the replica via the rdb-channel. As for the
replica, the incremental
changes are stored on a local buffer, while the RDB is loaded into
memory.
- Once the replica completes loading the rdb, it drops the
rdb-connection and streams the accumulated incremental
     changes into memory. Repl steady state continues normally.

## New replica state machine


![image](https://github.com/user-attachments/assets/38fbfff0-60b9-4066-8b13-becdb87babc3)





## Data <a name="data"></a>

![image](https://github.com/user-attachments/assets/d73631a7-0a58-4958-a494-a7f4add9108f)


![image](https://github.com/user-attachments/assets/f44936ed-c59a-4223-905d-0fe48a6d31a6)


![image](https://github.com/user-attachments/assets/bd333ee2-3c47-47e5-b244-4ea75f77c836)

## Explanation 
These graphs demonstrate performance improvements during full sync
sessions using rdb-channel + streaming rdb directly from the background
process to the replica.

First graph- with at most 50 clients and light weight commands, we saw
5%-7.5% improvement in write latency during sync session.
Two graphs below- full sync was tested during heavy read commands from
the primary (such as sdiff, sunion on large sets). In that case, the
child process writes to the replica without sharing CPU with the loaded
main process. As a result, this not only improves client response time,
but may also shorten sync time by about 50%. The shorter sync time
results in less memory being used to store replication diffs (>60% in
some of the tested cases).

## Test setup 
Both primary and replica in the performance tests ran on the same
machine. RDB size in all tests is 3.7gb. I generated write load using
valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list
__rand_int__`.

---------

Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: naglera <58042354+naglera@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-17 13:59:33 -07:00
Ping Xie
66d0f7d9a1
Ensure only primary sender drives slot ownership updates (#754)
Fixes a regression introduced in PR #445, which allowed a message from a
replica
to update the slot ownership of its primary. The regression results in a
`replicaof` cycle, causing server crashes due to the cycle detection
assert. The
fix restores the previous behavior where only primary senders can
trigger
`clusterUpdateSlotsConfigWith`.

Additional changes:

* Handling of primaries without slots is obsoleted by new handling of
when a
  sender that was a replica announces that it is now a primary.
* Replication loop detection code is unchanged but shifted downwards.
* Some variables are renamed for better readability and some are
introduced to
  avoid repeated memcmp() calls.

Fixes #753.

---------

Signed-off-by: Ping Xie <pingxie@google.com>
2024-07-16 13:05:49 -07:00
KarthikSubbarao
418901dec4
Limit tracking custom errors (e.g. from LUA) while allowing non custom errors to be tracked normally (#500)
Implementing the change proposed here:
https://github.com/valkey-io/valkey/issues/487

In this PR, we prevent tracking new custom error messages (e.g. LUA) if
the number of error messages (in the errors RAX) is greater than 128.
Instead, we will track any additional custom error prefix in a new
counter: `errorstat_ERRORSTATS_OVERFLOW ` and if any non-custom flagged
errors (e.g. MOVED / CLUSTERDOWN) occur, they will continue to be
tracked as usual.

This will address the issue of spammed error messages / memory usage of
the errors RAX. Additionally, we will not have to execute `CONFIG
RESETSTAT` to restore error stats functionality because normal error
messages continue to be tracked.

Example:
```
# Errorstats
.
.
.
errorstat_127:count=2
errorstat_128:count=2
errorstat_ERR:count=1
errorstat_ERRORSTATS_OVERFLOW:count=2
```

---------

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-14 20:04:47 -07:00
Binbin
a4ee8dada4
Fix WAITAOF test in external test due to appendonly is enabled (#775)
The test fails because, in external, another test may have
enabled appendonly, causing acklocal to return 1.

We can add a CONFIG SET to disable the appendonly, but this
is not safe too unless we use multi. The test does not actually
rely on appendonly, so we can just * it.

Fixes #770.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-07-12 23:32:39 +08:00
Madelyn Olson
9948f07a01
Temporary skip blockwait aof test until it's fixed (#773)
See https://github.com/valkey-io/valkey/issues/770 for details about
failure. Want to prevent the test failures.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-11 13:10:13 -04:00
Viktor Söderqvist
a323dce890
Dual stack and client-specific IPs in cluster (#736)
New configs:

* `cluster-announce-client-ipv4`
* `cluster-announce-client-ipv6`

New module API function:

* `ValkeyModule_GetClusterNodeInfoForClient`, takes a client id and is
otherwise just like its non-ForClient cousin.

If configured, one of these IP addresses are reported to each client in
CLUSTER SLOTS, CLUSTER SHARDS, CLUSTER NODES and redirects, replacing
the IP (`custer-announce-ip` or the auto-detected IP) of each node.
Which one is reported to the client depends on whether the client is
connected over IPv4 or IPv6.

Benefits:

* This allows clients using IPv4 to get the IPv4 addresses of all
cluster nodes and IPv6 clients to get the IPv6 clients.
* This allows the IPs visible to clients to be different to the IPs used
between the cluster nodes due to NAT'ing.

The information is propagated in the cluster bus using new Ping
extensions. (Old nodes without this feature ignore unknown Ping
extensions.)

This adds another dimension to CLUSTER SLOTS reply. It now depends on
the client's use of TLS, the IP address family and RESP version.
Refactoring: The cached connection type definition is moved from
connection.h (it actually has nothing to do with the connection
abstraction) to server.h and is changed to a bitmap, with one bit for
each of TLS, IPv6 and RESP3.

Fixes #337

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-07-10 13:53:52 +02:00
Viktor Söderqvist
b99c7237f4
Fix unstable test case EVAL+WAITAOF (#766)
Test case "EVAL - Scripts do not block on waitaof" observed to fail in
e.g.
https://github.com/valkey-io/valkey/actions/runs/9860131487/job/27233756421?pr=688

It can happen that the local AOF has been written and 1 is returned here
where 0 is expected. Writing a key inside the EVAL script makes sure
there's no time to write the AOF.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-07-09 21:40:49 +02:00