This PR optimizes the performance of HyperLogLog commands (PFCOUNT,
PFMERGE) by adding AVX2 fast paths.
Two AVX2 functions are added for conversion between raw representation
and dense representation. They are 15 ~ 30 times faster than scalar
implementaion. Note that sparse representation is not accelerated.
AVX2 fast paths are enabled when the CPU supports AVX2 (checked at
runtime) and the hyperloglog configuration is default (HLL_REGISTERS ==
16384 && HLL_BITS == 6).
`PFDEBUG SIMD (ON|OFF)` subcommand is added for unit tests. A new TCL
unit test checks that the results produced by non-AVX2 and AVX2
implementations are exactly equal.
When merging 3 dense hll structures, the benchmark shows a 12x speedup
compared to the scalar version.
```
pfcount key1 key2 key3
pfmerge keyall key1 key2 key3
```
```
======================================================================================================
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
------------------------------------------------------------------------------------------------------
PFCOUNT-scalar 5665.56 35.29839 32.25500 63.99900 67.58300 608.60
PFCOUNT-avx2 72377.83 2.75834 2.67100 5.34300 6.81500 7774.96
------------------------------------------------------------------------------------------------------
PFMERGE-scalar 9851.29 20.28806 20.09500 36.86300 39.16700 615.71
PFMERGE-avx2 125621.89 1.59126 1.55100 3.11900 4.70300 15702.74
------------------------------------------------------------------------------------------------------
scalar: valkey:unstable 2df56d87c0ebe802f38e8922bb2ea1e4ca9cfa76
avx2: Nugine:hll-simd 8f9adc34021080d96e60bd0abe06b043f3ed0275
CPU: 13th Gen Intel® Core™ i9-13900H × 20
Memory: 32.0 GiB
OS: Ubuntu 22.04.5 LTS
```
Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark script:
https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh
Algorithm:
https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp
---------
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
This log allows us to easily distinguish between full coverage and
minority partition when the cluster fails. Sometimes it is not easy
to see the minority partition in a healthy shards (both primary and
replicas).
And we decided not to add a cluster_fail_reason field to cluster info.
Given that there are only two reasons and both are well-known and if
we ended up adding more down the road we can add it in the furture.
Signed-off-by: Binbin <binloveplay1314@qq.com>
There are several patches in this PR:
* Abstract set/rewrite config bind option: `bind` option is a special
config, `socket` and `tls` are using the same one. However RDMA uses the
similar style but different one. Use a bit abstract work to make it
flexible for both `socket` and `RDMA`. (Even for QUIC in the future.)
* Introduce closeListener for connection type: closing socket by a
simple syscall would be fine, RDMA has complex logic. Introduce
connection type specific close listener method.
* RDMA: Use valkey.conf style instead of module parameters: use
`--rdma-bind` and `--rdma-port` style instead of module parameters. The
module style config `rdma.bind` and `rdma.port` are removed.
* RDMA: Support builtin: support `make BUILD_RDMA=yes`. module style is
still kept for now.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
After #1185, a client in import-source state can visit expired key
both in read commands and write commands, this commit handle
keyIsExpired function to handle import-source state as well, so
KEYS can visit the expired key.
This is not particularly important, but it ensures the definition,
also doing some cleanup around the test, verified that the client
can indeed visit the expired key.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When a node role changes, we should brocast the change to notify other nodes.
For example, one primary and one replica, after a failover, the replica became
a new primary, the primary became a new replica.
And then we trigger a second cluster failover for the new replica, the
new replica will send a MFSTART to its primary, ie, the new primary.
But the new primary may reject the MFSTART due to this logic:
```
} else if (type == CLUSTERMSG_TYPE_MFSTART) {
if (!sender || sender->replicaof != myself) return 1;
```
In the new primary views, sender is still a primary, and sender->replicaof
is NULL, so we will return. Then the manual failover timedout.
Another possibility is that other primaries refuse to vote after receiving
the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary,
so it refuse to vote, and then manual failover timedout.
```
void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
...
if (clusterNodeIsPrimary(node)) {
serverLog(LL_WARNING, "Failover auth denied to...
```
The reason is that, currently, we only update the node->replicaof information
when we receive a PING/PONG from the sender. For details, see clusterProcessPacket.
Therefore, in some scenarios, such as clusters with many nodes and a large
cluster-ping-interval (that is, cluster-node-timeout), the role change of the node
will be very delayed.
Added a DEBUG DISABLE-CLUSTER-RANDOM-PING command, send cluster ping
to a random node every second (see clusterCron).
Signed-off-by: Binbin <binloveplay1314@qq.com>
If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.
Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.
In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.
Signed-off-by: Binbin <binloveplay1314@qq.com>
New config: `import-mode (yes|no)`
New command: `CLIENT IMPORT-SOURCE (ON|OFF)`
The config, when set to `yes`, disables eviction and deletion of expired
keys, except for commands coming from a client which has marked itself
as an import-source, the data source when importing data from another
node, using the CLIENT IMPORT-SOURCE command.
When we sync data from the source Valkey to the destination Valkey using
some sync tools like
[redis-shake](https://github.com/tair-opensource/RedisShake), the
destination Valkey can perform expiration and eviction, which may cause
data corruption. This problem has been discussed in
https://github.com/redis/redis/discussions/9760#discussioncomment-1681041
and Redis already have a solution. But in Valkey we haven't fixed it by
now.
E.g. we call `set key 1 ex 1` on the source server and transfer this
command to the destination server. Then we call `incr key` on the source
server before the key expired, we will have a key on the source server
with a value of 2. But when the command arrived at the destination
server, the key may be expired and has deleted. So we will have a key on
the destination server with a value of 1, which is inconsistent with the
source server.
In standalone mode, we can use writable replica to simplify the sync
process. However, in cluster mode, we still need a sync tool to help us
transfer the source data to the destination. The sync tool usually work
as a normal client and the destination works as a primary which keep
expiration and eviction.
In this PR, we add a new mode named 'import-mode'. In this mode, server
stop expiration and eviction just like a replica. Notice that this mode
exists only in sync state to avoid data inconsistency caused by
expiration and eviction. Import mode only takes effect on the primary.
Sync tools can mark their clients as an import source by `CLIENT
IMPORT-SOURCE`, which work like a client from primary and can visit
expired keys in `lookupkey`.
**Notice: during the migration, other clients, apart from the import
source, should not access the data imported by import source.**
---------
Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com>
Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.
For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.
The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
CI report this failure:
```
[exception]: Executing test client: MOVED 1 127.0.0.1:22128.
MOVED 1 127.0.0.1:22128
while executing
"wait_for_condition 1000 50 {
[R 3 get key_991803] == 1024 && [R 3 get key_977613] == 10240 &&
[R 4 get key_991803] == 1024 && ..."
```
This may be because, even though the cluster state becomes OK,
The cluster still has inconsistent configuration for a short period
of time. We make sure to wait for the config to be consistent.
Signed-off-by: Binbin <binloveplay1314@qq.com>
We have an assert in propagateNow. If the primary node receives a
CLUSTER UPDATE such as dirty slots during SIGTERM waitting or during
a manual failover pausing or during a client pause, the delKeysInSlot
call will trigger this assert and cause primary crash.
In this case, we added a new server_del_keys_in_slot state just like
client_pause_in_transaction to track the state to avoid the assert
in propagateNow, the dirty slots will be deleted in the end without
affecting the data consistency.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
If we become an empty primary for some reason, we still need to
check if we need to delete dirty slots, because we may have dirty
slots data left over from a bad migration. Like the target node forcibly
executes CLUSTER SETSLOT NODE to take over the slot without
performing key migration.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If multiple primary nodes go down at the same time, their replica nodes will
initiate the elections at the same time. There is a certain probability that
the replicas will initate the elections in the same epoch.
And obviously, in our current election mechanism, only one replica node can
eventually get the enough votes, and the other replica node will fail to win
due the the insufficient majority, and then its election will time out and
we will wait for the retry, which result in a long failure time.
If another node has been won the election in the failover epoch, we can assume
that my election has failed and we can retry as soom as possible.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The CI report replica will return the error when performing CLUSTER
FAILOVER:
```
-ERR Master is down or failed, please use CLUSTER FAILOVER FORCE
```
This may because the primary state is fail or the cluster connection
is disconnected during the primary pause. In this PR, we added some
waits in wait_for_role, if the role is replica, we will wait for the
replication link and the cluster link to be ok.
Signed-off-by: Binbin <binloveplay1314@qq.com>
We set this to EEXIST in 568c2e039bac388003068cd8debb2f93619dd462,
it prints "File exists" which is not quite accurate,
change it to EALREADY, it will print "Operation already in progress".
Signed-off-by: Binbin <binloveplay1314@qq.com>
We don't think we really care about optimizing for the empty DB case,
which should be uncommon. Adding branches hurts branch prediction.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The cluster-node-timeout is 3000 in our tests, the timing test wasn't
succeeding, so extending the wait_for made them much more reliable.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This special pattern '#' is used to get the element itself,
it does not actually participate in the slot check.
In this case, passing `GET #` will cause '#' to participate
in the slot check, causing the command to get an
`pattern may be in different slots` error.
Signed-off-by: Binbin <binloveplay1314@qq.com>
There is no limitation in Valkey to create a cluster with 1 or 2 primaries,
only that it cannot do automatic failover. Remove this restriction and
add `are you sure` prompt to prompt the user.
This allow we use it to create a test cluster by cli or by
create-cluster.
Signed-off-by: Binbin <binloveplay1314@qq.com>
https://github.com/valkey-io/valkey/issues/1145
First part of a two-step effort to add `WithSlot` API for expiry. This
PR is to fix a crash that occurs when a RANDOMKEY uses a different slot
than the cached slot of a client during a multi-exec.
The next part will be to utilize the new API as an optimization to
prevent duplicate work when calculating the slot for a key.
---------
Signed-off-by: Nadav Levanoni <nadavl@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Nadav Levanoni <nadavl@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
The client that was killed by FUNCTION KILL received a reply of
SCRIPT KILL and the server log also showed SCRIPT KILL.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Ci report this failure:
```
*** [err]: SHUTDOWN NOSAVE can kill a timedout script anyway in tests/unit/scripting.tcl
Expected 'BUSY Valkey is busy running a script. *' to match '*connection refused*' (context: type eval line 8 cmd {assert_match {*connection refused*} $e} proc ::test)
```
We can see the logs the shutdown got rejected because there is an AOFRW
pending:
```
Writing initial AOF, can't exit.
Errors trying to shut down the server. Check the logs for more information.
```
The reason is that the previous test enabled the aof.
Signed-off-by: Binbin <binloveplay1314@qq.com>
this fixes: https://github.com/valkey-io/valkey/issues/1116
_Issue details from #1116 by @zuiderkwast_
> This config is undocumented since #758. The default was changed to
"yes" and it is quite useless to set it to "no". Yet, it can happen that
some user has an old config file where it is explicitly set to "no". The
result will be bad performace, since I/O threads will not do all the
I/O.
>
> It's indeed confusing.
>
> 1. Either remove the whole option from the code. And thus no need for
documentation. _OR:_
> 2. Introduce the option back in the configuration, just as a comment
is fine. And showing the default value "yes": `# io-threads-do-reads
yes` with additional text.
>
> _Originally posted by @melroy89 in [#1019 (reply in
thread)](https://github.com/orgs/valkey-io/discussions/1019#discussioncomment-10824778)_
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
The module commands which were added to acl categories were getting
skipped when `ACL CAT category` command was executed.
This PR fixes the bug.
Before:
```
127.0.0.1:6379> ACL CAT foocategory
(empty array)
```
After:
```
127.0.0.1:6379> ACL CAT foocategory
aclcheck.module.command.test.add.new.aclcategories
```
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
Currently when module loading fails due to busy name, we
don't have a clean way to assist to troubleshooting.
Case 1: when loading the same module multiple times, we can
not detemine the cause of its failure without referring to
the module list or the earliest module load log. The log
may not exist and sometimes it is difficult for people
to associate module list.
Case 2: when multiple modules use the same module name,
we can not quickly associate the busy name without referring
to the module list and the earliest module load log.
Different people wrote modules with the same module name,
they don't easily associate module name.
So in this PR, when doing module onload, we will try to
print a busy name log if this happen. Currently we check
ctx.module since if it is NULL it means the Init call
failed, and Init currently only fails with busy name.
It's kind of ugly. It would have been nice if we could have had a
better way for onload to signal why the load failed.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Applying the CVEs against mainline.
(CVE-2024-31449) Lua library commands may lead to stack overflow and
potential RCE.
(CVE-2024-31227) Potential Denial-of-service due to malformed ACL
selectors.
(CVE-2024-31228) Potential Denial-of-service due to unbounded pattern
matching.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Since we paused the primary node earlier, the replica may enter
cluster down due to primary node pfail. Here set allow read to
prevent subsequent read errors.
Signed-off-by: Binbin <binloveplay1314@qq.com>
These two test cases run in a loop:
* AOF rewrite during write load: RDB preamble=yes
* AOF rewrite during write load: RDB preamble=no
Both of the test cases build up a lot of data (3-4 million keys when I
run locally) so we should empty the data before the second test case.
Otherwise, the second test cases adds keys on top of the keys added in
the first test case, resulting in the double number of keys and takes
more time.
Before this commit:
[ok]: AOF rewrite during write load: RDB preamble=yes (18225 ms)
[ok]: AOF rewrite during write load: RDB preamble=no (37249 ms)
After:
[ok]: AOF rewrite during write load: RDB preamble=yes (18777 ms)
[ok]: AOF rewrite during write load: RDB preamble=no (19940 ms)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
For fake clients like the ones used for Lua and modules, we don't
determine TLS in the right way, causing CLUSTER SLOTS from EVAL over TLS
to fail a debug-assert.
This error was introduced when the caching of CLUSTER SLOTS was
introduced, i.e. in 8.0.0.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The reason is VM_Call will use a fake client without connection,
so we also need to check if c->conn is NULL.
This also affects scripts. If they are called in the script, the
server will crash. Injecting commands into AOF will also cause
startup failure.
Fixes#1054.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The one in CLUSTER SETSLOT help us keep track of state better,
of course it also can make the test case happy.
The one in gossip process fixes a problem that a replica can
print a log saying it is an empty primary.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Apparently there is a timing issue when using wait_for_ofs_sync:
```
[exception]: Executing test client: can't read "out_before": no such variable.
can't read "out_before": no such variable
```
The reason is that if the connection between the primary
and the replica is not established yet, the master_repl_offset
of the primary and replica in wait_for_ofs_sync is 0, and
the check fails, resulting in no replica client in the
client list below.
In this case, we need to make sure the replica is online
before proceeding.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas
There is a timing bug where the primary and replica have different
`cluster-allow-replica-migration` settings. In issue #970, we found that if
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains
in the original shard. This happens because we only process the
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.
This commit fixes the issue by also evaluating this flag for replicas in the
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.
Closes#970
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
The node may not be able to initiate an election in time due to
problems with cluster communication. If an election is initiated,
make sure its offset is 0.
Closes#967.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Maybe partially resolves https://github.com/valkey-io/valkey/issues/952.
The hostnames test relies on an assumption that node zero and node six
don't communicate with each other to test a bunch of behavior in the
handshake stake. This was done by previously dropping all meet packets,
however it seems like there was some case where node zero was sending a
single pong message to node 6, which was partially initializing the
state.
I couldn't track down why this happened, but I adjusted the test to
simply pause node zero which also correctly emulates the state we want
to be in since we're just testing state on node 6, and removes the
chance of errant messages. The test was failing about 5% of the time
locally, and I wasn't able to reproduce a failure with this new
configuration.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
In RdbLoad, we disable AOF before emptyData and rdbLoad to prevent copy-on-write issues. After rdbLoad completes, AOF should be re-enabled, but the code incorrectly checks server.aof_state, which has been reset to AOF_OFF in stopAppendOnly. This leads to AOF not being re-enabled after being disabled.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
## Set replica-lazy-flush and lazyfree-lazy-user-flush to yes by
default.
There are many problems with running flush synchronously. Even in
single CPU environments, the thread managers should balance between
the freeing and serving incoming requests.
## Set lazy eviction, expire, server-del, user-del to yes by default
We now have a del and a lazyfree del, we also have these configuration
items to control: lazyfree-lazy-eviction, lazyfree-lazy-expire,
lazyfree-lazy-server-del, lazyfree-lazy-user-del. In most cases lazyfree
is better since it reduces the risk of blocking the main thread, and
because we have lazyfreeGetFreeEffort, on those with high effor
(currently
64) will use lazyfree.
Part of #653.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
If the expiration time passed in SET is expired, for example, it
has expired due to the machine time (DTS) or the expiration time
passed in (wrong arg). In this case, we don't need to set the key
and wait for the active expire scan before deleting the key.
Compared with previous changes:
1. If the key does not exist, previously we would set the key and wait
for the active expire to delete it, so it is a set + del from the
perspective
of propaganda. Now we will no set the key and return, so it a NOP.
2. If the key exists, previously we woule set the key and wait
for the active expire to delete it, so it is a set + del From the
perspective
of propaganda. Now we will delete it and return, so it is a del.
Adding a new deleteExpiredKeyFromOverwriteAndPropagate function
to reduce the duplicate code.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
The reason is the server 3 still have the server 7 as its replica
due to a short wait, the wait is not enough, we should wait for
server loss its replica.
```
*** [err]: valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT
Expected '{127.0.0.1 21497 267}' to be equal to '' (context: type eval line 34 cmd {assert_equal [lindex [R 3 role] 2] {}} proc ::test)
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.
We have adopted the same handling method as the cluster mode:
1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:
```
MULTI ==> +OK
SET x y ==> -REDIRECT
EXEC ==> -EXECABORT
```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:
```
MULTI ==> +OK
SET x y ==> +QUEUED
failover
EXEC ==> -REDIRECT
```
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.
In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.
As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.
This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.
This one follow #885 and closes#942.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Fix a bug in isValidAuxChar where valid characters '.' and ':' were
incorrectly included in the banned charset. This issue affected the
validation of auxiliary fields in the nodes.conf file used by Valkey in
cluster mode, particularly when handling IPv4 and IPv6 addresses. The
code now correctly allows '.' and ':' as valid characters, ensuring
proper handling of these fields. Comments were added to clarify the use
of the banned charset.
Related to #736
---------
Signed-off-by: Ping Xie <pingxie@google.com>