The one in CLUSTER SETSLOT help us keep track of state better,
of course it also can make the test case happy.
The one in gossip process fixes a problem that a replica can
print a log saying it is an empty primary.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
The reason is VM_Call will use a fake client without connection,
so we also need to check if c->conn is NULL.
This also affects scripts. If they are called in the script, the
server will crash. Injecting commands into AOF will also cause
startup failure.
Fixes#1054.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Apparently this will fail to compile in some masOS version.
And internet claims _Thread_local is portable.
Fixes#1051.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Replaced "Could not connect to Redis" with "Could not connect to server" in the log
output for connection errors in `getRedisContext` and `createClient`.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Call emptyData right before rdbLoad to prevent errors in the middle
and we drop the replication stream and leaving an empty database.
The real changes is in disk-based part, the rest is just code movement.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
This pull request improves code readability, as a follow up of #749.
- Internal Naming Conventions: Removed the use of underscores (_) for
internal static structures/functions.
- Descriptive Function Names: Updated function names to be more
descriptive, making their purpose clearer. For instance, `_dictExpand`
is renamed to `dictExpandIfAutoResizeAllowed`.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas
There is a timing bug where the primary and replica have different
`cluster-allow-replica-migration` settings. In issue #970, we found that if
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains
in the original shard. This happens because we only process the
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.
This commit fixes the issue by also evaluating this flag for replicas in the
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.
Closes#970
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Ping Xie <pingxie@google.com>
The cluster configuration file is the metadata "database" for the
cluster. It is best to trigger a save when shutdown the server, to
avoid inconsistent content that is not refreshed.
We save the nodes.conf whenever something that affects the nodes.conf
has changed. But we are saving nodes.conf in clusterBeforeSleep, and
some events may save it without a fsync, there is a time gap.
And shutdown has its own save seems good to me, it doesn't need to
care about the others.
At the same time, a comment is added in unlock nodes.conf to explain
why we actively unlock when shutdown.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
clang generates warning if there is no newline at the end of the source
file.
Update .clang-format to handle the missing newline at eof.
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
Signed-off-by: Ping Xie <pingxie@google.com>
While doing some profiling, I noticed that getKeySlot() was a fairly
large part (~0.7%) of samples doing perf with high pipeline during
standalone. I think this is because we do a very late check for
server.cluster_mode, we first call getKeySlot() and then call
calculateKeySlot(). (calculateKeySlot was surprisingly not automatically
inlined, we were doing a jump into it and then immediately returning
zero). We then also do useless work in the form of caching zero in
client->slot, which will further mess with cache lines.
So, this PR tries to accomplish a few things things.
1) The usage of the `slot` name made a lot more sense before the
introduction of the kvstore. Now with kvstore, we call this the database
index, so all the references to slot in standalone are no longer really
accurate.
2) Pull the cluster mode check all the way out of getKeySlot(), so
hopefully a bit more performant.
3) Remove calculateKeySlot() as independent from getKeySlot().
calculateKeySlot used to have 3 call sites outside of db.c, which
warranted it's own function. It's now only called in two places,
pubsub.c and networking.c.
I ran some profiling, and saw about ~0.3% improvement, but don't really
trust it because you'll see a much higher (~2%) variance in test runs
just by how the branch predictions will get changed with a new memory
layout. Running perf again showed no samples in getKeySlot() and a
reduction in samples in lookupKey(), so maybe this will help a little
bit.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Ping Xie <pingxie@google.com>
In RdbLoad, we disable AOF before emptyData and rdbLoad to prevent copy-on-write issues. After rdbLoad completes, AOF should be re-enabled, but the code incorrectly checks server.aof_state, which has been reset to AOF_OFF in stopAppendOnly. This leads to AOF not being re-enabled after being disabled.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Prior to comparing the replica buffer against the configured limit, we
need to ensure that the limit configuration is enabled. If the limit is
set to zero, it indicates that there is no limit, and we should skip the
buffer limit check.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Fix for https://github.com/valkey-io/valkey/issues/997
Root Cause Analysis:
1. Two different jobs (READ and WRITE) may be sent to the same IO
thread.
2. When processing the read job in `processIOThreadsReadDone`, the IO
thread may find that the write job has also been completed.
3. In this case, the IO thread calls `processClientIOWriteDone` to first
process the completed write job and free the COBs
affbea5dc1/src/networking.c (L4666)
4. If there are pending writes (resulting from pipeline commands), a new
async IO write job is sent before processing the completed read job
affbea5dc1/src/networking.c (L2417)
When sending the write job, the `TLS_CONN_FLAG_POSTPONE_UPDATE_STATE`
flag is set to prevent the IO thread from updating the event loop, which
is not thread-safe.
5. Upon resuming the read job processing, the flag is cleared,
affbea5dc1/src/networking.c (L4685)
causing the IO thread to update the event loop.
Fix:
Prevent sending async write job for pending writes when a read job is
about to be processed.
Testing:
The issue could not be reproduced due to its rare occurrence, which
requires multiple specific conditions to align simultaneously.
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Ping Xie <pingxie@google.com>
For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.
This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.
Partially fixes#774
---------
Signed-off-by: Ben Totten <btotten@amazon.com>
Co-authored-by: Ben Totten <btotten@amazon.com>
Signed-off-by: Ping Xie <pingxie@google.com>
This PR migrates the tests related to dict into new test framework as
part of #428.
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
This PR migrates the tests related to listpack into new test framework
as part of #428.
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Print the full client info by using catClientInfoString, the
info is useful when we want to identify the source of request.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
If we modify aof-use-rdb-preamble in the middle of rewrite,
we may get a wrong aof base suffix. This is because the suffix
is concatenated by the main process afterwards, and it may be
different from the beginning.
We cache this value when we start the rewrite.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
`clusterProcessGossipSection` currently doesn't trigger a check and call `replicationSetPrimary` when `myself`'s primary node’s IP/port is updated. This fix ensures that after every node address update, `replicationSetPrimary` is called if the updated node is `myself`'s primary. This prevents missed updates and ensures that replicas reconnect properly to maintain their replication link with the primary.
Signed-off-by: Ping Xie <pingxie@google.com>
This pull request introduces several changes to improve the type safety
of Valkey's dictionary implementation:
- Getter/Setter Macros: Implemented macros `DICT_SET_VALUE` and
`DICT_GET_VALUE` to centralize type casting within these macros. This
change emulates the behavior of C++ templates in C, limiting type
casting to specific low-level operations and preventing it from being
spread across the codebase.
- Reduced Assert Overhead: Removed unnecessary asserts from critical hot
paths in the dictionary implementation.
- Consistent Naming: Standardized the naming of dictionary entry types.
For example, all dictionary entry types start their names with
`dictEntry`.
Fix#737
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Signed-off-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Follow up to https://github.com/valkey-io/valkey/pull/966, which didn't
update the kvstore tests. I'm not actually entirely clear why it fixes
it, but the consistency prevents the crash very reliably so will merge
it now and maybe see if Zhao has a better explanation.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Implement data masking for user data in server logs and diagnostic output. This change prevents potential exposure of confidential information, such as PII, and enhances privacy protection. It masks all command arguments, client names, and client usernames.
Added a new hide-user-data-from-log configuration item, default yes.
---------
Signed-off-by: Amit Nagler <anagler123@gmail.com>
Feature `one-dict-per-slot` refactors the database, and part of it
involved splitting the rehashing list from the global level back to the
database level, or more specifically, the kvstore level. This change is
fine, and it also simplifies the process of swapping databases, which is
good. And it should not have a major impact on the efficiency of
incremental rehashing.
To implement the kvstore-level rehashing list, each `dict` under the
`kvstore` needs to know which `kvstore` it belongs. However, kvstore did
not insert the reference relationship into the `dict` itself, instead,
it placed it in the `dictType`. In my view, this is a somewhat odd way.
Theoretically, `dictType` is just a collection of function handles, a
kind of virtual type that can be referenced globally, not an entity. But
now the `dictType` is instantiated, with each `kvstore` owning an actual
`dictType`, which in turn holds a reverse reference to the `kvstore`'s
resource pointer. This design is somewhat uncomfortable for me.
I think the `dictType` should not be instantiated. The references
between actual resources (`kvstore` and `dict`) should occur between
specific objects, rather than force materializing the `dictType`, which
is supposed to be virtual.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
## Set replica-lazy-flush and lazyfree-lazy-user-flush to yes by
default.
There are many problems with running flush synchronously. Even in
single CPU environments, the thread managers should balance between
the freeing and serving incoming requests.
## Set lazy eviction, expire, server-del, user-del to yes by default
We now have a del and a lazyfree del, we also have these configuration
items to control: lazyfree-lazy-eviction, lazyfree-lazy-expire,
lazyfree-lazy-server-del, lazyfree-lazy-user-del. In most cases lazyfree
is better since it reduces the risk of blocking the main thread, and
because we have lazyfreeGetFreeEffort, on those with high effor
(currently
64) will use lazyfree.
Part of #653.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
If the expiration time passed in SET is expired, for example, it
has expired due to the machine time (DTS) or the expiration time
passed in (wrong arg). In this case, we don't need to set the key
and wait for the active expire scan before deleting the key.
Compared with previous changes:
1. If the key does not exist, previously we would set the key and wait
for the active expire to delete it, so it is a set + del from the
perspective
of propaganda. Now we will no set the key and return, so it a NOP.
2. If the key exists, previously we woule set the key and wait
for the active expire to delete it, so it is a set + del From the
perspective
of propaganda. Now we will delete it and return, so it is a del.
Adding a new deleteExpiredKeyFromOverwriteAndPropagate function
to reduce the duplicate code.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Deletes zipmapSet, zipmapGet, etc. Only keep iterator and validate
integrity, what we use when loading an old RDB file.
Adjust unit tests to not use zipmapSet, etc.
Solves a build failure where when compiling with fortify source.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.
We have adopted the same handling method as the cluster mode:
1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:
```
MULTI ==> +OK
SET x y ==> -REDIRECT
EXEC ==> -EXECABORT
```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:
```
MULTI ==> +OK
SET x y ==> +QUEUED
failover
EXEC ==> -REDIRECT
```
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Migrate zipmap unit test to new unit test framework, parent ticket #428
.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Signed-off-by: hwware <wen.hui.ware@gmail.com>
Co-authored-by: hwware <wen.hui.ware@gmail.com>
When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.
In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.
As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.
This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.
This one follow #885 and closes#942.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Fix a bug in isValidAuxChar where valid characters '.' and ':' were
incorrectly included in the banned charset. This issue affected the
validation of auxiliary fields in the nodes.conf file used by Valkey in
cluster mode, particularly when handling IPv4 and IPv6 addresses. The
code now correctly allows '.' and ':' as valid characters, ensuring
proper handling of these fields. Comments were added to clarify the use
of the banned charset.
Related to #736
---------
Signed-off-by: Ping Xie <pingxie@google.com>
In #792, the time complexity became ambiguous, fluctuating between
O(1) and O(n), which is a significant difference. And we agree uncertainty
can potentially bring disaster to the business, the right thing to do is
to persuade users to use EXISTS instead of KEYS in this case, to do the
right thing the right way, rather than accommodating this incorrect usage.
This reverts commit d66a06e8183818c035bb78706f46fd62645db07e.
This reverts #792.
Signed-off-by: Binbin <binloveplay1314@qq.com>
## Description
When I explore the cycles distributions for `lrange` test (
`valkey-benchmark -p 9001 -t lrange -d 100 -r 1000000 -n 1000000 -c 50
--threads 4`). I found the `prepareClientToWrite` and
`clientHasPendingReplies` could be reduced to single call outside
instead of called in a loop, ideally we can gain 3% performance. The
corresponding `LRANG_100`, `LRANG_300`, `LRANGE_500`, `LRANGE_600` have
~2% - 3% performance boost, the benchmark test prove it helps.
This patch try to move the `prepareClientToWrite` and its child
`clientHasPendingReplies` out of the loop to reduce the function
overhead.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
sdsAllocSize returns the correct size without consulting the
allocator. Which is much faster than consulting the allocator.
The only exception is SDS_TYPE_5, for which it has to
consult the allocator.
This PR also sets alloc field correctly for embedded string objects.
It assumes that no allocator would allocate a buffer larger
than `259 + sizeof(robj)` for embedded string. We use embedded strings
for strings up to 44 bytes. If this assumption is wrong, the whole
function would require a rewrite. In general case sds type adjustment
might be needed. Such logic should go to sds.c.
---------
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Currently, the `dual-channel-replication` feature flag is immutable if
`enable-protected-configs` is enabled, which is the default behavior.
This PR proposes to make the `dual-channel-replication` flag mutable,
allowing it to be changed dynamically without restarting the cluster.
**Motivation:**
The ability to change the `dual-channel-replication` flag dynamically is
essential for testing and validating the feature on real clusters
running in production environments. By making the flag mutable, we can
enable or disable the feature without disrupting the cluster's
operations, facilitating easier testing and experimentation.
Additionally, this change would provide more flexibility for users to
enable or disable the feature based on their specific requirements or
operational needs without requiring a cluster restart.
---------
Signed-off-by: naglera <anagler123@gmail.com>
1. Remove redundant connIncrRefs/connDecrRefs
In socket.c, the reference counter is incremented before calling
callHandler, but the same reference counter is also incremented inside
callHandler before calling the actual callback.
static inline int callHandler(connection *conn, ConnectionCallbackFunc handler) {
connIncrRefs(conn);
if (handler) handler(conn);
connDecrRefs(conn);
...
}
This commit removes the redundant incr/decr calls in socket.c
2. Correct return value of connRead for TLS when peer closed
According to comments in connection.h, connRead returns 0 when the peer
has closed the connection. This patch corrects the return value for TLS
connections. (Without this patch, it returns -1 which means error.)
There is an observable difference in what is logged in the verbose
level: "Client closed connection" vs "Reading from client: (null)".
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This PR utilizes the IO threads to execute commands in batches, allowing
us to prefetch the dictionary data in advance.
After making the IO threads asynchronous and offloading more work to
them in the first 2 PRs, the `lookupKey` function becomes a main
bottle-neck and it takes about 50% of the main-thread time (Tested with
SET command). This is because the Valkey dictionary is a straightforward
but inefficient chained hash implementation. While traversing the hash
linked lists, every access to either a dictEntry structure, pointer to
key, or a value object requires, with high probability, an expensive
external memory access.
### Memory Access Amortization
Memory Access Amortization (MAA) is a technique designed to optimize the
performance of dynamic data structures by reducing the impact of memory
access latency. It is applicable when multiple operations need to be
executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.
Rather than executing operations sequentially, this approach interleaves
the execution of all operations. This is done in such a way that
whenever a memory access is required during an operation, the program
prefetches the necessary memory and transitions to another operation.
This ensures that when one operation is blocked awaiting memory access,
other memory accesses are executed in parallel, thereby reducing the
average access latency.
We applied this method in the development of `dictPrefetch`, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, `dictPrefetch` is an interleaved execution of dictFind for
all the keys.
**Implementation details**
When the main thread iterates over the `clients-pending-io-read`, for
clients with ready-to-execute commands (i.e., clients for which the IO
thread has parsed the commands), a batch of up to 16 commands is
created. Initially, the command's argv, which were allocated by the IO
thread, is prefetched to the main thread's L1 cache. Subsequently, all
the dict entries and values required for the commands are prefetched
from the dictionary before the command execution. Only then will the
commands be executed.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>