1596 Commits

Author SHA1 Message Date
zhaozhao.zz
c5a1585547
add paused_actions for INFO Clients (#1519)
Add `paused_actions` and `paused_timeout_milliseconds` for INFO Clients
to inform users about if clients are paused.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2025-01-14 19:01:00 +08:00
secwall
fdc89c56b7
Escape unix socket group in unit tests (#1554)
In some cases unix groups could have whitespace and/or `\` in them.
One example is my workstation. It's a MacOS in an Active Directory
domain. So my user has group `LD\Domain Users`.
Running `make test` on `unstable` and `8.0` branches fails with:

I'm not sure if we need to fix this in 8.0. But it seems that it should
be fixed in unstable.

Signed-off-by: secwall <secwall@yandex-team.ru>
2025-01-13 20:05:04 -08:00
Viktor Söderqvist
dc9ca1b98d
Test coverage for ECHO for reply schema validation (#1549)
After #1545 disabled some tests for reply schema validation, we now have
another issue that ECHO is not covered.

```
WARNING! The following commands were not hit at all:
  echo
ERROR! at least one command was not hit by the tests
```

This patch adds a test case for ECHO in the unit/other test suite. I
haven't checked if there are more commands that aren't covered.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-13 10:14:09 +08:00
Binbin
11cb8ee27c
Add latency stats around cluster config file operations (#1534)
When the cluster changes, we need to persist the cluster configuration,
and these file IO operations may cause latency.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-11 11:03:10 +08:00
Binbin
10357ceda5
Mark the node as FAIL when the node is marked as NOADDR and broadcast the FAIL (#1191)
Imagine we have a cluster, for example a three-shard cluster,
if shard 1 doing a CLUSTER RESET HARD, it will change the node
name, and then other nodes will mark it as NOADR since the node
name received by PONG has changed.

In the eyes of other nodes, there is one working primary node
left but with no address, and in this case, the address report
in MOVED will be invalid and will confuse the clients. And in
the same time, the replica will not failover since its primary
is not in the FAIL state. And the cluster looks OK to everyone.

This leaves a cluster that appears OK, but with no coverage for
shard 1, obviously we should do something like CLUSTER FORGET
to remove the node and fix the cluster before using it.

But the point in here, we can mark the NOADDR node as FAIL to
advance the cluster state. If a node is NOADDR means it does
not have a valid address, so we won't reconnect it, we won't
send PING, we won't gossip it, it seems reasonable to mark it
as FAIL.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-11 11:02:05 +08:00
Binbin
211b250aad
Do election in order based on failed primary rank to avoid voting conflicts (#1018)
When multiple primary nodes fail simultaneously, the cluster can not recover
within the default effective time (data_age limit). The main reason is that
the vote is without ranking among multiple replica nodes, which case too many
epoch conflicts.

Therefore, we introduced into ranking based on the failed primary shard-id.
Introduced a new failed_primary_rank var, this var means the rank of this
myself instance in the context of all failed primary list. This var will be
used in failover and we will do the failover election packets in order based
on the rank, this can effectively avoid the voting conflicts.

If a single primary is down, the behavior is the same as before. If multiple
primaries are down, their replica election initiation time will be delayed
by 500ms according to the ranking.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-11 10:43:18 +08:00
Binbin
d6bdd9e7d7
Fix module LatencyAddSample still work when latency-monitor-threshold is 0 (#1541)
When latency-monitor-threshold is set to 0, it means the latency monitor
is disabled, and in VM_LatencyAddSample, we wrote the condition
incorrectly, causing us to record latency when latency was turned off.

This bug was introduced in the very first day, see e3b1d6d, it was merged
in 2019.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-11 10:32:58 +08:00
Rueian
0a89571dcc
Skip logreqres on tests for the HELLO command (#1528)
Skip logreqres on tests for the HELLO command

Signed-off-by: Rueian <rueiancsie@gmail.com>
2025-01-08 10:05:20 -08:00
Rueian
3b52186b6a
Add availability_zone to the HELLO response (#1487)
It's inconvenient for client implementations to extract the
`availability_zone` information from the `INFO` response. The `INFO`
response contains a lot of information that a client implementation
typically doesn't need.

This PR adds the availability zone to the `HELLO` response. Clients
usually already use the `HELLO` command for protocol negotiation and
also get the server `version` and `role` from its response. To keep the
`HELLO` response small, the field is only added if availability zone is
configured.

---------

Signed-off-by: Rueian <rueiancsie@gmail.com>
2025-01-07 22:54:55 +01:00
Madelyn Olson
e1db553834
Add tests for acl selectors with no permissions or patterns (#1515)
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2025-01-06 15:46:55 -08:00
Madelyn Olson
7977c55ac9
Fix Read/Write key pattern selector (CVE-2024-51741) (#1514)
The explanation on the original commit was wrong. Key based access must
have a `~` in order to correctly configure whey key prefixes to apply
the selector to. If this is missing, a server assert will be triggered
later.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: YaacovHazan <yaacov.hazan@redis.com>
2025-01-06 14:02:16 -08:00
gmbnomis
26a72fa89c
Use the correct command proc for the LOOKUP_NOTOUCH exception in lookupKey (#1499)
When looking up a key in no-touch mode, `LOOKUP_NOTOUCH` is set to avoid
updating the last access time in `lookupKey`. An exception must be made
for the `TOUCH` command which must always update the key.

When called from a script, `server.executing_client` will point to the
`TOUCH` command, while `server.current_client` will point to e.g. an
`EVAL` command. So, we must use the former to find out the currently
executing command if defined.

This fix addresses the issue where TOUCH wasn't updating key access
times when called from scripts like EVAL.

Fixes #1498

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2025-01-03 09:41:15 +08:00
ranshid
0f273bb648
Align rejected unblocked commands to update the correct error statistic (#577)
Currently, in case a blocked command is unblocked externally (eg. due to
the relevant slot being migrated or the CLIENT UNBLOCK command was
issued, the command statistics will always update the failed_calls error
statistic. This leads to missalignment with
90b9f08e5d
as well as some inconsistencies. For example when a key is migrated
during cluster slot migration, clients blocked on XREADGROUP will be
unblocked and update the rejected_calls stat, while clients blocked on
BLPOP will get unblocked updating the failed_calls stat.

In this PR we add explicit indication in updateStatsOnUnblock thet
indicates if the command was rejected or failed.

---------

Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2025-01-01 16:33:09 +02:00
Pierre
e4179f1f3b
Only (re-)send MEET packet once every handshake timeout period (#1441)
Add `meet_sent` field in `clusterNode` indicating the last time we sent
a MEET packet. Use this field to only (re-)send a MEET packet once every
handshake timeout period when detecting a node without an inbound link.

When receiving multiple MEET packets on the same link while the node is
in handshake state, instead of dropping the packet, we now simply
prevent the creation of a new node. This way we still process the MEET
packet's gossip and reply with a PONG as any other packets.

Improve some logging messages to include `human_nodename`. Add
`nodeExceedsHandshakeTimeout()` function.

This is a follow-up to this previous PR:
https://github.com/valkey-io/valkey/pull/1307
And a partial fix to the crash described in:
https://github.com/valkey-io/valkey/pull/1436

---------

Signed-off-by: Pierre Turin <pieturin@amazon.com>
2024-12-30 15:56:39 -05:00
Ricardo Dias
6adef8e2f9
Adds support for scripting engines as Valkey modules (#1277)
This PR extends the module API to support the addition of different
scripting engines to execute user defined functions.

The scripting engine can be implemented as a Valkey module, and can be
dynamically loaded with the `loadmodule` config directive, or with the
`MODULE LOAD` command.

This PR also adds an example of a dummy scripting engine module, to show
how to use the new module API. The dummy module is implemented in
`tests/modules/helloscripting.c`.

The current module API support, only allows to load scripting engines to
run functions using `FCALL` command.

The additions to the module API are the following:

```c
/* This struct represents a scripting engine function that results from the
 * compilation of a script by the engine implementation. */
struct ValkeyModuleScriptingEngineCompiledFunction

typedef ValkeyModuleScriptingEngineCompiledFunction **(*ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc)(
    ValkeyModuleScriptingEngineCtx *engine_ctx,
    const char *code,
    size_t timeout,
    size_t *out_num_compiled_functions,
    char **err);

typedef void (*ValkeyModuleScriptingEngineCallFunctionFunc)(
    ValkeyModuleCtx *module_ctx,
    ValkeyModuleScriptingEngineCtx *engine_ctx,
    ValkeyModuleScriptingEngineFunctionCtx *func_ctx,
    void *compiled_function,
    ValkeyModuleString **keys,
    size_t nkeys,
    ValkeyModuleString **args,
    size_t nargs);

typedef size_t (*ValkeyModuleScriptingEngineGetUsedMemoryFunc)(
    ValkeyModuleScriptingEngineCtx *engine_ctx);

typedef size_t (*ValkeyModuleScriptingEngineGetFunctionMemoryOverheadFunc)(
    void *compiled_function);

typedef size_t (*ValkeyModuleScriptingEngineGetEngineMemoryOverheadFunc)(
    ValkeyModuleScriptingEngineCtx *engine_ctx);

typedef void (*ValkeyModuleScriptingEngineFreeFunctionFunc)(
    ValkeyModuleScriptingEngineCtx *engine_ctx,
    void *compiled_function);

/* This struct stores the callback functions implemented by the scripting
 * engine to provide the functionality for the `FUNCTION *` commands. */
typedef struct ValkeyModuleScriptingEngineMethodsV1 {
    uint64_t version; /* Version of this structure for ABI compat. */

    /* Library create function callback. When a new script is loaded, this
     * callback will be called with the script code, and returns a list of
     * ValkeyModuleScriptingEngineCompiledFunc objects. */
    ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc create_functions_library;

    /* The callback function called when `FCALL` command is called on a function
     * registered in this engine. */
    ValkeyModuleScriptingEngineCallFunctionFunc call_function;

    /* Function callback to get current used memory by the engine. */
    ValkeyModuleScriptingEngineGetUsedMemoryFunc get_used_memory;

    /* Function callback to return memory overhead for a given function. */
    ValkeyModuleScriptingEngineGetFunctionMemoryOverheadFunc get_function_memory_overhead;

    /* Function callback to return memory overhead of the engine. */
    ValkeyModuleScriptingEngineGetEngineMemoryOverheadFunc get_engine_memory_overhead;

    /* Function callback to free the memory of a registered engine function. */
    ValkeyModuleScriptingEngineFreeFunctionFunc free_function;
} ValkeyModuleScriptingEngineMethodsV1;

/* Registers a new scripting engine in the server.
 *
 * - `engine_name`: the name of the scripting engine. This name will match
 *   against the engine name specified in the script header using a shebang.
 *
 * - `engine_ctx`: engine specific context pointer.
 *
 * - `engine_methods`: the struct with the scripting engine callback functions
 * pointers.
 */
int ValkeyModule_RegisterScriptingEngine(ValkeyModuleCtx *ctx,
                                         const char *engine_name,
                                         void *engine_ctx,
                                         ValkeyModuleScriptingEngineMethods engine_methods);

/* Removes the scripting engine from the server.
 *
 * `engine_name` is the name of the scripting engine.
 *
 */
int ValkeyModule_UnregisterScriptingEngine(ValkeyModuleCtx *ctx, const char *engine_name);
```

---------

Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
2024-12-21 23:09:35 +01:00
Madelyn Olson
b56f4f70d2
Update info.tcl test to revert client output limits sooner (#1462)
We set the client output buffer limits to 10 bytes, and then execute
`info stats` which produces more than 10 bytes of output, which can
cause that command to throw an error.

I'm not sure why it wasn't consistently erroring before, might have been
some change related to the ubuntu upgrade though.

Issues related to ubuntu-tls are hopefully resolved now.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-20 10:16:46 +08:00
Madelyn Olson
60197b30e2
Attempt to read secondary error from info test (#1452)
The test attempts to write 1MB of data in order to trigger a disconnect.
Normally, the data is fully flushed and we get the error on the read
(I/O error). However, it's possible we might fail the write, which
leaves the client in an inconsistent state. On the next command, we
finally process the I/O error on the FD. So, the simple fix is to
consume any secondary errors.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-18 09:17:11 -08:00
ranshid
ba25b586d5
Introduce FORCE_DEFRAG compilation option to allow activedefrag run when allocator is not jemalloc (#1303)
Introduce compile time option to force activedefrag to run even when
jemalloc is not used as the allocator.
This is in order to be able to run tests with defrag enabled
while using memory instrumentation tools.

fixes: https://github.com/valkey-io/valkey/issues/1241

---------

Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-17 19:07:55 +02:00
Binbin
ad24220681
Automatic failover vote is not limited by two times the node timeout (#1356)
This is a follow of #1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion #1305 for more information.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-15 12:09:53 +08:00
Rain Valentine
88942c8e61
Replace dict with new hashtable for sets datatype (#1176)
The new `hashtable` provides faster lookups and uses less memory than
`dict`.

A TCL test case "SRANDMEMBER with a dict containing long chain" is
deleted because it's covered by a hashtable unit test
"test_random_entry_with_long_chain", which is already present.

This change also moves some logic from dismissMemory (object.c) to
zmadvise_dontneed (zmalloc.c), so the hashtable implementation which
needs the dismiss functionality doesn't need to depend on object.c and
server.h.

This PR follows #1186.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-14 20:53:48 +01:00
Madelyn Olson
0e96bb311e
Synchronously delete data during defrag tests (#1443)
The creation of fragmentation is delayed when we use lazy-free. You can
induce some of the active-defrag tests to fail by artificially adding a
delay in the lazyfree process, similar to the issues seen in #1433 and
issues like
https://github.com/valkey-io/valkey/actions/runs/12267010712/job/34226304803#step:7:6538.
The solution is to always do sync free during tests.

Might close https://github.com/valkey-io/valkey/issues/1433.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-14 19:14:01 +01:00
Jim Brunner
32f2c73cb5
defrag: eliminate persistent kvstore pointer and edge case fixes (#1430)
This update addresses several issues in defrag:
1. In the defrag redesign
(https://github.com/valkey-io/valkey/pull/1242), a bug was introduced
where `server.cronloops` was no longer being incremented in the
`whileBlockedCron()`. This resulted in some memory statistics not being
updated while blocked.
2. In the test case for AOF loading, we were seeing errors due to defrag
latencies. However, running the math, the latencies are justified given
the extremely high CPU target of the testcase. Adjusted the expected
latency check to allow longer latencies for this case where defrag is
undergoing starvation while AOF loading is in progress.
3. A "stage" is passed a "target". For the main dictionary and expires,
we were passing in a `kvstore*`. However, on flushall or swapdb, the
pointer may change. It's safer and more stable to use an index for the
DB (a DBID). Then if the pointer changes, we can detect the change, and
simply abort the stage. (If there's still fragmentation to deal with,
we'll pick it up again on the next cycle.)
4. We always start a new stage on a new defrag cycle. This gives the new
stage time to run, and prevents latency issues for certain stages which
don't operate incrementally. However, often several stages will require
almost no work, and this will leave a chunk of our CPU allotment unused.
This is mainly an issue in starvation situations (like AOF loading or
LUA script) - where defrag is running infrequently, with a large
duty-cycle. This change allows a new stage to be initiated if we still
have a standard duty-cycle remaining. (This can happen during starvation
situations where the planned duty cycle is larger than the standard
cycle. Most likely this isn't a concern for real scenarios, but it was
observed in testing.)
5. Minor comment correction in `server.h`

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
2024-12-12 14:55:57 -08:00
Pierre
5f7fe9ef21
Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout (#1307)
In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us. This prevents
cases where a node could send MEET packets in a loop because it thinks
the other node does not have an inbound link.

This fixes the bug described in #1251.

---------

Signed-off-by: Pierre Turin <pieturin@amazon.com>
2024-12-11 17:26:06 -08:00
Jim Brunner
0c8ad5cd34
defrag: allow defrag to start during AOF loading (#1420)
Addresses https://github.com/valkey-io/valkey/issues/1393

Changes:
* During AOF loading or long running script, this allows defrag to be
initiated.
* The AOF defrag test was corrected to eliminate the wait period and
rely on non-timer invocations.
* Logic for "overage" time in defrag was changed. It previously
accumulated underage leading to large latencies in extreme tests having
very high CPU percentage. After several simple stages were completed
during infrequent blocked processing, a large cycle time would be
experienced.

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
2024-12-11 19:47:06 +02:00
Viktor Söderqvist
3eb8314be6 Replace dict with hashtable for keys, expires and pubsub channels
Instead of a dictEntry with pointers to key and value, the hashtable
has a pointer directly to the value (robj) which can hold an embedded
key and acts as a key-value in the hashtable. This minimizes the number
of pointers to follow and thus the number of memory accesses to lookup
a key-value pair.

        Keys         robj
      hashtable
      +-------+   +-----------------------+
      | 0     |   | type, encoding, LRU   |
      | 1 ------->| refcount, expire      |
      | 2     |   | ptr                   |
      | ...   |   | optional embedded key |
      +-------+   | optional embedded val |
                  +-----------------------+

The expire timestamp (TTL) is also stored in the robj, if any. The expire
hash table points to the same robj.

Overview of changes:

* Replace dict with hashtable in kvstore (kvstore.c)
* Add functions for embedding key and expire in robj (object.c)
  * When there's unused space, reserve an expire field to avoid realloting
    it later if expire is added.
  * Always reserve space for expire for large key names to avoid realloc
    if it's set later.
* Update db functions (db.c)
  * dbAdd, setKey and setExpire reallocate the object when embedding a key
  * setKey does not increment the reference counter, since it would require
    duplicating the object. This responsibility is moved to the caller.
* Remove logic for shared integer objects as values in the database. The keys
  are now embedded in the objects, so all objects in the database need to be
  unique. Thus, we can't use shared objects as values. Also delete test cases
  for shared integers.
* Adjust various commands to the changes mentioned above.
* Adjust defrag code
  * Improvement: Don't access the expires table before defrag has actually
    reallocated the object.
* Adjust test cases that were using hard-coded sizes for dict when realloc
  would happen, and some other adjustments in test cases.
* Adjust memory prefetch for new hash table implementation in IO-threading,
  using new `hashtableIncrementalFind` API
* Adjust offloading of free() to IO threads: Object free to be done in main
  thread while keeping obj->ptr offloading in IO-thread since the DB object is
  now allocated by the main-thread and not by the IO-thread as it used to be.
* Let expireIfNeeded take an optional value, to avoid looking up the expires
  table when possible.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
2024-12-10 21:30:56 +01:00
Viktor Söderqvist
b4c2a1804a
Fix flaky init_test proc in maxmemory test suite (#1419)
The following error has been seen, but not reliably reproduced:

```
*** [err]: eviction due to output buffers of pubsub, client eviction: true in tests/unit/maxmemory.tcl
Expected '42' to be equal to '50' (context: type proc line 17 cmd {assert_equal [r dbsize] 50} proc ::init_test level 2)
```

The reason is probably that FLUSHDB is asynchronous and when we start
populating new keys, they are evicted because the background flush is
too slow. Changing this to FLUSHDB SYNC prevents this.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-10 20:52:06 +02:00
Viktor Söderqvist
2dfe25b408
Fix race in test "CLUSTER SLOT-STATS cpu-usec for blocking commands, unblocked on timeout" (#1416)
This fix changes the timeout for BLPOP in this test case from 1 second
to 0.5 seconds.

In the test case quoted below, the procedure
`wait_for_blocked_clients_count` waits for one second by default. If
BLPOP has 1 second timeout and the first
`wait_for_blocked_clients_count` finishes very fast, then the second
`wait_for_blocked_clients_count` can time out before the BLPOP has been
unblocked.

```TCL
    test "CLUSTER SLOT-STATS cpu-usec for blocking commands, unblocked on timeout." {
        # Blocking command with 1 second timeout.
        set rd [valkey_deferring_client]
        $rd BLPOP $key 1

        # Confirm that the client is blocked, then unblocked after 1 second timeout.
        wait_for_blocked_clients_count 1
        wait_for_blocked_clients_count 0
```

As seen in the definition of `wait_for_blocked_clients_count`, the total
time to wait is 1 second by default.

```TCL
proc wait_for_blocked_clients_count {count {maxtries 100} {delay 10} {idx 0}} {
    wait_for_condition $maxtries $delay  {
        [s $idx blocked_clients] == $count
    } else {
        fail "Timeout waiting for blocked clients"
    }
}
```

Fixes #1121

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-10 14:46:21 +01:00
Yanqi Lv
f951a1ca73
Add new flag in CLIENT LIST for import-source client (#1398)
- Add new flag "I" in `CLIENT LIST` for import-source client
- Add `DEBUG_CONFIG` for import-mode
- Allow import-source status to be turned off when import-mode is off

Fixes #1350 and
https://github.com/valkey-io/valkey/pull/1185#discussion_r1851049362.

---------

Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com>
Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2024-12-10 13:35:07 +01:00
Sarthak Aggarwal
9cfe1b3d81
Set Command with IFEQ Support (#1324)
This PR allows the Valkey users to perform conditional updates where the
SET command is completed if the given comparison-value matches the key’s
current value.

Syntax:

```
SET key value IFEQ comparison-value
```

Behavior:

If the values match, the SET completes as expected. If they do not
match, the command returns a (nil), except if the GET argument is also
given (see below).

Behavior with Additional Flags:

1. ```SET key value IFEQ comparison-value GET``` returns the existing
value, regardless of whether it matches comparison-value or not. The
conditional set operation is performed if the given comparison value
matches the existing value. To check if the SET succeeded, the caller
needs to check if the returned string matches the comparison-value.
2. ```SET key value IFEQ comparison-value XX``` is a syntax error.
3.  ```SET key value IFEQ comparison-value NX``` is a syntax error.

Closes: #1215

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2024-12-10 12:54:49 +01:00
Binbin
924729eb16
Fix the election was reset wrongly before failover epoch was obtained (#1339)
After #1009, we will reset the election when we received
a claim with an equal or higher epoch since a node can win
an election in the past.

But we need to consider the time before the node actually
obtains the failover_auth_epoch. The failover_auth_epoch
default is 0, so before the node actually get the failover
epoch, we might wrongly reset the election.

This is probably harmless, but will produce misleading log
output and may delay election by a cron cycle or beforesleep.
Now we will only reset the election when a node is actually
obtains the failover epoch.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-09 16:19:02 +08:00
Roman Gershman
b09db3ef78
Fix typo in streams seen-time / active-time test (#1409)
This variable name is wrong, it causes the wrong variable to be asserted.

Signed-off-by: Roman Gershman <romange@gmail.com>
2024-12-09 16:01:43 +08:00
Guillaume Koenig
e8078b7315
Allow MEMORY MALLOC-STATS and MEMORY PURGE during loading phase (#1317)
- Enable investigation of memory issues during loading
- Previously, all memory commands were rejected with LOADING error
(except memory help)
- `MEMORY MALLOC-STATS` and `MEMORTY PURGE` are now allowed
as they don't depend on the dataset
- `MEMORY STATS` and `MEMORY USAGE KEY` remain disallowed

Fixes #1299

Signed-off-by: Guillaume Koenig <knggk@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2024-12-08 20:30:07 +08:00
Binbin
176fafcaf7
Add a note to conf about the dangers of modifying dir at runtime (#887)
We've had security issues in the past with it, which is why
we marked it as PROTECTED. But, modifying during runtime
is also a dangerous action. For example, when child processes
are running, persistent temp files and log files may have
unexpected effects.

A scenario for modifying dir at runtime is to migrate a disk
failure, such as using disk-based replication to migrate a node,
writing nodes.conf to save the cluster configuration.

We decided to leave it as is and add a note in the conf
about the dangers of modifying dir at runtime.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-08 20:28:14 +08:00
Viktor Söderqvist
a2fe6af457
Fix Module Update Args test when other modules are loaded (#1403)
Fixes #1400

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-07 10:25:40 +01:00
Wen Hui
71560a2a4a
Add API UpdateRuntimeArgs for updating the module arguments during runtime (#1041)
Before Redis OSS 7, if we load a module with some arguments during
runtime,
and run the command "config rewrite", the module information will not be
saved into the
config file.

Since Redis OSS 7 and Valkey 7.2, if we load a module with some
arguments during runtime,
the module information (path, arguments number, and arguments value) can
be saved into the config file after config rewrite command is called.
Thus, the module will be loaded automatically when the server startup
next time.

Following is one example:

bind 172.25.0.58
port 7000
protected-mode no
enable-module-command yes

Generated by CONFIG REWRITE
latency-tracking-info-percentiles 50 99 99.9
dir "/home/ubuntu/valkey"
save 3600 1 300 100 60 10000
user default on nopass sanitize-payload ~* &* +https://github.com/ALL
loadmodule tests/modules/datatype.so 10 20

However, there is one problem.
If developers write a module, and update the running arguments by
someway, the updated arguments can not be saved into the config file
even "config rewrite" is called.
The reason comes from the following function
rewriteConfigLoadmoduleOption (src/config.c)

void rewriteConfigLoadmoduleOption(struct rewriteConfigState *state) {
..........
struct ValkeyModule *module = dictGetVal(de);
line = sdsnew("loadmodule ");
line = sdscatsds(line, module->loadmod->path);
for (int i = 0; i < module->loadmod->argc; i++) {
line = sdscatlen(line, " ", 1);
line = sdscatsds(line, module->loadmod->argv[i]->ptr);
}
rewriteConfigRewriteLine(state, "loadmodule", line, 1);
.......
}

The function only save the initial arguments information
(module->loadmod) into the configfile.

After core members discuss, ref
https://github.com/valkey-io/valkey/issues/1177


We decide add the following API to implement this feature:

Original proposal:

int VM_UpdateRunTimeArgs(ValkeyModuleCtx *ctx, int index, char *value);


Updated proposal:

ValkeyModuleString **values VM_GetRuntimeArgs(ValkeyModuleCtx *ctx);
**int VM_UpdateRuntimeArgs(ValkeyModuleCtx *ctx, int argc,
ValkeyModuleString **values);



Why we do not recommend the following way: 


MODULE UNLOAD
Update module args in the conf file
MODULE LOAD

I think there are the following disadvantages:

1. Some modules can not be unloaded. Such as the example module
datatype.so, which is tests/modules/datatype.so
2. it is not atomic operation for MODULE UNLOAD + MODULE LOAD
3. sometimes, if we just run the module unload, the client business
could be interrupted

---------

Signed-off-by: hwware <wen.hui.ware@gmail.com>
2024-12-05 11:58:24 -05:00
uriyage
9f8b174c2e
Optimize IO thread offload for modified argv (#1360)
### Improve expired commands performance with IO threads

#### Background
In our IO threads architecture, IO threads allocate client argv's and
later when we free it after processCommand we offload its free to the IO
threads.
With jemalloc, it's crucial that the same thread that allocates memory
also frees it.

For some commands we modify the client's argv in the main thread during
command processing (for example in `SET EX` command we rewrite the
command to use absolute time for replication propagation).

#### Current issues
1. When commands are rewritten (e.g., expire commands), we store the
original argv
   in `c->original_argv`. However, we're currently:
   - Freeing new argv (allocated by main thread) in IO threads
   - Freeing original argv (allocated by IO threads) in main thread
2. Currently, `c->original_argv` points to new array with old 
objects, while `c->argv` has old array with new objects, making memory
free management complicated.

#### Changes
1. Refactored argv modification handling code to ensure consistency -
both array and objects are now either all new or all old
2. Moved original_argv cleanup to happen in resetClient after argv
cleanup
3. Modified IO threads code to properly handle original argv cleanup
when argv are modified.

#### Performance Impact
Benchmark with `SET EX` commands (650 clients, 512 byte value, 8 IO
threads):
- New implementation: **729,548 ops/sec**
- Old implementation: **633,243 ops/sec**
Representing a **~15%** performance improvement due to more efficient
memory handling.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
2024-12-03 19:20:31 +02:00
Jim Brunner
397201c48f
Refactor of ActiveDefrag to reduce latencies (#1242)
Refer to:  https://github.com/valkey-io/valkey/issues/1141

This update refactors the defrag code to:
* Make the overall code more readable and maintainable
* Reduce latencies incurred during defrag processing

With this update, the defrag cycle time is reduced to 500us, with more
frequent cycles. This results in much more predictable latencies, with a
dramatic reduction in tail latencies.

(See https://github.com/valkey-io/valkey/issues/1141 for more complete
details.)

This update is focused mostly on the high-level processing, and does NOT
address lower level functions which aren't currently timebound (e.g.
`activeDefragSdsDict()`, and `moduleDefragGlobals()`). These are out of
scope for this update and left for a future update.

I fixed `kvstoreDictLUTDefrag` because it was using up to 7ms on a CME
single shard. See original github issue for performance details.

---------

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-03 08:42:29 -08:00
Nugine
3df609ef06
Optimize PFCOUNT, PFMERGE command by SIMD acceleration (#1293)
This PR optimizes the performance of HyperLogLog commands (PFCOUNT,
PFMERGE) by adding AVX2 fast paths.

Two AVX2 functions are added for conversion between raw representation
and dense representation. They are 15 ~ 30 times faster than scalar
implementaion. Note that sparse representation is not accelerated.

AVX2 fast paths are enabled when the CPU supports AVX2 (checked at
runtime) and the hyperloglog configuration is default (HLL_REGISTERS ==
16384 && HLL_BITS == 6).

`PFDEBUG SIMD (ON|OFF)` subcommand is added for unit tests. A new TCL
unit test checks that the results produced by non-AVX2 and AVX2
implementations are exactly equal.

When merging 3 dense hll structures, the benchmark shows a 12x speedup
compared to the scalar version.

```
pfcount key1 key2 key3
pfmerge keyall key1 key2 key3
```

```
======================================================================================================
Type             Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------
PFCOUNT-scalar    5665.56        35.29839        32.25500        63.99900        67.58300       608.60
PFCOUNT-avx2     72377.83         2.75834         2.67100         5.34300         6.81500      7774.96
------------------------------------------------------------------------------------------------------
PFMERGE-scalar    9851.29        20.28806        20.09500        36.86300        39.16700       615.71
PFMERGE-avx2    125621.89         1.59126         1.55100         3.11900         4.70300     15702.74
------------------------------------------------------------------------------------------------------

scalar: valkey:unstable  2df56d87c0ebe802f38e8922bb2ea1e4ca9cfa76
avx2:   Nugine:hll-simd  8f9adc34021080d96e60bd0abe06b043f3ed0275

CPU:    13th Gen Intel® Core™ i9-13900H × 20
Memory: 32.0 GiB
OS:     Ubuntu 22.04.5 LTS
```

Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark script:
https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh
Algorithm:
https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

---------

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
2024-12-02 19:40:38 +01:00
Binbin
fbbfe5d3d3
Print logs when the cluster state changes to fail or the fail reason changes (#1188)
This log allows us to easily distinguish between full coverage and
minority partition when the cluster fails. Sometimes it is not easy
to see the minority partition in a healthy shards (both primary and
replicas).

And we decided not to add a cluster_fail_reason field to cluster info.
Given that there are only two reasons and both are well-known and if
we ended up adding more down the road we can add it in the furture.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-02 15:55:24 +08:00
zhenwei pi
4695d118dd
RDMA builtin support (#1209)
There are several patches in this PR:

* Abstract set/rewrite config bind option: `bind` option is a special
config, `socket` and `tls` are using the same one. However RDMA uses the
similar style but different one. Use a bit abstract work to make it
flexible for both `socket` and `RDMA`. (Even for QUIC in the future.)
* Introduce closeListener for connection type: closing socket by a
simple syscall would be fine, RDMA has complex logic. Introduce
connection type specific close listener method.
* RDMA: Use valkey.conf style instead of module parameters: use
`--rdma-bind` and `--rdma-port` style instead of module parameters. The
module style config `rdma.bind` and `rdma.port` are removed.
* RDMA: Support builtin: support `make BUILD_RDMA=yes`. module style is
still kept for now.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
2024-11-29 11:13:34 +01:00
zvi-code
fd58f8d058
Disable lazy free in defrag test to fix 32bit daily failure (#1370)
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com>
2024-11-28 16:27:00 +01:00
Binbin
db7b7396ff
Make KEYS can visit expired key in import-source state (#1326)
After #1185, a client in import-source state can visit expired key
both in read commands and write commands, this commit handle
keyIsExpired function to handle import-source state as well, so
KEYS can visit the expired key.

This is not particularly important, but it ensures the definition,
also doing some cleanup around the test, verified that the client
can indeed visit the expired key.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-11-28 00:16:55 +08:00
Binbin
b9d224097a
Brocast a PONG to all node in cluster when role changed (#1295)
When a node role changes, we should brocast the change to notify other nodes.
For example, one primary and one replica, after a failover, the replica became
a new primary, the primary became a new replica.

And then we trigger a second cluster failover for the new replica, the
new replica will send a MFSTART to its primary, ie, the new primary.

But the new primary may reject the MFSTART due to this logic:
```
    } else if (type == CLUSTERMSG_TYPE_MFSTART) {
        if (!sender || sender->replicaof != myself) return 1;
```

In the new primary views, sender is still a primary, and sender->replicaof
is NULL, so we will return. Then the manual failover timedout.

Another possibility is that other primaries refuse to vote after receiving
the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary,
so it refuse to vote, and then manual failover timedout.
```
void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
    ...
        if (clusterNodeIsPrimary(node)) {
            serverLog(LL_WARNING, "Failover auth denied to...
```

The reason is that, currently, we only update the node->replicaof information
when we receive a PING/PONG from the sender. For details, see clusterProcessPacket.
Therefore, in some scenarios, such as clusters with many nodes and a large
cluster-ping-interval (that is, cluster-node-timeout), the role change of the node
will be very delayed.

Added a DEBUG DISABLE-CLUSTER-RANDOM-PING command, send cluster ping
to a random node every second (see clusterCron).

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-11-23 00:22:04 +08:00
Binbin
c4be326c32
Make manual failover reset the on-going election to promote failover (#1274)
If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-11-22 10:28:59 +08:00
Yanqi Lv
4986310945
Import-mode: Avoid expiration and eviction during data syncing (#1185)
New config: `import-mode (yes|no)`

New command: `CLIENT IMPORT-SOURCE (ON|OFF)`

The config, when set to `yes`, disables eviction and deletion of expired
keys, except for commands coming from a client which has marked itself
as an import-source, the data source when importing data from another
node, using the CLIENT IMPORT-SOURCE command.

When we sync data from the source Valkey to the destination Valkey using
some sync tools like
[redis-shake](https://github.com/tair-opensource/RedisShake), the
destination Valkey can perform expiration and eviction, which may cause
data corruption. This problem has been discussed in
https://github.com/redis/redis/discussions/9760#discussioncomment-1681041
and Redis already have a solution. But in Valkey we haven't fixed it by
now.

E.g. we call `set key 1 ex 1` on the source server and transfer this
command to the destination server. Then we call `incr key` on the source
server before the key expired, we will have a key on the source server
with a value of 2. But when the command arrived at the destination
server, the key may be expired and has deleted. So we will have a key on
the destination server with a value of 1, which is inconsistent with the
source server.

In standalone mode, we can use writable replica to simplify the sync
process. However, in cluster mode, we still need a sync tool to help us
transfer the source data to the destination. The sync tool usually work
as a normal client and the destination works as a primary which keep
expiration and eviction.

In this PR, we add a new mode named 'import-mode'. In this mode, server
stop expiration and eviction just like a replica. Notice that this mode
exists only in sync state to avoid data inconsistency caused by
expiration and eviction. Import mode only takes effect on the primary.
Sync tools can mark their clients as an import source by `CLIENT
IMPORT-SOURCE`, which work like a client from primary and can visit
expired keys in `lookupkey`.

**Notice: during the migration, other clients, apart from the import
source, should not access the data imported by import source.**

---------

Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com>
Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-11-19 21:53:19 +01:00
Binbin
ee386c92ff
Manual failover vote is not limited by two times the node timeout (#1305)
This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-11-19 11:17:20 -05:00
Binbin
aa2dd3ecb8
Stabilize replica migration test to make sure cluster config is consistent (#1311)
CI report this failure:
```
[exception]: Executing test client: MOVED 1 127.0.0.1:22128.
MOVED 1 127.0.0.1:22128
    while executing
"wait_for_condition 1000 50 {
            [R 3 get key_991803] == 1024 && [R 3 get key_977613] == 10240 &&
            [R 4 get key_991803] == 1024 && ..."
```

This may be because, even though the cluster state becomes OK,
The cluster still has inconsistent configuration for a short period
of time. We make sure to wait for the config to be consistent.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-11-16 18:58:25 +08:00
Binbin
92181b6797
Fix primary crash when processing dirty slots during shutdown wait / failover wait / client pause (#1131)
We have an assert in propagateNow. If the primary node receives a
CLUSTER UPDATE such as dirty slots during SIGTERM waitting or during
a manual failover pausing or during a client pause, the delKeysInSlot
call will trigger this assert and cause primary crash.

In this case, we added a new server_del_keys_in_slot state just like
client_pause_in_transaction to track the state to avoid the assert
in propagateNow, the dirty slots will be deleted in the end without
affecting the data consistency.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-11-15 16:47:15 +08:00
Binbin
2df56d87c0
Fix empty primary may have dirty slots data due to bad migration (#1285)
If we become an empty primary for some reason, we still need to
check if we need to delete dirty slots, because we may have dirty
slots data left over from a bad migration. Like the target node forcibly
executes CLUSTER SETSLOT NODE to take over the slot without
performing key migration.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-11-11 22:13:47 +08:00
Binbin
a2d22c63c0
Fix replica not able to initate election in time when epoch fails (#1009)
If multiple primary nodes go down at the same time, their replica nodes will
initiate the elections at the same time. There is a certain probability that
the replicas will initate the elections in the same epoch.

And obviously, in our current election mechanism, only one replica node can
eventually get the enough votes, and the other replica node will fail to win
due the the insufficient majority, and then its election will time out and
we will wait for the retry, which result in a long failure time.

If another node has been won the election in the failover epoch, we can assume
that my election has failed and we can retry as soom as possible.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-11-11 22:12:49 +08:00