2494 Commits

Author SHA1 Message Date
Viktor Söderqvist
12ec3d5932
Increase timeout for cross-version-replication test (#1644)
Fixes #1641

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-29 13:29:35 -08:00
Madelyn Olson
d3aabd7f13
Hex encode the data in dump test (#1637)
Addresses the failure here:
https://github.com/valkey-io/valkey/actions/runs/13000845302/job/36259016156#step:5:7272.

This change does three things:
1. For some reason TCL 8.5 (which is used on macos) is handling `\x03ba`
as `\0xba`, according to
https://www.tcl-lang.org/man/tcl8.5/TclCmd/Tcl.htm#M27 so we encode
"bar" using hex escapes too.
2. Fix a spacing issue. 
3. Make it so that if the restore fails, it immediately errors.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-29 16:20:38 -05:00
Binbin
4b8f3ed9ac
Do command existence and arity checks when loading AOF to avoid crash (#1614)
Do command existence and arity checks when loading AOF to avoid crash

Currently, loading commands such as `cluster` or `cluster slots xxx`
from AOF will cause the server to crash.
1. `cluster` is a container command, and executing proc will cause a
    crash because we do not check subcommand and arity.
2. `cluster slots xxx`, arity check fail, reply with an error from the
    AOF client and trigger a panic.

Of course, there are many other ways for a problematic AOF to cause the
panic, but it is still necessary do some basic checks before executing.
In this way, in these basic cases, we can print useful error messages
instead of crashing directly.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-30 01:06:13 +08:00
Madelyn Olson
f695c52acb
Fix timing issue in pause test (#1631) 2025-01-28 06:35:24 -08:00
ranshid
230efa4fbf
deflake tracking-redir-broken test (#1628)
This address 2 issues:

1. It is possible (somehow) that the inner server client (r) was not
working resp 3 when entering this test.
this makes sure it does.

2. in case the test failed it might leave the redirection client closed.
there is a cross test assumption it should be open, so moved most of the
assert checks to the end of the test.

example fail:
https://github.com/valkey-io/valkey/actions/runs/12979601179/job/36195523412

---------

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2025-01-28 12:35:32 +02:00
Viktor Söderqvist
e9b8970e72
Relaxed RDB version check (#1604)
New config `rdb-version-check` with values:

* `strict`: Reject future RDB versions.
* `relaxed`: Try parsing future RDB versions and fail only when an
unknown RDB opcode or type is encountered.

This can make it possible for Valkey 8.1 to try read a dump from for
example Valkey 9.0 or later on a best-effort basis. The conditions for
when this is expected to work can be defined when the future Valkey
versions are released. Loading is expected to fail in the following
cases:

* If the data set contains any new key types or other data elements not
supported by the current version.
* If the RDB contains new representations or encodings of existing key
types or other data elements.

This change also prepares for the next RDB version bump. A range of RDB
versions (12-79) is reserved, since it's expected to be used by foreign
software RDB versions, so Valkey will not accept versions in this range
even with the `relaxed` version check. The DUMP/RESTORE format has no
magic string; only the RDB version number.

This change also prepares for the magic string to change from REDIS to
VALKEY next time we bump the RDB version.

Related to #1108.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2025-01-27 18:44:24 +01:00
Madelyn Olson
88a68303c0
Make sure to disable pause after fork for dual channel test (#1612)
Might close https://github.com/valkey-io/valkey/issues/1484.

I noticed that we don't disable pause after fork on the last test that
was getting executed, so it might getting stuck in pause loops after the
test ends if it tries another psync for any reason.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2025-01-27 06:44:48 -08:00
Viktor Söderqvist
66577573f2
Test coverage for COMMANDLOG HELP (#1617)
Fixes reply-schema-validator test job which needs coverage for all
commands.

Failing job:
https://github.com/valkey-io/valkey/actions/runs/12969591890/job/36173810824

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-27 04:38:54 +01:00
zhaozhao.zz
3f21705a6c
Feature COMMANDLOG to record slow execution and large request/reply (#1294)
As discussed in PR #336.

We have different types of resources like CPU, memory, network, etc. The
`slowlog` can only record commands eat lots of CPU during the processing
phase (doesn't include read/write network time), but can not record
commands eat too many memory and network. For example:

1. run "SET key value(10 megabytes)" command would not be recored in
slowlog, since when processing it the SET command only insert the
value's pointer into db dict. But that command eats huge memory in query
buffer and bandwidth from network. In this case, just 1000 tps can cause
10GB/s network flow.
2. run "GET key" command and the key's value length is 10 megabytes. The
get command can eat huge memory in output buffer and bandwidth to
network.

This PR introduces a new command `COMMANDLOG`, to log commands that
consume significant network bandwidth, including both input and output.
Users can retrieve the results using `COMMANDLOG get <count>
large-request` and `COMMANDLOG get <count> large-reply`, all subcommands
for `COMMANDLOG` are:

* `COMMANDLOG HELP`
* `COMMANDLOG GET <count> <slow|large-request|large-reply>`
* `COMMANDLOG LEN <slow|large-request|large-reply>`
* `COMMANDLOG RESET <slow|large-request|large-reply>`

And the slowlog is also incorporated into the commandlog.

For each of these three types, additional configs have been added for
control:

* `commandlog-request-larger-than` and
`commandlog-large-request-max-len` represent the threshold for large
requests(the unit is Bytes) and the maximum number of commands that can
be recorded.
* `commandlog-reply-larger-than` and `commandlog-large-reply-max-len`
represent the threshold for large replies(the unit is Bytes) and the
maximum number of commands that can be recorded.
* `commandlog-execution-slower-than` and
`commandlog-slow-execution-max-len` represent the threshold for slow
executions(the unit is microseconds) and the maximum number of commands
that can be recorded.
* Additionally, `slowlog-log-slower-than` and `slowlog-max-len` are now
set as aliases for these two new configs.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
2025-01-24 11:41:40 +08:00
Viktor Söderqvist
99ed308817
Add cross-version test framework (and a simple test) (#1371)
This includes a way to run two versions of the server from the TCL test
framework. It's a preparation to add more cross-version tests. The
runtest script accepts a new parameter

    ./runtest --other-server-path path/to/valkey-server

and a new tag "needs:other-server" for test cases and start_server.
Tests with this tag are automatically skipped if `--other-server-path`
is not provided.

This PR adds it in a CI job with Valkey 7.2.7 by downloading a binary
release.

Fixes #76

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-23 11:26:54 +01:00
ranshid
7fc958da52
fix test Protocol desync regression test with TLS (#1593)
remove socket nonblocking and simplify the validation

fixes https://github.com/valkey-io/valkey/issues/1592

Signed-off-by: ranshid <ranshid@amazon.com>
2025-01-21 08:57:01 +02:00
ranshid
dd92d079dc
Fix Protocol desync regression test (#1590)
The desync regression test was created as a regression test for the
following bug:
in case we embed NULL termination inside inline/multi-bulk message we
will not be able to perform strchr in order to
identify the newline(\n)/carriage-return(\r) in the client query buffer.
this can influence (for example) replica reading primary stream and keep
filling it's query buffer endlessly consuming more and more memory.

In order to handle the above risk, a check was added to verify the
inline bulk and multi-bulk size are not exceeding the 64K bytes in the
query-buffer. A test was placed in order to verify this.

This PR introduce the following fixes to the desync regression test:
1. fix the sent payload to flush 1024 bytes block of 'A's instead of
'payload' which was sent by mistake.
2. Make sure that the connection is correctly terminated on protocol
error by the server after exceeding the 64K and not over 64K.
3. add another test intrinsic which will also verify the nested bulk
with embedded null termination (was not verified before)

fixes https://github.com/valkey-io/valkey/issues/1583


NOTE: Although it is possible to change the use of strchr to a more
"safe" utility (eg memchr) which will not pause scan at first occurrence
of '\0', we still like to protect against over excessive usage of the
query buffer and also preserve the current behavior(?). We will look
into improving this though in a followup issue.

---------

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
2025-01-20 20:28:45 +02:00
ranshid
3032ccd48a
Change the shared format for dual channel replication logs (#1586)
change the format of the dual channel replication logs so that it will
not
conflict with existing log formats like modules. 

Fixes: https://github.com/valkey-io/valkey/issues/1509

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2025-01-20 08:04:47 +02:00
Viktor Söderqvist
b2e4155f54
Lower latenct-monitor-threashold in expire-cycle test case (#1584)
The test case checks for expire-cycle in LATENCY LATEST, but with the
new hash table, the expiry-cycle is too fast to be logged by latency
monitor. Lower the latency monitor threshold to make it more likely to
be logged.

Fixes #1580

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-19 19:23:00 +01:00
Harkrishn Patro
87cc3d7a71
Fix cluster info sent stats for message with light header (#1563)
This issue affected only two message types (CLUSTERMSG_TYPE_PUBLISH and CLUSTERMSG_TYPE_PUBLISHSHARD) because they used a light message header, which caused the CLUSTER INFO stats to miss sent/received message information for those types.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2025-01-16 11:25:37 -08:00
Ricardo Dias
af71619c45
Extract the scripting engine code from the functions unit (#1312)
This commit creates a new compilation unit for the scripting engine code
by extracting the existing code from the functions unit.
We're doing this refactor to prepare the code for running the `EVAL`
command using different scripting engines.

This PR has a module API change: we changed the type of error messages
returned by the callback
`ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc` to be a
`ValkeyModuleString` (aka `robj`);

This PR also fixes #1470.

---------

Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
2025-01-16 10:08:16 +01:00
Ray Cao
921ba19acb
Incr expired_keys if the unix-time is already expired for EXPIREAT and other commands(#1517)
Some commands that use unix-time, such as `EXPIREAT` and `SET EXAT`, should include the deleted keys in the `expired_keys` statistics if the specified time has already expired, and notifications should be sent in the manner of expired.

---------

Signed-off-by: Ray Cao <zisong.cw@alibaba-inc.com>
2025-01-16 16:40:34 +08:00
Sarthak Aggarwal
6a8f068e36
Adding Missing filters to CLIENT LIST and Dedup Parsing (#1401)
Adds filter options to CLIENT LIST:

    * USER <username>
      Return clients authenticated by <username>.
    * ADDR <ip:port>
      Return clients connected from the specified address.
    * LADDR <ip:port>
      Return clients connected to the specified local address.
    * SKIPME (YES|NO)
      Exclude the current client from the list (default: no).
    * MAXAGE <maxage>
      Only list connections older than the specified age.

Modifies the ID filter to CLIENT KILL to allow multiple IDs

    * ID <client-id> [<client-id>...]
      Kill connections by client ids.


This makes CLIENT LIST and CLIENT KILL accept the same options.

For backward compatibility, the default value for SKIPME is NO for
CLIENT LIST and YES for CLIENT KILL.

The MAXAGE comes from CLIENT KILL, where it *keeps* clients with the
given max age and kills the older ones. This logic becomes weird for
CLIENT LIST, but is kept for similary with CLIENT KILL, for the use case
of first testing manually using CLIENT LIST, and then running CLIENT
KILL with the same filters.

The `ID client-id [client-id ...]` no longer needs to be the last
filter. The parsing logic determines if an argument is an ID or not
based on whether it can be parsed as an integer or not.

Partly addresses: #668

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2025-01-15 20:44:13 +01:00
zhaozhao.zz
c5a1585547
add paused_actions for INFO Clients (#1519)
Add `paused_actions` and `paused_timeout_milliseconds` for INFO Clients
to inform users about if clients are paused.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2025-01-14 19:01:00 +08:00
Amit Nagler
6be1c77b1e
Fix valgrind test (#1555)
Introduced at https://github.com/valkey-io/valkey/pull/1165/files

Signed-off-by: naglera <anagler123@gmail.com>
2025-01-14 10:49:46 +02:00
secwall
fdc89c56b7
Escape unix socket group in unit tests (#1554)
In some cases unix groups could have whitespace and/or `\` in them.
One example is my workstation. It's a MacOS in an Active Directory
domain. So my user has group `LD\Domain Users`.
Running `make test` on `unstable` and `8.0` branches fails with:

I'm not sure if we need to fix this in 8.0. But it seems that it should
be fixed in unstable.

Signed-off-by: secwall <secwall@yandex-team.ru>
2025-01-13 20:05:04 -08:00
Viktor Söderqvist
dc9ca1b98d
Test coverage for ECHO for reply schema validation (#1549)
After #1545 disabled some tests for reply schema validation, we now have
another issue that ECHO is not covered.

```
WARNING! The following commands were not hit at all:
  echo
ERROR! at least one command was not hit by the tests
```

This patch adds a test case for ECHO in the unit/other test suite. I
haven't checked if there are more commands that aren't covered.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-13 10:14:09 +08:00
Viktor Söderqvist
ad592f73d7
Skip CLI tests with reply schema validation (#1545)
The commands used in valkey-cli tests are not important the reply schema
validation. Skip them to avoid the problem if tests hanging. This has
failed lately in the daily job:

```
[TIMEOUT]: clients state report follows.
sock55fedcc19be0 => (IN PROGRESS) valkey-cli pubsub mode with single standard channel subscription
Killing still running Valkey server 33357
```

These test cases use a special valkey-cli command `:get pubsub` command,
which is an internal command to valkey-cli rather than a Valkey server
command. This command hangs when compiled with with logreqres enabled.
Easy solution is to skip the tests in this setup.

The test cases were introduced in #1432.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-12 08:02:39 +08:00
Binbin
11cb8ee27c
Add latency stats around cluster config file operations (#1534)
When the cluster changes, we need to persist the cluster configuration,
and these file IO operations may cause latency.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-11 11:03:10 +08:00
Binbin
10357ceda5
Mark the node as FAIL when the node is marked as NOADDR and broadcast the FAIL (#1191)
Imagine we have a cluster, for example a three-shard cluster,
if shard 1 doing a CLUSTER RESET HARD, it will change the node
name, and then other nodes will mark it as NOADR since the node
name received by PONG has changed.

In the eyes of other nodes, there is one working primary node
left but with no address, and in this case, the address report
in MOVED will be invalid and will confuse the clients. And in
the same time, the replica will not failover since its primary
is not in the FAIL state. And the cluster looks OK to everyone.

This leaves a cluster that appears OK, but with no coverage for
shard 1, obviously we should do something like CLUSTER FORGET
to remove the node and fix the cluster before using it.

But the point in here, we can mark the NOADDR node as FAIL to
advance the cluster state. If a node is NOADDR means it does
not have a valid address, so we won't reconnect it, we won't
send PING, we won't gossip it, it seems reasonable to mark it
as FAIL.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-11 11:02:05 +08:00
Binbin
211b250aad
Do election in order based on failed primary rank to avoid voting conflicts (#1018)
When multiple primary nodes fail simultaneously, the cluster can not recover
within the default effective time (data_age limit). The main reason is that
the vote is without ranking among multiple replica nodes, which case too many
epoch conflicts.

Therefore, we introduced into ranking based on the failed primary shard-id.
Introduced a new failed_primary_rank var, this var means the rank of this
myself instance in the context of all failed primary list. This var will be
used in failover and we will do the failover election packets in order based
on the rank, this can effectively avoid the voting conflicts.

If a single primary is down, the behavior is the same as before. If multiple
primaries are down, their replica election initiation time will be delayed
by 500ms according to the ranking.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-11 10:43:18 +08:00
Binbin
d6bdd9e7d7
Fix module LatencyAddSample still work when latency-monitor-threshold is 0 (#1541)
When latency-monitor-threshold is set to 0, it means the latency monitor
is disabled, and in VM_LatencyAddSample, we wrote the condition
incorrectly, causing us to record latency when latency was turned off.

This bug was introduced in the very first day, see e3b1d6d, it was merged
in 2019.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-11 10:32:58 +08:00
Binbin
b207b421bc
Fix new cli subscribed mode test in cluster mode (#1533)
We need to add a hash tag in cluster mode.
Fixes #1531.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-09 12:21:31 +08:00
Nikhil Manglore
9e0204941d
valkey-cli auto-exit from subscribed mode (#1432)
Resolves issue with valkey-cli not auto exiting from subscribed mode on
reaching zero pub/sub subscription (previously filed on Redis)
https://github.com/redis/redis/issues/12592

---------

Signed-off-by: Nikhil Manglore <nmanglor@amazon.com>
2025-01-08 21:03:06 +01:00
Rueian
0a89571dcc
Skip logreqres on tests for the HELLO command (#1528)
Skip logreqres on tests for the HELLO command

Signed-off-by: Rueian <rueiancsie@gmail.com>
2025-01-08 10:05:20 -08:00
Rueian
3b52186b6a
Add availability_zone to the HELLO response (#1487)
It's inconvenient for client implementations to extract the
`availability_zone` information from the `INFO` response. The `INFO`
response contains a lot of information that a client implementation
typically doesn't need.

This PR adds the availability zone to the `HELLO` response. Clients
usually already use the `HELLO` command for protocol negotiation and
also get the server `version` and `role` from its response. To keep the
`HELLO` response small, the field is only added if availability zone is
configured.

---------

Signed-off-by: Rueian <rueiancsie@gmail.com>
2025-01-07 22:54:55 +01:00
Madelyn Olson
e1db553834
Add tests for acl selectors with no permissions or patterns (#1515)
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2025-01-06 15:46:55 -08:00
Madelyn Olson
7977c55ac9
Fix Read/Write key pattern selector (CVE-2024-51741) (#1514)
The explanation on the original commit was wrong. Key based access must
have a `~` in order to correctly configure whey key prefixes to apply
the selector to. If this is missing, a server assert will be triggered
later.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: YaacovHazan <yaacov.hazan@redis.com>
2025-01-06 14:02:16 -08:00
gmbnomis
26a72fa89c
Use the correct command proc for the LOOKUP_NOTOUCH exception in lookupKey (#1499)
When looking up a key in no-touch mode, `LOOKUP_NOTOUCH` is set to avoid
updating the last access time in `lookupKey`. An exception must be made
for the `TOUCH` command which must always update the key.

When called from a script, `server.executing_client` will point to the
`TOUCH` command, while `server.current_client` will point to e.g. an
`EVAL` command. So, we must use the former to find out the currently
executing command if defined.

This fix addresses the issue where TOUCH wasn't updating key access
times when called from scripts like EVAL.

Fixes #1498

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2025-01-03 09:41:15 +08:00
Amit Nagler
8aff235721
Fix unreliable dual channel Valgrind tests (#1500)
Used same approach as PR #1165 to solve random failures.

Resolves #1491

Signed-off-by: naglera <anagler123@gmail.com>
2025-01-02 10:00:29 +08:00
ranshid
0f273bb648
Align rejected unblocked commands to update the correct error statistic (#577)
Currently, in case a blocked command is unblocked externally (eg. due to
the relevant slot being migrated or the CLIENT UNBLOCK command was
issued, the command statistics will always update the failed_calls error
statistic. This leads to missalignment with
90b9f08e5d
as well as some inconsistencies. For example when a key is migrated
during cluster slot migration, clients blocked on XREADGROUP will be
unblocked and update the rejected_calls stat, while clients blocked on
BLPOP will get unblocked updating the failed_calls stat.

In this PR we add explicit indication in updateStatsOnUnblock thet
indicates if the command was rejected or failed.

---------

Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2025-01-01 16:33:09 +02:00
Pierre
e4179f1f3b
Only (re-)send MEET packet once every handshake timeout period (#1441)
Add `meet_sent` field in `clusterNode` indicating the last time we sent
a MEET packet. Use this field to only (re-)send a MEET packet once every
handshake timeout period when detecting a node without an inbound link.

When receiving multiple MEET packets on the same link while the node is
in handshake state, instead of dropping the packet, we now simply
prevent the creation of a new node. This way we still process the MEET
packet's gossip and reply with a PONG as any other packets.

Improve some logging messages to include `human_nodename`. Add
`nodeExceedsHandshakeTimeout()` function.

This is a follow-up to this previous PR:
https://github.com/valkey-io/valkey/pull/1307
And a partial fix to the crash described in:
https://github.com/valkey-io/valkey/pull/1436

---------

Signed-off-by: Pierre Turin <pieturin@amazon.com>
2024-12-30 15:56:39 -05:00
uriyage
bb325bde35
Fix restore replica output bytes stat update (#1486)
This PR fixes the missing stat update for `total_net_repl_output_bytes`
that was removed during the refactoring in PR #758. The metric was not
being updated when writing to replica connections.

Changes:
- Restored the stat update in postWriteToClient for replica connections
- Added integration test to verify the metric is properly updated

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2024-12-25 10:58:49 +08:00
Amit Nagler
9f4503ca50
Add scoped RDB loading context and immediate abort flag (#1173)
This PR introduces a new mechanism for temporarily changing the
server's loading_rio context during RDB loading operations. The new
`RDB_SCOPED_LOADING_RIO` macro allows for a scoped change of the
`server.loading_rio` value, ensuring that it's automatically restored
to its original value when the scope ends.

Introduces a dedicated flag to `rio` to signal immediate abort,
preventing
potential use-after-free scenarios during replication disconnection in 
dual-channel load. This ensures proper termination of
`rdbLoadRioWithLoadingCtx`
when replication is cancelled due to connection loss on main connection.

Fixes https://github.com/valkey-io/valkey/issues/1152

---------

Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Amit Nagler <58042354+naglera@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
2024-12-24 08:14:32 +02:00
Amit Nagler
f1b7f3072c
Reduce dual channel testing time (#1477)
- By not waiting `repl-diskless-sync-delay` when we don't have to, we
can reduce ~30% of dual channel tests execution time.
- This commit also drops one test which is not required for regular sync
(`Sync should continue if not all slaves dropped`).
- Skip dual channel test with master diskless disabled because it will
initiate the same synchronization process as the non-dual channel test,
making it redundant.


Before:
```
Execution time of different units:
  171 seconds - integration/dual-channel-replication
  305 seconds - integration/replication-psync

\o/ All tests passed without errors!
```
After:
```
Execution time of different units:
  120 seconds - integration/dual-channel-replication
  236 seconds - integration/replication-psync

\o/ All tests passed without errors!
```

Discused on https://github.com/valkey-io/valkey/pull/1173

---------

Signed-off-by: naglera <anagler123@gmail.com>
2024-12-24 08:13:25 +02:00
Binbin
d00c856448
Fix switch case compilation error in the new helloscripting (#1472)
It is missing the curly braces for variable declaration after case.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-22 22:57:56 +01:00
Ricardo Dias
6adef8e2f9
Adds support for scripting engines as Valkey modules (#1277)
This PR extends the module API to support the addition of different
scripting engines to execute user defined functions.

The scripting engine can be implemented as a Valkey module, and can be
dynamically loaded with the `loadmodule` config directive, or with the
`MODULE LOAD` command.

This PR also adds an example of a dummy scripting engine module, to show
how to use the new module API. The dummy module is implemented in
`tests/modules/helloscripting.c`.

The current module API support, only allows to load scripting engines to
run functions using `FCALL` command.

The additions to the module API are the following:

```c
/* This struct represents a scripting engine function that results from the
 * compilation of a script by the engine implementation. */
struct ValkeyModuleScriptingEngineCompiledFunction

typedef ValkeyModuleScriptingEngineCompiledFunction **(*ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc)(
    ValkeyModuleScriptingEngineCtx *engine_ctx,
    const char *code,
    size_t timeout,
    size_t *out_num_compiled_functions,
    char **err);

typedef void (*ValkeyModuleScriptingEngineCallFunctionFunc)(
    ValkeyModuleCtx *module_ctx,
    ValkeyModuleScriptingEngineCtx *engine_ctx,
    ValkeyModuleScriptingEngineFunctionCtx *func_ctx,
    void *compiled_function,
    ValkeyModuleString **keys,
    size_t nkeys,
    ValkeyModuleString **args,
    size_t nargs);

typedef size_t (*ValkeyModuleScriptingEngineGetUsedMemoryFunc)(
    ValkeyModuleScriptingEngineCtx *engine_ctx);

typedef size_t (*ValkeyModuleScriptingEngineGetFunctionMemoryOverheadFunc)(
    void *compiled_function);

typedef size_t (*ValkeyModuleScriptingEngineGetEngineMemoryOverheadFunc)(
    ValkeyModuleScriptingEngineCtx *engine_ctx);

typedef void (*ValkeyModuleScriptingEngineFreeFunctionFunc)(
    ValkeyModuleScriptingEngineCtx *engine_ctx,
    void *compiled_function);

/* This struct stores the callback functions implemented by the scripting
 * engine to provide the functionality for the `FUNCTION *` commands. */
typedef struct ValkeyModuleScriptingEngineMethodsV1 {
    uint64_t version; /* Version of this structure for ABI compat. */

    /* Library create function callback. When a new script is loaded, this
     * callback will be called with the script code, and returns a list of
     * ValkeyModuleScriptingEngineCompiledFunc objects. */
    ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc create_functions_library;

    /* The callback function called when `FCALL` command is called on a function
     * registered in this engine. */
    ValkeyModuleScriptingEngineCallFunctionFunc call_function;

    /* Function callback to get current used memory by the engine. */
    ValkeyModuleScriptingEngineGetUsedMemoryFunc get_used_memory;

    /* Function callback to return memory overhead for a given function. */
    ValkeyModuleScriptingEngineGetFunctionMemoryOverheadFunc get_function_memory_overhead;

    /* Function callback to return memory overhead of the engine. */
    ValkeyModuleScriptingEngineGetEngineMemoryOverheadFunc get_engine_memory_overhead;

    /* Function callback to free the memory of a registered engine function. */
    ValkeyModuleScriptingEngineFreeFunctionFunc free_function;
} ValkeyModuleScriptingEngineMethodsV1;

/* Registers a new scripting engine in the server.
 *
 * - `engine_name`: the name of the scripting engine. This name will match
 *   against the engine name specified in the script header using a shebang.
 *
 * - `engine_ctx`: engine specific context pointer.
 *
 * - `engine_methods`: the struct with the scripting engine callback functions
 * pointers.
 */
int ValkeyModule_RegisterScriptingEngine(ValkeyModuleCtx *ctx,
                                         const char *engine_name,
                                         void *engine_ctx,
                                         ValkeyModuleScriptingEngineMethods engine_methods);

/* Removes the scripting engine from the server.
 *
 * `engine_name` is the name of the scripting engine.
 *
 */
int ValkeyModule_UnregisterScriptingEngine(ValkeyModuleCtx *ctx, const char *engine_name);
```

---------

Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
2024-12-21 23:09:35 +01:00
Madelyn Olson
b56f4f70d2
Update info.tcl test to revert client output limits sooner (#1462)
We set the client output buffer limits to 10 bytes, and then execute
`info stats` which produces more than 10 bytes of output, which can
cause that command to throw an error.

I'm not sure why it wasn't consistently erroring before, might have been
some change related to the ubuntu upgrade though.

Issues related to ubuntu-tls are hopefully resolved now.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-20 10:16:46 +08:00
Madelyn Olson
ffef236dbb
Fix storing the wrong PID in active servers (#1464)
In #1459, I missed that the data was also used to keep track of the PID
files so if the testing framework crashed it would no longer be able to
cleanup the extra servers. So now we properly extract the PID and store
it so we can clean up PIDs.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-20 10:14:56 +08:00
Madelyn Olson
079f4edf2d
Add a hint about the current file for TCL debugging (#1459)
There are some tests that fail and give no useful information since they are
outside of a test context. Now we will at least get the file we are located in.

We can sort of reverse engineer where we are in the test by seeing which
tests have finished in a file.

```
[TIMEOUT]: clients state report follows.
sock6 => (SPAWNED SERVER) pid:30375 - tests/unit/info.tcl
Killing still running Valkey server 30375 - tests/unit/info.tcl
```

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-19 14:18:02 +08:00
Madelyn Olson
60197b30e2
Attempt to read secondary error from info test (#1452)
The test attempts to write 1MB of data in order to trigger a disconnect.
Normally, the data is fully flushed and we get the error on the read
(I/O error). However, it's possible we might fail the write, which
leaves the client in an inconsistent state. On the next command, we
finally process the I/O error on the FD. So, the simple fix is to
consume any secondary errors.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-18 09:17:11 -08:00
Viktor Szépe
b66698b887
Discover and fix new typos (#1446)
Upgrade `typos` and fix corresponding typos

---------

Signed-off-by: Viktor Szépe <viktor@szepe.net>
2024-12-17 17:45:43 -08:00
ranshid
ba25b586d5
Introduce FORCE_DEFRAG compilation option to allow activedefrag run when allocator is not jemalloc (#1303)
Introduce compile time option to force activedefrag to run even when
jemalloc is not used as the allocator.
This is in order to be able to run tests with defrag enabled
while using memory instrumentation tools.

fixes: https://github.com/valkey-io/valkey/issues/1241

---------

Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-17 19:07:55 +02:00
Binbin
ad24220681
Automatic failover vote is not limited by two times the node timeout (#1356)
This is a follow of #1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion #1305 for more information.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-15 12:09:53 +08:00
Rain Valentine
88942c8e61
Replace dict with new hashtable for sets datatype (#1176)
The new `hashtable` provides faster lookups and uses less memory than
`dict`.

A TCL test case "SRANDMEMBER with a dict containing long chain" is
deleted because it's covered by a hashtable unit test
"test_random_entry_with_long_chain", which is already present.

This change also moves some logic from dismissMemory (object.c) to
zmadvise_dontneed (zmalloc.c), so the hashtable implementation which
needs the dismiss functionality doesn't need to depend on object.c and
server.h.

This PR follows #1186.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-14 20:53:48 +01:00