317 Commits

Author SHA1 Message Date
Viktor Söderqvist
12ec3d5932
Increase timeout for cross-version-replication test (#1644)
Fixes #1641

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-29 13:29:35 -08:00
Viktor Söderqvist
99ed308817
Add cross-version test framework (and a simple test) (#1371)
This includes a way to run two versions of the server from the TCL test
framework. It's a preparation to add more cross-version tests. The
runtest script accepts a new parameter

    ./runtest --other-server-path path/to/valkey-server

and a new tag "needs:other-server" for test cases and start_server.
Tests with this tag are automatically skipped if `--other-server-path`
is not provided.

This PR adds it in a CI job with Valkey 7.2.7 by downloading a binary
release.

Fixes #76

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-23 11:26:54 +01:00
Madelyn Olson
079f4edf2d
Add a hint about the current file for TCL debugging (#1459)
There are some tests that fail and give no useful information since they are
outside of a test context. Now we will at least get the file we are located in.

We can sort of reverse engineer where we are in the test by seeing which
tests have finished in a file.

```
[TIMEOUT]: clients state report follows.
sock6 => (SPAWNED SERVER) pid:30375 - tests/unit/info.tcl
Killing still running Valkey server 30375 - tests/unit/info.tcl
```

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-19 14:18:02 +08:00
Viktor Szépe
b66698b887
Discover and fix new typos (#1446)
Upgrade `typos` and fix corresponding typos

---------

Signed-off-by: Viktor Szépe <viktor@szepe.net>
2024-12-17 17:45:43 -08:00
ranshid
ba25b586d5
Introduce FORCE_DEFRAG compilation option to allow activedefrag run when allocator is not jemalloc (#1303)
Introduce compile time option to force activedefrag to run even when
jemalloc is not used as the allocator.
This is in order to be able to run tests with defrag enabled
while using memory instrumentation tools.

fixes: https://github.com/valkey-io/valkey/issues/1241

---------

Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-17 19:07:55 +02:00
Pierre
5f7fe9ef21
Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout (#1307)
In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us. This prevents
cases where a node could send MEET packets in a loop because it thinks
the other node does not have an inbound link.

This fixes the bug described in #1251.

---------

Signed-off-by: Pierre Turin <pieturin@amazon.com>
2024-12-11 17:26:06 -08:00
Viktor Söderqvist
3eb8314be6 Replace dict with hashtable for keys, expires and pubsub channels
Instead of a dictEntry with pointers to key and value, the hashtable
has a pointer directly to the value (robj) which can hold an embedded
key and acts as a key-value in the hashtable. This minimizes the number
of pointers to follow and thus the number of memory accesses to lookup
a key-value pair.

        Keys         robj
      hashtable
      +-------+   +-----------------------+
      | 0     |   | type, encoding, LRU   |
      | 1 ------->| refcount, expire      |
      | 2     |   | ptr                   |
      | ...   |   | optional embedded key |
      +-------+   | optional embedded val |
                  +-----------------------+

The expire timestamp (TTL) is also stored in the robj, if any. The expire
hash table points to the same robj.

Overview of changes:

* Replace dict with hashtable in kvstore (kvstore.c)
* Add functions for embedding key and expire in robj (object.c)
  * When there's unused space, reserve an expire field to avoid realloting
    it later if expire is added.
  * Always reserve space for expire for large key names to avoid realloc
    if it's set later.
* Update db functions (db.c)
  * dbAdd, setKey and setExpire reallocate the object when embedding a key
  * setKey does not increment the reference counter, since it would require
    duplicating the object. This responsibility is moved to the caller.
* Remove logic for shared integer objects as values in the database. The keys
  are now embedded in the objects, so all objects in the database need to be
  unique. Thus, we can't use shared objects as values. Also delete test cases
  for shared integers.
* Adjust various commands to the changes mentioned above.
* Adjust defrag code
  * Improvement: Don't access the expires table before defrag has actually
    reallocated the object.
* Adjust test cases that were using hard-coded sizes for dict when realloc
  would happen, and some other adjustments in test cases.
* Adjust memory prefetch for new hash table implementation in IO-threading,
  using new `hashtableIncrementalFind` API
* Adjust offloading of free() to IO threads: Object free to be done in main
  thread while keeping obj->ptr offloading in IO-thread since the DB object is
  now allocated by the main-thread and not by the IO-thread as it used to be.
* Let expireIfNeeded take an optional value, to avoid looking up the expires
  table when possible.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
2024-12-10 21:30:56 +01:00
Binbin
ee386c92ff
Manual failover vote is not limited by two times the node timeout (#1305)
This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-11-19 11:17:20 -05:00
Binbin
22bc49c4a6
Try to stabilize the failover call in the slot migration test (#1078)
The CI report replica will return the error when performing CLUSTER
FAILOVER:
```
-ERR Master is down or failed, please use CLUSTER FAILOVER FORCE
```

This may because the primary state is fail or the cluster connection
is disconnected during the primary pause. In this PR, we added some
waits in wait_for_role, if the role is replica, we will wait for the
replication link and the cluster link to be ok.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-11-07 13:42:20 +08:00
Viktor Söderqvist
00c97979d9
Make ./runtest --dump-logs dump logs on crash (#1117)
Until now, this flag only dumped logs on a failed assert in test case.
It is useful that this flag dumps logs on a crash as well.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-10-06 10:40:36 -07:00
Binbin
f7c5b40183
Avoid false positive in election tests (#984)
The node may not be able to initiate an election in time due to
problems with cluster communication. If an election is initiated,
make sure its offset is 0.

Closes #967.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-13 14:53:39 +08:00
uriyage
39f8bcb91b
Skip tracking clients OOM test when I/O threads are enabled (#764)
Fix feedback loop in key eviction with tracking clients when using I/O
threads.

Current issue:
Evicting keys while tracking clients or key space-notification exist
creates a feedback loop when using I/O threads:

While evicting keys we send tracking async writes to I/O threads,
preventing immediate release of tracking clients' COB memory
consumption.

Before the I/O thread finishes its write, we recheck used_memory, which
now includes the tracking clients' COB and thus continue to evict more
keys.

**Fix:**
We will skip the test for now while IO threads are active. We may
consider avoiding sending writes in `processPendingWrites` to I/O
threads for tracking clients when we are out of memory.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-08-21 17:02:57 -07:00
Binbin
76ad8f7a76
Skip IPv6 tests when TCLSH version is < 8.6 (#910)
In #786, we did skip it in the daily, but not for the others.
When running ./runtest on MacOS, we will get the failure.
```
couldn't open socket: host is unreachable (nodename nor servname provided, or not known)
```

The reason is that TCL 8.5 doesn't support ipv6, so we skip tests
tagged with ipv6. This also revert #786.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-15 15:11:38 +08:00
Pieter Cailliau
4d284daefd
Copyright update to reflect IP transfer from salvatore to Redis (#740)
Update references of copyright being assigned to Salvatore when it was
transferred to Redis Ltd. as per
https://github.com/valkey-io/valkey/issues/544.

---------

Signed-off-by: Pieter Cailliau <pieter@redis.com>
2024-08-14 09:20:36 -07:00
Binbin
59aa00823c
Replicas with the same offset queue up for election (#762)
In some cases, like read more than write scenario, the replication
offset of the replicas are the same. When the primary fails, the
replicas have the same rankings (rank == 0). They issue the election
at the same time (although we have a random 500), the simultaneous
elections may lead to the failure of the election due to quorum.

In clusterGetReplicaRank, when we calculates the rank, if the offsets
are the same, the one with the smaller node name will have a better
rank to avoid this situation.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-07-22 23:43:16 -07:00
Binbin
15a8290231
Optimize failover time when the new primary node is down again (#782)
We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.

Let's assume the following scenario:
1. Two replicas initiate an election.
2. Replica 1 is elected as the primary node, and replica 2 does not have
   enough votes.
3. Replica 1 is down, ie the new primary node down again in a short
time.
4. Replica 2 know that the new primary node is down and wants to
initiate
   a failover, but because the failover_auth_time of the previous round
   has not been reset, it needs to wait for it to time out and then wait
for the next retry time, which will take cluster-node-timeout * 4 times,
   this adds a lot of delay.

There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to
initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).

That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-07-19 15:27:49 -04:00
naglera
ff6b780fe6
Dual channel replication (#60)
In this PR we introduce the main benefit of dual channel replication by
continuously steaming the COB (client output buffers) in parallel to the
RDB and thus keeping the primary's side COB small AND accelerating the
overall sync process. By streaming the replication data to the replica
during the full sync, we reduce
1. Memory load from the primary's node.
2. CPU load from the primary's main process. [Latest performance
tests](#data)

## Motivation
* Reduce primary memory load. We do that by moving the COB tracking to
the replica side. This also decrease the chance for COB overruns. Note
that primary's input buffer limits at the replica side are less
restricted then primary's COB as the replica plays less critical part in
the replication group. While increasing the primary’s COB may end up
with primary reaching swap and clients suffering, at replica side we’re
more at ease with it. Larger COB means better chance to sync
successfully.
* Reduce primary main process CPU load. By opening a new, dedicated
connection for the RDB transfer, child processes can have direct access
to the new connection. Due to TLS connection restrictions, this was not
possible using one main connection. We eliminate the need for the child
process to use the primary's child-proc -> main-proc pipeline, thus
freeing up the main process to process clients queries.


 ## Dual Channel Replication high level interface design
- Dual channel replication begins when the replica sends a `REPLCONF
CAPA DUALCHANNEL` to the primary during initial
handshake. This is used to state that the replica is capable of dual
channel sync and that this is the replica's main channel, which is not
used for snapshot transfer.
- When replica lacks sufficient data for PSYNC, the primary will send
`-FULLSYNCNEEDED` response instead
of RDB data. As a next step, the replica creates a new connection
(rdb-channel) and configures it against
the primary with the appropriate capabilities and requirements. The
replica then requests a sync
     using the RDB channel. 
- Prior to forking, the primary sends the replica the snapshot's end
repl-offset, and attaches the replica
to the replication backlog to keep repl data until the replica requests
psync. The replica uses the main
     channel to request a PSYNC starting at the snapshot end offset. 
- The primary main threads sends incremental changes via the main
channel, while the bgsave process
sends the RDB directly to the replica via the rdb-channel. As for the
replica, the incremental
changes are stored on a local buffer, while the RDB is loaded into
memory.
- Once the replica completes loading the rdb, it drops the
rdb-connection and streams the accumulated incremental
     changes into memory. Repl steady state continues normally.

## New replica state machine


![image](https://github.com/user-attachments/assets/38fbfff0-60b9-4066-8b13-becdb87babc3)





## Data <a name="data"></a>

![image](https://github.com/user-attachments/assets/d73631a7-0a58-4958-a494-a7f4add9108f)


![image](https://github.com/user-attachments/assets/f44936ed-c59a-4223-905d-0fe48a6d31a6)


![image](https://github.com/user-attachments/assets/bd333ee2-3c47-47e5-b244-4ea75f77c836)

## Explanation 
These graphs demonstrate performance improvements during full sync
sessions using rdb-channel + streaming rdb directly from the background
process to the replica.

First graph- with at most 50 clients and light weight commands, we saw
5%-7.5% improvement in write latency during sync session.
Two graphs below- full sync was tested during heavy read commands from
the primary (such as sdiff, sunion on large sets). In that case, the
child process writes to the replica without sharing CPU with the loaded
main process. As a result, this not only improves client response time,
but may also shorten sync time by about 50%. The shorter sync time
results in less memory being used to store replication diffs (>60% in
some of the tested cases).

## Test setup 
Both primary and replica in the performance tests ran on the same
machine. RDB size in all tests is 3.7gb. I generated write load using
valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list
__rand_int__`.

---------

Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: naglera <58042354+naglera@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-17 13:59:33 -07:00
Viktor Söderqvist
a323dce890
Dual stack and client-specific IPs in cluster (#736)
New configs:

* `cluster-announce-client-ipv4`
* `cluster-announce-client-ipv6`

New module API function:

* `ValkeyModule_GetClusterNodeInfoForClient`, takes a client id and is
otherwise just like its non-ForClient cousin.

If configured, one of these IP addresses are reported to each client in
CLUSTER SLOTS, CLUSTER SHARDS, CLUSTER NODES and redirects, replacing
the IP (`custer-announce-ip` or the auto-detected IP) of each node.
Which one is reported to the client depends on whether the client is
connected over IPv4 or IPv6.

Benefits:

* This allows clients using IPv4 to get the IPv4 addresses of all
cluster nodes and IPv6 clients to get the IPv6 clients.
* This allows the IPs visible to clients to be different to the IPs used
between the cluster nodes due to NAT'ing.

The information is propagated in the cluster bus using new Ping
extensions. (Old nodes without this feature ignore unknown Ping
extensions.)

This adds another dimension to CLUSTER SLOTS reply. It now depends on
the client's use of TLS, the IP address family and RESP version.
Refactoring: The cached connection type definition is moved from
connection.h (it actually has nothing to do with the connection
abstraction) to server.h and is changed to a bitmap, with one bit for
each of TLS, IPv6 and RESP3.

Fixes #337

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-07-10 13:53:52 +02:00
Kyle Kim (kimkyle@)
b49eaad367
Introduce a minimal debugger for .tcl integration test suite. (#683)
Introduce a break-point function called `bp`, based on the tcl wiki's
minimal debugger.

```tcl
 proc bp {{s {}}} {
    if ![info exists ::bp_skip] {
        set ::bp_skip [list]
    } elseif {[lsearch -exact $::bp_skip $s]>=0} return
    if [catch {info level -1} who] {set who ::}
    while 1 {
        puts -nonewline "$who/$s> "; flush stdout
        gets stdin line
        if {$line=="c"} {puts "continuing.."; break}
        if {$line=="i"} {set line "info locals"}
        catch {uplevel 1 $line} res
        puts $res
    }
 }
```

```
... your test code before break-point
bp 1
... your test code after break-point
```

The `bp 1` will give back the tcl interpreter to the developer, and
allow you to interactively print local variables (through `puts`), run
functions and so forth.

Source: https://wiki.tcl-lang.org/page/A+minimal+debugger

---------

Signed-off-by: Kyle Kim <kimkyle@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-06-25 10:24:53 -07:00
Madelyn Olson
ce79539047
Fail tests immediately if the server is no longer running (#678)
Fix a minor inconvenience I have when writing tests. If I have a typo or
forget to generate the tls certificates, the start_server handle will
just loop for 2 minutes before printing the error. This just fails and
prints as soon as it sees the error.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-06-21 15:29:05 +08:00
Ping Xie
aad6769a80
Replicate slot migration states via RDB aux fields (#586) 2024-06-07 20:32:27 -07:00
Arthur Lee
3de5c71f48
[Feat] Support fast fail option for tcl test cases (#482)
This PR added a new option for tcl test case which will fail fast once any test cases fail.
This can be useful while running redis CI pipeline, and you want to accelerate the CI pipeline.

usage for example

> ./runtest --single unit/type/hash --fast-fail

---------

Signed-off-by: arthur.lee <arthur-lee@qq.com>
2024-05-15 06:55:24 -07:00
Binbin
fdd023ff82
Migrate cluster mode tests to normal framework (#442)
We currently has two disjoint TCL frameworks:
1. Normal testing framework, which trigger by runtest, which individually
launches nodes for testing.
2. Cluster framework, which trigger by runtest-cluster, which pre-allocates
N nodes and uses them for testing large configurations.

The normal TCL testing framework is much more readily tested and is also
automatically run as part of the CI for new PRs. The runtest-cluster since
it runs very slowly (cannot be parallelized), it currently only runs in daily
CI, this results in some changes to the cluster not being exposed in PR CI
in time.

This PR migrate the Cluster mode tests to normal framework. Some cluster
tests are kept in runtest-cluster because of timing issues or not yet
supported, we can process them later.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-05-09 10:14:47 +08:00
Shivshankar
52f9291f79
Rename redis to valkey in test suite logs and test names. (#366)
This PR covers below cases.
1. test suite's prints(i.e., puts statement logs).
2. Links refering to redis issues.
3. test names contains redis.

Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
2024-04-25 15:13:21 +08:00
Wen Hui
191be272b4
Rename redis.tcl to valkey.tcl (#283)
Includes some more changes e.g. the README under tests and some script under utils.

Signed-off-by: hwware <wen.hui.ware@gmail.com>
2024-04-24 20:54:52 +02:00
Shivshankar
8baf322742
Rename remaining test procedures (#355)
Renamed below procedures and variables (missed in #287) as follows.

redis_cluster             ->     valkey_cluster
redis1                    ->     valkey1
redis2                    ->     valkey2
get_redis_dir             ->     get_valkey_dir
test_redis_cli_rdb_dump   ->     test_valkey_cli_rdb_dump
test_redis_cli_repl       ->     test_valkey_cli_repl
redis-cli                 ->     valkey-cli
redis_reset_state         ->     valkey_reset_state
redisHandle               ->     valkeyHandle
redis_safe_read           ->     valkey_safe_read
redis_safe_gets           ->     valkey_safe_gets
redis_write               ->     valkey_write
redis_read_reply          ->     valkey_read_reply
redis_readable            ->     valkey_readable
redis_readnl              ->     valkey_readnl
redis_writenl             ->     valkey_writenl
redis_read_map            ->     valkey_read_map
redis_read_line           ->     valkey_read_line
redis_read_null           ->     valkey_read_null
redis_read_bool           ->     valkey_read_bool
redis_read_double         ->     valkey_read_double
redis_read_verbatim_str   ->     valkey_read_verbatim_str
redis_call_callback       ->     valkey_call_callback

---------

Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
2024-04-24 18:01:33 +02:00
Shivshankar
669f1d3014
redisbenchmark to valkeybenchmark in test directory and some test name rename. (#347)
This pr covers following chnages.
1. redisbenchmark to valkeybenchmark in test directory 
2. Removed redis from some test's title and changed the name
accordingly.
3. Updated test suite name and redis-server to valkey readme in test
directory.

---------

Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
2024-04-23 10:51:53 -07:00
Lipeng Zhu
87a5bfc002
Support connection schemes valkey:// and valkeys:// (#199)
1. Support valkey:// and valkeys:// scheme in valkey-cli and
valkey-benchmark. Retain the original Redis schemes for compatibility.
2. Add unit tests for valid URI, all schemes.

Fixes: https://github.com/valkey-io/valkey/issues/198
Fixes: https://github.com/valkey-io/valkey/issues/200

---------

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
2024-04-23 03:02:41 +02:00
Shivshankar
34413e0862
Replace "redis" with "valkey" test code (#287)
Occurrences of "redis" in TCL test suites and helpers, such as TCL
client used in tests, are replaced with "valkey".

Occurrences of uppercase "Redis" are not changed in this PR.

No files are renamed in this PR.

---------

Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
2024-04-18 15:57:17 +02:00
Viktor Söderqvist
9e2b7838ea
Add 'extended-redis-compatibility' config (#306)
New config 'extended-redis-compatibility' (yes/no) default no

* When yes:
  * Use "Redis" in the following error replies:
    - `-LOADING Redis is loading the dataset in memory`
    - `-BUSY Redis is busy`...
    - `-MISCONF Redis is configured to`...
* Use `=== REDIS BUG REPORT` in the crash log delimiters (START and
END).
* The HELLO command returns `"server" => "redis"` and `"version" =>
"7.2.4"` (our Redis OSS compatibility version).
  * The INFO field for mode is called `"redis_mode"`.
* When no:
* Use "Valkey" instead of "Redis" in the mentioned errors and crash log
delimiters.
* The HELLO command returns `"server" => "valkey"` and the Valkey
version for `"version"`.
  * The INFO field for mode is called `"server_mode"`.

* Documentation added in valkey.conf:

> Valkey is largely compatible with Redis OSS, apart from a few cases
where
> Redis OSS compatibility mode makes Valkey pretend to be Redis. Enable
this
  > only if you have problems with tools or clients. This is a temporary
> configuration added in Valkey 8.0 and is scheduled to have no effect
in Valkey
  > 9.0 and be completely removed in Valkey 10.0.

* A test case for the config is added. It is designed to fail if the
config is not deprecated (has no effect) in Valkey 9 and deleted in
Valkey 10.

* Other test cases are adjusted to work regardless of this config.

Fixes #274
Fixes #61

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-04-18 14:10:24 +02:00
jonghoonpark
c090874ed4
List test files dynamically (#313)
Motivation: Currently we have to manually update the all_tests variable
when introducing new test files.

Fix: I've modified it to list test files dynamically, but rather than
modify it to add all test files, I've first modified it to only add test
files from the following 4 paths so that it doesn't deviate too much
from what we already do

- unit
- unit/type
- unit/cluster
- integration

Related issue: https://github.com/valkey-io/valkey/issues/302

---------

Signed-off-by: jonghoonpark <dev@jonghoonpark.com>
2024-04-15 14:25:33 +02:00
Madelyn Olson
c0cef48e98
Fix reference to redis-tls module (#273)
Update test usage of valkey-tls.so module to use valkey-tls.so instead.

Fixes tests failures like
https://github.com/valkey-io/valkey/actions/runs/8592855995/job/23543475478.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-04-09 07:15:59 -07:00
Jacob Murphy
df5db0627f
Remove trademarked language in code comments (#223)
This includes comments used for module API documentation.

* Strategy for replacement: Regex search: `(//|/\*| \*|#).* ("|\()?(r|R)edis( |\.
  |'|\n|,|-|\)|")(?!nor the names of its contributors)(?!Ltd.)(?!Labs)(?!Contributors.)`
* Don't edit copyright comments
* Replace "Redis version X.X" -> "Redis OSS version X.X" to distinguish
from newly licensed repository
* Replace "Redis Object" -> "Object"
* Exclude markdown for now
* Don't edit Lua scripting comments referring to redis.X API
* Replace "Redis Protocol" -> "RESP"
* Replace redis-benchmark, -cli, -server, -check-aof/rdb with "valkey-"
prefix
* Most other places, I use best judgement to either remove "Redis", or
replace with "the server" or "server"

Fixes #148

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-04-09 10:24:03 +02:00
VoletiRam
d89ef06ce5
Wait for cluster fully online in cluster_config_consistent (#272)
Wait for cluster to be in a fully consistent and online state in
`cluster_config_consistent`. We expect the `start_server` to create the
desired primaries and replicas before the start of the tests. With the
current setup, the replicas may not complete the sync with primaries and
can be in loading state. In some cases, the role of replicas can still
be master with the delay of propagation of replicate command. The tests
can show flaky behavior in such cases. Add a check that verifies the
nodes health status 'online' for the cluster consistency. Leverage the
deterministic order of `CLUSTER SLOTS` to consider the cluster as
consistent along with the nodes health status.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Ram Prasad Voleti <ramvolet@amazon.com>
2024-04-08 20:03:56 -07:00
Wen Hui
7f5bcc96f0
Update some valkey-cli related in tcl (#236)
Signed-off-by: hwware <wen.hui.ware@gmail.com>
2024-04-05 16:46:33 -04:00
Shivshankar
f3ccfbb01f
Rename TLS test cert files to valkey (#186)
This PR covers changing the redis.crt and redis.key to valkey certs for
TLS testing.

The files are generated by the gen-test-certs.sh script under tests/tls/.

Also covers comments provided.

Signed-off-by: hwware <wen.hui.ware@gmail.com>
Co-authored-by: hwware <wen.hui.ware@gmail.com>
2024-04-03 23:04:52 +02:00
Harkrishn Patro
1736018aa9
Remove trademarked wording on configuration file and individual configs (#29)
Remove trademarked wording on configuration layer.

Following changes for release notes:

1. Rename redis.conf to valkey.conf
2. Pre-filled config in the template config file: Changing pidfile to `/var/run/valkey_6379.pid`

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
2024-04-03 19:47:26 +02:00
John Vandenberg
253fe9dced
Fix typos and replace 'codespell' with 'typos' (#72)
Uses https://github.com/taiki-e/install-action to install
https://github.com/crate-ci/typos in CI

This finds many more/different typos than
https://github.com/codespell-project/codespell , while having very few
false positives.

Signed-off-by: John Vandenberg <jayvdb@gmail.com>
2024-03-31 12:38:22 -07:00
Madelyn Olson
57789d4d08
Update naming to to Valkey (#62)
Documentation references should use `Valkey` while server and cli
references are all under `valkey`.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-03-28 09:58:28 -07:00
Madelyn Olson
3586485355 Moved to correct cli and benchmark 2024-03-21 19:16:03 -07:00
Madelyn Olson
ee107481e5 Refactor some tests to reference new executable 2024-03-21 19:11:08 -07:00
Binbin
3c2ea1ea95
Fix wathced client test timing issue caused by late close (#13062)
There is a timing issue in the test, close may arrive late, or in
freeClientAsync we will free the client in async way, which will
lead to errors in watching_clients statistics, since we will only
unwatch all keys when we truly freeClient.

Add a wait here to avoid this problem. Also fixed some outdated
comments i saw. The test was introduced in #12966.
2024-02-20 11:12:19 +02:00
bentotten
b3aaa0a136
When one shard, sole primary node marks potentially failed replica as FAIL instead of PFAIL (#12824)
Fixes issue where a single primary cannot mark a replica as failed in a
single-shard cluster.
2024-01-11 15:48:19 -08:00
Madelyn Olson
8bb9a2895e
Address some failures with new tests for improving debug report (#12915)
Fix a daily test failure because alpine doesn't support stack traces and
add in an extra assertion related to making sure the stack trace was
printed twice.
2024-01-08 17:56:06 -08:00
Binbin
c85a9b7896
Fix delKeysInSlot server events are not executed inside an execution unit (#12745)
This is a follow-up fix to #12733. We need to apply the same changes to
delKeysInSlot. Refer to #12733 for more details.

This PR contains some other minor cleanups / improvements to the test
suite and docs.
It uses the postnotifications test module in a cluster mode test which
revealed a leak in the test module (fixed).
2023-12-11 20:15:19 +02:00
Vitaly
0270abda82
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot.  Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.

## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot. 
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.

## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict. 

RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.

## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information. 

---------

Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-10-14 23:58:26 -07:00
Oran Agra
f0c1c730d4
test suite: clean server pids after server crashed (#12639)
when a server in the test suite crashes and is restarted by redstart_server, we didn't clean it's pid from the list.
we can see that when the corrupt-dump-fuzzer hangs, it has a long list of servers to lean, but in fact they're all already dead.
2023-10-13 16:28:52 +03:00
YaacovHazan
2e0f6724e0
Stabilization and improvements around aof tests (#12626)
In some tests, the code manually searches for a log message, and it
uses tail -1 with a delay of 1 second, which can miss the expected line.

Also, because the aof tests use start_server_aof and not start_server,
the test name doesn't log into the server log.

To fix the above, I made the following changes:
- Change the start_server_aof to wrap the start_server.
  This will add the created aof server to the servers list, and make
  srv() and wait_for_log_messages() available for the tests.

- Introduce a new option for start_server.
  'wait_ready' - an option to let the caller start the test code without
  waiting for the server to be ready. useful for tests on a server that
  is expected to exit on startup.

- Create a new start_server_aof_ex.
  The new proc also accept options as argument and make use of the
  new 'short_life' option for tests that are expected to exit on startup
  because of some error in the aof file(s).

Because of the above, I had to change many lines and replace every
local srv variable (a server config) usage with the srv().
2023-10-02 08:20:53 +03:00
DarrenJiang13
6abb3c4038
change log match to line match in tcl sanitizer_errors_from_file. (#12446)
In the tcl foreach loop, the function should compare line rather than the whole file.
2023-07-30 08:48:29 +03:00
Chen Tianjie
22a29935ff
Support TLS service when "tls-cluster" is not enabled and persist both plain and TLS port in nodes.conf (#12233)
Originally, when "tls-cluster" is enabled, `port` is set to TLS port. In order to support non-TLS clients, `pport` is used to propagate TCP port across cluster nodes. However when "tls-cluster" is disabled, `port` is set to TCP port, and `pport` is not used, which means the cluster cannot provide TLS service unless "tls-cluster" is on.
```
typedef struct {
    // ...
    uint16_t port;  /* Latest known clients port (TLS or plain). */
    uint16_t pport; /* Latest known clients plaintext port. Only used if the main clients port is for TLS. */
    // ...
} clusterNode;
```
```
typedef struct {
    // ...
    uint16_t port;   /* TCP base port number. */
    uint16_t pport;  /* Sender TCP plaintext port, if base port is TLS */
    // ...
} clusterMsg;
```
This PR renames `port` and `pport` in `clusterNode` to `tcp_port` and `tls_port`, to record both ports no matter "tls-cluster" is enabled or disabled.

This allows to provide TLS service to clients when "tls-cluster" is disabled: when displaying cluster topology, or giving `MOVED` error, server can provide TLS or TCP port according to client's connection type, no matter what type of connection cluster bus is using.

For backwards compatibility, `port` and `pport` in `clusterMsg` are preserved, when "tls-cluster" is enabled, `port` is set to TLS port and `pport` is set to TCP port, when "tls-cluster" is disabled, `port` is set to TCP port and `pport` is set to TLS port (instead of 0).

Also, in the nodes.conf file, a new aux field displaying an extra port is added to complete the persisted info. We may have `tls_port=xxxxx` or `tcp_port=xxxxx` in the aux field, to complete the cluster topology, while the other port is stored in the normal `<ip>:<port>` field. The format is shown below.
```
<node-id> <ip>:<tcp_port>@<cport>,<hostname>,shard-id=...,tls-port=6379 myself,master - 0 0 0 connected 0-1000
```
Or we can switch the position of two ports, both can be correctly resolved.
```
<node-id> <ip>:<tls_port>@<cport>,<hostname>,shard-id=...,tcp-port=6379 myself,master - 0 0 0 connected 0-1000
```
2023-06-26 07:43:38 -07:00