416 Commits

Author SHA1 Message Date
Viktor Söderqvist
12ec3d5932
Increase timeout for cross-version-replication test (#1644)
Fixes #1641

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-29 13:29:35 -08:00
Binbin
4b8f3ed9ac
Do command existence and arity checks when loading AOF to avoid crash (#1614)
Do command existence and arity checks when loading AOF to avoid crash

Currently, loading commands such as `cluster` or `cluster slots xxx`
from AOF will cause the server to crash.
1. `cluster` is a container command, and executing proc will cause a
    crash because we do not check subcommand and arity.
2. `cluster slots xxx`, arity check fail, reply with an error from the
    AOF client and trigger a panic.

Of course, there are many other ways for a problematic AOF to cause the
panic, but it is still necessary do some basic checks before executing.
In this way, in these basic cases, we can print useful error messages
instead of crashing directly.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-30 01:06:13 +08:00
Viktor Söderqvist
e9b8970e72
Relaxed RDB version check (#1604)
New config `rdb-version-check` with values:

* `strict`: Reject future RDB versions.
* `relaxed`: Try parsing future RDB versions and fail only when an
unknown RDB opcode or type is encountered.

This can make it possible for Valkey 8.1 to try read a dump from for
example Valkey 9.0 or later on a best-effort basis. The conditions for
when this is expected to work can be defined when the future Valkey
versions are released. Loading is expected to fail in the following
cases:

* If the data set contains any new key types or other data elements not
supported by the current version.
* If the RDB contains new representations or encodings of existing key
types or other data elements.

This change also prepares for the next RDB version bump. A range of RDB
versions (12-79) is reserved, since it's expected to be used by foreign
software RDB versions, so Valkey will not accept versions in this range
even with the `relaxed` version check. The DUMP/RESTORE format has no
magic string; only the RDB version number.

This change also prepares for the magic string to change from REDIS to
VALKEY next time we bump the RDB version.

Related to #1108.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2025-01-27 18:44:24 +01:00
Madelyn Olson
88a68303c0
Make sure to disable pause after fork for dual channel test (#1612)
Might close https://github.com/valkey-io/valkey/issues/1484.

I noticed that we don't disable pause after fork on the last test that
was getting executed, so it might getting stuck in pause loops after the
test ends if it tries another psync for any reason.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2025-01-27 06:44:48 -08:00
Viktor Söderqvist
99ed308817
Add cross-version test framework (and a simple test) (#1371)
This includes a way to run two versions of the server from the TCL test
framework. It's a preparation to add more cross-version tests. The
runtest script accepts a new parameter

    ./runtest --other-server-path path/to/valkey-server

and a new tag "needs:other-server" for test cases and start_server.
Tests with this tag are automatically skipped if `--other-server-path`
is not provided.

This PR adds it in a CI job with Valkey 7.2.7 by downloading a binary
release.

Fixes #76

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-23 11:26:54 +01:00
ranshid
3032ccd48a
Change the shared format for dual channel replication logs (#1586)
change the format of the dual channel replication logs so that it will
not
conflict with existing log formats like modules. 

Fixes: https://github.com/valkey-io/valkey/issues/1509

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2025-01-20 08:04:47 +02:00
Amit Nagler
6be1c77b1e
Fix valgrind test (#1555)
Introduced at https://github.com/valkey-io/valkey/pull/1165/files

Signed-off-by: naglera <anagler123@gmail.com>
2025-01-14 10:49:46 +02:00
Viktor Söderqvist
ad592f73d7
Skip CLI tests with reply schema validation (#1545)
The commands used in valkey-cli tests are not important the reply schema
validation. Skip them to avoid the problem if tests hanging. This has
failed lately in the daily job:

```
[TIMEOUT]: clients state report follows.
sock55fedcc19be0 => (IN PROGRESS) valkey-cli pubsub mode with single standard channel subscription
Killing still running Valkey server 33357
```

These test cases use a special valkey-cli command `:get pubsub` command,
which is an internal command to valkey-cli rather than a Valkey server
command. This command hangs when compiled with with logreqres enabled.
Easy solution is to skip the tests in this setup.

The test cases were introduced in #1432.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2025-01-12 08:02:39 +08:00
Binbin
b207b421bc
Fix new cli subscribed mode test in cluster mode (#1533)
We need to add a hash tag in cluster mode.
Fixes #1531.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2025-01-09 12:21:31 +08:00
Nikhil Manglore
9e0204941d
valkey-cli auto-exit from subscribed mode (#1432)
Resolves issue with valkey-cli not auto exiting from subscribed mode on
reaching zero pub/sub subscription (previously filed on Redis)
https://github.com/redis/redis/issues/12592

---------

Signed-off-by: Nikhil Manglore <nmanglor@amazon.com>
2025-01-08 21:03:06 +01:00
Amit Nagler
8aff235721
Fix unreliable dual channel Valgrind tests (#1500)
Used same approach as PR #1165 to solve random failures.

Resolves #1491

Signed-off-by: naglera <anagler123@gmail.com>
2025-01-02 10:00:29 +08:00
ranshid
0f273bb648
Align rejected unblocked commands to update the correct error statistic (#577)
Currently, in case a blocked command is unblocked externally (eg. due to
the relevant slot being migrated or the CLIENT UNBLOCK command was
issued, the command statistics will always update the failed_calls error
statistic. This leads to missalignment with
90b9f08e5d
as well as some inconsistencies. For example when a key is migrated
during cluster slot migration, clients blocked on XREADGROUP will be
unblocked and update the rejected_calls stat, while clients blocked on
BLPOP will get unblocked updating the failed_calls stat.

In this PR we add explicit indication in updateStatsOnUnblock thet
indicates if the command was rejected or failed.

---------

Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2025-01-01 16:33:09 +02:00
uriyage
bb325bde35
Fix restore replica output bytes stat update (#1486)
This PR fixes the missing stat update for `total_net_repl_output_bytes`
that was removed during the refactoring in PR #758. The metric was not
being updated when writing to replica connections.

Changes:
- Restored the stat update in postWriteToClient for replica connections
- Added integration test to verify the metric is properly updated

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2024-12-25 10:58:49 +08:00
Amit Nagler
9f4503ca50
Add scoped RDB loading context and immediate abort flag (#1173)
This PR introduces a new mechanism for temporarily changing the
server's loading_rio context during RDB loading operations. The new
`RDB_SCOPED_LOADING_RIO` macro allows for a scoped change of the
`server.loading_rio` value, ensuring that it's automatically restored
to its original value when the scope ends.

Introduces a dedicated flag to `rio` to signal immediate abort,
preventing
potential use-after-free scenarios during replication disconnection in 
dual-channel load. This ensures proper termination of
`rdbLoadRioWithLoadingCtx`
when replication is cancelled due to connection loss on main connection.

Fixes https://github.com/valkey-io/valkey/issues/1152

---------

Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Amit Nagler <58042354+naglera@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
2024-12-24 08:14:32 +02:00
Amit Nagler
f1b7f3072c
Reduce dual channel testing time (#1477)
- By not waiting `repl-diskless-sync-delay` when we don't have to, we
can reduce ~30% of dual channel tests execution time.
- This commit also drops one test which is not required for regular sync
(`Sync should continue if not all slaves dropped`).
- Skip dual channel test with master diskless disabled because it will
initiate the same synchronization process as the non-dual channel test,
making it redundant.


Before:
```
Execution time of different units:
  171 seconds - integration/dual-channel-replication
  305 seconds - integration/replication-psync

\o/ All tests passed without errors!
```
After:
```
Execution time of different units:
  120 seconds - integration/dual-channel-replication
  236 seconds - integration/replication-psync

\o/ All tests passed without errors!
```

Discused on https://github.com/valkey-io/valkey/pull/1173

---------

Signed-off-by: naglera <anagler123@gmail.com>
2024-12-24 08:13:25 +02:00
Viktor Szépe
b66698b887
Discover and fix new typos (#1446)
Upgrade `typos` and fix corresponding typos

---------

Signed-off-by: Viktor Szépe <viktor@szepe.net>
2024-12-17 17:45:43 -08:00
Viktor Söderqvist
3eb8314be6 Replace dict with hashtable for keys, expires and pubsub channels
Instead of a dictEntry with pointers to key and value, the hashtable
has a pointer directly to the value (robj) which can hold an embedded
key and acts as a key-value in the hashtable. This minimizes the number
of pointers to follow and thus the number of memory accesses to lookup
a key-value pair.

        Keys         robj
      hashtable
      +-------+   +-----------------------+
      | 0     |   | type, encoding, LRU   |
      | 1 ------->| refcount, expire      |
      | 2     |   | ptr                   |
      | ...   |   | optional embedded key |
      +-------+   | optional embedded val |
                  +-----------------------+

The expire timestamp (TTL) is also stored in the robj, if any. The expire
hash table points to the same robj.

Overview of changes:

* Replace dict with hashtable in kvstore (kvstore.c)
* Add functions for embedding key and expire in robj (object.c)
  * When there's unused space, reserve an expire field to avoid realloting
    it later if expire is added.
  * Always reserve space for expire for large key names to avoid realloc
    if it's set later.
* Update db functions (db.c)
  * dbAdd, setKey and setExpire reallocate the object when embedding a key
  * setKey does not increment the reference counter, since it would require
    duplicating the object. This responsibility is moved to the caller.
* Remove logic for shared integer objects as values in the database. The keys
  are now embedded in the objects, so all objects in the database need to be
  unique. Thus, we can't use shared objects as values. Also delete test cases
  for shared integers.
* Adjust various commands to the changes mentioned above.
* Adjust defrag code
  * Improvement: Don't access the expires table before defrag has actually
    reallocated the object.
* Adjust test cases that were using hard-coded sizes for dict when realloc
  would happen, and some other adjustments in test cases.
* Adjust memory prefetch for new hash table implementation in IO-threading,
  using new `hashtableIncrementalFind` API
* Adjust offloading of free() to IO threads: Object free to be done in main
  thread while keeping obj->ptr offloading in IO-thread since the DB object is
  now allocated by the main-thread and not by the IO-thread as it used to be.
* Let expireIfNeeded take an optional value, to avoid looking up the expires
  table when possible.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
2024-12-10 21:30:56 +01:00
Amit Nagler
7043ef0bbb
Split dual-channel COB overrun tests to separate servers (#1374)
1. The test isn't waiting long enough for the output buffer to overrun.
This problem is happening because an error from the previous test is
bleeding into the current test's logs. The simplest fix would be to
split these tests.
2. Increased replication timeout to ensure sync fails due to output
buffer overrun before a timeout occurs.

Fixes #1367

Signed-off-by: naglera <anagler123@gmail.com>
2024-12-01 21:33:43 +08:00
Amit Nagler
9305b49145
Add tag for dual-channel logs (#999)
This PR introduces a consistent tagging system for dual-channel logs.
The goal is to improve log readability and filterability, making it
easier for operators to manage and analyze log entries.

Resolves https://github.com/valkey-io/valkey/issues/986

---------

Signed-off-by: naglera <anagler123@gmail.com>
2024-11-26 16:51:52 +02:00
Binbin
4aacffa32d
Stabilize dual replication test to avoid getting LOADING error (#1288)
When doing `$replica replicaof no one`, we may get a LOADING
error, this is because during the test execution, the replica
may reconnect very quickly, and the full sync is initiated,
and the replica has entered the LOADING state.

In this commit, we make sure the primary is pasued after the
fork, so the replica won't enter the LOADING state, and with
this fix, this test seems more natural and predictable.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-11-11 21:42:34 +08:00
ranshid
29b83f1ac8
Introduce bgsave cancel (#757)
In some cases bgsave child process can run for a long time exhausting
system resources. Although it is possible to kill the bgsave child
process from the system shell, sometimes it is not possible allowing OS
level access.

This PR adds a new subcommand to the BGSAVE command.
When user will issue `BGSAVE CANCEL`, it will do one of the 2:

1. In case a bgsave child process is currently running, the child
   process would be immediately killed thus terminating any
   save/replication full sync process.
2. In case a bgsave child process is SCHEDULED to run, the scheduled
   execution will be cancelled.

---------

Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-10-21 11:56:44 +02:00
ranshid
36d438ba27
Deflake test ync should continue if not all slaves dropped dual-channel-replication (#1164)
Sometimes when dual-channel is turned off the tested replica might
disconnect on COB overrun. disable the replica COB limit in order to
prevent such cases.

Fixes: #1153

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2024-10-14 15:31:59 +08:00
ranshid
597aa037cc
Deflake test Primary COB growth with inactive replica (#1165)
in case of valgrind run, the replica might get disconnected from the
primary due to repl-timeout reached. Fix is to configure larger timeout
in case of valgrind test.

**Partially** fixes: #1152

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2024-10-14 15:30:29 +08:00
kronwerk
cd8de095c4
Add flush-before-load option for repl-diskless-load (#909)
A new option for diskless replication on the replica side.

After a network failure, the replica may need to perform a full sync.
The other option for diskless full sync is `swapdb`, but it uses twice
as much memory, temporarily. In situations where this is not acceptable,
and where losing data is acceptable, the `flush-before-load` can be
useful. If the full sync fails, the old data is lost though. Therefore,
the new option is marked as "dangerous".

---------

Signed-off-by: kronwerk <ca11e5e22g@gmail.com>
Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>
Co-authored-by: kronwerk <ca11e5e22g@gmail.com>
2024-10-09 13:11:53 +02:00
Binbin
80fcbd3fec
Fix module / script call CLUSTER SLOTS / SHARDS fake client check crash (#1063)
The reason is VM_Call will use a fake client without connection,
so we also need to check if c->conn is NULL.

This also affects scripts. If they are called in the script, the
server will crash. Injecting commands into AOF will also cause
startup failure.

Fixes #1054.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-25 14:50:48 +08:00
Binbin
6ce75cdea8
Fix replica online timing issue in failover test (#1044)
Ci reported this failure:
```
[exception]: Executing test client: ERR FAILOVER target replica is not online..
ERR FAILOVER target replica is not online.
    while executing
"$node_0 failover to $node_1_host $node_1_port"
```

We can see somehow the replica is not online in time and
casuing this failure, added a verify_replica_online to make
sure the replica is online for the test.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-23 17:35:02 +08:00
Binbin
17390383b5
Replica flush the old data after RDB file is ok in disk-based replication (#926)
Call emptyData right before rdbLoad to prevent errors in the middle
and we drop the replication stream and leaving an empty database.
The real changes is in disk-based part, the rest is just code movement.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-14 11:49:49 +08:00
Amit Nagler
1b24168450
Dual Channel Replication - Verify Replica Local Buffer Limit Configuration (#989)
Prior to comparing the replica buffer against the configured limit, we
need to ensure that the limit configuration is enabled. If the limit is
set to zero, it indicates that there is no limit, and we should skip the
buffer limit check.

---------

Signed-off-by: naglera <anagler123@gmail.com>
2024-09-10 17:26:28 -07:00
Binbin
6478526597
Fix aof base suffix when modifying aof-use-rdb-preamble during rewrite (#886)
If we modify aof-use-rdb-preamble in the middle of rewrite,
we may get a wrong aof base suffix. This is because the suffix
is concatenated by the main process afterwards, and it may be
different from the beginning.

We cache this value when we start the rewrite.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-07 23:27:59 +08:00
zhaozhao.zz
743f5ac2ae
standalone -REDIRECT handles special case of MULTI context (#895)
In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.

We have adopted the same handling method as the cluster mode:

1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  -REDIRECT
    EXEC     ==>  -EXECABORT
    ```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  +QUEUED
    failover
    EXEC     ==>  -REDIRECT
    ```

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-08-30 10:17:53 +08:00
Amit Nagler
1ff2a3b6ae
Remove dual-channel-replication Feature Flag's Protection (#908)
Currently, the `dual-channel-replication` feature flag is immutable if
`enable-protected-configs` is enabled, which is the default behavior.
This PR proposes to make the `dual-channel-replication` flag mutable,
allowing it to be changed dynamically without restarting the cluster.

**Motivation:**
The ability to change the `dual-channel-replication` flag dynamically is
essential for testing and validating the feature on real clusters
running in production environments. By making the flag mutable, we can
enable or disable the feature without disrupting the cluster's
operations, facilitating easier testing and experimentation.
Additionally, this change would provide more flexibility for users to
enable or disable the feature based on their specific requirements or
operational needs without requiring a cluster restart.

---------

Signed-off-by: naglera <anagler123@gmail.com>
2024-08-27 10:18:48 -07:00
Madelyn Olson
b12668af7a
Revert repl backlog size back to 1mb for dual channel tests (#934)
There is a test that assumes that the backlog will get overrun, but
because of the recent changes to the default it no longer fails. It
seems like it is a bit flakey now though, so resetting the value in the
test back to 1mb. (This relates to the CoB of 1100k. So it should
consistently work with a 1mb limit).

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-08-22 15:35:28 -07:00
Binbin
a1ac459ef1
Set repl-backlog-size from 1mb to 10mb by default (#911)
The repl-backlog-size 1mb is too small in most cases, now network
transmission and bandwidth performance have improved rapidly in more
than ten years.

The bigger the replication backlog, the longer the replica can endure
the disconnect and later be able to perform a partial resynchronization.

Part of #653.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-21 11:59:02 -04:00
gmbnomis
7795152fff
Fix valgrind timing issue failure in replica-redirect test (#917)
Wait for the replica to become online before starting the actual test.

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
2024-08-18 21:20:53 +08:00
zhaozhao.zz
131857e80a
To avoid bouncing -REDIRECT during FAILOVER (#871)
Fix #821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-08-14 14:04:29 +08:00
Amit Nagler
6cb86fff51
Fix dual-channel replication test under valgrind (#904)
Test dual-channel-replication primary gets cob overrun during replica
rdb load` fails during the Valgrind run. This is due to the load
handlers disconnecting before the tests complete, resulting in a low
primary COB. Increasing the handlers' timeout should resolve this issue.

Failure:
https://github.com/valkey-io/valkey/actions/runs/10361286333/job/28681321393

Server logs reveals that the load handler clients were disconnected
before the test started

Also the two previus test took about 20 seconds which is the handler
timeout.

---------

Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-08-13 10:40:19 -07:00
naglera
27fce29500
Fix dual-channel-replication related issues (#837)
- Fix TLS bug where connection were shutdown by primary's main process
while the child process was still writing- causing main process to be
blocked.
- TLS connection fix -file descriptors are set to blocking mode in the
main thread, followed by a blocking write. This sets the file
descriptors to non-blocking if TLS is used (see `connTLSSyncWrite()`)
(@xbasel).
- Improve the reliability of dual-channel tests. Modify the pause
mechanism to verify process status directly, rather than relying on log.
- Ensure that `server.repl_offset` and `server.replid` are updated
correctly when dual channel synchronization completes successfully.
Thist led to failures in replication tests that validate replication IDs
or compare replication offsets.

---------

Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: naglera <58042354+naglera@users.noreply.github.com>
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2024-08-12 13:03:12 -07:00
Binbin
929283fc6f
Dual channel replication should not update lastbgsave_status when transfer error (#811)
Currently lastbgsave_status is used in bgsave or disk-replication,
and the target is the disk. In #60, we update it when transfer error,
i think it is mainly used in tests, so we can use log to replace it.

It changes lastbgsave_status to err in this case, but it is strange
that it does not set ok or err in the above if and the following else.
Also noted this will affect stop-writes-on-bgsave-error.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-08-12 11:25:55 -07:00
naglera
9211aed72e
Improve reliability of dual-channel-replication pause resume tests (#835)
Update the dual channel-replication tests to wait for the pause to begin
before attempting to unpause.

---------

Signed-off-by: naglera <anagler123@gmail.com>
2024-07-28 11:14:56 -07:00
Binbin
e32518d655
Fix unexpected behavior when turning appendonly on and off within a transaction (#826)
If we do `config set appendonly yes` and `config set appendonly no`
in a multi, there are some unexpected behavior.

When doing appendonly yes, we will schedule a AOFRW, and when we
are doding appendonly no, we will call stopAppendOnly to stop it.
In stopAppendOnly, the aof_fd is -1 since the aof is not start yet
and the fsync and close will take the -1 and call it, so they will
all fail with EBADF. And stopAppendOnly will emit a server log, the
close(-1) should be no problem but it is still an undefined behavior.

This PR also adds a log `Background append only file rewriting
scheduled.` to bgrewriteaofCommand when it was scheduled. 
And adds a log in stopAppendOnly when a scheduled AOF is canceled,
it will print  `AOF was disabled but there is a scheduled AOF background, cancel it.`

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-07-27 18:38:23 +08:00
naglera
48ca2c9176
Improve dual channel replication stability and fix compatibility issues (#804)
Introduce several improvements to improve the stability of dual-channel
replication and fix compatibility issues.

1. Make dual-channel-replication tests more reliable: use pause instead
of forced sleep.
2. Fix race conditions when freeing RDB client.
3. Check if sync was stopped during local buffer streaming.
4. Fix $ENDOFFSET reply format to work on 32-bit machines too.

---------

Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-25 09:34:39 -07:00
naglera
ff6b780fe6
Dual channel replication (#60)
In this PR we introduce the main benefit of dual channel replication by
continuously steaming the COB (client output buffers) in parallel to the
RDB and thus keeping the primary's side COB small AND accelerating the
overall sync process. By streaming the replication data to the replica
during the full sync, we reduce
1. Memory load from the primary's node.
2. CPU load from the primary's main process. [Latest performance
tests](#data)

## Motivation
* Reduce primary memory load. We do that by moving the COB tracking to
the replica side. This also decrease the chance for COB overruns. Note
that primary's input buffer limits at the replica side are less
restricted then primary's COB as the replica plays less critical part in
the replication group. While increasing the primary’s COB may end up
with primary reaching swap and clients suffering, at replica side we’re
more at ease with it. Larger COB means better chance to sync
successfully.
* Reduce primary main process CPU load. By opening a new, dedicated
connection for the RDB transfer, child processes can have direct access
to the new connection. Due to TLS connection restrictions, this was not
possible using one main connection. We eliminate the need for the child
process to use the primary's child-proc -> main-proc pipeline, thus
freeing up the main process to process clients queries.


 ## Dual Channel Replication high level interface design
- Dual channel replication begins when the replica sends a `REPLCONF
CAPA DUALCHANNEL` to the primary during initial
handshake. This is used to state that the replica is capable of dual
channel sync and that this is the replica's main channel, which is not
used for snapshot transfer.
- When replica lacks sufficient data for PSYNC, the primary will send
`-FULLSYNCNEEDED` response instead
of RDB data. As a next step, the replica creates a new connection
(rdb-channel) and configures it against
the primary with the appropriate capabilities and requirements. The
replica then requests a sync
     using the RDB channel. 
- Prior to forking, the primary sends the replica the snapshot's end
repl-offset, and attaches the replica
to the replication backlog to keep repl data until the replica requests
psync. The replica uses the main
     channel to request a PSYNC starting at the snapshot end offset. 
- The primary main threads sends incremental changes via the main
channel, while the bgsave process
sends the RDB directly to the replica via the rdb-channel. As for the
replica, the incremental
changes are stored on a local buffer, while the RDB is loaded into
memory.
- Once the replica completes loading the rdb, it drops the
rdb-connection and streams the accumulated incremental
     changes into memory. Repl steady state continues normally.

## New replica state machine


![image](https://github.com/user-attachments/assets/38fbfff0-60b9-4066-8b13-becdb87babc3)





## Data <a name="data"></a>

![image](https://github.com/user-attachments/assets/d73631a7-0a58-4958-a494-a7f4add9108f)


![image](https://github.com/user-attachments/assets/f44936ed-c59a-4223-905d-0fe48a6d31a6)


![image](https://github.com/user-attachments/assets/bd333ee2-3c47-47e5-b244-4ea75f77c836)

## Explanation 
These graphs demonstrate performance improvements during full sync
sessions using rdb-channel + streaming rdb directly from the background
process to the replica.

First graph- with at most 50 clients and light weight commands, we saw
5%-7.5% improvement in write latency during sync session.
Two graphs below- full sync was tested during heavy read commands from
the primary (such as sdiff, sunion on large sets). In that case, the
child process writes to the replica without sharing CPU with the loaded
main process. As a result, this not only improves client response time,
but may also shorten sync time by about 50%. The shorter sync time
results in less memory being used to store replication diffs (>60% in
some of the tested cases).

## Test setup 
Both primary and replica in the performance tests ran on the same
machine. RDB size in all tests is 3.7gb. I generated write load using
valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list
__rand_int__`.

---------

Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: naglera <58042354+naglera@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-17 13:59:33 -07:00
uriyage
bbfd041895
Async IO threads (#758)
This PR is 1 of 3 PRs intended to achieve the goal of 1 million requests
per second, as detailed by [dan touitou](https://github.com/touitou-dan)
in https://github.com/valkey-io/valkey/issues/22. This PR modifies the
IO threads to be fully asynchronous, which is a first and necessary step
to allow more work offloading and better utilization of the IO threads.

### Current IO threads state:

Valkey IO threads were introduced in Redis 6.0 to allow better
utilization of multi-core machines. Before this, Redis was
single-threaded and could only use one CPU core for network and command
processing. The introduction of IO threads helps in offloading the IO
operations to multiple threads.

**Current IO Threads flow:**

1. Initialization: When Redis starts, it initializes a specified number
of IO threads. These threads are in addition to the main thread, each
thread starts with an empty list, the main thread will populate that
list in each event-loop with pending-read-clients or
pending-write-clients.
2. Read Phase: The main thread accepts incoming connections and reads
requests from clients. The reading of requests are offloaded to IO
threads. The main thread puts the clients ready-to-read in a list and
set the global io_threads_op to IO_THREADS_OP_READ, the IO threads pick
the clients up, perform the read operation and parse the first incoming
command.
3. Command Processing: After reading the requests, command processing is
still single-threaded and handled by the main thread.
4. Write Phase: Similar to the read phase, the write phase is also be
offloaded to IO threads. The main thread prepares the response in the
clients’ output buffer then the main thread puts the client in the list,
and sets the global io_threads_op to the IO_THREADS_OP_WRITE. The IO
threads then pick the clients up and perform the write operation to send
the responses back to clients.
5. Synchronization: The main-thread communicate with the threads on how
many jobs left per each thread with atomic counter. The main-thread
doesn’t access the clients while being handled by the IO threads.

**Issues with current implementation:**

* Underutilized Cores: The current implementation of IO-threads leads to
the underutilization of CPU cores.
* The main thread remains responsible for a significant portion of
IO-related tasks that could be offloaded to IO-threads.
* When the main-thread is processing client’s commands, the IO threads
are idle for a considerable amount of time.
* Notably, the main thread's performance during the IO-related tasks is
constrained by the speed of the slowest IO-thread.
* Limited Offloading: Currently, Since the Main-threads waits
synchronously for the IO threads, the Threads perform only read-parse,
and write operations, with parsing done only for the first command. If
the threads can do work asynchronously we may offload more work to the
threads reducing the load from the main-thread.
* TLS: Currently, we don't support IO threads with TLS (where offloading
IO would be more beneficial) since TLS read/write operations are not
thread-safe with the current implementation.

### Suggested change

Non-blocking main thread - The main thread and IO threads will operate
in parallel to maximize efficiency. The main thread will not be blocked
by IO operations. It will continue to process commands independently of
the IO thread's activities.

**Implementation details**

**Inter-thread communication.**

* We use a static, lock-free ring buffer of fixed size (2048 jobs) for
the main thread to send jobs and for the IO to receive them. If the ring
buffer fills up, the main thread will handle the task itself, acting as
back pressure (in case IO operations are more expensive than command
processing). A static ring buffer is a better candidate than a dynamic
job queue as it eliminates the need for allocation/freeing per job.
* An IO job will be in the format: ` [void* function-call-back | void
*data] `where data is either a client to read/write from and the
function-ptr is the function to be called with the data for example
readQueryFromClient using this format we can use it later to offload
other types of works to the IO threads.
* The Ring buffer is one way from the main-thread to the IO thread, Upon
read/write event the main thread will send a read/write job then in
before sleep it will iterate over the pending read/write clients to
checking for each client if the IO threads has already finished handling
it. The IO thread signals it has finished handling a client read/write
by toggling an atomic flag read_state / write_state on the client
struct.

**Thread Safety**

As suggested in this solution, the IO threads are reading from and
writing to the clients' buffers while the main thread may access those
clients.
We must ensure no race conditions or unsafe access occurs while keeping
the Valkey code simple and lock free.

Minimal Action in the IO Threads
The main change is to limit the IO thread operations to the bare
minimum. The IO thread will access only the client's struct and only the
necessary fields in this struct.
The IO threads will be responsible for the following:

* Read Operation: The IO thread will only read and parse a single
command. It will not update the server stats, handle read errors, or
parsing errors. These tasks will be taken care of by the main thread.
* Write Operation: The IO thread will only write the available data. It
will not free the client's replies, handle write errors, or update the
server statistics.


To achieve this without code duplication, the read/write code has been
refactored into smaller, independent components:

* Functions that perform only the read/parse/write calls.
* Functions that handle the read/parse/write results.

This refactor accounts for the majority of the modifications in this PR.

**Client Struct Safe Access**

As we ensure that the IO threads access memory only within the client
struct, we need to ensure thread safety only for the client's struct's
shared fields.

* Query Buffer 
* Command parsing - The main thread will not try to parse a command from
the query buffer when a client is offloaded to the IO thread.
* Client's memory checks in client-cron - The main thread will not
access the client query buffer if it is offloaded and will handle the
querybuf grow/shrink when the client is back.
* CLIENT LIST command - The main thread will busy-wait for the IO thread
to finish handling the client, falling back to the current behavior
where the main thread waits for the IO thread to finish their
processing.
* Output Buffer 
* The IO thread will not change the client's bufpos and won't free the
client's reply lists. These actions will be done by the main thread on
the client's return from the IO thread.
* bufpos / block→used: As the main thread may change the bufpos, the
reply-block→used, or add/delete blocks to the reply list while the IO
thread writes, we add two fields to the client struct: io_last_bufpos
and io_last_reply_block. The IO thread will write until the
io_last_bufpos, which was set by the main-thread before sending the
client to the IO thread. If more data has been added to the cob in
between, it will be written in the next write-job. In addition, the main
thread will not trim or merge reply blocks while the client is
offloaded.
* Parsing Fields 
    * Client's cmd, argc, argv, reqtype, etc., are set during parsing.
* The main thread will indicate to the IO thread not to parse a cmd if
the client is not reset. In this case, the IO thread will only read from
the network and won't attempt to parse a new command.
* The main thread won't access the c→cmd/c→argv in the CLIENT LIST
command as stated before it will busy wait for the IO threads.
* Client Flags 
* c→flags, which may be changed by the main thread in multiple places,
won't be accessed by the IO thread. Instead, the main thread will set
the c→io_flags with the information necessary for the IO thread to know
the client's state.
* Client Close 
* On freeClient, the main thread will busy wait for the IO thread to
finish processing the client's read/write before proceeding to free the
client.
* Client's Memory Limits 
* The IO thread won't handle the qb/cob limits. In case a client crosses
the qb limit, the IO thread will stop reading for it, letting the main
thread know that the client crossed the limit.

**TLS**

TLS is currently not supported with IO threads for the following
reasons:

1. Pending reads - If SSL has pending data that has already been read
from the socket, there is a risk of not calling the read handler again.
To handle this, a list is used to hold the pending clients. With IO
threads, multiple threads can access the list concurrently.
2. Event loop modification - Currently, the TLS code
registers/unregisters the file descriptor from the event loop depending
on the read/write results. With IO threads, multiple threads can modify
the event loop struct simultaneously.
3. The same client can be sent to 2 different threads concurrently
(https://github.com/redis/redis/issues/12540).

Those issues were handled in the current PR:

1. The IO thread only performs the read operation. The main thread will
check for pending reads after the client returns from the IO thread and
will be the only one to access the pending list.
2. The registering/unregistering of events will be similarly postponed
and handled by the main thread only.
3. Each client is being sent to the same dedicated thread (c→id %
num_of_threads).


**Sending Replies Immediately with IO threads.**

Currently, after processing a command, we add the client to the
pending_writes_list. Only after processing all the clients do we send
all the replies. Since the IO threads are now working asynchronously, we
can send the reply immediately after processing the client’s requests,
reducing the command latency. However, if we are using AOF=always, we
must wait for the AOF buffer to be written, in which case we revert to
the current behavior.

**IO threads dynamic adjustment**

Currently, we use an all-or-nothing approach when activating the IO
threads. The current logic is as follows: if the number of pending write
clients is greater than twice the number of threads (including the main
thread), we enable all threads; otherwise, we enable none. For example,
if 8 IO threads are defined, we enable all 8 threads if there are 16
pending clients; else, we enable none.
It makes more sense to enable partial activation of the IO threads. If
we have 10 pending clients, we will enable 5 threads, and so on. This
approach allows for a more granular and efficient allocation of
resources based on the current workload.

In addition, the user will now be able to change the number of I/O
threads at runtime. For example, when decreasing the number of threads
from 4 to 2, threads 3 and 4 will be closed after flushing their job
queues.

**Tests**

Currently, we run the io-threads tests with 4 IO threads
(443d80f168/.github/workflows/daily.yml (L353)).
This means that we will not activate the IO threads unless there are 8
(threads * 2) pending write clients per single loop, which is unlikely
to happened in most of tests, meaning the IO threads are not currently
being tested.

To enforce the main thread to always offload work to the IO threads,
regardless of the number of pending events, we add an
events-per-io-thread configuration with a default value of 2. When set
to 0, this configuration will force the main thread to always offload
work to the IO threads.

When we offload every single read/write operation to the IO threads, the
IO-threads are running with 100% CPU when running multiple tests
concurrently some tests fail as a result of larger than expected command
latencies. To address this issue, we have to add some after or wait_for
calls to some of the tests to ensure they pass with IO threads as well.

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
2024-07-08 20:01:39 -07:00
zhaozhao.zz
28c5a17edf
replica redirect read&write to primary in standalone mode (#325)
To implement #319 

1. replica is able to redirect read and write commands to it's primary
in standalone mode
    * reply with "-REDIRECT primary-ip:port"
2. add a subcommand `CLIENT CAPA redirect`, a client can announce the
capability to handle redirection
    * if a client can handle redirection, the data access commands (read and
write) will be redirected
3. allow `readonly` and `readwrite` command in standalone mode, may be a
breaking change
    * a client with redirect capability cannot process read commands on a
replica by default
    * use READONLY command can allow read commands on a replica

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-06-27 19:00:45 +08:00
Ping Xie
4135894a5d
Update remaining master references to primary (#660)
Signed-off-by: Ping Xie <pingxie@google.com>
2024-06-17 20:31:15 -07:00
Madelyn Olson
bce240eab7
Replace masteruser and masterauth with primaryuser and primaryauth (#598)
Make the one backwards compatible config change we are allowed to
replace for removing master from our API.

`masterauth` and `masteruser` are still used as an alias, but aren't
explicitly referenced. As an addendum to
https://github.com/valkey-io/valkey/pull/591, it would be good to have
this in 8. Given the related PR for updated other references for master,
I just updated the ones around this specific change.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-06-07 00:46:52 -07:00
Viktor Söderqvist
ad5fd5b95c
More rebranding (#606)
More rebranding of

* Log messages (#252)
* The DENIED error reply
* Internal function names and comments, mainly Lua API

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-06-07 01:40:55 +02:00
naglera
28e055af0b
Deflake chained replicas disconnect (#574)
Deflake chained replicas disconnect when replica re-connect with the
same master. sync_partial_ok counter might get incremented if replica
timed out during test.

Signed-off-by: naglera <anagler123@gmail.com>
2024-06-02 20:53:39 -07:00
Shivshankar
52f9291f79
Rename redis to valkey in test suite logs and test names. (#366)
This PR covers below cases.
1. test suite's prints(i.e., puts statement logs).
2. Links refering to redis issues.
3. test names contains redis.

Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
2024-04-25 15:13:21 +08:00
Shivshankar
8baf322742
Rename remaining test procedures (#355)
Renamed below procedures and variables (missed in #287) as follows.

redis_cluster             ->     valkey_cluster
redis1                    ->     valkey1
redis2                    ->     valkey2
get_redis_dir             ->     get_valkey_dir
test_redis_cli_rdb_dump   ->     test_valkey_cli_rdb_dump
test_redis_cli_repl       ->     test_valkey_cli_repl
redis-cli                 ->     valkey-cli
redis_reset_state         ->     valkey_reset_state
redisHandle               ->     valkeyHandle
redis_safe_read           ->     valkey_safe_read
redis_safe_gets           ->     valkey_safe_gets
redis_write               ->     valkey_write
redis_read_reply          ->     valkey_read_reply
redis_readable            ->     valkey_readable
redis_readnl              ->     valkey_readnl
redis_writenl             ->     valkey_writenl
redis_read_map            ->     valkey_read_map
redis_read_line           ->     valkey_read_line
redis_read_null           ->     valkey_read_null
redis_read_bool           ->     valkey_read_bool
redis_read_double         ->     valkey_read_double
redis_read_verbatim_str   ->     valkey_read_verbatim_str
redis_call_callback       ->     valkey_call_callback

---------

Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
2024-04-24 18:01:33 +02:00