Two jobs were missing from the job list for failure notification
* test-ubuntu-tls-io-threads
* test-sanitizer-force-defrag
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Addresses the failure here:
https://github.com/valkey-io/valkey/actions/runs/13000845302/job/36259016156#step:5:7272.
This change does three things:
1. For some reason TCL 8.5 (which is used on macos) is handling `\x03ba`
as `\0xba`, according to
https://www.tcl-lang.org/man/tcl8.5/TclCmd/Tcl.htm#M27 so we encode
"bar" using hex escapes too.
2. Fix a spacing issue.
3. Make it so that if the restore fails, it immediately errors.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Do command existence and arity checks when loading AOF to avoid crash
Currently, loading commands such as `cluster` or `cluster slots xxx`
from AOF will cause the server to crash.
1. `cluster` is a container command, and executing proc will cause a
crash because we do not check subcommand and arity.
2. `cluster slots xxx`, arity check fail, reply with an error from the
AOF client and trigger a panic.
Of course, there are many other ways for a problematic AOF to cause the
panic, but it is still necessary do some basic checks before executing.
In this way, in these basic cases, we can print useful error messages
instead of crashing directly.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Use Linux syscall mmap/munmap to manage a RDMA memory region, then we
have a guard page protected VMA like (cat /proc/PID/maps):
785018afe000-785018aff000 ---p 00000000 00:00 0 -> top guard page
785018aff000-785018bff000 rw-p 00000000 00:00 0 -> RDMA memory region
785018bff000-785018c00000 ---p 00000000 00:00 0 -> bottom guard page
Once any code accesses memory unexpectedly, segment fault occurs.
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
In C, we had better initialize every variable in struct, this PR fixes
one missed variable Initialization.
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
This address 2 issues:
1. It is possible (somehow) that the inner server client (r) was not
working resp 3 when entering this test.
this makes sure it does.
2. in case the test failed it might leave the redirection client closed.
there is a cross test assumption it should be open, so moved most of the
assert checks to the end of the test.
example fail:
https://github.com/valkey-io/valkey/actions/runs/12979601179/job/36195523412
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
New config `rdb-version-check` with values:
* `strict`: Reject future RDB versions.
* `relaxed`: Try parsing future RDB versions and fail only when an
unknown RDB opcode or type is encountered.
This can make it possible for Valkey 8.1 to try read a dump from for
example Valkey 9.0 or later on a best-effort basis. The conditions for
when this is expected to work can be defined when the future Valkey
versions are released. Loading is expected to fail in the following
cases:
* If the data set contains any new key types or other data elements not
supported by the current version.
* If the RDB contains new representations or encodings of existing key
types or other data elements.
This change also prepares for the next RDB version bump. A range of RDB
versions (12-79) is reserved, since it's expected to be used by foreign
software RDB versions, so Valkey will not accept versions in this range
even with the `relaxed` version check. The DUMP/RESTORE format has no
magic string; only the RDB version number.
This change also prepares for the magic string to change from REDIS to
VALKEY next time we bump the RDB version.
Related to #1108.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Use-after-free has been detect by address sanitizer, such as in this
test run:
https://github.com/valkey-io/valkey/actions/runs/12981530413/job/36200075972?pr=1620#step:5:1339
`hashtableShrinkIfNeeded` may free one of the hash tables and invalidate
the variables used by the `fillBucketHole(ht, b, pos_in_bucket,
table_index)` just after, causing use-after-free. Fill bucket hole first
and shrink afterwards is assumed to solve the issue. (Not reproduced
locally.)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Might close https://github.com/valkey-io/valkey/issues/1484.
I noticed that we don't disable pause after fork on the last test that
was getting executed, so it might getting stuck in pause loops after the
test ends if it tries another psync for any reason.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Fixes the unit test for hashtable random fairness intermittent failures when
running with the `--accurate` flag.
https://github.com/valkey-io/valkey/actions/runs/12969591890/job/36173815884#step:10:105
The test case picks a random element out of 400, repeated 1M times, and
then checks that 60% of the elements are picked within 3 standard
deviations from the number of times they're expected to be picked. In
this test run (with `--accurate`), the expected number is 2500 and the
standard deviation is 50, which is only 2% of the expected value. This
makes the check too strict and makes the test flaky.
As an alternative, we allow 80% of the elements to be picked within 10%
of the expected number. With this alternative condition, we can also
raise the check for the non-edge case from 60% to 80% of the elements to
be within 3 standard deviations. (With fewer repetitions, 3 standard
deviations is greater than 10% of the expected value, so this new
condition only affects the `--accurate` test run.)
Additional change: Set a random seed to the hash function in the test
suite. Until now, we only seeded the random number generator.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
As discussed in PR #336.
We have different types of resources like CPU, memory, network, etc. The
`slowlog` can only record commands eat lots of CPU during the processing
phase (doesn't include read/write network time), but can not record
commands eat too many memory and network. For example:
1. run "SET key value(10 megabytes)" command would not be recored in
slowlog, since when processing it the SET command only insert the
value's pointer into db dict. But that command eats huge memory in query
buffer and bandwidth from network. In this case, just 1000 tps can cause
10GB/s network flow.
2. run "GET key" command and the key's value length is 10 megabytes. The
get command can eat huge memory in output buffer and bandwidth to
network.
This PR introduces a new command `COMMANDLOG`, to log commands that
consume significant network bandwidth, including both input and output.
Users can retrieve the results using `COMMANDLOG get <count>
large-request` and `COMMANDLOG get <count> large-reply`, all subcommands
for `COMMANDLOG` are:
* `COMMANDLOG HELP`
* `COMMANDLOG GET <count> <slow|large-request|large-reply>`
* `COMMANDLOG LEN <slow|large-request|large-reply>`
* `COMMANDLOG RESET <slow|large-request|large-reply>`
And the slowlog is also incorporated into the commandlog.
For each of these three types, additional configs have been added for
control:
* `commandlog-request-larger-than` and
`commandlog-large-request-max-len` represent the threshold for large
requests(the unit is Bytes) and the maximum number of commands that can
be recorded.
* `commandlog-reply-larger-than` and `commandlog-large-reply-max-len`
represent the threshold for large replies(the unit is Bytes) and the
maximum number of commands that can be recorded.
* `commandlog-execution-slower-than` and
`commandlog-slow-execution-max-len` represent the threshold for slow
executions(the unit is microseconds) and the maximum number of commands
that can be recorded.
* Additionally, `slowlog-log-slower-than` and `slowlog-max-len` are now
set as aliases for these two new configs.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
This PR builds upon the [previous entry prefetching
optimization](https://github.com/valkey-io/valkey/pull/1501) to further
enhance performance by implementing value prefetching for hashtable
iterators.
## Implementation
Modified `hashtableInitIterator` to accept a new flags parameter,
allowing control over iterator behavior.
Implemented conditional value prefetching within `hashtableNext` based
on the new `HASHTABLE_ITER_PREFETCH_VALUES` flag.
When the flag is set, hashtableNext now calls `prefetchBucketValues` at
the start of each new bucket, preemptively loading the values of filled
entries into the CPU cache.
The actual prefetching of values is performed using type-specific
callback functions implemented in `server.c`:
- For `robj` the `hashtableObjectPrefetchValue` callback is used to
prefetch the value if not embeded.
This implementation is specifically focused on main database iterations
at this stage. Applying it to hashtables that hold other object types
should not be problematic, but its performance benefits for those cases
will need to be proven through testing and benchmarking.
## Performance
### Setup:
- 64cores Graviton 3 Amazon EC2 instance.
- 50 mil keys with different value sizes.
- Running valkey server over RAM file system.
- crc checksum and comperssion off.
### Action
- save command.
### Results
The results regarding the duration of “save” command was taken from
“info all” command.
```
+--------------------+------------------+------------------+
| Prefetching | Value size (byte)| Time (seconds) |
+--------------------+------------------+------------------+
| No | 100 | 20.112279 |
| Yes | 100 | 12.758519 |
| No | 40 | 16.945366 |
| Yes | 40 | 10.902022 |
| No | 20 | 9.817000 |
| Yes | 20 | 9.626821 |
| No | 10 | 9.71510 |
| Yes | 10 | 9.510565 |
+--------------------+------------------+------------------+
```
The results largely align with our expectations, showing significant
improvements for larger values (100 bytes and 40 bytes) that are stored
outside the robj. For smaller values (20 bytes and 10 bytes) that are
embedded within the robj, we see almost no improvement, which is as
expected.
However, the small improvement observed even for these embedded values
is somewhat surprising. Given that we are not actively prefetching these
embedded values, this minor performance gain was not anticipated.
perf record on save command **without** value prefetching:
```
--99.98%--rdbSaveDb
|
|--91.38%--rdbSaveKeyValuePair
| |
| |--42.72%--rdbSaveRawString
| | |
| | |--26.69%--rdbWriteRaw
| | | |
| | | --25.75%--rioFileWrite.lto_priv.0
| | |
| | --15.41%--rdbSaveLen
| | |
| | |--7.58%--rdbWriteRaw
| | | |
| | | --7.08%--rioFileWrite.lto_priv.0
| | | |
| | | --6.54%--_IO_fwrite
| | |
| | |
| | --7.42%--rdbWriteRaw.constprop.1
| | |
| | --7.18%--rioFileWrite.lto_priv.0
| | |
| | --6.73%--_IO_fwrite
| |
| |
| |--40.44%--rdbSaveStringObject
| |
| --7.62%--rdbSaveObjectType
| |
| --7.39%--rdbWriteRaw.constprop.1
| |
| --7.04%--rioFileWrite.lto_priv.0
| |
| --6.59%--_IO_fwrite
|
|
--7.33%--hashtableNext.constprop.1
|
--6.28%--prefetchNextBucketEntries.lto_priv.0
```
perf record on save command **with** value prefetching:
```
rdbSaveRio
|
--99.93%--rdbSaveDb
|
|--79.81%--rdbSaveKeyValuePair
| |
| |--66.79%--rdbSaveRawString
| | |
| | |--42.31%--rdbWriteRaw
| | | |
| | | --40.74%--rioFileWrite.lto_priv.0
| | |
| | --23.37%--rdbSaveLen
| | |
| | |--11.78%--rdbWriteRaw
| | | |
| | | --11.03%--rioFileWrite.lto_priv.0
| | | |
| | | --10.30%--_IO_fwrite
| | | |
| | |
| | --10.98%--rdbWriteRaw.constprop.1
| | |
| | --10.44%--rioFileWrite.lto_priv.0
| | |
| | --9.74%--_IO_fwrite
| | |
| |
| |--11.33%--rdbSaveObjectType
| | |
| | --10.96%--rdbWriteRaw.constprop.1
| | |
| | --10.51%--rioFileWrite.lto_priv.0
| | |
| | --9.75%--_IO_fwrite
| | |
| |
| --0.77%--rdbSaveStringObject
|
--18.39%--hashtableNext
|
|--10.04%--hashtableObjectPrefetchValue
|
--6.06%--prefetchNextBucketEntries
```
Conclusions:
The prefetching strategy appears to be working as intended, shifting the
performance bottleneck from data access to I/O operations.
The significant reduction in rdbSaveStringObject time suggests that
string objects(which are the values) are being accessed more
efficiently.
Signed-off-by: NadavGigi <nadavgigi102@gmail.com>
This includes a way to run two versions of the server from the TCL test
framework. It's a preparation to add more cross-version tests. The
runtest script accepts a new parameter
./runtest --other-server-path path/to/valkey-server
and a new tag "needs:other-server" for test cases and start_server.
Tests with this tag are automatically skipped if `--other-server-path`
is not provided.
This PR adds it in a CI job with Valkey 7.2.7 by downloading a binary
release.
Fixes#76
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The desync regression test was created as a regression test for the
following bug:
in case we embed NULL termination inside inline/multi-bulk message we
will not be able to perform strchr in order to
identify the newline(\n)/carriage-return(\r) in the client query buffer.
this can influence (for example) replica reading primary stream and keep
filling it's query buffer endlessly consuming more and more memory.
In order to handle the above risk, a check was added to verify the
inline bulk and multi-bulk size are not exceeding the 64K bytes in the
query-buffer. A test was placed in order to verify this.
This PR introduce the following fixes to the desync regression test:
1. fix the sent payload to flush 1024 bytes block of 'A's instead of
'payload' which was sent by mistake.
2. Make sure that the connection is correctly terminated on protocol
error by the server after exceeding the 64K and not over 64K.
3. add another test intrinsic which will also verify the nested bulk
with embedded null termination (was not verified before)
fixes https://github.com/valkey-io/valkey/issues/1583
NOTE: Although it is possible to change the use of strchr to a more
"safe" utility (eg memchr) which will not pause scan at first occurrence
of '\0', we still like to protect against over excessive usage of the
query buffer and also preserve the current behavior(?). We will look
into improving this though in a followup issue.
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
change the format of the dual channel replication logs so that it will
not
conflict with existing log formats like modules.
Fixes: https://github.com/valkey-io/valkey/issues/1509
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
The test case checks for expire-cycle in LATENCY LATEST, but with the
new hash table, the expiry-cycle is too fast to be logged by latency
monitor. Lower the latency monitor threshold to make it more likely to
be logged.
Fixes#1580
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When processing a cluster bus PING extension, there is a memory leak
when adding a new key to the `nodes_black_list` dict. We now make sure
to free the key `sds` if the dict did not take ownership of it.
Signed-off-by: Pierre Turin <pieturin@amazon.com>
This issue affected only two message types (CLUSTERMSG_TYPE_PUBLISH and CLUSTERMSG_TYPE_PUBLISHSHARD) because they used a light message header, which caused the CLUSTER INFO stats to miss sent/received message information for those types.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This commit creates a new compilation unit for the scripting engine code
by extracting the existing code from the functions unit.
We're doing this refactor to prepare the code for running the `EVAL`
command using different scripting engines.
This PR has a module API change: we changed the type of error messages
returned by the callback
`ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc` to be a
`ValkeyModuleString` (aka `robj`);
This PR also fixes#1470.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Some commands that use unix-time, such as `EXPIREAT` and `SET EXAT`, should include the deleted keys in the `expired_keys` statistics if the specified time has already expired, and notifications should be sent in the manner of expired.
---------
Signed-off-by: Ray Cao <zisong.cw@alibaba-inc.com>
Just like spell-check workflow, we should allow to trigger it
in push events, so that the forks repo can notice the format
thing way before submitting the PR.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Adds filter options to CLIENT LIST:
* USER <username>
Return clients authenticated by <username>.
* ADDR <ip:port>
Return clients connected from the specified address.
* LADDR <ip:port>
Return clients connected to the specified local address.
* SKIPME (YES|NO)
Exclude the current client from the list (default: no).
* MAXAGE <maxage>
Only list connections older than the specified age.
Modifies the ID filter to CLIENT KILL to allow multiple IDs
* ID <client-id> [<client-id>...]
Kill connections by client ids.
This makes CLIENT LIST and CLIENT KILL accept the same options.
For backward compatibility, the default value for SKIPME is NO for
CLIENT LIST and YES for CLIENT KILL.
The MAXAGE comes from CLIENT KILL, where it *keeps* clients with the
given max age and kills the older ones. This logic becomes weird for
CLIENT LIST, but is kept for similary with CLIENT KILL, for the use case
of first testing manually using CLIENT LIST, and then running CLIENT
KILL with the same filters.
The `ID client-id [client-id ...]` no longer needs to be the last
filter. The parsing logic determines if an argument is an ID or not
based on whether it can be parsed as an integer or not.
Partly addresses: #668
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Add `paused_actions` and `paused_timeout_milliseconds` for INFO Clients
to inform users about if clients are paused.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
`sds` is a typedef of `char *`.
`const sds` means `char * const`, i.e. a const-pointer to non-const
content.
More often, you would want `const char *`, i.e. a pointer to
const-content. Until now, it's not possible to express that. This PR
adds `const_sds` which is a pointer to const-content sds.
To get a const-pointer to const-content sds, you can use `const
const_sds`.
In this PR, some uses of `const sds` are replaced by `const_sds`. We can
use it more later.
Fixes#1542
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In some cases unix groups could have whitespace and/or `\` in them.
One example is my workstation. It's a MacOS in an Active Directory
domain. So my user has group `LD\Domain Users`.
Running `make test` on `unstable` and `8.0` branches fails with:
I'm not sure if we need to fix this in 8.0. But it seems that it should
be fixed in unstable.
Signed-off-by: secwall <secwall@yandex-team.ru>
This PR replaces dict with the new hashtable data structure in the HASH
datatype. There is a new struct for hashtable items which contains a
pointer to value sds string and the embedded key sds string. These
values were previously stored in dictEntry. This structure is kept
opaque so we can easily add small value embedding or other optimizations
in the future.
closes#1095
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
After #1545 disabled some tests for reply schema validation, we now have
another issue that ECHO is not covered.
```
WARNING! The following commands were not hit at all:
echo
ERROR! at least one command was not hit by the tests
```
This patch adds a test case for ECHO in the unit/other test suite. I
haven't checked if there are more commands that aren't covered.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The commands used in valkey-cli tests are not important the reply schema
validation. Skip them to avoid the problem if tests hanging. This has
failed lately in the daily job:
```
[TIMEOUT]: clients state report follows.
sock55fedcc19be0 => (IN PROGRESS) valkey-cli pubsub mode with single standard channel subscription
Killing still running Valkey server 33357
```
These test cases use a special valkey-cli command `:get pubsub` command,
which is an internal command to valkey-cli rather than a Valkey server
command. This command hangs when compiled with with logreqres enabled.
Easy solution is to skip the tests in this setup.
The test cases were introduced in #1432.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When the cluster changes, we need to persist the cluster configuration,
and these file IO operations may cause latency.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Imagine we have a cluster, for example a three-shard cluster,
if shard 1 doing a CLUSTER RESET HARD, it will change the node
name, and then other nodes will mark it as NOADR since the node
name received by PONG has changed.
In the eyes of other nodes, there is one working primary node
left but with no address, and in this case, the address report
in MOVED will be invalid and will confuse the clients. And in
the same time, the replica will not failover since its primary
is not in the FAIL state. And the cluster looks OK to everyone.
This leaves a cluster that appears OK, but with no coverage for
shard 1, obviously we should do something like CLUSTER FORGET
to remove the node and fix the cluster before using it.
But the point in here, we can mark the NOADDR node as FAIL to
advance the cluster state. If a node is NOADDR means it does
not have a valid address, so we won't reconnect it, we won't
send PING, we won't gossip it, it seems reasonable to mark it
as FAIL.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When multiple primary nodes fail simultaneously, the cluster can not recover
within the default effective time (data_age limit). The main reason is that
the vote is without ranking among multiple replica nodes, which case too many
epoch conflicts.
Therefore, we introduced into ranking based on the failed primary shard-id.
Introduced a new failed_primary_rank var, this var means the rank of this
myself instance in the context of all failed primary list. This var will be
used in failover and we will do the failover election packets in order based
on the rank, this can effectively avoid the voting conflicts.
If a single primary is down, the behavior is the same as before. If multiple
primaries are down, their replica election initiation time will be delayed
by 500ms according to the ranking.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When latency-monitor-threshold is set to 0, it means the latency monitor
is disabled, and in VM_LatencyAddSample, we wrote the condition
incorrectly, causing us to record latency when latency was turned off.
This bug was introduced in the very first day, see e3b1d6d, it was merged
in 2019.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In #1441, we found a assert, and decided remove this assert and instead
just free the newly created node and close the link, since if we cannot
get the IP from the link it probably means the connection was closed.
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED ===
17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true
------ STACK TRACE ------
17847 valkey-server *
src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634]
src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de]
/__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e]
src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea]
src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547]
/lib64/libc.so.6(+0x40c8) [0x7f083985a0c8]
/lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b]
src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5]
```
But it also introduces another assert. The reason is that this new node
is not added to the cluster nodes dict.
```
17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED ===
17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true
------ STACK TRACE ------
17128 valkey-server *
src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4]
src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2]
src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618]
/__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278]
src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09]
src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5]
src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e]
```
This closes#1527.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The fix that Redis gave us for the CVE-2024-46981 was freeing lctx.lua,
and I didn't merge it correctly. We made some changes so that we are
able to async free the lua context, so we need to free the passed in
context. This was applied correctly on the two released versions (8.0
and 7.2) just not on unstable.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This PR is to cleanup the `SERVER_TEST` compiler flag from cmake compile
definitions, as it is no longer required in the new unit test framework, see #428.
Signed-off-by: Karthick Ariyaratnam <karthyuom@gmail.com>
This PR introduces improvements to the hashtable iterator, implementing
prefetching technique described in the blog post [Unlock One Million RPS
- Part 2](https://valkey.io/blog/unlock-one-million-rps-part2/) . The
changes lay the groundwork for further enhancements in use cases
involving iterators. Future PRs will build upon this foundation to
improve performance and functionality in various iterator-dependent
operations.
In the pursuit of maximizing iterator performance, I conducted a
comprehensive series of experiments. My tests encompassed a wide range
of approaches, including processing multiple bucket indices in parallel,
prefetching the next bucket upon completion of the current one, and
several other timing and quantity variations. Surprisingly, after
rigorous testing and performance analysis, the simplest implementation
presented in this PR consistently outperformed all other more complex
strategies.
## Implementation
Each time we start iterating over a bucket, we prefetch data for future
iterations:
- We prefetch the entries of the next bucket (if it exists).
- We prefetch the structure (but not the entries) of the bucket after
the next.
This prefetching is done when we pick up a new bucket, increasing the
chance that the data will be in cache by the time we need it.
## Performance
The data below was taken by conducting keys command on 64cores Graviton
3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each. The
results regarding the duration of “keys *” command was taken from “info
all” command.
```
+--------------------+------------------+-----------------------------+
| prefetching | Time (seconds) | Keys Processed per Second |
+--------------------+------------------+-----------------------------+
| No | 11.112279 | 4,499,529 |
| Yes | 3.141916 | 15,913,862 |
+--------------------+------------------+-----------------------------+
Improvement:
Comparing the iterator without prefetching to the one with prefetching,
we can see a speed improvement of 11.112279 / 3.141916 ≈ 3.54 times faster.
```
### Save command improvment
#### Setup:
- 64cores Graviton 3 Amazon EC2 instance.
- 50 mil keys in size of 100 bytes each.
- Running valkey server over RAM file system.
- crc checksum and comperssion off.
#### Results
```
+--------------------+------------------+-----------------------------+
| prefetching | Time (seconds) | Keys Processed per Second |
+--------------------+------------------+-----------------------------+
| No | 28 | 1,785,700 |
| Yes | 19.6 | 2,550,000 |
+--------------------+------------------+-----------------------------+
Improvement:
- Reduced SAVE time by 30% (8.4 seconds faster)
- Increased key processing rate by 42.8% (764,300 more keys/second)
```
Signed-off-by: NadavGigi <nadavgigi102@gmail.com>
Resolves issue with valkey-cli not auto exiting from subscribed mode on
reaching zero pub/sub subscription (previously filed on Redis)
https://github.com/redis/redis/issues/12592
---------
Signed-off-by: Nikhil Manglore <nmanglor@amazon.com>
This PR replaces dict with hashtable in the ZSET datatype. Instead of
mapping key to score as dict did, the hashtable maps key to a node in
the skiplist, which contains the score. This takes advantage of
hashtable performance improvements and saves 15 bytes per set item - 24
bytes overhead before, 9 bytes after.
Closes#1096
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
# Refactor client structure to use modular data components
## Current State
The client structure allocates memory for replication / pubsub /
multi-keys / module / blocked data for every client, despite these
features being used by only a small subset of clients. In addition the
current field layout in the client struct is suboptimal, with poor
alignment and unnecessary padding between fields, leading to a larger
than necessary memory footprint of 896 bytes per client. Furthermore,
fields that are frequently accessed together during operations are
scattered throughout the struct, resulting in poor cache locality.
## This PR's Change
1. Lazy Initialization
- **Components are only allocated when first used:**
- PubSubData: Created on first SUBSCRIBE/PUBLISH operation
- ReplicationData: Initialized only for replica connections
- ModuleData: Allocated when module interaction begins
- BlockingState: Created when first blocking command is issued
- MultiState: Initialized on MULTI command
2. Memory Layout Optimization:
- Grouped related fields for better locality
- Moved rarely accessed fields (e.g., client->name) to struct end
- Optimized field alignment to eliminate padding
3. Additional changes:
- Moved watched_keys to be static allocated in the `mstate` struct
- Relocated replication init logic to replication.c
### Key Benefits
- **Efficient Memory Usage:**
- 45% smaller base client structure - Basic clients now use 528 bytes
(down from 896).
- Better memory locality for related operations
- Performance improvement in high throughput scenarios. No performance
regressions in other cases.
### Performance Impact
Tested with 650 clients and 512 bytes values.
#### Single Thread Performance
| Operation | Dataset | New (ops/sec) | Old (ops/sec) | Change % |
|------------|---------|---------------|---------------|-----------|
| SET | 1 key | 261,799 | 258,261 | +1.37% |
| SET | 3M keys | 209,134 | ~209,000 | ~0% |
| GET | 1 key | 281,564 | 277,965 | +1.29% |
| GET | 3M keys | 231,158 | 228,410 | +1.20% |
#### 8 IO Threads Performance
| Operation | Dataset | New (ops/sec) | Old (ops/sec) | Change % |
|------------|---------|---------------|---------------|-----------|
| SET | 1 key | 1,331,578 | 1,331,626 | -0.00% |
| SET | 3M keys | 1,254,441 | 1,152,645 | +8.83% |
| GET | 1 key | 1,293,149 | 1,289,503 | +0.28% |
| GET | 3M keys | 1,152,898 | 1,101,791 | +4.64% |
#### Pipeline Performance (3M keys)
| Operation | Pipeline Size | New (ops/sec) | Old (ops/sec) | Change % |
|-----------|--------------|---------------|---------------|-----------|
| SET | 10 | 548,964 | 538,498 | +1.94% |
| SET | 20 | 606,148 | 594,872 | +1.89% |
| SET | 30 | 631,122 | 616,606 | +2.35% |
| GET | 10 | 628,482 | 624,166 | +0.69% |
| GET | 20 | 687,371 | 681,659 | +0.84% |
| GET | 30 | 725,855 | 721,102 | +0.66% |
### Observations:
1. Single-threaded operations show consistent improvements (1-1.4%)
2. Multi-threaded performance shows significant gains for large
datasets:
- SET with 3M keys: +8.83% improvement
- GET with 3M keys: +4.64% improvement
3. Pipeline operations show consistent improvements:
- SET operations: +1.89% to +2.35%
- GET operations: +0.66% to +0.84%
4. No performance regressions observed in any test scenario
Related issue:https://github.com/valkey-io/valkey/issues/761
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This commit, https://github.com/valkey-io/valkey/pull/1504, moved the
wrong worker to ubuntu 22. We wanted to move codecov and not coverity.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
It's inconvenient for client implementations to extract the
`availability_zone` information from the `INFO` response. The `INFO`
response contains a lot of information that a client implementation
typically doesn't need.
This PR adds the availability zone to the `HELLO` response. Clients
usually already use the `HELLO` command for protocol negotiation and
also get the server `version` and `role` from its response. To keep the
`HELLO` response small, the field is only added if availability zone is
configured.
---------
Signed-off-by: Rueian <rueiancsie@gmail.com>
Reset GC state before closing the lua VM to prevent user data to be
wrongly freed while still might be used on destructor callbacks.
Created and publish by Redis in their OSS branch.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: YaacovHazan <yaacov.hazan@redis.com>
The explanation on the original commit was wrong. Key based access must
have a `~` in order to correctly configure whey key prefixes to apply
the selector to. If this is missing, a server assert will be triggered
later.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: YaacovHazan <yaacov.hazan@redis.com>
This may speed up the transition to the fail state a bit.
Previously we would only check when we received a pfail/fail
report from others in gossip. If myself is the last vote,
we can directly switch to fail in here without waiting for
the next gossip packet.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When building with `CMake` (especially the targets `valkey-cli`,
`valkey-server` and `valkey-benchmark`) it is possible to have a
successful build while having warnings.
This PR fixes this - which is aligned with how the `Makefile` is working
today:
- Enable `-Wall` + `-Werror` for valkey targets
- Fixed warning in valkey-cli:jsonStringOutput method
Signed-off-by: Eran Ifrah <eifrah@amazon.com>
The issues in #1453 seem to
have only shown up since we moved to ubuntu 24, as part of the rolling
`ubunut-latest` migration from 22->24.
Closes#1453.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
When looking up a key in no-touch mode, `LOOKUP_NOTOUCH` is set to avoid
updating the last access time in `lookupKey`. An exception must be made
for the `TOUCH` command which must always update the key.
When called from a script, `server.executing_client` will point to the
`TOUCH` command, while `server.current_client` will point to e.g. an
`EVAL` command. So, we must use the former to find out the currently
executing command if defined.
This fix addresses the issue where TOUCH wasn't updating key access
times when called from scripts like EVAL.
Fixes#1498
Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
In this commit we move all structures and functions declarations related
to Valkey modules from `server.h` to the recently added `module.h` file.
This re-organization makes it easier for new contributors to find the
valkey modules related code, as well as reducing the compilation times
when changes are made to the modules code.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
### Fix race with pending writes in replica state transition
#### The Problem
In #60 (Dual channel replication) a new `connWrite` call was added
before the `waitForClientIO` check. This created a race condition where
the main thread may attempt to write to a client that could have pending
writes in IO threads.
#### The Fix
Moved the `waitForClientIO()` call earlier in `syncCommand`, before any
`connWrite` call. This ensures all pending IO operations are completed
before attempting to write to the client.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Currently, in case a blocked command is unblocked externally (eg. due to
the relevant slot being migrated or the CLIENT UNBLOCK command was
issued, the command statistics will always update the failed_calls error
statistic. This leads to missalignment with
90b9f08e5d
as well as some inconsistencies. For example when a key is migrated
during cluster slot migration, clients blocked on XREADGROUP will be
unblocked and update the rejected_calls stat, while clients blocked on
BLPOP will get unblocked updating the failed_calls stat.
In this PR we add explicit indication in updateStatsOnUnblock thet
indicates if the command was rejected or failed.
---------
Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Add `meet_sent` field in `clusterNode` indicating the last time we sent
a MEET packet. Use this field to only (re-)send a MEET packet once every
handshake timeout period when detecting a node without an inbound link.
When receiving multiple MEET packets on the same link while the node is
in handshake state, instead of dropping the packet, we now simply
prevent the creation of a new node. This way we still process the MEET
packet's gossip and reply with a PONG as any other packets.
Improve some logging messages to include `human_nodename`. Add
`nodeExceedsHandshakeTimeout()` function.
This is a follow-up to this previous PR:
https://github.com/valkey-io/valkey/pull/1307
And a partial fix to the crash described in:
https://github.com/valkey-io/valkey/pull/1436
---------
Signed-off-by: Pierre Turin <pieturin@amazon.com>
In the `arguments` section, the `arguments` key is only used for
arguments of type `block` or `oneof`.
Consequently, the `arguments` given for `IFEQ` are ignored by the
server. However, they lead to strange results when rendering the
command's page for the web documentation.
Fix this by removing `arguments` for `IFEQ`.
Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
This PR fixes the missing stat update for `total_net_repl_output_bytes`
that was removed during the refactoring in PR #758. The metric was not
being updated when writing to replica connections.
Changes:
- Restored the stat update in postWriteToClient for replica connections
- Added integration test to verify the metric is properly updated
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This PR introduces a new mechanism for temporarily changing the
server's loading_rio context during RDB loading operations. The new
`RDB_SCOPED_LOADING_RIO` macro allows for a scoped change of the
`server.loading_rio` value, ensuring that it's automatically restored
to its original value when the scope ends.
Introduces a dedicated flag to `rio` to signal immediate abort,
preventing
potential use-after-free scenarios during replication disconnection in
dual-channel load. This ensures proper termination of
`rdbLoadRioWithLoadingCtx`
when replication is cancelled due to connection loss on main connection.
Fixes https://github.com/valkey-io/valkey/issues/1152
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Amit Nagler <58042354+naglera@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
- By not waiting `repl-diskless-sync-delay` when we don't have to, we
can reduce ~30% of dual channel tests execution time.
- This commit also drops one test which is not required for regular sync
(`Sync should continue if not all slaves dropped`).
- Skip dual channel test with master diskless disabled because it will
initiate the same synchronization process as the non-dual channel test,
making it redundant.
Before:
```
Execution time of different units:
171 seconds - integration/dual-channel-replication
305 seconds - integration/replication-psync
\o/ All tests passed without errors!
```
After:
```
Execution time of different units:
120 seconds - integration/dual-channel-replication
236 seconds - integration/replication-psync
\o/ All tests passed without errors!
```
Discused on https://github.com/valkey-io/valkey/pull/1173
---------
Signed-off-by: naglera <anagler123@gmail.com>
As part of #1463, I made a small refactor between the PR and the daily
test I submitted to try to improve readability by adding a function to
abstract the extraction of the message types. However, that change
apparently caused GCC to throw another warning, so reverting the
abstraction on just one line.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This PR extends the module API to support the addition of different
scripting engines to execute user defined functions.
The scripting engine can be implemented as a Valkey module, and can be
dynamically loaded with the `loadmodule` config directive, or with the
`MODULE LOAD` command.
This PR also adds an example of a dummy scripting engine module, to show
how to use the new module API. The dummy module is implemented in
`tests/modules/helloscripting.c`.
The current module API support, only allows to load scripting engines to
run functions using `FCALL` command.
The additions to the module API are the following:
```c
/* This struct represents a scripting engine function that results from the
* compilation of a script by the engine implementation. */
struct ValkeyModuleScriptingEngineCompiledFunction
typedef ValkeyModuleScriptingEngineCompiledFunction **(*ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc)(
ValkeyModuleScriptingEngineCtx *engine_ctx,
const char *code,
size_t timeout,
size_t *out_num_compiled_functions,
char **err);
typedef void (*ValkeyModuleScriptingEngineCallFunctionFunc)(
ValkeyModuleCtx *module_ctx,
ValkeyModuleScriptingEngineCtx *engine_ctx,
ValkeyModuleScriptingEngineFunctionCtx *func_ctx,
void *compiled_function,
ValkeyModuleString **keys,
size_t nkeys,
ValkeyModuleString **args,
size_t nargs);
typedef size_t (*ValkeyModuleScriptingEngineGetUsedMemoryFunc)(
ValkeyModuleScriptingEngineCtx *engine_ctx);
typedef size_t (*ValkeyModuleScriptingEngineGetFunctionMemoryOverheadFunc)(
void *compiled_function);
typedef size_t (*ValkeyModuleScriptingEngineGetEngineMemoryOverheadFunc)(
ValkeyModuleScriptingEngineCtx *engine_ctx);
typedef void (*ValkeyModuleScriptingEngineFreeFunctionFunc)(
ValkeyModuleScriptingEngineCtx *engine_ctx,
void *compiled_function);
/* This struct stores the callback functions implemented by the scripting
* engine to provide the functionality for the `FUNCTION *` commands. */
typedef struct ValkeyModuleScriptingEngineMethodsV1 {
uint64_t version; /* Version of this structure for ABI compat. */
/* Library create function callback. When a new script is loaded, this
* callback will be called with the script code, and returns a list of
* ValkeyModuleScriptingEngineCompiledFunc objects. */
ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc create_functions_library;
/* The callback function called when `FCALL` command is called on a function
* registered in this engine. */
ValkeyModuleScriptingEngineCallFunctionFunc call_function;
/* Function callback to get current used memory by the engine. */
ValkeyModuleScriptingEngineGetUsedMemoryFunc get_used_memory;
/* Function callback to return memory overhead for a given function. */
ValkeyModuleScriptingEngineGetFunctionMemoryOverheadFunc get_function_memory_overhead;
/* Function callback to return memory overhead of the engine. */
ValkeyModuleScriptingEngineGetEngineMemoryOverheadFunc get_engine_memory_overhead;
/* Function callback to free the memory of a registered engine function. */
ValkeyModuleScriptingEngineFreeFunctionFunc free_function;
} ValkeyModuleScriptingEngineMethodsV1;
/* Registers a new scripting engine in the server.
*
* - `engine_name`: the name of the scripting engine. This name will match
* against the engine name specified in the script header using a shebang.
*
* - `engine_ctx`: engine specific context pointer.
*
* - `engine_methods`: the struct with the scripting engine callback functions
* pointers.
*/
int ValkeyModule_RegisterScriptingEngine(ValkeyModuleCtx *ctx,
const char *engine_name,
void *engine_ctx,
ValkeyModuleScriptingEngineMethods engine_methods);
/* Removes the scripting engine from the server.
*
* `engine_name` is the name of the scripting engine.
*
*/
int ValkeyModule_UnregisterScriptingEngine(ValkeyModuleCtx *ctx, const char *engine_name);
```
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
We are getting a number of errors like:
```
array subscript ‘clusterMsg[0]’ is partly outside array bounds of ‘unsigned char[2272]’
```
Which is basically GCC telling us that we have an object which is longer
than the underlying storage of the allocation. We actually do this a
lot, but GCC is generally not aware of how big the underlying allocation
is, so it doesn't throw this error. We are specifically getting this
error because the msgBlock can be of variable length depending on the
type of message, but GCC assumes it's the longest one possible. The
solution I went with here was make the message type optional, so that it
wasn't included in the size. I think this also makes some sense, since
it's really just a helper for us to easily cast the object around.
I considered disabling this error, but it is generally pretty useful
since it can catch real issues. Another solution would be to
over-allocate to the largest possible object, which could hurt
performance as we initialize it to zero.
Results:
https://github.com/madolson/valkey/actions/runs/12423414811/job/34686899884
This is a slightly cleaned up version of
https://github.com/valkey-io/valkey/pull/1439. I thought I had another
strategy but alas, it didn't work out.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
We set the client output buffer limits to 10 bytes, and then execute
`info stats` which produces more than 10 bytes of output, which can
cause that command to throw an error.
I'm not sure why it wasn't consistently erroring before, might have been
some change related to the ubuntu upgrade though.
Issues related to ubuntu-tls are hopefully resolved now.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
In #1459, I missed that the data was also used to keep track of the PID
files so if the testing framework crashed it would no longer be able to
cleanup the extra servers. So now we properly extract the PID and store
it so we can clean up PIDs.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
There are two changes here:
1. The one in clusterNodeCleanupFailureReports, only primary with slots can
report the failure report, if the primary became a replica its failure report
should be cleared. This may lead to inaccurate node fail judgment in some network
partition cases i guess, it will also affect the CLUSTER COUNT-FAILURE-REPORTS
command.
2. The one in clusterProcessGossipSection, it is not that important, but it can
print a "node is back online" log helps us troubleshoot the problem, although
it may conflict with 1 at some points.
Signed-off-by: Binbin <binloveplay1314@qq.com>
- Moves `build-config.json` to workflow dir to build old versions with
new configs.
- Enables contributors to test release Wf on private repo by adding
`github.event_name == 'workflow_dispatch' ||`
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
**Background**
When conducting performance tests using `valkey-benchmark`, reading from
replicas was not supported. Consequently, even in cluster mode, all
reads were directed to the primary nodes. This limitation made it
challenging to obtain accurate metrics during workload stress testing
for performance measurement or before a version upgrade.
Related issue : https://github.com/valkey-io/valkey/issues/900
**Changes**
1. Replaced the use of `CLUSTER NODES` with `CLUSTER SLOTS` when
fetching cluster configuration. This allows for easier identification of
replica slots.
2. Support for reading from replicas by executing the client in
`READONLY` mode.
3. Support reading from replicas even during slot migrations.
4. Introduced two CLI options `--rfr` to enable reading from replicas
only or all cluster nodes. A warning added to indicate that write
requests might not be handled correctly when using this option.
---------
Signed-off-by: bluayer <ijacsong98@gmail.com>
Signed-off-by: bluayer <bluayer@gmail.com>
Signed-off-by: Jungwoo Song <37579681+bluayer@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
The old reqEpoch mainly refers to requestCurrentEpoch, see:
```
if (requestCurrentEpoch < server.cluster->currentEpoch) {
serverLog(LL_WARNING, "Failover auth denied to %.40s (%s): reqEpoch (%llu) < curEpoch(%llu)", node->name,
node->human_nodename, (unsigned long long)requestCurrentEpoch,
(unsigned long long)server.cluster->currentEpoch);
return;
}
```
And in here we refer to requestConfigEpoch, it's a bit misleading,
so change it to reqConfigEpoch to make it clear.
Signed-off-by: Binbin <binloveplay1314@qq.com>
There are some tests that fail and give no useful information since they are
outside of a test context. Now we will at least get the file we are located in.
We can sort of reverse engineer where we are in the test by seeing which
tests have finished in a file.
```
[TIMEOUT]: clients state report follows.
sock6 => (SPAWNED SERVER) pid:30375 - tests/unit/info.tcl
Killing still running Valkey server 30375 - tests/unit/info.tcl
```
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
The test attempts to write 1MB of data in order to trigger a disconnect.
Normally, the data is fully flushed and we get the error on the read
(I/O error). However, it's possible we might fail the write, which
leaves the client in an inconsistent state. On the next command, we
finally process the I/O error on the FD. So, the simple fix is to
consume any secondary errors.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Asan now supports making sure you are passing in the correct pointer
type, which seems useful but we can't support it since we pass in an
incorrect pointer in several places. This is most commonly done with
generic free functions, where we simply cast it to the correct type.
It's not a lot of code to clean up, so it seems appropriate to cleanup
instead of disabling the check.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Introduce compile time option to force activedefrag to run even when
jemalloc is not used as the allocator.
This is in order to be able to run tests with defrag enabled
while using memory instrumentation tools.
fixes: https://github.com/valkey-io/valkey/issues/1241
---------
Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Avoid tmpfs as fadvise(FADV_DONTNEED) has no effect on memory-backed
filesystems.
Fixes https://github.com/valkey-io/valkey/issues/897
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
We have set the secret as `AWS_S3_TEST_BUCKET` for test bucket and I
missed it in the initial review.
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```
In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.
If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.
Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This is a follow of #1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.
Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.
Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.
So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.
See discussion #1305 for more information.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The new `hashtable` provides faster lookups and uses less memory than
`dict`.
A TCL test case "SRANDMEMBER with a dict containing long chain" is
deleted because it's covered by a hashtable unit test
"test_random_entry_with_long_chain", which is already present.
This change also moves some logic from dismissMemory (object.c) to
zmadvise_dontneed (zmalloc.c), so the hashtable implementation which
needs the dismiss functionality doesn't need to depend on object.c and
server.h.
This PR follows #1186.
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
We deprecate the usage of classic malloc and free, but under certain
circumstances they might get imported from intrinsics. The original
thought is we should just override malloc and free to use zmalloc and
zfree, but I think we should continue to deprecate it to avoid
accidental imports of allocations.
Closes https://github.com/valkey-io/valkey/issues/1434.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Fixes four cases where `stringmatchlen` could overrun the pattern if it
is not terminated with NUL.
These commits are cherry-picked from my
[fork](https://github.com/thaliaarchi/antirez-stringmatch) which
extracts `stringmatch` as a library and compares it to other projects by
antirez which use the same matcher.
Signed-off-by: Thalia Archibald <thalia@archibald.dev>
This update addresses several issues in defrag:
1. In the defrag redesign
(https://github.com/valkey-io/valkey/pull/1242), a bug was introduced
where `server.cronloops` was no longer being incremented in the
`whileBlockedCron()`. This resulted in some memory statistics not being
updated while blocked.
2. In the test case for AOF loading, we were seeing errors due to defrag
latencies. However, running the math, the latencies are justified given
the extremely high CPU target of the testcase. Adjusted the expected
latency check to allow longer latencies for this case where defrag is
undergoing starvation while AOF loading is in progress.
3. A "stage" is passed a "target". For the main dictionary and expires,
we were passing in a `kvstore*`. However, on flushall or swapdb, the
pointer may change. It's safer and more stable to use an index for the
DB (a DBID). Then if the pointer changes, we can detect the change, and
simply abort the stage. (If there's still fragmentation to deal with,
we'll pick it up again on the next cycle.)
4. We always start a new stage on a new defrag cycle. This gives the new
stage time to run, and prevents latency issues for certain stages which
don't operate incrementally. However, often several stages will require
almost no work, and this will leave a chunk of our CPU allotment unused.
This is mainly an issue in starvation situations (like AOF loading or
LUA script) - where defrag is running infrequently, with a large
duty-cycle. This change allows a new stage to be initiated if we still
have a standard duty-cycle remaining. (This can happen during starvation
situations where the planned duty cycle is larger than the standard
cycle. Most likely this isn't a concern for real scenarios, but it was
observed in testing.)
5. Minor comment correction in `server.h`
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
This change makes the binary build on the target ubuntu version.
This PR also deprecated ubuntu18 and valkey will not support:
- X86:
- Ubuntu 20
- Ubuntu 22
- Ubuntu 24
- ARM:
- Ubuntu 20
- Ubuntu 22
Removed ARM ubuntu 24 as the action we are using for ARM builds does not
support Ubuntu 24.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
## Summary
This PR fixes#1346 where we can get rid of the long term credentials by
using OpenID Connect. OpenID Connect (OIDC) allows your GitHub Actions
workflows to access resources in Amazon Web Services (AWS), without
needing to store the AWS credentials as long-lived GitHub secrets.
---------
Signed-off-by: vudiep411 <vdiep@amazon.com>
This can happen when scripts are running for long period of time and the server attempts to defrag it in the whileBlockedCron.
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us. This prevents
cases where a node could send MEET packets in a loop because it thinks
the other node does not have an inbound link.
This fixes the bug described in #1251.
---------
Signed-off-by: Pierre Turin <pieturin@amazon.com>
Addresses https://github.com/valkey-io/valkey/issues/1393
Changes:
* During AOF loading or long running script, this allows defrag to be
initiated.
* The AOF defrag test was corrected to eliminate the wait period and
rely on non-timer invocations.
* Logic for "overage" time in defrag was changed. It previously
accumulated underage leading to large latencies in extreme tests having
very high CPU percentage. After several simple stages were completed
during infrequent blocked processing, a large cycle time would be
experienced.
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
There is a leak in here, hashtableTwoPhasePopDelete won't call the entry
destructor and like hashtablePop we need to call it by myself.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Instead of a dictEntry with pointers to key and value, the hashtable
has a pointer directly to the value (robj) which can hold an embedded
key and acts as a key-value in the hashtable. This minimizes the number
of pointers to follow and thus the number of memory accesses to lookup
a key-value pair.
Keys robj
hashtable
+-------+ +-----------------------+
| 0 | | type, encoding, LRU |
| 1 ------->| refcount, expire |
| 2 | | ptr |
| ... | | optional embedded key |
+-------+ | optional embedded val |
+-----------------------+
The expire timestamp (TTL) is also stored in the robj, if any. The expire
hash table points to the same robj.
Overview of changes:
* Replace dict with hashtable in kvstore (kvstore.c)
* Add functions for embedding key and expire in robj (object.c)
* When there's unused space, reserve an expire field to avoid realloting
it later if expire is added.
* Always reserve space for expire for large key names to avoid realloc
if it's set later.
* Update db functions (db.c)
* dbAdd, setKey and setExpire reallocate the object when embedding a key
* setKey does not increment the reference counter, since it would require
duplicating the object. This responsibility is moved to the caller.
* Remove logic for shared integer objects as values in the database. The keys
are now embedded in the objects, so all objects in the database need to be
unique. Thus, we can't use shared objects as values. Also delete test cases
for shared integers.
* Adjust various commands to the changes mentioned above.
* Adjust defrag code
* Improvement: Don't access the expires table before defrag has actually
reallocated the object.
* Adjust test cases that were using hard-coded sizes for dict when realloc
would happen, and some other adjustments in test cases.
* Adjust memory prefetch for new hash table implementation in IO-threading,
using new `hashtableIncrementalFind` API
* Adjust offloading of free() to IO threads: Object free to be done in main
thread while keeping obj->ptr offloading in IO-thread since the DB object is
now allocated by the main-thread and not by the IO-thread as it used to be.
* Let expireIfNeeded take an optional value, to avoid looking up the expires
table when possible.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
This changes the type of command tables from dict to hashtable. Command
table lookup takes ~3% of overall CPU time in benchmarks, so it is a
good candidate for optimization.
My initial SET benchmark comparison suggests that hashtable is about 4.5
times faster than dict and this replacement reduced overall CPU time by
2.79% 🥳
---------
Signed-off-by: Rain Valentine <rainval@amazon.com>
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Rain Valentine <rainval@amazon.com>
A cache-line aware hash table with a user-defined key-value entry type,
supporting incremental rehashing, scan, iterator, random sampling,
incremental lookup and more...
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The following error has been seen, but not reliably reproduced:
```
*** [err]: eviction due to output buffers of pubsub, client eviction: true in tests/unit/maxmemory.tcl
Expected '42' to be equal to '50' (context: type proc line 17 cmd {assert_equal [r dbsize] 50} proc ::init_test level 2)
```
The reason is probably that FLUSHDB is asynchronous and when we start
populating new keys, they are evicted because the background flush is
too slow. Changing this to FLUSHDB SYNC prevents this.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
events-per-io-thread is for testing purposes that allow us to force the
main thread to always offload the works to the IO threads, see
adjustIOThreadsByEventLoad for more details.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This fix changes the timeout for BLPOP in this test case from 1 second
to 0.5 seconds.
In the test case quoted below, the procedure
`wait_for_blocked_clients_count` waits for one second by default. If
BLPOP has 1 second timeout and the first
`wait_for_blocked_clients_count` finishes very fast, then the second
`wait_for_blocked_clients_count` can time out before the BLPOP has been
unblocked.
```TCL
test "CLUSTER SLOT-STATS cpu-usec for blocking commands, unblocked on timeout." {
# Blocking command with 1 second timeout.
set rd [valkey_deferring_client]
$rd BLPOP $key 1
# Confirm that the client is blocked, then unblocked after 1 second timeout.
wait_for_blocked_clients_count 1
wait_for_blocked_clients_count 0
```
As seen in the definition of `wait_for_blocked_clients_count`, the total
time to wait is 1 second by default.
```TCL
proc wait_for_blocked_clients_count {count {maxtries 100} {delay 10} {idx 0}} {
wait_for_condition $maxtries $delay {
[s $idx blocked_clients] == $count
} else {
fail "Timeout waiting for blocked clients"
}
}
```
Fixes#1121
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
- Add new flag "I" in `CLIENT LIST` for import-source client
- Add `DEBUG_CONFIG` for import-mode
- Allow import-source status to be turned off when import-mode is off
Fixes#1350 and
https://github.com/valkey-io/valkey/pull/1185#discussion_r1851049362.
---------
Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com>
Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This PR allows the Valkey users to perform conditional updates where the
SET command is completed if the given comparison-value matches the key’s
current value.
Syntax:
```
SET key value IFEQ comparison-value
```
Behavior:
If the values match, the SET completes as expected. If they do not
match, the command returns a (nil), except if the GET argument is also
given (see below).
Behavior with Additional Flags:
1. ```SET key value IFEQ comparison-value GET``` returns the existing
value, regardless of whether it matches comparison-value or not. The
conditional set operation is performed if the given comparison value
matches the existing value. To check if the SET succeeded, the caller
needs to check if the returned string matches the comparison-value.
2. ```SET key value IFEQ comparison-value XX``` is a syntax error.
3. ```SET key value IFEQ comparison-value NX``` is a syntax error.
Closes: #1215
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
We added two more committers, but according to our governance document
that makes them TSC members. As we discussed, for now we want to keep
the balance of corporate interests, so so updating the governance to
explicitly list TSC members compared to folks with just write
permissions.
Also adds the new new folks with commit permissions.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Change assert crash log to also use binary representation like 5bdd72bea77d4bb237441c9a671e80edcdc998ad.
And do not print the password in assert crash log like 56eef6fb5ab7a755485c19f358761954ca459472.
In addition, for 5bdd72bea77d4bb237441c9a671e80edcdc998ad, we will print '"argv"',
because originally the code would print a '', and sdscatrepr will add an extra "",
so now removing the extra '' here.
Extract the getArgvReprString method and clean up the code a bit.
Examples:
```
debug assert "\x00abc"
before:
client->argv[0] = "debug" (refcount: 1)
client->argv[1] = "assert" (refcount: 1)
client->argv[2] = "" (refcount: 1)
after:
client->argv[0] = "debug" (refcount: 1)
client->argv[1] = "assert" (refcount: 1)
client->argv[2] = "\x00abc" (refcount: 1)
debug panic "\x00abc"
before:
argc: '3'
argv[0]: '"debug"'
argv[1]: '"panic"'
argv[2]: '"\x00abc"'
after:
argc: 3
argv[0]: "debug"
argv[1]: "panic"
argv[2]: "\x00abc"
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
ZRANK is a widly used command for workloads using sorted-sets. For
example, in leaderboards It enables query the specific rank of a player.
The way ZRANK is currently implemented is:
1. locate the element in the SortedSet hashtable.
2. take the score of the element and use it in order to locate the
element in the SkipList (when listpack encoding is not used)
3. During the SkipLis scan for the elemnt we keep the path and use it in
order to sum the span in each path node in order to calculate the elemnt
rank
One problem with this approach is that it involves multiple compare
operations in order to locate the element. Specifically string
comparison can be expensive since it will require access multiple memory
locations for the items the element string is compared against.
Perf analysis showed this can take up to 20% of the rank scan time. (TBD
- provide the perf results for example)
We can improve the rank search by taking advantage of the fact that the
element node in the skiplist is pointed by the hashtable value!
Our Skiplist implementation is using FatKeys, where each added node is
assigned a randomly chosen height. Say we keep a height record for every
skiplist element. In order to get an element rank we simply:
1. locate the element in the SortedSet hashtable.
2. we go directly to the node in the skiplist.
3. we jump to the full height of the node and take the span value.
4. we continue going foreward and always jump to the heighst point in
each node we get to, making sure to sum all the spans.
5. we take off the summed spans from the SkipList length and we now have
the specific node rank. :)
In order to test this method I created several benchmarks. All
benchmarks used the same seeds and the lists contained 1M elements.
Since a very important factor is the number of scores compared to the
number of elements (since small ratio means more string compares during
searches) each benchmark test used different number of scores (1, 10K,
100K, 1M)
some results:
**TPS**
Scores range | non-optimized | optimized | gain
-- | -- | -- | --
1 | 416042 | 605363 | 45.51%
10K | 359776 | 459200 | 27.63%
100K | 380387 | 459157 | 20.71%
1M | 416059 | 450853 | 8.36%
**Latency**
Scores range | non-optimized | optimized | gain
-- | -- | -- | --
1 | 1.191000 | 0.831000 | -30.23%
10K | 1.383000 | 1.095000 | -20.82%
100K | 1.311000 | 1.087000 | -17.09%
1M | 1.191000 | 1.119000 | -6.05%
### Memory efficiency
adding another field to each skiplist node can cause degredation in
memory efficiency for large sortedsets. We use the fact that level 0
recorded span of ALL nodes can either be 1 or zero (for the last node).
So we use wrappers in order to get a node span and override the span for
level 0 to hold the node height.
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
After #1009, we will reset the election when we received
a claim with an equal or higher epoch since a node can win
an election in the past.
But we need to consider the time before the node actually
obtains the failover_auth_epoch. The failover_auth_epoch
default is 0, so before the node actually get the failover
epoch, we might wrongly reset the election.
This is probably harmless, but will produce misleading log
output and may delay election by a cron cycle or beforesleep.
Now we will only reset the election when a node is actually
obtains the failover epoch.
Signed-off-by: Binbin <binloveplay1314@qq.com>
- Enable investigation of memory issues during loading
- Previously, all memory commands were rejected with LOADING error
(except memory help)
- `MEMORY MALLOC-STATS` and `MEMORTY PURGE` are now allowed
as they don't depend on the dataset
- `MEMORY STATS` and `MEMORY USAGE KEY` remain disallowed
Fixes#1299
Signed-off-by: Guillaume Koenig <knggk@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
We've had security issues in the past with it, which is why
we marked it as PROTECTED. But, modifying during runtime
is also a dangerous action. For example, when child processes
are running, persistent temp files and log files may have
unexpected effects.
A scenario for modifying dir at runtime is to migrate a disk
failure, such as using disk-based replication to migrate a node,
writing nodes.conf to save the cluster configuration.
We decided to leave it as is and add a note in the conf
about the dangers of modifying dir at runtime.
Signed-off-by: Binbin <binloveplay1314@qq.com>
By including <stdatomic.h> after the other includes in the unit test, we
can avoid redefining a macro which led to a build failure.
Fixes#1394
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In the ValkeyModule_OnLoad method of the file hellodict.c, the parameter
keystep of ValkeyModule_CreateCommand should be 1. Otherwise, execute
command will coredump.
MODULE LOAD /home/tiger/valkey/src/modules/hellodict.so
COMMAND GETKEYS HELLODICT.SET key value
Signed-off-by: Codebells <1347103071@qq.com>
Determine the status of the Client when attempting to read data. If
state=CLIENT_COMPLETED_IO, no read attempt is made and I/O operations on
the Client are rescheduled by the main thread.
> And 20474 Byte = PROTO_IOBUF_LEN(16KB) + SDS_HDR_VAR(16, s)(4090 Byte)
Fixes#1385
---------
Signed-off-by: fengquyoumo <1455117463@qq.com>
Before Redis OSS 7, if we load a module with some arguments during
runtime,
and run the command "config rewrite", the module information will not be
saved into the
config file.
Since Redis OSS 7 and Valkey 7.2, if we load a module with some
arguments during runtime,
the module information (path, arguments number, and arguments value) can
be saved into the config file after config rewrite command is called.
Thus, the module will be loaded automatically when the server startup
next time.
Following is one example:
bind 172.25.0.58
port 7000
protected-mode no
enable-module-command yes
Generated by CONFIG REWRITE
latency-tracking-info-percentiles 50 99 99.9
dir "/home/ubuntu/valkey"
save 3600 1 300 100 60 10000
user default on nopass sanitize-payload ~* &* +https://github.com/ALL
loadmodule tests/modules/datatype.so 10 20
However, there is one problem.
If developers write a module, and update the running arguments by
someway, the updated arguments can not be saved into the config file
even "config rewrite" is called.
The reason comes from the following function
rewriteConfigLoadmoduleOption (src/config.c)
void rewriteConfigLoadmoduleOption(struct rewriteConfigState *state) {
..........
struct ValkeyModule *module = dictGetVal(de);
line = sdsnew("loadmodule ");
line = sdscatsds(line, module->loadmod->path);
for (int i = 0; i < module->loadmod->argc; i++) {
line = sdscatlen(line, " ", 1);
line = sdscatsds(line, module->loadmod->argv[i]->ptr);
}
rewriteConfigRewriteLine(state, "loadmodule", line, 1);
.......
}
The function only save the initial arguments information
(module->loadmod) into the configfile.
After core members discuss, ref
https://github.com/valkey-io/valkey/issues/1177
We decide add the following API to implement this feature:
Original proposal:
int VM_UpdateRunTimeArgs(ValkeyModuleCtx *ctx, int index, char *value);
Updated proposal:
ValkeyModuleString **values VM_GetRuntimeArgs(ValkeyModuleCtx *ctx);
**int VM_UpdateRuntimeArgs(ValkeyModuleCtx *ctx, int argc,
ValkeyModuleString **values);
Why we do not recommend the following way:
MODULE UNLOAD
Update module args in the conf file
MODULE LOAD
I think there are the following disadvantages:
1. Some modules can not be unloaded. Such as the example module
datatype.so, which is tests/modules/datatype.so
2. it is not atomic operation for MODULE UNLOAD + MODULE LOAD
3. sometimes, if we just run the module unload, the client business
could be interrupted
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
Since 4695d118dd (#1209), RDMA supports builtin.
And module connection type may be removed in future. So run a builtin
RDMA support for CI workflow.
RDMA module is complied only in CI, keep it building check only until
module connection type gets obsolete.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
The recent PR (https://github.com/valkey-io/valkey/pull/1242) converted
Active Defrag to use `monotime`. In that change, a conversion was
performed to continue to use `ustime()` as part of the module interface.
Since this time is only used internally, and never actually exposed to
the module, we can convert this to use `monotime` directly.
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
### Improve expired commands performance with IO threads
#### Background
In our IO threads architecture, IO threads allocate client argv's and
later when we free it after processCommand we offload its free to the IO
threads.
With jemalloc, it's crucial that the same thread that allocates memory
also frees it.
For some commands we modify the client's argv in the main thread during
command processing (for example in `SET EX` command we rewrite the
command to use absolute time for replication propagation).
#### Current issues
1. When commands are rewritten (e.g., expire commands), we store the
original argv
in `c->original_argv`. However, we're currently:
- Freeing new argv (allocated by main thread) in IO threads
- Freeing original argv (allocated by IO threads) in main thread
2. Currently, `c->original_argv` points to new array with old
objects, while `c->argv` has old array with new objects, making memory
free management complicated.
#### Changes
1. Refactored argv modification handling code to ensure consistency -
both array and objects are now either all new or all old
2. Moved original_argv cleanup to happen in resetClient after argv
cleanup
3. Modified IO threads code to properly handle original argv cleanup
when argv are modified.
#### Performance Impact
Benchmark with `SET EX` commands (650 clients, 512 byte value, 8 IO
threads):
- New implementation: **729,548 ops/sec**
- Old implementation: **633,243 ops/sec**
Representing a **~15%** performance improvement due to more efficient
memory handling.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Refer to: https://github.com/valkey-io/valkey/issues/1141
This update refactors the defrag code to:
* Make the overall code more readable and maintainable
* Reduce latencies incurred during defrag processing
With this update, the defrag cycle time is reduced to 500us, with more
frequent cycles. This results in much more predictable latencies, with a
dramatic reduction in tail latencies.
(See https://github.com/valkey-io/valkey/issues/1141 for more complete
details.)
This update is focused mostly on the high-level processing, and does NOT
address lower level functions which aren't currently timebound (e.g.
`activeDefragSdsDict()`, and `moduleDefragGlobals()`). These are out of
scope for this update and left for a future update.
I fixed `kvstoreDictLUTDefrag` because it was using up to 7ms on a CME
single shard. See original github issue for performance details.
---------
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This PR optimizes the performance of HyperLogLog commands (PFCOUNT,
PFMERGE) by adding AVX2 fast paths.
Two AVX2 functions are added for conversion between raw representation
and dense representation. They are 15 ~ 30 times faster than scalar
implementaion. Note that sparse representation is not accelerated.
AVX2 fast paths are enabled when the CPU supports AVX2 (checked at
runtime) and the hyperloglog configuration is default (HLL_REGISTERS ==
16384 && HLL_BITS == 6).
`PFDEBUG SIMD (ON|OFF)` subcommand is added for unit tests. A new TCL
unit test checks that the results produced by non-AVX2 and AVX2
implementations are exactly equal.
When merging 3 dense hll structures, the benchmark shows a 12x speedup
compared to the scalar version.
```
pfcount key1 key2 key3
pfmerge keyall key1 key2 key3
```
```
======================================================================================================
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
------------------------------------------------------------------------------------------------------
PFCOUNT-scalar 5665.56 35.29839 32.25500 63.99900 67.58300 608.60
PFCOUNT-avx2 72377.83 2.75834 2.67100 5.34300 6.81500 7774.96
------------------------------------------------------------------------------------------------------
PFMERGE-scalar 9851.29 20.28806 20.09500 36.86300 39.16700 615.71
PFMERGE-avx2 125621.89 1.59126 1.55100 3.11900 4.70300 15702.74
------------------------------------------------------------------------------------------------------
scalar: valkey:unstable 2df56d87c0ebe802f38e8922bb2ea1e4ca9cfa76
avx2: Nugine:hll-simd 8f9adc34021080d96e60bd0abe06b043f3ed0275
CPU: 13th Gen Intel® Core™ i9-13900H × 20
Memory: 32.0 GiB
OS: Ubuntu 22.04.5 LTS
```
Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark script:
https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh
Algorithm:
https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp
---------
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
This log allows us to easily distinguish between full coverage and
minority partition when the cluster fails. Sometimes it is not easy
to see the minority partition in a healthy shards (both primary and
replicas).
And we decided not to add a cluster_fail_reason field to cluster info.
Given that there are only two reasons and both are well-known and if
we ended up adding more down the road we can add it in the furture.
Signed-off-by: Binbin <binloveplay1314@qq.com>
**Motivation**
Copy-on-write (COW) amplification refers to the issue where writing to a
small object leads to the entire page being cloned, resulting in
inefficient memory usage. This issue arises during the BGSAVE process,
which can be particularly problematic on instances with limited memory.
If the BGSAVE process could release unneeded memory, it could reduce
memory consumption. To address this, the BGSAVE process calls the
`madvise` function to signal the operating system to reclaim the buffer.
However, this approach does not work for buffers smaller than a page
(usually 4KiB). Even after multiple such calls, where a full page may be
free, the operating system will not reclaim it.
To solve this issue, we can call `zfree` directly. This allows the
allocator (jemalloc) to handle the bookkeeping and release pages when
buffers are no longer needed. This approach reduces copy-on-write
events.
**Benchmarks**
To understand how usage of `zfree` affects BGSAVE and the memory
consumption I ran 45 benchmarks that compares my clonewith the vanilla
version. The benchmark has the following steps:
1. Start a new Valkey process
2. Fill the DB with data sequentially
3. Run a warmup to randomize the memory layout
4. Introduce fragmentation by deleting part of the keys
5. In parallel:
1. Trigger BGSAVE
2. Start 80/20 get/set load
I played the following parameters to understand their influence:
1. Number of keys: 3M, 6M, and 12M.
2. Data size. While key themselves are of fixed length ~30 bytes, the
value size is 120, 250, 500, 1000, and 2000 bytes.
3. Fragmentation. I delete 5%, 10%, and 15% of the original key range.
I'm attaching a graph of BGSAVE process memory consumption. Instead of
all benchmarks, I show the most representative runs IMO.
<img width="1570" alt="3m-fixed"
src="https://github.com/user-attachments/assets/3dbbc528-01c1-4821-a3c2-6be455e7f78a">
For 2000 bytes values peak memory usage is ~53% compared to vanilla. The
peak happens at 57% BGSAVE progress.
For 500 bytes values the peak is ~80% compared to vanilla. And happens
at ~80% progress.
For 120 bytes the difference is under 5%, and the patched version could
even use more memory.

For 12M keys, the peak is ~85% of the vanilla’s. Happens at ~70% mark.
For 6M keys, the peak is ~87% of the vanilla’s. Happens at ~77% mark.
For 3M keys, the peak is ~87% of the vanilla’s Happens at ~80% mark.
**Changes**
The PR contains 2 changes:
1. Static buffer for RDB comrpession.
RDB compression leads to COW events even without any write load if we
use `zfree`. It happens because the compression functions allocates a
new buffer for each object. Together with freeing objects with `zfree`
it leads to reusing of the memory shared with the main process.
To deal with this problem, we use a pre-allocated constant 8K buffer for
compression. If the object size is too big for this buffer, than we fall
back to the ad hoc allocation behavior.
2. Freeing string objects instead of dismissing them
Call to `zfree` is more expensive than direct call to `madvise`. But
with #453 strings use the fast path – `zfree_with_size`. As a possible
next step we can optimize `zfree` for other data types as well.
---------
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
1. The test isn't waiting long enough for the output buffer to overrun.
This problem is happening because an error from the previous test is
bleeding into the current test's logs. The simplest fix would be to
split these tests.
2. Increased replication timeout to ensure sync fails due to output
buffer overrun before a timeout occurs.
Fixes#1367
Signed-off-by: naglera <anagler123@gmail.com>
We should reset repl_down_since only on state change, in the
current code, if the rdb channel in the dual channel is normal,
that is, rdb is loaded normally, but the psync channel is
abnormal, we will set repl_down_since 0 here. If the primary
is down at this time, the replica may be abnormal when calculating
data_age in cluster failover, since repl_state != REPL_STATE_CONNECTED,
this causes the replica to be unable to initiate an election due
to the old data_age.
In dualChannelSyncHandleRdbLoadCompletion, if the psync channel
is not established, the function will return. We will set repl_state
to REPL_STATE_CONNECTED and set repl_down_since to 0 in
dualChannelSyncSuccess, that is, in establishPrimaryConnection.
See also 677d10b2a8ff7f13033ccfe56ffcd246dbe70fb6 for more details.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Replace occurrences of 'zfree' with 'zfree_with_size' to improve
performance.
'zfree_with_size' function avoids calling 'zmalloc_size' to retrieve
buffer size and
uses previuos calculation of size for calling 'zfree_with_size'. This
results in faster
memory deallocation and reduces overhead.
Signed-off-by: stav bentov <stavbt@amazon.com>
Co-authored-by: stav bentov <stavbt@amazon.com>
There are several patches in this PR:
* Abstract set/rewrite config bind option: `bind` option is a special
config, `socket` and `tls` are using the same one. However RDMA uses the
similar style but different one. Use a bit abstract work to make it
flexible for both `socket` and `RDMA`. (Even for QUIC in the future.)
* Introduce closeListener for connection type: closing socket by a
simple syscall would be fine, RDMA has complex logic. Introduce
connection type specific close listener method.
* RDMA: Use valkey.conf style instead of module parameters: use
`--rdma-bind` and `--rdma-port` style instead of module parameters. The
module style config `rdma.bind` and `rdma.port` are removed.
* RDMA: Support builtin: support `make BUILD_RDMA=yes`. module style is
still kept for now.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
In #1326 we make KEYS can visit expired key in import-source state
by updating keyIsExpired to check for import mode. But after #1205,
we now use keyIsExpiredWithDictIndex to optimize and remove the
redundant dict_index, and keyIsExpiredWithDictIndex does not handle
this logic.
In this commit, we handle keyIsExpiredWithDictIndex to make it check
for import mode as well so that KEYS can visit the expired key.
Signed-off-by: Binbin <binloveplay1314@qq.com>
After #1185, a client in import-source state can visit expired key
both in read commands and write commands, this commit handle
keyIsExpired function to handle import-source state as well, so
KEYS can visit the expired key.
This is not particularly important, but it ensures the definition,
also doing some cleanup around the test, verified that the client
can indeed visit the expired key.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This PR introduces a consistent tagging system for dual-channel logs.
The goal is to improve log readability and filterability, making it
easier for operators to manage and analyze log entries.
Resolves https://github.com/valkey-io/valkey/issues/986
---------
Signed-off-by: naglera <anagler123@gmail.com>
The code is ok before 2de544cfcc6d1aa7cf6d0c75a6116f7fc27b6fd6,
but now we will set server.repl_transfer_fd right after dfd was
initiated, and in here we have a double close error since dfd and
server.repl_transfer_fd are the same fd.
Also move the declaration of dfd/maxtries to a small scope to avoid
the confusion since they are only used in this code.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Apparently on Mac, sleep will modify errno to ETIMEDOUT, and then it
prints the misleading message: Operation timed out.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Optimize sdscatrepr by reducing realloc calls, furthermore, we can reduce memcpy calls by
batch processing of consecutive printable characters.
Signed-off-by: Ray Cao <zisong.cw@alibaba.com>
Co-authored-by: Ray Cao <zisong.cw@alibaba.com>
Fast_float is a C++ header-only library to parse doubles using SIMD
instructions. The purpose is to speed up sorted sets and other commands
that use doubles. A single-file copy of fast_float is included in this
repo. This introduces an optional dependency on a C++ compiler.
The use of fast_float is enabled at compile time using the make variable
`USE_FAST_FLOAT=yes`. It is disabled by default.
Fixes#1069.
---------
Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com>
Signed-off-by: Parth <661497+parthpatel@users.noreply.github.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Roshan Swain <swainroshan001@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
We have a replicationEmptyDbCallback, it is a callback used by emptyData
while flushing away old data. Previously, we did not add this callback
logic for function, in case of abuse, there may be a lot of functions,
and also to make the code consistent, we add the same callback logic
for function.
Changes around this commit:
1. Extend emptyData / functionsLibCtxClear to support passing callback
when flushing functions.
2. Added disklessLoad function create and discard helper function, just
like disklessLoadInitTempDb and disklessLoadDiscardTempDb), we wll
always flush the temp function in a async way to avoid any block.
3. Cleanup around discardTempDb, remove the callback pointer since in
async way we don't need the callback.
4. Remove functionsLibCtxClear call in readSyncBulkPayload, because we
called emptyData in the previous lines, which also empty functions.
We are doing this callback in replication is because during the flush,
replica may block a while if the flush is doing in the sync way, to
avoid the primary to detect the replica is timing out, replica will use this
callback to notify the primary (we also do this callback when loading
a RDB). And in the async way, we empty the data in the bio and there is
no slw operation, so it will ignores the callback.
Signed-off-by: Binbin <binloveplay1314@qq.com>
These commands are all administrator commands. If they are operated
incorrectly, serious consequences may occur. Print the full client
info by using catClientInfoString, the info is useful when we want
to identify the source of request.
Since the origin client info is very large and might complicate the
output, we added a catClientInfoShortString function, it will only
print some basic fields, we want these fields that are useful to
identify the client. These fields are:
- id
- addr
- laddr
- connection info
- name
- user
- lib-name
- lib-ver
And also used it to replace the origin client info where it has the
same purpose. Some logging is changed from full client info to short
client info:
- CLUSTER FAILOVER
- FAILOVER / PSYNC
- REPLICAOF NO ONE
- SHUTDOWN
Signed-off-by: Binbin <binloveplay1314@qq.com>
When a node role changes, we should brocast the change to notify other nodes.
For example, one primary and one replica, after a failover, the replica became
a new primary, the primary became a new replica.
And then we trigger a second cluster failover for the new replica, the
new replica will send a MFSTART to its primary, ie, the new primary.
But the new primary may reject the MFSTART due to this logic:
```
} else if (type == CLUSTERMSG_TYPE_MFSTART) {
if (!sender || sender->replicaof != myself) return 1;
```
In the new primary views, sender is still a primary, and sender->replicaof
is NULL, so we will return. Then the manual failover timedout.
Another possibility is that other primaries refuse to vote after receiving
the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary,
so it refuse to vote, and then manual failover timedout.
```
void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
...
if (clusterNodeIsPrimary(node)) {
serverLog(LL_WARNING, "Failover auth denied to...
```
The reason is that, currently, we only update the node->replicaof information
when we receive a PING/PONG from the sender. For details, see clusterProcessPacket.
Therefore, in some scenarios, such as clusters with many nodes and a large
cluster-ping-interval (that is, cluster-node-timeout), the role change of the node
will be very delayed.
Added a DEBUG DISABLE-CLUSTER-RANDOM-PING command, send cluster ping
to a random node every second (see clusterCron).
Signed-off-by: Binbin <binloveplay1314@qq.com>
We need to start making use of the new `WithDictIndex` APIs which allow
us to reuse the dict_index calculation (avoid over-calling `getKeySlot`
for no good reason).
In this PR I optimized `lookupKey` so it now calls `getKeySlot` to reuse
the dict_index two additional times. It also optimizes the keys command
to avoid unnecessary computation of the slot id.
---------
Signed-off-by: Nadav Levanoni <nadavl@amazon.com>
Co-authored-by: Nadav Levanoni <nadavl@amazon.com>
`cluster_legacy.c`: `slot_info_pairs` has `uint16_t` values, but they
were cast to `unsigned long` and `%i` was used.
`valkey-cli.c`: `node->replicas_count` is `int`, not `unsigned long`.
Signed-off-by: ArtSin <artsin666@gmail.com>
The test is incompatible with valgrind. Added a new `--valgrind`
argument to test suite, which will cause that test to be skipped.
We skipped it in the past, see 5b61b0dc6d2579ee484fa6cf29bfac59513f84ab
Signed-off-by: Binbin <binloveplay1314@qq.com>
If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.
Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.
In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.
Signed-off-by: Binbin <binloveplay1314@qq.com>
### Summary of the change
This is a base PR for refactoring defrag. It moves the defrag logic to
rely on jemalloc [native
api](https://github.com/jemalloc/jemalloc/pull/1463#issuecomment-479706489)
instead of relying on custom code changes made by valkey in the jemalloc
([je_defrag_hint](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)))
library. This enables valkey to use latest vanila jemalloc without the
need to maintain code changes cross jemalloc versions.
This change requires some modifications because the new api is providing
only the information, not a yes\no defrag. The logic needs to be
implemented at valkey code. Additionally, the api does not provide,
within single call, all the information needed to make a decision, this
information is available through additional api call. To reduce the
calls to jemalloc, in this PR the required information is collected
during the `computeDefragCycles` and not for every single ptr, this way
we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will
further improve the defrag decision and process.
### Added files:
`allocator_defrag.c` / `allocator_defrag.h` - This files implement the
allocator specific knowledge for making defrag decision. The knowledge
about slabs and allocation logic and so on, all goes into this file.
This improves the separation between jemalloc specific code and other
possible implementation.
### Moved functions:
[`zmalloc_no_tcache` , `zfree_no_tcache`
](4593dc2f05/src/zmalloc.c (L215))
- these are very jemalloc specific logic assumptions, and are very
specific to how we defrag with jemalloc. This is also with the vision
that from performance perspective we should consider using tcache, we
only need to make sure we don't recycle entries without going through
the arena [for example: we can use private tcache, one for free and one
for alloc].
`frag_smallbins_bytes` - the logic and implementation moved to the new
file
### Existing API:
* [once a second + when completed full cycle]
[`computeDefragCycles`](4593dc2f05/src/defrag.c (L916))
* `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active,
resident, retained, muzzy_, `frag_smallbins_bytes`
*
[`frag_smallbins_bytes`](4593dc2f05/src/zmalloc.c (L690))
: for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs`
* [during defrag, for each pointer]
* `je_defrag_hint` is getting a memory pointer and returns {0,1} .
[Internally it
uses](4593dc2f05/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L368))
this information points:
* #`nonfull_slabs`
* #`total_slabs`
* #free regs in the ptr slab
## Jemalloc API (via ctl interface)
[BATCH][`experimental_utilization_batch_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L4114))
: gets an array of pointers, returns for each pointer 3 values,
* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes
[EXTENDED][`experimental_utilization_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L3989))
:
* memory address of the extent a potential reallocation would go into
* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes
* [stats-enabled]total number of free regions in the bin the extent
belongs to
* [stats-enabled]total number of regions in the bin the extent belongs
to
### `experimental_utilization_batch_query_ctl` vs valkey
`je_defrag_hint`?
[good]
- We can query pointers in a batch, reduce the overall overhead
- The per ptr decision algorithm is not within jemalloc api, jemalloc
only provides information, valkey can tune\configure\optimize easily
[bad]
- In the batch API we only know the utilization of the slab (of that
memory ptr), we don’t get the data about #`nonfull_slabs` and total
allocated regs.
## New functions:
1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the
[MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster
calls. See this quote from the jemalloc documentation:
The mallctlnametomib() function provides a way to avoid repeated name
lookups for
applications that repeatedly query the same portion of the namespace,by
translating
a name to a “Management Information Base” (MIB) that can be passed
repeatedly to
mallctlbymib().
6. `jemalloc_sz2binind_lgq*` : this api is to support reverse map
between bin size and it’s info without lookup. This mapping depends on
the number of size classes we have that are derived from
[`lg_quantum`](4593dc2f05/deps/Makefile (L115))
7. `defrag_jemalloc_get_frag_smallbins` : This function replaces
`frag_smallbins_bytes` the logic moved to the new file allocator_defrag
`defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the
results
8. `should_defrag` : implements the same logic as the existing
implementation
[inside](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382))
je_defrag_hint
9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an
array of pointers, utilizing the new batch api. currently only 1 pointer
is passed.
### Logical differences:
In order to get the information about #`nonfull_slabs` and #`regs`, we
use the query cycle to collect the information per size class. In order
to find the index of bin information given bin size, in o(1), we use
`jemalloc_sz2binind_lgq*` .
## Testing
This is the first draft. I did some initial testing that basically
fragmentation by reducing max memory and than waiting for defrag to
reach desired level. The test only serves as sanity that defrag is
succeeding eventually, no data provided here regarding efficiency and
performance.
### Test:
1. disable `activedefrag`
2. run valkey benchmark on overlapping address ranges with different
block sizes
3. wait untill `used_memory` reaches 10GB
4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru`
5. stop load
6. wait for `mem_fragmentation_ratio` to reach 2
7. enable `activedefrag` - start test timer
8. wait until reach `mem_fragmentation_ratio` = 1.1
#### Results*:
(With this PR)Test results: ` 56 sec`
(Without this PR)Test results: `67 sec`
*both runs perform same "work" number of buffers moved to reach
fragmentation target
Next benchmarking is to compare to:
- DONE // existing `je_get_defrag_hint`
- compare with naive defrag all: `int defrag_hint() {return 1;}`
---------
Signed-off-by: Zvi Schneider <ezvisch@amazon.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Zvi Schneider <ezvisch@amazon.com>
Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This change prevents unintended side effects on connection state and
improves consistency with non-TLS sync operations.
For example, when invoking `connTLSSyncRead` with a blocking file
descriptor, the mode is switched to non-blocking upon `connTLSSyncRead`
exit. If the code assumes the file descriptor remains blocking and calls
the normal `read` expecting it to block, it may result in a short read.
This caused a crash in dual-channel, which was fixed in this PR by
relocating `connBlock()`:
https://github.com/valkey-io/valkey/pull/837
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
FUNCTION RESTORE have a FLUSH option, it will delete all the existing
libraries before restoring the payload. If for some reasons, there are
a lot of libraries, we will block a while in here.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The goto error label is the same as the error return, use goto
to reduce the references.
```
error:
cancelReplicationHandshake(1);
return;
```
Also this can make the log printing more continuous under the
error, that is, we print the error log first, and then print
the reconnecting log at the last (in cancelReplicationHandshake).
Signed-off-by: Binbin <binloveplay1314@qq.com>
New config: `import-mode (yes|no)`
New command: `CLIENT IMPORT-SOURCE (ON|OFF)`
The config, when set to `yes`, disables eviction and deletion of expired
keys, except for commands coming from a client which has marked itself
as an import-source, the data source when importing data from another
node, using the CLIENT IMPORT-SOURCE command.
When we sync data from the source Valkey to the destination Valkey using
some sync tools like
[redis-shake](https://github.com/tair-opensource/RedisShake), the
destination Valkey can perform expiration and eviction, which may cause
data corruption. This problem has been discussed in
https://github.com/redis/redis/discussions/9760#discussioncomment-1681041
and Redis already have a solution. But in Valkey we haven't fixed it by
now.
E.g. we call `set key 1 ex 1` on the source server and transfer this
command to the destination server. Then we call `incr key` on the source
server before the key expired, we will have a key on the source server
with a value of 2. But when the command arrived at the destination
server, the key may be expired and has deleted. So we will have a key on
the destination server with a value of 1, which is inconsistent with the
source server.
In standalone mode, we can use writable replica to simplify the sync
process. However, in cluster mode, we still need a sync tool to help us
transfer the source data to the destination. The sync tool usually work
as a normal client and the destination works as a primary which keep
expiration and eviction.
In this PR, we add a new mode named 'import-mode'. In this mode, server
stop expiration and eviction just like a replica. Notice that this mode
exists only in sync state to avoid data inconsistency caused by
expiration and eviction. Import mode only takes effect on the primary.
Sync tools can mark their clients as an import source by `CLIENT
IMPORT-SOURCE`, which work like a client from primary and can visit
expired keys in `lookupkey`.
**Notice: during the migration, other clients, apart from the import
source, should not access the data imported by import source.**
---------
Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com>
Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.
For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.
The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This add the missing return when repl_state change to RECEIVE_VERSION_REPLY,
this way we won’t be blocked if the primary doesn’t reply with REPLCONF
VERSION.
In practice i guess this is no likely to block in this context, reading
small responses are are likely to be received in one packet, so this
is just a cleanup (consistent with the previous state machine
processing).
Also update the state machine diagram to mention the VERSION reply.
Signed-off-by: Binbin <binloveplay1314@qq.com>
### Problem
Valkey stores scripts in a dictionary (lua_scripts) keyed by their SHA1
hashes, but it needs a way to know which scripts are least recently
used. It uses an LRU list (lua_scripts_lru_list) to keep track of
scripts in usage order. When the list reaches a maximum length, Valkey
evicts the oldest scripts to free memory in both the list and
dictionary. The problem here is that the sds from the LRU list can be
pointing to already freed/moved memory by active defrag that the sds in
the dictionary used to point to. It results in assertion error at [this
line](https://github.com/valkey-io/valkey/blob/unstable/src/eval.c#L519)
### Solution
If we duplicate the sds when adding it to the LRU list, we can create an
independent copy of the script identifier (sha). This duplication
ensures that the sha string in the LRU list remains stable and
unaffected by any defragmentation that could alter or free the original
sds. In addition, dictUnlink doesn't require exact pointer
match([ref](https://github.com/valkey-io/valkey/blob/unstable/src/eval.c#L71-L78))
so this change makes sense to unlink the right dictEntry with the copy
of the sds.
### Reproduce
To reproduce it with tcl test:
1. Disable je_get_defrag_hint in defrag.c to trigger defrag often
2. Execute test script
```
start_server {tags {"auth external:skip"}} {
test {Regression for script LRU crash} {
r config set activedefrag yes
r config set active-defrag-ignore-bytes 1
r config set active-defrag-threshold-lower 0
r config set active-defrag-threshold-upper 1
r config set active-defrag-cycle-min 99
r config set active-defrag-cycle-max 99
for {set i 0} {$i < 100000} {incr i} {
r eval "return $i" 0
}
after 5000;
}
}
```
### Crash info
Crash report:
```
=== REDIS BUG REPORT START: Cut & paste starting from here ===
14044:M 12 Nov 2024 14:51:27.054 # === ASSERTION FAILED ===
14044:M 12 Nov 2024 14:51:27.054 # ==> eval.c:556 'de' is not true
------ STACK TRACE ------
Backtrace:
/usr/bin/redis-server 127.0.0.1:6379 [cluster](luaDeleteFunction+0x148)[0x723708]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](luaCreateFunction+0x26c)[0x724450]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](evalCommand+0x2bc)[0x7254dc]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](call+0x574)[0x5b8d14]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommand+0xc84)[0x5b9b10]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommandAndResetClient+0x11c)[0x6db63c]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processInputBuffer+0x1b0)[0x6dffd4]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x6bd968]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x659634]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](amzTLSEventHandler+0x194)[0x6588d8]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x750c88]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](aeProcessEvents+0x228)[0x757fa8]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](redisMain+0x478)[0x7786b8]
/lib64/libc.so.6(__libc_start_main+0xe4)[0xffffa7763da4]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x5ad3b0]
```
Defrag info:
```
mem_fragmentation_ratio:1.18
mem_fragmentation_bytes:47229992
active_defrag_hits:20561
active_defrag_misses:5878518
active_defrag_key_hits:77
active_defrag_key_misses:212
total_active_defrag_time:29009
```
### Test:
Run the test script to push 100,000 scripts to ensure the LRU list keeps
500 maximum length without any crash.
```
27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
[ok]: Regression for script LRU crash (6811 ms)
[1/1 done]: unit/test (7 seconds)
```
---------
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <155032684+sungming2@users.noreply.github.com>
Co-authored-by: Seungmin Lee <sungming@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
### Problem
GitHub Actions is starting the deprecation process for macOS 12.
Deprecation will begin on 10/7/24 and the image will be fully
unsupported by 12/3/24.
For more details, see
https://github.com/actions/runner-images/issues/10721
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Co-authored-by: Seungmin Lee <sungming@amazon.com>
This PR addresses two issues:
1. Performance Degradation Fix - Resolves a significant performance
issue during RDB load on replica nodes.
- The problem was causing replicas to rehash multiple times during the
load process. Local testing demonstrated up to 50% degradation in BGSAVE
time.
- The problem occurs when the replica tries to expand pre-created slot
dictionaries. This operation fails quietly, resulting in undetected
performance issues.
- This fix aims to optimize the RDB load process and restore expected
performance levels.
2. Bug fix when reading `RDB_OPCODE_RESIZEDB` in Valkey 8.0 cluster
mode-
- Use the shard's master slots count when processing this opcode, as
`clusterNodeCoversSlot` is not initialized for the currently syncing
replica.
- Previously, this problem went unnoticed because `RDB_OPCODE_RESIZEDB`
had no practical impact (due to 1).
These improvements will enhance overall system performance and ensure
smoother upgrades to Valkey 8.0 in the future.
Testing:
- Conducted local tests to verify the performance improvement during RDB
load.
- Verified that ignoring `RDB_OPCODE_RESIZEDB` does not negatively
impact functionality in the current version.
Signed-off-by: naglera <anagler123@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
s_malloc_size == zmalloc_size, currently sdsAllocSize does not
calculate PREFIX_SIZE when no malloc_size available, this casue
test_typesAndAllocSize fail in the new unittest, what we want to
check is actually zmalloc_usable_size.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Main thread profiling revealed significant overhead in TLS operations,
even with read/write offloaded to I/O threads:
Perf results:
**10.82%** 8.82% `valkey-server libssl.so.3 [.] SSL_pending` # Called by
main thread after I/O completion
**10.16%** 5.06% `valkey-server libcrypto.so.3 [.] ERR_clear_error` #
Called for every event regardless of thread handling
This commit further optimizes TLS operations by moving more work from
the main thread to I/O threads:
Improve TLS offloading to I/O threads with two main changes:
1. Move `ERR_clear_error()` calls closer to SSL operations
- Currently, error queue is cleared for every TLS event
- Now only clear before actual SSL function calls
- This prevents unnecessary clearing in main thread when operations
are handled by I/O threads
2. Optimize `SSL_pending()` checks
- Add `TLS_CONN_FLAG_HAS_PENDING` flag to track pending data
- Move pending check to follow read operations immediately
- I/O thread sets flag when pending data exists
- Main thread uses flag to update pending list
Performance improvements:
Testing setup based on
https://valkey.io/blog/unlock-one-million-rps-part2/
Before:
- SET: 896,047 ops/sec
- GET: 875,794 ops/sec
After:
- SET: 985,784 ops/sec (+10% improvement)
- GET: 1,066,171 ops/sec (+22% improvement)
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
CI report this failure:
```
[exception]: Executing test client: MOVED 1 127.0.0.1:22128.
MOVED 1 127.0.0.1:22128
while executing
"wait_for_condition 1000 50 {
[R 3 get key_991803] == 1024 && [R 3 get key_977613] == 10240 &&
[R 4 get key_991803] == 1024 && ..."
```
This may be because, even though the cluster state becomes OK,
The cluster still has inconsistent configuration for a short period
of time. We make sure to wait for the config to be consistent.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If bgsaveerr is error, there is no need to protect the rdb channel.
The impact of this may be that when bgsave fails, we will protect
the rdb channel for 60s. It may occupy the reference of the repl
buf block, making it impossible to recycle it until we free the
client due to COB or free the client after 60s.
We kept the RDB channel open as long as the replica hadn't established
a main connection, even if the snapshot process failed. There is no
value in keeping the RDB client in this case.
Signed-off-by: Binbin <binloveplay1314@qq.com>
We have an assert in propagateNow. If the primary node receives a
CLUSTER UPDATE such as dirty slots during SIGTERM waitting or during
a manual failover pausing or during a client pause, the delKeysInSlot
call will trigger this assert and cause primary crash.
In this case, we added a new server_del_keys_in_slot state just like
client_pause_in_transaction to track the state to avoid the assert
in propagateNow, the dirty slots will be deleted in the end without
affecting the data consistency.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
We originally checked the replica connection to whether to kill the
diskless child only when rdbPipeReadHandler is triggered. Actually
we can check it when the replica is disconnected, so that we don't
have to wait for rdbPipeReadHandler to be triggered and can kill
the forkless child as soon as possible.
In this way, when the child or rdbPipeReadHandler is stuck for some
reason, we can kill the child faster and release the fork resources.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This adds a log when a handshake fails for a timeout. This can help
troubleshoot cluster asymmetry issues caused by failed MEETs
---------
Signed-off-by: Ben Totten <btotten@amazon.com>
Signed-off-by: bentotten <59932872+bentotten@users.noreply.github.com>
Co-authored-by: Ben Totten <btotten@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Remove the dict pointer argument to the `dictType` callbacks `keyDup`,
`keyCompare`, `keyDestructor` and `valDestructor`. This argument was
unused in all of the callback implementations.
The macros `dictFreeKey()` and `dictFreeVal()` are made internal to dict
and moved from dict.h to dict.c. They're also changed from macros to
static inline functions.
Signed-off-by: Qu Chen <quchen@amazon.com>
* We compile various c files into object and package them into library
(.a file) using ar to feed to unit tests. With new GCC versions, the
objects inside such library don't participate in LTO process without
additional flags.
* Here is a direct quote from gcc documentation explaining this issue:
"If you are not using a linker with plugin support and/or do not enable
the linker plugin, then the objects inside libfoo.a are extracted and
linked as usual, but they do not participate in the LTO optimization
process. In order to make a static library suitable for both LTO
optimization and usual linkage, compile its object files with
-flto-ffat-lto-objects."
* Read full documentation about -flto at
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
* Without this additional flag, I get following errors while executing
"make test-unit". With this change, those errors go away.
```
ARCHIVE libvalkey.a
ar: threads_mngr.o: plugin needed to handle lto object
...
..
.
/tmp/ccDYbMXL.ltrans0.ltrans.o: In function `dictClear':
/local/workplace/elasticache/valkey/src/unit/../dict.c:776: undefined
reference to `valkey_free'
/local/workplace/elasticache/valkey/src/unit/../dict.c:770: undefined
reference to `valkey_free'
/tmp/ccDYbMXL.ltrans0.ltrans.o: In function `dictGetVal':
```
Fixes#1290
---------
Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com>
Migrate quicklist unit test to new unit test framework, and cleanup
remaining references of SERVER_TEST, parent ticket #428.
Closes#428.
Signed-off-by: artikell <739609084@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
In #1282, we init server.pid earlier to keep log message role
consistent, but we forgot to consider daemonize. In daemonize
mode, we will always print the child role.
We need to reset server.pid after daemonize(), otherwise the
log printing role will always be the child. It also causes a
incorrect server.pid value, affecting the concatenation of
some pid names.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If we become an empty primary for some reason, we still need to
check if we need to delete dirty slots, because we may have dirty
slots data left over from a bad migration. Like the target node forcibly
executes CLUSTER SETSLOT NODE to take over the slot without
performing key migration.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If multiple primary nodes go down at the same time, their replica nodes will
initiate the elections at the same time. There is a certain probability that
the replicas will initate the elections in the same epoch.
And obviously, in our current election mechanism, only one replica node can
eventually get the enough votes, and the other replica node will fail to win
due the the insufficient majority, and then its election will time out and
we will wait for the retry, which result in a long failure time.
If another node has been won the election in the failover epoch, we can assume
that my election has failed and we can retry as soom as possible.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Currently when a manual failover is triggeded, we will set a
CLUSTER_TODO_HANDLE_FAILOVER to start the election as soon as
possible in the next beforeSleep. But in fact, we won't delay
the election in manual failover, waitting for the next beforeSleep
to kick in will delay the election a some milliseconds.
We can trigger the election immediately in this case in the
same function call, without waitting for beforeSleep, which
can save us some milliseconds.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When doing `$replica replicaof no one`, we may get a LOADING
error, this is because during the test execution, the replica
may reconnect very quickly, and the full sync is initiated,
and the replica has entered the LOADING state.
In this commit, we make sure the primary is pasued after the
fork, so the replica won't enter the LOADING state, and with
this fix, this test seems more natural and predictable.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Null out several references after freeing the object in `freeClient()`.
This is just to make the code more safe, to protect against
use-after-free for future changes.
Signed-off-by: Qu Chen <quchen@amazon.com>
Init server.pid earlier to keep log message role consistent.
Closes#1206.
Before:
```text
24881:C 21 Oct 2024 21:10:57.165 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo
24881:C 21 Oct 2024 21:10:57.165 * Valkey version=255.255.255, bits=64, commit=814e0f55, modified=1, pid=24881, just started
24881:C 21 Oct 2024 21:10:57.165 * Configuration loaded
24881:M 21 Oct 2024 21:10:57.167 * Increased maximum number of open files to 10032 (it was originally set to 1024).
```
After:
```text
68560:M 08 Nov 2024 16:10:12.257 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo
68560:M 08 Nov 2024 16:10:12.257 * Valkey version=255.255.255, bits=64, commit=45d596e1, modified=1, pid=68560, just started
68560:M 08 Nov 2024 16:10:12.257 * Configuration loaded
68560:M 08 Nov 2024 16:10:12.258 * monotonic clock: POSIX clock_gettime
```
Signed-off-by: azuredream <zhaozixuan67@gmail.com>
RDMA: Use connection reference counter style
The reference counter of connection is used to protect re-entry of closenmethod.
Use this style instead the unsafe one.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
As @madolson pointed out, these do have proper null terminators. This
cleans them up to follow the rest of the code which copies the last byte
explicitly, which should help reduce cognitive load and make it more
resilient should code refactors occur (e.g. non-static allocation of
memory, changes to other functions).
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
With this commit, users are able to build valkey using `CMake`.
## Example usage:
Build `valkey-server` in Release mode with TLS enabled and using
`jemalloc` as the allocator:
```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=/tmp/valkey-install \
-DBUILD_MALLOC=jemalloc -DBUILD_TLS=1
make -j$(nproc) install
# start valkey
/tmp/valkey-install/bin/valkey-server
```
Build `valkey-unit-tests`:
```bash
mkdir build-release-ut
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DBUILD_MALLOC=jemalloc -DBUILD_UNIT_TESTS=1
make -j$(nproc)
# Run the tests
./bin/valkey-unit-tests
```
Current features supported by this PR:
- Building against different allocators: (`jemalloc`, `tcmalloc`,
`tcmalloc_minimal` and `libc`), e.g. to enable `jemalloc` pass
`-DBUILD_MALLOC=jemalloc` to `cmake`
- OpenSSL builds (to enable TLS, pass `-DBUILD_TLS=1` to `cmake`)
- Sanitizier: pass `-DBUILD_SANITIZER=<address|thread|undefined>` to
`cmake`
- Install target + redis symbolic links
- Build `valkey-unit-tests` executable
- Standard CMake variables are supported. e.g. to install `valkey` under
`/home/you/root` pass `-DCMAKE_INSTALL_PREFIX=/home/you/root`
Why using `CMake`? To list *some* of the advantages of using `CMake`:
- Superior IDE integrations: cmake generates the file
`compile_commands.json` which is required by `clangd` to get a compiler
accuracy code completion (in other words: your VScode will thank you)
- Out of the source build tree: with the current build system, object
files are created all over the place polluting the build source tree,
the best practice is to build the project on a separate folder
- Multiple build types co-existing: with the current build system, it is
often hard to have multiple build configurations. With cmake you can do
it easily:
- It is the de-facto standard for C/C++ project these days
More build examples:
ASAN build:
```bash
mkdir build-asan
cd $_
cmake .. -DBUILD_SANITIZER=address -DBUILD_MALLOC=libc
make -j$(nproc)
```
ASAN with jemalloc:
```bash
mkdir build-asan-jemalloc
cd $_
cmake .. -DBUILD_SANITIZER=address -DBUILD_MALLOC=jemalloc
make -j$(nproc)
```
As seen by the previous examples, any combination is allowed and
co-exist on the same source tree.
## Valkey installation
With this new `CMake`, it is possible to install the binary by running
`make install` or creating a package `make package` (currently supported
on Debian like distros)
### Example 1: build & install using `make install`:
```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/valkey-install -DCMAKE_BUILD_TYPE=Release
make -j$(nproc) install
# valkey is now installed under $HOME/valkey-install
```
### Example 2: create a `.deb` installer:
```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc) package
# ... CPack deb generation output
sudo gdebi -n ./valkey_8.1.0_amd64.deb
# valkey is now installed under /opt/valkey
```
### Example 3: create installer for non Debian systems (e.g. FreeBSD or
macOS):
```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc) package
mkdir -p /opt/valkey && ./valkey-8.1.0-Darwin.sh --prefix=/opt/valkey --exclude-subdir
# valkey-server is now installed under /opt/valkey
```
Signed-off-by: Eran Ifrah <eifrah@amazon.com>
functionsLibCtxClear should clear the provided lib_ctx parameter,
not the static variable curr_functions_lib_ctx, as this contradicts
the function's intended purpose.
The impact i guess is minor, like in some unhappy paths (diskless load
fails, function restore fails?), we will mess up the functions_caches
field, which is used in used_memory_functions / used_memory_scripts
fileds in INFO.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The CI report replica will return the error when performing CLUSTER
FAILOVER:
```
-ERR Master is down or failed, please use CLUSTER FAILOVER FORCE
```
This may because the primary state is fail or the cluster connection
is disconnected during the primary pause. In this PR, we added some
waits in wait_for_role, if the role is replica, we will wait for the
replication link and the cluster link to be ok.
Signed-off-by: Binbin <binloveplay1314@qq.com>
We set this to EEXIST in 568c2e039bac388003068cd8debb2f93619dd462,
it prints "File exists" which is not quite accurate,
change it to EALREADY, it will print "Operation already in progress".
Signed-off-by: Binbin <binloveplay1314@qq.com>
We don't think we really care about optimizing for the empty DB case,
which should be uncommon. Adding branches hurts branch prediction.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This commit addresses issues that were likely introduced during a rebase
related to:
b0f23df165
Change dual channel replication state in main handler only
Signed-off-by: naglera <anagler123@gmail.com>
A minor debugging change that helped in the investigation of
https://github.com/valkey-io/valkey/issues/1251. Basically there are
some edge cases where we want to fully isolate a note from receiving
packets, but can't suspend the process because we need it to continue
sending outbound traffic. So, added a filter for that.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
The cluster-node-timeout is 3000 in our tests, the timing test wasn't
succeeding, so extending the wait_for made them much more reliable.
Signed-off-by: Binbin <binloveplay1314@qq.com>
All of the internal variables related to number of dicts in the kvstore
are type `int`. Not sure why these 2 items were declared as `long long`.
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
RDMA MR (memory region) is not forkable, the VMA (virtual memory area)
of a MR gets empty in a child process. Prevent IO for child process to
avoid server crash.
In the check for whether read and write is allowed in an RDMA
connection, a check that if we're in a child process is added. If we
are, the function returns an error, which will cause the RDMA client to
be disconnected.
Suggested-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
The current implementation of `sdssplitargs` does repeated `sdscatlen`
to build the parsed arguments, which isn't very efficient because it
does a lot of extra reallocations and moves through the sds code a lot.
It also typically results in memory overhead, because `sdscatlen`
over-allocates, which is usually not needed since args are usually not
modified after being created.
The new implementation of sdssplitargs does two passes, the first to
parse the argument to figure out the final length and the second to
actually copy the string. It's generally about 2x faster for larger
strings (~100 bytes), and about 20% faster for small strings (~10
bytes). This is generally faster since as long as everything is in the
CPU cache, it's going to be fast.
There are a couple of sanity tests, none existed before, as well as some
fuzzying which was used to find some bugs and also to do the
benchmarking. The original benchmarking code can be seen
6576aeb86a.
```
test_sdssplitargs_benchmark - unit/test_sds.c:530] Using random seed: 1729883235
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 56.44%, new:13039us, old:29930us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 56.58%, new:12057us, old:27771us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 59.18%, new:9048us, old:22165us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 54.61%, new:12381us, old:27278us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 51.17%, new:16012us, old:32793us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 49.18%, new:16041us, old:31563us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 58.40%, new:12450us, old:29930us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 56.49%, new:13066us, old:30031us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 58.75%, new:12744us, old:30894us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 52.44%, new:16885us, old:35504us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 62.57%, new:8107us, old:21659us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 62.12%, new:8320us, old:21966us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 45.23%, new:13960us, old:25487us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 57.95%, new:9188us, old:21849us
```
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
to align with how we encode the length at `_addReplyLongLongWithPrefix`
Signed-off-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Co-authored-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Typically, RDMA connection gets closed by client side, the server side
handles diconnected CM event, and delete keepalive timer correctly.
However, the server side may close connection voluntarily, for example
the maxium connections exceed. Handle this case to avoid invalid memory
access.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
clusterNodeGetReplica agrumnets are missed to migrate during the slave
to replication migration so updated the argument slave to replica.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
At some point unit tests stopped building on MacOS because of duplicate
symbols. I had originally solved this problem by using a flag that
overrides symbols, but the much better solution is to mark the duplicate
symbols as weak and they can be overridden during linking. (Symbols by
default are strong, strong symbols override weak symbols)
I also added macos unit build to the CI, so that this doesn't silently
break in the future again.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Add config options for log format and timestamp format introduced by
#1022
Related to #1225
This change adds two new configs into valkey.conf:
log-format
log-timestamp-format
---------
Signed-off-by: azuredream <zhaozixuan67@gmail.com>
### Description
This patch try to increase the max number of io-threads from 16(128) to
256 for below reasons:
1. The core number increases a lot in the modern server processors, for
example, the [Sierra
Forest](https://en.wikipedia.org/wiki/Sierra_Forest) processors are
targeted towards with up to **288** cores.
Due to limitation of **_io-threads_** number (16 and 128 ), benchmark
like https://openbenchmarking.org/test/pts/valkey even cannot run on a
high core count server.
2. For some workloads, the bottleneck could be main thread, but for the
other workloads, big key/value which caused heavy io, the bottleneck
could be the io-threads, for example benchmark `memtier_benchmark -s
127.0.0.1 -p 9001 "--data-size" "20000" --ratio 1:0 --key-pattern P:P
--key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 16
--hide-histogram`. The QPS is still scalable after 16 io-threads.

**Fig 1. QPS Scale factor with io-threads number grows.**
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
In the old code, if fstat fails, replica->repldbfd will hold the
fd and we are doing a free client. And in freeClient, we check and
close only if repl_state == REPLICA_STATE_SEND_BULK. So if fstat
fails, we will leak the fd.
We can also extend freeClient to handle REPLICA_STATE_WAIT_BGSAVE_END
as well, but here seems to be a more friendly (and safer) way.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When explored the cycles distribution for main thread with io-threads
enabled. We found this security attack check takes significant time in
main thread, **~3%** cycles were used to do the commands security check
in main thread.
This patch try to completely avoid doing it in the hot path. We can do
it only after we looked up the command and it wasn't found, just before
we call commandCheckExistence.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
Add ability to configure log output format and timestamp format in the
logs.
This change adds two new configs:
* `log-format`: Either legacy or logfmt (See https://brandur.org/logfmt)
* `log-timestamp-format`: legacy, iso8601 or milliseconds (since the
eppch).
Related to #1006.
Example:
```
$ ./valkey-server /home/zhaoz12/git/valkey/valkey/valkey.conf
pid=109463 role=RDB/AOF timestamp="2024-09-10T20:37:25.738-04:00" level=warning message="WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect."
pid=109463 role=RDB/AOF timestamp="2024-09-10T20:37:25.738-04:00" level=notice message="oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo"
pid=109463 role=RDB/AOF timestamp="2024-09-10T20:37:25.738-04:00" level=notice message="Valkey version=255.255.255, bits=64, commit=affbea5d, modified=1, pid=109463, just started"
pid=109463 role=RDB/AOF timestamp="2024-09-10T20:37:25.738-04:00" level=notice message="Configuration loaded"
pid=109463 role=master timestamp="2024-09-10T20:37:25.738-04:00" level=notice message="monotonic clock: POSIX clock_gettime"
pid=109463 role=master timestamp="2024-09-10T20:37:25.739-04:00" level=warning message="Failed to write PID file: Permission denied"
```
---------
Signed-off-by: azuredream <zhaozixuan67@gmail.com>
All other places written in this function are maintained it,
although the caller of rdbSaveDb does not reply on it, it is
maintained to be consistent with other places, is its duty.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If a replica is step into data_age too old stage, it can not
trigger the failover and currently it can not be automatically
recovered and we will print a log every
CLUSTER_CANT_FAILOVER_RELOG_PERIOD,
which is every second. If the primary has not recovered or there is
no manual failover, this log will flood the log file.
In this case, limit its frequency to 10 times period, which is
10 seconds in our code. Also in this data_age too old stage,
the repeated logs also can stand for the progress of the failover.
See also #780 for more details about it.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
The command argument strings created while parsing inline commands (see
`processInlineBuffer()`) can contain free capacity. Since some commands
,such as `SET`, store these strings in the database, that free capacity
increases the memory usage. In the worst case, it could double the
memory usage.
This only occurs if the inline command format is used. The argument
strings are built by appending character by character in
`sdssplitargs()`. Regular RESP commands are not affected.
This change trims the strings within `processInlineBuffer()`.
### Why `trimStringObjectIfNeeded()` within `object.c` is not solving
this?
When the command argument string is packed into an object,
`trimStringObjectIfNeeded()` is called.
This does only trim the string if it is larger than
`PROTO_MBULK_BIG_ARG` (32kB), as only strings larger than this would
ever need trimming if the command it sent using the bulk string format.
We could modify this condition, but that would potentially have a
performance impact on commands using the bulk format. Since those make
up for the vast majority of executed commands, limiting this change to
inline commands seems prudent.
### Experiment Results
* 1 million `SET [key] [value]` commands
* Random keys (16 bytes)
* 600 bytes values
Memory usage without this change:
```
used_memory:1089327888
used_memory_human:1.01G
used_memory_rss:1131696128
used_memory_rss_human:1.05G
used_memory_peak:1089348264
used_memory_peak_human:1.01G
used_memory_peak_perc:100.00%
used_memory_overhead:49302800
used_memory_startup:911808
used_memory_dataset:1040025088
used_memory_dataset_perc:95.55%
```
Memory usage with this change:
```
used_memory:705327888
used_memory_human:672.65M
used_memory_rss:718802944
used_memory_rss_human:685.50M
used_memory_peak:705348256
used_memory_peak_human:672.67M
used_memory_peak_perc:100.00%
used_memory_overhead:49302800
used_memory_startup:911808
used_memory_dataset:656025088
used_memory_dataset_perc:93.13%
```
If the same experiment is repeated using the normal RESP array of bulk
string format (`*3\r\n$3\r\nSET\r\n...`) then the memory usage is 672MB
with and without of this change.
If a replica is attached, its memory usage is 672MB with and without
this change, since the replication link never uses inline commands.
Signed-off-by: Stefan Mueller <muelstef@amazon.com>
Currently, if the replica has a lot of data, CLUSTER RESET
will block for a while and report the slowlog, and it seems
that there is no harm in making it async so external components
can be easier when monitoring it.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Part of https://github.com/valkey-io/valkey/pull/1200 PR, since feild is
changed. Looks like commands.def is missed to get genereated based on
the changes so that is causing CI failure on unstable.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
In some cases bgsave child process can run for a long time exhausting
system resources. Although it is possible to kill the bgsave child
process from the system shell, sometimes it is not possible allowing OS
level access.
This PR adds a new subcommand to the BGSAVE command.
When user will issue `BGSAVE CANCEL`, it will do one of the 2:
1. In case a bgsave child process is currently running, the child
process would be immediately killed thus terminating any
save/replication full sync process.
2. In case a bgsave child process is SCHEDULED to run, the scheduled
execution will be cancelled.
---------
Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
struct connListener.priv should be used by connection type specific
data, static local listener data should not use this.
A RDMA config structure is going to be introduced in the next step:
```
typedef struct serverRdmaContextConfig {
char *bindaddr;
int bindaddr_count;
int port;
int rx_size;
int comp_vector;
...
} serverRdmaContextConfig;
```
Then a builtin RDMA will be supported.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
This special pattern '#' is used to get the element itself,
it does not actually participate in the slot check.
In this case, passing `GET #` will cause '#' to participate
in the slot check, causing the command to get an
`pattern may be in different slots` error.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Hide 'unixsocketgroup' and 'unixsocketperm' into a Unix socket specific
data structure. A single opaque pointer 'void *priv' is enough for a
listener. Once any new config is added, we don't need 'void *priv2',
'void *priv3' and so on.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
When profiling some workloads with `io-threads` enabled. We found the
false sharing issue is heavy.
This patch try to split the the elements accessed by main thread and
io-threads into different cache line by padding the elements in the head
of `used_memory_thread_padded` array.
This design helps mitigate the false sharing between main
thread and io-threads, because the main thread has been the bottleneck
with io-threads enabled. We didn't put each element in an individual
cache line is that we don't want to bring the additional cache line
fetch operation (3 vs 16 cache line) when call function like
`zmalloc_used_memory()`.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <zhu.lipeng@outlook.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
There is no limitation in Valkey to create a cluster with 1 or 2 primaries,
only that it cannot do automatic failover. Remove this restriction and
add `are you sure` prompt to prompt the user.
This allow we use it to create a test cluster by cli or by
create-cluster.
Signed-off-by: Binbin <binloveplay1314@qq.com>
https://github.com/valkey-io/valkey/issues/1145
First part of a two-step effort to add `WithSlot` API for expiry. This
PR is to fix a crash that occurs when a RANDOMKEY uses a different slot
than the cached slot of a client during a multi-exec.
The next part will be to utilize the new API as an optimization to
prevent duplicate work when calculating the slot for a key.
---------
Signed-off-by: Nadav Levanoni <nadavl@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Nadav Levanoni <nadavl@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
**Overview**
This PR introduces the use of
[MurmurHash3](https://en.wikipedia.org/wiki/MurmurHash) as the hashing
function for Lua's luaS_newlstr function, replacing the previous simple
hash function. The change aims to improve performance, particularly for
large strings.
**Changes**
Implemented MurmurHash3 algorithm in lstring.c
Updated luaS_newlstr to use MurmurHash3 for string hashing
**Performance Testing:**
Test Setup:
1. Ran a valkey server
2. Loaded 1000 keys with large values (100KB each) to the server using a
Lua script
```
local numKeys = 1000
for i = 1, numKeys do
local key = "large_key_" .. i
local largeValue = string.rep("x", 1024*100)
redis.call("SET", key, largeValue)
end
```
3. Used a Lua script to randomly select and retrieve keys
```
local randomKey = redis.call("RANDOMKEY")
local result = redis.call("GET", randomKey)
```
4. Benchmarked using valkey-benchmark:
`./valkey-benchmark -n 100000 evalsha
c157a37967e69569339a39a953c046fc2ecb4258 0`
Results:
A | Unstable | This PR | Change
-- | -- | -- | --
Throughput | 6,835.74 requests per second | 17,061.94 requests per
second | **+150% increase**
Avg Latency | 7.218 ms | 2.838 ms | **-61% decrease**
Min Latency | 3.144 ms | 1.320 ms | **-58% decrease**
P50 Latency | 8.463 ms | 3.167 ms | **-63% decrease**
P95 Latency | 8.863 ms | 3.527 ms | **-60% decrease**
P99 Latency | 9.063 ms | 3.663 ms | **-60% decrease**
Max Latency | 63.871 ms | 55.327 ms | **-13% decrease**
Summary:
* Throughput: Improved by 150%.
* Latency: Significant reductions in average, minimum, and percentile
latencies (P50, P95, P99), leading to much faster response times.
* Max Latency: Slightly decreased by 13%, indicating fewer outlier
delays after the fix.
---------
Signed-off-by: Shai Zarka <zarkash@amazon.com>
Signed-off-by: zarkash-aws <zarkash@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The client that was killed by FUNCTION KILL received a reply of
SCRIPT KILL and the server log also showed SCRIPT KILL.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Currently in conf we describe activerehashing as: Active rehashing
uses 1 millisecond every 100 milliseconds of CPU time. This is the
case for hz = 10.
If we change hz, the description in conf will be inaccurate. Users
may notice that the server spends some CPU (used in activerehashing)
at high hz but don't know why, since our cron calls are fixed to 1ms.
This PR takes hz into account and fixed the CPU usage at 1% (this may
not be accurate in some cases because we do 100 step rehashing in
dictRehashMicroseconds but it can avoid CPU spikes in this case).
This PR also improves the description of the activerehashing
configuration item to explain this change.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The zcalloc symbol is a symbol name already used by zlib, which is
defining other names using the "z" prefix specific to zlib. In practice,
linking valkey with a static openssl, which itself might depend on a
static libz will result in link time error rejecting multiple symbol
definitions.
Fixes: #1157
Signed-off-by: Romain Geissler <romain.geissler@amadeus.com>
Currently in our daily, if a job fails, it will cancel the other jobs
in the same matrix, we want to avoid this so that all jobs in a matrix
can eventually run to completion.
Docs: jobs.<job_id>.strategy.fail-fast applies to the entire matrix.
If jobs.<job_id>.strategy.fail-fast is set to true or its expression
evaluates to true, GitHub will cancel all in-progress and queued jobs
in the matrix if any job in the matrix fails. This property defaults
to true.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Sometimes when dual-channel is turned off the tested replica might
disconnect on COB overrun. disable the replica COB limit in order to
prevent such cases.
Fixes: #1153
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
in case of valgrind run, the replica might get disconnected from the
primary due to repl-timeout reached. Fix is to configure larger timeout
in case of valgrind test.
**Partially** fixes: #1152
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Ci report this failure:
```
*** [err]: SHUTDOWN NOSAVE can kill a timedout script anyway in tests/unit/scripting.tcl
Expected 'BUSY Valkey is busy running a script. *' to match '*connection refused*' (context: type eval line 8 cmd {assert_match {*connection refused*} $e} proc ::test)
```
We can see the logs the shutdown got rejected because there is an AOFRW
pending:
```
Writing initial AOF, can't exit.
Errors trying to shut down the server. Check the logs for more information.
```
The reason is that the previous test enabled the aof.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Similar to #860 but this is for HGETALL families (HGETALL/HKEYS/HVALS).
This patch moves `prepareClientToWrite` out of the loop to reduce the
function overhead.
Signed-off-by: Masahiro Ide <imasahiro9@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
this fixes: https://github.com/valkey-io/valkey/issues/1116
_Issue details from #1116 by @zuiderkwast_
> This config is undocumented since #758. The default was changed to
"yes" and it is quite useless to set it to "no". Yet, it can happen that
some user has an old config file where it is explicitly set to "no". The
result will be bad performace, since I/O threads will not do all the
I/O.
>
> It's indeed confusing.
>
> 1. Either remove the whole option from the code. And thus no need for
documentation. _OR:_
> 2. Introduce the option back in the configuration, just as a comment
is fine. And showing the default value "yes": `# io-threads-do-reads
yes` with additional text.
>
> _Originally posted by @melroy89 in [#1019 (reply in
thread)](https://github.com/orgs/valkey-io/discussions/1019#discussioncomment-10824778)_
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
The module commands which were added to acl categories were getting
skipped when `ACL CAT category` command was executed.
This PR fixes the bug.
Before:
```
127.0.0.1:6379> ACL CAT foocategory
(empty array)
```
After:
```
127.0.0.1:6379> ACL CAT foocategory
aclcheck.module.command.test.add.new.aclcategories
```
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
A new option for diskless replication on the replica side.
After a network failure, the replica may need to perform a full sync.
The other option for diskless full sync is `swapdb`, but it uses twice
as much memory, temporarily. In situations where this is not acceptable,
and where losing data is acceptable, the `flush-before-load` can be
useful. If the full sync fails, the old data is lost though. Therefore,
the new option is marked as "dangerous".
---------
Signed-off-by: kronwerk <ca11e5e22g@gmail.com>
Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>
Co-authored-by: kronwerk <ca11e5e22g@gmail.com>
Currently when module loading fails due to busy name, we
don't have a clean way to assist to troubleshooting.
Case 1: when loading the same module multiple times, we can
not detemine the cause of its failure without referring to
the module list or the earliest module load log. The log
may not exist and sometimes it is difficult for people
to associate module list.
Case 2: when multiple modules use the same module name,
we can not quickly associate the busy name without referring
to the module list and the earliest module load log.
Different people wrote modules with the same module name,
they don't easily associate module name.
So in this PR, when doing module onload, we will try to
print a busy name log if this happen. Currently we check
ctx.module since if it is NULL it means the Init call
failed, and Init currently only fails with busy name.
It's kind of ugly. It would have been nice if we could have had a
better way for onload to signal why the load failed.
Signed-off-by: Binbin <binloveplay1314@qq.com>
There is a lot of bad legacy usage of `default:` with enums, which is an
anti-pattern. If you omit the default, the compiler will tell you if a
new enum value was added and that it is missing from a switch statement.
Someone mentioned on another PR they used `default:` because of this
warning, so just removing it, but might create an issue to do a wider
cleanup.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Signed-off-by: Masahiro Ide <imasahiro9@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
I’ve prepared a minor fix for `processCommand()` function.
In `processCommand()`, the `obey_client` variable is created, but some
conditional statements call the `mustObeyClient()` function instead of
reusing `obey_client`.
I’ve modified these statements to `reuse obey_client`.
Since I’m relatively new to Redis, please let me know if there are any
reasons why the conditional statements need to call `mustObeyClient()`
again.
Thank you for taking the time to review my PR.
Signed-off-by: otheng03 <07c00h@gmail.com>
Until now, this flag only dumped logs on a failed assert in test case.
It is useful that this flag dumps logs on a crash as well.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Since a7cbca40661 ("RDMA: Support .is_local method (#1089)"),
valkey-server started to support auto-detect local connection, then we
can use protected mode for local RDMA device for test.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Improved documentation and readability of lua code as well as removed references to Redis.
---------
Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com>
This change counts both solo test executions to give an accurate total number of tests being run.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
This commit adds initialization code for the fields
`io_last_reply_block` and `io_last_bufpos` of the `client` struct.
While in the current code flow, these fields are only accessed after
being written in the `trySendWriteToIOThreads`, I discovered that they
were not being initialized while doing some changes to the code flow of
IO threads.
I believe it's good pratice to initialize all fields of a struct upon
creation, and will avoid future bugs which are usually hard to debug.
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Fix the warning introduced in #688:
```
unit/test_rax.c:168:15: runtime error: left shift of 36625 by 16 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior unit/test_rax.c:168:15 in
Fuzz test in mode 1 [7504]:
```
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Applying the CVEs against mainline.
(CVE-2024-31449) Lua library commands may lead to stack overflow and
potential RCE.
(CVE-2024-31227) Potential Denial-of-service due to malformed ACL
selectors.
(CVE-2024-31228) Potential Denial-of-service due to unbounded pattern
matching.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
- Add systemd support to the build artifact tarballs, so people can use
it under systemd compatible distros. As discussed here:
https://github.com/orgs/valkey-io/discussions/1103#discussioncomment-10815549.
Adding `libsystemd-dev` to install and add `USE_SYSTEMD=yes` to the
build.
- Cleanup & bring the arm & x86 workflow files in-sync. It was a bit of
a mess ;) (removing `jq wget awscli` from the 'Tarball' step)
Signed-off-by: Melroy van den Berg <melroy@melroy.org>
As discussed here:
https://github.com/orgs/valkey-io/discussions/1103#discussioncomment-10814006
`cp` can't be used anymore, `rsync` is more powerful and allow to
exclude files.
Alternatively:
1. Remove the c, d and o files. Which isn't ideal either.
2. Improve the build. Eg. by building inside a `build` directory instead
of in the src folder.
Ps. I know these workflows aren't trigger in this PR. Only via "Build
Release Packages" workflow action:
https://github.com/valkey-io/valkey/actions/workflows/build-release-packages.yml..
So I can't fully test in this PR. But it should work ^^
Ps. ps. I did test `rsync -av --exclude='*.c' --exclude='*.d'
--exclude='*.o' src/valkey-*` command in isolation and that works as
expected!
---------
Signed-off-by: Melroy van den Berg <melroy@melroy.org>
Introduce a `size_t` field into the rax struct to track allocation size.
Update the allocation size on rax insert and deletes.
Return the allocation size when `raxAllocSize` is called.
This size tracking is now used in MEMORY USAGE and MEMORY STATS in place
of the previous method based on sampling.
The module API allows to create sorted dictionaries, which are backed by
rax. Users now also get precise memory allocation for them (through
`ValkeyModule_MallocSizeDict`).
Fixes#677.
For the release notes:
* MEMORY USAGE and MEMORY STATS are now exact for streams, rather than
based on sampling.
---------
Signed-off-by: Guillaume Koenig <knggk@amazon.com>
Signed-off-by: Guillaume Koenig <106696198+knggk@users.noreply.github.com>
Co-authored-by: Joey <yzhaon@amazon.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Since we paused the primary node earlier, the replica may enter
cluster down due to primary node pfail. Here set allow read to
prevent subsequent read errors.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Currently, we create a Lua table without initial capacity even when the
capacity is known. As a result, we need to resize the Lua tables
repeatedly when converting RESP serialized object to Lua object and it
consumes extra cpu resources a bit when we need to transfer
RESP-serialized data to Lua world.
This patch try to remove this extra resize to reduce (re-)allocation
overhead.
| name | unstable bb57dfe6303 (rps) | this patch(rps) | improvements |
| --------------- | -------- | --------- | -------------- |
| evalsha - hgetall h1 | 60565.68 | 64487.01 | 6.47% |
| evalsha - hgetall h10 | 47023.41 | 50602.17 | 7.61% |
| evalsha - hgetall h25 | 33572.82 | 37345.48 | 11.23% |
| evalsha - hgetall h50 | 24206.63 | 25276.14 | 4.42% |
| evalsha - hgetall h100 | 15068.87 | 15656.8 | 3.90% |
| evalsha - hgetall h300 | 5948.56 | 6094.74 | 2.46% |
Signed-off-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Co-authored-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
These two test cases run in a loop:
* AOF rewrite during write load: RDB preamble=yes
* AOF rewrite during write load: RDB preamble=no
Both of the test cases build up a lot of data (3-4 million keys when I
run locally) so we should empty the data before the second test case.
Otherwise, the second test cases adds keys on top of the keys added in
the first test case, resulting in the double number of keys and takes
more time.
Before this commit:
[ok]: AOF rewrite during write load: RDB preamble=yes (18225 ms)
[ok]: AOF rewrite during write load: RDB preamble=no (37249 ms)
After:
[ok]: AOF rewrite during write load: RDB preamble=yes (18777 ms)
[ok]: AOF rewrite during write load: RDB preamble=no (19940 ms)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
There is no ethernet style virtual device (like lo 127.0.0.1) for RDMA,
however a connection with the same local address and peer address are
considered as local.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
moduleTimerHandler is aeTimeProc handler and event loop gets created
with this. However, found that the function return type is int but
actually returns "long long" value(i.e., next_period). and return value
being assigned to int variable in processTimeEvents(where time events
are processed), this might cause an overflow of the timer values. So
changed the return type of the function to long long. And also updated
other callback function return type to be consistent.
I found this when I was checking functions reported in
https://github.com/valkey-io/valkey/issues/1054 issue stacktrace. (FYI,
this is just to update the return type to be consistent and it will not
the fix for the issue reported)
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Currently cluster tests in unit/cluster are run as part of
the ./runtest. Sometims we change the cluster code and only
want to run cluster tests. This PR added a --cluster option
to runtest so that we can run only cluster tests.
Signed-off-by: Binbin <binloveplay1314@qq.com>
For fake clients like the ones used for Lua and modules, we don't
determine TLS in the right way, causing CLUSTER SLOTS from EVAL over TLS
to fail a debug-assert.
This error was introduced when the caching of CLUSTER SLOTS was
introduced, i.e. in 8.0.0.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The reason is VM_Call will use a fake client without connection,
so we also need to check if c->conn is NULL.
This also affects scripts. If they are called in the script, the
server will crash. Injecting commands into AOF will also cause
startup failure.
Fixes#1054.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In CLUSTER FAILOVER FORCE case, we will set mf_can_start to
1 and wait for a cron to trigger the election. We can also set a
CLUSTER_TODO_HANDLE_MANUALFAILOVER flag so that we
can start the election as soon as possible instead of waiting for
the cron, so that we won't have a 100ms delay (clusterCron).
Signed-off-by: Binbin <binloveplay1314@qq.com>
This commit hopefully improves the formatting of the codebase by setting
ColumnLimit to 0 and hence stopping clang-format from trying to put as
much stuff in one line as possible.
This change enabled us to remove most of `clang-format off` directives
and fixed a bunch of lines that looked like this:
```c
#define KEY \
VALUE /* comment */
```
Additionally, one pair of `clang-format off` / `clang-format on` had
`clang-format off` as the second comment and hence didn't enable the
formatting for the rest of the file. This commit addresses this issue as
well.
Please tell me if anything in the changes seem off. If everything is
fine, I will add this commit to `.git-blame-ignore-revs` later.
---------
Signed-off-by: Mikhail Koviazin <mikhail.koviazin@aiven.io>
Ci reported this failure:
```
[exception]: Executing test client: ERR FAILOVER target replica is not online..
ERR FAILOVER target replica is not online.
while executing
"$node_0 failover to $node_1_host $node_1_port"
```
We can see somehow the replica is not online in time and
casuing this failure, added a verify_replica_online to make
sure the replica is online for the test.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This commit changes the `tcmalloc.h` header location from the deprecated
location `google/` to `gperftools/`.
**Why we're doing this now?**
The location `google/tcmalloc.h` has been deprecated for more than 10
years in favor of `gperftools/tcmalloc.h`, and the deprecated location
will be removed in the next release of gperftools.
Fixes#1033
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
The one in CLUSTER SETSLOT help us keep track of state better,
of course it also can make the test case happy.
The one in gossip process fixes a problem that a replica can
print a log saying it is an empty primary.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Apparently this will fail to compile in some masOS version.
And internet claims _Thread_local is portable.
Fixes#1051.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Since in here the monitor value is mymaster, we need to make sure the
primary name is the same, otherwise the default configuration cannot start
sentinel.
```
sentinel monitor mymaster 127.0.0.1 6379 2
```
The following error occurs when the default configuration is started:
```
*** FATAL CONFIG FILE ERROR (Version 255.255.255) ***
Reading the configuration file, at line 358
>>> 'SENTINEL primary-reboot-down-after-period myprimary 0'
No such master with specified name.
```
Introduced in #647.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Apparently there is a timing issue when using wait_for_ofs_sync:
```
[exception]: Executing test client: can't read "out_before": no such variable.
can't read "out_before": no such variable
```
The reason is that if the connection between the primary
and the replica is not established yet, the master_repl_offset
of the primary and replica in wait_for_ofs_sync is 0, and
the check fails, resulting in no replica client in the
client list below.
In this case, we need to make sure the replica is online
before proceeding.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Sometims it is hard to see the old primary during a
multi primaries failover, adding this log can help
use to find the old primary node.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
errno is global variable and shared with system calls, so there is
chance it may be overwritten during io free or close socket in migrate
command code. It would be better it is copied before the free or
closesocket and use copied value to check for retry in socket_err block.
So added new variable to take copy and used the copy variable for the
check.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Replaced "Could not connect to Redis" with "Could not connect to server" in the log
output for connection errors in `getRedisContext` and `createClient`.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Call emptyData right before rdbLoad to prevent errors in the middle
and we drop the replication stream and leaving an empty database.
The real changes is in disk-based part, the rest is just code movement.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This pull request improves code readability, as a follow up of #749.
- Internal Naming Conventions: Removed the use of underscores (_) for
internal static structures/functions.
- Descriptive Function Names: Updated function names to be more
descriptive, making their purpose clearer. For instance, `_dictExpand`
is renamed to `dictExpandIfAutoResizeAllowed`.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas
There is a timing bug where the primary and replica have different
`cluster-allow-replica-migration` settings. In issue #970, we found that if
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains
in the original shard. This happens because we only process the
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.
This commit fixes the issue by also evaluating this flag for replicas in the
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.
Closes#970
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
The node may not be able to initiate an election in time due to
problems with cluster communication. If an election is initiated,
make sure its offset is 0.
Closes#967.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The cluster configuration file is the metadata "database" for the
cluster. It is best to trigger a save when shutdown the server, to
avoid inconsistent content that is not refreshed.
We save the nodes.conf whenever something that affects the nodes.conf
has changed. But we are saving nodes.conf in clusterBeforeSleep, and
some events may save it without a fsync, there is a time gap.
And shutdown has its own save seems good to me, it doesn't need to
care about the others.
At the same time, a comment is added in unlock nodes.conf to explain
why we actively unlock when shutdown.
Signed-off-by: Binbin <binloveplay1314@qq.com>
clang generates warning if there is no newline at the end of the source
file.
Update .clang-format to handle the missing newline at eof.
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
While doing some profiling, I noticed that getKeySlot() was a fairly
large part (~0.7%) of samples doing perf with high pipeline during
standalone. I think this is because we do a very late check for
server.cluster_mode, we first call getKeySlot() and then call
calculateKeySlot(). (calculateKeySlot was surprisingly not automatically
inlined, we were doing a jump into it and then immediately returning
zero). We then also do useless work in the form of caching zero in
client->slot, which will further mess with cache lines.
So, this PR tries to accomplish a few things things.
1) The usage of the `slot` name made a lot more sense before the
introduction of the kvstore. Now with kvstore, we call this the database
index, so all the references to slot in standalone are no longer really
accurate.
2) Pull the cluster mode check all the way out of getKeySlot(), so
hopefully a bit more performant.
3) Remove calculateKeySlot() as independent from getKeySlot().
calculateKeySlot used to have 3 call sites outside of db.c, which
warranted it's own function. It's now only called in two places,
pubsub.c and networking.c.
I ran some profiling, and saw about ~0.3% improvement, but don't really
trust it because you'll see a much higher (~2%) variance in test runs
just by how the branch predictions will get changed with a new memory
layout. Running perf again showed no samples in getKeySlot() and a
reduction in samples in lookupKey(), so maybe this will help a little
bit.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Maybe partially resolves https://github.com/valkey-io/valkey/issues/952.
The hostnames test relies on an assumption that node zero and node six
don't communicate with each other to test a bunch of behavior in the
handshake stake. This was done by previously dropping all meet packets,
however it seems like there was some case where node zero was sending a
single pong message to node 6, which was partially initializing the
state.
I couldn't track down why this happened, but I adjusted the test to
simply pause node zero which also correctly emulates the state we want
to be in since we're just testing state on node 6, and removes the
chance of errant messages. The test was failing about 5% of the time
locally, and I wasn't able to reproduce a failure with this new
configuration.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This file enables developers to ignore the certain revisions in
git-blame. This is quite handy considering there was a commit that
reformatted the large amount of code in valkey.
As a downside, one has to do a manual step for each clone of valkey to
enable this feature. The instructions are available in the file itself.
---------
Signed-off-by: Mikhail Koviazin <mikhail.koviazin@aiven.io>
In RdbLoad, we disable AOF before emptyData and rdbLoad to prevent copy-on-write issues. After rdbLoad completes, AOF should be re-enabled, but the code incorrectly checks server.aof_state, which has been reset to AOF_OFF in stopAppendOnly. This leads to AOF not being re-enabled after being disabled.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Prior to comparing the replica buffer against the configured limit, we
need to ensure that the limit configuration is enabled. If the limit is
set to zero, it indicates that there is no limit, and we should skip the
buffer limit check.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Fix for https://github.com/valkey-io/valkey/issues/997
Root Cause Analysis:
1. Two different jobs (READ and WRITE) may be sent to the same IO
thread.
2. When processing the read job in `processIOThreadsReadDone`, the IO
thread may find that the write job has also been completed.
3. In this case, the IO thread calls `processClientIOWriteDone` to first
process the completed write job and free the COBs
affbea5dc1/src/networking.c (L4666)
4. If there are pending writes (resulting from pipeline commands), a new
async IO write job is sent before processing the completed read job
affbea5dc1/src/networking.c (L2417)
When sending the write job, the `TLS_CONN_FLAG_POSTPONE_UPDATE_STATE`
flag is set to prevent the IO thread from updating the event loop, which
is not thread-safe.
5. Upon resuming the read job processing, the flag is cleared,
affbea5dc1/src/networking.c (L4685)
causing the IO thread to update the event loop.
Fix:
Prevent sending async write job for pending writes when a read job is
about to be processed.
Testing:
The issue could not be reproduced due to its rare occurrence, which
requires multiple specific conditions to align simultaneously.
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.
This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.
Partially fixes#774
---------
Signed-off-by: Ben Totten <btotten@amazon.com>
Co-authored-by: Ben Totten <btotten@amazon.com>
This PR migrates the tests related to dict into new test framework as
part of #428.
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This PR migrates the tests related to listpack into new test framework
as part of #428.
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Print the full client info by using catClientInfoString, the
info is useful when we want to identify the source of request.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If we modify aof-use-rdb-preamble in the middle of rewrite,
we may get a wrong aof base suffix. This is because the suffix
is concatenated by the main process afterwards, and it may be
different from the beginning.
We cache this value when we start the rewrite.
Signed-off-by: Binbin <binloveplay1314@qq.com>
`clusterProcessGossipSection` currently doesn't trigger a check and call `replicationSetPrimary` when `myself`'s primary node’s IP/port is updated. This fix ensures that after every node address update, `replicationSetPrimary` is called if the updated node is `myself`'s primary. This prevents missed updates and ensures that replicas reconnect properly to maintain their replication link with the primary.
Before this doc update, the comments in valkey.conf said that DEL is a
blocking command, and even refered to other synchronous freeing as "in a
blocking way, like if DEL was called". This has now become confusing and
incorrect, since DEL is now non-blocking by default.
The comments also mentioned too much about the "old default" and only
later explain that the "new default" is non-blocking.
This doc update focuses on the current default and expresses it like
"Starting from Valkey 8.0, lazy freeing is enabled by default", rather
than using words like old and new.
This is a follow-up to #913.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This pull request introduces several changes to improve the type safety
of Valkey's dictionary implementation:
- Getter/Setter Macros: Implemented macros `DICT_SET_VALUE` and
`DICT_GET_VALUE` to centralize type casting within these macros. This
change emulates the behavior of C++ templates in C, limiting type
casting to specific low-level operations and preventing it from being
spread across the codebase.
- Reduced Assert Overhead: Removed unnecessary asserts from critical hot
paths in the dictionary implementation.
- Consistent Naming: Standardized the naming of dictionary entry types.
For example, all dictionary entry types start their names with
`dictEntry`.
Fix#737
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Signed-off-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Follow up to https://github.com/valkey-io/valkey/pull/966, which didn't
update the kvstore tests. I'm not actually entirely clear why it fixes
it, but the consistency prevents the crash very reliably so will merge
it now and maybe see if Zhao has a better explanation.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Implement data masking for user data in server logs and diagnostic output. This change prevents potential exposure of confidential information, such as PII, and enhances privacy protection. It masks all command arguments, client names, and client usernames.
Added a new hide-user-data-from-log configuration item, default yes.
---------
Signed-off-by: Amit Nagler <anagler123@gmail.com>
Feature `one-dict-per-slot` refactors the database, and part of it
involved splitting the rehashing list from the global level back to the
database level, or more specifically, the kvstore level. This change is
fine, and it also simplifies the process of swapping databases, which is
good. And it should not have a major impact on the efficiency of
incremental rehashing.
To implement the kvstore-level rehashing list, each `dict` under the
`kvstore` needs to know which `kvstore` it belongs. However, kvstore did
not insert the reference relationship into the `dict` itself, instead,
it placed it in the `dictType`. In my view, this is a somewhat odd way.
Theoretically, `dictType` is just a collection of function handles, a
kind of virtual type that can be referenced globally, not an entity. But
now the `dictType` is instantiated, with each `kvstore` owning an actual
`dictType`, which in turn holds a reverse reference to the `kvstore`'s
resource pointer. This design is somewhat uncomfortable for me.
I think the `dictType` should not be instantiated. The references
between actual resources (`kvstore` and `dict`) should occur between
specific objects, rather than force materializing the `dictType`, which
is supposed to be virtual.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
## Set replica-lazy-flush and lazyfree-lazy-user-flush to yes by
default.
There are many problems with running flush synchronously. Even in
single CPU environments, the thread managers should balance between
the freeing and serving incoming requests.
## Set lazy eviction, expire, server-del, user-del to yes by default
We now have a del and a lazyfree del, we also have these configuration
items to control: lazyfree-lazy-eviction, lazyfree-lazy-expire,
lazyfree-lazy-server-del, lazyfree-lazy-user-del. In most cases lazyfree
is better since it reduces the risk of blocking the main thread, and
because we have lazyfreeGetFreeEffort, on those with high effor
(currently
64) will use lazyfree.
Part of #653.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
If the expiration time passed in SET is expired, for example, it
has expired due to the machine time (DTS) or the expiration time
passed in (wrong arg). In this case, we don't need to set the key
and wait for the active expire scan before deleting the key.
Compared with previous changes:
1. If the key does not exist, previously we would set the key and wait
for the active expire to delete it, so it is a set + del from the
perspective
of propaganda. Now we will no set the key and return, so it a NOP.
2. If the key exists, previously we woule set the key and wait
for the active expire to delete it, so it is a set + del From the
perspective
of propaganda. Now we will delete it and return, so it is a del.
Adding a new deleteExpiredKeyFromOverwriteAndPropagate function
to reduce the duplicate code.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Deletes zipmapSet, zipmapGet, etc. Only keep iterator and validate
integrity, what we use when loading an old RDB file.
Adjust unit tests to not use zipmapSet, etc.
Solves a build failure where when compiling with fortify source.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The reason is the server 3 still have the server 7 as its replica
due to a short wait, the wait is not enough, we should wait for
server loss its replica.
```
*** [err]: valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT
Expected '{127.0.0.1 21497 267}' to be equal to '' (context: type eval line 34 cmd {assert_equal [lindex [R 3 role] 2] {}} proc ::test)
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.
We have adopted the same handling method as the cluster mode:
1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:
```
MULTI ==> +OK
SET x y ==> -REDIRECT
EXEC ==> -EXECABORT
```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:
```
MULTI ==> +OK
SET x y ==> +QUEUED
failover
EXEC ==> -REDIRECT
```
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Migrate zipmap unit test to new unit test framework, parent ticket #428
.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Signed-off-by: hwware <wen.hui.ware@gmail.com>
Co-authored-by: hwware <wen.hui.ware@gmail.com>
When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.
In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.
As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.
This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.
This one follow #885 and closes#942.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Fix a bug in isValidAuxChar where valid characters '.' and ':' were
incorrectly included in the banned charset. This issue affected the
validation of auxiliary fields in the nodes.conf file used by Valkey in
cluster mode, particularly when handling IPv4 and IPv6 addresses. The
code now correctly allows '.' and ':' as valid characters, ensuring
proper handling of these fields. Comments were added to clarify the use
of the banned charset.
Related to #736
---------
Signed-off-by: Ping Xie <pingxie@google.com>
In #792, the time complexity became ambiguous, fluctuating between
O(1) and O(n), which is a significant difference. And we agree uncertainty
can potentially bring disaster to the business, the right thing to do is
to persuade users to use EXISTS instead of KEYS in this case, to do the
right thing the right way, rather than accommodating this incorrect usage.
This reverts commit d66a06e8183818c035bb78706f46fd62645db07e.
This reverts #792.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Most of the content of TLS.md has already been copied to README.md in
#927.
The description of how to run tests with TLS is moved to
tests/README.md.
Descriptions of the additional scripts runtest-cluster, runtest-sentinel
and runtest-module are added in tests/README.md.
Links to tests/README.md and src/unit/README.md are added in the
top-level README.md along with a brief overview of the `make test-*`
commands.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The MANIFESTO is not Valkey's manifesto and it doesn't even mention open
source. Let's write another one later, or some other document about our
project principles.
The other two files are one-line files with no relevant info. They're
polluting the file listing at root level. It's the first thing you see
when you start exploring the repo for the first time.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Migrate the contents in TLS.md into TLS sections including building,
running and detail supports. TODO list in the TLS.md is almost done
except the implementation of benchmark support is still not the best
approach which should migrate to hiredis async mode.
Closes#888
---------
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In #885, we only add a shutdown path, there is another path
is that the server might got hang by slowlog. This PR added
the pause path coverage to cover it.
Signed-off-by: Binbin <binloveplay1314@qq.com>
## Description
When I explore the cycles distributions for `lrange` test (
`valkey-benchmark -p 9001 -t lrange -d 100 -r 1000000 -n 1000000 -c 50
--threads 4`). I found the `prepareClientToWrite` and
`clientHasPendingReplies` could be reduced to single call outside
instead of called in a loop, ideally we can gain 3% performance. The
corresponding `LRANG_100`, `LRANG_300`, `LRANGE_500`, `LRANGE_600` have
~2% - 3% performance boost, the benchmark test prove it helps.
This patch try to move the `prepareClientToWrite` and its child
`clientHasPendingReplies` out of the loop to reduce the function
overhead.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
The test might be fast enough and then there is no change in the role
causing the test to fail. Adding a wait to avoid the timing issue:
```
*** [err]: valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT
Expected '{127.0.0.1 23154 267}' to be equal to '' (context: type eval line 24 cmd {assert_equal [lindex [R 3 role] 2] {}} proc ::test)
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
sdsAllocSize returns the correct size without consulting the
allocator. Which is much faster than consulting the allocator.
The only exception is SDS_TYPE_5, for which it has to
consult the allocator.
This PR also sets alloc field correctly for embedded string objects.
It assumes that no allocator would allocate a buffer larger
than `259 + sizeof(robj)` for embedded string. We use embedded strings
for strings up to 44 bytes. If this assumption is wrong, the whole
function would require a rewrite. In general case sds type adjustment
might be needed. Such logic should go to sds.c.
---------
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Currently, the `dual-channel-replication` feature flag is immutable if
`enable-protected-configs` is enabled, which is the default behavior.
This PR proposes to make the `dual-channel-replication` flag mutable,
allowing it to be changed dynamically without restarting the cluster.
**Motivation:**
The ability to change the `dual-channel-replication` flag dynamically is
essential for testing and validating the feature on real clusters
running in production environments. By making the flag mutable, we can
enable or disable the feature without disrupting the cluster's
operations, facilitating easier testing and experimentation.
Additionally, this change would provide more flexibility for users to
enable or disable the feature based on their specific requirements or
operational needs without requiring a cluster restart.
---------
Signed-off-by: naglera <anagler123@gmail.com>
1. Remove redundant connIncrRefs/connDecrRefs
In socket.c, the reference counter is incremented before calling
callHandler, but the same reference counter is also incremented inside
callHandler before calling the actual callback.
static inline int callHandler(connection *conn, ConnectionCallbackFunc handler) {
connIncrRefs(conn);
if (handler) handler(conn);
connDecrRefs(conn);
...
}
This commit removes the redundant incr/decr calls in socket.c
2. Correct return value of connRead for TLS when peer closed
According to comments in connection.h, connRead returns 0 when the peer
has closed the connection. This patch corrects the return value for TLS
connections. (Without this patch, it returns -1 which means error.)
There is an observable difference in what is logged in the verbose
level: "Client closed connection" vs "Reading from client: (null)".
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This PR utilizes the IO threads to execute commands in batches, allowing
us to prefetch the dictionary data in advance.
After making the IO threads asynchronous and offloading more work to
them in the first 2 PRs, the `lookupKey` function becomes a main
bottle-neck and it takes about 50% of the main-thread time (Tested with
SET command). This is because the Valkey dictionary is a straightforward
but inefficient chained hash implementation. While traversing the hash
linked lists, every access to either a dictEntry structure, pointer to
key, or a value object requires, with high probability, an expensive
external memory access.
### Memory Access Amortization
Memory Access Amortization (MAA) is a technique designed to optimize the
performance of dynamic data structures by reducing the impact of memory
access latency. It is applicable when multiple operations need to be
executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.
Rather than executing operations sequentially, this approach interleaves
the execution of all operations. This is done in such a way that
whenever a memory access is required during an operation, the program
prefetches the necessary memory and transitions to another operation.
This ensures that when one operation is blocked awaiting memory access,
other memory accesses are executed in parallel, thereby reducing the
average access latency.
We applied this method in the development of `dictPrefetch`, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, `dictPrefetch` is an interleaved execution of dictFind for
all the keys.
**Implementation details**
When the main thread iterates over the `clients-pending-io-read`, for
clients with ready-to-execute commands (i.e., clients for which the IO
thread has parsed the commands), a batch of up to 16 commands is
created. Initially, the command's argv, which were allocated by the IO
thread, is prefetched to the main thread's L1 cache. Subsequently, all
the dict entries and values required for the commands are prefetched
from the dictionary before the command execution. Only then will the
commands be executed.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
This example was for script replication which we have
completely removed in 7.0, so this example is outdated now.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Although KEYS is a dangerous command and we recommend people
to avoid using it, some people who are not familiar with it
still using it, and even use KEYS with no pattern at all.
Once KEYS is using with no pattern, we can convert it to an
exact match to avoid iterating over all data.
Signed-off-by: Binbin <binloveplay1314@qq.com>
According to the Python document[1], any invalid escape sequences in
string literals now generate a DeprecationWarning (SyntaxWarning as of
3.12) and in the future this will become a SyntaxError.
This Change uses Python’s raw string notation for regular expression
patterns to avoid it.
[1]: https://docs.python.org/3.10/library/re.html
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
When failover deny to vote, sometimes due to network or
some blocking operations, the time of FAILOVER_AUTH_REQUEST
packet arrival is very uncertain. Since there is no epoch
information in these logs, it is hard to associate the log
with other logs.
Signed-off-by: Binbin <binloveplay1314@qq.com>
At the VERBOSE/DEBUG log level, which is output once every 5 seconds,
added to show the "Total" message of all clients and to show memory
usage (used_memory) with used_memory_human.
Also, it seems clearer to show "total" number of keys and the number of
volatile in entire keys.
---------
Signed-off-by: NAM UK KIM <namuk2004@naver.com>
Add new optional, immutable string config called `unixsocketgroup`.
Change the group of the unix socket to `unixsocketgroup` after `bind()`
if specified.
Adds tests to validate the behavior.
Fixes#873.
Signed-off-by: Ayush Sharma <mrayushs933@gmail.com>
Today if we attached the "run-extra-tests" tag it adds at least 20
minutes because the dump-fuzzer test runs with full accuracy. This
fuzzer is useful, but probably only really needed for the daily, so
removing it from the PRs. We still run the fuzzers, just not for as
long.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This fixes#899. In that issue, the primary is cluster-allow-replica-migration no
and its replica is cluster-allow-replica-migration yes.
And during the slot migration:
1. Primary calling blockClientForReplicaAck, waiting its replica.
2. Its replica reconfiguring itself as a replica of other shards due to
replica migration and disconnect from the old primary.
3. The old primary never got the chance to receive the ack, so it got a
timeout and got a NOREPLICAS error.
In this case, the replicas might automatically migrate to another primary,
resulting in the client being unblocked with the NOREPLICAS error. In this
case, since the configuration will eventually propagate itself, we can safely
ignore this error on the source node.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In CLUSTER SETSLOT propagation logic, if the replicas are down, the
client will get block during command processing and then unblock
with `NOREPLICAS Not enough good replicas to write`.
The reason is that all replicas are down (or some are down), but
myself->num_replicas is including all replicas, so the client will
get block and always get timeout.
We should only wait for those online replicas, otherwise the waiting
propagation will always timeout since there are not enough replicas.
The admin can easily check if there are replicas that are down for an
extended period of time. If they decide to move forward anyways, we
should not block it. If a replica failed right before the replication and
was not included in the replication, it would also unlikely win the election.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
There is a test that assumes that the backlog will get overrun, but
because of the recent changes to the default it no longer fails. It
seems like it is a bit flakey now though, so resetting the value in the
test back to 1mb. (This relates to the CoB of 1100k. So it should
consistently work with a 1mb limit).
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Now, when clients run the unsubscribe, sunsubscribe and punsubscribe
commands in the non-subscribed mode, it returns 0.
Indeed this is a bug, we should not allow client run these kind of
commands here.
Thus, this PR fixes this bug, but it is a break change for existing
clients
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
In #764, we add a --io-threads mode in test, but forgot
to handle runtest-cluster, they are different framework.
Currently runtest-cluster does not support tags, and we
don't have plan to support it. And currently cluster tests
does not have any io-threads tests, so this PR just align
--io-threads option with #764.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If n is already myself primary, there is no need to re-establish the
replication connection.
In the past we allow a replica node to reconnect with its primary via
this CLUSTER REPLICATE command, it will use psync. But since #885, we
will assume that a full sync is needed in this case, so if we don't do
this, the replica will always use full sync.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
Fix feedback loop in key eviction with tracking clients when using I/O
threads.
Current issue:
Evicting keys while tracking clients or key space-notification exist
creates a feedback loop when using I/O threads:
While evicting keys we send tracking async writes to I/O threads,
preventing immediate release of tracking clients' COB memory
consumption.
Before the I/O thread finishes its write, we recheck used_memory, which
now includes the tracking clients' COB and thus continue to evict more
keys.
**Fix:**
We will skip the test for now while IO threads are active. We may
consider avoiding sending writes in `processPendingWrites` to I/O
threads for tracking clients when we are out of memory.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Update README.md since the project is no longer under construction, and
can reference the main website.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
The repl-backlog-size 1mb is too small in most cases, now network
transmission and bandwidth performance have improved rapidly in more
than ten years.
The bigger the replication backlog, the longer the replica can endure
the disconnect and later be able to perform a partial resynchronization.
Part of #653.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Just add one more test for command "sentinel IS-PRIMARY-DOWN-BY-ADDR" to
make the reply-schemas-validator
run successfully.
Note: test result here
https://github.com/hwware/valkey/actions/runs/10457516111
Signed-off-by: hwware <wen.hui.ware@gmail.com>
If the client side crashes by any issue or exits normally, the kernel
will try to disconnect RDMA QPs. Then the kernel of server side receives
CM packets, valkey-server handles CM disconnected event and close
connection.
However, there is a lack of keepalive mechanism from RDMA transport
layer. Once the kernel of client side crashes, the server side will not
be notified. To avoid this issue, valkey server sents Keepaliv command
periodically to detect any dead QPs.
An example of mlx-cx5:
```
# RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
# RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
# RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
# RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
# RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
# RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
```
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In these places we should use RDB_EOF_MARK_SIZE, but we mixed
it with CONFIG_RUN_ID_SIZE. This is not an issue since they are
all 40, just a cleanup.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Saves some memory (one pointer) in the client struct.
Since a client cannot be blocked multiple times, we can assume
it will be held in only one extra utility list, so it is ok to maintain
a union of these listNode references.
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Currently, if the client enters a blocked state, it will be
added to the server.clients_waiting_acks list. When the client
is unblocked, that is, when unblockClient is called, we will
need to linearly traverse server.clients_waiting_acks to delete
the client, and this search is O(N).
When WAIT (or WAITAOF) is used extensively in some cases, this
O(N) search may be time-consuming. We can remember the list node
and store it in the blockingState struct and it can avoid the
linear search in unblockClientWaitingReplicas.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Add 4 new commands for Sentinel (reference
https://github.com/valkey-io/valkey/issues/36)
Sentinel GET-PRIMARY-ADDR-BY-NAME
Sentinel PRIMARY
Sentinel PRIMARIES
Sentinel IS-PRIMARY-DOWN-BY-ADDR
and deprecate 4 old commands:
Sentinel GET-MASTER-ADDR-BY-NAME
Sentinel MASTER
Sentinel MASTERS
Sentinel IS-MASTER-DOWN-BY-ADDR
and all sentinel tests pass here
https://github.com/hwware/valkey/actions/runs/9962102363/job/27525124583
Note:
1. runtest-sentinel pass all test cases
2. I finished a sentinel rolling upgrade test: 1 primary 2 replicas 3
sentinel
there are 4 steps in this test scenario:
step 1: all 3 sentinel nodes run old sentinel, shutdown primary, and
then new primary can be voted successfully.
step 2: replace sentinel 1 with new sentinel bin file, and then shutdown
primary, and then another new primary can be voted successfully
step 3: replace sentinel 2 with new sentinel bin file, and then shutdown
primary, and then another new primary can be voted successfully
step 4: replace sentinel 3 with new sentinel bin file, and then shutdown
primary, and then another new primary can be voted successfully
We can see, even mixed version sentinel running, whole system still
works.
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
For `debug object` command, we use `val->lru` but ignore the `lfu` mode.
So in `lfu` mode, `debug object` would return meaningless `lru` descriptions.
Added two new fields lfu_freq and lfu_access_time_minutes.
Signed-off-by: jiangyujie.jyj <yjjiang1996@163.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Adding FAST option to DEBUG OBJECT command.
The light version only shows the light weight infomation,
which mostly O(1). The pre-existing version that show more
stats such as serializedlength sometimes is time consuming.
This should allow looking into debug stats (the key expired
but not deleted), even on huge object, on which we're afraid
to run the command for fear of causing a server freeze.
Somehow like 3ca451c46fed894bf49e7561fa0282d2583f1c06.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In #786, we did skip it in the daily, but not for the others.
When running ./runtest on MacOS, we will get the failure.
```
couldn't open socket: host is unreachable (nodename nor servname provided, or not known)
```
The reason is that TCL 8.5 doesn't support ipv6, so we skip tests
tagged with ipv6. This also revert #786.
Signed-off-by: Binbin <binloveplay1314@qq.com>
I've tried to test a dual channel replication but forgot to add +sync
for my replication user. As a result replica entered silent cycle like
this:
```
* Connecting to PRIMARY 127.0.0.1:6379
* PRIMARY <-> REPLICA sync started
* Non blocking connect for SYNC fired the event.
* Primary replied to PING, replication can continue...
* Trying a partial resynchronization (request ...)
* PSYNC is not possible, initialize RDB channel.
* Aborting dual channel sync
```
And primary got endless cycle like this:
```
* Replica 127.0.0.1:6380 asks for synchronization
* Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '...', my replication IDs are '...' and '...')
* Replica 127.0.0.1:6380 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
```
There was no way to understand that replication user is missing +sync
acl on notice log level. With this one-line change we get a warning
message in our replica log.
---------
Signed-off-by: secwall <secwall@yandex-team.ru>
Update references of copyright being assigned to Salvatore when it was
transferred to Redis Ltd. as per
https://github.com/valkey-io/valkey/issues/544.
---------
Signed-off-by: Pieter Cailliau <pieter@redis.com>
A pointer to dtype is stored in the dict forever.
dtype is stack-allocated while the dict created is global.
The dict (and the pointer to dtype in it) will live past the lifetime of
dtype.
clusterManagerLinkDictType is a global object that has the same values
as dtype.
Signed-off-by: Salvatore Mesoraca <salvatore.mesoraca@aiven.io>
Fix#821
During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
We are updating this variable in the main thread, and the
child threads can printing the logs at the same time. This
generating a warning in SANITIZER=thread:
```
WARNING: ThreadSanitizer: data race (pid=74208)
Read of size 4 at 0x000102875c10 by thread T3:
#0 serverLogRaw <null>:52173615 (valkey-server:x86_64+0x10003c556)
#1 _serverLog <null>:52173615 (valkey-server:x86_64+0x10003ca89)
#2 bioProcessBackgroundJobs <null>:52173615 (valkey-server:x86_64+0x1001402c9)
Previous write of size 4 at 0x000102875c10 by main thread (mutexes: write M0):
#0 afterSleep <null>:52173615 (valkey-server:x86_64+0x10004989b)
#1 aeProcessEvents <null>:52173615 (valkey-server:x86_64+0x100031e52)
#2 main <null>:52173615 (valkey-server:x86_64+0x100064a3c)
#3 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
#4 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
```
The refresh of daylight_active is not real time, we update
it in aftersleep, so we don't need a strong synchronization,
so using memory_order_relaxed. But also noted we are doing
load/store operations only for daylight_active, which is an
aligned 32-bit integer, so using memory_order_relaxed will
not provide more consistency than what we have today.
So this is just a cleanup that to clear the warning.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Test dual-channel-replication primary gets cob overrun during replica
rdb load` fails during the Valgrind run. This is due to the load
handlers disconnecting before the tests complete, resulting in a low
primary COB. Increasing the handlers' timeout should resolve this issue.
Failure:
https://github.com/valkey-io/valkey/actions/runs/10361286333/job/28681321393
Server logs reveals that the load handler clients were disconnected
before the test started
Also the two previus test took about 20 seconds which is the handler
timeout.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
The clusterManagerCompareKeysValues is introduced in
143bfa1e6e65cf8be1eaad0b8169e2d95ca62f9a, which calls
DEBUG DIGEST-VALUE to check whether the value of the
source node and the target node are consistent.
However, the DEBUG DIGEST-VALUE command is not supported
in older version, such as version 4, and the node will return
unknown subcommand. Or the DEBUG command can be disabled
in version 7, and the node will return DEBUG not allowed.
In these cases, we need to output friendly message to
allow users to proceed to the next step, instead of just
outputing `Value check failed!`.
Unknown subcommand example:
```
*** Target key exists
*** Checking key values on both nodes...
Node 127.0.0.1:30001 replied with error:
ERR unknown subcommand or wrong number of arguments for 'DIGEST-VALUE'. Try DEBUG HELP.
Node 127.0.0.1:30003 replied with error:
ERR unknown subcommand or wrong number of arguments for 'DIGEST-VALUE'. Try DEBUG HELP.
*** Value check failed!
DEBUG DIGEST-VALUE command is not supported.
You can relaunch the command with --cluster-replace option to force key overriding.
```
DEBUG not allowed example:
```
*** Target key exists
*** Checking key values on both nodes...
Node 127.0.0.1:30001 replied with error:
ERR DEBUG command not allowed. If the enable-debug-command option is ...
Node 127.0.0.1:30003 replied with error:
ERR DEBUG command not allowed. If the enable-debug-command option is ...
*** Value check failed!
DEBUG command is not allowed.
You can turn on the enable-debug-command option.
Or you can relaunch the command with --cluster-replace option to force key overriding.
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
In the past implementation of `ZUNION` and `ZUNIONSTORE` commands, we
first create a temporary dict called `accumulator`. After adding all
member-score mappings to `accumulator`, we still need to convert
`accumulator` back to the final dict `dstzset->dict`. However, we can
directly use `dstzset->dict` to avoid the additional copy operation.
This PR removes the `accumulator` dict and directly uses` dstzset->dict
`to store the member-score mappings.
- **Test**
First, I added 300 unique elements to two sorted sets called
'zunion_test1' and 'zunion_test2'. Then, I tested `zunion` and
`zunionstore` on these two sorted sets. The test results shown below
indicate that the performance of both zunion and zunionstore improved
about 31%.
### ZUNION
#### unstable
```
./valkey-benchmark -P 10 -n 100000 zunion 2 zunion_test1 zunion_test2
Summary:
throughput summary: 2713.41 requests per second
latency summary (msec):
avg min p50 p95 p99 max
146.252 3.464 153.343 182.015 184.959 192.895
```
#### This PR
```
./valkey-benchmark -P 10 -n 100000 zunion 2 zunion_test1 zunion_test2
Summary:
throughput summary: 3543.84 requests per second
latency summary (msec):
avg min p50 p95 p99 max
108.259 2.984 114.239 141.695 145.151 160.255
```
### ZUNIONSTORE
#### unstable
```
./valkey-benchmark -P 10 -n 100000 zunionstore out 2 zunion_test1 zunion_test2
Summary:
throughput summary: 3168.07 requests per second
latency summary (msec):
avg min p50 p95 p99 max
157.511 3.368 183.167 189.311 193.535 231.679
```
#### This PR
```
./valkey-benchmark -P 10 -n 100000 zunionstore out 2 zunion_test1 zunion_test2
Summary:
throughput summary: 4144.73 requests per second
latency summary (msec):
avg min p50 p95 p99 max
120.374 2.648 141.823 149.119 153.855 183.167
```
---------
Signed-off-by: RayCao <zisong.cw@alibaba-inc.com>
Signed-off-by: zisong.cw <zisong.cw@alibaba-inc.com>
A configuration option with zero impact on server operation but is
printed out on server crash and can be accessed by gdb for debugging. It
can be used by the user/operator to store any free-form string. This
string will persist as long as the server is running and will be
accessible in the following ways:
And printed in crash reports:
```
------ CONFIG DEBUG OUTPUT ------
lazyfree-lazy-eviction no
...
io-threads-do-reads yes
debug-context "test2"
proto-max-bulk-len 512mb
```
---------
Signed-off-by: Eran Liberty <eranl@amazon.com>
Co-authored-by: Eran Liberty <eranl@amazon.com>
- Fix TLS bug where connection were shutdown by primary's main process
while the child process was still writing- causing main process to be
blocked.
- TLS connection fix -file descriptors are set to blocking mode in the
main thread, followed by a blocking write. This sets the file
descriptors to non-blocking if TLS is used (see `connTLSSyncWrite()`)
(@xbasel).
- Improve the reliability of dual-channel tests. Modify the pause
mechanism to verify process status directly, rather than relying on log.
- Ensure that `server.repl_offset` and `server.replid` are updated
correctly when dual channel synchronization completes successfully.
Thist led to failures in replication tests that validate replication IDs
or compare replication offsets.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: naglera <58042354+naglera@users.noreply.github.com>
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
When debug assert mode enabled, verify that we don't insert the same
client twice into server.clients_to_close.
Signed-off-by: naglera <anagler123@gmail.com>
Currently lastbgsave_status is used in bgsave or disk-replication,
and the target is the disk. In #60, we update it when transfer error,
i think it is mainly used in tests, so we can use log to replace it.
It changes lastbgsave_status to err in this case, but it is strange
that it does not set ok or err in the above if and the following else.
Also noted this will affect stop-writes-on-bgsave-error.
Signed-off-by: Binbin <binloveplay1314@qq.com>
I think we should first check if the server is currently enabled in
cluster mode or if it has modules loaded prior to the throttled cron run
(`run_with_period`) condition.
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Added an assertion to avoid incorrect usage of the network bytes out for
replication code flow in slot stats computation.
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
This PR adjusts the logging conditions of clusterLogCantFailover
in this two ways.
1. For the same cant_failover_reason, we will print the log once
in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which
is a bit long, shorten it to 1s, so we can better track its state.
We get to see the system making progress by watching the message.
Using 1s also covers pretty much all cases as i don't see a reason
for using a <1s node timeout, test or prod.
2. We will not print logs before the nolog_fail_time, its value
is cluster-node-timeout+5000. This may casue us to lose some logs,
for example, if cluster-node-timeout is small, auth_timeout will
be 2000, and auth_retry_time will be 4000. In this case, we will
lose all the reasons during the election if the failover is timedout.
So remove the nolog_fail_time logic, since we still do have the
CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many
logs.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Few CI improvements witch will reduce occupation CI queue and eliminate
stale runs.
1. Kill CI jobs on PRs once PR branch gets a new push. This will prevent
situation happened today - a huge job triggered twice in less than an
hour and occupied all **org** (for all repositories) runners queue for
the rest of the day (see pic). This completely blocked valkey-glide
team.
2. Distribute nightly croned jobs on time to prevent them running
together. Keep in mind, cron's TZ is UTC, so midnight tasks incur
developers located in other timezones.
This must be backported to all release branches (`valkey-x.y` and `x.y`)

---------
Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
This example of a minimal user account in your Valkey server
for Sentinel is incorrect. If you add this ACL as-is to your
valkey users.acl, valkey will add resetchannels -@all before
the +client which prevents sentinel from publishing messages
to the __sentinel__:hello pubsub for sentinel discovery.
Fix#744.
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
We have a number of test failures in the empty shard migration which
seem to be related to race conditions in the failover, but could be more
pervasive. For now disable the tests to prevent so many false negative
test failures.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Release candidates have a version that is lower than 8.0.0 to allow for
8.0.0 to have 0x080000 as a release number. However, we did an explicit
check to make sure a version was 8.0 or greater to validate a replica
supports a feature. Now we are using the highest patch version of latest
minor to do the comparison to accommodate future versions.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Fix bug in writeToClient
In https://github.com/valkey-io/valkey/pull/758, a major refactor was
done to `networking.c`.
As part of this refactor, a new bug was introduced: we don't advance the
`c->buf` pointer in repeated writes.
This bug should be very unlikely to manifest, as it requires the
client's TCP buffer to be filled in the first try and then released
immediately after in the second try.
Despite all my efforts to reproduce this scenario, I was unable to do
so.
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
I think it is a good idea to mention this.
The Cluster config file is written relative this directory, if the
'cluster-config-file' configuration directive is a relative path.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Fixes test failure
(https://github.com/valkey-io/valkey/actions/runs/10146979329/job/28056316421?pr=837)
on 32 bit system for slot stats metric underflow on the following
condition:
```
server.cluster->slot_stats[c->slot].network_bytes_out += (len * listLength(server.replicas));
```
* Here listLength accesses `len` which is of `unsigned long` type and
multiplied with `len` (which could be negative). This is a risky
operation and behaves differently based on the architecture.
```
clusterSlotStatsAddNetworkBytesOutForReplication(-sdslen(selectcmd->ptr));
```
* `sdslen` method returns `size_t`. applying `-` operation to decrement
network bytes out is also incorrect.
This change adds assertion on `len` being negative and handles the
wrapping of overall value.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Update the dual channel-replication tests to wait for the pause to begin
before attempting to unpause.
---------
Signed-off-by: naglera <anagler123@gmail.com>
If we do `config set appendonly yes` and `config set appendonly no`
in a multi, there are some unexpected behavior.
When doing appendonly yes, we will schedule a AOFRW, and when we
are doding appendonly no, we will call stopAppendOnly to stop it.
In stopAppendOnly, the aof_fd is -1 since the aof is not start yet
and the fsync and close will take the -1 and call it, so they will
all fail with EBADF. And stopAppendOnly will emit a server log, the
close(-1) should be no problem but it is still an undefined behavior.
This PR also adds a log `Background append only file rewriting
scheduled.` to bgrewriteaofCommand when it was scheduled.
And adds a log in stopAppendOnly when a scheduled AOF is canceled,
it will print `AOF was disabled but there is a scheduled AOF background, cancel it.`
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Adds two new metrics for per-slot statistics, network-bytes-in and
network-bytes-out. The network bytes are inclusive of replication bytes
but exclude other types of network traffic such as clusterbus traffic.
#### network-bytes-in
The metric tracks network ingress bytes under per-slot context, by
reverse calculation of `c->argv_len_sum` and `c->argc`, stored under a
newly introduced field `c->net_input_bytes_curr_cmd`.
#### network-bytes-out
The metric tracks network egress bytes under per-slot context, by
hooking onto COB buffer mutations.
#### sample response
Both metrics are reported under the `CLUSTER SLOT-STATS` command.
```
127.0.0.1:6379> cluster slot-stats slotsrange 0 0
1) 1) (integer) 0
2) 1) "key-count"
2) (integer) 0
3) "cpu-usec"
4) (integer) 0
5) "network-bytes-in"
6) (integer) 0
7) "network-bytes-out"
8) (integer) 0
```
---------
Signed-off-by: Kyle Kim <kimkyle@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Adds light-weight cluster bus header for pubsub message. Closes#557.
This also supports sending to and receiving non-light messages from
older versions of the engine.
The light-weight cluster bus message supports multiple pubsub messages
(payloads) for one pubsub channel. Receiving messages with multiple
payloads is supported but we're not yet sending such multi-payload
messages to other nodes.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Introduce several improvements to improve the stability of dual-channel
replication and fix compatibility issues.
1. Make dual-channel-replication tests more reliable: use pause instead
of forced sleep.
2. Fix race conditions when freeing RDB client.
3. Check if sync was stopped during local buffer streaming.
4. Fix $ENDOFFSET reply format to work on 32-bit machines too.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
During the ZADD operation, a conversion from listpack to skiplist might
be necessary for the sorted set. Currently, the function
zsetTypeMaybeConvert only examines the number of elements but does not
check the Max size of the elements. It is advisable to include a check
for value_len_hint for a more robust conversion check mechanism.
---------
Signed-off-by: RayCao <zisong.cw@alibaba-inc.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Currently, when running tests with IO threads, we set the
`events-per-io-thread` config to 0. This activated IO threads 100% of
the time, regardless of the number of IO events.
This is causing issues with tests running multiple server instances, as
it drained machine CPU resources. As a result, tests could have very
long runtimes, especially on limited instances.
For example, in
https://github.com/valkey-io/valkey/actions/runs/10066315827/job/27827426986?pr=804,
the `Cluster consistency during live resharding` test ran for 1 hour and
41 minutes.
This PR addresses the issue by:
1. Deactivating IO threads when there are no IO events
2. Continuing to offload all IO events to IO threads
Tested on 16 cores instance, after implementing these changes, the
runtime for the `Cluster consistency during live resharding` test
dropped from 7 minutes an 14 seconds to 3 minutes and 28 seconds.
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
In some cases, like read more than write scenario, the replication
offset of the replicas are the same. When the primary fails, the
replicas have the same rankings (rank == 0). They issue the election
at the same time (although we have a random 500), the simultaneous
elections may lead to the failure of the election due to quorum.
In clusterGetReplicaRank, when we calculates the rank, if the offsets
are the same, the one with the smaller node name will have a better
rank to avoid this situation.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
The metric tracks cpu time in micro-seconds, sharing the same value as
`INFO COMMANDSTATS`, aggregated under per-slot context.
---------
Signed-off-by: Kyle Kim <kimkyle@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This PR allows running a subset of the daily tests with a PR by
attaching the `run-extra-tests` flag. This is done by conditionally
running the daily tests when the label is attached. (I will do that for
this PR to demonstrate).
One downside of this PR is that a lot of tests will forever show-up as
"skipped" for most PRs, as long as that doesn't bother us it should be
OK. Skipped tests don't take up any of our runner compute.
Another note, if the label isn't attached on the first commit, the
submitter will need to push something to get the tests to run again.
There is a way to make it kick off tests during a label, but that added
a bunch more complexity so just wanted to start with this.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
When executing the script, the client passed in is a fake
client, and its woff is always 0.
This results in woff always being 0 when executing wait/waitaof
in the script, and the command returns a wrong number.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.
Let's assume the following scenario:
1. Two replicas initiate an election.
2. Replica 1 is elected as the primary node, and replica 2 does not have
enough votes.
3. Replica 1 is down, ie the new primary node down again in a short
time.
4. Replica 2 know that the new primary node is down and wants to
initiate
a failover, but because the failover_auth_time of the previous round
has not been reset, it needs to wait for it to time out and then wait
for the next retry time, which will take cluster-node-timeout * 4 times,
this adds a lot of delay.
There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to
initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).
That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Fix#784
Prior to the change, `CLUSTER SHARDS` command processing might pick a
failed primary node which won't have the slot coverage information and
the slots `output` in turn would be empty. This change finds an
appropriate node which has the slot coverage information served by a
given shard and correctly displays it as part of `CLUSTER SHARDS`
output.
Before:
```
1) 1) "slots"
2) (empty array)
3) "nodes"
4) 1) 1) "id"
2) "2936f22a490095a0a851b7956b0a88f2b67a5d44"
...
9) "role"
10) "master"
...
13) "health"
14) "fail"
```
After:
```
1) 1) "slots"
2) 1) 0
2) 5461
3) "nodes"
4) 1) 1) "id"
2) "2936f22a490095a0a851b7956b0a88f2b67a5d44"
...
9) "role"
10) "master"
...
13) "health"
14) "fail"
```
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
### IO-Threads Work Offloading
This PR is the 2nd of 3 PRs intended to achieve the goal of 1M requests
per second.
(1st PR: https://github.com/valkey-io/valkey/pull/758)
This PR offloads additional work to the I/O threads, beyond the current
read-parse/write operations, to better utilize the I/O threads and
reduce the load on the main thread.
It contains the following 3 commits:
### Poll Offload
Currently, the main thread is responsible for executing the poll-wait
system call, while the IO threads wait for tasks from the main thread.
The poll-wait operation is expensive and can consume up to 30% of the
main thread's time. We could have let the IO threads do the poll-wait by
themselves, with each thread listening to some of the clients and
notifying the main thread when a client's command is ready to execute.
However, the current approach, where the main thread listens for events
from the network, has several benefits. The main thread remains in
charge, allowing it to know the state of each client
(idle/read/write/close) at any given time. Additionally, it makes the
threads flexible, enabling us to drain an IO thread's job queue and stop
a thread when the load is light without modifying the event loop and
moving its clients to a different IO thread. Furthermore, with this
approach, the IO threads don't need to wait for both messages from the
network and from the main thread instead, the threads wait only for
tasks from the main thread.
To enjoy the benefits of both the main thread remaining in charge and
the poll being offloaded, we propose offloading the poll-wait as a
single-time, non-blocking job to one of the IO threads. The IO thread
will perform a poll-wait non-blocking call while the main thread
processes the client commands. Later, in `aeProcessEvents`, instead of
sleeping on the poll, we check for the IO thread's poll-wait results.
The poll-wait will be offloaded in `beforeSleep` only when there are
ready events for the main thread to process. If no events are pending,
the main thread will revert to the current behavior and sleep on the
poll by itself.
**Implementation Details**
A new call back `custompoll` was added to the `aeEventLoop` when not set
to `NULL` the ae will call the `custompoll` callback instead of the
`aeApiPoll`.
When the poll is offloaded we will set the `custompoll` to
`getIOThreadPollResults` and send a poll-job to the thread. the thread
will take a mutex, call a non-blocking (with timeout 0) to `aePoll`
which will populate the fired events array. the IO thread will set the
`server.io_fired_events` to the number of the returning `numevents`,
later the main-thread in `custompoll` will return the
`server.io_fired_events` and will set the `customPoll` back to `NULL`.
To ensure thread safety when accessing server.el, all functions that
modify the eventloop events were wrapped with a mutex to ensure mutual
exclusion when modifying the events.
### Command Lookup Offload
As the IO thread parses the command from the client's Querybuf, it can
perform a command lookup in the commands dictionary, which can consume
up to ~5% of the main-thread runtime.
**Implementation details**
The IO thread will store the looked-up command in the client's new field
`io_parsed_cmd` field. We can't use `c->cmd` for that since we use
`c->cmd `to check if a command was reprocessed or not.
To ensure thread safety when accessing the command dictionary, we make
sure the main thread isn't changing the dictionary while IO threads are
accessing it. This is accomplished by introducing a new flag called
`no_incremental_rehash` for the `dictType` commands. When performing
`dictResize`, we will rehash the entire dictionary in place rather than
deferring the process.
### Free Offload
Since the command arguments are allocated by the I/O thread, it would be
beneficial if they were also freed by the same thread. If the main
thread frees objects allocated by the I/O thread, two issues arise:
1. During the freeing process, the main thread needs to access the SDS
pointed to by the object to get its length.
2. With Jemalloc, each thread manages thread local pool (`tcache`) of
buffers for quick reallocation without accessing the arena. If the main
thread constantly frees objects allocated by other threads, those
threads will have to frequently access the shared arena to obtain new
memory allocations
**Implementation Details**
When freeing the client's argv, we will send the argv array to the
thread that allocated it. The thread will be identified by the client
ID. When freeing an object during `dbOverwrite`, we will offload the
object free as well. We will extend this to offload the free during
`dbDelete` in a future PR, as its effects on defrag/memory evictions
need to be studied.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Primary side: Remove read handler upon RDB connection close.
At this stage we do not expect any writed form that connection
so it should be safe to remove the read handler. Otherwise the
read handler will keep printing the `Client closed connection`
logs, see handleReadResult.
Signed-off-by: naglera <anagler123@gmail.com>
server.rdb_child_exit_pipe is init in !dual_channel block,
so the call here would be close(-1) in !dual_channel way.
It will also generate a warning in valgrind:
Warning: invalid file descriptor -1 in syscall close()
Introduced in #60.
Signed-off-by: Binbin <binloveplay1314@qq.com>
After #60, the CI report this warning:
```
rdb.c: In function 'rdbSaveToReplicasSockets':
rdb.c:3661:28: error: 'safe_to_exit_pipe' may be used uninitialized [-Werror=maybe-uninitialized]
3661 | if (!dual_channel) close(safe_to_exit_pipe);
| ^~~~~~~~~~~~~~~~~~~~~~~~
rdb.c:3512:37: note: 'safe_to_exit_pipe' was declared here
3512 | int pipefds[2], rdb_pipe_write, safe_to_exit_pipe;
| ^~~~~~~~~~~~~~~~~
rdb.c:3654:17: error: 'rdb_pipe_write' may be used uninitialized [-Werror=maybe-uninitialized]
3654 | close(rdb_pipe_write); /* close write in parent so that it can detect the close on the child. */
| ^~~~~~~~~~~~~~~~~~~~~
rdb.c:3512:21: note: 'rdb_pipe_write' was declared here
3512 | int pipefds[2], rdb_pipe_write, safe_to_exit_pipe;
| ^~~~~~~~~~~~~~
cc1: all warnings being treated as errors
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
In this PR we introduce the main benefit of dual channel replication by
continuously steaming the COB (client output buffers) in parallel to the
RDB and thus keeping the primary's side COB small AND accelerating the
overall sync process. By streaming the replication data to the replica
during the full sync, we reduce
1. Memory load from the primary's node.
2. CPU load from the primary's main process. [Latest performance
tests](#data)
## Motivation
* Reduce primary memory load. We do that by moving the COB tracking to
the replica side. This also decrease the chance for COB overruns. Note
that primary's input buffer limits at the replica side are less
restricted then primary's COB as the replica plays less critical part in
the replication group. While increasing the primary’s COB may end up
with primary reaching swap and clients suffering, at replica side we’re
more at ease with it. Larger COB means better chance to sync
successfully.
* Reduce primary main process CPU load. By opening a new, dedicated
connection for the RDB transfer, child processes can have direct access
to the new connection. Due to TLS connection restrictions, this was not
possible using one main connection. We eliminate the need for the child
process to use the primary's child-proc -> main-proc pipeline, thus
freeing up the main process to process clients queries.
## Dual Channel Replication high level interface design
- Dual channel replication begins when the replica sends a `REPLCONF
CAPA DUALCHANNEL` to the primary during initial
handshake. This is used to state that the replica is capable of dual
channel sync and that this is the replica's main channel, which is not
used for snapshot transfer.
- When replica lacks sufficient data for PSYNC, the primary will send
`-FULLSYNCNEEDED` response instead
of RDB data. As a next step, the replica creates a new connection
(rdb-channel) and configures it against
the primary with the appropriate capabilities and requirements. The
replica then requests a sync
using the RDB channel.
- Prior to forking, the primary sends the replica the snapshot's end
repl-offset, and attaches the replica
to the replication backlog to keep repl data until the replica requests
psync. The replica uses the main
channel to request a PSYNC starting at the snapshot end offset.
- The primary main threads sends incremental changes via the main
channel, while the bgsave process
sends the RDB directly to the replica via the rdb-channel. As for the
replica, the incremental
changes are stored on a local buffer, while the RDB is loaded into
memory.
- Once the replica completes loading the rdb, it drops the
rdb-connection and streams the accumulated incremental
changes into memory. Repl steady state continues normally.
## New replica state machine

## Data <a name="data"></a>



## Explanation
These graphs demonstrate performance improvements during full sync
sessions using rdb-channel + streaming rdb directly from the background
process to the replica.
First graph- with at most 50 clients and light weight commands, we saw
5%-7.5% improvement in write latency during sync session.
Two graphs below- full sync was tested during heavy read commands from
the primary (such as sdiff, sunion on large sets). In that case, the
child process writes to the replica without sharing CPU with the loaded
main process. As a result, this not only improves client response time,
but may also shorten sync time by about 50%. The shorter sync time
results in less memory being used to store replication diffs (>60% in
some of the tested cases).
## Test setup
Both primary and replica in the performance tests ran on the same
machine. RDB size in all tests is 3.7gb. I generated write load using
valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list
__rand_int__`.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: naglera <58042354+naglera@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Fixes a regression introduced in PR #445, which allowed a message from a
replica
to update the slot ownership of its primary. The regression results in a
`replicaof` cycle, causing server crashes due to the cycle detection
assert. The
fix restores the previous behavior where only primary senders can
trigger
`clusterUpdateSlotsConfigWith`.
Additional changes:
* Handling of primaries without slots is obsoleted by new handling of
when a
sender that was a replica announces that it is now a primary.
* Replication loop detection code is unchanged but shifted downwards.
* Some variables are renamed for better readability and some are
introduced to
avoid repeated memcmp() calls.
Fixes#753.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Update sentinel.conf config parameter,
From:
SENTINEL master-reboot-down-after-period mymaster 0
To:
SENTINEL primary-reboot-down-after-period myprimary 0
But we still keep the backward compatibility, clients could use SENTINEL
master-reboot-down-after-period mymaster 0 OR
SENTINEL primary-reboot-down-after-period myprimary 0
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
Adds an option to build RDMA support as a module:
make BUILD_RDMA=module
To start valkey-server with RDMA, use a command line like the following:
./src/valkey-server --loadmodule src/valkey-rdma.so \
port=6379 bind=xx.xx.xx.xx
* Implement server side of connection module only, this means we can
*NOT*
compile RDMA support as built-in.
* Add necessary information in README.md
* Support 'CONFIG SET/GET', for example, 'CONFIG Set rdma.port 6380',
then
check this by 'rdma res show cm_id' and valkey-cli (with RDMA support,
but not implemented in this patch).
* The full listeners show like:
listener0:name=tcp,bind=*,bind=-::*,port=6379
listener1:name=unix,bind=/var/run/valkey.sock
listener2:name=rdma,bind=xx.xx.xx.xx,bind=yy.yy.yy.yy,port=6379
listener3:name=tls,bind=*,bind=-::*,port=16379
Because the lack of RDMA support from TCL, use a simple C program to
test
Valkey Over RDMA (under tests/rdma/). This is a quite raw version with
basic
library dependence: libpthread, libibverbs, librdmacm. Run using the
script:
./runtest-rdma [ OPTIONS ]
To run RDMA in GitHub actions, a kernel module RXE for emulated soft
RDMA, needs
to be installed. The kernel module source code is fetched a repo
containing only
the RXE kernel driver from the Linux kernel, but stored in an separate
repo to
avoid cloning the whole Linux kernel repo.
----
Since 2021/06, I created a
[PR](https://github.com/redis/redis/pull/9161) for *Redis Over RDMA*
proposal. Then I did some work to [fully abstract connection and make
TLS dynamically loadable](https://github.com/redis/redis/pull/9320), a
new connection type could be built into Redis statically, or a separated
shared library(loaded by Redis on startup) since Redis 7.2.0.
Base on the new connection framework, I created a new
[PR](https://github.com/redis/redis/pull/11182), some
guys(@xiezhq-hermann @zhangyiming1201 @JSpewock @uvletter @FujiZ)
noticed, played and tested this PR. However, because of the lack of time
and knowledge from the maintainers, this PR has been pending about 2
years.
Related doc: [Introduce *Valkey Over RDMA*
specification](https://github.com/valkey-io/valkey-doc/pull/123). (same
as Redis, and this should be same)
Changes in this PR:
- implement *Valkey Over RDMA*. (compact the Valkey style)
Finally, if this feature is considered to merge, I volunteer to maintain
it.
---------
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Implementing the change proposed here:
https://github.com/valkey-io/valkey/issues/487
In this PR, we prevent tracking new custom error messages (e.g. LUA) if
the number of error messages (in the errors RAX) is greater than 128.
Instead, we will track any additional custom error prefix in a new
counter: `errorstat_ERRORSTATS_OVERFLOW ` and if any non-custom flagged
errors (e.g. MOVED / CLUSTERDOWN) occur, they will continue to be
tracked as usual.
This will address the issue of spammed error messages / memory usage of
the errors RAX. Additionally, we will not have to execute `CONFIG
RESETSTAT` to restore error stats functionality because normal error
messages continue to be tracked.
Example:
```
# Errorstats
.
.
.
errorstat_127:count=2
errorstat_128:count=2
errorstat_ERR:count=1
errorstat_ERRORSTATS_OVERFLOW:count=2
```
---------
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Allows cluster admins to configure the blacklist TTL as needed to allow
sufficient time for `CLUSTER FORGET` to be executed on every node in the
cluster.
Config name `cluster-blacklist-ttl`; unit seconds; deault 60.
---------
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Normally we can create a test cluster directly in the directory
using `./utils/create-cluster/create-cluster`, which would keep
the test files under `./` and messed up the git.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Inputing a negative number equivalent to --loop, and inputing a
number greater than or equal to 0 will cause the tests to be run
one more time.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The test fails because, in external, another test may have
enabled appendonly, causing acklocal to return 1.
We can add a CONFIG SET to disable the appendonly, but this
is not safe too unless we use multi. The test does not actually
rely on appendonly, so we can just * it.
Fixes#770.
Signed-off-by: Binbin <binloveplay1314@qq.com>
New configs:
* `cluster-announce-client-ipv4`
* `cluster-announce-client-ipv6`
New module API function:
* `ValkeyModule_GetClusterNodeInfoForClient`, takes a client id and is
otherwise just like its non-ForClient cousin.
If configured, one of these IP addresses are reported to each client in
CLUSTER SLOTS, CLUSTER SHARDS, CLUSTER NODES and redirects, replacing
the IP (`custer-announce-ip` or the auto-detected IP) of each node.
Which one is reported to the client depends on whether the client is
connected over IPv4 or IPv6.
Benefits:
* This allows clients using IPv4 to get the IPv4 addresses of all
cluster nodes and IPv6 clients to get the IPv6 clients.
* This allows the IPs visible to clients to be different to the IPs used
between the cluster nodes due to NAT'ing.
The information is propagated in the cluster bus using new Ping
extensions. (Old nodes without this feature ignore unknown Ping
extensions.)
This adds another dimension to CLUSTER SLOTS reply. It now depends on
the client's use of TLS, the IP address family and RESP version.
Refactoring: The cached connection type definition is moved from
connection.h (it actually has nothing to do with the connection
abstraction) to server.h and is changed to a bitmap, with one bit for
each of TLS, IPv6 and RESP3.
Fixes#337
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
I noticed in #738 that we don't properly check ULong config boundaries
and made the change there. I'm pulling out that particular commit into
this PR since we don't know if we want to merge the configurable cluster
blacklist TTL yet.
---------
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Test case "EVAL - Scripts do not block on waitaof" observed to fail in
e.g.
https://github.com/valkey-io/valkey/actions/runs/9860131487/job/27233756421?pr=688
It can happen that the local AOF has been written and 1 is returned here
where 0 is expected. Writing a key inside the EVAL script makes sure
there's no time to write the AOF.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
for kqueue:
EV_DELETE fails if the specified fd is not associated with the kqfd. If
EVFILT_WRITE is associated but EVFILT_READ is not, then calling
aeApiDelEvent with mask = -1 or `(AE_READABLE|AE_WRITABLE)` will
cause the kevent() to fail with errno = 2(No such file or directory) and
EVFILT_WRITE not dissociated. So we need to calculate the actual mask
to be removed, instead of passing in whatever user provides.
for evport:
The comment clearly states that aeApiDelEvent "rely on the fact that our
caller has already updated the mask in the eventLoop".
for epoll:
There's no need to calculate the "actual mask" twice, once in
`aeDeleteFileEvent` and another in `aeApiDelEvent`, let's just use the
mask recorded in the eventLoop.
Fixes#715
Signed-off-by: wkgcass <wkgcass@hotmail.com>
Co-authored-by: Andy Pan <i@andypan.me>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This PR is 1 of 3 PRs intended to achieve the goal of 1 million requests
per second, as detailed by [dan touitou](https://github.com/touitou-dan)
in https://github.com/valkey-io/valkey/issues/22. This PR modifies the
IO threads to be fully asynchronous, which is a first and necessary step
to allow more work offloading and better utilization of the IO threads.
### Current IO threads state:
Valkey IO threads were introduced in Redis 6.0 to allow better
utilization of multi-core machines. Before this, Redis was
single-threaded and could only use one CPU core for network and command
processing. The introduction of IO threads helps in offloading the IO
operations to multiple threads.
**Current IO Threads flow:**
1. Initialization: When Redis starts, it initializes a specified number
of IO threads. These threads are in addition to the main thread, each
thread starts with an empty list, the main thread will populate that
list in each event-loop with pending-read-clients or
pending-write-clients.
2. Read Phase: The main thread accepts incoming connections and reads
requests from clients. The reading of requests are offloaded to IO
threads. The main thread puts the clients ready-to-read in a list and
set the global io_threads_op to IO_THREADS_OP_READ, the IO threads pick
the clients up, perform the read operation and parse the first incoming
command.
3. Command Processing: After reading the requests, command processing is
still single-threaded and handled by the main thread.
4. Write Phase: Similar to the read phase, the write phase is also be
offloaded to IO threads. The main thread prepares the response in the
clients’ output buffer then the main thread puts the client in the list,
and sets the global io_threads_op to the IO_THREADS_OP_WRITE. The IO
threads then pick the clients up and perform the write operation to send
the responses back to clients.
5. Synchronization: The main-thread communicate with the threads on how
many jobs left per each thread with atomic counter. The main-thread
doesn’t access the clients while being handled by the IO threads.
**Issues with current implementation:**
* Underutilized Cores: The current implementation of IO-threads leads to
the underutilization of CPU cores.
* The main thread remains responsible for a significant portion of
IO-related tasks that could be offloaded to IO-threads.
* When the main-thread is processing client’s commands, the IO threads
are idle for a considerable amount of time.
* Notably, the main thread's performance during the IO-related tasks is
constrained by the speed of the slowest IO-thread.
* Limited Offloading: Currently, Since the Main-threads waits
synchronously for the IO threads, the Threads perform only read-parse,
and write operations, with parsing done only for the first command. If
the threads can do work asynchronously we may offload more work to the
threads reducing the load from the main-thread.
* TLS: Currently, we don't support IO threads with TLS (where offloading
IO would be more beneficial) since TLS read/write operations are not
thread-safe with the current implementation.
### Suggested change
Non-blocking main thread - The main thread and IO threads will operate
in parallel to maximize efficiency. The main thread will not be blocked
by IO operations. It will continue to process commands independently of
the IO thread's activities.
**Implementation details**
**Inter-thread communication.**
* We use a static, lock-free ring buffer of fixed size (2048 jobs) for
the main thread to send jobs and for the IO to receive them. If the ring
buffer fills up, the main thread will handle the task itself, acting as
back pressure (in case IO operations are more expensive than command
processing). A static ring buffer is a better candidate than a dynamic
job queue as it eliminates the need for allocation/freeing per job.
* An IO job will be in the format: ` [void* function-call-back | void
*data] `where data is either a client to read/write from and the
function-ptr is the function to be called with the data for example
readQueryFromClient using this format we can use it later to offload
other types of works to the IO threads.
* The Ring buffer is one way from the main-thread to the IO thread, Upon
read/write event the main thread will send a read/write job then in
before sleep it will iterate over the pending read/write clients to
checking for each client if the IO threads has already finished handling
it. The IO thread signals it has finished handling a client read/write
by toggling an atomic flag read_state / write_state on the client
struct.
**Thread Safety**
As suggested in this solution, the IO threads are reading from and
writing to the clients' buffers while the main thread may access those
clients.
We must ensure no race conditions or unsafe access occurs while keeping
the Valkey code simple and lock free.
Minimal Action in the IO Threads
The main change is to limit the IO thread operations to the bare
minimum. The IO thread will access only the client's struct and only the
necessary fields in this struct.
The IO threads will be responsible for the following:
* Read Operation: The IO thread will only read and parse a single
command. It will not update the server stats, handle read errors, or
parsing errors. These tasks will be taken care of by the main thread.
* Write Operation: The IO thread will only write the available data. It
will not free the client's replies, handle write errors, or update the
server statistics.
To achieve this without code duplication, the read/write code has been
refactored into smaller, independent components:
* Functions that perform only the read/parse/write calls.
* Functions that handle the read/parse/write results.
This refactor accounts for the majority of the modifications in this PR.
**Client Struct Safe Access**
As we ensure that the IO threads access memory only within the client
struct, we need to ensure thread safety only for the client's struct's
shared fields.
* Query Buffer
* Command parsing - The main thread will not try to parse a command from
the query buffer when a client is offloaded to the IO thread.
* Client's memory checks in client-cron - The main thread will not
access the client query buffer if it is offloaded and will handle the
querybuf grow/shrink when the client is back.
* CLIENT LIST command - The main thread will busy-wait for the IO thread
to finish handling the client, falling back to the current behavior
where the main thread waits for the IO thread to finish their
processing.
* Output Buffer
* The IO thread will not change the client's bufpos and won't free the
client's reply lists. These actions will be done by the main thread on
the client's return from the IO thread.
* bufpos / block→used: As the main thread may change the bufpos, the
reply-block→used, or add/delete blocks to the reply list while the IO
thread writes, we add two fields to the client struct: io_last_bufpos
and io_last_reply_block. The IO thread will write until the
io_last_bufpos, which was set by the main-thread before sending the
client to the IO thread. If more data has been added to the cob in
between, it will be written in the next write-job. In addition, the main
thread will not trim or merge reply blocks while the client is
offloaded.
* Parsing Fields
* Client's cmd, argc, argv, reqtype, etc., are set during parsing.
* The main thread will indicate to the IO thread not to parse a cmd if
the client is not reset. In this case, the IO thread will only read from
the network and won't attempt to parse a new command.
* The main thread won't access the c→cmd/c→argv in the CLIENT LIST
command as stated before it will busy wait for the IO threads.
* Client Flags
* c→flags, which may be changed by the main thread in multiple places,
won't be accessed by the IO thread. Instead, the main thread will set
the c→io_flags with the information necessary for the IO thread to know
the client's state.
* Client Close
* On freeClient, the main thread will busy wait for the IO thread to
finish processing the client's read/write before proceeding to free the
client.
* Client's Memory Limits
* The IO thread won't handle the qb/cob limits. In case a client crosses
the qb limit, the IO thread will stop reading for it, letting the main
thread know that the client crossed the limit.
**TLS**
TLS is currently not supported with IO threads for the following
reasons:
1. Pending reads - If SSL has pending data that has already been read
from the socket, there is a risk of not calling the read handler again.
To handle this, a list is used to hold the pending clients. With IO
threads, multiple threads can access the list concurrently.
2. Event loop modification - Currently, the TLS code
registers/unregisters the file descriptor from the event loop depending
on the read/write results. With IO threads, multiple threads can modify
the event loop struct simultaneously.
3. The same client can be sent to 2 different threads concurrently
(https://github.com/redis/redis/issues/12540).
Those issues were handled in the current PR:
1. The IO thread only performs the read operation. The main thread will
check for pending reads after the client returns from the IO thread and
will be the only one to access the pending list.
2. The registering/unregistering of events will be similarly postponed
and handled by the main thread only.
3. Each client is being sent to the same dedicated thread (c→id %
num_of_threads).
**Sending Replies Immediately with IO threads.**
Currently, after processing a command, we add the client to the
pending_writes_list. Only after processing all the clients do we send
all the replies. Since the IO threads are now working asynchronously, we
can send the reply immediately after processing the client’s requests,
reducing the command latency. However, if we are using AOF=always, we
must wait for the AOF buffer to be written, in which case we revert to
the current behavior.
**IO threads dynamic adjustment**
Currently, we use an all-or-nothing approach when activating the IO
threads. The current logic is as follows: if the number of pending write
clients is greater than twice the number of threads (including the main
thread), we enable all threads; otherwise, we enable none. For example,
if 8 IO threads are defined, we enable all 8 threads if there are 16
pending clients; else, we enable none.
It makes more sense to enable partial activation of the IO threads. If
we have 10 pending clients, we will enable 5 threads, and so on. This
approach allows for a more granular and efficient allocation of
resources based on the current workload.
In addition, the user will now be able to change the number of I/O
threads at runtime. For example, when decreasing the number of threads
from 4 to 2, threads 3 and 4 will be closed after flushing their job
queues.
**Tests**
Currently, we run the io-threads tests with 4 IO threads
(443d80f168/.github/workflows/daily.yml (L353)).
This means that we will not activate the IO threads unless there are 8
(threads * 2) pending write clients per single loop, which is unlikely
to happened in most of tests, meaning the IO threads are not currently
being tested.
To enforce the main thread to always offload work to the IO threads,
regardless of the number of pending events, we add an
events-per-io-thread configuration with a default value of 2. When set
to 0, this configuration will force the main thread to always offload
work to the IO threads.
When we offload every single read/write operation to the IO threads, the
IO-threads are running with 100% CPU when running multiple tests
concurrently some tests fail as a result of larger than expected command
latencies. To address this issue, we have to add some after or wait_for
calls to some of the tests to ensure they pass with IO threads as well.
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
We forgot to call sdsfreesplitres in the error path during a
nodes.conf corruption check, this function exits on the error
paths so this is just a cleanup.
Signed-off-by: bentotten <59932872+bentotten@users.noreply.github.com>
This PR updates all Redis/redis keywords to Valkey/valkey, including
variable names, comments, function names.
All sentinel test cases passed.
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
Currently, for nested MULTI or executing WATCH in MULTI, we will return
an error but we will not abort the transaction.
```
127.0.0.1:6379> multi
OK
127.0.0.1:6379(TX)> multi
(error) ERR MULTI calls can not be nested
127.0.0.1:6379(TX)> set key value
QUEUED
127.0.0.1:6379(TX)> exec
1) OK
127.0.0.1:6379> multi
OK
127.0.0.1:6379(TX)> watch key
(error) ERR WATCH inside MULTI is not allowed
127.0.0.1:6379(TX)> set key value
QUEUED
127.0.0.1:6379(TX)> exec
1) OK
```
This is an unexpected behavior that should abort the transaction.
The number of elements returned by EXEC also doesn't match the number
of commands in MULTI.
Add the NO_MULTI flag to them so that they will
be rejected in processCommand and rejectCommand will abort the
transaction.
So there are two visible changes:
- Different words in the error messages. (Command not allowed inside a
transaction)
- Exec returns error.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This PR incorporates changes related to key embedding described in the
https://github.com/redis/redis/issues/12216
With this change there will be no `key` pointer and embedded the key
within the `dictEntry`. 1 byte is used for additional bookkeeping.
Overall the saving would be 7 bytes on average.
Key changes:
New dict entry type introduced, which is now used as an entry for the
main dictionary:
```c
typedef struct {
union {
void *val;
uint64_t u64;
int64_t s64;
double d;
} v;
struct dictEntry *next; /* Next entry in the same hash bucket. */
uint8_t key_header_size; /* offset into key_buf where the key is located at. */
unsigned char key_buf[]; /* buffer with embedded key. */
} embeddedDictEntry;
```
One new function has been added to the dictType:
```c
size_t (*embedKey)(unsigned char *buf, size_t buf_len, const void *key, unsigned char *header_size);
```
Change is opt-in per dict type, hence sets, hashes and other types that
are using dictionary are not impacted.
With this change main dictionary now owns the data, so copy on insert in
dbAdd is no longer needed.
### Benchmarking results
TLDR; Around 9-10% memory usage reduction in overall memory usage for
scenario with key of 16 bytes and value of 8 bytes and 16 bytes. The
throughput per second varies but is similar or greater in most of the
run(s) with the changes against unstable (ae2d421).
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
#### Description
This patch try to introduce a thread-local storage variable for each
thread to update its own `used_memory`, and then sum them together when
reading in `zmalloc_used_memory`. Then we can reduce unnecessary `lock
add` contention from atomic variable. We also add a protection if too
many threads created and the total threads number greater than 132, then
fall back to atomic operation for the threads index >= 132.
#### Problem Statement
`zmalloc` and `zfree` related functions will update the `used_memory`
atomicity for each operation, and they are called very frequency. From
the benchmark of
[memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml)
, the cycles ratio of `zmalloc` and `zfree` are high, they are wrappers
for the lower allocator library, it should not take too much cycles. And
most of the cycles are contributed by `lock add` and `lock sub` , they
are expensive instructions. From the profiling, the metrics' update
mainly come from the main thread, use a TLS will reduce a lot of
contention.
#### Performance Boost
**Note:** This optimization should benefit common benchmark widely. I
choose below 2 scenarios to validate the performance boost in my local
environment.
| Test Suites | Performance Boost |
|-|-|
|[memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10.yml)|8%|
|[memtier_benchmark-1Mkeys-load-string-with-100B-values-pipeline-10](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1Mkeys-load-string-with-100B-values-pipeline-10.yml)|4%|
##### Test Env
- OS: Ubuntu 22.04.4 LTS
- Platform: Intel Xeon Platinum 8380
- Server and Client in same socket
##### Start Server
```sh
taskset -c 0-3 ~/valkey/src/valkey-server /tmp/valkey_1.conf
port 9001
bind * -::*
daemonize yes
protected-mode no
save ""
```
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
Module Authentication using a blocking implementation currently gets
rejected when the "cluster is down" from the client timeout cron job
(`clientsCronHandleTimeout`).
This PR exempts clients blocked on Module Authentication from being
rejected here.
---------
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
cluster-slots test is tesing a very fragmented slots range of a
relatively large cluster. For this reason, when run under valgrind, some
of the nodes are timing out when cluster is attempting to converge and
propagate.
This pr sets the test's cluster-node-timeout to 90000 and
cluster-ping-interval to 1000.
Signed-off-by: ranshid <ranshid@amazon.com>
Test failed in my local:
```
*** [err]: CLUSTER SLOT-STATS ORDERBY LIMIT correct response pagination, where limit is less than number of assigned slots in tests/unit/cluster/slot-stats.tcl
Expected [dict exists 0 0 1 0 2 0 3 0 4 0 16383] (context: type source line 64 file /xxx/tests/unit/cluster/slot-stats.tcl cmd {assert {[dict exists $expected_slots $slot]}} proc ::assert_slot_visibility level 1)
```
It seems that when the stat is equal, that is, when the key-count is
equal,
the qsort performance will be different. When the stat is equal, we
compare
by slot (in ascending order).
Signed-off-by: Binbin <binloveplay1314@qq.com>
We did not set a default value for limit, but it will be used
in addReplyOrderBy later, the undefined behavior may crash the
server since the value could be negative and crash will happen
in addReplyArrayLen.
An interesting reproducible example (limit reuses the value of -1):
```
> cluster slot-stats orderby key-count desc limit -1
(error) ERR Limit has to lie in between 1 and 16384 (maximum number of slots).
> cluster slot-stats orderby key-count desc
Error: Server closed the connection
```
Set the default value of limit to 16384.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
The command provides detailed slot usage statistics upon invocation,
with initial support for key-count metric. cpu-usec (approved) and
memory-bytes (pending-approval) metrics will soon follow after the
merger of this PR.
---------
Signed-off-by: Kyle Kim <kimkyle@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
When Redis/Valkey/KeyDB is run in a cloud environment across multiple
AZ's it is preferable to keep traffic local to an AZ both for cost
reasons and for latency. This is typically done when you are enabling
reads on replicas with the READONLY command.
For this change we are creating a setting that is echo'd back in the
info command. We do not want to add the cloud SDKs as dependencies and
this is the easiest way around that. It is fairly trivial to grab the AZ
from the cloud and push that into your setting file.
Currently at Snapchat we have a custom client that after connecting
reads this from the server and will preferentially use that server if
the AZ string matches its internally configured AZ.
In the future it would be ideal if we used this information when
performing failover or even exposed it in cluster nodes.
Signed-off-by: John Sully <john@csquare.ca>
The create-cluster in utils mainly used to create a test cluster,
turning on these options is useful for testing purposes.
Signed-off-by: Binbin <binloveplay1314@qq.com>
To implement #319
1. replica is able to redirect read and write commands to it's primary
in standalone mode
* reply with "-REDIRECT primary-ip:port"
2. add a subcommand `CLIENT CAPA redirect`, a client can announce the
capability to handle redirection
* if a client can handle redirection, the data access commands (read and
write) will be redirected
3. allow `readonly` and `readwrite` command in standalone mode, may be a
breaking change
* a client with redirect capability cannot process read commands on a
replica by default
* use READONLY command can allow read commands on a replica
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
As discussed, we want to remove the old `REDIS_STATIC` flag which is no
longer relevant.
When moving from Redis to Valkey we renamed all REDIS flags in Makefile.
The REDIS_STATIC flag was renamed to SERVER_STATIC, but this change was
not updated in some of the files.
After discussing it with @madolson and @ranshid, we decided that since
this was introduced 10 years ago, and in many places in the code base we
simply use `static`, we should simplify and remove the flag entirely.
---------
Signed-off-by: Ouri Half <ourih@amazon.com>
Add check in CLUSTERLINK KILL cmd to avoid freeing cluster bus links to
myself. Also add an assert in `freeClusterLink()`.
Testing:
```
127.0.0.1:6379> debug clusterlink kill all c0404ee68574c6aa1048aaebfe90283afe51d2fc
(error) ERR Cannot free cluster link(s) to myself
```
Signed-off-by: Pierre Turin <pieturin@amazon.com>
Introduce a break-point function called `bp`, based on the tcl wiki's
minimal debugger.
```tcl
proc bp {{s {}}} {
if ![info exists ::bp_skip] {
set ::bp_skip [list]
} elseif {[lsearch -exact $::bp_skip $s]>=0} return
if [catch {info level -1} who] {set who ::}
while 1 {
puts -nonewline "$who/$s> "; flush stdout
gets stdin line
if {$line=="c"} {puts "continuing.."; break}
if {$line=="i"} {set line "info locals"}
catch {uplevel 1 $line} res
puts $res
}
}
```
```
... your test code before break-point
bp 1
... your test code after break-point
```
The `bp 1` will give back the tcl interpreter to the developer, and
allow you to interactively print local variables (through `puts`), run
functions and so forth.
Source: https://wiki.tcl-lang.org/page/A+minimal+debugger
---------
Signed-off-by: Kyle Kim <kimkyle@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Issue Introduced by #453.
When we check the SDS _TYPE_5 allocation size we mistakenly used
zmalloc_size which DOES take the PREFIX size into account when no
malloc_size support.
Later when we free we add the PREFIX_SIZE again which leads to negative
memory accounting on some tests.
Example test failure:
https://github.com/valkey-io/valkey/actions/runs/9654170962/job/26627901497
Signed-off-by: ranshid <ranshid@amazon.com>
## Description
While exploring hotspots with profiling some benchmark workloads, we
noticed the high cycles ratio of `prepareClientToWrite`, taking about 9%
of the CPU of `smembers`, `lrange` commands. After deep dive the code
logic, we thought we can gain the performance by reducing the redundant
call of `prepareClientToWrite` when call addReply* continuously.
For example: In
https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L1080-L1082,
`prepareClientToWrite` is called three times in a row.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
Fix a minor inconvenience I have when writing tests. If I have a typo or
forget to generate the tls certificates, the start_server handle will
just loop for 2 minutes before printing the error. This just fails and
prints as soon as it sees the error.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
### Description ###
zfree updates memory statistics. It gets the size of the buffer from
jemalloc by calling zmalloc_size. This operation is costly. We can avoid
it if we know the buffer size. For example, we can calculate size of sds
from the data we have in its header.
This commit introduces zfree_with_size function that accepts both
pointer to a buffer, and its size. zfree is refactored to call
zfree_with_size.
sdsfree uses the new interface for all but SDS_TYPE_5.
### Benchmark ###
Dataset is 3 million strings. Each benchmark run uses its own value size
(8192, 512, and 120). The benchmark is 100% write load for 5 minutes.
```
value size new tps old tps % new us/call old us/call %
8k 272088.53 269971.75 0.78 1.83 1.92 -4.69
512 356881.91 352856.72 1.14 1.27 1.35 -5.93
120 377523.81 368774.78 2.37 1.14 1.19 -4.20
```
---------
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
In this PR, we update master keyword to primary keyword several in
sentinel command json file and sentinelCommand function.
And there is no update for configurable parameters in sentinel.conf file
Signed-off-by: hwware <wen.hui.ware@gmail.com>
In some scenarios, the business may not be able to find the
previously used Lua script and only have a SHA signature.
Or there are multiple identical evalsha's args in monitor/slowlog,
and admin is not able to distinguish the script body.
Add a new script subcommmand to show the contents of script
given the scripts sha1. Returns a NOSCRIPT error if the script
is not present in the cache.
Usage: `SCRIPT SHOW sha1`
Complexity: `O(1)`
Closes#604.
Doc PR: https://github.com/valkey-io/valkey-doc/pull/143
---------
Signed-off-by: wei.kukey <wei.kukey@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This PR makes our current RDB format compatible with the Redis 7.2.4 RDB
format. there are 2 changes introduced in this PR:
1. Move back the RDB version to 11
2. Make slot info section persist as AUX data instead of dedicated
section.
We have introduced slot-info as part of the work to replace cluster
metadata with slot specific dictionaries. This caused us to bump the RDB
version and thus we prevent downgrade (which is conceptualy O.K but
better be prevented). We do not require the slot-info section to exist,
so making it an AUX section will help suppport version downgrade from
Valkey 8.
fixes: [#645](https://github.com/valkey-io/valkey/issues/645)
NOTE: tested manually by:
1. connecting Redis 7.2.4 replica to a Valkey 8(RC)
2. upgrade/downgrade Redis 7.2.4 cluster and Valkey 8(RC) cluster
---------
Signed-off-by: ranshid <ranshid@amazon.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When we goto eoferr, we need to release the auxkey and auxval,
this is a cleanup, also explicitly check that decoder return
value is C_ERR.
Introduced in #586.
Signed-off-by: Binbin <binloveplay1314@qq.com>
I think the log of pfail status changes will be very useful.
The other parts were scanned and found that it can be modified.
Changes:
1. Changing pfail status releated logs from VERBOSE to NOTICE.
2. Changing configEpoch collision log from VERBOSE(warning) to NOTICE.
3. Changing some logs from DEBUG to NOTICE.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
In markNodeAsFailingIfNeeded we will count needed_quorum and failures,
needed_quorum is the half the cluster->size and plus one, and
cluster-size
is the size of primary node which contain slots, but when counting
failures, we dit not check if primary has slots.
Only the primary has slots that has the rights to vote, adding a new
clusterNodeIsVotingPrimary to formalize this concept.
Release notes:
bugfix where nodes not in the quorum group might spuriously mark nodes
as failed
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
When there is a link failure while an ongoing MEET request is sent the
sending node stops sending anymore MEET and starts sending PINGs. Since
every node responds to PINGs from unknown nodes with a PONG, the
receiving node never adds the sending node. But the sending node adds
the receiving node when it sees a PONG. This can lead to asymmetry in
cluster membership. This changes makes the sender keep sending MEET
until it sees a PONG, avoiding the asymmetry.
---------
Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>
getKeysResults is typically initialized with 2kb of zeros (16 * 256),
which isn't strictly necessary since the only thing we have to
initialize is some of the metadata fields. The rest of the data can
remain junk as long as we don't access it. This was a bit of a
regression in 7.0 with the keyspecs, since we doubled the size of the
zeros, but hopefully this recovers a lot of the performance drop.
I saw a modest performance bump for deep pipeline of cluster mode (~8%).
I think we would see some comparable improvements in the other places
where we are using it such as tracking and ACLs.
---------
Signed-off-by: Madelyn Olson <matolson@amazon.com>
In ad28d222edcef9d4496fd7a94656013f07dd08e5, we added a Lua eval
scripts eviction. If the script was previously added via EVAL, we
promote it to SCRIPT LOAD, prevent it from being evicted later.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This patch try to correct the latency report.
1. Rename Redis to Valkey.
2. Remove redundant Dave dialog, and refine the output message.
---------
Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>
Signed-off-by: hwware <wen.hui.ware@gmail.com>
Until now, these configuration items allowed typing empty strings,
but empty strings behave strangely.
Empty dir will fail in chdir with No such file or directory:
```
./src/valkey-server --dir ""
*** FATAL CONFIG FILE ERROR (Version 255.255.255) ***
Reading the configuration file, at line 2
>>> 'dir ""'
No such file or directory
```
Empty dbfilename will cause shutdown to fail since it will
always fail in rdb save:
```
./src/valkey-server --dbfilename ""
* User requested shutdown...
* Saving the final RDB snapshot before exiting.
# Error moving temp DB file temp-19530.rdb on the final destination (in server root dir /xxx/xxx/valkey): No such file or directory
# Error trying to save the DB, can't exit.
# Errors trying to shut down the server. Check the logs for more information.
```
Empty cluster-config-file will fail in clusterLockConfig:
```
./src/valkey-server --cluster-enabled yes --cluster-config-file ""
Can't open in order to acquire a lock: No such file or directory
```
With this patch, now we will just reject it in config set like:
```
*** FATAL CONFIG FILE ERROR (Version 255.255.255) ***
Reading the configuration file, at line 2
>>> 'xxx ""'
xxx can't be empty
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
While debugging a cluster bus issue, found the cluster node flags were
represented in numbers. I generally find it easy when these are
represented as bitwise shift operation. It improves readability a bit.
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Added a wait_for_condition to avoid the timing issue.
```
*** [err]: query buffer resized correctly in tests/unit/querybuf.tcl
Expected 11 >= 16384 && 11 <= 32770 (context: type eval line 24 cmd {assert {$orig_test_client_qbuf >= 16384 && $orig_test_client_qbuf <= $MAX_QUERY_BUFFER_SIZE}} proc ::test)
*** [err]: query buffer resized correctly when not idle in tests/unit/querybuf.tcl
Expected 11 > 32768 (context: type eval line 14 cmd {assert {$orig_test_client_qbuf > 32768}} proc ::test)
*** [err]: query buffer resized correctly with fat argv in tests/unit/querybuf.tcl
query buffer should not be resized when client idle time smaller than 2s
```
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Allow CLUSTER subcommands NODES, INFO, MYID, MYSHARDID while loading
data.
It's safe to allow them and it's helpful for clients to get cluster
nodes/info information
during a node failover and while loading data to monitor the
state of the cluster.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
We've been seeing some pretty consistent failures from
`test-valgrind-test` and `test-sanitizer-address` because of the
querybuf test periodically failing. I tracked it down to the test
periodically taking too long and the client cron getting triggered. A
simple solution is to just disable the cron during the key race
condition. I was able to run this locally for 100 iterations without
seeing a failure.
Example:
https://github.com/valkey-io/valkey/actions/runs/9474458354/job/26104103514
and
https://github.com/valkey-io/valkey/actions/runs/9474458354/job/26104106830.
Signed-off-by: Madelyn Olson <matolson@amazon.com>
We added some clang-format off comments before we had decided on the
format configuration. Now, it turns out that turning formatting off is
often not necessary.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Currently, "config rewrite" writes some default value in the config file
incase of empty config file specified.
But it adds multiple "save" config entries as follows:
```
save 3600 1
save 300 100
save 60 10000
```
After the fix the save will look like:
```
save 3600 1 300 100 60 10000
```
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Safe iterators must call resetIterators when they are done being used.
Fix one issue where a safe iterator was not correctly calling reset
during cluster slot caching and fixed a second issue where reset
iterator was being called twice. For the double release case,
kvstoreIteratorNextDict is responsible for patching up the iterator, but
we were calling it a second time in kvstoreIteratorNext.
In addition, I added some documentation around initializing iterators,
added an assert to prevent double initialization, and remove a function
from the public interface which isn't needed and might lead to incorrect
usage of the safe iterators.
Bumping srgsanky for finding it here:
c4782066e7 (r142867004).
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Make the one backwards compatible config change we are allowed to
replace for removing master from our API.
`masterauth` and `masteruser` are still used as an alias, but aren't
explicitly referenced. As an addendum to
https://github.com/valkey-io/valkey/pull/591, it would be good to have
this in 8. Given the related PR for updated other references for master,
I just updated the ones around this specific change.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
More rebranding of
* Log messages (#252)
* The DENIED error reply
* Internal function names and comments, mainly Lua API
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
makes SERVER_CFLAGS='-DSERVER_TEST' compile as well
Introduced in #443.
Signed-off-by: Eran Liberty <eranl@amazon.com>
Co-authored-by: Eran Liberty <eranl@amazon.com>
Remove the unused value duplicate API from dict. It's unused in the codebase and introduces unnecessary overhead.
---------
Signed-off-by: Eran Liberty <eran.liberty@gmail.com>
"Client blocked on XREADGROUP while stream's slot is migrated" uses the
migrate command, which requires special handling for TLS and non-tls.
This was not being handled, so was throwing an error.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Test `query buffer resized correctly` start to fail
(https://github.com/valkey-io/valkey/actions/runs/9278013807) with
non-jemalloc allocators after
https://github.com/valkey-io/valkey/pull/258 PR.
With Jemalloc we allocate ~20K for the query buffer, in the test we read
1 byte in the first read, in the second read we make sure we have at
least 16KB free place in the query buffer and we have as Jemalloc
allocated 20KB, But with non jemalloc we allocate in the first read
exactly 16KB. in the second read we check and see that we don't have
16KB free space as we already read 1 byte hence we reallocate this time
greedly (*2 of the requested size of 16KB+1) hence the test condition
that the querybuf size is < 32KB is no longer true
The `query buffer resized correctly test` starts
[failing](https://github.com/valkey-io/valkey/actions/runs/9278013807)
with non-jemalloc allocators after PR #258 .
With jemalloc, we allocate ~20KB for the query buffer. In the test, we
read 1 byte initially and then ensure there is at least 16KB of free
space in the buffer for the second read, which is satisfied by
jemalloc's 20KB allocation. However, with non-jemalloc allocators, the
first read allocates exactly 16KB. When we check again, we don't have
16KB free due to the 1 byte already read. This triggers a greedy
reallocation (doubling the requested size of 16KB+1), causing the query
buffer size to exceed the 32KB limit, thus failing the test condition.
This PR adjusted the test query buffer upper limit to be 32KB +2.
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
sds type should be determined based on the size of the underlying
buffer, not the logical length of the sds. Currently we truncate the
alloc field in case buffer is larger than we can handle. It leads to a
mismatch between alloc field and the actual size of the buffer. Even
considering that alloc doesn't include header size and the null
terminator.
It also leads to a waste of memory with jemalloc. For example, let's
consider creation of sds of length 253. According to the length, the
appropriate type is SDS_TYPE_8. But we allocate `253 + sizeof(struct
sdshdr8) + 1` bytes, which sums to 257 bytes. In this case jemalloc
allocates buffer from the next size bucket. With current configuration
on Linux it's 320 bytes. So we end up with 320 bytes buffer, while we
can't address more than 255.
The same happens with other types and length close enough to the
appropriate powers of 2.
The downside of the adjustment is that with allocators that do not
allocate larger than requested chunks (like GNU allocator), we switch to
a larger type "too early". It leads to small waste of memory.
Specifically: sds of length 31 takes 35 bytes instead of 33 (2 bytes
wasted) sds of length 255 takes 261 bytes instead of 259 (2 bytes
wasted) sds of length 65,535 takes 65,545 bytes instead of 65,541 (4
bytes wasted) sds of length 4,294,967,295 takes 4,294,967,313 bytes
instead of 4,294,967,305 (8 bytes wasted)
---------
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Deflake chained replicas disconnect when replica re-connect with the
same master. sync_partial_ok counter might get incremented if replica
timed out during test.
Signed-off-by: naglera <anagler123@gmail.com>
When LRU/LFU enabled, Valkey does not allow using shared objects, as
value objects may be shared among many different keys and they can't
share LRU/LFU information.
However `maxmemory-policy` is modifiable at runtime. If LRU/LFU is not
enabled at start, but then enabled when some shared objects are already
used, there could be some confusion in LRU/LFU information.
For `set` command it is OK since it is going to create a new object when
LRU/LFU enabled, but `get` command will not unshare the object and just
update LRU/LFU information.
So we may duplicate the object in this case. It is a one-time task for
each key using shared objects, unless this is the case for so many keys,
there should be no serious performance degradation.
Still, LRU will be updated anyway, no matter LRU/LFU is enabled or not,
because `OBJECT IDLETIME` needs it, unless `maxmemory-policy` is set to
LFU. So idle time of a key may still be messed up.
---------
Signed-off-by: chentianjie.ctj <chentianjie.ctj@alibaba-inc.com>
Signed-off-by: Chen Tianjie <TJ_Chen@outlook.com>
Introduce a new hidden server configuration, `enable-debug-assert`, which
allows selectively enabling or disabling, at runtime, expensive or risky
assertions used primarily for debugging and testing.
Fix#569
---------
Signed-off-by: Ping Xie <pingxie@google.com>
There are currently three block types: BLOCKED_WAIT, BLOCKED_WAITAOF,
and BLOCKED_WAIT_PREREPL, used to block clients executing `WAIT`,
`WAITAOF`, and `CLUSTER SETSLOT`, respectively. They share the same
workflow: the client is blocked until replication to the expected number
of replicas completes. However, they provide different responses
depending on the commands involved. Using distinct block types leads to
code duplication and reduced readability. This PR consolidates the three
types into a single WAIT type, differentiating them using the pending
command to ensure the appropriate response is returned.
Fix#427
---------
Signed-off-by: Ping Xie <pingxie@google.com>
In #11012, we changed the way command durations were computed to handle
the same command being executed multiple times. In #11970, we added an
assert if the duration is not properly reset, potentially indicating
that a call to report statistics was missed.
I found an edge case where this happens - easily reproduced by blocking
a client on `XGROUPREAD` and migrating the stream's slot. This causes
the engine to process the `XGROUPREAD` command twice:
1. First time, we are blocked on the stream, so we wait for unblock to
come back to it a second time. In most cases, when we come back to
process the command second time after unblock, we process the command
normally, which includes recording the duration and then resetting it.
2. After unblocking we come back to process the command, and this is
where we hit the edge case - at this point, we had already migrated the
slot to another node, so we return a `MOVED` response. But when we do
that, we don’t reset the duration field.
Fix: also reset the duration when returning a `MOVED` response. I think
this is right, because the client should redirect the command to the
right node, which in turn will calculate the execution duration.
Also wrote a test which reproduces this, it fails without the fix and
passes with it.
---------
Signed-off-by: Nitai Caro <caronita@amazon.com>
Co-authored-by: Nitai Caro <caronita@amazon.com>
In #53, we will cache the CLUSTER SLOTS response to improve the
throughput and reduct the latency.
In the code snippet below, the second cluster slots will use the
old hostname:
```
config set cluster-preferred-endpoint-type hostname
config set cluster-announce-hostname old-hostname.com
multi
cluster slots
config set cluster-announce-hostname new-hostname.com
cluster slots
exec
```
When updating the hostname, in updateAnnouncedHostname, we will set
CLUSTER_TODO_SAVE_CONFIG and we will do a clearCachedClusterSlotsResponse
in clusterSaveConfigOrDie, so harmless in most cases.
Move the clearCachedClusterSlotsResponse call to clusterDoBeforeSleep
instead of scheduling it to be called in clusterSaveConfigOrDie.
Signed-off-by: Binbin <binloveplay1314@qq.com>
ref:
- https://github.com/valkey-io/valkey/pull/118 (my pervious change)
- https://github.com/valkey-io/valkey/pull/461 (issuing that clang
format checker fails due to my change)
There was an issue that clang-format cheker failed.
I don't know why I missed it and why it didn't catch.
just running `clang-format -i bitops.c` was all.
Signed-off-by: LiiNen <kjeonghoon065@gmail.com>
_This change is the thing I suggested to redis when it was BSD, and is
not just migration - this is of course more advanced_
### Issue
There is weird difference in syntax between BITPOS and BITCOUNT:
```
BITPOS key bit [start [end [BYTE | BIT]]]
BITCOUNT key [start end [BYTE | BIT]]
```
I think this might cause confusion in terms of usability.
It was not just a syntax typo error, and really works differently.
The results below are with unstable build:
```
> get TEST:ABCD
"ABCD"
> BITPOS TEST:ABCD 1 0 -1
(integer) 1
> BITCOUNT TEST:ABCD 0 -1
(integer) 9
> BITPOS TEST:ABCD 1 0
(integer) 1
> BITCOUNT TEST:ABCD 0
(error) ERR syntax error
```
### What did I fix
simply changes logic, to accept BITCOUNT also without 'end' - 'end'
become optional, like BITPOS
```
> GET TEST:ABCD
"ABCD"
> BITPOS TEST:ABCD 1 0 -1
(integer) 1
> BITCOUNT TEST:ABCD 0 -1
(integer) 9
> BITPOS TEST:ABCD 1 0
(integer) 1
> BITCOUNT TEST:ABCD 0
(integer) 9
```
Of course, I also fixed syntax hint:
```
# ASIS
> BITCOUNT key [start end [BYTE|BIT]]
# TOBE
> BITCOUNT key [start [end [BYTE|BIT]]]
```

### Moreover ...
I hadn't noticed that there was very small dead code in these command
logic, when I wrote PR to redis.
I found it now, when write code again, so I wrote it in valkey.
``` c
/* asis unstable */
/* bitcountCommand() */
if (!strcasecmp(c->argv[4]->ptr,"bit")) isbit = 1;
// ...
if (c->argc < 4) {
if (isbit) end = (totlen<<3) + 7;
else end = totlen-1;
}
/* bitposCommand() */
if (!strcasecmp(c->argv[5]->ptr,"bit")) isbit = 1;
// ...
if (c->argc < 5) {
if (isbit) end = (totlen<<3) + 7;
else end = totlen-1;
}
```
Bit variable (actually int) "isbit" is only being set as 1, when 'BIT'
is declared.
But we were checking whether 'isbit' is true or false in this 'if'
phrase, even if isbit could never be 1, because argc is always less than
4 (or 5 in bitpos).
I think this minor fixes will make valkey command operation more
consistent.
Of course, this PR contains just changing args from "required" to
"optional", so it will never hurt previous users.
Thanks,
---------
Signed-off-by: LiiNen <kjeonghoon065@gmail.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
This PR optimizes client query buffer handling in Valkey by introducing
a shared query buffer that is used by default for client reads. This
reduces memory usage by ~20KB per client by avoiding allocations for
most clients using short (<16KB) complete commands. For larger or
partial commands, the client still gets its own private buffer.
The primary changes are:
* Adding a shared query buffer `shared_qb` that clients use by default
* Modifying client querybuf initialization and reset logic
* Copying any partial query from shared to private buffer before command
execution
* Freeing idle client query buffers when empty to allow reuse of shared
buffer
* Master client query buffers are kept private as their contents need to
be preserved for replication stream
In addition to the memory savings, this change shows a 3% improvement in
latency and throughput when running with 1000 active clients.
The memory reduction may also help reduce the need to evict clients when
reaching max memory limit, as the query buffer is the main memory
consumer per client.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
The races are between the '$rd' client and the 'r' client in the test
case.
Test case "client input output and command process statistics" in
unit/introspection.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The replica sends its version when initiating replication, in
pipeline with other REPLCONF commands.
The primary stores it in the client struct. Other fields are made
smaller to avoid making the client struct consume more memory.
Fixes#414.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This aligns the behaviour with established Valkey commands with a
TIMEOUT argument, such as BLPOP.
Fix#422
Signed-off-by: Ping Xie <pingxie@google.com>
- Replaces custom atomics logic with C11 default atomics logic.
- Drops "atomicvar_api" field from server info
Closes#485
---------
Signed-off-by: adetunjii <adetunjithomas1@outlook.com>
Signed-off-by: Samuel Adetunji <adetunjithomas1@outlook.com>
Co-authored-by: teej4y <samuel.adetunji@prunny.com>
Although I think this improves the readability of individual configs,
the fact there are now 1k more lines of configs makes this overall much
harder to parse. So reverting it back to the way it was before.
`,\n [ ]+` replace with `, `.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
After READONLY, make a cluster replica behave as its primary regarding
returning ASK redirects and TRYAGAIN.
Without this patch, a client reading from a replica cannot tell if a key
doesn't exist or if it has already been migrated to another shard as
part of an ongoing slot migration. Therefore, without an ASK redirect in
this situation, offloading reads to cluster replicas wasn't reliable.
Note: The target of a redirect is always a primary. If a client wants to
continue reading from a replica after following a redirect, it needs to
figure out the replicas of that new primary using CLUSTER SHARDS or
similar.
This is related to #21 and has been made possible by the introduction of
Replication of Slot Migration States in #445.
----
Release notes:
During cluster slot migration, replicas are able to return -ASK
redirects and -TRYAGAIN.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Undeprecate cluster slots command. This command is widely used by
clients to form the cluster topology and with the recent change to
improve performance of `CLUSTER SLOTS` command via #53 as well as us
looking to further improve the usability via #517, it makes sense to
undeprecate this command.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
I have validated that these settings closely match the existing coding
style with one major exception on `BreakBeforeBraces`, which will be
`Attach` going forward. The mixed `BreakBeforeBraces` styles in the
current codebase are hard to imitate and also very odd IMHO - see below
```
if (a == 1) { /*Attach */
}
```
```
if (a == 1 ||
b == 2)
{ /* Why? */
}
```
Please do NOT merge just yet. Will add the github action next once the
style is reviewed/approved.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
This commit adds a logic to cache `CLUSTER SLOTS` response for reduced
latency and also updates the cache when a change in the cluster is
detected.
Historically, `CLUSTER SLOTS` command was deprecated, however all the
server clients have been using `CLUSTER SLOTS` and have not migrated to
`CLUSTER SHARDS`. In future this logic can be added to any other
commands to improve the performance of the engine.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Issue #428.
Moved the SERVER_TEST block from ziplist.c into unit tests in
test_ziplist.c. I left the benchmark related tasks alone in their own
test, as I am not sure what to do with them.
Some of the assertions are a little vague/useless, but I will try to
refine them.
---------
Signed-off-by: Mason Hall <hallmason17@gmail.com>
Implemented in
3945a32177 (diff-a154d1fa454a9868e2c455acdae971e3605151516f9a8efac7f2c9b2845d914d),
this function is never called and never used. I was trying to understand
whether we could use this for another PR, but couldn't really find a
point for it because it didn't do exactly what I expected.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
In 3f725b8, we introduced a change back in march to reduce the
entropy of ASLR, because ASAN didn't support it. Now the
vm.mmap_rnd_bits
was reverted in actions/runner-images#9491 so can remove this changes.
Closes#519.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Commit 07ed0eafa98a66 introduced some SCAN improvements, but some
changes were postponed to a later version (8.0), which this PR finishes:
1. Prepare to move the TYPE filtering to the scan callback as well. this
was put on hold since it has side effects that can be considered a
breaking change, which is that we will not attempt to do lazy expire
(delete) a key that was filtered by not matching the TYPE (changing it
would mean TYPE filter starts behaving the same as MATCH filter already
does in that respect).
2. when the specified key TYPE filter is an unknown type, server will
reply a error immediately instead of doing a full scan that comes back
empty handed.
Fixes#235
Release notes:
> SCAN: Expired keys that don't match the TYPE argument for the SCAN are
no longer deleted by SCAN
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This patch try to correct the actual allocated size from allocator
when call sdsRedize to align the logic with sdsnewlen function.
Maybe the https://github.com/valkey-io/valkey/pull/453 optimization
should depend on this.
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
This PR added a new option for tcl test case which will fail fast once any test cases fail.
This can be useful while running redis CI pipeline, and you want to accelerate the CI pipeline.
usage for example
> ./runtest --single unit/type/hash --fast-fail
---------
Signed-off-by: arthur.lee <arthur-lee@qq.com>
The test logic is not smart enough to realize that a test is fully
#ifdef'd out, so it will try to attach it to the test suite anyways.
This is a minor work around for the reclaim file page test so that it
will still attach the test, it will just always succeed. Also remove an
unnecessary print statement that was missed in the same test.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Cleaned up the minor cluster refactoring notes that were intended to be
follow ups that never happened. Basically:
1. Minor style nitpicks
2. Generalized clusterNodeIsMyself so that it wasn't implementation
dependent.
3. Removed getMyClusterId, and just make it an explicit call to myself's
name, which seems more straightforward and removes unnecessary
abstraction.
4. Remove clusterNodeGetSlaveof infavor of clusterNodeGetMaster. We
already do a check if it's a replica, and if it wasn't working it would
have been crashing.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This is the actual PR which is created to migrate all tests related to
zmalloc into new test framework as part of the parent issue
https://github.com/valkey-io/valkey/issues/428.
Signed-off-by: Karthick Ariyaratnam <karthyuom@gmail.com>
The script extracts the comments and prototypes from module.c and does
some pre-processing, e.g. converts URLs to markdown links. The URL
regexp didn't account for '#', '?' (and a few more chars) so an URL like
`https://example.com/#section` was converted to markdown as
[https://example.com/](https://example.com/)#section
With this change, it's instead correctly converted to
[https://example.com/#section](https://example.com/#section)
Additional change: Removes an unused metadata field.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In c7ad9feb52,
we missed removed endian coverage from the legacy unit tests, so it failed to find it when building.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This PR migrates all tests related to endianconv into new test framework
as part of the parent issue https://github.com/valkey-io/valkey/issues/428.
---------
Signed-off-by: Karthick Ariyaratnam <karthyuom@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Currently, we use select(2) on DragonFlyBSD while
`kqueue` is available on DragonFlyBSD since FreeBSD 4.1
and DragonFlyBSD was originally forked from FreeBSD 4.8
`select(2)` is a pretty old technique that has many defects
compared to `kqueue`, we should switch to `kqueue` on DragonFlyBSD.
Signed-off-by: Andy Pan <i@andypan.me>
If `valkey-server` was started with the `redis-server` symlink, the old
proc names are used, for backwards compatibility.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
This migrates unit tests related to sha1 to new framework, ref: #428.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This patch migrates all tests in sds.c into new test framework as part
of the parent issue https://github.com/valkey-io/valkey/issues/428.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Move dependency from `slowlog.c` into `slowlog.h`, make sure the
language server can work properly under `slowlog.h`
Signed-off-by: arthur.lee <arthur-lee@qq.com>
We currently has two disjoint TCL frameworks:
1. Normal testing framework, which trigger by runtest, which individually
launches nodes for testing.
2. Cluster framework, which trigger by runtest-cluster, which pre-allocates
N nodes and uses them for testing large configurations.
The normal TCL testing framework is much more readily tested and is also
automatically run as part of the CI for new PRs. The runtest-cluster since
it runs very slowly (cannot be parallelized), it currently only runs in daily
CI, this results in some changes to the cluster not being exposed in PR CI
in time.
This PR migrate the Cluster mode tests to normal framework. Some cluster
tests are kept in runtest-cluster because of timing issues or not yet
supported, we can process them later.
Signed-off-by: Binbin <binloveplay1314@qq.com>
All the intset unit tests are migrated to new test framework as part of
https://github.com/valkey-io/valkey/pull/344, but the old framework
declaration is missed to remove from intset.h. So removed the code.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Updated serverPanic output in db.c based on the
extended-redis-compatibility config. and also updated comments in other
files.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
This is a preparation for adding clang-format.
These comments prevent automatic formatting in some places. With these
exceptions, we will be able to run clang-format on the rest of the code.
This is a preparation for #323.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Fix the compile error with the following command:
`make all-with-unit-tests SERVER_CFLAGS='-Werror -DSERVER_TEST'
`
```
/usr/bin/ld: /home/ubuntu/valkey-shiv-repo/valkey/src/eval.c:1172: undefined reference to `lua_next'
/usr/bin/ld: /home/ubuntu/valkey-shiv-repo/valkey/src/eval.c:1154: undefined reference to `lua_toboolean'
/usr/bin/ld: /home/ubuntu/valkey-shiv-repo/valkey/src/eval.c:1175: undefined reference to `lua_type'
/usr/bin/ld: /home/ubuntu/valkey-shiv-repo/valkey/src/eval.c:1176: undefined reference to `lua_tonumber'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:469: valkey-unit-tests] Error 1
make[1]: Leaving directory '/home/ubuntu/valkey-shiv-repo/valkey/src'
make: *** [Makefile:6: all-with-unit-tests] Error 2
```
Issue is happened as all deps libraries not linked for
valkey-unit-tests, so linked all libraries to the binary.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
This PR migrates all tests related to kvstore into new test framework as
part of the parent issue https://github.com/valkey-io/valkey/issues/428.
---------
Signed-off-by: Karthick Ariyaratnam <karthyuom@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This PR migrates all tests related to util into new test framework as
part of the parent issue https://github.com/valkey-io/valkey/issues/428.
---------
Signed-off-by: Karthick Ariyaratnam <karthyuom@gmail.com>
Despite the fact that SO_REUSEADDR can be set on a Unix domain socket
via setsockopt() without reporting an error, SO_REUSEADDR was actually
created for ipv4/ipv6 and it's not supported for sockets of AF_UNIX.
Therefore, setting this option on a Unix domain socket does nothing but
costs one extra system call.
Signed-off-by: Andy Pan <i@andypan.me>
Command syntax is now:
```
ZSCAN key cursor [MATCH pattern] [COUNT count] [NOSCORES]
```
Return format:
```
127.0.0.1:6379> zadd z 1 a 2 b 3 c
(integer) 3
127.0.0.1:6379> zscan z 0
1) "0"
2) 1) "a"
2) "1"
3) "b"
4) "2"
5) "c"
6) "3"
127.0.0.1:6379> zscan z 0 noscores
1) "0"
2) 1) "a"
2) "b"
3) "c"
```
when NOSCORES is on, the command will only return members in the zset,
without scores.
For client side parsing the command return, I believe it is fine as long
as the command is backwards compatible. The return structure are still
lists, what has changed is the content. And clients can tell the
difference by the subcommand they use.
Since `novalues` option of `HSCAN` is already accepted
(redis/redis#12765), I think similar thing can be done to `ZSCAN`.
---------
Signed-off-by: Chen Tianjie <TJ_Chen@outlook.com>
This PR fixes an error in the unit/READMED.md (see
https://github.com/valkey-io/valkey/issues/428) in order to correct the
steps for running single unit test file.
Signed-off-by: Karthick Ariyaratnam <karthyuom@gmail.com>
Updated strlcat function and macros name (redis_strlcat ->
valkey_strlcat).
I think the standard strcat function is not safe.
(https://codeql.github.com/codeql-query-help/cpp/cpp-unsafe-strcat/)
So, it would be better to keep it as a safe function.
Signed-off-by: NAM UK KIM <namuk2004@naver.com>
This commit does four things:
1. On various images, the linker was not able to correctly load the flto
optimizations from the archive generated for unit tests, and was
throwing errors. I was able to solve this by updating the plugin for the
fortify test, but was unable to reproduce it on the ASAN tests or find a
solution. So I decided to go with a single solution for now, which was
to just disable the linker optimizations for those tests. This shouldn't
weaken the protections provided by ASAN.
2. The change to remove flto for some reason caused some odd inlining
behavior in the intset test, that I wasn't really able to understand.
The error was basically that we were doing a 4 byte write, starting at
byte offset 8, for the first addition to listpack that was of size 10.
Practically this has no effect, since I'm not aware of any allocator
that would give us a 10 byte block as opposed to 12 (or more likely 16)
bytes. The isn't the correct behavior, since an uninitialized listpack
defaults to 16bit encoding, which should only be writing 2 bytes. I
rabbit holed like 2 hours into this, and gave up and just ignored the
warning on the file.
3. Now that address sanitizer was correctly running, it picked up two
issues. A memory leak and uninitialized value, so those were easy to
fix.
4. There is also a small change to the fortify to build the test up
front instead of later, this is just to be consistent with other tests
and has no functional change.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
We already have global stats for input traffic, output traffic and how
many commands have been executed.
However, some users have the difficulty of locating the IP(s) which have
heavy network traffic. So here some stats for single client are
introduced.
```
tot-net-in // Total network input bytes read from the client
tot-net-out // Total network output bytes sent to the client
tot-cmds // Total commands the client has executed
```
These three stats are shown in `CLIENT LIST` and `CLIENT INFO`.
Though the metrics are handled in hot paths of the code, personally I
don't think it will slow down the server. Considering all other complex
operations handled nearby, this is only a small and simple operation.
However we do need to be cautious when adding more and more metrics, as
discussed in redis/redis#12640, we may need to find a way to tell
whether this has obvious performance degradation.
---------
Signed-off-by: Chen Tianjie <TJ_Chen@outlook.com>
Existing code does not build on macOS < 10.7. There are two issues:
1. The check for `MAC_OS_10_6_DETECTED` does the opposite of what it
should, since `AvailabilityMacros.h` does not define underscore-prefixed
versions of macros. `__MAC_OS_X_VERSION_MAX_ALLOWED` evaluates to 0, and
on 10.6 everything breaks down:
Credits to @mohd-akram who pointed out at possible origin of the this
problem.
2. Once that is fixed, on 10.6 when building for `ppc` there are new
errors, because the code uses inaccurate assumptions for archs. Fix that
too.
---------
Signed-off-by: Sergey Fedorov <vital.had@gmail.com>
I added details at the end with the Valkey entity information under an
"About Valkey" heading. Please feel free to adjust the heading if that's
not an appropriate one to use.
---------
Signed-off-by: Mike Dolan <mikedolan@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
`reclaimFilePageCache` did not set `errno` but `rdbSaveInternal` which
is logging the error assumed it did. This makes sure `errno` is set.
Signed-off-by: Ted Lyngmo <ted@lyncon.se>
The core idea was to take a lot of the stuff from the C unity framework
and adapt it a bit here. Each file in the `unit` directory that starts
with `test_` is automatically assumed to be a test suite. Within each
file, all functions that start with `test_` are assumed to be a test.
See unit/README.md for details about the implementation.
Instead of compiling basically a net new binary, the way the tests are
compiled is that the main valkey server is compiled as a static archive,
which we then compile the individual test files against to create a new
test executable. This is not all that important now, other than it makes
the compilation simpler, but what it will allow us to do is overwrite
functions in the archive to enable mocking for cross compilation unit
functions. There are also ways to enable mocking from within the same
compilation unit, but I don't know how important this is.
Tests are also written in one of two styles:
1. Including the header file and directly calling functions from the
archive.
2. Importing the original file, and then calling the functions. This
second approach is cool because we can call static functions. It won't
mess up the archive either.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
These JSON files were originally not intended to be used directly, since
they contain internals and some fiels like "acl_categories" that are not
the final ACL categories. (Valkey will apply some implicit rules to
compute the final ACL categories.) However, people see JSON files
and use them directly anyway.
So it's better to document them.
In a later PR, we can get rid of all implicit ACL categories and instead
populate them explicitly in the JSON files. Then, we'll add a validation
(e.g. in generate-command-code.py) that the implied categories are set.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This is a minor change where only naming and links now points properly
to valkey.
Fixes#388
---------
Signed-off-by: Rolandas Šimkus <rolandas@simkus.io>
Signed-off-by: simkusr <rolandas.s@wilibox.com>
Signed-off-by: simkusr <rolandas@simkus.io>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: simkusr <rolandas.s@wilibox.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Renamed redis to valkey/server in aof.c serverlogs.
The AOF rewrite child process title is set to "redis-aof-rewrite" if
Valkey was started from a redis-server symlink, otherwise to
"valkey-aof-rewrite".
This is a breaking changes since logs are changed.
Part of #207.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Improve the performance of crc64 for large batches by processing large
number of bytes in parallel and combining the results.
## Performance
* 53-73% faster on Xeon 2670 v0 @ 2.6ghz
* 2-2.5x faster on Core i3 8130U @ 2.2 ghz
* 1.6-2.46 bytes/cycle on i3 8130U
* likely >2x faster than crcspeed on newer CPUs with more resources than
a 2012-era Xeon 2670
* crc64 combine function runs in <50 nanoseconds typical with vector +
cache optimizations (~8 *microseconds* without vector optimizations, ~80
*microseconds without cache, the combination is extra effective)
* still single-threaded
* valkey-server test crc64 --help (requires `make distclean && make
SERVER_TEST=yes`)
---------
Signed-off-by: Josiah Carlson <josiah.carlson@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Serverassert is a drop-in replacement of assert. We use it even in code
copied from other sources. To make these files usable outside of Valkey,
it should be enough to replace the `serverassert.h` include with
`<assert.h>`. Therefore, this file shouldn't have any dependencies to
the rest of the valkey code.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Fix the mem_freed variable to be initialized with init.
with this PR prevents the variable from acting unknowingly.
Signed-off-by: NAM UK KIM <namuk2004@naver.com>
Initial PR to add a governance doc outlining permissions for the main
Valkey project as well as define responsibilities for sub-projects.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: hwware <wen.hui.ware@gmail.com>
Co-authored-by: binyan <binbin.yan@nokia.com
Changes the default value for the `pidfile` config.
The template config file `valkey.conf` already contains `pidfile
/var/run/valkey_6379.pid`. This is not a default. The default is what
you get when you start valkey without config.
Tests suites config pidfile changed to valkey accordingly.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
The output of this script becomes the contents of
`topics/module-api-ref.md` in the `valkey-doc` repo. (Updating it is a
manual process.)
The script uses git tags to find the version that first added an API
function. To preserve the history from old Redis OSS versions, for which
we don't keep git tags, a mapping is stored in a file.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Default value for the "syslog-ident" config changed from "redis" to
"valkey".
Fixes#301.
---------
Signed-off-by: Karthick Ariyaratnam <karthyuom@gmail.com>
What this PR mainly does:
1. Refactor the `anetCreateSocket()` to make it more generic for more
socket arguments, and use `SOCK_NONBLOCK` if available, which will
reduce two system calls (`F_GETFL` and `F_SETFL`) of enabling the
non-blocking mode on each newly created socket.
2. Clean up the unused `anetUnixGenericConnect()` that calls
`anetCreateSocket()`.
`SOCK_NONBLOCK` for system call `socket()` is supported on most
UNIX-like platforms (`linux`, `dragonfly`, `freebsd`, `netbsd`,
`openbsd`, `solaris`, etc.). This improvement will significantly reduce
the system calls considering how massively `anetTcpGenericConnect()`
will be called when needed.
As for the cleanup, `anetUnixGenericConnect` was introduced in c61e692
and the only reference back then was from the `createClient()` in
`redis-benchmark.c` which had been removed in ec8f066 and made it the
dead code. Most of that dead code was also cleaned up in f657315e, and
it seems that the `anetUnixGenericConnect` got left out. Therefore, I
also cleaned it up, ***but I'm not so certain about doing this cleanup
in this PR. Maybe you would prefer to do it in a separate PR?***
References:
- [socket(2) on Linux](https://man7.org/linux/man-pages/man2/socket.2.html)
- [socket(2) on FreeBSD](https://man.freebsd.org/cgi/man.cgi?socket(2))
- [socket(2) on DragonFly](https://man.dragonflybsd.org/?command=socket§ion=2)
- [socket(2) on NetBSD](https://man.netbsd.org/socket.2)
- [socket(2) on OpenBSD](https://man.openbsd.org/socket.2)
- [socket(3c) on Solaris](https://docs.oracle.com/cd/E88353_01/html/E37843/socket-3c.html)
- [socket(3socket) on illumos](https://illumos.org/man/3SOCKET/socket)
---------
Signed-off-by: Andy Pan <i@andypan.me>
I found a few cases where serverAssert() resulted in abort being added, but abort requires
stdlib. So, when serverAssert() uses abort, also automatically include stdlib.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
1. Rename `REDIS_*` macros defined in object.c to `VALKEY_*`,
2. Rename `Redis` to `Valkey` , `redis-cli` to `valkey-cli` in logs
(i.e. put statement) and descriptions in object.c and
utils/create-cluster/README
---------
Signed-off-by: Sher Sun <sher.sun@huawei.com>
Co-authored-by: Sher Sun <sher.sun@huawei.com>
The "count" field of clusterMsg is only used for gossip.
Signed-off-by: Ben Totten <btotten@amazon.com>
Co-authored-by: Ben Totten <btotten@amazon.com>
Don't let the Make valiable `USE_REDIS_SYMLINKS` affect the build.
If it does, it causes the second line in the example below (`make
install`) to recompile what was already compiled on the line above, and
this time it's built without BUILD_TLS=yes USE_SYSTEMD=yes.
make BUILD_TLS=yes USE_SYSTEMD=yes
make PREFIX=custom/usr USE_REDIS_SYMLINKS=no install
Fixes#377
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When using a TLS scheme for valkey-cli and benchmark-cli compiled
without TLS, make the error message only mention the scheme used.
Before:
"valkeys:// and rediss:// are only supported when valkey-cli is compiled
with OpenSSL"
After, depending on which scheme the user tried to use:
"valkeys:// is only supported when valkey-cli is compiled with OpenSSL"
"rediss:// is only supported when valkey-cli is compiled with OpenSSL"
Follow up of #199.
---------
Signed-off-by: karthyuom <karthyuom@gmail.com>
Co-authored-by: k00809413 <karthick.ariyaratnam1@huawei.com>
Rename the init script and a related `.tpl` file and rename variable
names inside (redis to valkey). Update variables in
`utils/install_server.sh`.
Fixes#354
Signed-off-by: hwware <wen.hui.ware@gmail.com>
This pr covers following chnages.
1. redisbenchmark to valkeybenchmark in test directory
2. Removed redis from some test's title and changed the name
accordingly.
3. Updated test suite name and redis-server to valkey readme in test
directory.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Scope of the changes:
- updated example modules to reference Valkey vs Redis in variable names
- updated tests to use valkeymodule.h
- updated vars in tests/modules to use valkey vs redis in variable names
Summary of the testing:
- ran make for all modules, loaded them into valkey-server and tested
commands
- ran make for test/modules
- ran make test for the entire codebase
---------
Signed-off-by: Dmitry Polyakovsky <dmitry.polyakovky@oracle.com>
Co-authored-by: Dmitry Polyakovsky <dmitry.polyakovky@oracle.com>
This patch try to do following things:
1. Rename `redis_*` and `REDIS_*` macros defined in config.h to
`valkey_*`, `VALKEY_*` and update associated used files. (`redis_fstat`,
`redis_fsync`, `REDIS_THREAD_STACK_SIZE`, etc.)
2. Remove the leading double underscore for guard macro in config.h.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
we already have valkeymodule.h with new naming convention and reference
it from modules examples
Signed-off-by: Dmitry Polyakovsky <dmitry.polyakovky@oracle.com>
Co-authored-by: Dmitry Polyakovsky <dmitry.polyakovky@oracle.com>
Readme in github shows that install script will help to install valkey
server, However the logs of thes cripts and variables in the script still
points to redis so renamed redis to valkey/server accordingly.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Previously these scripts were updated but still some places are left so
updated the valkey.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Occurrences of "redis" in TCL test suites and helpers, such as TCL
client used in tests, are replaced with "valkey".
Occurrences of uppercase "Redis" are not changed in this PR.
No files are renamed in this PR.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
New config 'extended-redis-compatibility' (yes/no) default no
* When yes:
* Use "Redis" in the following error replies:
- `-LOADING Redis is loading the dataset in memory`
- `-BUSY Redis is busy`...
- `-MISCONF Redis is configured to`...
* Use `=== REDIS BUG REPORT` in the crash log delimiters (START and
END).
* The HELLO command returns `"server" => "redis"` and `"version" =>
"7.2.4"` (our Redis OSS compatibility version).
* The INFO field for mode is called `"redis_mode"`.
* When no:
* Use "Valkey" instead of "Redis" in the mentioned errors and crash log
delimiters.
* The HELLO command returns `"server" => "valkey"` and the Valkey
version for `"version"`.
* The INFO field for mode is called `"server_mode"`.
* Documentation added in valkey.conf:
> Valkey is largely compatible with Redis OSS, apart from a few cases
where
> Redis OSS compatibility mode makes Valkey pretend to be Redis. Enable
this
> only if you have problems with tools or clients. This is a temporary
> configuration added in Valkey 8.0 and is scheduled to have no effect
in Valkey
> 9.0 and be completely removed in Valkey 10.0.
* A test case for the config is added. It is designed to fail if the
config is not deprecated (has no effect) in Valkey 9 and deleted in
Valkey 10.
* Other test cases are adjusted to work regardless of this config.
Fixes#274Fixes#61
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This PR introduces Codecov to automate code coverage tracking for our
project's tests.
For more information about the Codecov platform, please refer to
https://docs.codecov.com/docs/quick-start
---------
Signed-off-by: Vitah Lin <vitahlin@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This macros is used to add rewrite string in src/config.c and removing
the redis will not effect log or output.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Log messages containing "Redis" in some files are changed.
Add macro SERVER_TITLE defined to "Valkey" (uppercase V) is introduced
and used in log messages, so at least it will be easy to patch this
definition to get Redis or any other name in the logs instead of Valkey.
Change "Redis" in some log messages to use %s and SERVER_TITLE.
This is a partial implementation of
https://github.com/valkey-io/valkey/issues/207
---------
Signed-off-by: 0del <bany.y0599@gmail.com>
Low-risk error replies containing "Redis" are changed.
In most cases, the word "Redis" is simply removed from the error message,
such as in "This Redis instance is not configured to use an ACL file. (...)",
the message is changed to "This instance is not configured to use an ACL
file. (...)".
Additionally, error replies from `redis.call` in a Lua script are
affected, such as
* "Please specify at least one argument for this redis lib call"
* "Wrong number of args calling Redis command from script"
* "Unknown Redis command called from script"
* "Invalid command passed to redis.acl_check_cmd()"
The name Redis is simply removed from these error message. In the last
one above, "redis.acl_check_cmd()" is replaced by
"server.acl_check_cmd()" in the error message.
The following error replies are considered high of causing problems for
clients, so they are not changed in this commit:
* (not in scope) "-MISCONF Redis is configured to save RDB snapshots
(...)"
* (not in scope) "-LOADING Redis is loading the dataset in memory"
* (not in scope) "-BUSY Redis is busy running a script (...)"
Fixes#204
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
[related to](https://github.com/valkey-io/valkey/issues/230)
Adds workflows to build Valkey binaries and push to S3 to make it
available to download from the website
The Workflows can be triggered by pushing a release to the repo and the
other option is manually by one of the Maintainers.
Once the workflow triggers, it will generate a matrix of Jobs for the
platforms we need to build from `utils/releasetools/build-config.json`
and then the respective Jobs are triggered. These jobs make Valkey with
respect to the platform binaries we want to release and would push to a
private S3 bucket.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Update CONTRIBUTING.md:
* A more friendly approach, hopefully.
* The note about receiving patches is moved to the DCO section.
* Some "Get started" links in a bullet list (inspired by OpenTofu's
contributing file).
* For questions, refer to GitHub Discussions and Discord instead of only
Discord.
* Minor edits and formatting.
Update issue templates:
* The issue template for questions is replaced by a link to Discussions
and to Matrix and Discord chats.
* Add a link to the valkey-doc repo.
* The crash report template is extended into a form, with separate
fields for the crash report and the additional info.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Motivation: Currently we have to manually update the all_tests variable
when introducing new test files.
Fix: I've modified it to list test files dynamically, but rather than
modify it to add all test files, I've first modified it to only add test
files from the following 4 paths so that it doesn't deviate too much
from what we already do
- unit
- unit/type
- unit/cluster
- integration
Related issue: https://github.com/valkey-io/valkey/issues/302
---------
Signed-off-by: jonghoonpark <dev@jonghoonpark.com>
The time between each consequent probes is set by TCP_KEEPINTVL in
seconds. The minimum value is ten seconds. The maximum is ten days,
while the default is two hours. The TCP connection will be aborted after
certain amount of probes, which is set by TCP_KEEPCNT, without receiving
response.
## References
[Solaris
11.4](https://docs.oracle.com/cd/E88353_01/html/E37851/tcp-4p.html)
Signed-off-by: Andy Pan <i@andypan.me>
Updated release-tool scripts to valkey, This PR covered only the file
names for the tarball but still location needs to be updated
accordingly.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
This utils/lru/README incorrectly refers to REDIS_LRU_CLOCK_RESOLUTION
in server.h to modify the LRU clock resolution. However, the actual
constant in server.h has been updated to LRU_CLOCK_RESOLUTION, but the
README was not updated to reflect this change.
1. Replaced REDIS_LRU_CLOCK_RESOLUTION with LRU_CLOCK_RESOLUTION in the
text of utils/lru/README.
2. Updated references from "Redis" to "Valkey" within the same README
file as part of the ongoing rebranding efforts:)
---------
Signed-off-by: Sher Sun <sher.sun@huawei.com>
Co-authored-by: Sher Sun <sher.sun@huawei.com>
Fixes#241
Removes the empty `history` arrays in 6 command JSON files. This
normalizes these 6 with the rest of the command JSON which omit the
`history` array entirely when there is no history.
This makes parsing these files slightly less annoying in languages where
empty arrays are falsey.
Signed-off-by: Kyle J. Davis <kyledvs@amazon.com>
This sets the default permission for current CI workflows to only be
able to read from the repository (scope: "contents").
When a used Github Action require additional permissions (like CodeQL)
we grant that permission on job-level instead.
This means that a compromised action will not be able to modify the repo
or even steal secrets since all other permission-scopes are implicit set
to "none", i.e. not permitted. This is recommended by
[OpenSSF](https://github.com/ossf/scorecard/blob/main/docs/checks.md#token-permissions).
This PR includes a small fix for the possibility of missing server logs
artifacts, found while verifying the permission.
The `upload-artifact@v3` action will replace artifacts which already
exists. Since both CI-jobs `test-external-standalone` and
`test-external-nodebug` uses the same artifact name, when both jobs
fail, we only get logs from the last finished job. This can be avoided
by using unique artifact names.
This PR is part of #211
More about permissions and scope can be found here:
https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#permissions
---------
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
* Tested it on local instance. This was originally part of
https://github.com/valkey-io/valkey/pull/288 but I am pushing this
separately, so that we can easily merge it into the upcoming release.
```
lua debugger> server ping
<redis> ping
<reply> "+PONG"
lua debugger> redis ping
<redis> ping
<reply> "+PONG"
```
* I also searched for lua debugger related unit tests to add coverage
for this but did not find any relevant test to modify. Leaving it at
that for now.
---------
Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com>
Add an initial security release page. In the fullness of time I would
like to also include our version support here, but until that has been
decided I would like to keep this simple and just include links.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Update comment suggesting clusterMsgPingExtTypes to clusterMsgPingtypes
as clusterMsgPingExtTypes does not exist
Signed-off-by: Ben Totten <btotten@amazon.com>
ValkeyModuleEvent_MasterLinkChange was updated to use more inclusive
language, but it was done in the compatibility layer as well
(RedisModuleEvent_).
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Daniel House <daniel.house@huawei.com>
Signed-off-by: daniel-house <danny@cs.toronto.edu>
Co-authored-by: Daniel House <daniel.house@huawei.com>
Actually, this script doesn't do anything except printing that it is
replaced by redis-cli.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Updated to Valkey in valkey-cli.c file's comments and prints.
* The output of valkey-cli --help
* The output of the cli built-in HELP command
* The prompt in interactive valkey-cli -s unixsocket
* The history file and the default rc file (changed filename)
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
The default pid file is created at /var/run/redis.pid based on the code
at
da831c0d22/src/server.h (L132).
Until we update it, we should reflect that in the conf file.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Updated procedure redis_deferring_client in test environent to
valkey_deferring_client.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
This includes comments used for module API documentation.
* Strategy for replacement: Regex search: `(//|/\*| \*|#).* ("|\()?(r|R)edis( |\.
|'|\n|,|-|\)|")(?!nor the names of its contributors)(?!Ltd.)(?!Labs)(?!Contributors.)`
* Don't edit copyright comments
* Replace "Redis version X.X" -> "Redis OSS version X.X" to distinguish
from newly licensed repository
* Replace "Redis Object" -> "Object"
* Exclude markdown for now
* Don't edit Lua scripting comments referring to redis.X API
* Replace "Redis Protocol" -> "RESP"
* Replace redis-benchmark, -cli, -server, -check-aof/rdb with "valkey-"
prefix
* Most other places, I use best judgement to either remove "Redis", or
replace with "the server" or "server"
Fixes#148
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Wait for cluster to be in a fully consistent and online state in
`cluster_config_consistent`. We expect the `start_server` to create the
desired primaries and replicas before the start of the tests. With the
current setup, the replicas may not complete the sync with primaries and
can be in loading state. In some cases, the role of replicas can still
be master with the delay of propagation of replicate command. The tests
can show flaky behavior in such cases. Add a check that verifies the
nodes health status 'online' for the cluster consistency. Leverage the
deterministic order of `CLUSTER SLOTS` to consider the cluster as
consistent along with the nodes health status.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Ram Prasad Voleti <ramvolet@amazon.com>
Ref: https://github.com/redis/redis/pull/12760
### Description
#### Fixes compatibilty of PlaceholderKV cluster (7.2 - extensions
enabled by default) with older Redis cluster (< 7.0 - extensions not
handled) .
With some of the extensions enabled by default in 7.2 version, new nodes
running 7.2 and above start sending out larger clusterbus message
payload including the ping extensions. This caused an incompatibility
with node running engine versions < 7.0. Old nodes (< 7.0) would receive
the payload from new nodes (> 7.2) would observe a payload length
(totlen) > (estlen) and would perform an early exit and won't process
the message.
This fix introduces a flag `extensions_supported` on the clusterMsg
indicating the sender node can handle extensions parsing. Once, a
receiver nodes receives a message with this flag set to 1, it would
update clusterNode new field extensions_supported and start sending out
extensions if it has any.
This PR also introduces a DEBUG sub command to enable/disable cluster
message extensions `process-clustermsg-extensions` feature.
Note: A successful `PING`/`PONG` is required as a sender for a given
node to be marked as `extensions_supported` and then extensions message
will be sent to it. This could cause a slight delay in receiving the
extensions message(s).
### Testing
TCL test verifying the cluster state is healthy irrespective of
enabling/disabling cluster message extensions feature.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Apply new logo at startup.
It is one character wider and 2 characters taller than the original
Redis logo.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Small changes to the log messages printed during startup and shutdown,
for Valkey branding.
SERVER_NAME is replaced by verbatim "Valkey" in one place, because
SERVER_NAME expands to "valkey" in lowercase. (Should we introduce
another macro that expands to "Valkey"?)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
These were flagged on the 7.2 build system, which is using the old spell
check. I think we should consider re-adding it as it missed some typos.
Relevant: https://github.com/valkey-io/valkey/pull/72
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Adds a new make variable called `USE_REDIS_SYMLINKS`, with default value
`yes`. If yes, then `make install` creates additional symlinks to the
installed binaries:
* `valkey-server`
* `valkey-cli`
* `valkey-benchmark`
* `valkey-check-rdb`
* `valkey-check-aof`
* `valkey-sentinel`
The names of the symlinks are the legacy redis binary names
(`redis-server`, etc.). The purpose is to provide backward compatibility
for scripts expecting the these filenames. The symlinks are installed in
the same directory as the binaries (typically `/usr/local/bin/` or
similar).
Similarly, `make uninstall` removes these symlinks if
`USE_REDIS_SYMLINKS` is `yes`.
This is described in a note in README.md.
Fixes#147
---------
Signed-off-by: Vitah Lin <vitahlin@gmail.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
This commit does not remove redis.call/pcall just yet. It also does not
rename Redis in error messages such as "Please specify at least one
argument for this redis lib call". This allows users to maintain full
backwards compatibility while introducing an option to use server.call
for new code.
I verified that the unit tests pass. Also manually verified that the
redis-server responds to server.call invocations within lua scripting.
Also verified that function registration works as expected.
```
[ok]: EVAL - is Lua able to call Redis API? (0 ms)
[ok]: EVAL - is Lua able to call Server API? (1 ms)
[ok]: EVAL - No arguments to redis.call/pcall is considered an error (0 ms)
[ok]: EVAL - No arguments to server.call/pcall is considered an error (1 ms)
```
---------
Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This commit updates the following fields:
1. server_version -> valkey_version in server info. Since we would like
to advertise specific compatibility, we are making the version specific
to valkey. servername will remain as an optional indicator, and other
valkey compatible stores might choose to advertise something else.
1. We dropped redis-ver from the API. This isn't related to API
compatibility, but we didn't want to "fake" that valkey was creating an
rdb from a Redis version.
1. Renamed server-ver -> valkey_version in rdb info. Same as point one,
we want to explicitly indicate this was created by a valkey server.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Fix#146
Removed REDISMODULE_ prefixes from the core source code to align with
the new SERVERMODULE_ naming convention. Added a new 'redismodule.h'
header file to ensure full backward compatibility with existing modules.
This compatibility layer maps all legacy REDISMODULE_ prefixed
identifiers to their new SERVERMODULE_ equivalents, allowing existing
Redis modules to function without modification.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Mostly comments, but one pre-filled config in this template config file
is changed:
pidfile /var/run/valkey-sentinel.pid
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
This little suggestion can help contributors to link their PRs to
issues. This, in turn, helps the maintainers.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
New info information to be used to determine the valkey versioning info.
Internally, introduce new define values for "SERVER_VERSION" which is
different from the Redis compatibility version, "REDIS_VERSION".
Add two new info fields:
`server_version`: The Valkey server version
`server_name`: Indicates that the server is valkey.
Add one new RDB field: `server_ver`, which indicates the valkey version
that produced the server.
Add 3 new LUA globals: `SERVER_VERSION_NUM`, `SERVER_VERSION`, and
`SERVER_NAME`. Which reflect the valkey version instead of the Redis
compatibility version.
Also clean up various places where Redis and configuration was being
used that is no longer necessary.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Fixing redis -> valkey in the output of valkey-server --help.
Signed-off-by: Daniel House <daniel.house@huawei.com>
Co-authored-by: Daniel House <daniel.house@huawei.com>
Rename include guard macros (redis -> valkey) and remove the leading double underscore.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
As part of earlier PRs binary names and some file names renamed to
valkey, but still cli and benchmark and other source files still with redis
name. So changed the file names and makefile accordingly.
Signed-off-by: hwware <wen.hui.ware@gmail.com>
Co-authored-by: hwware <wen.hui.ware@gmail.com>
This PR covers changing the redis.crt and redis.key to valkey certs for
TLS testing.
The files are generated by the gen-test-certs.sh script under tests/tls/.
Also covers comments provided.
Signed-off-by: hwware <wen.hui.ware@gmail.com>
Co-authored-by: hwware <wen.hui.ware@gmail.com>
Remove trademarked wording on configuration layer.
Following changes for release notes:
1. Rename redis.conf to valkey.conf
2. Pre-filled config in the template config file: Changing pidfile to `/var/run/valkey_6379.pid`
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
We use `#ifdef SERVER_TEST` to run the relavent tests, we can now remove
the dead code `#ifdef REDIS_TEST_MAIN`.
Signed-off-by: Vitah Lin <vitahlin@gmail.com>
Change syntax hint from
SHUTDOWN [NOSAVE | SAVE] [NOW] [FORCE] [ABORT]
to
SHUTDOWN [[NOSAVE | SAVE] [NOW] [FORCE] | ABORT]
It's not that important for docs, but the latter is preferred for valkey-cli's automatic syntax hints.
Signed-off-by: LiiNen <kjeonghoon065@gmail.com>
The test flag `REDIS_TEST` has already be changed to `SERVER_TEST` in
`.github/workflows/daily.yml`, the name in the src directory need to be
changed as well.
```shell
run: |
sudo apt-get update && sudo apt-get install libc6-dev-i386
make 32bit SERVER_CFLAGS='-Werror -DSERVER_TEST'
```
Signed-off-by: Vitah Lin <vitahlin@gmail.com>
desc is set to NULL, never set to anything, and then
checked if it should be freed on the error path if it's NULL.
This can be cleaned up, since it's really unused.
Fixes#129
Signed-off-by: 0del <bany.y0599@gmail.com>
Fix the coverity name to reflect the actual name in coverity. See
successful build here:
https://github.com/valkey-io/valkey/actions/runs/8516329554. Also
removed unnecessary TCL dependency from the install.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This resolves (1.viii) from
https://github.com/valkey-io/valkey/issues/43
> REDIS_FLAGS will be updated to SERVER_FLAGS. Maybe we should also
allow REDIS_FLAGS -> SERVER_FLAGS as well, for an extra layer of
compatibility.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Documentation references should use `Valkey` while server and cli
references are all under `valkey`.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Update the contributor file to include a DCO, which are going to require
for all incoming contributions to provide indication that the
contributor has the rights to contribute the code.
I also fixed the formatting of contribution HOWTO, since it was getting
mangled.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
In ASAN CI, we find server may crash because of NULL ptr in `kvstoreIncrementallyRehash`.
the reason is that we use two phase unlink in `dbGenericDelete`. After `kvstoreDictTwoPhaseUnlinkFind`,
the dict may be in rehashing and only have one element in ht[0] of `db->keys`.
When we delete the last element in `db->keys` meanwhile `db->keys` is in rehashing, we may free the
dict in `kvstoreDictTwoPhaseUnlinkFree` without deleting the node in `kvs->rehashing`. Then we may
use this freed ptr in `kvstoreIncrementallyRehash` in the `serverCron` and cause the crash.
This is indeed a use-after-free problem.
The fix is to call rehashingCompleted in dictRelease and dictEmpty, so that every call for
rehashingStarted is always matched with a rehashingCompleted.
Adding a test in the unit test to catch it consistently
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
In `beginResultEmission`, -1 means the result length is not known in
advance. But after #12185, if we pass -1 to `zrangeResultBeginStore`, it
will convert to SIZE_MAX in `zsetTypeCreate` and try to `dictExpand`.
Although `dictExpand` won't succeed because the size overflows, I think
we'd better to avoid this wrong conversion.
This bug can be triggered when the source of `zrangestore` doesn't exist
or we use `zrangestore` command with `byscore` or `bylex`.
The impact is that dst keys will be converted to use skiplist instead of
listpack.
Users who abuse lua error_reply will generate a new error object on each
error call, which can make server.errors get bigger and bigger. This
will
cause the server to block when calling INFO (we also return errorstats
by
default).
To prevent the damage it can cause, when a misuse is detected, we will
print a warning log and disable the errorstats to avoid adding more new
errors. It can be re-enabled via CONFIG RESETSTAT.
Because server.errors may be very large (it may be better now since we
have the limit), config resetstat may block for a while. So in
resetErrorTableStats, we will try to lazyfree server.errors.
See the related discussion at the end of #8217.
After #13072, there is an use-after-free error. In expireScanCallback, we
will delete the dict, and then in dictScan we will continue to use the dict,
like we will doing `dictResumeRehashing(d)` in the end, this casued an error.
In this PR, in freeDictIfNeeded, if the dict's pauserehash is set, don't
delete the dict yet, and then when scan returns try to delete it again.
At the same time, we noticed that there will be similar problems in iterator.
We may also delete elements during the iteration process, causing the dict
to be deleted, so the part related to iter in the PR has also been modified.
dictResetIterator was also missing from the previous kvstoreIteratorNextDict,
we currently have no scenario that elements will be deleted in kvstoreIterator
process, deal with it together to avoid future problems. Added some simple
tests to verify the changes.
In addition, the modification in #13072 omitted initTempDb and emptyDbAsync,
and they were also added. This PR also remove the slow flag from the expire
test (consumes 1.3s) so that problems can be found in CI in the future.
This change to the Makefile makes it possible to opt out of
`-fno-omit-frame-pointer` added in #12973 and `-flto` (#11350). Those
features were implemented by conditionally modifying the `REDIS_CFLAGS`
and `REDIS_LDFLAGS` variables. Historically, those variables provided a
way for users to pass options to the compiler and linker unchanged.
Instead of conditionally appending optimization flags to REDIS_CFLAGS
and REDIS_LDFLAGS, I want to append them to the OPTIMIZATION variable.
Later in the Makefile, we have `OPT=$(OPTIMIZATION)` (meaning
OPTIMIZATION is only a default for OPT, but OPT can be overridden by the
user), and later the flags are combined like this:
FINAL_CFLAGS=$(STD) $(WARN) $(OPT) $(DEBUG) $(CFLAGS) $(REDIS_CFLAGS)
FINAL_LDFLAGS=$(LDFLAGS) $(OPT) $(REDIS_LDFLAGS) $(DEBUG)
This makes it possible for the the user to override all optimization
flags with e.g. `make OPT=-O1` or just `make OPT=`.
For some reason `-O3` was also already added to REDIS_LDFLAGS by default
in #12339, so I added OPT to FINAL_LDFLAGS to avoid more complex logic
(such as introducing a separate LD_OPT variable).
Currently (following #11695, and #12822), keys kvstore and expires
kvstore both flag with ON_DEMAND, it means that a cluster node will
only allocate a dict when the slot is assigned to it and populated,
but on the other hand, when the slot is unassigned, the dict will
remain allocated.
We considered releasing the dict when the slot is unassigned, but it
causes complications on replicas. On the other hand, from benchmarks
we conducted, it looks like the performance impact of releasing the
dict when it becomes empty and re-allocate it when a key is added
again, isn't huge.
This PR add KVSTORE_FREE_EMPTY_DICTS to cluster mode keys / expires
kvstore.
The impact is about about 2% performance drop, for this hopefully
uncommon scenario.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
In some cases, users will abuse lua eval. Each EVAL call generates
a new lua script, which is added to the lua interpreter and cached
to redis-server, consuming a large amount of memory over time.
Since EVAL is mostly the one that abuses the lua cache, and these
won't have pipeline issues (i.e. the script won't disappear
unexpectedly,
and cause errors like it would with SCRIPT LOAD and EVALSHA),
we implement a plain FIFO LRU eviction only for these (not for
scripts loaded with SCRIPT LOAD).
### Implementation notes:
When not abused we'll probably have less than 100 scripts, and when
abused we'll have many thousands. So we use a hard coded value of 500
scripts. And considering that we don't have many scripts, then unlike
keys, we don't need to worry about the memory usage of keeping a true
sorted LRU linked list. We compute the SHA of each script anyway,
and put the script in a dict, we can store a listNode there, and use
it for quick removal and re-insertion into an LRU list each time the
script is used.
### New interfaces:
At the same time, a new `evicted_scripts` field is added to
INFO, which represents the number of evicted eval scripts. Users
can check it to see if they are abusing EVAL.
### benchmark:
`./src/redis-benchmark -P 10 -n 1000000 -r 10000000000 eval "return
__rand_int__" 0`
The simple abuse of eval benchmark test that will create 1 million EVAL
scripts. The performance has been improved by 50%, and the max latency
has dropped from 500ms to 13ms (this may be caused by table expansion
inside Lua when the number of scripts is large). And in the INFO memory,
it used to consume 120MB (server cache) + 310MB (lua engine), but now
it only consumes 70KB (server cache) + 210KB (lua_engine) because of
the scripts eviction.
For non-abusive case of about 100 EVAL scripts, there's no noticeable
change in performance or memory usage.
### unlikely potentially breaking change:
in theory, a user can maybe load a
script with EVAL and then use EVALSHA to call it (by calculating the
SHA1 value on the client side), it could be that if we read the docs
carefully we'll realized it's a valid scenario, but we suppose it's
extremely rare. So it may happen that EVALSHA acts on a script created
by EVAL, and the script is evicted and EVALSHA returns a NOSCRIPT error.
that is if you have more than 500 scripts being used in the same
transaction / pipeline.
This solves the second point in #13102.
Allow using `+` as a special ID for last item in stream on XREAD
command.
This would allow to iterate on a stream with XREAD starting with the
last available message instead of the next one which `$` is used for.
I.e. the caller can use `BLOCK` and `+` on the first call, and change to
`$` on the next call.
Closes#7388
---------
Co-authored-by: Felipe Machado <462154+felipou@users.noreply.github.com>
Sometimes it's useful to compute a key's cluster slot in a module.
This API function is just like the command CLUSTER KEYSLOT (but faster).
A "reverse" API is also added:
`RedisModule_ClusterCanonicalKeyNameInSlot`. Given a slot, it returns a
short string that we can call a canonical key for the slot.
The check in fileIsManifest misjudged the manifest file. For example,
if resp aof contains "file", it will be considered a manifest file and
the check will fail:
```
*3
$3
set
$4
file
$4
file
```
In #12951, if the preamble aof also contains it, it will also fail.
Fixes#12951.
the bug was happening if the the word "file" is mentioned
in the first 1024 lines of the AOF. and now as soon as it finds
a non-comment line it'll break (if it contains "file" or doesn't)
Distribute operations via `redis-benchmark` on different slots owned by
node.
`current_slot_index` is never updated hence the value is always `0` and
the tag used is always the first slot owned by the node. Hence any
read/write operation via `redis-benchmark` in cluster mode always
happens on a particular slot.
This is inconvenient to load data uniformly via `redis-benchmark`.
Since lua_Number is not explicitly an integer or a double, we need to
make an effort
to convert it as an integer when that's possible, since the string could
later be used
in a context that doesn't support scientific notation (e.g. 1e9 instead
of 100000000).
Since fpconv_dtoa converts numbers with the equivalent of `%f` or `%e`,
which ever is shorter,
this would break if we try to pass a long integer number to a command
that takes integer.
we'll get an implicit conversion to string in Lua, and then the parsing
in getLongLongFromObjectOrReply will fail.
```
> eval "redis.call('hincrby', 'key', 'field', '1000000000')" 0
(nil)
> eval "redis.call('hincrby', 'key', 'field', tonumber('1000000000'))" 0
(error) ERR value is not an integer or out of range script: ac99c32e4daf7e300d593085b611de261954a946, on @user_script:1.
```
Switch to using ll2string if the number can be safely represented as a
long long.
The problem was introduced in #10587 (Redis 7.2).
closes#13113.
---------
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Update the crc16 hash lookup table to use fixed size character arrays instead of pointers
to static string addresses. Since the actual values are so short, we can just store them
in a uniform array instead. This saves about 128kb of memory and should improve the
performance as well since we should have much better memory locality.
`CONFIG SET oom-score-adj handles configuration failures` test failed in
some CI jobs today.
Failed CI: https://github.com/redis/redis/actions/runs/8152519326
Not sure why the github action's docker image perssions have changed,
but the issue is similar to #12887,
where we can't assume the range of oom_score_adj that a user can change.
## Solution:
Modify the way of determining whether the current user has no privileges
or not,
instead of relying on whether the user id is 0 or not.
This commit updates the processing of PONG gossip messages in the
cluster. When a node (B) becomes a replica due to a failover, its PONG
messages include its new primary node's (A) information and B's
configuration epoch is aligned with A's. This allows observer nodes to
identify changes in primary-ship, addressing issues of intermediate
states and enhancing cluster state consistency during topology changes.
Fix#13018
After #13013
### This PR make effort to defrag the pubsub kvstore in the following
ways:
1. Till now server.pubsub(shard)_channels only share channel name obj
with the first subscribed client, now change it so that the clients and
the pubsub kvstore share the channel name robj.
This would save a lot of memory when there are many subscribers to the
same channel.
It also means that we only need to defrag the channel name robj in the
pubsub kvstore, and then update
all client references for the current channel, avoiding the need to
iterate through all the clients to do the same things.
2. Refactor the code to defragment pubsub(shard) in the same way as
defragment of keys and EXPIRES, with the exception that we only
defragment pubsub(without shard) when slot is zero.
### Other
Fix an overlook in #11695, if defragment doesn't reach the end time, we
should wait for the current
db's keys and expires, pubsub and pubsubshard to finish before leaving,
now it's possible to exit
early when the keys are defragmented.
---------
Co-authored-by: oranagra <oran@redislabs.com>
The new regular expression break the validator:
```
In file included from commands.c:10:
commands_with_reply_schema.def:14528:72: error: stray ‘\’ in program
14528 | struct jsonObjectElement MEMORY_STATS_ReplySchema_patternProperties__db\_\d+__properties_overhead_hashtable_main_elements[] = {
```
The reason is that special characters are not added to to_c_name,
causes special characters to appear in the structure name, causing
c file compilation to fail.
Broken by #12913
- In removeUnsupportedArgs, trying to access the next item after the
last one and causing an out of bounds read.
- In versionIsSupported, when the 'version' is equal to 'since', the
return value is 0 (not supported).
Also, change the function to return `not supported` in case they have
different numbers of digits
Both issues were found by `Non-interactive non-TTY CLI: Test
command-line hinting - old server` under `test-sanitizer-address` (When
changing the `src/version.h` locally to `8.0.0`)
The new `MAXAGE` argument inside `client-kill` triggered the issue (new
argument at the end of the list)
---------
Co-authored-by: YaacovHazan <yaacov.hazan@redislabs.com>
Sometimes we need to make fast judgement about why Redis is suddenly
taking more memory. One of the reasons is main DB's dicts doing
rehashing.
We may use `MEMORY STATS` to monitor the overhead memory of each DB, but
there still lacks a total sum to show an overall trend. So this PR adds
the total overhead of all DBs to `INFO MEMORY` section, together with
the total count of rehashing DB dicts, providing some intuitive metrics
about main dicts rehashing.
This PR adds the following metrics to INFO MEMORY
* `mem_overhead_db_hashtable_rehashing` - only size of ht[0] in
dictionaries we're rehashing (i.e. the memory that's gonna get released
soon)
and a similar ones to MEMORY STATS:
* `overhead.db.hashtable.lut` (complements the existing
`overhead.hashtable.main` and `overhead.hashtable.expires` which also
counts the `dictEntry` structs too)
* `overhead.db.hashtable.rehashing` - temporary rehashing overhead.
* `db.dict.rehashing.count` - number of top level dictionaries being
rehashed.
---------
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
In XREADGROUP ACK, because streamPropagateXCLAIM does not propagate
entries-read, entries-read will be inconsistent between master and
replicas.
I.e. if no entries were claimed, it would have propagated correctly, but
if some
were claimed, then the entries-read field would be inconsistent on the
replica.
The fix was suggested by guybe7, call streamPropagateGroupID
unconditionally,
so that we will normalize entries_read on the replicas. In the past, we
would
only set propagate_last_id when NOACK was specified. And in #9127,
XCLAIM did
not propagate entries_read in ACK, which would cause entries_read to be
inconsistent between master and replicas.
Another approach is add another arg to XCLAIM and let it propagate
entries_read,
but we decided not to use it. Because we want minimal damage in case
there's an
old target and new source (in the worst case scenario, the new source
doesn't
recognize XGROUP SETID ... ENTRIES READ and the lag is lost. If we
change XCLAIM,
the damage is much more severe).
In this patch, now if the user uses XREADGROUP .. COUNT 1 there will be
an additional
overhead of MULTI, EXEC and XGROUPSETID. We assume the extra command in
case of
COUNT 1 (4x factor, changing from one XCLAIM to
MULTI+XCLAIM+XSETID+EXEC), is probably
ok since reading just one entry is in any case very inefficient (a
client round trip
per record), so we're hoping it's not a common case.
Issue was introduced in #9127.
Even if we have SCRIPT FLUSH ASYNC now, when there are a lot of
lua scripts, SCRIPT FLUSH ASYNC will still block the main thread.
This is because lua_close is executed in the main thread, and lua
heap needs to release a lot of memory.
In this PR, we take the current lua instance on lctx.lua and call
lua_close on it in a background thread, to close it in async way.
This is MeirShpilraien's idea.
The --count option for redis-cli has been released in redis 7.2.
https://github.com/redis/redis/pull/12042
But I have found in code, that some logic was missing for using this
'count' option.
```
static redisReply *sendScan(unsigned long long *it) {
redisReply *reply;
if (config.pattern)
reply = redisCommand(context, "SCAN %llu MATCH %b COUNT %d",
*it, config.pattern, sdslen(config.pattern), config.count);
else
reply = redisCommand(context,"SCAN %llu",*it);
```
The intention was being able to using scan count.
But in this case, the --count will be only applied when 'pattern' is
declared.
So, I had fix it simply, to be worked properly - even if --pattern
option is not being used.
I tested it simply with time() command several times, and I could see it
works as intended with this commit.
The examples of test results are below:
```
# unstable build
time(./redis-cli -a $AUTH -p $PORT -h $HOST --scan >/dev/null 2>/dev/null)
real 0m1.287s
user 0m0.011s
sys 0m0.022s
# count is not applied
time(./redis-cli -a $AUTH -p $PORT -h $HOST --scan --count 1000 >/dev/null 2>/dev/null)
real 0m1.117s
user 0m0.011s
sys 0m0.020s
# count is applied with --pattern
time(./redis-cli -a $AUTH -p $PORT -h $HOST --scan --count 1000 --pattern "hash:*" >/dev/null 2>/dev/null)
real 0m0.045s
user 0m0.002s
sys 0m0.002s
```
```
# fix-redis-cli-scan-count build
time(./redis-cli -a $AUTH -p $PORT -h $HOST --scan >/dev/null 2>/dev/null)
real 0m1.084s
user 0m0.008s
sys 0m0.024s
# count is applied even if --pattern is not declared
time(./redis-cli -a $AUTH -p $PORT -h $HOST --scan --count 1000 >/dev/null 2>/dev/null)
real 0m0.043s
user 0m0.000s
sys 0m0.004s
# of course this also applied
time(./redis-cli -a $AUTH -p $PORT -h $HOST --scan --count 1000 --pattern "hash:*" >/dev/null 2>/dev/null)
real 0m0.031s
user 0m0.002s
sys 0m0.002s
```
Thanks a lot.
If we call `DEL` on expired keys, keys may be deleted in
`expireIfNeeded` and we don't need to call `dbSyncDelete` or
`dbAsyncDelete` after, which repeat the deletion process(i.e. find keys
in main db).
In this PR, I refine the return values of `expireIfNeeded` to indicate
whether we have deleted the expired key to avoid the potential redundant
deletion logic in `delGenericCommand`. Besides, because both KEY_EXPIRED
and KEY_DELETED are non-zero, this PR won't affect other functions
calling `expireIfNeeded`.
I also make a performance test. I first close active expiration by
`debug set-active-expire 0` and write 1 million keys with 1ms TTL. Then
I repeatedly delete 100 expired keys in one `DEL`. The results are as
follow, which shows that this PR can improve performance by about 10% in
this situation.
**unstable**
```
Summary:
throughput summary: 10080.65 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.953 0.136 0.959 1.215 1.335 2.247
```
**This PR**
```
Summary:
throughput summary: 11074.20 requests per second
latency summary (msec):
avg min p50 p95 p99 max
0.865 0.128 0.879 1.055 1.175 2.159
```
---------
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Oran Agra <oran@redislabs.com>
In #8554, we added a MALLOC_MIN_SIZE to use a minimum allocation
size when using malloc(0). However, we did not update the size,
when malloc_size is missing.
When malloc_size exists, we record the size that was allocated
instead of the size that was requested. This would work with both
jemalloc, and libc malloc (the change in #8554, doesn't break this).
When malloc_size is missing, we allocate extra size_t bytes and
store the requested size in it. In that case, the requested size
is probably different than the allocated size anyway (the change
in #8554 doesn't conceptually change that).
So we have room for improvement since in this case we are aware
of the extra bytes we asked for. Same as we're also aware of the
extra size_t bytes we asked for.
In addition, some cleaning was done:
1. fixes some outupdated comments.
2. test cleanups
Implement #12699
This PR exposing Lua os.clock() api for getting the elapsed time of Lua
code execution.
Using:
```lua
local start = os.clock()
...
do something
...
local elpased = os.clock() - start
```
---------
Co-authored-by: Meir Shpilraien (Spielrein) <meir@redis.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Following #12568
In issue #9357, when inserting an element larger than 1GB, we currently
store it in a plain node instead of a listpack.
Presently, when we insert an element that exceeds the maximum size of a
packed node, it cannot be accommodated in any other nodes, thus ending
up isolated like a large element.
I.e. it's a node with only one element, but it's listpack encoded rather
than a plain buffer.
This PR lowers the threshold for considering an element as 'large' from
1GB to the maximum size of a node.
While this change doesn't completely resolve the bug mentioned in the
previous PR, it does mitigate its potential impact.
As a result of this change, we can now only use LSET to replace an
element with another element that falls below the maximum size
threshold.
In the worst-case scenario, with a fill of -5, the largest packed node
we can create is 2GB (32k * 64k):
* 32k: The smallest element in a listpack is 2 bytes, which allows us to
store up to 32k elements.
* 64k: This is the maximum size for a single quicklist node.
## Others
To fully fix#9357, we need more work, as discussed in #12568, when we
insert an element into a quicklistNode, it may be created in a new node,
put into another node, or merged, and we can't correctly delete the node
that was supposed to be deleted.
I'm not sure it's worth it, since it involves a lot of modifications.
Implement #12963
## Changes
1. large bins don't have external fragmentation or are at least
non-defraggable, so we should ignore the effect of
large bins when measuring fragmentation, and only measure fragmentation
of small bins. this affects both the allocator_frag* metrics and also
the active-defrag trigger
2. Adding INFO metrics for `muzzy` memory, which is memory returned to
the OS but still shows as RSS until the OS reclaims it.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Recently I saw in CI that reply-schemas-validator fails here:
```
Failed validating 'minimum' in schema[1]['properties']['groups']['items']['properties']['consumers']['items']['properties']['active-time']:
{'description': 'Last time this consumer was active (successful '
'reading/claiming).',
'minimum': 0,
'type': 'integer'}
On instance['groups'][0]['consumers'][0]['active-time']:
-1729380548878722639
```
The reason is that in fuzzer, we may restore corrupted active-time,
which will cause the reply schema CI to fail.
The fuzzer can cause corrupt the state in many places, which will
bugs that mess up the reply, so we decided to skip logreqres.
Also, seen-time is the same type as active-time, adding the minimum.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
There is a timing issue in the test, close may arrive late, or in
freeClientAsync we will free the client in async way, which will
lead to errors in watching_clients statistics, since we will only
unwatch all keys when we truly freeClient.
Add a wait here to avoid this problem. Also fixed some outdated
comments i saw. The test was introduced in #12966.
We can see that the past time here happens to be busy_time_limit,
causing the test to fail:
```
[err]: RM_Call from blocked client in tests/unit/moduleapi/blockedclient.tcl
Expected '50' to be more than '50' (context: type eval line 26 cmd {assert_morethan [expr [clock clicks -milliseconds]-$start] $busy_time_limit} proc ::test)
```
It is reasonable for them to be equal, so equal is added here.
It should be noted that in the previous `Busy module command` test,
we also used assert_morethan_equal, so this should have been missed
at the time.
Currently redis uses O3 level optimization would remove the frame pointer
in the target bin.
In the very old past, when gcc optimized at O1 and above levels, the
frame pointer is deleted by default to improve performance. This saves
the RBP registers and reduces the pop/push instructions. But it makes it
difficult for us to observe the running status of the program. For
example, the perf tool cannot be used effectively, especially the modern
eBPF tools such as bcc/memleak.
Add readme about the command json folder, what it does, and who should
(not) use it.
see discussion
https://github.com/redis/redis/issues/9359#issuecomment-1936420698
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
In the `databasesCron()`, the time consumed by
`kvstoreIncrementallyRehash()` is used to calculate the exit condition.
However, within `kvstoreIncrementallyRehash()`, the loop first checks
for timeout before performing rehashing. Therefore, the time for the
last rehash isn't accounted for, making the consumed time inaccurate. We
need to precisely calculate all the time spent on rehashing.
Additionally, the time allocated to `kvstoreIncrementallyRehash()`
should be the remaining time, which is
`INCREMENTAL_REHASHING_THRESHOLD_US` minus the already consumed
`elapsed_us`.
Currently aof_last_fsync is using a low resolution unixtime is really
bad,
it checks if the absolute number of (full) seconds changed by one.
depending on which side of the second barrier it falls, we can get very
different results.
This PR change the resolution to use milliseconds instead of complete
seconds.
In cases where the event loop cycle duration is short and their rapid
(e.g. running
many fast commands with short pipeline, or a high `hz` config), this
change will not
make much difference, since in anyway, we'll be quick to detect that
we're on a "new
second", and it's likely that these fsync will always be executed close
to the second
switch barrier.
But in cases of rare or slow event loops cycles (e.g. either slow
commands, or very
low rate of traffic to redis, and low `hz`), it could easily be that
with the old code,
in some cases we'll have over 1.5 seconds between fsyncs, and in others
less than 0.5.
see discussion in #8612
This PR also handle aof_flush_postponed_start as well, the damage there
is smaller
since the threshold is 2 seconds, and not 1.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Redis has some special commands that mark the client's state, such as
`subscribe` and `blpop`, which mark the client as `CLIENT_PUBSUB` or
`CLIENT_BLOCKED`, and we have metrics for the special use cases.
However, there are also other special commands, like `WATCH`, which
although do not have a specific flags, and should also be considered
stateful client types. For stateful clients, in many scenarios, the
connections cannot be shared in "connection pool", meaning connection
pool cannot be used. For example, whenever the `WATCH` command is
executed, a new connection is required to put the client into the "watch
state" because the watched keys are stored in the client.
If different business logic requires watching different keys, separate
connections must be used; otherwise, there will be contamination. This
also means that if a user's business heavily relies on the `WATCH`
command, a large number of connections will be required.
Recently we have encountered this situation in our platform, where some
users consume a significant number of connections when using Redis
because of `WATCH`.
I hope we can have a way to observe these special use cases and special
client connections. Here I add a few monitoring metrics:
1. `watching_clients` in `INFO` reply: The number of clients currently
in the "watching" state.
2. `total_watched_keys` in `INFO` reply: The total number of keys being
watched.
3. `watch` in `CLIENT LIST` reply: The number of keys each client is
currently watching.
Usually, the probability that a dict exists is much greater than the
probability that it does not exist. In kvstoreDictAddRaw, we will call
kvstoreGetDict multiple times. Based on this assumption, we change
createDictIfNeeded to something like get or create function:
```
before:
dict exist: 2 kvstoreGetDict
dict non-exist: 2 kvstoreGetDict
after:
dict exist: 1 kvstoreGetDict
dict non-exist: 3 kvstoreGetDict
```
A possible 3% performance improvement was observed:
In addition, some typos/comments i saw have been cleaned up.
In low memory situations, sending a big number of arguments (sets)
may cause OOM panic. Use ztrycalloc, like we do on LCS and XAUTOCLAIM,
and fail gracefully.
This change affects the following commands: ZUNION, ZINTER, ZDIFF,
ZUNIONSTORE, ZINTERSTORE, ZDIFFSTORE, ZINTERCARD.
These tests have all failed in daily CI:
```
*** [err]: Blocking XREADGROUP for stream key that has clients blocked on stream - reprocessing command in tests/unit/type/stream-cgroups.tcl
Expected '1101' to be between to '1000' and '1100' (context: type eval line 23 cmd {assert_range [expr $end-$start] 1000 1100} proc ::test)
*** [err]: BLPOP unblock but the key is expired and then block again - reprocessing command in tests/unit/type/list.tcl
Expected '1101' to be between to '1000' and '1100' (context: type eval line 23 cmd {assert_range [expr $end-$start] 1000 1100} proc ::test)
*** [err]: BZPOPMIN unblock but the key is expired and then block again - reprocessing command in tests/unit/type/zset.tcl
Expected '1103' to be between to '1000' and '1100' (context: type eval line 23 cmd {assert_range [expr $end-$start] 1000 1100} proc ::test)
```
Increase the range to avoid failures, and improve the comment to be
clearer.
tests was introduced in #13004.
The receiver does not update any of its cluster state based on gossip
about itself. This commit explicitly avoids sending or processing gossip
about the receiver.
Currently cluster bus gossips include 10% of nodes in the cluster with a
minimum of 3 nodes. For up to 30 node clusters, this commit makes sure
that 1/3 of the gossip (1 out of 3 gossips) is never discarded. This
should help with relatively faster convergence of cluster state in
general.
Following the changes introduced by 8cd62f82c, the dbExpandExpires used
the db_size instead of expires_size.
Co-authored-by: YaacovHazan <yaacov.hazan@redislabs.com>
Following the changes introduced by 8cd62f82c, the kvstoreDictExpand for
the expires kvstore used the slot_size instead of expires_slot_size.
Co-authored-by: YaacovHazan <yaacov.hazan@redislabs.com>
This test fails occasionally:
```
*** [err]: CLIENT KILL maxAGE will kill old clients in tests/unit/introspection.tcl
Expected 2 == 1 (context: type eval line 14 cmd {assert {$res == 1}} proc ::test)
```
This test is very likely to do a false positive if the execute time
takes longer than the max age, for example, if the execution time
between sleep and kill exceeds 1s, rd2 will also be killed due to
the max age.
The test can adjust the order of execution statements to increase
the probability of passing, but this is still will be a timing issue
in some slow machines, so decided give it a few more chances.
The test was introduced in #12299.
Fail CI:
https://github.com/redis/redis/actions/runs/7837608438/job/21387609715
## Why defragment tests only failed under 32-bit
First of all, under 32-bit jemalloc will allocate more small bins and
less large bins, which will also lead to more external fragmentation,
therefore, the fragmentation ratio is higher in 32-bit than in 64-bit,
so the defragment tests(`Active defrag eval scripts: cluster` and
`Active defrag big keys: cluster`) always fails in 32-bit.
## Why defragment tests only failed with cluster
The fowllowing is the result of `Active defrag eval scripts: cluster`
test.
1) Before #11695, the fragmentation ratio is 3.11%.
2) After #11695, the fragmentation ratio grew to 4.58%.
Since we are using per-slot dictionary to manage slots, we will only
defragment the contents of these dictionaries (keys, values), but not
the dictionaries' struct and ht_table, which means that frequent
shrinking and expanding of the dictionaries, will make more fragments.
3) After #12850 and #12948, In cluster mode, a large number of cluster
slot dicts will be shrunk, creating additional fragmention, and the
dictionary will not be defragged.
## Solution
* Add defragmentation of the per-slot dictionary's own structures, dict
struct and ht_table.
## Other change
* Increase floating point print precision of `frags` and `rss` in debug
logs for defrag
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
The test fails here and there:
```
*** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl
scan didn't handle slot skipping logic.
```
There are two case:
1. In the case of passing the test, we use child process to avoid the
dict resize, but it can not completely limit it, since in the dictDelete
we still have chance to trigger the resize (hit the force radio). The
reason why our test passed before is because the expire dict is still
in the rehashing process, so the dictDelete, the dictShrinkIfNeeded can
not trigger the resize.
2. In the case of failing the test, the expire dict finished the
rehashing,
so the last dictDelete, the dictShrinkIfNeeded trigger the dict resize
since it hit the force radio, so the skipping logic fail.
This PR add a new DEBUG command to disbale the dict resize.
We forgot to call quicklistSetOptions after createQuicklistObject,
in the sort store scenario, we will create a quicklist with default
fill or compress options.
This PR adds fill and depth parameters to createQuicklistObject to
specify that options need to be set after creating a quicklist.
This closes#12871.
release notes:
> Fix lists created by SORT STORE to respect list compression and
packing configs.
Fix two crash introducted by #12955
When a quicklist node can't be inserted and split, we eventually merge
the current node with its neighboring
nodes after inserting, and compress the current node and its siblings.
1. When the current node is merged with another node, the current node
may become invalid and can no longer be used.
Solution: let `_quicklistMergeNodes()` return the merged nodes.
3. If the current node is a LZF quicklist node, its recompress will be
1. If the split node can be merged with a sibling node to become head or
tail, recompress may cause the head and tail to be compressed, which is
not allowed.
Solution: always recompress to 0 after merging.
After fix for #13033, address sanitizer reports this heap-use-after-free
error. When the pubsubshard_channels dict becomes empty, we will delete
the dict, and the dictReleaseIterator will call dictResetIterator, it
will use the dict so we will trigger the error.
This PR introduced a new struct kvstoreDictIterator to wrap
dictIterator.
Replace the original dict iterator with the new kvstore dict iterator.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: guybe7 <guy.benoish@redislabs.com>
The reason is the same as #13016. The reason is that in #12819,
in cron, in addition to trying to shrink, we will also tyring
to expand. The dict was expanded by cron before we trigger the
bgsave since we do have the enough keys (4096) to hit the radio.
Before the bgsave, we only add 4095 keys to avoid this issue.
Fix#12864
The main reason for this crash is that when replacing a element of a
quicklist packed node with lpReplace() method,
if the final size is larger than 4GB, lpReplace() will fail and returns
NULL, causing `node->entry` to be incorrectly set to NULL.
Since the inserted data is not a large element, we can't just replace it
like a large element, first quicklistInsertAfter()
and then quicklistDelIndex(), because the current node may be merged and
invalidated in quicklistInsertAfter().
The solution of this PR:
When replacing a node fails (listpack exceeds 4GB), split the current
node, create a new node to put in the middle, and try to merge them.
This is the same as inserting a large element.
In the worst case, its size will not exceed 4GB.
This commit addresses a problem in connSocketBlockingConnect where
different types of connection failures, including timeouts and other
errors, were not consistently handled. Previously, the function did not
return C_ERR immediately after detecting a connection failure, which
could lead to inconsistent states and misinterpretation of the
connection status.
With this update, connSocketBlockingConnect now correctly returns C_ERR
upon encountering any connection error, ensuring that all types of
connection failures are handled consistently and the behavior of the
function aligns with expected outcomes in case of connection issues.
Closes#12900
After #12822, when pubsubshard_channels became empty, kvstoreDictDelete
will delete the dict (which is the only one currently deleting dicts
that
become empty) and in the next loop, we will make an invalid call to
dictNext.
After the dict becomes empty, we break out of the loop without calling
dictNext.
Currently, once active defrag starts, we can not adjust
active_defrag_running
downwards. This is because active_defrag_running will be dynamically
compute
based on the fragmentation, we think we should not lower the effort when
the
fragmentation drops.
However, we need to note that active_defrag_running will also be
dynamically
computed based on configurations. In this case, we are not respecting
cycle-min
or cycle-max. Some people may realize halfway through that defrag
consumes a
lot and want to adjust it.
Previously we could only turn off activedefrag and then turn it on again
to
adjust active_defrag_running downwards. So in this PR, when a active
defrag
configuration change is made, we will re-compute it.
These configuration items are:
- active-defrag-cycle-min
- active-defrag-cycle-max
- active-defrag-threshold-upper
Since now a DB in cluster mode is divided into 16384 dicts, here
we directly check kvstoreDictSize instead of kvstoreSize, which
may have a higher probability that we can save the lookup.
The other change is a cleanup, obviously kvstoreGetHash should be
applied to the db->expires dicts.
When the dict is NULL, we also need to push resize_cursor, otherwise it
will keep doing useless continue here, and there is no way to resize the
other dict behind it.
Introduced in #12822.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
# Description
Gather most of the scattered `redisDb`-related code from the per-slot
dict PR (#11695) and turn it to a new data structure, `kvstore`. i.e.
it's a class that represents an array of dictionaries.
# Motivation
The main motivation is code cleanliness, the idea of using an array of
dictionaries is very well-suited to becoming a self-contained data
structure.
This allowed cleaning some ugly code, among others: loops that run twice
on the main dict and expires dict, and duplicate code for allocating and
releasing this data structure.
# Notes
1. This PR reverts the part of https://github.com/redis/redis/pull/12848
where the `rehashing` list is global (handling rehashing `dict`s is
under the responsibility of `kvstore`, and should not be managed by the
server)
2. This PR also replaces the type of `server.pubsubshard_channels` from
`dict**` to `kvstore` (original PR:
https://github.com/redis/redis/pull/12804). After that was done,
server.pubsub_channels was also chosen to be a `kvstore` (with only one
`dict`, which seems odd) just to make the code cleaner by making it the
same type as `server.pubsubshard_channels`, see
`pubsubtype.serverPubSubChannels`
3. the keys and expires kvstores are currenlty configured to allocate
the individual dicts only when the first key is added (unlike before, in
which they allocated them in advance), but they won't release them when
the last key is deleted.
Worth mentioning that due to the recent change the reply of DEBUG
HTSTATS changed, in case no keys were ever added to the db.
before:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
```
after:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
[Expires HT]
```
When db->expires_cursor==0, it means the DB is done the scanning,
we should exit the loop to avoid the useless scanning.
It is easy to see the active expire timeout in the modified test,
for example, let's assume that there is only 1 expired key in the
DB, and the size / buckets ratio is less than 1%, which means that
we will skip it in isExpiryDictValidForSamplingCb, and the return
value of expires_cursor is 0.
Because `data.sampled == 0` is always true, so `repeat` is also
always true, we will keep scanning the DB, but every time it is
skipped by the previous judgment (expires_cursor = 0), until the
timelimit is finally exhausted.
The JSON file lacks the following structural API changes:
- GEORADIUSBYMEMBER: add the ANY option for COUNT since 6.2.0.
- GEORADIUSBYMEMBER_RO: add the ANY option for COUNT since 6.2.0.
- GEORADIUS_RO: Added support for uppercase unit names since 7.0.0.
- GEORADIUSBYMEMBER_RO: Added support for uppercase unit names since
7.0.0.
---------
Signed-off-by: daz-3ux <daz-3ux@proton.me>
Co-authored-by: bodong.ybd <bodong.ybd@alibaba-inc.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: yangpengda.333 <yangpengda.333@bytedance.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Currently, We compute `db->avg_ttl` after each short `dbScan` sweep (a
few buckets without checking the time limit). But after each `dbScan`
sweep, we don't have much data and this makes the db->avg_ttl less
precise. For example, even if we scan the whole db, we can't get the
exact avg_ttl because we separate the data.
i.e. because of the running average, if we issue 16 calls to scan, we'll
give lower weight to the first one, and higher weight to the last one.
I think we should calculate `db->avg_ttl` until completing more of the
db iteration (judgement of time limit or the beginning of iterating next
db) because we have more sample data in this db and can get more
accurate result. In the best case, if we scan the whole db, we can get
the exact avg_ttl.
In this PR, we postpone the avg_ttl calculation until the judgement of
time limit or iteration of next db, so we can accumulate more data to
get more precise avg_ttl.
Note that we still need to make sure to decay the old TTLs at the same
speed as before, which is why we want to run the decay mechanism several
times, or use the Pow formula, see the comment in the code.
In my experiment, this PR can improve 89% or 52% accuracy in different
workload.
Co-authored-by: Oran Agra <oran@redislabs.com>
In Redis, rdb is produced in three scenarios mainly.
- backup, such as `bgsave` and `save` command
- full sync in replication
- aof rewrite if `aof-use-rdb-preamble` is yes
We also have some RDB flags to identify the purpose of rdb saving.
```C
/* flags on the purpose of rdb save or load */
#define RDBFLAGS_NONE 0 /* No special RDB loading. */
#define RDBFLAGS_AOF_PREAMBLE (1<<0) /* Load/save the RDB as AOF preamble. */
#define RDBFLAGS_REPLICATION (1<<1) /* Load/save for SYNC. */
```
But currently, it seems that these flags and purposes of rdb saving
don't exactly match. I find it in `rdbSaveRioWithEOFMark` which calls
`startSaving` with `RDBFLAGS_REPLICATION` but `rdbSaveRio` with
`RDBFLAGS_NONE`.
```C
int rdbSaveRioWithEOFMark(int req, rio *rdb, int *error, rdbSaveInfo *rsi) {
char eofmark[RDB_EOF_MARK_SIZE];
startSaving(RDBFLAGS_REPLICATION);
getRandomHexChars(eofmark,RDB_EOF_MARK_SIZE);
if (error) *error = 0;
if (rioWrite(rdb,"$EOF:",5) == 0) goto werr;
if (rioWrite(rdb,eofmark,RDB_EOF_MARK_SIZE) == 0) goto werr;
if (rioWrite(rdb,"\r\n",2) == 0) goto werr;
if (rdbSaveRio(req,rdb,error,RDBFLAGS_NONE,rsi) == C_ERR) goto werr;
if (rioWrite(rdb,eofmark,RDB_EOF_MARK_SIZE) == 0) goto werr;
stopSaving(1);
return C_OK;
werr: /* Write error. */
/* Set 'error' only if not already set by rdbSaveRio() call. */
if (error && *error == 0) *error = errno;
stopSaving(0);
return C_ERR;
}
```
In this PR, I refine the purpose of rdb saving with accurate flags.
Ci report this failure:
```
*** [err]: Don't rehash if used memory exceeds maxmemory after rehash in tests/unit/maxmemory.tcl
Expected '4098' to equal or match '4002'
WARNING: the new maxmemory value set via CONFIG SET (1176088) is smaller than the current memory usage (1231083)
```
It can be seen from the log that used_memory changed before we set
maxmemory.
The reason is that in #12819, in cron, in addition to trying to shrink,
we will
also tyring to expand. The dict was expanded by cron before we set
maxmemory,
causing the test to fail.
Before setting maxmemory, we only add 4095 keys to avoid triggering
resize.
When we use a timer to unblock a client in module, if the timer
period and the block timeout are very close, they will unblock the
client in the same event loop, and it will trigger the assertion.
The reason is that in moduleBlockedClientTimedOut we will protect
against re-processing, so we don't actually call updateStatsOnUnblock
(see #12817), so we are not able to reset the c->duration.
The reason is unblockClientOnTimeout() didn't realize that bc had
been unblocked. We add a function to the module to determine if bc
is blocked, and then use it in unblockClientOnTimeout() to exit.
There is the stack:
```
beforeSleep
blockedBeforeSleep
handleBlockedClientsTimeout
checkBlockedClientTimeout
unblockClientOnTimeout
unblockClient
resetClient
-- assertion, crash the server
'c->duration == 0' is not true
```
The block timeout is passed in the test case, but we do not pass
in the timeout_callback, and it will crash when unlocking. In this
case, in moduleBlockedClientTimedOut we will check timeout_callback.
There is the stack:
```
beforeSleep
blockedBeforeSleep
handleBlockedClientsTimeout
checkBlockedClientTimeout
unblockClientOnTimeout
replyToBlockedClientTimedOut
moduleBlockedClientTimedOut
-- timeout_callback is NULL, invalidFunctionWasCalled
bc->timeout_callback(&ctx,(void**)c->argv,c->argc);
```
Add a way to HSCAN a hash key, and get only the filed names.
Command syntax is now:
```
HSCAN key cursor [MATCH pattern] [COUNT count] [NOVALUES]
```
when `NOVALUES` is on, the command will only return keys in the hash.
---------
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Adds an ability to kill clients older than a specified age.
Also, fixed the age calculation in `catClientInfoString` to use
`commandTimeSnapshot`
instead of the old `server.unixtime`, and added missing documentation
for
`CLIENT KILL ID` to output of `CLIENT help`.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
tests consistently fail on timeout (sleep that's too short).
it now takes more time because in #12819 we iterate on all dicts, not
just non-empty ones.
it passed the PR's CI because it skips the `slow` tag, which might have
been misplaced, but now it is probably required.
with the fix, the tests take quite a lot of time:
```
[ok]: Redis can trigger resizing (1860 ms)
[ok]: Redis can rewind and trigger smaller slot resizing (744 ms)
```
before #12819:
```
[ok]: Redis can trigger resizing (309 ms)
[ok]: Redis can rewind and trigger smaller slot resizing (295 ms)
```
failure:
https://github.com/redis/redis/actions/runs/7704158180/job/20995931735
```
*** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl
scan didn't handle slot skipping logic.
*** [err]: Redis can trigger resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
table size: 128
number of elements: 5
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 29 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test)
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
table size: 256
number of elements: 10
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 16*' (context: type eval line 27 cmd {assert_match "*table size: 16*" [r debug HTSTATS 0]} proc ::test)
```
This was introduced in #13004, missing this assignment.
It causes timeout to be a random value (may be less than now),
and then in `Unblock by timer` test, the client is unblocked
and then it call timeout_callback, since the callback is NULL,
the server will crash.
The crash stack is:
```
beforesleep
handleBlockedClientsTimeout
checkBlockedClientTimeout
unblockClientOnTimeout
replyToBlockedClientTimedOut
moduleBlockedClientTimedOut
-- the timeout_callback is NULL, invalidFunctionWasCalled
bc->timeout_callback(&ctx,(void**)c->argv,c->argc);
```
This allows specifying the timeout value for opening the TCP
connection to a server. The timeout, default 0 means no limit,
depending on the OS. It can be specified using the new `-t` switch.
revive #3764, fixes#3763
---------
Co-authored-by: Itamar Haber <itamar@redislabs.com>
Co-authored-by: yoav-steinberg <yoav@redislabs.com>
In #11012, we will reprocess command when client is unblocked on keys,
in some blocking commands, for example, in the XREADGROUP BLOCK
scenario,
because of the re-processing command, we will recalculate the block
timeout,
causing the blocking time to be reset.
This commit add a new CLIENT_REPROCESSING_COMMAND clent flag, explicitly
let the command know that it is being re-processed, later in
blockForKeys
we will not reset the timeout.
Affected BLOCK cases:
- list / zset / stream, added test cases for each.
Unaffected cases:
- module (never re-process the commands).
- WAIT / WAITAOF (never re-process the commands).
Fixes#12998.
The function `tryResizeHashTables` only attempts to shrink the dicts
that has keys (change from #11695), this was a serious problem until the
change in #12850 since it meant if all keys are deleted, we won't shrink
the dick.
But still, both dictShrink and dictExpand may be blocked by a fork child
process, therefore, the cron job needs to perform both dictShrink and
dictExpand, for not just non-empty dicts, but all dicts in DBs.
What this PR does:
1. Try to resize all dicts in DBs (not just non-empty ones, as it was
since #12850)
2. handle both shrink and expand (not just shrink, as it was since
forever)
3. Refactor some APIs about dict resizing (get rid of `htNeedsShrink`
`htNeedsShrink` `dictShrinkToFit`, and expose `dictShrinkIfNeeded`
`dictExpandIfNeeded` which already contains all the code of those
functions we get rid of, to make APIs more neat)
4. In the `Don't rehash if redis has child process` test, now that cron
would do resizing, we no longer need to write to DB after the child
process got killed, and can wait for the cron to expand the hash table.
Modules may want to handle allocation failures gracefully. Adding
RM_TryCalloc() and RM_TryRealloc() for it.
RM_TryAlloc() was added before:
https://github.com/redis/redis/pull/10541
We have not limited the value of maxmemory-samples in the past, it can
be set very large. If it is set very large, we will have stack overflow
in evictionPoolPopulate when we trigger the key eviction.
There is no reason for this config to be set too high, so just limit its
range to [1,64].
#### Problem Statement:
For any read/update operation during rehashing, we're doing ~10+ random
DRAM lookups to do the rehashing, as we are using the `rehashidx` to
rehash 10 buckets, whose dict entries most likely aren't cached in the
CPU or near the bucket we are operating on. If these random bucket are
empty, the rehashing process during that command execution is skipped.
#### Implementation:
For reducing the performance recession while dict is rehashing, we
determine the index at which the key would be stored in the 0th HT, we
check if that index has already been rehashed, if not we will rehash the
bucket containing the key and the bucket will be moved from 0th HT to
the 1st HT.
If the key has already been rehashed, we perform the random access
bucket rehash (using `rehashidx`) and we again verify if rehashing is
still ongoing and look up the key in the respective HT.
This ensures rehashing is not skipped in any command call and that we
rehash a particular bucket or random bucket in each call.
#### Changes in this PR:
- Added a new method `dictBucketRehash` to perform rehash on a single
bucket.
- Helper function `moveKeysInBucketOldtoNew` for `dictRehash` and
`dictBucketRehash` to move all the keys in a bucket from the old to the
new hash HT.
- Helper function `verifyMoreRehashRequired` for `dictRehash` and
`dictBucketRehash` to check if we have already rehashed the whole table
and if more rehashing is required.
### Benchmark:
- This PR still shows **~13%** improvement in the latency during
rehashing.
- Rehashing is now **~2%** faster for this PR when compared to unstable.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Fix#9926 , and introduce an alternative method to prevent abuse of
transactions:
1. revert #5454 (which was blocking read-only transactions in OOM
state), and break the tie of MULTI state memory usage and the server OOM
state. Meaning that we'll limit the total memory a single client can
queue, and do that unconditionally regardless of the server being OOM or
not.
2. to prevent abuse of transactions, we use the
`client-query-buffer-limit` to restrict the size of the transaction.
Because the commands cached in the MULTI/EXEC queue have not been
executed yet, so they are also considered a part of the "query buffer"
in a broader sense. In other words, the commands in the MULTI queue and
the `querybuf` of the client together constitute the "query buffer".
When they exceed the limit, the connection will be disconnected.
The reasoning is that it's sensible to sends a single command with a
huge (1GB) argument, and it's sensible to sends a transaction with many
small commands, but it's probably not common to sends a long transaction
with many huge arguments (will consume a lot of memory before even being
executed).
If anyone runs into that, they can simply increase the
`client-query-buffer-limit` config.
P.S. To prevent DDoS attacks, unauthenticated clients have a separate
hard limit. Their query buffer should not exceed a maximum of 1MB. In
other words, if the query buffer of an unauthenticated client exceeds
1MB or the `client-query-buffer-limit` (if it is set to a value smaller
than 1MB,), the connection will be disconnected.
In the following case sender may be unknown, so we need to set up a
NULL check for sender:
```
/* If this is a MEET packet from an unknown node, we still process
* the gossip section here since we have to trust the sender because
* of the message type. */
if (!sender && type == CLUSTERMSG_TYPE_MEET)
clusterProcessGossipSection(hdr,link);
```
We have test cases for incr related commands with no key exist and
spaces in key and wrong type of key. However, we dont have test cases
covered for INCRBY INCRBYFLOAT DECRBY INCR DECR HINCRBY HINCRBYFLOAT
ZINCRBY with valid key and invalid value as argument, and float value to
incrby and decrby. So added test cases for the scenarios in incr.tcl.
Thank you!
In #11568 we removed the NOSCRIPT flag from commands, e.g. removing
NOSCRIPT flag from WAIT. Aiming to allow them in scripts and let them
implicitly behave in the non-blocking way.
This PR remove NOSCRIPT flag from WAITAOF just like WAIT (to be
symmetrical)).
And this PR also add BLOCKING flag for WAIT and WAITAOF.
This PR did some cleanups around function:
- drop the comment about Libraries Ctx, since we do have comment
in functionsLibCtx, no need to maintain multiple copies.
- remove outdated comment about the dropped Library description.
- remove unused desc and code vars in functionExtractLibMetaData.
- fix engines_nemory typo, changed it to engines_memory.
- remove outdated comment about FUNCTION CREATE and FUNCTION INFO,
FUNCTION CREATE was renamed to FUNCTION LOAD.
- Check in initServer whether the return of functionsInit is OK.
Currently slowlog gets disabled if slowlog-log-slower-than is set to less than zero. I think we should also disable it if slowlog-max-len is set to zero. We apply the same logic to acllog-max-len.
There have been occasional instances of memory corruption (though code bugs or bit flips) leading to invalid node information being gossiped around. To prevent this invalid information spreading, we verify the node IDs in received gossip are in an acceptable format, and disregard any gossiped nodes with invalid IDs. This PR uses the existing verifyClusterNodeId function to check the validity of the gossiped node IDs and if an invalid one is encountered, logs raw byte information to help debug the corruption.
---------
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
background: some modules need to know the `dbid` information, such as
the function used during RDB loading:
```
robj *rdbLoadObject(int rdbtype, rio *rdb, sds key, int dbid, int *error) {
....
moduleInitIOContext(io,mt,rdb,&keyobj,dbid);
```
However, during replication, the "tempDb" created for diskless RDB
loading is not correctly set with the dbid. This leads to passing the
wrong dbid to the `rdbLoadObject` function (as tempDb uses zcalloc, all
ids are 0).
```
disklessLoadInitTempDb()->rdbLoadRioWithLoadingCtx()->
/* Read value */
val = rdbLoadObject(type,rdb,key,db->id,&error);
```
To fix it, set the correct ID (relative index) for the tempdb.
In #12838, we misuse the safe iterator of the client dict, so we can't
catch the synchronous release of the client if there is a bug.
Since we realize that clients (even subscribers) are released with async
free, we change the safe iterators of the client dict into unsafe
iterators in `pubsub.c`. And I also remove redundant code.
Before this change (most recently modified in
https://github.com/redis/redis/pull/12850#discussion_r1421406393), The
trigger for normal expand threshold was 100% utilization and the trigger
for normal shrink threshold was 10% (HASHTABLE_MIN_FILL).
While during fork (DICT_RESIZE_AVOID), when we want to avoid rehash, the
trigger thresholds were multiplied by 5 (`dict_force_resize_ratio`),
meaning 500% for expand and 2% (100/10/5) for shrink.
However, in `dictRehash` (the incremental rehashing), the rehashing
threshold for shrinking during fork (DICT_RESIZE_AVOID) was 20% by
mistake.
This meant that if a shrinking is triggered when `dict_can_resize` is
`DICT_RESIZE_ENABLE` which the threshold is 10%, the rehashing can
continue when `dict_can_resize` is `DICT_RESIZE_AVOID`.
This would cause unwanted CopyOnWrite damage.
It'll make sense to change the thresholds of the rehash trigger and the
thresholds of the incremental rehashing the same, however, in one we
compare the size of the hash table to the number of records, and in the
other we compare the size of ht[0] to the size of ht[1], so the formula
is not exactly the same.
to make things easier we change all the thresholds to powers of 2, so
the normal shrinking threshold is changed from 100/10 (i.e. 10%) to
100/8 (i.e. 12.5%), and we change the threshold during forks from 5 to
4, i.e. from 500% to 400% for expand, and from 2% (100/10/5) to 3.125%
(100/8/4)
Fix#12785 and other race condition issues.
See the following isolated comments.
The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```
1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
https://github.com/redis/redis/pull/12817#issuecomment-1831181220
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
in `updateStatsOnUnblock()`.
- Introduced:
Version: 6.2
PR: #7491
- Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
- Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`
- Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.
- Introduced:
Version: 6.2
PR: #8217
- Touch
`server.stat_total_error_replies`, `server.errors`,
- Harm Level: High
Potentially corrupts the memory data of `server.errors`
- Solution:
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.
2. Made RM_Reply*() family API thread-safe
Related discussion:
https://github.com/redis/redis/pull/12817#discussion_r1408707239
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client
- Introduced:
Version: 7.2.0
PR: #12326
- Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
we can still exit after `c->flags & CLIENT_PUSHING`.
- Solution
Checking `c->flags & CLIENT_PUSHING` earlier.
3. Made freeClient() thread-safe
Fix#12785
- Introduced:
Version: 4.0
Commit:
3fcf959e60
- Harm Level: Moderate
* Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.
* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
- Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.
4. Fixed free client args without GIL
Related discussion:
https://github.com/redis/redis/pull/12817#discussion_r1408706695
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.
- Introduced:
Version: 6.2.0
PR: #8141
- Harm Level: Low
Trigger assertion or double free or memory leak.
- Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked
5. Fix adding fake client to server.clients_pending_write
It will incorrectly log the memory usage for the fake client.
Related discussion:
https://github.com/redis/redis/pull/12817#issuecomment-1851899163
- Introduced:
Version: 4.0
Commit:
9b01b64430
- Harm Level: None
Only result in NOP
- Solution:
* Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.
6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
- Introduced:
Version: 6.2
PR: #7491
- Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.
- Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.
### Other issue
1. RM_Yield is not thread-safe, fixed via #12905.
### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().
### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
When doing dict resizing, dictTypeResizeAllowed is used to judge whether
the new allocated memory for rehashing would cause OOM.
However when shrinking, we alloc `_dictNextExp(d->ht_used[0])` bytes of
memory, while in `dictTypeResizeAllowed` we still use
`_dictNextExp(d->ht_used[0]+1)` as the new allocated memory size. This
will overestimate the memory used by shrinking at special conditions,
causing a false OOM judgement.
We doing this in diskless on-empty-db mode, when diskless
loading fails, we will call emptyData to remove the half-loaded
data in case we started with an empty replica.
Now when a disk-based sync rdbLoad fails, we will call emptyData
too in case it loads partially incomplete data.
when the replica attempts another re-sync, it'll empty the dataset
again anyway, so this affects two things:
1. memory consumption in the time gap until the next rdb loading begins
2. if the unsynced replica is for some reason promoted, it would have kept
the partial dataset instead of being empty.
Before #12850, we will only try to shrink the dict in serverCron,
which we can control by using a child process, but now every time
we delete a key, the shrink check will be called.
In these test (added in #12802), we meant to disable the resizing,
but druing the delete, the dict will meet the force shrink, like
2 / 128 = 0.015 < 0.2, the delete will trigger a force resize and
will cause the test to fail.
In this commit, we try to keep the load factor at 3 / 128 = 0.023,
that is, do not meet the force shrink.
In the past we used integers to compare ratios, let us assume that
we have the following data in expanding:
```
used / size > 5
`80 / 16 > 5` is false
`81 / 16 > 5` is false
`95 / 16 > 5` is false
`96 / 16 > 5` is true
```
Because the integer result is rounded, our resize breaks the ratio
constraint, this has existed since the beginning, which resulted in
us not strictly following the ratio (shrink also has the same issue).
This PR change it to multiplication to avoid floating point
calculations.
The test have a race:
```
*** [err]: Redis can rewind and trigger smaller slot resizing in tests/unit/other.tcl
Expected '[Dictionary HT]
Hash table 0 stats (main hash table):
table size: 12
number of elements: 2
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
' to match '*table size: 8*' (context: type eval line 12 cmd {assert_match "*table size: 8*" [r debug HTSTATS 0]} proc ::test)
```
When `r del "{alice}$j"` is executed in the loop, when the key is
deleted to [9, 12], the load factor has meet HASHTABLE_MIN_FILL,
if serverCron happens to trigger slot dict resize, then the test
will fail. Because there is not way to meet HASHTABLE_MIN_FILL in
the subsequent dels.
The solution is to avoid triggering the resize in advance. We can
use multi to delete them at once, or we can disable the resize.
Since we disabled resize in the previous test, the fix also uses
the method of disabling resize.
The test is introduced in #12802.
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.
The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).
In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.
Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Regarding how to obtain the hash slot of a key, there is an optimization
in `getKeySlot()`, it is used to avoid redundant hash calculations for
keys: when the current client is in the process of executing a command,
it can directly use the slot of the current client because the slot to
access has already been calculated in advance in `processCommand()`.
However, scripts are a special case where, in default mode or with
`allow-cross-slot-keys` enabled, they are allowed to access keys beyond
the pre-declared range. This means that the keys they operate on may not
belong to the slot of the pre-declared keys. Currently, when the
commands in a script are executed, the slot of the original client
(i.e., the current client) is not correctly updated, leading to
subsequent access to the wrong slot.
This PR fixes the above issue. When checking the cluster constraints in
a script, the slot to be accessed by the current command is set for the
original client (i.e., the current client). This ensures that
`getKeySlot()` gets the correct slot cache.
Additionally, the following modifications are made:
1. The 'sort' and 'sort_ro' commands use `getKeySlot()` instead of
`c->slot` because the client could be an engine client in a script and
can lead to potential bug.
2. `getKeySlot()` is also used in pubsub to obtain the slot for the
channel, standardizing the way slots are retrieved.
Change the calculation method of bytes_per_key to make it closer to
the true average key size. The calculation method is as follows:
mh->bytes_per_key = mh->total_keys ? (mh->dataset / mh->total_keys) : 0;
The test was introduced in #10745, but we forgot to add it to the
test_helper.tcl, so our CI did not actually run it. This PR adds it
and ensures it passes CI tests.
#11766 introduced a bug in sdsResize where it could forget to update the
sds type in the sds header and then cause an overflow in sdsalloc. it
looks like the only implication of that is a possible assertion in HLL,
but it's hard to rule out possible heap corruption issues with
clientsCronResizeQueryBuffer
Fix a daily test failure because alpine doesn't support stack traces and
add in an extra assertion related to making sure the stack trace was
printed twice.
This is a follow-up PR for #12782, in which we introduced nested
preprocessor directives for TCP keep-alive on Solaris and added
redundant indentation for code. Besides, it could result in unreachable
code due to the lack of `#else` on the latest Solaris 11.4 where
`TCP_KEEPIDLE`, `TCP_KEEPINTVL`, and `TCP_KEEPCNT` are available. As a
result, this PR does three main things:
- To eliminate the redundant indention for C code in nested preprocessor
directives
- To add `#else` directives and move `TCP_KEEPALIVE_THRESHOLD` +
`TCP_KEEPALIVE_ABORT_THRESHOLD` settings under it, avoid unreachable
code and compiler warnings when `#if defined(TCP_KEEPIDLE) &&
defined(TCP_KEEPINTVL) && defined(TCP_KEEPCNT)` is met on Solaris 11.4
- To remove a few trailing whitespace in comments
I'm testing the performance of Pub/Sub command recently. I find if many
clients unsubscribe or are killed simultaneously, Redis needs a long
time to deal with it.
In my experiment, I set 5000 clients and each client subscribes 100
channels. Then I call `client kill type pubsub` to simulate the
situation where clients unsubscribe all channels at the same time and
calculate the execution time. The result shows that it takes about 23s.
I use the _perf_ and find that `listSearchKey` in
`pubsubUnsubscribeChannel` costs more than 90% cpu time. I think we can
optimize this situation.
In this PR, I replace list with dict to track the clients subscribing
the channel more efficiently. It changes O(N) to O(1) in the search
phase. Then I repeat the experiment as above. The results are as
follows.
| | Execution Time(s) |used_memory(MB) |
| :---------------- | :------: | :----: |
| unstable(1bd0b54) | 23.734 | 65.41 |
| optimize-pubsub | 0.288 | 67.66 |
Thanks for #11595 , I use a no-value dict and the results shows that the
performance improves significantly but the memory usage only increases
slightly.
Notice:
- This PR will cause the performance degradation about 20% in
`[p|s]subscribe` command but won't freeze Redis.
In #10122, we set the destination key's flag of SINTERSTORE to `RW`,
however, this command doesn't actually read or modify the destination
key, just overwrites it.
Therefore, we change it to `OW` similarly to all other *STORE commands.
Crash reported in #12695. In the process of upgrading the cluster from
7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we
will rely on shard id to build the server.cluster->shards dict.
In some cases, for example, the 7.0 master node and the 7.2 replica node.
From the view of 7.2 replica node, the cluster->shards dictionary does not
have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica
node may crash.
We should fix the underlying assumption of updateShardId, which is that the
shard dict should be always in sync with the node's shard_id. The fix was
suggested by PingXie, see more details in #12695.
## Issues and solutions from #12817
1. Touch ProcessingEventsWhileBlocked and calling moduleCount() without
GIL in afterSleep()
- Introduced:
Version: 7.0.0
PR: #9963
- Harm Level: Very High
If the module thread calls `RM_Yield()` before the main thread enters
afterSleep(),
and modifies `ProcessingEventsWhileBlocked`(+1), it will cause the main
thread to not wait for GIL,
which can lead to all kinds of unforeseen problems, including memory
data corruption.
- Initial / Abandoned Solution:
* Added `__thread` specifier for ProcessingEventsWhileBlocked.
`ProcessingEventsWhileBlocked` is used to protect against nested event
processing, but event processing
in the main thread and module threads should be completely independent
and unaffected, so it is safer
to use TLS.
* Adding a cached module count to keep track of the current number of
modules, to avoid having to use `dictSize()`.
- Related Warnings:
```
WARNING: ThreadSanitizer: data race (pid=1136)
Write of size 4 at 0x0001045990c0 by thread T4 (mutexes: write M0):
#0 processEventsWhileBlocked networking.c:4135 (redis-server:arm64+0x10006d124)
#1 RM_Yield module.c:2410 (redis-server:arm64+0x10018b66c)
#2 bg_call_worker <null>:83232836 (blockedclient.so:arm64+0x16a8)
Previous read of size 4 at 0x0001045990c0 by main thread:
#0 afterSleep server.c:1861 (redis-server:arm64+0x100024f98)
#1 aeProcessEvents ae.c:408 (redis-server:arm64+0x10000fd64)
#2 aeMain ae.c:496 (redis-server:arm64+0x100010f0c)
#3 main server.c:7220 (redis-server:arm64+0x10003f38c)
```
2. aeApiPoll() is not thread-safe
When using RM_Yield to handle events in a module thread, if the main
thread has not yet
entered `afterSleep()`, both the module thread and the main thread may
touch `server.el` at the same time.
- Introduced:
Version: 7.0.0
PR: #9963
- Old / Abandoned Solution:
Adding a new mutex to protect timing between after beforeSleep() and
before afterSleep().
Defect: If the main thread enters the ae loop without any IO events, it
will wait until
the next timeout or until there is any event again, and the module
thread will
always hang until the main thread leaves the event loop.
- Related Warnings:
```
SUMMARY: ThreadSanitizer: data race ae_kqueue.c:55 in addEventMask
==================
==================
WARNING: ThreadSanitizer: data race (pid=14682)
Write of size 4 at 0x000100b54000 by thread T9 (mutexes: write M0):
#0 aeApiPoll ae_kqueue.c:175 (redis-server:arm64+0x100010588)
#1 aeProcessEvents ae.c:399 (redis-server:arm64+0x10000fb84)
#2 processEventsWhileBlocked networking.c:4138 (redis-server:arm64+0x10006d3c4)
#3 RM_Yield module.c:2410 (redis-server:arm64+0x10018b66c)
#4 bg_call_worker <null>:16042052 (blockedclient.so:arm64+0x169c)
Previous write of size 4 at 0x000100b54000 by main thread:
#0 aeApiPoll ae_kqueue.c:175 (redis-server:arm64+0x100010588)
#1 aeProcessEvents ae.c:399 (redis-server:arm64+0x10000fb84)
#2 aeMain ae.c:496 (redis-server:arm64+0x100010da8)
#3 main server.c:7238 (redis-server:arm64+0x10003f51c)
```
## The final fix as the comments:
https://github.com/redis/redis/pull/12817#discussion_r1436427232
Optimized solution based on the above comment:
First, we add `module_gil_acquring` to indicate whether the main thread
is currently in the acquiring GIL state.
When the module thread starts to yield, there are two possibilities(we
assume the caller keeps the GIL):
1. The main thread is in the mid of beforeSleep() and afterSleep(), that
is, `module_gil_acquring` is not 1 now.
At this point, the module thread will wake up the main thread through
the pipe and leave the yield,
waiting for the next yield when the main thread may already in the
acquiring GIL state.
2. The main thread is in the acquiring GIL state.
The module thread release the GIL, yielding CPU to give the main thread
an opportunity to start
event processing, and then acquire the GIL again until the main thread
releases it.
This is what
https://github.com/redis/redis/pull/12817#discussion_r1436427232
mentioned direction.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
If there are nodes in the cluster that do not support shard-id, they
will gossip shard-id. From the perspective of nodes that support shard-id,
their shard-id is meaningless (since shard-id is randomly generated when
we create a node.)
Nodes that support shard-id will save the shard-id information in nodes.conf.
If the node is restarted according to nodes.conf, the server will report a
corrupted cluster config file error. Because auxShardIdSetter will reject
configurations with inconsistent master-replica shard-ids.
A cluster-wide consensus for the node's shard_id is not necessary. The key
is maintaining consistency of the shard_id on each individual 7.2 node.
As the cluster progressively upgrades to version 7.2, we can expect the
shard_ids across all nodes to naturally converge and align.
In this PR, when processing the gossip, if sender is a replica and does not
support shard-id, set the shard_id to the shard_id of its master.
Bumps
[cross-platform-actions/action](https://github.com/cross-platform-actions/action)
from 0.21.1 to 0.22.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/cross-platform-actions/action/releases">cross-platform-actions/action's
releases</a>.</em></p>
<blockquote>
<h2>Cross Platform Action 0.22.0</h2>
<h3>Added</h3>
<ul>
<li>
<p>Added support for using the action in multiple steps in the same job
(<a
href="https://redirect.github.com/cross-platform-actions/action/issues/26">#26</a>).
All the inputs need to be the same for all steps, except for the
following
inputs: <code>sync_files</code>, <code>shutdown_vm</code> and
<code>run</code>.</p>
</li>
<li>
<p>Added support for specifying that the VM should not shutdown after
the action
has run. This adds a new input parameter: <code>shutdown_vm</code>. When
set to <code>false</code>,
this will hopefully mitigate very frequent freezing of VM during
teardown (<a
href="https://redirect.github.com/cross-platform-actions/action/issues/61">#61</a>,
<a
href="https://redirect.github.com/cross-platform-actions/action/issues/72">#72</a>).</p>
</li>
</ul>
<h3>Changed</h3>
<ul>
<li>
<p>Always terminate VM instead of shutting down. This is more efficient
and this
will hopefully mitigate very frequent freezing of VM during teardown
(<a
href="https://redirect.github.com/cross-platform-actions/action/issues/61">#61</a>,
<a
href="https://redirect.github.com/cross-platform-actions/action/issues/72">#72</a>).</p>
</li>
<li>
<p>Use <code>unsafe</code> as the cache mode for QEMU disks. This should
improve performance (<a
href="https://redirect.github.com/cross-platform-actions/action/issues/67">#67</a>).</p>
</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/cross-platform-actions/action/blob/master/changelog.md">cross-platform-actions/action's
changelog</a>.</em></p>
<blockquote>
<h2>[0.22.0] - 2023-12-27</h2>
<h3>Added</h3>
<ul>
<li>
<p>Added support for using the action in multiple steps in the same job
(<a
href="https://redirect.github.com/cross-platform-actions/action/issues/26">#26</a>).
All the inputs need to be the same for all steps, except for the
following
inputs: <code>sync_files</code>, <code>shutdown_vm</code> and
<code>run</code>.</p>
</li>
<li>
<p>Added support for specifying that the VM should not shutdown after
the action
has run. This adds a new input parameter: <code>shutdown_vm</code>. When
set to <code>false</code>,
this will hopefully mitigate very frequent freezing of VM during
teardown (<a
href="https://redirect.github.com/cross-platform-actions/action/issues/61">#61</a>,
<a
href="https://redirect.github.com/cross-platform-actions/action/issues/72">#72</a>).</p>
</li>
</ul>
<h3>Changed</h3>
<ul>
<li>
<p>Always terminate VM instead of shutting down. This is more efficient
and this
will hopefully mitigate very frequent freezing of VM during teardown
(<a
href="https://redirect.github.com/cross-platform-actions/action/issues/61">#61</a>,
<a
href="https://redirect.github.com/cross-platform-actions/action/issues/72">#72</a>).</p>
</li>
<li>
<p>Use <code>unsafe</code> as the cache mode for QEMU disks. This should
improve performance (<a
href="https://redirect.github.com/cross-platform-actions/action/issues/67">#67</a>).</p>
</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="5800fa0060"><code>5800fa0</code></a>
Release 0.22.0</li>
<li><a
href="20ad4b2ceb"><code>20ad4b2</code></a>
Fix <a
href="https://redirect.github.com/cross-platform-actions/action/issues/67">#67</a>:
Use <code>unsafe</code> as the cache mode disks</li>
<li><a
href="d9184930c3"><code>d918493</code></a>
Always terminate VM instead of shutting down.</li>
<li><a
href="626f1d6c95"><code>626f1d6</code></a>
Fix error when terminating the VM</li>
<li><a
href="d59f08dc5c"><code>d59f08d</code></a>
Print stack trace for uncaught exceptions</li>
<li><a
href="7f2fab9c56"><code>7f2fab9</code></a>
Revert "Run SSH in verbose mode when debug mode is
enabled"</li>
<li><a
href="0f566c356e"><code>0f566c3</code></a>
[no ci] Update the changelog</li>
<li><a
href="b7f77446bb"><code>b7f7744</code></a>
[no ci] Fix spelling</li>
<li><a
href="9894a9b118"><code>9894a9b</code></a>
Wrap <code>host</code> module in namespace</li>
<li><a
href="87fdd346a2"><code>87fdd34</code></a>
Fix broken test-vm-shutdown tests</li>
<li>Additional commits viewable in <a
href="https://github.com/cross-platform-actions/action/compare/v0.21.1...v0.22.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This change is trying to make two failure modes a bit easier to deep dive:
1. If a serverPanic or serverAssert occurs during the info (or module)
printing, it will recursively panic, which is a lot of fun as it will
just keep recursively printing. It will eventually stack overflow, but
will generate a lot of text in the process.
2. When a segfault happens during the segfault handler, no information
is communicated other than it happened. This can be problematic because
`info` may help diagnose the real issue, but without fixing the
recursive crash it might be hard to get at that info.
There are situations (especially in TLS) in which the engine gets too occupied managing a large number of new connections. Existing connections may time-out while the server is processing the new connections initial TLS handshakes, which may cause cause new connections to be established, perpetuating the problem. To better manage the tradeoff between new connection rate and other workloads, this change adds a new config to manage maximum number of new connections per event loop cycle, instead of using a predetermined number (currently 1000).
This change introduces two new configurations, max-new-connections-per-cycle and max-new-tls-connections-per-cycle. The default value of the tcp connections is 10 per cycle and the default value of tls connections per cycle is 1.
---------
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This PR is a supplement to #11144, moving `signalModifiedKey` in
`xaddCommand` after the trimming, to ensure the state of key eventual
consistency. Currently there is no problem with Redis, but it is better
to avoid issues in future development on Redis.
In case server.fsynced_reploff changed (e.g. flushAppendOnly set it to
server.master_repl_offset in case there was nothing to fsync) we want to
avoid sleeping before the next beforeSleep so we can call
blockedBeforeSleep ASAP.
without that, in case there's no incoming traffic, we could be waiting
for the next cron timer event to wake us up.
The code does not delete the corresponding node when traversing clients,
resulting in a loop, causing the dictDelete() == DICT_OK assertion to
fail.
In addition, did a cleanup, in the dictCreate scenario, we can avoid a
dictFind call since the dict is empty.
Issue was introduced in #12804.
When arity is -2, it allows us to input two nodes, but returns:
```
*** ERROR: Invalid configuration for cluster creation.
*** Redis Cluster requires at least 3 master nodes.
*** This is not possible with 2 nodes and 0 replicas per node.
*** At least 3 nodes are required.
```
When we input one node, it return:
```
[ERR] Wrong number of arguments for specified --cluster sub command
```
Strictly speaking -2 should also be rejected, because redis-cli
requires at least three nodes. However, the error message was not
very friendly, so decided to change it arity -1.
This closes#12891.
We have achieved replacing `slots_to_keys` radix tree with key->slot
linked list (#9356), and then replacing the list with slot specific
dictionaries for keys (#11695).
Shard channels behave just like keys in many ways, and we also need a
slots->channels mapping. Currently this is still done by using a radix
tree. So we should split `server.pubsubshard_channels` into 16384 dicts
and drop the radix tree, just like what we did to DBs.
Some benefits (basically the benefits of what we've done to DBs):
1. Optimize counting channels in a slot. This is currently used only in
removing channels in a slot. But this is potentially more useful:
sometimes we need to know how many channels there are in a specific slot
when doing slot migration. Counting is now implemented by traversing the
radix tree, and with this PR it will be as simple as calling `dictSize`,
from O(n) to O(1).
2. The radix tree in the cluster has been removed. The shard channel
names no longer require additional storage, which can save memory.
3. Potentially useful in slot migration, as shard channels are logically
split by slots, thus making it easier to migrate, remove or add as a
whole.
4. Avoid rehashing a big dict when there is a large number of channels.
Drawbacks:
1. Takes more memory than using radix tree when there are relatively few
shard channels.
What this PR does:
1. in cluster mode, split `server.pubsubshard_channels` into 16384
dicts, in standalone mode, still use only one dict.
2. drop the `slots_to_channels` radix tree.
3. to save memory (to solve the drawback above), all 16384 dicts are
created lazily, which means only when a channel is about to be inserted
to the dict will the dict be initialized, and when all channels are
deleted, the dict would delete itself.
5. use `server.shard_channel_count` to keep track of the number of all
shard channels.
---------
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
If fopen() is successful and redis_fstat determines that the file is 0
bytes, the file handle stored in fp will leak. This change closes the
filehandle stored in fp if the file is 0 bytes.
Second attempt at fixing Coverity 390029
This is a follow-up to #12796
Fix#12792
On ubuntu 23(lunar), non-root users will not be allowed to change the
oom_score_adj of a process to a value that is too low.
Since terminal's default oom_score_adj is 200, if we run the test on
terminal, we won't be able to set the oom_score_adj of the redis process
to 9 or 22, which is too low.
Reproduction on ubuntu 23(lunar) terminal:
```sh
$ cat /proc/`pgrep redis-server`/oom_score_adj
200
$ echo 100 > /proc/`pgrep redis-server`/oom_score_adj
# success without error
$ echo 99 > /proc/`pgrep redis-server`/oom_score_adj
echo: write error: Permission denied
```
As from the output above, we can only set the minimum oom score of redis
processes to 100.
By modifying the test, make oom_score_adj only increase upwards and not
decrease.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
The implementation of aeProcessEvents seems have different behavior from
the top comment.
The implementation process file events first, then process time events.
When testing and debugging the cluster code before, you need
to stop the cluster after making changes, and then start the
cluster again. Add a restart option for ease of use.
Previous implementation would disconnect _all_ clients when running `ACL
LOAD`, which wasn't very useful.
This change brings the behavior in line with that of `ACL SETUSER`, `ACL
DELUSER`, in that only clients whose user is deleted or clients
subscribed to channels which they no longer have access to will be
disconnected.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
This PR, we added -4 and -6 options to redis-cli to determine
IPV4 / IPV6 priority in DNS lookup.
This was mentioned in
https://github.com/redis/redis/pull/11151#issuecomment-1231570651
For now it's only used in CLUSTER MEET.
The options also made it possible to reliably test dns lookup in CI,
using this option, we can add some localhost tests for #11151.
The commit was cherry-picked from #11151, back then we decided to split
the PR.
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Let us see which version of redis this tool is part of.
Similarly to redis-cli, redis-benchmark and redis-check-rdb
redis-rdb-check and redis-aof-check are actually symlinks to redis,
so they will directly use getVersion in server, the format became:
```
{title} v={redis_version} sha={sha}:{dirty} malloc={malloc} bits={bits} build={build}
```
Move cliVersion into cli_common, redis-cli and redis-benchmark will
use it, and the format is not change:
```
{title} {redis_version} (git:{sha})
```
We dont have test for hgetall against key doesnot exist so added the
test in test suite and along with this, added wrong type cases for other
missing commands.
redis-cli avoids saving sensitive commands in it's history (doesn't
persist them to the history file).
this means that if you had a typo and you wanna re-run the command, you
can't easily do that.
This PR changes that to keep an in-memory history of all the redacted
commands, and just
not persist them to disk. This way we would be able to press the up
arrow and
re-try the command freely, and it'll just not survive a redis-cli
restart.
After #11695, we added two functions `rehashingStarted` and
`rehashingCompleted` to the dict structure. We also registered two
handlers for the main database's dict and expire structures. This allows
the main database to record the dict in `rehashing` list when rehashing
starts. Later, in `serverCron`, the `incrementallyRehash` function is
continuously called to perform the rehashing operation. However,
currently, when rehashing is completed, `rehashingCompleted` does not
remove the dict from the `rehashing` list. This results in the
`rehashing` list containing many invalid dicts. Although subsequent cron
checks and removes dicts that don't require rehashing, it is still
inefficient.
This PR implements the functionality to remove the dict from the
`rehashing` list in `rehashingCompleted`. This is achieved by adding
`metadata` to the dict structure, which keeps track of its position in
the `rehashing` list, allowing for quick removal. This approach avoids
storing duplicate dicts in the `rehashing` list.
Additionally, there are other modifications:
1. Whether in standalone or cluster mode, the dict in database is
inserted into the rehashing linked list when rehashing starts. This
eliminates the need to distinguish between standalone and cluster mode
in `incrementallyRehash`. The function only needs to focus on the dicts
in the `rehashing` list that require rehashing.
2. `rehashing` list is moved from per-database to Redis server level.
This decouples `incrementallyRehash` from the database ID, and in
standalone mode, there is no need to iterate over all databases,
avoiding unnecessary access to databases that do not require rehashing.
In the future, even if unsharded-cluster mode supports multiple
databases, there will be no risk involved.
3. The insertion and removal operations of dict structures in the
`rehashing` list are decoupled from `activerehashing` config.
`activerehashing` only controls whether `incrementallyRehash` is
executed in serverCron. There is no need for additional steps when
modifying the `activerehashing` switch, as in #12705.
The raxFind implementation uses a special pointer value (the address of
a static string) as the "not found" value. It works as long as actual
pointers were used. However we've seen usages where long long,
non-pointer values have been used. It creates a risk that one of the
long long value precisely is the address of the special "not found"
value. This commit changes raxFind to return 1 or 0 to indicate
elementhood, and take in a new void **value to optionally return the
associated value.
By extension, this also allow the RedisModule_DictSet/Replace operations
to also safely insert integers instead of just pointers.
The by/get options of sort/sort_ro command used to be forbidden in
cluster mode, since we are not sure which slot the pattern may be in.
As the optimization done in #12536, patterns now can be mapped to slots,
we should allow by/get options in cluster mode when the pattern maps to
the same slot as the key.
In #11489, we consider acl username to be sensitive information,
and consider the ACL GETUSER a sensitive command and remove it
from redis-cli historyfile.
This PR redact username information in ACL GETUSER and ACL DELUSER
from SLOWLOG, and also remove ACL DELUSER from redis-cli historyfile.
This PR also mark tls-key-file-pass and tls-client-key-file-pass
as sensitive config, will redact it from SLOWLOG and also
remove them from redis-cli historyfile.
In INFO CLIENTS section, we already have blocked_clients and
tracking_clients. We should add a new metric showing the number of
pubsub connections, which helps performance monitoring and trouble
shooting.
This is a follow-up fix to #12733. We need to apply the same changes to
delKeysInSlot. Refer to #12733 for more details.
This PR contains some other minor cleanups / improvements to the test
suite and docs.
It uses the postnotifications test module in a cluster mode test which
revealed a leak in the test module (fixed).
The change in dbSwapDatabases seems harmless. Because in non-clustered
mode, dbBuckets calculations are strictly accurate and in cluster mode,
we only have one DB. Modify it for uniformity (just like resize_cursor).
The change in swapMainDbWithTempDb is needed in case we swap with the
temp db, otherwise the overhead memory usage of db can be miscalculated.
In addition we will swap all fields (including rehashing list), just for
completeness (and reduce the chance of surprises in the future).
Introduced in #12697.
In the old dictRehashingInfo implementation, for the initialization
scenario,
it mistakenly directly set to_size to DICTHT_SIZE(DICT_HT_INITIAL_EXP),
which
is 4 in our code by default.
In scenarios where dictExpand directly passes the target size as
initialization,
the code will calculate bucket_count incorrectly. For example, in DEBUG
POPULATE
or RDB load scenarios, it will cause the final bucket_count to be
initialized to
65536 (16384 * 4), see:
```
before:
DB 0: 10000000 keys (0 volatile) in 65536 slots HT.
it should be:
DB 0: 10000000 keys (0 volatile) in 16777216 slots HT.
```
In PR, new ht will also be initialized before calling rehashingStarted
in
_dictExpand, so that the calls in dictRehashingInfo can be unified.
Bug was introduced in #12697.
when dbExpand is called from rdb.c with try_expand set to 0, it will
either panic panic on OOM, or be non-fatal (should not fail RDB loading)
At the same time, the log text has been slightly adjusted to make it
more unified.
If a dict is empty before `dictExpand`, we don't need to do rehashing
actually. We can just create a new dict and set it as ht_table[0] to
skip incremental rehashing and free the memory quickly.
When loading RDB on cluster nodes, it is necessary to consider the
scenario where a node is a replica.
For example, during a rolling upgrade, new version instances are often
mounted as replicas on old version instances. In this case, the full
synchronization legacy RDB does not contain slot information, and the
new version instance, acting as a replica, should be able to handle the
legacy RDB correctly for `dbExpand`.
Additionally, renaming `getMyClusterSlotCount` to `getMyShardSlotCount`
would be appropriate.
Introduced in #11695
The decrement time was replaced by access time in
583c31472577fb8175e17ee0ce243972f4dd8425.
The halved and doubled LFU_INIT_VAL logic has been changed in
06ca9d683920da19ad53532f8cd55b54584027bc.
Now we just decrement the counter by num_periods. This has been
previously fixed in redis.conf, #11108.
I've noticed that https://scan.coverity.com/projects/redis already
exists, but appears to be only updated on an ad-hoc basis. creating
[redis-unstable](https://scan.coverity.com/projects/redis-unstable?tab=project_settings)
project in coverity for this CI.
This PR adds a GitHub Action-based CI job to create a new Coverity build
once daily, so that there is always a recent scan available.
This is within the limit, as Redis is ~150K LOC and per
https://scan.coverity.com/faq#frequency :
> Up to 21 builds per week, with a maximum of 3 builds per day, for
projects with 100K to 500K lines of code
Before this is merged in, two new secrets will need to be created:
COVERITY_SCAN_EMAIL with the email address used for accessing Coverity
COVERITY_SCAN_TOKEN with the Project token from
https://scan.coverity.com/projects/redis-unstable?tab=project_settings
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Currently, during RDB loading, once a `dbExpand` is performed, the
`should_expand_db` flag is set to 0. This causes the remaining DBs
unable to do `dbExpand` when there are multiple DBs.
To fix this issue, we need to set `should_expand_db` back to 1 whenever
we encounter `RDB_OPCODE_RESIZEDB`. This ensures that each DB can
perform `dbExpand` correctly.
Additionally, the initial value of `should_expand_db` should also be set
to 0 to prevent invalid `dbExpand` in older versions of RDB where
`RDB_OPCODE_RESIZEDB` is not present.
problem introduced in #11695
Additional optimizations for the eviction logic in #11695:
To make the eviction logic clearer and decouple the number of sampled
keys from the running mode (cluster or standalone).
* When sampling in each database, we only care about the number of keys
in the current database (not the dicts we sampled from).
* If there are a insufficient number of keys in the current database
(e.g. 10 times the value of `maxmemory_samples`), we can break out
sooner (to avoid looping on a sparse database).
* We'll never try to sample the db dicts more times than the number of
non-empty dicts in the db (max 1 in non-cluster mode).
And it also ensures that each database has a sufficient amount of
sampled keys, so even if unsharded-cluster supports multiple databases,
there won't be any issues.
other changes:
1. keep track of the number of non-empty dicts in each database.
2. move key_count tracking into cumulativeKeyCountAdd rather than all
it's callers
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
in #12536 we made a similar optimization for SCAN, now that hashtags in
patterns. When we can make sure all keys matching the pettern will be in
the same slot, we can limit the iteration to run only one one.
`make test` will unnecessary rebuild `redis-server` if tests are modified because mkheader will touch releae.c.
This changes scopes down when we re-generate release.c to only when actual source files are modified.
It was first added to load lua from RDB, see 28dfdca7. After #9812,
we no longer save lua in RDB. luaCreateFunction will only be called
in script load and eval*, both of which are available in the client.
It could be that that some day we'll still want to load scripts from
somewhere that's not a client. This fix is in dead code.
Warning:
```
postnotifications.c:216:77: warning: format specifies type 'long' but the argument has type 'uint64_t' (aka 'unsigned long long') [-Wformat]
RedisModule_Log(ctx, "warning", "Got an unexpected subevent '%ld'", subevent);
~~~ ^~~~~~~~
%llu
```
CI:
https://github.com/redis/redis/actions/runs/6937308713/job/18871124342#step:6:115
## Other
Add `CFLAGS=-Werror` flag for module CI.
---------
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This is just a cleanup, although they are both correct, the change
is normatively better, and addReplyError is also much faster.
Although not important, speed is not important for these error cases.
In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.
Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.
And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to
check.
The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.
So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.
But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.
This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in
#8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).
We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.
After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0
to
check and make sure that our dict expansion is dealyed.
Using `./src/redis-server redis.conf --save "" --maxmemory 6871947673`.
Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`.
After ten rounds of testing:
```
unstable: this PR:
769585.94 816860.00
771724.00 818196.69
775674.81 822368.44
781983.12 822503.69
783576.25 828088.75
784190.75 828637.75
791389.69 829875.50
794659.94 835660.69
798212.00 830013.25
801153.62 833934.56
```
We can see there is about 4-5% performance improvement in this case.
The following four configurations are renamed to align with Redis style:
1. server_cpulist renamed to server-cpulist
2. bio_cpulist renamed to bio-cpulist
3. aof_rewrite_cpulist renamed to aof-rewrite-cpulist
4. bgsave_cpulist renamed to bgsave-cpulist
The original names are retained as aliases to ensure compatibility with
old configuration files. We recommend users to gradually transition to
using the new configuration names to maintain consistency in style.
Introduced in #11695 .
The tryResizeHashTables function gets stuck on the last non-empty slot
while iterating through dictionaries. It does not restart from the
beginning. The reason for this issue is a problem with the usage of
dbIteratorNextDict:
/* Returns next dictionary from the iterator, or NULL if iteration is complete. */
dict *dbIteratorNextDict(dbIterator *dbit) {
if (dbit->next_slot == -1) return NULL;
dbit->slot = dbit->next_slot;
dbit->next_slot = dbGetNextNonEmptySlot(dbit->db, dbit->slot, dbit->keyType);
return dbGetDictFromIterator(dbit);
}
When iterating to the last non-empty slot, next_slot is set to -1,
causing it to loop indefinitely on that slot. We need to modify the code
to ensure that after iterating to the last non-empty slot, it returns to
the first non-empty slot.
BTW, function tryResizeHashTables is actually iterating over slots
that have keys. However, in its implementation, it leverages the
dbIterator (which is a key iterator) to obtain slot and dictionary
information. While this approach works fine, but it is not very
intuitive. This PR also improves readability by changing the iteration
to directly iterate over slots, thereby enhancing clarity.
The current comment for `findSlotByKeyIndex` is a bit ambiguous and can
be misleading, as it may be misunderstood as getting the next slot
corresponding to target.
When we register notification or server event in RedisModule_OnLoad, but
RedisModule_OnLoad eventually fails, triggering notification or server
event
will cause the server to crash.
If the loading fails on a later stage of moduleLoad, we do call
moduleUnload
which handles all un-registration, but when it fails on the
RedisModule_OnLoad
call, we only un-register several specific things and these were
missing:
- moduleUnsubscribeNotifications
- moduleUnregisterFilters
- moduleUnsubscribeAllServerEvents
Refactored the code to reuse the code from moduleUnload.
Fixes#12808.
Currently, when the number of databases exceeds 16,
the efficiency of cleaning expired keys is relatively low.
The reason is that by default only 16 databases are scanned when
attempting to clean expired keys (CRON_DBS_PER_CALL is 16). But users
may set databases higher than 16, such as 256, but it does not
necessarily mean that all 256 databases have expiration time set. If
only one database has expiration time set, this database needs 16
activeExpireCycle rounds in order to be scanned once, and 15 of those
rounds are meaningless.
To optimize the efficiency of expiration in such scenarios, we use dbs_per_call
to control the number of databases with expired keys being scanned.
Additionally, add a condition to limit the maximum number of rounds
to server.dbnum to prevent excessive spinning. This ensures that even if
only one database has expired keys, it can be triggered within a single cron job.
Fix `signalModifiedKey()` order, call it after key modification
completed, to ensure the state of key eventual consistency.
When a key is modified, Redis calls `signalModifiedKey` to notify other
systems, such as the watch system of transactions and the tracking
system of client side caching. However, in some commands, the
`signalModifiedKey` call happens during the key modification process
instead of after the key modification is completed. This can potentially
cause issues, as systems relying on `signalModifiedKey` may receive the
"write in flight" status of the key rather than its final state.
These commands include:
1. PFADD
2. LSET, LMOVE, LREM
3. ZPOPMIN, ZPOPMAX, BZPOPMIN, BZPOPMAX, ZMPOP, BZMPOP
Currently there is no problem with Redis, but it is better to adjust the
order of `signalModifiedKey()`, to avoid issues in future development on
Redis.
---------
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
see discussion from after https://github.com/redis/redis/pull/12453 was
merged
----
This PR replaces signals that are not considered async-signal-safe
(AS-safe) with safe calls.
#### **1. serverLog() and serverLogFromHandler()**
`serverLog` uses unsafe calls. It was decided that we will **avoid**
`serverLog` calls by the signal handlers when:
* The signal is not fatal, such as SIGALRM. In these cases, we prefer
using `serverLogFromHandler` which is the safe version of `serverLog`.
Note they have different prompts:
`serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>`
`serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>`
* The code was added recently. Calls to `serverLog` by the signal
handler have been there ever since Redis exists and it hasn't caused
problems so far. To avoid regression, from now we should use
`serverLogFromHandler`
#### **2. `snprintf` `fgets` and `strtoul`(base = 16) -------->
`_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`**
The safe version of `snprintf` was taken from
[here](8cfc4ca5e7/src/mc_util.c (L754))
#### **3. fopen(), fgets(), fclose() --------> open(), read(), close()**
#### **4. opendir(), readdir(), closedir() --------> open(),
syscall(SYS_getdents64), close()**
#### **5. Threads_mngr sync mechanisms**
* waiting for the thread to generate stack trace: semaphore -------->
busy-wait
* `globals_rw_lock` was removed: as we are not using malloc and the
semaphore anymore we don't need to protect `ThreadsManager_cleanups`.
#### **6. Stacktraces buffer**
The initial problem was that we were not able to safely call malloc
within the signal handler.
To solve that we created a buffer on the stack of `writeStacktraces` and
saved it in a global pointer, assuming that under normal circumstances,
the function `writeStacktraces` would complete before any thread
attempted to write to it. However, **if threads lag behind, they might
access this global pointer after it no longer belongs to the
`writeStacktraces` stack, potentially corrupting memory.**
To address this, various solutions were discussed
[here](https://github.com/redis/redis/pull/12658#discussion_r1390442896)
Eventually, we decided to **create a pipe** at server startup that will
remain valid as long as the process is alive.
We chose this solution due to its minimal memory usage, and since
`write()` and `read()` are atomic operations. It ensures that stack
traces from different threads won't mix.
**The stacktraces collection process is now as follows:**
* Cleaning the pipe to eliminate writes of late threads from previous
runs.
* Each thread writes to the pipe its stacktrace
* Waiting for all the threads to mark completion or until a timeout (2
sec) is reached
* Reading from the pipe to print the stacktraces.
#### **7. Changes that were considered and eventually were dropped**
* replace watchdog timer with a POSIX timer:
according to [settimer man](https://linux.die.net/man/2/setitimer)
> POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending
the use of the POSIX timers API
([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2),
[timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.)
instead.
However, although it is supposed to conform to POSIX std, POSIX timers
API is not supported on Mac.
You can take a look here at the Linux implementation:
[here](c7562ee135)
To avoid messing up the code, and uncertainty regarding compatibility,
it was decided to drop it for now.
* avoid using sds (uses malloc) in logConfigDebugInfo
It was considered to print config info instead of using sds, however
apparently, `logConfigDebugInfo` does more than just print the sds, so
it was decided this fix is out of this issue scope.
#### **8. fix Signal mask check**
The check `signum & sig_mask` intended to indicate whether the signal is
blocked by the thread was incorrect. Actually, the bit position in the
signal mask corresponds to the signal number. We fixed this by changing
the condition to: `sig_mask & (1L << (sig_num - 1))`
#### **9. Unrelated changes**
both `fork.tcl `and `util.tcl` implemented a function called
`count_log_message` expecting different parameters. This caused
confusion when trying to run daily tests with additional test parameters
to run a specific test.
The `count_log_message` in `fork.tcl` was removed and the calls were
replaced with calls to `count_log_message` located in `util.tcl`
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
`open()` can return any non-negative integer on success, including zero.
This change modifies the check from open()'s return value to also
include checking for a return value of zero (e.g., if stdin were closed
and then `open()` was called).
Fixes Coverity 404720
Can't happen in Redis. just a cleanup.
If fopen() is successful, but redis_fstat() determines the file is zero
bytes, the file handle stored in fp will leak. This change closes the
filehandle stored in fp if the file is zero bytes.
An FD leak on a tool like redis-check-aof isn't an issue (it'll exit soon anyway).
This is just a cleanup.
If fopen() is successful, but redis_fstat() fails, the file handle
stored in fp will leak. This change closes the filehandle stored in fp
if redis_fstat() fails.
Fixes Coverity 390029
If a recursive call to dirRemove() returns -1, dirRemove() the directory
handle stored in dir will leak. This change closes the directory handle
stored in dir even if the recursive call to dirRemove() returns -1.
Fixed Coverity 371073
dictExpandAllowed (for the main db dict and the expire dict) seems to
involve a few function calls and memory accesses, and we can do it only
after the thresholds checks and can get some performance improvements.
A simple benchmark test: there are 11032768 fixed keys in the database,
start a redis-server with `--maxmemory big_number --save ""`,
start a redis-benchmark with `-P 100 -r 1000000000 -t set -n 5000000`,
collect `throughput summary: n requests per second` result.
After five rounds of testing:
```
unstable this PR
848032.56 897988.56
854408.69 913408.88
858663.94 914076.81
871839.56 916758.31
882612.56 920640.75
```
We can see a 5% performance improvement in general condition.
But note we'll still hit the degradation when we over the thresholds.
A followup PR for #12742
Add some brief comments explaining the purpose of the file to the head
of cluster_legacy.c and cluster.c.
Add copyright notice to cluster.c
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Co-authored-by: Josh Hershberg <yehoshua@redis.com>
This PR reworks the clustering code to support multiple clustering
implementations, specifically, the current "legacy" clustering
implementation or, although not part of this PR, flotilla (see #10875).
Which implementation is used could be a compile-time flag (will be added
later). Legacy clustering functionality remains unchanged.
The basic idea is as follows. The header cluster.h now contains function
declarations that define the "Cluster API." These are the contract and
interface between any clustering implementation and the rest of the
Redis source code. Some of the function definitions are shared between
all clustering implementations. These functions are in cluster.c. The
functions and data structures specific to legacy clustering are in
cluster-legacy.c/h. One consequence of this is that the structs
clusterNode and clusterState which were previously "public" to the rest
of Redis are now hidden behind the Cluster API.
The PR is divided up into commits, each with a commit message explaining
the changes. some are just mass rename or moving code between files (may
not require close inspection / review), others are manual changes.
One other, related change is:
- The "failover" command is now plugged into the Cluster API so that the
clustering implementation can (a) enable/disable the command to begin
with and if enabled (b) perform the actual failover. The "failover"
command remains disabled for legacy clustering.
The failover command is up until now not supported
in cluster mode. This commit allows a cluster
implementation to support the command. The legacy
clustering implementation still does not support
this command.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Move primary functions used to implement datapath
clustering into cluster.c, making them shared. This
required adding "accessor" and other functions to
abstract access to node details and cluster state.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Divide up clusterCommand into clusterCommand for shared
sub-commands and clusterCommandSpecial for implementation
specific sub-commands. So to, the cluster command help
sub-command has been divided into two implementations,
clusterCommandHelp and clusterCommandHelpSpecial. Some
common sub-subcommand implementations have been extracted
and their implemenations either made shared or else
implementation specific.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Move (but do not change) some items from cluster_legacy.c
back info cluster.c. These items are shared code that all
clustering implementations will use.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
More declerations can be moved into cluster_legacy.h
as they are not requied for the cluster api. The code
was simply moved, not changed in any way.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Move clusterNode into cluster_legacy.h.
In order to achieve this some accessor methods
were added and also a refactor of how debugCommand
handles cluster related subcommands.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Move clusterState into cluster_legacy.h. In order to achieve
this some "accessor" methods needed to be added to the
cluster API and some other minor refactors.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
When we are loading data, it is not safe to generate data
through DEBUG POPULATE. POPULATE may generate keys, causing
panic when loading data with duplicate keys.
Move some declerations from cluster.h to cluster_legacy.h.
The items moved are specific to the legacy clustering
implementation and DO NOT require any other refactoring
other than moving them from one file to another.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Make redis-cli --bigkeys and --memkeys usable on a replicas in cluster
mode, by sending the READONLY command. This is only if -c is also given.
We don't detect if a node is a master or a replica so we send READONLY
in both cases. The READONLY has no effect on masters.
Release notes:
Make redis-cli --bigkeys and --memkeys usable on cluster replicas
---------
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Oran Agra <oran@redislabs.com>
Add documentation of the URI format in the `--help` output of
`redis-cli` and `redis-benchmark`.
In particular, it's good for users to know that they need to specify
"default" as the username when authenticating without a username. Other
details of the URI format are described too, like scheme and dbnum.
It used to be possible to connect to Redis using an URL with an empty
username, like `redis-cli -u redis://:PASSWORD@localhost:6379/0`. This
was broken in 6.2 (#8048), and there was a discussion about it #9186.
Now, users need to specify "default" as the username and it's better to
document it.
Refer to #12746 for more details.
We have some test cases of swapdb with watchkey but missing seperate
basic swapdb test cases, unhappy path and flushdb after swapdb. So added
the test cases in keyspace.tcl.
CI reports that this test failed, the reason is because during
the command processing, the node processed PING/PONG, resulting
in ping_sent or pong_received mismatch.
Change to use MULTI to avoid timing issue. The test was introduced
in #12224.
This is currently harmless, since we have already cleared the dict
before, it will reset the rehashidx to -1, and in incrementallyRehash
we will call dictIsRehashing to check.
It would be nice to empty the list to avoid meaningless attempts, and
the code is also unified to reduce misunderstandings.
When using DB iterator, it will use dictInitSafeIterator to init a old safe
dict iterator. When dbIteratorNext is used, it will jump to the next slot db
dict when we are done a dict. During this process, we do not have any calls to
dictResumeRehashing, which causes the dict's pauserehash to always be > 0.
And at last, it will be returned directly in dictRehashMilliseconds, which causes
us to have slot dict in a state where rehash cannot be completed.
In the "expire scan should skip dictionaries with lot's of empty buckets" test,
adding a `keys *` can reproduce the problem stably. `keys *` will call dbIteratorNext
to trigger a traversal of all slot dicts.
Added dbReleaseIterator and dbIteratorInitNextSafeIterator methods to call dictResetIterator.
Issue was introduced in #11695.
This PR introduces a new macro, serverAssertWithInfoDebug, to do complex assertions only for debugging. The main intention is to allow running complex operations during tests without impacting runtime performance. This assertion is enabled when setting DEBUG_ASSERTIONS.
The DEBUG_ASSERTIONS flag is set for the daily and CI variants of `test-sanitizer-address`.
Test recently added fails on timeout in valgrind in GH actions.
Locally with valgrind the test finishes within 1.5 sec(s). Couldn't find
any issue due to lack of reproducibility. Increasing the timeout and
adding an additional log to the test to understand how many keys
were left at the end.
Redis 7.2 (#9406) introduced a new modules event, `RedisModuleEvent_Key`.
This new event allows the module to read the key data just before it is removed
from the database (either deleted, expired, evicted, or overwritten).
When the key is removed from the database, either by active expire or eviction.
The new event was not called as part of an execution unit. This can cause an
issue if the module registers a post notification job inside the event. This job will
not be executed atomically with the expiration/eviction operation and will not
replicated inside a Multi/Exec. Moreover, the post notification job will be executed
right after the event where it is still not safe to perform any write operation, this will
violate the promise that post notification job will be called atomically with the
operation that triggered it and **only when it is safe to write**.
This PR fixes the issue by wrapping each expiration/eviction of a key with an execution
unit. This makes sure the entire operation will run atomically and all the post notification
jobs will be executed at the end where it is safe to write.
Tests were modified to verify the fix.
This change overcomes many stability issues experienced with the
vmactions action.
We need to limit VMs to 8GB for better stability, as the 13GB default
seems to hang them occasionally.
Shell code has been simplified since this action seem to use `bash -e`
which will abort on non-zero exit codes anyway.
changes the `gettimeofday` caller, by removing an unused optional output argument.
It would take 2 benefits:
- simplify code, discard unnecessary arg.
- possibly faster due to the implementation in kernel.
Clarify the errors related to the cluster mode in the script, return the command that encountered an execution error along with the specific error message.
---------
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Currently, we do not write the following sensitive commands into the ~/.rediscli_history file:
ACL SETUSER username [rule [rule ...]]
AUTH [username] password
HELLO [AUTH username password]
MIGRATE host port <key | ""> destination-db timeout [[AUTH password | AUTH2 username password]]
CONFIG SET masterauth master-password
CONFIG SET masteruser username
CONFIG SET requirepass foobared
However, we still write the following sensitive commands into the ~/.rediscli_history file:
ACL GETUSER username
Sentinel CONFIG set sentinel-pass password
Sentinel CONFIG set sentinel-user username
Sentinel set mastername auth-pass password
Sentinel set mastername auth-user username
This change adds the commands of the second list to be skipped from being written to the history file.
Reverts the skipping defrag tests in cluster mode (done in #12672.
instead it skips only some defrag tests that are relevant for cluster modes.
The test now run well after investigating and making the changes in #12674 and #12694.
Co-authored-by: Oran Agra <oran@redislabs.com>
Optimize the performance of SCAN commands when a match pattern can only contain keys from a
single slot in cluster mode. This can happen when the pattern contains a hash tag before any
wildcard matchers or when the key contains no matchers.
In #12476 server.stat_client_qbuf_limit_disconnections was added. It is written in readQueryFromClient, which may be called by multiple threads when io-threads and io-threads-do-reads are turned on. Somehow we missed to make it an atomic variable.
when find a key ,if redis is rehashing, currently we should lookup both tables (ht[0] and ht[1]).
if we use the key's index comparing to the rehashidx,if index < rehashidx,then we can conclude:
1. it is rehashing(rehashidx is -1 if it is not rehashing)
2. we can't find key in ht[0] so just continue to find key in ht[1]
The possible performance gain here, is not the looping over the linked list (which is empty),
but rather the lookup in the table (which could be a cache miss).
---------
Co-authored-by: zhangshihua003 <zhangshihua003@ke.com>
Co-authored-by: sundb <sundbcn@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: judeng <abc3844@126.com>
This PR optimizes the time complexity of findSlotByKeyIndex from O(log^2(N)) to O(log(N)) by using the
tree structure of binary index tree to find a slot in one search of the index.
As part of #11695 independent dictionaries were introduced per slot.
Time complexity to discover total no. of buckets across all dictionaries
increased to O(N) with straightforward implementation of iterating over
all dictionaries and adding dictBuckets of each.
To optimize the time complexity, we could maintain a global counter at
db level to keep track of the count of buckets and update it on the start
and end of rehashing.
---------
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Fixing issues described in #12672, started after #11695
Related to #12674
Fixes the `defrag didn't stop' issue.
In some cases of how the keys were stored in memory
defrag_later_item_in_progress was not getting reset once we finish
defragging the later items and we move to the next slot. This stopped
the scan to happen in the later slots and did not get
Fix some outdated comments and add comment for moduleNotifyKeyspaceEvent
we added in #11084 since it seems a bit implicit.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Test failure on freebsd CI:
```
*** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl
scan didn't handle slot skipping logic.
```
Observation:
expiry of keys might happen before the empty buckets are formed and won't help with the expiry skip logic validation.
Solution:
Disable expiration until the empty buckets are formed.
Fixing issues started after #11695 when the defrag tests are being executed in cluster mode too.
For some reason, it looks like the defragmentation is over too quickly, before the test is able to
detect that it's running.
so now instead of waiting to see that it's active, we wait to see that it did some work
```
[err]: Active defrag big list: cluster in tests/unit/memefficiency.tcl
defrag not started.
[err]: Active defrag big keys: cluster in tests/unit/memefficiency.tcl
defrag didn't stop.
```
Temporarily disabling few of the defrag tests in cluster mode to make the daily run stable:
Active defrag eval scripts
Active defrag big keys
Active defrag big list
Active defrag edge case
Dictionary iterator logic in the `tryResizeHashTables` method is picking the next
(incorrect) dictionary while the cursor is at a given slot. This could lead to some
dictionary/slot getting skipped from resizing.
Also stabilize the test.
problem introduced recently in #11695
Before this commit, Unix socket setup performed chmod(2) on the socket
file after calling listen(2). Depending on what umask is used, this
could leave the file with the wrong permissions for a short period of
time. As a result, another process could exploit this race condition and
establish a connection that would otherwise not be possible.
We now make sure the socket permissions are set up prior to calling
listen(2).
(cherry picked from commit 1119ecae6fd8796fa337df2212f09173ab6c7b0a)
Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
Using heap allocation during signal handlers is unsafe.
This PR purpose is to replace all the heap allocations done within the signal
handlers raised upon server crash and assertions.
These were added in #12453.
writeStacktraces(): allocates the stacktraces output array on the calling thread's
stack and assigns the address to a global variable.
It calls `ThreadsManager_runOnThreads()` that invokes `collect_stacktrace_data()`
by each thread: each thread writes to a different location in the above array to allow
sync writes.
get_ready_to_signal_threads_tids(): instead of allocating the `tids` array, it receives it
as a fixed size array parameter, allocated on on the stack of the calling function, and
returns the number of valid threads. The array size is hard-coded to 50.
`ThreadsManager_runOnThreads():` To avoid the outputs array allocation, the
**callback signature** was changed. Now it should return void. This function return type
has also changed to int - returns 1 if successful, and 0 otherwise.
Other unsafe calls will be handled in following PRs
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
when a server in the test suite crashes and is restarted by redstart_server, we didn't clean it's pid from the list.
we can see that when the corrupt-dump-fuzzer hangs, it has a long list of servers to lean, but in fact they're all already dead.
In #11568 we removed the NOSCRIPT flag from commands and keep the BLOCKING flag.
Aiming to allow them in scripts and let them implicitly behave in the non-blocking way.
In that sense, the old behavior was to allow LPOP and reject BLPOP, and the new behavior,
is to allow BLPOP too, and fail it only in case it ends up blocking.
So likewise, so far we allowed XREAD and rejected XREAD BLOCK, and we will now allow
that too, and only reject it if it ends up blocking.
In #10536, we introduced the assert, some older versions of servers
(like 7.0) doesn't gossip shard_id, so we will not add the node to
cluster->shards, and node->shard_id is filled in randomly and may not
be found here.
It causes that if we add a 7.2 node to a 7.0 cluster and allocate slots
to the 7.2 node, the 7.2 node will crash when it hits this assert. Somehow
like #12538.
In this PR, we remove the assert and replace it with an unconditional removal.
When entering pubsub mode and using the redis-cli only
connect command, we need to reset pubsub_mode because
we switch to a different connection.
This will affect the prompt when the connection is successful,
and redis-cli will crash when the connect fails:
```
127.0.0.1:6379> subscribe ch
1) "subscribe"
2) "ch"
3) (integer) 1
127.0.0.1:6379(subscribed mode)> connect 127.0.0.1 6380
127.0.0.1:6380(subscribed mode)> ping
PONG
127.0.0.1:6380(subscribed mode)> connect a b
Could not connect to Redis at a:0: Name or service not known
Segmentation fault
```
recently there are some incidents of hanged tests in the CI
when we try to reproduce them, we get an assertion, not a hang.
maybe the server logs will reveal some info.
Use the __MAC_OS_X_VERSION_MIN_REQUIRED macro to detect the
macOS system version instead of using MAC_OS_X_VERSION_10_6.
From MacOSX14.0.sdk, the default definitions of MAC_OS_X_VERSION_xxx have
been removed in usr/include/AvailabilityMacros.h. It includes AvailabilityVersions.h,
where the following condition must be met:
`#if (!defined(_POSIX_C_SOURCE) && !defined(_XOPEN_SOURCE)) || defined(_DARWIN_C_SOURCE)`
Only then will MAC_OS_X_VERSION_xxx be defined.
However, in the project, _DARWIN_C_SOURCE is not defined, which leads to the
loss of the definition for MAC_OS_X_VERSION_10_6.
Recently we added a way for the module to declare that it wishes to
receive nested KSN, by setting ALLOW_NESTED_KEYSPACE_NOTIFICATIONS.
but it looks like this flow has a bug, clearing the `active` member
when it was previously set. however, since nesting is permitted,
this bug has no implications, since regardless of the active member,
the notification is permitted.
We use the C standard assert() in various places in the codebase, which requires NDEBUG to be undefined. We introduced the redisassert.h file in order to allow low level files to access the assert that maps to serverPanic, but this was only applied tactically and is not available broadly.
This PR removes all usage of the standard library asserts and replaces them with an assert that maps to serverPanic. It makes us immune to accidentally setting the NDEBUG flag preventing assertions. I also marked marked the server asserts as "likely" to not execute. I spot checked various points in the code, and it didn't change the code layout on my x86 mac, but it is more consistent with redisassert.h and seems more correct overall.
Fixed some usages of tabs which caused weird indentation in the code. Tried to find all of the places so their was one PR. I ignored all of the usages of tabs which don't really affect readability.
## Crash fix
### Current behavior
We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example.
The threads mngr API guarantees that the output array length will be `tids_len`, however, some
indices can be NULL, in case it fails to collect some of the threads' outputs.
When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL
entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage,
so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() )
### fix
Allocate the global output array with zcalloc.
### To reproduce the bug, you'll have to change the code:
**in threadsmngr:ThreadsManager_runOnThreads():**
make sure the g_output_array allocation is initialized with garbage and not 0s
(add `memset(g_output_array, 2, sizeof(void*) * tids_len);` below the allocation).
Force one of the threads to write to the array:
add a global var: `static redisAtomic size_t return_now = 0;`
add to `invoke_callback()` before writing to the output array:
```
size_t i_return;
atomicGetIncr(return_now, i_return, 1);
if(i_return == 1) return;
```
compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert`
The assertion triggers the the stacktrace collection.
Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from
the threads mngr, it can be safely triggered again.
## Added global variables r/w lock in ThreadsManager
To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still
invoking the signal handler, we use a r/w lock.
For cleanups, we will acquire the write lock.
The threads will acquire the read lock to enable them to write simultaneously.
If we fail to acquire the read lock, it means cleanups are in progress and we return immediately.
After acquiring the lock we can safely check that the global output array wasn't nullified and proceed
to write to it.
This way we ensure the threads are not modifying the global variables/ trying to write to the output
array after they were zeroed/nullified/destroyed(the semaphore).
## other minor logging change
1. removed logging if the semaphore times out because the threads can still write to the output array
after this check. Instead, we print the total number of printed stacktraces compared to the exacted
number (len_tids).
2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct.
3. improve testing
Co-authored-by: Oran Agra <oran@redislabs.com>
In some tests, the code manually searches for a log message, and it
uses tail -1 with a delay of 1 second, which can miss the expected line.
Also, because the aof tests use start_server_aof and not start_server,
the test name doesn't log into the server log.
To fix the above, I made the following changes:
- Change the start_server_aof to wrap the start_server.
This will add the created aof server to the servers list, and make
srv() and wait_for_log_messages() available for the tests.
- Introduce a new option for start_server.
'wait_ready' - an option to let the caller start the test code without
waiting for the server to be ready. useful for tests on a server that
is expected to exit on startup.
- Create a new start_server_aof_ex.
The new proc also accept options as argument and make use of the
new 'short_life' option for tests that are expected to exit on startup
because of some error in the aof file(s).
Because of the above, I had to change many lines and replace every
local srv variable (a server config) usage with the srv().
The problem is that WAITAOF could have hang in case commands were
propagated only to replicas.
This can happen if a module uses RM_Call with the REDISMODULE_ARGV_NO_AOF flag.
In that case, master_repl_offset would increase, but there would be nothing to fsync, so
in the absence of other traffic, fsynced_reploff_pending would stay the static, and WAITAOF can hang.
This commit updates fsynced_reploff_pending to the latest offset in flushAppendOnlyFile in case
there's nothing to fsync. i.e. in case it's behind because of the above mentions case it'll be refreshed
and release the WAITAOF.
Other changes:
Fix a race in wait.tcl (client getting blocked vs. the fsync thread)
If we set `fsynced_reploff_pending` in `startAppendOnly`, and the fork doesn't start
immediately (e.g. there's another fork active at the time), any subsequent commands
will increment `server.master_repl_offset`, but will not cause a fsync (given they were
executed before the fork started, they just ended up in the RDB part of it)
Therefore, any WAITAOF will wait on the new master_repl_offset, but it will time out
because no fsync will be executed.
Release notes:
```
WAITAOF could timeout in the absence of write traffic in case a new AOF is created and
an AOFRW can't immediately start.
This can happen by the appendonly config is changed at runtime, but also after FLUSHALL,
and replica full sync.
```
In a long printf call with many placeholders, it's hard to see which argument
belongs to which placeholder.
The long printf-like calls in the INFO and CLIENT commands are rewritten into
pairs of (format, argument). These pairs are then rewritten to a single call with
a long format string and a long list of arguments, using a macro called FMTARGS.
The file `fmtargs.h` is added to the repo.
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
The `retval` variable is defined as an `int`, so with 4 bytes, it cannot properly represent
microsecond values greater than the equivalent of about 35 minutes.
This bug shouldn't impact standard Redis behavior because Redis doesn't have timer
events that are scheduled as far as 35 minutes out, but it may affect custom Redis modules
which interact with the event timers via the RM_CreateTimer API.
The impact is that `usUntilEarliestTimer` may return 0 for as long as `retval` is scaled to
an overflowing value. While `usUntilEarliestTimer` continues to return `0`, `aeApiPoll`
will have a zero timeout, and so Redis will use significantly more CPU iterating through
its event loop without pause. For timers scheduled far enough into the future, Redis will
cycle between ~35 minute periods of high CPU usage and ~35 minute periods of standard
CPU usage.
In this PR we are adding the functionality to collect all the process's threads' backtraces.
## Changes made in this PR
### **introduce threads mngr API**
The **threads mngr API** which has 2 abilities:
* `ThreadsManager_init() `- register to SIGUSR2. called on the server start-up.
* ` ThreadsManager_runOnThreads()` - receives a list of a pid_t and a callback, tells every
thread in the list to invoke the callback, and returns the output collected by each invocation.
**Elaborating atomicvar API**
* `atomicIncrGet(var,newvalue_var,count) `-- Increment and get the atomic counter new value
* `atomicFlagGetSet` -- Get and set the atomic counter value to 1
### **Always set SIGALRM handler**
SIGALRM handler prints the process's stacktrace to the log file. Up until now, it was set only if the
`server.watchdog_period` > 0. This can be also useful if debugging is needed. However, in situations
where the server can't get requests, (a deadlock, for example) we weren't able to change the signal handler.
To make it available at run time we set SIGALRM handler on server startup. The signal handler name was
changed to a more general `sigalrmSignalHandler`.
### **Print all the process' threads' stacktraces**
`logStackTrace()` now calls `writeStacktraces()`, instead of logging the current thread stacktrace.
`writeStacktraces()`:
* On Linux systems we use the threads manager API to collect the backtraces of all the process' threads.
To get the `tids` list (threads ids) we read the `/proc/<redis-server-pid>/tasks` file which includes a list of directories.
Each directory name corresponds to one tid (including the main thread). For each thread, we also need to check if it
can get the signal from the threads manager (meaning it is not blocking/ignoring that signal). We send the threads
manager this tids list and `collect_stacktrace_data()` callback, which collects the thread's backtrace addresses,
its name, and tid.
* On other systems, the behavior remained as it was (writing only the current thread stacktrace to the log file).
## compatibility notes
1. **The threads mngr API is only supported in linux.**
2. glibc earlier than 2.3 We use `syscall(SYS_gettid)` and `syscall(SYS_tgkill...)` because their dedicated
alternatives (`gettid()` and `tgkill`) were added in glibc 2.3.
## Output example
Each thread backtrace will have the following format:
`<tid> <thread_name> [additional_info]`
* **tid**: as read from the `/proc/<redis-server-pid>/tasks` file
* **thread_name**: the tread name as it is registered in the os/
* **additional_info**: Sometimes we want to add specific information about one of the threads. currently.
it is only used to mark the thread that handles the backtraces collection by adding "*".
In case of crash - this also indicates which thread caused the crash. The handling thread in won't
necessarily appear first.
```
------ STACK TRACE ------
EIP:
/lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc]
67089 redis-server *
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb9437790]
/lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc]
redis-server *:6379(+0x75e0c)[0xaaaac2fe5e0c]
redis-server *:6379(aeProcessEvents+0x18c)[0xaaaac2fe6c00]
redis-server *:6379(aeMain+0x24)[0xaaaac2fe7038]
redis-server *:6379(main+0xe0c)[0xaaaac3001afc]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb91d73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb91d74cc]
redis-server *:6379(_start+0x30)[0xaaaac2fe0370]
67093 bio_lazy_free
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67091 bio_close_file
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67092 bio_aof
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67089:signal-handler (1693824528) --------
```
Starting a change in #12233 (released in 7.2), CLUSTER commands use client's
connection to decide whether to return TLS port or non-TLS port, but commands
called by Lua script and module's RM_Call don't have a real client with connection,
and would currently be regarded as non-TLS connections.
We can use server.current_client instead when it is available. When it is not (module calls
commands without a real client), we may see this as an undefined behavior, and return null
or default port (currently in this PR it returns default port, judged by server.tls_cluster).
An unintentional change was introduced in #10536, we used
to use addReplyLongLong and now it is addReplyBulkLonglong,
revert it back the previous behavior.
Updated the command tips for ACL SAVE / SETUSER / DELUSER, CLIENT SETNAME / SETINFO, and LATENCY RESET.
The tips now match CONFIG SET, since there's a similar behavior for all of these commands - the
user expects to update the various configurations & states on all nodes, not only on a single, random node.
For LATENCY RESET the response tip is now agg_sum.
Co-authored-by: Shachar Langbeheim <shachlan@amazon.com>
When connecting between a 7.0 and 7.2 cluster, the 7.0 cluster will not populate the shard_id field, which is expect on the 7.2 cluster. This is not intended behavior, as the 7.2 cluster is supposed to use a temporary shard_id while the node is in the upgrading state, but it wasn't being correctly set in this case.
Recently, the option of sending an argument from stdin using `-x` flag
was added to redis-benchmark (this option is available in redis-cli as well).
However, using the `-x` option for sending a blobs that contains null-characters
doesn't work as expected - the argument is trimmed in the first occurrence of
`\X00` (unlike in redis-cli).
This PR aims to fix this issue and add the support for every binary string input,
by sending arguments length to `redisFormatCommandArgv` when processing
redis-benchmark command, so we won't treat the arguments as C-strings.
Additionally, we add a simple test coverage for `-x` (without binary strings, and
also remove an excessive server started in tests, and make sure to select db 0
so that `r` and the benchmark work on the same db.
Co-authored-by: Oran Agra <oran@redislabs.com>
The new test added in #12476 causes reply-schemas-validator to fail.
When doing `catch {r get key}`, the req-res output is:
```
3
get
3
key
12
__argv_end__
$100000
aaaaaaaaaaaaaaaaaaaa...4
info
5
stats
12
__argv_end__
=1670
txt:# Stats
...
```
And we can see the link after `$100000`, there is a 4 in the last,
it break the req-res-log-validator script since the format is wrong.
The reason i guess is after the client reconnection (after the output
buf limit), we will not add newlines, but append args directly.
Since obuf-limits.tcl is doing the same thing, and it had the logreqres:skip
flag, so this PR is following it.
Found that in moduleConfigValidityCheck and isModuleConfigNameRegistered, sds is not required. This also allowed to remove unnecessary memcopy from some of the config registering APIs.
ZRANGE BYSCORE/BYLEX with [LIMIT offset count] option was
using every level in skiplist to jump to the first/last node in range,
but only use level[0] in skiplist to locate the node at offset, resulting
in sub-optimal performance using LIMIT:
```
while (ln && offset--) {
if (reverse) {
ln = ln->backward;
} else {
ln = ln->level[0].forward;
}
}
```
It could be slow when offset is very big. We can get the total rank of
the offset location and use skiplist to jump to it. It is an improvement
from O(offset) to O(log rank).
Below shows how this is implemented (if the offset is positve):
Use the skiplist to seach for the first element in the range, record its
rank `rank_0`, so we can have the rank of the target node `rank_t`.
Meanwhile we record the last node we visited which has zsl->level-1
levels and its rank `rank_1`. Then we start from the zsl->level-1 node,
use skiplist to go forward `rank_t-rank_1` nodes to reach the target node.
It is very similiar when the offset is reversed.
Note that if `rank_t` is very close to `rank_0`, we just start from the first
element in range and go node by node, this for the case when zsl->level-1
node is to far away and it is quicker to reach the target node by node.
Here is a test using a random generated zset including 10000 elements
(with different positive scores), doing a bench mark which compares how
fast the `ZRANGE` command is exucuted before and after the optimization.
The start score is set to 0 and the count is set to 1 to make sure that
most of the time is spent on locating the offset.
```
memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test 0 +inf byscore limit <offset> 1"
```
| offset | QPS(unstable) | QPS(optimized) |
|--------|--------|--------|
| 10 | 73386.02 | 74819.82 |
| 1000 | 48084.96 | 73177.73 |
| 2000 | 31156.79 | 72805.83 |
| 5000 | 10954.83 | 71218.21 |
With the result above, we can see that the original code is greatly
slowed down when offset gets bigger, and with the optimization the
speed is almost not affected.
Similiar results are generated when testing reversed offset:
```
memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test +inf 0 byscore rev limit <offset> 1"
```
| offset | QPS(unstable) | QPS(optimized) |
|--------|--------|--------|
| 10 | 74505.14 | 71653.67 |
| 1000 | 46829.25 | 72842.75 |
| 2000 | 28985.48 | 73669.01 |
| 5000 | 11066.22 | 73963.45 |
And the same conclusion is drawn from the tests of ZRANGE BYLEX.
This PR adds a new Module API int RM_AddACLCategory(RedisModuleCtx *ctx, const char *category_name) to add a new ACL command category.
Here, we initialize the ACLCommandCategories array by allocating space for 64 categories and duplicate the 21 default categories from the predefined array 'ACLDefaultCommandCategories' into the ACLCommandCategories array while ACL initialization. Valid ACL category names can only contain alphanumeric characters, underscores, and dashes.
The API when called, checks for the onload flag, category name validity, and for duplicate category name if present. If the conditions are satisfied, the API adds the new category to the trailing end of the ACLCommandCategories array and assigns the acl_categories flag bit according to the index at which the category is added.
If any error is encountered the errno is set accordingly by the API.
---------
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Add these INFO metrics:
* client_query_buffer_limit_disconnections
* client_output_buffer_limit_disconnections
Sometimes it is useful to monitor whether clients reaches size limit of
query buffer and output buffer, to decide whether we need to adjust the
buffer size limit or reduce client query payload.
Since the three commands have similar behavior (change config, return
OK), the tips that govern how they should behave should be similar.
Co-authored-by: Shachar Langbeheim <shachlan@amazon.com>
This test failed several times:
```
*** [err]: LATENCY GRAPH can output the event graph in tests/unit/latency-monitor.tcl
Expected '478' to be more than or equal to '500' (context: type eval
line 8 cmd {assert_morethan_equal $high 500} proc ::test)
```
Not sure why, adding some verbose printing that'll print the command
result on the next time.
warning against editing the config file and restarting the server.
which will attempt to load an AOF file and disregard the RDB.
Co-authored-by: Oran Agra <oran@redislabs.com>
BITCOUNT and BITPOS with non-existing key will return 0 even the
arguments are error, before this commit:
```
> flushall
OK
> bitcount s 0
(integer) 0
> bitpos s 0 0 1 hello
(integer) 0
> set s 1
OK
> bitcount s 0
(error) ERR syntax error
> bitpos s 0 0 1 hello
(error) ERR syntax error
```
The reason is that we judged non-existing before parameter checking and
returned. This PR fixes it, and after this commit:
```
> flushall
OK
> bitcount s 0
(error) ERR syntax error
> bitpos s 0 0 1 hello
(error) ERR syntax error
```
Also BITPOS made the same fix as #12394, check for wrong argument, before
checking for key.
```
> lpush mylist a b c
(integer) 3
> bitpos mylist 1 a b
(error) WRONGTYPE Operation against a key holding the wrong kind of value
```
Generally, In any command we first check for the argument and then check if key exist.
Some of the examples are
```
127.0.0.1:6379> getrange no-key invalid1 invalid2
(error) ERR value is not an integer or out of range
127.0.0.1:6379> setbit no-key 1 invalid
(error) ERR bit is not an integer or out of range
127.0.0.1:6379> xrange no-key invalid1 invalid2
(error) ERR Invalid stream ID specified as stream command argument
```
**Before change**
```
bitcount no-key invalid1 invalid2
0
```
**After change**
```
bitcount no-key invalid1 invalid2
(error) ERR value is not an integer or out of range
```
Limit the range of LREM count to -LONG_MAX ~ LONG_MAX.
Before the fix, passing -LONG_MAX would cause an overflow
and would effectively be the same as passing 0. (Because
this condition `toremove && removed == toremove `can never
be satisfied).
This is a minor fix as it shouldn't really affect users,
more like a cleanup.
In our test case, now we missed some test coverage for client sub-commands.
This pr goal is to add some test coverage cases of the following commands:
Client caching
Client kill
Client no-evict
Client pause
Client reply
Client tracking
Client setname
At the very least, this is useful to make sure there are no leaks and crashes in these code paths.
This PR purpose is to make the crash report process thread safe.
main changes include:
1. `setupSigSegvHandler()` is introduced to initialize the signal handler.
This function first initializes the signal handler mutex (if not initialized yet)
and then registers the process to the signal handler.
2. **sigsegvHandler** flags :
SA_NODEFER - don't add the signal to the process signal mask. We use this
flag because we want to be able to handle a second call to the signal manually.
removed SA_RESETHAND: this flag resets the signal handler function upon the first
entrance to the registered function. The reason to use this flag is to protect from
recursively entering the signal handler by the same thread. But, it also means
that if a second thread crashes while handling a signal, the process will be
terminated immediately and we won't get the crash report.
In this PR we discard this flag. The signal handler guard described below purpose
is to solve the above issues.
3. Add a **signal handler lock** with ERRORCHECK attributes.
The lock's purpose is to ensure that only one thread generates a crash report.
Once a second thread enters the signal handler it will be blocked.
We use the ERRORCHECK lock in order to protect from possible deadlock in
case the thread handling the crash gets a signal. In the latest scenario, we log
what we have collected until the handler crashed.
At the end of the crash report we reset the signal handler SIG_DFL, with no flags, and
rethrow the signal to generate a core dump (if enabled) and exit the process.
During the work on this PR we wanted to understand the historical reasons for
how crash is handled.
With respect to the choice of the flag, we believe the **SA_RESETHAND** was not
added for any specific purpose.
**SA_ONSTACK** which is removed here from bugReportEnd(), was originally also
set in the initial registration to signal handler, but removed in 3ada43e73. In addition,
it was removed from another location in deee2c1ef with the following description,
which is also relevant to why it should be removed from bugReportEnd:
> it seems to be some valgrind bug with SA_ONSTACK.
> SA_ONSTACK seems unneeded since WD is not recursive (SA_NODEFER was removed),
> also, not sure if it's even valid without a call to sigaltstack()
When running cluster info, and the number of keys overflows the integer
value, the summary no longer makes sense. This fixes by using an appropriate
type to handle values over the max int value.
If dict is rehashing, the entries in the head of table[0] is moved to table[1]
and all entries in `table[0][0:rehashidx]` is NULL.
`dictNext` start looking for non-NULL entry from table 0 index 0, and the first call
of `dictNext` on a rehashing dict will Iterate many times to skip those NULL entries.
We can easily skip those entries by setting `iter->index` as `iter->d->rehashidx` when
dict is rehashing and it's the first call of dictNext (`iter->index == -1 && iter->table == 0`).
Co-authored-by: sundb <sundbcn@gmail.com>
Currently rdbSaveMillisecondTime, rdbSaveDoubleValue api's return type is
int but they return the value directly from rdbWriteRaw function which has the
return type of ssize_t. As this may cause overflow to int so changed to ssize_t.
It is also possible that your question was already asked here, so please do a quick issues search before submitting. Lastly, if your question is about one of Redis' [clients](https://redis.io/clients), you may to contact your client's developers for help.
That said, please feel free to replace all this with your question :)
or [Valkey's Discord](https://discord.gg/zbcPa5umUB)
or [Valkey's Matrix](https://matrix.to/#/#valkey:matrix.org)
* Found a bug? [Report it here](https://github.com/valkey-io/valkey/issues/new?template=bug_report.md&title=%5BBUG%5D)
* Valkey crashed? [Submit a crash report here](https://github.com/valkey-io/valkey/issues/new?template=crash_report.md&title=%5BCRASH%5D+%3Cshort+description%3E)
* Suggest a new feature? [Post your detailed feature request here](https://github.com/valkey-io/valkey/issues/new?template=feature_request.md&title=%5BNEW%5D)
* Want to help with documentation? [Move on to valkey-doc](https://github.com/valkey-io/valkey-doc)
* Report a vulnerability? See [SECURITY.md](SECURITY.md)
Issues and pull requests for documentation belong on the redis-doc repo:
## Developer Certificate of Origin
https://github.com/redis/redis-doc
We respect the intellectual property rights of others and we want to make sure
all incoming contributions are correctly attributed and licensed. A Developer
Certificate of Origin (DCO) is a lightweight mechanism to do that. The DCO is
a declaration attached to every commit. In the commit message of the contribution,
the developer simply adds a `Signed-off-by` statement and thereby agrees to the DCO,
which you can find below or at [DeveloperCertificate.org](http://developercertificate.org/).
If you are reporting a security bug or vulnerability, see SECURITY.md.
```text
Developer's Certificate of Origin 1.1
# How to provide a patch for a new feature
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the
best of my knowledge, is covered under an appropriate open
source license and I have the right under that license to
submit that work with modifications, whether created in whole
or in part by me, under the same open source license (unless
I am permitted to submit under a different license), as
Indicated in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including
all personal information I submit with it, including my
sign-off) is maintained indefinitely and may be redistributed
consistent with this project or the open source license(s)
involved.
```
We require that every contribution to Valkey to be signed with a DCO. We require the
usage of known identity (such as a real or preferred name). We do not accept anonymous
contributors nor those utilizing pseudonyms. A DCO signed commit will contain a line like:
```text
Signed-off-by: Jane Smith <jane.smith@email.com>
```
You may type this line on your own when writing your commit messages. However, if your
user.name and user.email are set in your git configs, you can use `git commit` with `-s`
or `--signoff` to add the `Signed-off-by` line to the end of the commit message. We also
require revert commits to include a DCO.
If you're contributing code to the Valkey project in any other form, including
sending a code fragment or patch via private email or public discussion groups,
you need to ensure that the contribution is in accordance with the DCO.
# How to provide a patch or a new feature
1. If it is a major feature or a semantical change, please don't start coding
straight away: if your feature is not a conceptual fit you'll lose a lot of
time writing the code without any reason. Start by posting in the mailing list
and creating an issue at Github with the description of, exactly, what you want
to accomplish and why. Use cases are important for features to be accepted.
Here you'll see if there is consensus about your idea.
time writing the code without any reason. Start by creating an issue at Github with the
description of, exactly, what you want to accomplish and why. Use cases are important for
features to be accepted. Here you can see if there is consensus about your idea.
2. If in step 1 you get an acknowledgment from the project leaders, use the
following procedure to submit a patch:
a. Fork Redis on github ( https://docs.github.com/en/github/getting-started-with-github/fork-a-repo )
b. Create a topic branch (git checkout -b my_branch)
c. Push to your branch (git push origin my_branch)
d. Initiate a pull request on github ( https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request )
e. Done :)
2. If in step 1 you get an acknowledgment from the project leaders, use the following
procedure to submit a patch:
1. Fork Valkey on GitHub ([HOWTO](https://docs.github.com/en/github/getting-started-with-github/fork-a-repo))
1. Create a topic branch (`git checkout -b my_branch`)
1. Make the needed changes and commit with a DCO. (`git commit -s`)
1. Push to your branch (`git push origin my_branch`)
1. Initiate a pull request on GitHub ([HOWTO](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request))
1. Done :)
3. Keep in mind that we are very overloaded, so issues and PRs sometimes wait
for a *very* long time. However this is not lack of interest, as the project
for a *very* long time. However this is not a lack of interest, as the project
gets more and more users, we find ourselves in a constant need to prioritize
certain issues/PRs over others. If you think your issue/PR is very important
try to popularize it, have other users commenting and sharing their point of
view and so forth. This helps.
view, and so forth. This helps.
4. For minor fixes just open a pull request on Github.
4. For minor fixes, open a pull request on GitHub.
To link a pull request to an existing issue, please write "Fixes #xyz" somewhere
in the pull request description, where xyz is the issue number.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# License 2
BSD 3-Clause License
Copyright (c) 2024-present, Valkey contributors
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# License 3
BSD 3-Clause License
Copyright (c) 2006-2020, Redis Ltd.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
The Valkey project is managed by a Technical Steering Committee (TSC) composed of the maintainers of the Valkey repository.
The Valkey project includes all of the current and future repositories under the Valkey-io organization.
Committers are defined as individuals with write access to the code within a repository.
Maintainers are defined as individuals with full access to a repository and own its governance.
Both maintainers and committers should be clearly listed in the MAINTAINERS.md file in a given projects repository.
Maintainers of other repositories within the Valkey project are not members of the TSC unless explicitly added.
## Technical Steering Committee
The TSC is responsible for oversight of all technical, project, approval, and policy matters for Valkey.
The TSC members are listed in the [MAINTAINERS.md](MAINTAINERS.md) file in the Valkey repository.
Maintainers (and accordingly, TSC members) may be added or removed by no less than 2/3 affirmative vote of the current TSC.
The TSC shall appoint a Chair responsible for organizing TSC meetings.
If the TSC Chair is removed from the TSC (or the Chair steps down from that role), it is the responsibility of the TSC to appoint a new Chair.
The TSC can amend this governance document by no less than a 2/3 affirmative vote.
The TSC may, at its discretion, add or remove members who are not maintainers of the main Valkey repository.
The TSC may, at its discretion, add or remove maintainers from other repositories within the Valkey project.
## Voting
The TSC shall strive for all decisions to be made by consensus.
While explicit agreement of the entire TSC is preferred, it is not required for consensus.
Rather, the TSC shall determine consensus based on their good faith consideration of a number of factors, including the dominant view of the TSC and nature of support and objections.
The TSC shall document evidence of consensus in accordance with these requirements.
If consensus cannot be reached, the TSC shall make the decision by a vote.
A vote shall also be called when an issue or pull request is marked as a major decision, which are decisions that have a significant impact on the Valkey architecture or design.
Examples of major decisions:
* Fundamental changes to the Valkey core datastructures
* Adding a new data structure or API
* Changes that affect backward compatibility
* New user visible fields that need to be maintained
* Modifications to the TSC or other governance documents
* Adding members to other roles within the Valkey project
* Delegation of maintainership for projects to other groups or individuals
* Adding or removing a new external library such as a client
or module to the project.
Any member of the TSC can call a vote with reasonable notice to the TSC, setting out a discussion period and a separate voting period.
Any discussion may be conducted in person or electronically by text, voice, or video.
The discussion shall be open to the public, with the notable exception of discussions involving embargoed security issues or the addition or removal of maintainers, which will be private.
In any vote, each voting TSC member will have one vote.
The TSC shall give at least two weeks for all members to submit their vote.
Except as specifically noted elsewhere in this document, decisions by vote require a simple majority vote of all voting members.
It is the responsibility of the TSC chair to help facilitate the voting process as needed to make sure it completes within the voting period.
## Termination of Membership
A maintainer's access (and accordingly, their position on the TSC) will be removed if any of the following occur:
* Resignation: Written notice of resignation to the TSC.
* TSC Vote: 2/3 affirmative vote of the TSC to remove a member
* Unreachable Member: If a member is unresponsive for more than six months, the remaining active members of the TSC may vote to remove the unreachable member by a simple majority.
## Technical direction for other Valkey projects
The TSC may delegate decision making for other projects within the Valkey organization to the maintainers responsible for those projects.
Delegation of decision making for a project is considered a major decision, and shall happen with an explicit vote.
Projects within the Valkey organization must indicate the individuals with commit permissions by updating the MAINTAINERS.md within their repositories.
The TSC may, at its discretion, overrule the decisions made by other projects within the Valkey organization, although they should show restraint in doing so.
## License of this document
This document may be used, modified, and/or distributed under the terms of the
[Creative Commons Attribution 4.0 International (CC-BY) license](https://creativecommons.org/licenses/by/4.0/legalcode).
Redis is often referred to as a *data structures* server. What this means is that Redis provides access to mutable data structures via a set of commands, which are sent using a *server-client* model with TCP sockets and a simple protocol. So different processes can query and modify the same data structures in a shared way.
<h3align="center">Futriix</h3>
Data structures implemented into Redis have a few special properties:
<palign="center">
Futriix's полная документация (команды идентичны)
<br/>
<ahref="https://valkey.io/"><strong>Изучить полную документацию</strong></a>
<br/>
<ahref="">Сообщить об ошибке</a>
·
<ahref="">Предложение новой функциональности</a>
</p>
</div>
* Redis cares to store them on disk, even if they are always served and modified into the server memory. This means that Redis is fast, but that it is also non-volatile.
* The implementation of data structures emphasizes memory efficiency, so data structures inside Redis will likely use less memory compared to the same data structure modelled using a high-level programming language.
* Redis offers a number of features that are natural to find in a database, like replication, tunable levels of durability, clustering, and high availability.
## Краткая документация проекта Futriix
Another good example is to think of Redis as a more complex version of memcached, where the operations are not just SETs and GETs, but operations that work with complex data types like Lists, Sets, ordered data structures, and so forth.
If you want to know more, this is a list of selected starting points:
* Introduction to Redis data types. https://redis.io/topics/data-types-intro
* Try Redis directly inside your browser. https://try.redis.io
* The full list of Redis commands. https://redis.io/commands
* There is much more inside the official Redis documentation. https://redis.io/documentation
Building Redis
--------------
Redis can be compiled and used on Linux, OSX, OpenBSD, NetBSD, FreeBSD.
We support big endian and little endian architectures, and both 32 bit
and 64 bit systems.
It may compile on Solaris derived systems (for instance SmartOS) but our
support for this platform is *best effort* and Redis is not guaranteed to
work as well as in Linux, OSX, and \*BSD.
It is as simple as:
% make
To build with TLS support, you'll need OpenSSL development libraries (e.g.
libssl-dev on Debian/Ubuntu) and run:
% make BUILD_TLS=yes
To build with systemd support, you'll need systemd development libraries (such
as libsystemd-dev on Debian/Ubuntu or systemd-devel on CentOS) and run:
% make USE_SYSTEMD=yes
To append a suffix to Redis program names, use:
% make PROG_SUFFIX="-alt"
You can build a 32 bit Redis binary using:
% make 32bit
After building Redis, it is a good idea to test it using:
% make test
If TLS is built, running the tests with TLS enabled (you will need `tcl-tls`
Fixing build problems with dependencies or cached build options
---------
<!-- ABOUT THE PROJECT -->
## О проекте
Redis has some dependencies which are included in the `deps` directory.
`make` does not automatically rebuild dependencies even if something in
the source code of dependencies changes.
Проект Futriix является форком проекта Valkey.
Futriix-Распределённая СУБД на языке "C", построенная на базе [Valkey](https://valkey.io/), с поддержкой модулей на базе Искусственного интеллекта и модулей на языке Golang.
When you update the source code with `git pull` or when code inside the
dependencies tree is modified in any other way, make sure to use the following
command in order to really clean everything and rebuild from scratch:
СУБД поддерживает модуль c распределённым [JSON](https://source.futriix.ru/gvsafronov/futriix-json), [ИИ-модуль "Виртуальный помощник"](), [SQL-модуль](https://source.futriix.ru/gvsafronov/fdx).
% make distclean
Ниже приведён пример того, инструкции по настройке вашего проекта локально.
Чтобы запустить локальную копию проекта, выполните следующие простые шаги.
This will clean: jemalloc, lua, hiredis, linenoise and other dependencies.
Also if you force certain build options like 32bit target, no C compiler
optimizations (for debugging purposes), and other similar build time options,
those options are cached indefinitely until you issue a `make distclean`
command.
Fixing problems building 32 bit binaries
---------
### Подготовка
If after building Redis with a 32 bit target you need to rebuild it
with a 64 bit target, or the other way around, you need to perform a
`make distclean` in the root directory of the Redis distribution.
Ниже приведены шаги, которые помогут вам скомпилировать и установить Futriix.
* Устанавливаем язык программирования C, соопутствующие утилиты (autoconf и другие)
In case of build errors when trying to build a 32 bit binary of Redis, try
## Исправление проблем сборки с зависимостями или кэшированными параметрами сборки.
Futriix содержит некоторые зависимости, которые хранятся в директории `deps`.
Утилита `make` автоматически не пересобирает зависимости даже если вносятся каие-либо изменения в код зависимостей.
Когда вы обновляете код проекта командой `git pull` или когда код внутри
дерева зависимостей изменен каким-либо другим способом, обязательно используйте следующее
команду для того, чтобы действительно все почистить и пересобрать с нуля:
```sh
unix:$ make distclean
```
В результате работы команды выше будут очищены: аллокатор памяти jemalloc, язык lua, библиотеку hiredis, библиотеку linenoise а также другие зависимости.
Кроме того, если вы принудительно используете определенные параметры сборки, такие как 32-битная версия для 32-битной системы, оптимизации компилятора C в данном случае не будут выполнены. Оптимизации (для целей отладки) и другие подобные параметры времени сборки,
кэшируются на неопределенный срок, пока вы не выполните команду `make distclean`.
To build with support for the processor's internal instruction clock, use:
Для сборки с поддержкой внутренней тактовой частоты процессора, используйте команду ниже:
% make CFLAGS="-DUSE_PROCESSOR_CLOCK"
```sh
unix:$ make CFLAGS="-DUSE_PROCESSOR_CLOCK"
```
Verbose build
-------------
## Расширенный вариант сборки
Redis will build with a user-friendly colorized output by default.
If you want to see a more verbose output, use the following:
Futriix по умолчанию создает удобный для пользователя цветной вывод.
Если вы хотите увидеть более подробный вывод, выполните следующую команду:
% make V=1
```sh
unix:$ make V=1
```
4. Если вы хотите запустить сервер Futriix с параметрами по-умолчанию (без указания файла конфигурации) выполните следующую команду:
```sh
`./futriix-server`
```
5. Также вы можете использовать файл конфигурации, располагающийся в директории "Futriix" `futriix.conf` для конфигурирования вашего сервера.
Для запуска Futriix с файлом конфигурации используйте команду ниже:
```sh
./futriix-server /path/to/futriix.conf
```
6. Запустите утилиту futriix-cli (Client Futriix) для подключения к **локальному** серверу Futriix, а также для того чтобы начать работу с инстансом:
Running Redis
-------------
```sh
./futriix-cli
```
To run Redis with the default configuration, just type:
7. Для подключения с помощью утилиты futriix-cli к конкретному узлу в сети, добавьте параметр `h`-указание удалённого хоста по его ip-адресу и параметр `p`- указания номера порта:
% cd src
% ./redis-server
```sh
./futriix-cli -h 11.164.22.7 -p 50000
```
If you want to provide your redis.conf, you have to run it using an additional
If you are reading this README you are likely in front of a Github page
or you just untarred the Redis distribution tar ball. In both the cases
you are basically one step away from the source code, so here we explain
the Redis source code layout, what is in each file as a general idea, the
most important functions and structures inside the Redis server and so forth.
We keep all the discussion at a high level without digging into the details
since this document would be huge otherwise and our code base changes
continuously, but a general idea should be a good starting point to
understand more. Moreover most of the code is heavily commented and easy
to follow.
Source code layout
---
The Redis root directory just contains this README, the Makefile which
calls the real Makefile inside the `src` directory and an example
configuration for Redis and Sentinel. You can find a few shell
scripts that are used in order to execute the Redis, Redis Cluster and
Redis Sentinel unit tests, which are implemented inside the `tests`
directory.
Inside the root are the following important directories:
* `src`: contains the Redis implementation, written in C.
* `tests`: contains the unit tests, implemented in Tcl.
* `deps`: contains libraries Redis uses. Everything needed to compile Redis is inside this directory; your system just needs to provide `libc`, a POSIX compatible interface and a C compiler. Notably `deps` contains a copy of `jemalloc`, which is the default allocator of Redis under Linux. Note that under `deps` there are also things which started with the Redis project, but for which the main repository is not `redis/redis`.
There are a few more directories but they are not very important for our goals
here. We'll focus mostly on `src`, where the Redis implementation is contained,
exploring what there is inside each file. The order in which files are
exposed is the logical one to follow in order to disclose different layers
of complexity incrementally.
Note: lately Redis was refactored quite a bit. Function names and file
names have been changed, so you may find that this documentation reflects the
`unstable` branch more closely. For instance, in Redis 3.0 the `server.c`
and `server.h` files were named `redis.c` and `redis.h`. However the overall
structure is the same. Keep in mind that all the new developments and pull
requests should be performed against the `unstable` branch.
server.h
---
The simplest way to understand how a program works is to understand the
data structures it uses. So we'll start from the main header file of
Redis, which is `server.h`.
All the server configuration and in general all the shared state is
defined in a global structure called `server`, of type `struct redisServer`.
A few important fields in this structure are:
* `server.db` is an array of Redis databases, where data is stored.
* `server.commands` is the command table.
* `server.clients` is a linked list of clients connected to the server.
* `server.master` is a special client, the master, if the instance is a replica.
There are tons of other fields. Most fields are commented directly inside
the structure definition.
Another important Redis data structure is the one defining a client.
In the past it was called `redisClient`, now just `client`. The structure
has many fields, here we'll just show the main ones:
```c
struct client {
int fd;
sds querybuf;
int argc;
robj **argv;
redisDb *db;
int flags;
list *reply;
// ... many other fields ...
char buf[PROTO_REPLY_CHUNK_BYTES];
}
```
The client structure defines a *connected client*:
* The `fd` field is the client socket file descriptor.
* `argc` and `argv` are populated with the command the client is executing, so that functions implementing a given Redis command can read the arguments.
* `querybuf` accumulates the requests from the client, which are parsed by the Redis server according to the Redis protocol and executed by calling the implementations of the commands the client is executing.
* `reply` and `buf` are dynamic and static buffers that accumulate the replies the server sends to the client. These buffers are incrementally written to the socket as soon as the file descriptor is writable.
As you can see in the client structure above, arguments in a command
are described as `robj` structures. The following is the full `robj`
structure, which defines a *Redis object*:
```c
struct redisObject {
unsigned type:4;
unsigned encoding:4;
unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
* LFU data (least significant 8 bits frequency
* and most significant 16 bits access time). */
int refcount;
void *ptr;
};
```
Basically this structure can represent all the basic Redis data types like
strings, lists, sets, sorted sets and so forth. The interesting thing is that
it has a `type` field, so that it is possible to know what type a given
object has, and a `refcount`, so that the same object can be referenced
in multiple places without allocating it multiple times. Finally the `ptr`
field points to the actual representation of the object, which might vary
even for the same type, depending on the `encoding` used.
2. Откройте файл конфигурации futriix.conf в любом текстовом редакторе, например nano, как в примере приведённом ниже:
Redis objects are used extensively in the Redis internals, however in order
to avoid the overhead of indirect accesses, recently in many places
we just use plain dynamic strings not wrapped inside a Redis object.
```sh
unix:$ nano futriix/futriix.conf
server.c
---
This is the entry point of the Redis server, where the `main()` function
is defined. The following are the most important steps in order to startup
the Redis server.
* `initServerConfig()` sets up the default values of the `server` structure.
* `initServer()` allocates the data structures needed to operate, setup the listening socket, and so forth.
* `aeMain()` starts the event loop which listens for new connections.
There are two special functions called periodically by the event loop:
1. `serverCron()` is called periodically (according to `server.hz` frequency), and performs tasks that must be performed from time to time, like checking for timed out clients.
2. `beforeSleep()` is called every time the event loop fired, Redis served a few requests, and is returning back into the event loop.
Inside server.c you can find code that handles other vital things of the Redis server:
* `call()` is used in order to call a given command in the context of a given client.
* `activeExpireCycle()` handles eviction of keys with a time to live set via the `EXPIRE` command.
* `performEvictions()` is called when a new write command should be performed but Redis is out of memory according to the `maxmemory` directive.
* The global variable `redisCommandTable` defines all the Redis commands, specifying the name of the command, the function implementing the command, the number of arguments required, and other properties of each command.
commands.c
---
This file is auto generated by utils/generate-command-code.py, the content is based on the JSON files in the src/commands folder.
These are meant to be the single source of truth about the Redis commands, and all the metadata about them.
These JSON files are not meant to be used by anyone directly, instead that metadata can be obtained via the `COMMAND` command.
networking.c
---
This file defines all the I/O functions with clients, masters and replicas
(which in Redis are just special clients):
* `createClient()` allocates and initializes a new client.
* The `addReply*()` family of functions are used by command implementations in order to append data to the client structure, that will be transmitted to the client as a reply for a given command executed.
* `writeToClient()` transmits the data pending in the output buffers to the client and is called by the *writable event handler*`sendReplyToClient()`.
* `readQueryFromClient()` is the *readable event handler* and accumulates data read from the client into the query buffer.
* `processInputBuffer()` is the entry point in order to parse the client query buffer according to the Redis protocol. Once commands are ready to be processed, it calls `processCommand()` which is defined inside `server.c` in order to actually execute the command.
* `freeClient()` deallocates, disconnects and removes a client.
aof.c and rdb.c
---
As you can guess from the names, these files implement the RDB and AOF
persistence for Redis. Redis uses a persistence model based on the `fork()`
system call in order to create a process with the same (shared) memory
content of the main Redis process. This secondary process dumps the content
of the memory on disk. This is used by `rdb.c` to create the snapshots
on disk and by `aof.c` in order to perform the AOF rewrite when the
append only file gets too big.
The implementation inside `aof.c` has additional functions in order to
implement an API that allows commands to append new commands into the AOF
file as clients execute them.
The `call()` function defined inside `server.c` is responsible for calling
the functions that in turn will write the commands into the AOF.
db.c
---
Certain Redis commands operate on specific data types; others are general.
Examples of generic commands are `DEL` and `EXPIRE`. They operate on keys
and not on their values specifically. All those generic commands are
defined inside `db.c`.
Moreover `db.c` implements an API in order to perform certain operations
on the Redis dataset without directly accessing the internal data structures.
The most important functions inside `db.c` which are used in many command
implementations are the following:
* `lookupKeyRead()` and `lookupKeyWrite()` are used in order to get a pointer to the value associated to a given key, or `NULL` if the key does not exist.
* `dbAdd()` and its higher level counterpart `setKey()` create a new key in a Redis database.
* `dbDelete()` removes a key and its associated value.
* `emptyDb()` removes an entire single database or all the databases defined.
The rest of the file implements the generic commands exposed to the client.
object.c
---
The `robj` structure defining Redis objects was already described. Inside
`object.c` there are all the functions that operate with Redis objects at
a basic level, like functions to allocate new objects, handle the reference
counting and so forth. Notable functions inside this file:
* `incrRefCount()` and `decrRefCount()` are used in order to increment or decrement an object reference count. When it drops to 0 the object is finally freed.
* `createObject()` allocates a new object. There are also specialized functions to allocate string objects having a specific content, like `createStringObjectFromLongLong()` and similar functions.
This file also implements the `OBJECT` command.
replication.c
---
This is one of the most complex files inside Redis, it is recommended to
approach it only after getting a bit familiar with the rest of the code base.
In this file there is the implementation of both the master and replica role
of Redis.
One of the most important functions inside this file is `replicationFeedSlaves()` that writes commands to the clients representing replica instances connected
to our master, so that the replicas can get the writes performed by the clients:
this way their data set will remain synchronized with the one in the master.
This file also implements both the `SYNC` and `PSYNC` commands that are
used in order to perform the first synchronization between masters and
replicas, or to continue the replication after a disconnection.
Script
---
The script unit is composed of 3 units:
* `script.c` - integration of scripts with Redis (commands execution, set replication/resp, ...)
* `script_lua.c` - responsible to execute Lua code, uses script.c to interact with Redis from within the Lua code.
* `function_lua.c` - contains the Lua engine implementation, uses script_lua.c to execute the Lua code.
* `functions.c` - contains Redis Functions implementation (FUNCTION command), uses functions_lua.c if the function it wants to invoke needs the Lua engine.
* `eval.c` - contains the `eval` implementation using `script_lua.c` to invoke the Lua code.
Other C files
---
* `t_hash.c`, `t_list.c`, `t_set.c`, `t_string.c`, `t_zset.c` and `t_stream.c` contains the implementation of the Redis data types. They implement both an API to access a given data type, and the client command implementations for these data types.
* `ae.c` implements the Redis event loop, it's a self contained library which is simple to read and understand.
* `sds.c` is the Redis string library, check https://github.com/antirez/sds for more information.
* `anet.c` is a library to use POSIX networking in a simpler way compared to the raw interface exposed by the kernel.
* `dict.c` is an implementation of a non-blocking hash table which rehashes incrementally.
* `cluster.c` implements the Redis Cluster. Probably a good read only after being very familiar with the rest of the Redis code base. If you want to read `cluster.c` make sure to read the [Redis Cluster specification][4].
[4]: https://redis.io/topics/cluster-spec
Anatomy of a Redis command
---
All the Redis commands are defined in the following way:
```c
void foobarCommand(client *c) {
printf("%s",c->argv[1]->ptr); /* Do something with the argument. */
addReply(c,shared.ok); /* Reply something to the client. */
}
```
The command function is referenced by a JSON file, together with its metadata, see `commands.c` described above for details.
The command flags are documented in the comment above the `struct redisCommand` in `server.h`.
For other details, please refer to the `COMMAND` command. https://redis.io/commands/command/
3. Найдите и установите значения "yes" для параметров "active-replica" и "multi-master". После чего добавьте в файл конфигурации ip-адреса, узлов вашего кластера. Если вы всё сделали правильно у вас должны отробразится строки в файле конфигурации `futriix.conf` как показано ниже:
After the command operates in some way, it returns a reply to the client,
usually using `addReply()` or a similar function defined inside `networking.c`.
```sh
There are tons of command implementations inside the Redis source code
that can serve as examples of actual commands implementations (e.g. pingCommand). Writing
a few toy commands can be a good exercise to get familiar with the code base.
active-replica yes
multi-master yes
replicaof 192.168.11.5 9880
replicaof 192.168.11.6 9880
replicaof 192.168.11.7 9880
There are also many other files not described here, but it is useless to
cover everything. We just want to help you with the first steps.
Eventually you'll find your way inside the Redis code base :-)
5. Перейдите в директорию Futriix и запустите скрипт `cluster.sh`с параметрами `pick` (скрипт запущенный с данным параметром "соберёт кластер"), и `run`,(скрипт запущенный с данным параметром "запустит кластер") как указано ниже:
```sh
unix:$ ./cluster pick
unix:$ ./cluster run
```
6. Установите права на исполнение на скрипт `cluster.sh` , воспользовавшись командой ниже:
```sh
unix:$ chmod +x cluster.sh
```
7. Для остановки кластера запустите скрипт `cluster.sh`с параметром `stop`
Вклады — это то, что делает сообщество открытого исходного кода таким замечательным местом для обучения, вдохновения и творчества. Любой ваш вклад **очень ценится**.
Если у вас есть предложение, которое могло бы улучшить ситуацию, создайте форк репозитория и создайте запрос на включение. Также можно просто открыть задачу с тегом «улучшение».
Не забудьте поставить проекту звезду! Еще раз спасибо!
1. Форкните проект
2. Создайте свою ветку функций (`git checkout -b Feature/AmazingFeature`)
3. Зафиксируйте свои изменения (git commit -m 'Add some AmazingFeature'`)
4. Отправьте в ветку (`git push origin Feature/AmazingFeature`)
5. Откройте запрос на включение
<!-- LICENSE -->
## Лицензия
Проект распространяется под 3-пунктной лицензией BSD. Подробнсти смотрите в файле `COPYING.txt`.
Redis is generally backwards compatible with very few exceptions, so we
recommend users to always use the latest version to experience stability,
performance and security.
We generally backport security issues to a single previous major version,
unless this is not possible or feasible with a reasonable effort.
| Version | Supported |
| ------- | ------------------ |
| 7.0.x | :white_check_mark: |
| 6.2.x | :white_check_mark: |
| 6.0.x | :white_check_mark: |
| <6.0|:x:|
## Reporting a Vulnerability
If you believe you’ve discovered a serious vulnerability, please contact the
Redis core team at redis@redis.io. We will evaluate your report and if
necessary issue a fix and an advisory. If the issue was previously undisclosed,
we’ll also mention your name in the credits.
## Responsible Disclosure
In some cases, we may apply a responsible disclosure process to reported or
otherwise discovered vulnerabilities. We will usually do that for a critical
vulnerability, and only if we have a good reason to believe information about
it is not yet public.
This process involves providing an early notification about the vulnerability,
its impact and mitigations to a short list of vendors under a time-limited
embargo on public disclosure.
Vendors on the list are individuals or organizations that maintain Redis
distributions or provide Redis as a service, who have third party users who
will benefit from the vendor’s ability to prepare for a new version or deploy a
fix early.
If you believe you should be on the list, please contact us and we will
consider your request based on the above criteria.
If you believe you've discovered a security vulnerability, please contact the Valkey team at security@lists.valkey.io.
Please *DO NOT* create an issue.
We follow a responsible disclosure procedure, so depending on the severity of the issue we may notify Valkey vendors about the issue before releasing it publicly.
If you would like to be added to our list of vendors, please reach out to the Valkey team at maintainers@lists.valkey.io.
This directory contains all Redis dependencies, except for the libc that
This directory contains all Valkey dependencies, except for the libc that
should be provided by the operating system.
* **Jemalloc** is our memory allocator, used as replacement for libc malloc on Linux by default. It has good performances and excellent fragmentation behavior. This component is upgraded from time to time.
* **hiredis** is the official C client library for Redis. It is used by redis-cli, redis-benchmark and Redis Sentinel. It is part of the Redis official ecosystem but is developed externally from the Redis repository, so we just upgrade it as needed.
* **linenoise** is a readline replacement. It is developed by the same authors of Redis but is managed as a separated project and updated as needed.
* **linenoise** is a readline replacement. It is developed by the same authors of Valkey but is managed as a separated project and updated as needed.
* **lua** is Lua 5.1 with minor changes for security and additional libraries.
* **hdr_histogram** Used for per-command latency tracking histograms.
* **fast_float** is a replacement for strtod to convert strings to floats efficiently.
How to upgrade the above dependencies
===
@ -13,10 +14,10 @@ How to upgrade the above dependencies
Jemalloc
---
Jemalloc is modified with changes that allow us to implement the Redis
active defragmentation logic. However this feature of Redis is not mandatory
and Redis is able to understand if the Jemalloc version it is compiled
against supports such Redis-specific modifications. So in theory, if you
Jemalloc is modified with changes that allow us to implement the Valkey
active defragmentation logic. However this feature of Valkey is not mandatory
and Valkey is able to understand if the Jemalloc version it is compiled
against supports such Valkey-specific modifications. So in theory, if you
are not interested in the active defragmentation, you can replace Jemalloc
just following these steps:
@ -28,7 +29,7 @@ just following these steps:
Jemalloc configuration script is broken and will not work nested in another
git repository.
However note that we change Jemalloc settings via the `configure` script of Jemalloc using the `--with-lg-quantum` option, setting it to the value of 3 instead of 4. This provides us with more size classes that better suit the Redis data structures, in order to gain memory efficiency.
However note that we change Jemalloc settings via the `configure` script of Jemalloc using the `--with-lg-quantum` option, setting it to the value of 3 instead of 4. This provides us with more size classes that better suit the Valkey data structures, in order to gain memory efficiency.
If you want to upgrade Jemalloc while also providing support for
active defragmentation, in addition to the above steps you need to perform
@ -38,7 +39,7 @@ the following additional steps:
to add `#define JEMALLOC_FRAG_HINT`.
6. Implement the function `je_get_defrag_hint()` inside `src/jemalloc.c`. You
can see how it is implemented in the current Jemalloc source tree shipped
with Redis, and rewrite it according to the new Jemalloc internals, if they
with Valkey, and rewrite it according to the new Jemalloc internals, if they
changed, otherwise you could just copy the old implementation if you are
upgrading just to a similar version of Jemalloc.
@ -61,7 +62,7 @@ cd deps/jemalloc
Hiredis
---
Hiredis is used by Sentinel, `redis-cli` and `redis-benchmark`. Like Redis, uses the SDS string library, but not necessarily the same version. In order to avoid conflicts, this version has all SDS identifiers prefixed by `hi`.
Hiredis is used by Sentinel, `valkey-cli` and `valkey-benchmark`. Like Valkey, uses the SDS string library, but not necessarily the same version. In order to avoid conflicts, this version has all SDS identifiers prefixed by `hi`.
This should hopefully merge the local changes into the new version.
@ -71,7 +72,7 @@ Linenoise
---
Linenoise is rarely upgraded as needed. The upgrade process is trivial since
Redis uses a non modified version of linenoise, so to upgrade just do the
Valkey uses a non modified version of linenoise, so to upgrade just do the
following:
1. Remove the linenoise directory.
@ -81,11 +82,11 @@ Lua
---
We use Lua 5.1 and no upgrade is planned currently, since we don't want to break
Lua scripts for new Lua features: in the context of Redis Lua scripts the
Lua scripts for new Lua features: in the context of Valkey Lua scripts the
capabilities of 5.1 are usually more than enough, the release is rock solid,
and we definitely don't want to break old scripts.
So upgrading of Lua is up to the Redis project maintainers and should be a
So upgrading of Lua is up to the Valkey project maintainers and should be a
manual procedure performed by taking a diff between the different versions.
Currently we have at least the following differences between official Lua 5.1
@ -94,6 +95,7 @@ and our version:
1. Makefile is modified to allow a different compiler than GCC.
2. We have the implementation source code, and directly link to the following external libraries: `lua_cjson.o`, `lua_struct.o`, `lua_cmsgpack.o` and `lua_bit.o`.
3. There is a security fix in `ldo.c`, line 498: The check for `LUA_SIGNATURE[0]` is removed in order to avoid direct bytecode execution.
4. In `lstring.c`, the luaS_newlstr function's hash calculation has been upgraded from a simple hash function to MurmurHash3, implemented within the same file, to enhance performance, particularly for operations involving large strings.
Hdr_Histogram
---
@ -104,3 +106,17 @@ We use a customized version based on master branch commit e4448cf6d1cd08fff51981
2. Copy updated files from newer version onto files in /hdr_histogram.
3. Apply the changes from 1 above to the updated files.
fast_float
---
The fast_float library provides fast header-only implementations for the C++ from_chars functions for `float` and `double` types as well as integer types. These functions convert ASCII strings representing decimal values (e.g., `1.3e10`) into binary types. The functions are much faster than comparable number-parsing functions from existing C++ standard libraries.
Specifically, `fast_float` provides the following function to parse floating-point numbers with a C++17-like syntax (the library itself only requires C++11):
@ -17,7 +17,7 @@ Line editing with some support for history is a really important feature for com
So what usually happens is either:
* Large programs with configure scripts disabling line editing if readline is not present in the system, or not supporting it at all since readline is GPL licensed and libedit (the BSD clone) is not as known and available as readline is (Real world example of this problem: Tclsh).
* Smaller programs not using a configure script not supporting line editing at all (A problem we had with Redis-cli for instance).
* Smaller programs not using a configure script not supporting line editing at all (A problem we had with Valkey-cli for instance).
The result is a pollution of binaries without line editing support.
@ -108,7 +108,7 @@ to search and re-edit already inserted lines of text.
The followings are the history API calls:
int linenoiseHistoryAdd(const char *line);
int linenoiseHistoryAdd(const char *line, int is_sensitive);
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.