Two jobs were missing from the job list for failure notification
* test-ubuntu-tls-io-threads
* test-sanitizer-force-defrag
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Addresses the failure here:
https://github.com/valkey-io/valkey/actions/runs/13000845302/job/36259016156#step:5:7272.
This change does three things:
1. For some reason TCL 8.5 (which is used on macos) is handling `\x03ba`
as `\0xba`, according to
https://www.tcl-lang.org/man/tcl8.5/TclCmd/Tcl.htm#M27 so we encode
"bar" using hex escapes too.
2. Fix a spacing issue.
3. Make it so that if the restore fails, it immediately errors.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Do command existence and arity checks when loading AOF to avoid crash
Currently, loading commands such as `cluster` or `cluster slots xxx`
from AOF will cause the server to crash.
1. `cluster` is a container command, and executing proc will cause a
crash because we do not check subcommand and arity.
2. `cluster slots xxx`, arity check fail, reply with an error from the
AOF client and trigger a panic.
Of course, there are many other ways for a problematic AOF to cause the
panic, but it is still necessary do some basic checks before executing.
In this way, in these basic cases, we can print useful error messages
instead of crashing directly.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Use Linux syscall mmap/munmap to manage a RDMA memory region, then we
have a guard page protected VMA like (cat /proc/PID/maps):
785018afe000-785018aff000 ---p 00000000 00:00 0 -> top guard page
785018aff000-785018bff000 rw-p 00000000 00:00 0 -> RDMA memory region
785018bff000-785018c00000 ---p 00000000 00:00 0 -> bottom guard page
Once any code accesses memory unexpectedly, segment fault occurs.
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
In C, we had better initialize every variable in struct, this PR fixes
one missed variable Initialization.
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
This address 2 issues:
1. It is possible (somehow) that the inner server client (r) was not
working resp 3 when entering this test.
this makes sure it does.
2. in case the test failed it might leave the redirection client closed.
there is a cross test assumption it should be open, so moved most of the
assert checks to the end of the test.
example fail:
https://github.com/valkey-io/valkey/actions/runs/12979601179/job/36195523412
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
New config `rdb-version-check` with values:
* `strict`: Reject future RDB versions.
* `relaxed`: Try parsing future RDB versions and fail only when an
unknown RDB opcode or type is encountered.
This can make it possible for Valkey 8.1 to try read a dump from for
example Valkey 9.0 or later on a best-effort basis. The conditions for
when this is expected to work can be defined when the future Valkey
versions are released. Loading is expected to fail in the following
cases:
* If the data set contains any new key types or other data elements not
supported by the current version.
* If the RDB contains new representations or encodings of existing key
types or other data elements.
This change also prepares for the next RDB version bump. A range of RDB
versions (12-79) is reserved, since it's expected to be used by foreign
software RDB versions, so Valkey will not accept versions in this range
even with the `relaxed` version check. The DUMP/RESTORE format has no
magic string; only the RDB version number.
This change also prepares for the magic string to change from REDIS to
VALKEY next time we bump the RDB version.
Related to #1108.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Use-after-free has been detect by address sanitizer, such as in this
test run:
https://github.com/valkey-io/valkey/actions/runs/12981530413/job/36200075972?pr=1620#step:5:1339
`hashtableShrinkIfNeeded` may free one of the hash tables and invalidate
the variables used by the `fillBucketHole(ht, b, pos_in_bucket,
table_index)` just after, causing use-after-free. Fill bucket hole first
and shrink afterwards is assumed to solve the issue. (Not reproduced
locally.)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Might close https://github.com/valkey-io/valkey/issues/1484.
I noticed that we don't disable pause after fork on the last test that
was getting executed, so it might getting stuck in pause loops after the
test ends if it tries another psync for any reason.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Fixes the unit test for hashtable random fairness intermittent failures when
running with the `--accurate` flag.
https://github.com/valkey-io/valkey/actions/runs/12969591890/job/36173815884#step:10:105
The test case picks a random element out of 400, repeated 1M times, and
then checks that 60% of the elements are picked within 3 standard
deviations from the number of times they're expected to be picked. In
this test run (with `--accurate`), the expected number is 2500 and the
standard deviation is 50, which is only 2% of the expected value. This
makes the check too strict and makes the test flaky.
As an alternative, we allow 80% of the elements to be picked within 10%
of the expected number. With this alternative condition, we can also
raise the check for the non-edge case from 60% to 80% of the elements to
be within 3 standard deviations. (With fewer repetitions, 3 standard
deviations is greater than 10% of the expected value, so this new
condition only affects the `--accurate` test run.)
Additional change: Set a random seed to the hash function in the test
suite. Until now, we only seeded the random number generator.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
As discussed in PR #336.
We have different types of resources like CPU, memory, network, etc. The
`slowlog` can only record commands eat lots of CPU during the processing
phase (doesn't include read/write network time), but can not record
commands eat too many memory and network. For example:
1. run "SET key value(10 megabytes)" command would not be recored in
slowlog, since when processing it the SET command only insert the
value's pointer into db dict. But that command eats huge memory in query
buffer and bandwidth from network. In this case, just 1000 tps can cause
10GB/s network flow.
2. run "GET key" command and the key's value length is 10 megabytes. The
get command can eat huge memory in output buffer and bandwidth to
network.
This PR introduces a new command `COMMANDLOG`, to log commands that
consume significant network bandwidth, including both input and output.
Users can retrieve the results using `COMMANDLOG get <count>
large-request` and `COMMANDLOG get <count> large-reply`, all subcommands
for `COMMANDLOG` are:
* `COMMANDLOG HELP`
* `COMMANDLOG GET <count> <slow|large-request|large-reply>`
* `COMMANDLOG LEN <slow|large-request|large-reply>`
* `COMMANDLOG RESET <slow|large-request|large-reply>`
And the slowlog is also incorporated into the commandlog.
For each of these three types, additional configs have been added for
control:
* `commandlog-request-larger-than` and
`commandlog-large-request-max-len` represent the threshold for large
requests(the unit is Bytes) and the maximum number of commands that can
be recorded.
* `commandlog-reply-larger-than` and `commandlog-large-reply-max-len`
represent the threshold for large replies(the unit is Bytes) and the
maximum number of commands that can be recorded.
* `commandlog-execution-slower-than` and
`commandlog-slow-execution-max-len` represent the threshold for slow
executions(the unit is microseconds) and the maximum number of commands
that can be recorded.
* Additionally, `slowlog-log-slower-than` and `slowlog-max-len` are now
set as aliases for these two new configs.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
This PR builds upon the [previous entry prefetching
optimization](https://github.com/valkey-io/valkey/pull/1501) to further
enhance performance by implementing value prefetching for hashtable
iterators.
## Implementation
Modified `hashtableInitIterator` to accept a new flags parameter,
allowing control over iterator behavior.
Implemented conditional value prefetching within `hashtableNext` based
on the new `HASHTABLE_ITER_PREFETCH_VALUES` flag.
When the flag is set, hashtableNext now calls `prefetchBucketValues` at
the start of each new bucket, preemptively loading the values of filled
entries into the CPU cache.
The actual prefetching of values is performed using type-specific
callback functions implemented in `server.c`:
- For `robj` the `hashtableObjectPrefetchValue` callback is used to
prefetch the value if not embeded.
This implementation is specifically focused on main database iterations
at this stage. Applying it to hashtables that hold other object types
should not be problematic, but its performance benefits for those cases
will need to be proven through testing and benchmarking.
## Performance
### Setup:
- 64cores Graviton 3 Amazon EC2 instance.
- 50 mil keys with different value sizes.
- Running valkey server over RAM file system.
- crc checksum and comperssion off.
### Action
- save command.
### Results
The results regarding the duration of “save” command was taken from
“info all” command.
```
+--------------------+------------------+------------------+
| Prefetching | Value size (byte)| Time (seconds) |
+--------------------+------------------+------------------+
| No | 100 | 20.112279 |
| Yes | 100 | 12.758519 |
| No | 40 | 16.945366 |
| Yes | 40 | 10.902022 |
| No | 20 | 9.817000 |
| Yes | 20 | 9.626821 |
| No | 10 | 9.71510 |
| Yes | 10 | 9.510565 |
+--------------------+------------------+------------------+
```
The results largely align with our expectations, showing significant
improvements for larger values (100 bytes and 40 bytes) that are stored
outside the robj. For smaller values (20 bytes and 10 bytes) that are
embedded within the robj, we see almost no improvement, which is as
expected.
However, the small improvement observed even for these embedded values
is somewhat surprising. Given that we are not actively prefetching these
embedded values, this minor performance gain was not anticipated.
perf record on save command **without** value prefetching:
```
--99.98%--rdbSaveDb
|
|--91.38%--rdbSaveKeyValuePair
| |
| |--42.72%--rdbSaveRawString
| | |
| | |--26.69%--rdbWriteRaw
| | | |
| | | --25.75%--rioFileWrite.lto_priv.0
| | |
| | --15.41%--rdbSaveLen
| | |
| | |--7.58%--rdbWriteRaw
| | | |
| | | --7.08%--rioFileWrite.lto_priv.0
| | | |
| | | --6.54%--_IO_fwrite
| | |
| | |
| | --7.42%--rdbWriteRaw.constprop.1
| | |
| | --7.18%--rioFileWrite.lto_priv.0
| | |
| | --6.73%--_IO_fwrite
| |
| |
| |--40.44%--rdbSaveStringObject
| |
| --7.62%--rdbSaveObjectType
| |
| --7.39%--rdbWriteRaw.constprop.1
| |
| --7.04%--rioFileWrite.lto_priv.0
| |
| --6.59%--_IO_fwrite
|
|
--7.33%--hashtableNext.constprop.1
|
--6.28%--prefetchNextBucketEntries.lto_priv.0
```
perf record on save command **with** value prefetching:
```
rdbSaveRio
|
--99.93%--rdbSaveDb
|
|--79.81%--rdbSaveKeyValuePair
| |
| |--66.79%--rdbSaveRawString
| | |
| | |--42.31%--rdbWriteRaw
| | | |
| | | --40.74%--rioFileWrite.lto_priv.0
| | |
| | --23.37%--rdbSaveLen
| | |
| | |--11.78%--rdbWriteRaw
| | | |
| | | --11.03%--rioFileWrite.lto_priv.0
| | | |
| | | --10.30%--_IO_fwrite
| | | |
| | |
| | --10.98%--rdbWriteRaw.constprop.1
| | |
| | --10.44%--rioFileWrite.lto_priv.0
| | |
| | --9.74%--_IO_fwrite
| | |
| |
| |--11.33%--rdbSaveObjectType
| | |
| | --10.96%--rdbWriteRaw.constprop.1
| | |
| | --10.51%--rioFileWrite.lto_priv.0
| | |
| | --9.75%--_IO_fwrite
| | |
| |
| --0.77%--rdbSaveStringObject
|
--18.39%--hashtableNext
|
|--10.04%--hashtableObjectPrefetchValue
|
--6.06%--prefetchNextBucketEntries
```
Conclusions:
The prefetching strategy appears to be working as intended, shifting the
performance bottleneck from data access to I/O operations.
The significant reduction in rdbSaveStringObject time suggests that
string objects(which are the values) are being accessed more
efficiently.
Signed-off-by: NadavGigi <nadavgigi102@gmail.com>
This includes a way to run two versions of the server from the TCL test
framework. It's a preparation to add more cross-version tests. The
runtest script accepts a new parameter
./runtest --other-server-path path/to/valkey-server
and a new tag "needs:other-server" for test cases and start_server.
Tests with this tag are automatically skipped if `--other-server-path`
is not provided.
This PR adds it in a CI job with Valkey 7.2.7 by downloading a binary
release.
Fixes#76
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The desync regression test was created as a regression test for the
following bug:
in case we embed NULL termination inside inline/multi-bulk message we
will not be able to perform strchr in order to
identify the newline(\n)/carriage-return(\r) in the client query buffer.
this can influence (for example) replica reading primary stream and keep
filling it's query buffer endlessly consuming more and more memory.
In order to handle the above risk, a check was added to verify the
inline bulk and multi-bulk size are not exceeding the 64K bytes in the
query-buffer. A test was placed in order to verify this.
This PR introduce the following fixes to the desync regression test:
1. fix the sent payload to flush 1024 bytes block of 'A's instead of
'payload' which was sent by mistake.
2. Make sure that the connection is correctly terminated on protocol
error by the server after exceeding the 64K and not over 64K.
3. add another test intrinsic which will also verify the nested bulk
with embedded null termination (was not verified before)
fixes https://github.com/valkey-io/valkey/issues/1583
NOTE: Although it is possible to change the use of strchr to a more
"safe" utility (eg memchr) which will not pause scan at first occurrence
of '\0', we still like to protect against over excessive usage of the
query buffer and also preserve the current behavior(?). We will look
into improving this though in a followup issue.
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
change the format of the dual channel replication logs so that it will
not
conflict with existing log formats like modules.
Fixes: https://github.com/valkey-io/valkey/issues/1509
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
The test case checks for expire-cycle in LATENCY LATEST, but with the
new hash table, the expiry-cycle is too fast to be logged by latency
monitor. Lower the latency monitor threshold to make it more likely to
be logged.
Fixes#1580
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When processing a cluster bus PING extension, there is a memory leak
when adding a new key to the `nodes_black_list` dict. We now make sure
to free the key `sds` if the dict did not take ownership of it.
Signed-off-by: Pierre Turin <pieturin@amazon.com>
This issue affected only two message types (CLUSTERMSG_TYPE_PUBLISH and CLUSTERMSG_TYPE_PUBLISHSHARD) because they used a light message header, which caused the CLUSTER INFO stats to miss sent/received message information for those types.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This commit creates a new compilation unit for the scripting engine code
by extracting the existing code from the functions unit.
We're doing this refactor to prepare the code for running the `EVAL`
command using different scripting engines.
This PR has a module API change: we changed the type of error messages
returned by the callback
`ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc` to be a
`ValkeyModuleString` (aka `robj`);
This PR also fixes#1470.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Some commands that use unix-time, such as `EXPIREAT` and `SET EXAT`, should include the deleted keys in the `expired_keys` statistics if the specified time has already expired, and notifications should be sent in the manner of expired.
---------
Signed-off-by: Ray Cao <zisong.cw@alibaba-inc.com>
Just like spell-check workflow, we should allow to trigger it
in push events, so that the forks repo can notice the format
thing way before submitting the PR.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Adds filter options to CLIENT LIST:
* USER <username>
Return clients authenticated by <username>.
* ADDR <ip:port>
Return clients connected from the specified address.
* LADDR <ip:port>
Return clients connected to the specified local address.
* SKIPME (YES|NO)
Exclude the current client from the list (default: no).
* MAXAGE <maxage>
Only list connections older than the specified age.
Modifies the ID filter to CLIENT KILL to allow multiple IDs
* ID <client-id> [<client-id>...]
Kill connections by client ids.
This makes CLIENT LIST and CLIENT KILL accept the same options.
For backward compatibility, the default value for SKIPME is NO for
CLIENT LIST and YES for CLIENT KILL.
The MAXAGE comes from CLIENT KILL, where it *keeps* clients with the
given max age and kills the older ones. This logic becomes weird for
CLIENT LIST, but is kept for similary with CLIENT KILL, for the use case
of first testing manually using CLIENT LIST, and then running CLIENT
KILL with the same filters.
The `ID client-id [client-id ...]` no longer needs to be the last
filter. The parsing logic determines if an argument is an ID or not
based on whether it can be parsed as an integer or not.
Partly addresses: #668
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Add `paused_actions` and `paused_timeout_milliseconds` for INFO Clients
to inform users about if clients are paused.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
`sds` is a typedef of `char *`.
`const sds` means `char * const`, i.e. a const-pointer to non-const
content.
More often, you would want `const char *`, i.e. a pointer to
const-content. Until now, it's not possible to express that. This PR
adds `const_sds` which is a pointer to const-content sds.
To get a const-pointer to const-content sds, you can use `const
const_sds`.
In this PR, some uses of `const sds` are replaced by `const_sds`. We can
use it more later.
Fixes#1542
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In some cases unix groups could have whitespace and/or `\` in them.
One example is my workstation. It's a MacOS in an Active Directory
domain. So my user has group `LD\Domain Users`.
Running `make test` on `unstable` and `8.0` branches fails with:
I'm not sure if we need to fix this in 8.0. But it seems that it should
be fixed in unstable.
Signed-off-by: secwall <secwall@yandex-team.ru>
This PR replaces dict with the new hashtable data structure in the HASH
datatype. There is a new struct for hashtable items which contains a
pointer to value sds string and the embedded key sds string. These
values were previously stored in dictEntry. This structure is kept
opaque so we can easily add small value embedding or other optimizations
in the future.
closes#1095
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
After #1545 disabled some tests for reply schema validation, we now have
another issue that ECHO is not covered.
```
WARNING! The following commands were not hit at all:
echo
ERROR! at least one command was not hit by the tests
```
This patch adds a test case for ECHO in the unit/other test suite. I
haven't checked if there are more commands that aren't covered.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The commands used in valkey-cli tests are not important the reply schema
validation. Skip them to avoid the problem if tests hanging. This has
failed lately in the daily job:
```
[TIMEOUT]: clients state report follows.
sock55fedcc19be0 => (IN PROGRESS) valkey-cli pubsub mode with single standard channel subscription
Killing still running Valkey server 33357
```
These test cases use a special valkey-cli command `:get pubsub` command,
which is an internal command to valkey-cli rather than a Valkey server
command. This command hangs when compiled with with logreqres enabled.
Easy solution is to skip the tests in this setup.
The test cases were introduced in #1432.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When the cluster changes, we need to persist the cluster configuration,
and these file IO operations may cause latency.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Imagine we have a cluster, for example a three-shard cluster,
if shard 1 doing a CLUSTER RESET HARD, it will change the node
name, and then other nodes will mark it as NOADR since the node
name received by PONG has changed.
In the eyes of other nodes, there is one working primary node
left but with no address, and in this case, the address report
in MOVED will be invalid and will confuse the clients. And in
the same time, the replica will not failover since its primary
is not in the FAIL state. And the cluster looks OK to everyone.
This leaves a cluster that appears OK, but with no coverage for
shard 1, obviously we should do something like CLUSTER FORGET
to remove the node and fix the cluster before using it.
But the point in here, we can mark the NOADDR node as FAIL to
advance the cluster state. If a node is NOADDR means it does
not have a valid address, so we won't reconnect it, we won't
send PING, we won't gossip it, it seems reasonable to mark it
as FAIL.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When multiple primary nodes fail simultaneously, the cluster can not recover
within the default effective time (data_age limit). The main reason is that
the vote is without ranking among multiple replica nodes, which case too many
epoch conflicts.
Therefore, we introduced into ranking based on the failed primary shard-id.
Introduced a new failed_primary_rank var, this var means the rank of this
myself instance in the context of all failed primary list. This var will be
used in failover and we will do the failover election packets in order based
on the rank, this can effectively avoid the voting conflicts.
If a single primary is down, the behavior is the same as before. If multiple
primaries are down, their replica election initiation time will be delayed
by 500ms according to the ranking.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When latency-monitor-threshold is set to 0, it means the latency monitor
is disabled, and in VM_LatencyAddSample, we wrote the condition
incorrectly, causing us to record latency when latency was turned off.
This bug was introduced in the very first day, see e3b1d6d, it was merged
in 2019.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In #1441, we found a assert, and decided remove this assert and instead
just free the newly created node and close the link, since if we cannot
get the IP from the link it probably means the connection was closed.
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED ===
17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true
------ STACK TRACE ------
17847 valkey-server *
src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634]
src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de]
/__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e]
src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea]
src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547]
/lib64/libc.so.6(+0x40c8) [0x7f083985a0c8]
/lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b]
src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5]
```
But it also introduces another assert. The reason is that this new node
is not added to the cluster nodes dict.
```
17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED ===
17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true
------ STACK TRACE ------
17128 valkey-server *
src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4]
src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2]
src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618]
/__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278]
src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09]
src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5]
src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e]
```
This closes#1527.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The fix that Redis gave us for the CVE-2024-46981 was freeing lctx.lua,
and I didn't merge it correctly. We made some changes so that we are
able to async free the lua context, so we need to free the passed in
context. This was applied correctly on the two released versions (8.0
and 7.2) just not on unstable.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This PR is to cleanup the `SERVER_TEST` compiler flag from cmake compile
definitions, as it is no longer required in the new unit test framework, see #428.
Signed-off-by: Karthick Ariyaratnam <karthyuom@gmail.com>
This PR introduces improvements to the hashtable iterator, implementing
prefetching technique described in the blog post [Unlock One Million RPS
- Part 2](https://valkey.io/blog/unlock-one-million-rps-part2/) . The
changes lay the groundwork for further enhancements in use cases
involving iterators. Future PRs will build upon this foundation to
improve performance and functionality in various iterator-dependent
operations.
In the pursuit of maximizing iterator performance, I conducted a
comprehensive series of experiments. My tests encompassed a wide range
of approaches, including processing multiple bucket indices in parallel,
prefetching the next bucket upon completion of the current one, and
several other timing and quantity variations. Surprisingly, after
rigorous testing and performance analysis, the simplest implementation
presented in this PR consistently outperformed all other more complex
strategies.
## Implementation
Each time we start iterating over a bucket, we prefetch data for future
iterations:
- We prefetch the entries of the next bucket (if it exists).
- We prefetch the structure (but not the entries) of the bucket after
the next.
This prefetching is done when we pick up a new bucket, increasing the
chance that the data will be in cache by the time we need it.
## Performance
The data below was taken by conducting keys command on 64cores Graviton
3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each. The
results regarding the duration of “keys *” command was taken from “info
all” command.
```
+--------------------+------------------+-----------------------------+
| prefetching | Time (seconds) | Keys Processed per Second |
+--------------------+------------------+-----------------------------+
| No | 11.112279 | 4,499,529 |
| Yes | 3.141916 | 15,913,862 |
+--------------------+------------------+-----------------------------+
Improvement:
Comparing the iterator without prefetching to the one with prefetching,
we can see a speed improvement of 11.112279 / 3.141916 ≈ 3.54 times faster.
```
### Save command improvment
#### Setup:
- 64cores Graviton 3 Amazon EC2 instance.
- 50 mil keys in size of 100 bytes each.
- Running valkey server over RAM file system.
- crc checksum and comperssion off.
#### Results
```
+--------------------+------------------+-----------------------------+
| prefetching | Time (seconds) | Keys Processed per Second |
+--------------------+------------------+-----------------------------+
| No | 28 | 1,785,700 |
| Yes | 19.6 | 2,550,000 |
+--------------------+------------------+-----------------------------+
Improvement:
- Reduced SAVE time by 30% (8.4 seconds faster)
- Increased key processing rate by 42.8% (764,300 more keys/second)
```
Signed-off-by: NadavGigi <nadavgigi102@gmail.com>
Resolves issue with valkey-cli not auto exiting from subscribed mode on
reaching zero pub/sub subscription (previously filed on Redis)
https://github.com/redis/redis/issues/12592
---------
Signed-off-by: Nikhil Manglore <nmanglor@amazon.com>
This PR replaces dict with hashtable in the ZSET datatype. Instead of
mapping key to score as dict did, the hashtable maps key to a node in
the skiplist, which contains the score. This takes advantage of
hashtable performance improvements and saves 15 bytes per set item - 24
bytes overhead before, 9 bytes after.
Closes#1096
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
# Refactor client structure to use modular data components
## Current State
The client structure allocates memory for replication / pubsub /
multi-keys / module / blocked data for every client, despite these
features being used by only a small subset of clients. In addition the
current field layout in the client struct is suboptimal, with poor
alignment and unnecessary padding between fields, leading to a larger
than necessary memory footprint of 896 bytes per client. Furthermore,
fields that are frequently accessed together during operations are
scattered throughout the struct, resulting in poor cache locality.
## This PR's Change
1. Lazy Initialization
- **Components are only allocated when first used:**
- PubSubData: Created on first SUBSCRIBE/PUBLISH operation
- ReplicationData: Initialized only for replica connections
- ModuleData: Allocated when module interaction begins
- BlockingState: Created when first blocking command is issued
- MultiState: Initialized on MULTI command
2. Memory Layout Optimization:
- Grouped related fields for better locality
- Moved rarely accessed fields (e.g., client->name) to struct end
- Optimized field alignment to eliminate padding
3. Additional changes:
- Moved watched_keys to be static allocated in the `mstate` struct
- Relocated replication init logic to replication.c
### Key Benefits
- **Efficient Memory Usage:**
- 45% smaller base client structure - Basic clients now use 528 bytes
(down from 896).
- Better memory locality for related operations
- Performance improvement in high throughput scenarios. No performance
regressions in other cases.
### Performance Impact
Tested with 650 clients and 512 bytes values.
#### Single Thread Performance
| Operation | Dataset | New (ops/sec) | Old (ops/sec) | Change % |
|------------|---------|---------------|---------------|-----------|
| SET | 1 key | 261,799 | 258,261 | +1.37% |
| SET | 3M keys | 209,134 | ~209,000 | ~0% |
| GET | 1 key | 281,564 | 277,965 | +1.29% |
| GET | 3M keys | 231,158 | 228,410 | +1.20% |
#### 8 IO Threads Performance
| Operation | Dataset | New (ops/sec) | Old (ops/sec) | Change % |
|------------|---------|---------------|---------------|-----------|
| SET | 1 key | 1,331,578 | 1,331,626 | -0.00% |
| SET | 3M keys | 1,254,441 | 1,152,645 | +8.83% |
| GET | 1 key | 1,293,149 | 1,289,503 | +0.28% |
| GET | 3M keys | 1,152,898 | 1,101,791 | +4.64% |
#### Pipeline Performance (3M keys)
| Operation | Pipeline Size | New (ops/sec) | Old (ops/sec) | Change % |
|-----------|--------------|---------------|---------------|-----------|
| SET | 10 | 548,964 | 538,498 | +1.94% |
| SET | 20 | 606,148 | 594,872 | +1.89% |
| SET | 30 | 631,122 | 616,606 | +2.35% |
| GET | 10 | 628,482 | 624,166 | +0.69% |
| GET | 20 | 687,371 | 681,659 | +0.84% |
| GET | 30 | 725,855 | 721,102 | +0.66% |
### Observations:
1. Single-threaded operations show consistent improvements (1-1.4%)
2. Multi-threaded performance shows significant gains for large
datasets:
- SET with 3M keys: +8.83% improvement
- GET with 3M keys: +4.64% improvement
3. Pipeline operations show consistent improvements:
- SET operations: +1.89% to +2.35%
- GET operations: +0.66% to +0.84%
4. No performance regressions observed in any test scenario
Related issue:https://github.com/valkey-io/valkey/issues/761
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This commit, https://github.com/valkey-io/valkey/pull/1504, moved the
wrong worker to ubuntu 22. We wanted to move codecov and not coverity.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
It's inconvenient for client implementations to extract the
`availability_zone` information from the `INFO` response. The `INFO`
response contains a lot of information that a client implementation
typically doesn't need.
This PR adds the availability zone to the `HELLO` response. Clients
usually already use the `HELLO` command for protocol negotiation and
also get the server `version` and `role` from its response. To keep the
`HELLO` response small, the field is only added if availability zone is
configured.
---------
Signed-off-by: Rueian <rueiancsie@gmail.com>
Reset GC state before closing the lua VM to prevent user data to be
wrongly freed while still might be used on destructor callbacks.
Created and publish by Redis in their OSS branch.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: YaacovHazan <yaacov.hazan@redis.com>
The explanation on the original commit was wrong. Key based access must
have a `~` in order to correctly configure whey key prefixes to apply
the selector to. If this is missing, a server assert will be triggered
later.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: YaacovHazan <yaacov.hazan@redis.com>
This may speed up the transition to the fail state a bit.
Previously we would only check when we received a pfail/fail
report from others in gossip. If myself is the last vote,
we can directly switch to fail in here without waiting for
the next gossip packet.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When building with `CMake` (especially the targets `valkey-cli`,
`valkey-server` and `valkey-benchmark`) it is possible to have a
successful build while having warnings.
This PR fixes this - which is aligned with how the `Makefile` is working
today:
- Enable `-Wall` + `-Werror` for valkey targets
- Fixed warning in valkey-cli:jsonStringOutput method
Signed-off-by: Eran Ifrah <eifrah@amazon.com>
The issues in #1453 seem to
have only shown up since we moved to ubuntu 24, as part of the rolling
`ubunut-latest` migration from 22->24.
Closes#1453.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
When looking up a key in no-touch mode, `LOOKUP_NOTOUCH` is set to avoid
updating the last access time in `lookupKey`. An exception must be made
for the `TOUCH` command which must always update the key.
When called from a script, `server.executing_client` will point to the
`TOUCH` command, while `server.current_client` will point to e.g. an
`EVAL` command. So, we must use the former to find out the currently
executing command if defined.
This fix addresses the issue where TOUCH wasn't updating key access
times when called from scripts like EVAL.
Fixes#1498
Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
In this commit we move all structures and functions declarations related
to Valkey modules from `server.h` to the recently added `module.h` file.
This re-organization makes it easier for new contributors to find the
valkey modules related code, as well as reducing the compilation times
when changes are made to the modules code.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
### Fix race with pending writes in replica state transition
#### The Problem
In #60 (Dual channel replication) a new `connWrite` call was added
before the `waitForClientIO` check. This created a race condition where
the main thread may attempt to write to a client that could have pending
writes in IO threads.
#### The Fix
Moved the `waitForClientIO()` call earlier in `syncCommand`, before any
`connWrite` call. This ensures all pending IO operations are completed
before attempting to write to the client.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Currently, in case a blocked command is unblocked externally (eg. due to
the relevant slot being migrated or the CLIENT UNBLOCK command was
issued, the command statistics will always update the failed_calls error
statistic. This leads to missalignment with
90b9f08e5d
as well as some inconsistencies. For example when a key is migrated
during cluster slot migration, clients blocked on XREADGROUP will be
unblocked and update the rejected_calls stat, while clients blocked on
BLPOP will get unblocked updating the failed_calls stat.
In this PR we add explicit indication in updateStatsOnUnblock thet
indicates if the command was rejected or failed.
---------
Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Add `meet_sent` field in `clusterNode` indicating the last time we sent
a MEET packet. Use this field to only (re-)send a MEET packet once every
handshake timeout period when detecting a node without an inbound link.
When receiving multiple MEET packets on the same link while the node is
in handshake state, instead of dropping the packet, we now simply
prevent the creation of a new node. This way we still process the MEET
packet's gossip and reply with a PONG as any other packets.
Improve some logging messages to include `human_nodename`. Add
`nodeExceedsHandshakeTimeout()` function.
This is a follow-up to this previous PR:
https://github.com/valkey-io/valkey/pull/1307
And a partial fix to the crash described in:
https://github.com/valkey-io/valkey/pull/1436
---------
Signed-off-by: Pierre Turin <pieturin@amazon.com>
In the `arguments` section, the `arguments` key is only used for
arguments of type `block` or `oneof`.
Consequently, the `arguments` given for `IFEQ` are ignored by the
server. However, they lead to strange results when rendering the
command's page for the web documentation.
Fix this by removing `arguments` for `IFEQ`.
Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
This PR fixes the missing stat update for `total_net_repl_output_bytes`
that was removed during the refactoring in PR #758. The metric was not
being updated when writing to replica connections.
Changes:
- Restored the stat update in postWriteToClient for replica connections
- Added integration test to verify the metric is properly updated
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This PR introduces a new mechanism for temporarily changing the
server's loading_rio context during RDB loading operations. The new
`RDB_SCOPED_LOADING_RIO` macro allows for a scoped change of the
`server.loading_rio` value, ensuring that it's automatically restored
to its original value when the scope ends.
Introduces a dedicated flag to `rio` to signal immediate abort,
preventing
potential use-after-free scenarios during replication disconnection in
dual-channel load. This ensures proper termination of
`rdbLoadRioWithLoadingCtx`
when replication is cancelled due to connection loss on main connection.
Fixes https://github.com/valkey-io/valkey/issues/1152
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Amit Nagler <58042354+naglera@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
- By not waiting `repl-diskless-sync-delay` when we don't have to, we
can reduce ~30% of dual channel tests execution time.
- This commit also drops one test which is not required for regular sync
(`Sync should continue if not all slaves dropped`).
- Skip dual channel test with master diskless disabled because it will
initiate the same synchronization process as the non-dual channel test,
making it redundant.
Before:
```
Execution time of different units:
171 seconds - integration/dual-channel-replication
305 seconds - integration/replication-psync
\o/ All tests passed without errors!
```
After:
```
Execution time of different units:
120 seconds - integration/dual-channel-replication
236 seconds - integration/replication-psync
\o/ All tests passed without errors!
```
Discused on https://github.com/valkey-io/valkey/pull/1173
---------
Signed-off-by: naglera <anagler123@gmail.com>
As part of #1463, I made a small refactor between the PR and the daily
test I submitted to try to improve readability by adding a function to
abstract the extraction of the message types. However, that change
apparently caused GCC to throw another warning, so reverting the
abstraction on just one line.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This PR extends the module API to support the addition of different
scripting engines to execute user defined functions.
The scripting engine can be implemented as a Valkey module, and can be
dynamically loaded with the `loadmodule` config directive, or with the
`MODULE LOAD` command.
This PR also adds an example of a dummy scripting engine module, to show
how to use the new module API. The dummy module is implemented in
`tests/modules/helloscripting.c`.
The current module API support, only allows to load scripting engines to
run functions using `FCALL` command.
The additions to the module API are the following:
```c
/* This struct represents a scripting engine function that results from the
* compilation of a script by the engine implementation. */
struct ValkeyModuleScriptingEngineCompiledFunction
typedef ValkeyModuleScriptingEngineCompiledFunction **(*ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc)(
ValkeyModuleScriptingEngineCtx *engine_ctx,
const char *code,
size_t timeout,
size_t *out_num_compiled_functions,
char **err);
typedef void (*ValkeyModuleScriptingEngineCallFunctionFunc)(
ValkeyModuleCtx *module_ctx,
ValkeyModuleScriptingEngineCtx *engine_ctx,
ValkeyModuleScriptingEngineFunctionCtx *func_ctx,
void *compiled_function,
ValkeyModuleString **keys,
size_t nkeys,
ValkeyModuleString **args,
size_t nargs);
typedef size_t (*ValkeyModuleScriptingEngineGetUsedMemoryFunc)(
ValkeyModuleScriptingEngineCtx *engine_ctx);
typedef size_t (*ValkeyModuleScriptingEngineGetFunctionMemoryOverheadFunc)(
void *compiled_function);
typedef size_t (*ValkeyModuleScriptingEngineGetEngineMemoryOverheadFunc)(
ValkeyModuleScriptingEngineCtx *engine_ctx);
typedef void (*ValkeyModuleScriptingEngineFreeFunctionFunc)(
ValkeyModuleScriptingEngineCtx *engine_ctx,
void *compiled_function);
/* This struct stores the callback functions implemented by the scripting
* engine to provide the functionality for the `FUNCTION *` commands. */
typedef struct ValkeyModuleScriptingEngineMethodsV1 {
uint64_t version; /* Version of this structure for ABI compat. */
/* Library create function callback. When a new script is loaded, this
* callback will be called with the script code, and returns a list of
* ValkeyModuleScriptingEngineCompiledFunc objects. */
ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc create_functions_library;
/* The callback function called when `FCALL` command is called on a function
* registered in this engine. */
ValkeyModuleScriptingEngineCallFunctionFunc call_function;
/* Function callback to get current used memory by the engine. */
ValkeyModuleScriptingEngineGetUsedMemoryFunc get_used_memory;
/* Function callback to return memory overhead for a given function. */
ValkeyModuleScriptingEngineGetFunctionMemoryOverheadFunc get_function_memory_overhead;
/* Function callback to return memory overhead of the engine. */
ValkeyModuleScriptingEngineGetEngineMemoryOverheadFunc get_engine_memory_overhead;
/* Function callback to free the memory of a registered engine function. */
ValkeyModuleScriptingEngineFreeFunctionFunc free_function;
} ValkeyModuleScriptingEngineMethodsV1;
/* Registers a new scripting engine in the server.
*
* - `engine_name`: the name of the scripting engine. This name will match
* against the engine name specified in the script header using a shebang.
*
* - `engine_ctx`: engine specific context pointer.
*
* - `engine_methods`: the struct with the scripting engine callback functions
* pointers.
*/
int ValkeyModule_RegisterScriptingEngine(ValkeyModuleCtx *ctx,
const char *engine_name,
void *engine_ctx,
ValkeyModuleScriptingEngineMethods engine_methods);
/* Removes the scripting engine from the server.
*
* `engine_name` is the name of the scripting engine.
*
*/
int ValkeyModule_UnregisterScriptingEngine(ValkeyModuleCtx *ctx, const char *engine_name);
```
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
We are getting a number of errors like:
```
array subscript ‘clusterMsg[0]’ is partly outside array bounds of ‘unsigned char[2272]’
```
Which is basically GCC telling us that we have an object which is longer
than the underlying storage of the allocation. We actually do this a
lot, but GCC is generally not aware of how big the underlying allocation
is, so it doesn't throw this error. We are specifically getting this
error because the msgBlock can be of variable length depending on the
type of message, but GCC assumes it's the longest one possible. The
solution I went with here was make the message type optional, so that it
wasn't included in the size. I think this also makes some sense, since
it's really just a helper for us to easily cast the object around.
I considered disabling this error, but it is generally pretty useful
since it can catch real issues. Another solution would be to
over-allocate to the largest possible object, which could hurt
performance as we initialize it to zero.
Results:
https://github.com/madolson/valkey/actions/runs/12423414811/job/34686899884
This is a slightly cleaned up version of
https://github.com/valkey-io/valkey/pull/1439. I thought I had another
strategy but alas, it didn't work out.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
We set the client output buffer limits to 10 bytes, and then execute
`info stats` which produces more than 10 bytes of output, which can
cause that command to throw an error.
I'm not sure why it wasn't consistently erroring before, might have been
some change related to the ubuntu upgrade though.
Issues related to ubuntu-tls are hopefully resolved now.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
In #1459, I missed that the data was also used to keep track of the PID
files so if the testing framework crashed it would no longer be able to
cleanup the extra servers. So now we properly extract the PID and store
it so we can clean up PIDs.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
There are two changes here:
1. The one in clusterNodeCleanupFailureReports, only primary with slots can
report the failure report, if the primary became a replica its failure report
should be cleared. This may lead to inaccurate node fail judgment in some network
partition cases i guess, it will also affect the CLUSTER COUNT-FAILURE-REPORTS
command.
2. The one in clusterProcessGossipSection, it is not that important, but it can
print a "node is back online" log helps us troubleshoot the problem, although
it may conflict with 1 at some points.
Signed-off-by: Binbin <binloveplay1314@qq.com>
- Moves `build-config.json` to workflow dir to build old versions with
new configs.
- Enables contributors to test release Wf on private repo by adding
`github.event_name == 'workflow_dispatch' ||`
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
**Background**
When conducting performance tests using `valkey-benchmark`, reading from
replicas was not supported. Consequently, even in cluster mode, all
reads were directed to the primary nodes. This limitation made it
challenging to obtain accurate metrics during workload stress testing
for performance measurement or before a version upgrade.
Related issue : https://github.com/valkey-io/valkey/issues/900
**Changes**
1. Replaced the use of `CLUSTER NODES` with `CLUSTER SLOTS` when
fetching cluster configuration. This allows for easier identification of
replica slots.
2. Support for reading from replicas by executing the client in
`READONLY` mode.
3. Support reading from replicas even during slot migrations.
4. Introduced two CLI options `--rfr` to enable reading from replicas
only or all cluster nodes. A warning added to indicate that write
requests might not be handled correctly when using this option.
---------
Signed-off-by: bluayer <ijacsong98@gmail.com>
Signed-off-by: bluayer <bluayer@gmail.com>
Signed-off-by: Jungwoo Song <37579681+bluayer@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
The old reqEpoch mainly refers to requestCurrentEpoch, see:
```
if (requestCurrentEpoch < server.cluster->currentEpoch) {
serverLog(LL_WARNING, "Failover auth denied to %.40s (%s): reqEpoch (%llu) < curEpoch(%llu)", node->name,
node->human_nodename, (unsigned long long)requestCurrentEpoch,
(unsigned long long)server.cluster->currentEpoch);
return;
}
```
And in here we refer to requestConfigEpoch, it's a bit misleading,
so change it to reqConfigEpoch to make it clear.
Signed-off-by: Binbin <binloveplay1314@qq.com>
There are some tests that fail and give no useful information since they are
outside of a test context. Now we will at least get the file we are located in.
We can sort of reverse engineer where we are in the test by seeing which
tests have finished in a file.
```
[TIMEOUT]: clients state report follows.
sock6 => (SPAWNED SERVER) pid:30375 - tests/unit/info.tcl
Killing still running Valkey server 30375 - tests/unit/info.tcl
```
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
The test attempts to write 1MB of data in order to trigger a disconnect.
Normally, the data is fully flushed and we get the error on the read
(I/O error). However, it's possible we might fail the write, which
leaves the client in an inconsistent state. On the next command, we
finally process the I/O error on the FD. So, the simple fix is to
consume any secondary errors.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Asan now supports making sure you are passing in the correct pointer
type, which seems useful but we can't support it since we pass in an
incorrect pointer in several places. This is most commonly done with
generic free functions, where we simply cast it to the correct type.
It's not a lot of code to clean up, so it seems appropriate to cleanup
instead of disabling the check.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Introduce compile time option to force activedefrag to run even when
jemalloc is not used as the allocator.
This is in order to be able to run tests with defrag enabled
while using memory instrumentation tools.
fixes: https://github.com/valkey-io/valkey/issues/1241
---------
Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Avoid tmpfs as fadvise(FADV_DONTNEED) has no effect on memory-backed
filesystems.
Fixes https://github.com/valkey-io/valkey/issues/897
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
We have set the secret as `AWS_S3_TEST_BUCKET` for test bucket and I
missed it in the initial review.
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```
In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.
If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.
Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This is a follow of #1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.
Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.
Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.
So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.
See discussion #1305 for more information.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The new `hashtable` provides faster lookups and uses less memory than
`dict`.
A TCL test case "SRANDMEMBER with a dict containing long chain" is
deleted because it's covered by a hashtable unit test
"test_random_entry_with_long_chain", which is already present.
This change also moves some logic from dismissMemory (object.c) to
zmadvise_dontneed (zmalloc.c), so the hashtable implementation which
needs the dismiss functionality doesn't need to depend on object.c and
server.h.
This PR follows #1186.
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
We deprecate the usage of classic malloc and free, but under certain
circumstances they might get imported from intrinsics. The original
thought is we should just override malloc and free to use zmalloc and
zfree, but I think we should continue to deprecate it to avoid
accidental imports of allocations.
Closes https://github.com/valkey-io/valkey/issues/1434.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Fixes four cases where `stringmatchlen` could overrun the pattern if it
is not terminated with NUL.
These commits are cherry-picked from my
[fork](https://github.com/thaliaarchi/antirez-stringmatch) which
extracts `stringmatch` as a library and compares it to other projects by
antirez which use the same matcher.
Signed-off-by: Thalia Archibald <thalia@archibald.dev>
This update addresses several issues in defrag:
1. In the defrag redesign
(https://github.com/valkey-io/valkey/pull/1242), a bug was introduced
where `server.cronloops` was no longer being incremented in the
`whileBlockedCron()`. This resulted in some memory statistics not being
updated while blocked.
2. In the test case for AOF loading, we were seeing errors due to defrag
latencies. However, running the math, the latencies are justified given
the extremely high CPU target of the testcase. Adjusted the expected
latency check to allow longer latencies for this case where defrag is
undergoing starvation while AOF loading is in progress.
3. A "stage" is passed a "target". For the main dictionary and expires,
we were passing in a `kvstore*`. However, on flushall or swapdb, the
pointer may change. It's safer and more stable to use an index for the
DB (a DBID). Then if the pointer changes, we can detect the change, and
simply abort the stage. (If there's still fragmentation to deal with,
we'll pick it up again on the next cycle.)
4. We always start a new stage on a new defrag cycle. This gives the new
stage time to run, and prevents latency issues for certain stages which
don't operate incrementally. However, often several stages will require
almost no work, and this will leave a chunk of our CPU allotment unused.
This is mainly an issue in starvation situations (like AOF loading or
LUA script) - where defrag is running infrequently, with a large
duty-cycle. This change allows a new stage to be initiated if we still
have a standard duty-cycle remaining. (This can happen during starvation
situations where the planned duty cycle is larger than the standard
cycle. Most likely this isn't a concern for real scenarios, but it was
observed in testing.)
5. Minor comment correction in `server.h`
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
This change makes the binary build on the target ubuntu version.
This PR also deprecated ubuntu18 and valkey will not support:
- X86:
- Ubuntu 20
- Ubuntu 22
- Ubuntu 24
- ARM:
- Ubuntu 20
- Ubuntu 22
Removed ARM ubuntu 24 as the action we are using for ARM builds does not
support Ubuntu 24.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
## Summary
This PR fixes#1346 where we can get rid of the long term credentials by
using OpenID Connect. OpenID Connect (OIDC) allows your GitHub Actions
workflows to access resources in Amazon Web Services (AWS), without
needing to store the AWS credentials as long-lived GitHub secrets.
---------
Signed-off-by: vudiep411 <vdiep@amazon.com>
This can happen when scripts are running for long period of time and the server attempts to defrag it in the whileBlockedCron.
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us. This prevents
cases where a node could send MEET packets in a loop because it thinks
the other node does not have an inbound link.
This fixes the bug described in #1251.
---------
Signed-off-by: Pierre Turin <pieturin@amazon.com>
Addresses https://github.com/valkey-io/valkey/issues/1393
Changes:
* During AOF loading or long running script, this allows defrag to be
initiated.
* The AOF defrag test was corrected to eliminate the wait period and
rely on non-timer invocations.
* Logic for "overage" time in defrag was changed. It previously
accumulated underage leading to large latencies in extreme tests having
very high CPU percentage. After several simple stages were completed
during infrequent blocked processing, a large cycle time would be
experienced.
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
There is a leak in here, hashtableTwoPhasePopDelete won't call the entry
destructor and like hashtablePop we need to call it by myself.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Instead of a dictEntry with pointers to key and value, the hashtable
has a pointer directly to the value (robj) which can hold an embedded
key and acts as a key-value in the hashtable. This minimizes the number
of pointers to follow and thus the number of memory accesses to lookup
a key-value pair.
Keys robj
hashtable
+-------+ +-----------------------+
| 0 | | type, encoding, LRU |
| 1 ------->| refcount, expire |
| 2 | | ptr |
| ... | | optional embedded key |
+-------+ | optional embedded val |
+-----------------------+
The expire timestamp (TTL) is also stored in the robj, if any. The expire
hash table points to the same robj.
Overview of changes:
* Replace dict with hashtable in kvstore (kvstore.c)
* Add functions for embedding key and expire in robj (object.c)
* When there's unused space, reserve an expire field to avoid realloting
it later if expire is added.
* Always reserve space for expire for large key names to avoid realloc
if it's set later.
* Update db functions (db.c)
* dbAdd, setKey and setExpire reallocate the object when embedding a key
* setKey does not increment the reference counter, since it would require
duplicating the object. This responsibility is moved to the caller.
* Remove logic for shared integer objects as values in the database. The keys
are now embedded in the objects, so all objects in the database need to be
unique. Thus, we can't use shared objects as values. Also delete test cases
for shared integers.
* Adjust various commands to the changes mentioned above.
* Adjust defrag code
* Improvement: Don't access the expires table before defrag has actually
reallocated the object.
* Adjust test cases that were using hard-coded sizes for dict when realloc
would happen, and some other adjustments in test cases.
* Adjust memory prefetch for new hash table implementation in IO-threading,
using new `hashtableIncrementalFind` API
* Adjust offloading of free() to IO threads: Object free to be done in main
thread while keeping obj->ptr offloading in IO-thread since the DB object is
now allocated by the main-thread and not by the IO-thread as it used to be.
* Let expireIfNeeded take an optional value, to avoid looking up the expires
table when possible.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
This changes the type of command tables from dict to hashtable. Command
table lookup takes ~3% of overall CPU time in benchmarks, so it is a
good candidate for optimization.
My initial SET benchmark comparison suggests that hashtable is about 4.5
times faster than dict and this replacement reduced overall CPU time by
2.79% 🥳
---------
Signed-off-by: Rain Valentine <rainval@amazon.com>
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Rain Valentine <rainval@amazon.com>
A cache-line aware hash table with a user-defined key-value entry type,
supporting incremental rehashing, scan, iterator, random sampling,
incremental lookup and more...
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The following error has been seen, but not reliably reproduced:
```
*** [err]: eviction due to output buffers of pubsub, client eviction: true in tests/unit/maxmemory.tcl
Expected '42' to be equal to '50' (context: type proc line 17 cmd {assert_equal [r dbsize] 50} proc ::init_test level 2)
```
The reason is probably that FLUSHDB is asynchronous and when we start
populating new keys, they are evicted because the background flush is
too slow. Changing this to FLUSHDB SYNC prevents this.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
events-per-io-thread is for testing purposes that allow us to force the
main thread to always offload the works to the IO threads, see
adjustIOThreadsByEventLoad for more details.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This fix changes the timeout for BLPOP in this test case from 1 second
to 0.5 seconds.
In the test case quoted below, the procedure
`wait_for_blocked_clients_count` waits for one second by default. If
BLPOP has 1 second timeout and the first
`wait_for_blocked_clients_count` finishes very fast, then the second
`wait_for_blocked_clients_count` can time out before the BLPOP has been
unblocked.
```TCL
test "CLUSTER SLOT-STATS cpu-usec for blocking commands, unblocked on timeout." {
# Blocking command with 1 second timeout.
set rd [valkey_deferring_client]
$rd BLPOP $key 1
# Confirm that the client is blocked, then unblocked after 1 second timeout.
wait_for_blocked_clients_count 1
wait_for_blocked_clients_count 0
```
As seen in the definition of `wait_for_blocked_clients_count`, the total
time to wait is 1 second by default.
```TCL
proc wait_for_blocked_clients_count {count {maxtries 100} {delay 10} {idx 0}} {
wait_for_condition $maxtries $delay {
[s $idx blocked_clients] == $count
} else {
fail "Timeout waiting for blocked clients"
}
}
```
Fixes#1121
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
- Add new flag "I" in `CLIENT LIST` for import-source client
- Add `DEBUG_CONFIG` for import-mode
- Allow import-source status to be turned off when import-mode is off
Fixes#1350 and
https://github.com/valkey-io/valkey/pull/1185#discussion_r1851049362.
---------
Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com>
Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This PR allows the Valkey users to perform conditional updates where the
SET command is completed if the given comparison-value matches the key’s
current value.
Syntax:
```
SET key value IFEQ comparison-value
```
Behavior:
If the values match, the SET completes as expected. If they do not
match, the command returns a (nil), except if the GET argument is also
given (see below).
Behavior with Additional Flags:
1. ```SET key value IFEQ comparison-value GET``` returns the existing
value, regardless of whether it matches comparison-value or not. The
conditional set operation is performed if the given comparison value
matches the existing value. To check if the SET succeeded, the caller
needs to check if the returned string matches the comparison-value.
2. ```SET key value IFEQ comparison-value XX``` is a syntax error.
3. ```SET key value IFEQ comparison-value NX``` is a syntax error.
Closes: #1215
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
We added two more committers, but according to our governance document
that makes them TSC members. As we discussed, for now we want to keep
the balance of corporate interests, so so updating the governance to
explicitly list TSC members compared to folks with just write
permissions.
Also adds the new new folks with commit permissions.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Change assert crash log to also use binary representation like 5bdd72bea77d4bb237441c9a671e80edcdc998ad.
And do not print the password in assert crash log like 56eef6fb5ab7a755485c19f358761954ca459472.
In addition, for 5bdd72bea77d4bb237441c9a671e80edcdc998ad, we will print '"argv"',
because originally the code would print a '', and sdscatrepr will add an extra "",
so now removing the extra '' here.
Extract the getArgvReprString method and clean up the code a bit.
Examples:
```
debug assert "\x00abc"
before:
client->argv[0] = "debug" (refcount: 1)
client->argv[1] = "assert" (refcount: 1)
client->argv[2] = "" (refcount: 1)
after:
client->argv[0] = "debug" (refcount: 1)
client->argv[1] = "assert" (refcount: 1)
client->argv[2] = "\x00abc" (refcount: 1)
debug panic "\x00abc"
before:
argc: '3'
argv[0]: '"debug"'
argv[1]: '"panic"'
argv[2]: '"\x00abc"'
after:
argc: 3
argv[0]: "debug"
argv[1]: "panic"
argv[2]: "\x00abc"
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
ZRANK is a widly used command for workloads using sorted-sets. For
example, in leaderboards It enables query the specific rank of a player.
The way ZRANK is currently implemented is:
1. locate the element in the SortedSet hashtable.
2. take the score of the element and use it in order to locate the
element in the SkipList (when listpack encoding is not used)
3. During the SkipLis scan for the elemnt we keep the path and use it in
order to sum the span in each path node in order to calculate the elemnt
rank
One problem with this approach is that it involves multiple compare
operations in order to locate the element. Specifically string
comparison can be expensive since it will require access multiple memory
locations for the items the element string is compared against.
Perf analysis showed this can take up to 20% of the rank scan time. (TBD
- provide the perf results for example)
We can improve the rank search by taking advantage of the fact that the
element node in the skiplist is pointed by the hashtable value!
Our Skiplist implementation is using FatKeys, where each added node is
assigned a randomly chosen height. Say we keep a height record for every
skiplist element. In order to get an element rank we simply:
1. locate the element in the SortedSet hashtable.
2. we go directly to the node in the skiplist.
3. we jump to the full height of the node and take the span value.
4. we continue going foreward and always jump to the heighst point in
each node we get to, making sure to sum all the spans.
5. we take off the summed spans from the SkipList length and we now have
the specific node rank. :)
In order to test this method I created several benchmarks. All
benchmarks used the same seeds and the lists contained 1M elements.
Since a very important factor is the number of scores compared to the
number of elements (since small ratio means more string compares during
searches) each benchmark test used different number of scores (1, 10K,
100K, 1M)
some results:
**TPS**
Scores range | non-optimized | optimized | gain
-- | -- | -- | --
1 | 416042 | 605363 | 45.51%
10K | 359776 | 459200 | 27.63%
100K | 380387 | 459157 | 20.71%
1M | 416059 | 450853 | 8.36%
**Latency**
Scores range | non-optimized | optimized | gain
-- | -- | -- | --
1 | 1.191000 | 0.831000 | -30.23%
10K | 1.383000 | 1.095000 | -20.82%
100K | 1.311000 | 1.087000 | -17.09%
1M | 1.191000 | 1.119000 | -6.05%
### Memory efficiency
adding another field to each skiplist node can cause degredation in
memory efficiency for large sortedsets. We use the fact that level 0
recorded span of ALL nodes can either be 1 or zero (for the last node).
So we use wrappers in order to get a node span and override the span for
level 0 to hold the node height.
---------
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
After #1009, we will reset the election when we received
a claim with an equal or higher epoch since a node can win
an election in the past.
But we need to consider the time before the node actually
obtains the failover_auth_epoch. The failover_auth_epoch
default is 0, so before the node actually get the failover
epoch, we might wrongly reset the election.
This is probably harmless, but will produce misleading log
output and may delay election by a cron cycle or beforesleep.
Now we will only reset the election when a node is actually
obtains the failover epoch.
Signed-off-by: Binbin <binloveplay1314@qq.com>
- Enable investigation of memory issues during loading
- Previously, all memory commands were rejected with LOADING error
(except memory help)
- `MEMORY MALLOC-STATS` and `MEMORTY PURGE` are now allowed
as they don't depend on the dataset
- `MEMORY STATS` and `MEMORY USAGE KEY` remain disallowed
Fixes#1299
Signed-off-by: Guillaume Koenig <knggk@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
We've had security issues in the past with it, which is why
we marked it as PROTECTED. But, modifying during runtime
is also a dangerous action. For example, when child processes
are running, persistent temp files and log files may have
unexpected effects.
A scenario for modifying dir at runtime is to migrate a disk
failure, such as using disk-based replication to migrate a node,
writing nodes.conf to save the cluster configuration.
We decided to leave it as is and add a note in the conf
about the dangers of modifying dir at runtime.
Signed-off-by: Binbin <binloveplay1314@qq.com>
By including <stdatomic.h> after the other includes in the unit test, we
can avoid redefining a macro which led to a build failure.
Fixes#1394
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In the ValkeyModule_OnLoad method of the file hellodict.c, the parameter
keystep of ValkeyModule_CreateCommand should be 1. Otherwise, execute
command will coredump.
MODULE LOAD /home/tiger/valkey/src/modules/hellodict.so
COMMAND GETKEYS HELLODICT.SET key value
Signed-off-by: Codebells <1347103071@qq.com>
Determine the status of the Client when attempting to read data. If
state=CLIENT_COMPLETED_IO, no read attempt is made and I/O operations on
the Client are rescheduled by the main thread.
> And 20474 Byte = PROTO_IOBUF_LEN(16KB) + SDS_HDR_VAR(16, s)(4090 Byte)
Fixes#1385
---------
Signed-off-by: fengquyoumo <1455117463@qq.com>
Before Redis OSS 7, if we load a module with some arguments during
runtime,
and run the command "config rewrite", the module information will not be
saved into the
config file.
Since Redis OSS 7 and Valkey 7.2, if we load a module with some
arguments during runtime,
the module information (path, arguments number, and arguments value) can
be saved into the config file after config rewrite command is called.
Thus, the module will be loaded automatically when the server startup
next time.
Following is one example:
bind 172.25.0.58
port 7000
protected-mode no
enable-module-command yes
Generated by CONFIG REWRITE
latency-tracking-info-percentiles 50 99 99.9
dir "/home/ubuntu/valkey"
save 3600 1 300 100 60 10000
user default on nopass sanitize-payload ~* &* +https://github.com/ALL
loadmodule tests/modules/datatype.so 10 20
However, there is one problem.
If developers write a module, and update the running arguments by
someway, the updated arguments can not be saved into the config file
even "config rewrite" is called.
The reason comes from the following function
rewriteConfigLoadmoduleOption (src/config.c)
void rewriteConfigLoadmoduleOption(struct rewriteConfigState *state) {
..........
struct ValkeyModule *module = dictGetVal(de);
line = sdsnew("loadmodule ");
line = sdscatsds(line, module->loadmod->path);
for (int i = 0; i < module->loadmod->argc; i++) {
line = sdscatlen(line, " ", 1);
line = sdscatsds(line, module->loadmod->argv[i]->ptr);
}
rewriteConfigRewriteLine(state, "loadmodule", line, 1);
.......
}
The function only save the initial arguments information
(module->loadmod) into the configfile.
After core members discuss, ref
https://github.com/valkey-io/valkey/issues/1177
We decide add the following API to implement this feature:
Original proposal:
int VM_UpdateRunTimeArgs(ValkeyModuleCtx *ctx, int index, char *value);
Updated proposal:
ValkeyModuleString **values VM_GetRuntimeArgs(ValkeyModuleCtx *ctx);
**int VM_UpdateRuntimeArgs(ValkeyModuleCtx *ctx, int argc,
ValkeyModuleString **values);
Why we do not recommend the following way:
MODULE UNLOAD
Update module args in the conf file
MODULE LOAD
I think there are the following disadvantages:
1. Some modules can not be unloaded. Such as the example module
datatype.so, which is tests/modules/datatype.so
2. it is not atomic operation for MODULE UNLOAD + MODULE LOAD
3. sometimes, if we just run the module unload, the client business
could be interrupted
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
Since 4695d118dd (#1209), RDMA supports builtin.
And module connection type may be removed in future. So run a builtin
RDMA support for CI workflow.
RDMA module is complied only in CI, keep it building check only until
module connection type gets obsolete.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
The recent PR (https://github.com/valkey-io/valkey/pull/1242) converted
Active Defrag to use `monotime`. In that change, a conversion was
performed to continue to use `ustime()` as part of the module interface.
Since this time is only used internally, and never actually exposed to
the module, we can convert this to use `monotime` directly.
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
### Improve expired commands performance with IO threads
#### Background
In our IO threads architecture, IO threads allocate client argv's and
later when we free it after processCommand we offload its free to the IO
threads.
With jemalloc, it's crucial that the same thread that allocates memory
also frees it.
For some commands we modify the client's argv in the main thread during
command processing (for example in `SET EX` command we rewrite the
command to use absolute time for replication propagation).
#### Current issues
1. When commands are rewritten (e.g., expire commands), we store the
original argv
in `c->original_argv`. However, we're currently:
- Freeing new argv (allocated by main thread) in IO threads
- Freeing original argv (allocated by IO threads) in main thread
2. Currently, `c->original_argv` points to new array with old
objects, while `c->argv` has old array with new objects, making memory
free management complicated.
#### Changes
1. Refactored argv modification handling code to ensure consistency -
both array and objects are now either all new or all old
2. Moved original_argv cleanup to happen in resetClient after argv
cleanup
3. Modified IO threads code to properly handle original argv cleanup
when argv are modified.
#### Performance Impact
Benchmark with `SET EX` commands (650 clients, 512 byte value, 8 IO
threads):
- New implementation: **729,548 ops/sec**
- Old implementation: **633,243 ops/sec**
Representing a **~15%** performance improvement due to more efficient
memory handling.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Refer to: https://github.com/valkey-io/valkey/issues/1141
This update refactors the defrag code to:
* Make the overall code more readable and maintainable
* Reduce latencies incurred during defrag processing
With this update, the defrag cycle time is reduced to 500us, with more
frequent cycles. This results in much more predictable latencies, with a
dramatic reduction in tail latencies.
(See https://github.com/valkey-io/valkey/issues/1141 for more complete
details.)
This update is focused mostly on the high-level processing, and does NOT
address lower level functions which aren't currently timebound (e.g.
`activeDefragSdsDict()`, and `moduleDefragGlobals()`). These are out of
scope for this update and left for a future update.
I fixed `kvstoreDictLUTDefrag` because it was using up to 7ms on a CME
single shard. See original github issue for performance details.
---------
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This PR optimizes the performance of HyperLogLog commands (PFCOUNT,
PFMERGE) by adding AVX2 fast paths.
Two AVX2 functions are added for conversion between raw representation
and dense representation. They are 15 ~ 30 times faster than scalar
implementaion. Note that sparse representation is not accelerated.
AVX2 fast paths are enabled when the CPU supports AVX2 (checked at
runtime) and the hyperloglog configuration is default (HLL_REGISTERS ==
16384 && HLL_BITS == 6).
`PFDEBUG SIMD (ON|OFF)` subcommand is added for unit tests. A new TCL
unit test checks that the results produced by non-AVX2 and AVX2
implementations are exactly equal.
When merging 3 dense hll structures, the benchmark shows a 12x speedup
compared to the scalar version.
```
pfcount key1 key2 key3
pfmerge keyall key1 key2 key3
```
```
======================================================================================================
Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
------------------------------------------------------------------------------------------------------
PFCOUNT-scalar 5665.56 35.29839 32.25500 63.99900 67.58300 608.60
PFCOUNT-avx2 72377.83 2.75834 2.67100 5.34300 6.81500 7774.96
------------------------------------------------------------------------------------------------------
PFMERGE-scalar 9851.29 20.28806 20.09500 36.86300 39.16700 615.71
PFMERGE-avx2 125621.89 1.59126 1.55100 3.11900 4.70300 15702.74
------------------------------------------------------------------------------------------------------
scalar: valkey:unstable 2df56d87c0ebe802f38e8922bb2ea1e4ca9cfa76
avx2: Nugine:hll-simd 8f9adc34021080d96e60bd0abe06b043f3ed0275
CPU: 13th Gen Intel® Core™ i9-13900H × 20
Memory: 32.0 GiB
OS: Ubuntu 22.04.5 LTS
```
Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark script:
https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh
Algorithm:
https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp
---------
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
This log allows us to easily distinguish between full coverage and
minority partition when the cluster fails. Sometimes it is not easy
to see the minority partition in a healthy shards (both primary and
replicas).
And we decided not to add a cluster_fail_reason field to cluster info.
Given that there are only two reasons and both are well-known and if
we ended up adding more down the road we can add it in the furture.
Signed-off-by: Binbin <binloveplay1314@qq.com>
**Motivation**
Copy-on-write (COW) amplification refers to the issue where writing to a
small object leads to the entire page being cloned, resulting in
inefficient memory usage. This issue arises during the BGSAVE process,
which can be particularly problematic on instances with limited memory.
If the BGSAVE process could release unneeded memory, it could reduce
memory consumption. To address this, the BGSAVE process calls the
`madvise` function to signal the operating system to reclaim the buffer.
However, this approach does not work for buffers smaller than a page
(usually 4KiB). Even after multiple such calls, where a full page may be
free, the operating system will not reclaim it.
To solve this issue, we can call `zfree` directly. This allows the
allocator (jemalloc) to handle the bookkeeping and release pages when
buffers are no longer needed. This approach reduces copy-on-write
events.
**Benchmarks**
To understand how usage of `zfree` affects BGSAVE and the memory
consumption I ran 45 benchmarks that compares my clonewith the vanilla
version. The benchmark has the following steps:
1. Start a new Valkey process
2. Fill the DB with data sequentially
3. Run a warmup to randomize the memory layout
4. Introduce fragmentation by deleting part of the keys
5. In parallel:
1. Trigger BGSAVE
2. Start 80/20 get/set load
I played the following parameters to understand their influence:
1. Number of keys: 3M, 6M, and 12M.
2. Data size. While key themselves are of fixed length ~30 bytes, the
value size is 120, 250, 500, 1000, and 2000 bytes.
3. Fragmentation. I delete 5%, 10%, and 15% of the original key range.
I'm attaching a graph of BGSAVE process memory consumption. Instead of
all benchmarks, I show the most representative runs IMO.
<img width="1570" alt="3m-fixed"
src="https://github.com/user-attachments/assets/3dbbc528-01c1-4821-a3c2-6be455e7f78a">
For 2000 bytes values peak memory usage is ~53% compared to vanilla. The
peak happens at 57% BGSAVE progress.
For 500 bytes values the peak is ~80% compared to vanilla. And happens
at ~80% progress.
For 120 bytes the difference is under 5%, and the patched version could
even use more memory.

For 12M keys, the peak is ~85% of the vanilla’s. Happens at ~70% mark.
For 6M keys, the peak is ~87% of the vanilla’s. Happens at ~77% mark.
For 3M keys, the peak is ~87% of the vanilla’s Happens at ~80% mark.
**Changes**
The PR contains 2 changes:
1. Static buffer for RDB comrpession.
RDB compression leads to COW events even without any write load if we
use `zfree`. It happens because the compression functions allocates a
new buffer for each object. Together with freeing objects with `zfree`
it leads to reusing of the memory shared with the main process.
To deal with this problem, we use a pre-allocated constant 8K buffer for
compression. If the object size is too big for this buffer, than we fall
back to the ad hoc allocation behavior.
2. Freeing string objects instead of dismissing them
Call to `zfree` is more expensive than direct call to `madvise`. But
with #453 strings use the fast path – `zfree_with_size`. As a possible
next step we can optimize `zfree` for other data types as well.
---------
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
1. The test isn't waiting long enough for the output buffer to overrun.
This problem is happening because an error from the previous test is
bleeding into the current test's logs. The simplest fix would be to
split these tests.
2. Increased replication timeout to ensure sync fails due to output
buffer overrun before a timeout occurs.
Fixes#1367
Signed-off-by: naglera <anagler123@gmail.com>
We should reset repl_down_since only on state change, in the
current code, if the rdb channel in the dual channel is normal,
that is, rdb is loaded normally, but the psync channel is
abnormal, we will set repl_down_since 0 here. If the primary
is down at this time, the replica may be abnormal when calculating
data_age in cluster failover, since repl_state != REPL_STATE_CONNECTED,
this causes the replica to be unable to initiate an election due
to the old data_age.
In dualChannelSyncHandleRdbLoadCompletion, if the psync channel
is not established, the function will return. We will set repl_state
to REPL_STATE_CONNECTED and set repl_down_since to 0 in
dualChannelSyncSuccess, that is, in establishPrimaryConnection.
See also 677d10b2a8ff7f13033ccfe56ffcd246dbe70fb6 for more details.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Replace occurrences of 'zfree' with 'zfree_with_size' to improve
performance.
'zfree_with_size' function avoids calling 'zmalloc_size' to retrieve
buffer size and
uses previuos calculation of size for calling 'zfree_with_size'. This
results in faster
memory deallocation and reduces overhead.
Signed-off-by: stav bentov <stavbt@amazon.com>
Co-authored-by: stav bentov <stavbt@amazon.com>
There are several patches in this PR:
* Abstract set/rewrite config bind option: `bind` option is a special
config, `socket` and `tls` are using the same one. However RDMA uses the
similar style but different one. Use a bit abstract work to make it
flexible for both `socket` and `RDMA`. (Even for QUIC in the future.)
* Introduce closeListener for connection type: closing socket by a
simple syscall would be fine, RDMA has complex logic. Introduce
connection type specific close listener method.
* RDMA: Use valkey.conf style instead of module parameters: use
`--rdma-bind` and `--rdma-port` style instead of module parameters. The
module style config `rdma.bind` and `rdma.port` are removed.
* RDMA: Support builtin: support `make BUILD_RDMA=yes`. module style is
still kept for now.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
In #1326 we make KEYS can visit expired key in import-source state
by updating keyIsExpired to check for import mode. But after #1205,
we now use keyIsExpiredWithDictIndex to optimize and remove the
redundant dict_index, and keyIsExpiredWithDictIndex does not handle
this logic.
In this commit, we handle keyIsExpiredWithDictIndex to make it check
for import mode as well so that KEYS can visit the expired key.
Signed-off-by: Binbin <binloveplay1314@qq.com>
After #1185, a client in import-source state can visit expired key
both in read commands and write commands, this commit handle
keyIsExpired function to handle import-source state as well, so
KEYS can visit the expired key.
This is not particularly important, but it ensures the definition,
also doing some cleanup around the test, verified that the client
can indeed visit the expired key.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This PR introduces a consistent tagging system for dual-channel logs.
The goal is to improve log readability and filterability, making it
easier for operators to manage and analyze log entries.
Resolves https://github.com/valkey-io/valkey/issues/986
---------
Signed-off-by: naglera <anagler123@gmail.com>
The code is ok before 2de544cfcc6d1aa7cf6d0c75a6116f7fc27b6fd6,
but now we will set server.repl_transfer_fd right after dfd was
initiated, and in here we have a double close error since dfd and
server.repl_transfer_fd are the same fd.
Also move the declaration of dfd/maxtries to a small scope to avoid
the confusion since they are only used in this code.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Apparently on Mac, sleep will modify errno to ETIMEDOUT, and then it
prints the misleading message: Operation timed out.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Optimize sdscatrepr by reducing realloc calls, furthermore, we can reduce memcpy calls by
batch processing of consecutive printable characters.
Signed-off-by: Ray Cao <zisong.cw@alibaba.com>
Co-authored-by: Ray Cao <zisong.cw@alibaba.com>
Fast_float is a C++ header-only library to parse doubles using SIMD
instructions. The purpose is to speed up sorted sets and other commands
that use doubles. A single-file copy of fast_float is included in this
repo. This introduces an optional dependency on a C++ compiler.
The use of fast_float is enabled at compile time using the make variable
`USE_FAST_FLOAT=yes`. It is disabled by default.
Fixes#1069.
---------
Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com>
Signed-off-by: Parth <661497+parthpatel@users.noreply.github.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Roshan Swain <swainroshan001@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
We have a replicationEmptyDbCallback, it is a callback used by emptyData
while flushing away old data. Previously, we did not add this callback
logic for function, in case of abuse, there may be a lot of functions,
and also to make the code consistent, we add the same callback logic
for function.
Changes around this commit:
1. Extend emptyData / functionsLibCtxClear to support passing callback
when flushing functions.
2. Added disklessLoad function create and discard helper function, just
like disklessLoadInitTempDb and disklessLoadDiscardTempDb), we wll
always flush the temp function in a async way to avoid any block.
3. Cleanup around discardTempDb, remove the callback pointer since in
async way we don't need the callback.
4. Remove functionsLibCtxClear call in readSyncBulkPayload, because we
called emptyData in the previous lines, which also empty functions.
We are doing this callback in replication is because during the flush,
replica may block a while if the flush is doing in the sync way, to
avoid the primary to detect the replica is timing out, replica will use this
callback to notify the primary (we also do this callback when loading
a RDB). And in the async way, we empty the data in the bio and there is
no slw operation, so it will ignores the callback.
Signed-off-by: Binbin <binloveplay1314@qq.com>
These commands are all administrator commands. If they are operated
incorrectly, serious consequences may occur. Print the full client
info by using catClientInfoString, the info is useful when we want
to identify the source of request.
Since the origin client info is very large and might complicate the
output, we added a catClientInfoShortString function, it will only
print some basic fields, we want these fields that are useful to
identify the client. These fields are:
- id
- addr
- laddr
- connection info
- name
- user
- lib-name
- lib-ver
And also used it to replace the origin client info where it has the
same purpose. Some logging is changed from full client info to short
client info:
- CLUSTER FAILOVER
- FAILOVER / PSYNC
- REPLICAOF NO ONE
- SHUTDOWN
Signed-off-by: Binbin <binloveplay1314@qq.com>
When a node role changes, we should brocast the change to notify other nodes.
For example, one primary and one replica, after a failover, the replica became
a new primary, the primary became a new replica.
And then we trigger a second cluster failover for the new replica, the
new replica will send a MFSTART to its primary, ie, the new primary.
But the new primary may reject the MFSTART due to this logic:
```
} else if (type == CLUSTERMSG_TYPE_MFSTART) {
if (!sender || sender->replicaof != myself) return 1;
```
In the new primary views, sender is still a primary, and sender->replicaof
is NULL, so we will return. Then the manual failover timedout.
Another possibility is that other primaries refuse to vote after receiving
the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary,
so it refuse to vote, and then manual failover timedout.
```
void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
...
if (clusterNodeIsPrimary(node)) {
serverLog(LL_WARNING, "Failover auth denied to...
```
The reason is that, currently, we only update the node->replicaof information
when we receive a PING/PONG from the sender. For details, see clusterProcessPacket.
Therefore, in some scenarios, such as clusters with many nodes and a large
cluster-ping-interval (that is, cluster-node-timeout), the role change of the node
will be very delayed.
Added a DEBUG DISABLE-CLUSTER-RANDOM-PING command, send cluster ping
to a random node every second (see clusterCron).
Signed-off-by: Binbin <binloveplay1314@qq.com>
We need to start making use of the new `WithDictIndex` APIs which allow
us to reuse the dict_index calculation (avoid over-calling `getKeySlot`
for no good reason).
In this PR I optimized `lookupKey` so it now calls `getKeySlot` to reuse
the dict_index two additional times. It also optimizes the keys command
to avoid unnecessary computation of the slot id.
---------
Signed-off-by: Nadav Levanoni <nadavl@amazon.com>
Co-authored-by: Nadav Levanoni <nadavl@amazon.com>
`cluster_legacy.c`: `slot_info_pairs` has `uint16_t` values, but they
were cast to `unsigned long` and `%i` was used.
`valkey-cli.c`: `node->replicas_count` is `int`, not `unsigned long`.
Signed-off-by: ArtSin <artsin666@gmail.com>
The test is incompatible with valgrind. Added a new `--valgrind`
argument to test suite, which will cause that test to be skipped.
We skipped it in the past, see 5b61b0dc6d2579ee484fa6cf29bfac59513f84ab
Signed-off-by: Binbin <binloveplay1314@qq.com>
If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.
Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.
In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.
Signed-off-by: Binbin <binloveplay1314@qq.com>
### Summary of the change
This is a base PR for refactoring defrag. It moves the defrag logic to
rely on jemalloc [native
api](https://github.com/jemalloc/jemalloc/pull/1463#issuecomment-479706489)
instead of relying on custom code changes made by valkey in the jemalloc
([je_defrag_hint](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)))
library. This enables valkey to use latest vanila jemalloc without the
need to maintain code changes cross jemalloc versions.
This change requires some modifications because the new api is providing
only the information, not a yes\no defrag. The logic needs to be
implemented at valkey code. Additionally, the api does not provide,
within single call, all the information needed to make a decision, this
information is available through additional api call. To reduce the
calls to jemalloc, in this PR the required information is collected
during the `computeDefragCycles` and not for every single ptr, this way
we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will
further improve the defrag decision and process.
### Added files:
`allocator_defrag.c` / `allocator_defrag.h` - This files implement the
allocator specific knowledge for making defrag decision. The knowledge
about slabs and allocation logic and so on, all goes into this file.
This improves the separation between jemalloc specific code and other
possible implementation.
### Moved functions:
[`zmalloc_no_tcache` , `zfree_no_tcache`
](4593dc2f05/src/zmalloc.c (L215))
- these are very jemalloc specific logic assumptions, and are very
specific to how we defrag with jemalloc. This is also with the vision
that from performance perspective we should consider using tcache, we
only need to make sure we don't recycle entries without going through
the arena [for example: we can use private tcache, one for free and one
for alloc].
`frag_smallbins_bytes` - the logic and implementation moved to the new
file
### Existing API:
* [once a second + when completed full cycle]
[`computeDefragCycles`](4593dc2f05/src/defrag.c (L916))
* `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active,
resident, retained, muzzy_, `frag_smallbins_bytes`
*
[`frag_smallbins_bytes`](4593dc2f05/src/zmalloc.c (L690))
: for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs`
* [during defrag, for each pointer]
* `je_defrag_hint` is getting a memory pointer and returns {0,1} .
[Internally it
uses](4593dc2f05/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L368))
this information points:
* #`nonfull_slabs`
* #`total_slabs`
* #free regs in the ptr slab
## Jemalloc API (via ctl interface)
[BATCH][`experimental_utilization_batch_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L4114))
: gets an array of pointers, returns for each pointer 3 values,
* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes
[EXTENDED][`experimental_utilization_query_ctl`](4593dc2f05/deps/jemalloc/src/ctl.c (L3989))
:
* memory address of the extent a potential reallocation would go into
* number of free regions in the extent
* number of regions in the extent
* size of the extent in terms of bytes
* [stats-enabled]total number of free regions in the bin the extent
belongs to
* [stats-enabled]total number of regions in the bin the extent belongs
to
### `experimental_utilization_batch_query_ctl` vs valkey
`je_defrag_hint`?
[good]
- We can query pointers in a batch, reduce the overall overhead
- The per ptr decision algorithm is not within jemalloc api, jemalloc
only provides information, valkey can tune\configure\optimize easily
[bad]
- In the batch API we only know the utilization of the slab (of that
memory ptr), we don’t get the data about #`nonfull_slabs` and total
allocated regs.
## New functions:
1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the
[MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster
calls. See this quote from the jemalloc documentation:
The mallctlnametomib() function provides a way to avoid repeated name
lookups for
applications that repeatedly query the same portion of the namespace,by
translating
a name to a “Management Information Base” (MIB) that can be passed
repeatedly to
mallctlbymib().
6. `jemalloc_sz2binind_lgq*` : this api is to support reverse map
between bin size and it’s info without lookup. This mapping depends on
the number of size classes we have that are derived from
[`lg_quantum`](4593dc2f05/deps/Makefile (L115))
7. `defrag_jemalloc_get_frag_smallbins` : This function replaces
`frag_smallbins_bytes` the logic moved to the new file allocator_defrag
`defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the
results
8. `should_defrag` : implements the same logic as the existing
implementation
[inside](9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382))
je_defrag_hint
9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an
array of pointers, utilizing the new batch api. currently only 1 pointer
is passed.
### Logical differences:
In order to get the information about #`nonfull_slabs` and #`regs`, we
use the query cycle to collect the information per size class. In order
to find the index of bin information given bin size, in o(1), we use
`jemalloc_sz2binind_lgq*` .
## Testing
This is the first draft. I did some initial testing that basically
fragmentation by reducing max memory and than waiting for defrag to
reach desired level. The test only serves as sanity that defrag is
succeeding eventually, no data provided here regarding efficiency and
performance.
### Test:
1. disable `activedefrag`
2. run valkey benchmark on overlapping address ranges with different
block sizes
3. wait untill `used_memory` reaches 10GB
4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru`
5. stop load
6. wait for `mem_fragmentation_ratio` to reach 2
7. enable `activedefrag` - start test timer
8. wait until reach `mem_fragmentation_ratio` = 1.1
#### Results*:
(With this PR)Test results: ` 56 sec`
(Without this PR)Test results: `67 sec`
*both runs perform same "work" number of buffers moved to reach
fragmentation target
Next benchmarking is to compare to:
- DONE // existing `je_get_defrag_hint`
- compare with naive defrag all: `int defrag_hint() {return 1;}`
---------
Signed-off-by: Zvi Schneider <ezvisch@amazon.com>
Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com>
Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com>
Co-authored-by: Zvi Schneider <ezvisch@amazon.com>
Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This change prevents unintended side effects on connection state and
improves consistency with non-TLS sync operations.
For example, when invoking `connTLSSyncRead` with a blocking file
descriptor, the mode is switched to non-blocking upon `connTLSSyncRead`
exit. If the code assumes the file descriptor remains blocking and calls
the normal `read` expecting it to block, it may result in a short read.
This caused a crash in dual-channel, which was fixed in this PR by
relocating `connBlock()`:
https://github.com/valkey-io/valkey/pull/837
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
FUNCTION RESTORE have a FLUSH option, it will delete all the existing
libraries before restoring the payload. If for some reasons, there are
a lot of libraries, we will block a while in here.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The goto error label is the same as the error return, use goto
to reduce the references.
```
error:
cancelReplicationHandshake(1);
return;
```
Also this can make the log printing more continuous under the
error, that is, we print the error log first, and then print
the reconnecting log at the last (in cancelReplicationHandshake).
Signed-off-by: Binbin <binloveplay1314@qq.com>
New config: `import-mode (yes|no)`
New command: `CLIENT IMPORT-SOURCE (ON|OFF)`
The config, when set to `yes`, disables eviction and deletion of expired
keys, except for commands coming from a client which has marked itself
as an import-source, the data source when importing data from another
node, using the CLIENT IMPORT-SOURCE command.
When we sync data from the source Valkey to the destination Valkey using
some sync tools like
[redis-shake](https://github.com/tair-opensource/RedisShake), the
destination Valkey can perform expiration and eviction, which may cause
data corruption. This problem has been discussed in
https://github.com/redis/redis/discussions/9760#discussioncomment-1681041
and Redis already have a solution. But in Valkey we haven't fixed it by
now.
E.g. we call `set key 1 ex 1` on the source server and transfer this
command to the destination server. Then we call `incr key` on the source
server before the key expired, we will have a key on the source server
with a value of 2. But when the command arrived at the destination
server, the key may be expired and has deleted. So we will have a key on
the destination server with a value of 1, which is inconsistent with the
source server.
In standalone mode, we can use writable replica to simplify the sync
process. However, in cluster mode, we still need a sync tool to help us
transfer the source data to the destination. The sync tool usually work
as a normal client and the destination works as a primary which keep
expiration and eviction.
In this PR, we add a new mode named 'import-mode'. In this mode, server
stop expiration and eviction just like a replica. Notice that this mode
exists only in sync state to avoid data inconsistency caused by
expiration and eviction. Import mode only takes effect on the primary.
Sync tools can mark their clients as an import source by `CLIENT
IMPORT-SOURCE`, which work like a client from primary and can visit
expired keys in `lookupkey`.
**Notice: during the migration, other clients, apart from the import
source, should not access the data imported by import source.**
---------
Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com>
Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.
For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.
The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This add the missing return when repl_state change to RECEIVE_VERSION_REPLY,
this way we won’t be blocked if the primary doesn’t reply with REPLCONF
VERSION.
In practice i guess this is no likely to block in this context, reading
small responses are are likely to be received in one packet, so this
is just a cleanup (consistent with the previous state machine
processing).
Also update the state machine diagram to mention the VERSION reply.
Signed-off-by: Binbin <binloveplay1314@qq.com>
### Problem
Valkey stores scripts in a dictionary (lua_scripts) keyed by their SHA1
hashes, but it needs a way to know which scripts are least recently
used. It uses an LRU list (lua_scripts_lru_list) to keep track of
scripts in usage order. When the list reaches a maximum length, Valkey
evicts the oldest scripts to free memory in both the list and
dictionary. The problem here is that the sds from the LRU list can be
pointing to already freed/moved memory by active defrag that the sds in
the dictionary used to point to. It results in assertion error at [this
line](https://github.com/valkey-io/valkey/blob/unstable/src/eval.c#L519)
### Solution
If we duplicate the sds when adding it to the LRU list, we can create an
independent copy of the script identifier (sha). This duplication
ensures that the sha string in the LRU list remains stable and
unaffected by any defragmentation that could alter or free the original
sds. In addition, dictUnlink doesn't require exact pointer
match([ref](https://github.com/valkey-io/valkey/blob/unstable/src/eval.c#L71-L78))
so this change makes sense to unlink the right dictEntry with the copy
of the sds.
### Reproduce
To reproduce it with tcl test:
1. Disable je_get_defrag_hint in defrag.c to trigger defrag often
2. Execute test script
```
start_server {tags {"auth external:skip"}} {
test {Regression for script LRU crash} {
r config set activedefrag yes
r config set active-defrag-ignore-bytes 1
r config set active-defrag-threshold-lower 0
r config set active-defrag-threshold-upper 1
r config set active-defrag-cycle-min 99
r config set active-defrag-cycle-max 99
for {set i 0} {$i < 100000} {incr i} {
r eval "return $i" 0
}
after 5000;
}
}
```
### Crash info
Crash report:
```
=== REDIS BUG REPORT START: Cut & paste starting from here ===
14044:M 12 Nov 2024 14:51:27.054 # === ASSERTION FAILED ===
14044:M 12 Nov 2024 14:51:27.054 # ==> eval.c:556 'de' is not true
------ STACK TRACE ------
Backtrace:
/usr/bin/redis-server 127.0.0.1:6379 [cluster](luaDeleteFunction+0x148)[0x723708]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](luaCreateFunction+0x26c)[0x724450]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](evalCommand+0x2bc)[0x7254dc]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](call+0x574)[0x5b8d14]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommand+0xc84)[0x5b9b10]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommandAndResetClient+0x11c)[0x6db63c]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processInputBuffer+0x1b0)[0x6dffd4]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x6bd968]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x659634]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](amzTLSEventHandler+0x194)[0x6588d8]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x750c88]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](aeProcessEvents+0x228)[0x757fa8]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](redisMain+0x478)[0x7786b8]
/lib64/libc.so.6(__libc_start_main+0xe4)[0xffffa7763da4]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x5ad3b0]
```
Defrag info:
```
mem_fragmentation_ratio:1.18
mem_fragmentation_bytes:47229992
active_defrag_hits:20561
active_defrag_misses:5878518
active_defrag_key_hits:77
active_defrag_key_misses:212
total_active_defrag_time:29009
```
### Test:
Run the test script to push 100,000 scripts to ensure the LRU list keeps
500 maximum length without any crash.
```
27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
[ok]: Regression for script LRU crash (6811 ms)
[1/1 done]: unit/test (7 seconds)
```
---------
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Signed-off-by: Seungmin Lee <155032684+sungming2@users.noreply.github.com>
Co-authored-by: Seungmin Lee <sungming@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
### Problem
GitHub Actions is starting the deprecation process for macOS 12.
Deprecation will begin on 10/7/24 and the image will be fully
unsupported by 12/3/24.
For more details, see
https://github.com/actions/runner-images/issues/10721
Signed-off-by: Seungmin Lee <sungming@amazon.com>
Co-authored-by: Seungmin Lee <sungming@amazon.com>
This PR addresses two issues:
1. Performance Degradation Fix - Resolves a significant performance
issue during RDB load on replica nodes.
- The problem was causing replicas to rehash multiple times during the
load process. Local testing demonstrated up to 50% degradation in BGSAVE
time.
- The problem occurs when the replica tries to expand pre-created slot
dictionaries. This operation fails quietly, resulting in undetected
performance issues.
- This fix aims to optimize the RDB load process and restore expected
performance levels.
2. Bug fix when reading `RDB_OPCODE_RESIZEDB` in Valkey 8.0 cluster
mode-
- Use the shard's master slots count when processing this opcode, as
`clusterNodeCoversSlot` is not initialized for the currently syncing
replica.
- Previously, this problem went unnoticed because `RDB_OPCODE_RESIZEDB`
had no practical impact (due to 1).
These improvements will enhance overall system performance and ensure
smoother upgrades to Valkey 8.0 in the future.
Testing:
- Conducted local tests to verify the performance improvement during RDB
load.
- Verified that ignoring `RDB_OPCODE_RESIZEDB` does not negatively
impact functionality in the current version.
Signed-off-by: naglera <anagler123@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
s_malloc_size == zmalloc_size, currently sdsAllocSize does not
calculate PREFIX_SIZE when no malloc_size available, this casue
test_typesAndAllocSize fail in the new unittest, what we want to
check is actually zmalloc_usable_size.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Main thread profiling revealed significant overhead in TLS operations,
even with read/write offloaded to I/O threads:
Perf results:
**10.82%** 8.82% `valkey-server libssl.so.3 [.] SSL_pending` # Called by
main thread after I/O completion
**10.16%** 5.06% `valkey-server libcrypto.so.3 [.] ERR_clear_error` #
Called for every event regardless of thread handling
This commit further optimizes TLS operations by moving more work from
the main thread to I/O threads:
Improve TLS offloading to I/O threads with two main changes:
1. Move `ERR_clear_error()` calls closer to SSL operations
- Currently, error queue is cleared for every TLS event
- Now only clear before actual SSL function calls
- This prevents unnecessary clearing in main thread when operations
are handled by I/O threads
2. Optimize `SSL_pending()` checks
- Add `TLS_CONN_FLAG_HAS_PENDING` flag to track pending data
- Move pending check to follow read operations immediately
- I/O thread sets flag when pending data exists
- Main thread uses flag to update pending list
Performance improvements:
Testing setup based on
https://valkey.io/blog/unlock-one-million-rps-part2/
Before:
- SET: 896,047 ops/sec
- GET: 875,794 ops/sec
After:
- SET: 985,784 ops/sec (+10% improvement)
- GET: 1,066,171 ops/sec (+22% improvement)
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
CI report this failure:
```
[exception]: Executing test client: MOVED 1 127.0.0.1:22128.
MOVED 1 127.0.0.1:22128
while executing
"wait_for_condition 1000 50 {
[R 3 get key_991803] == 1024 && [R 3 get key_977613] == 10240 &&
[R 4 get key_991803] == 1024 && ..."
```
This may be because, even though the cluster state becomes OK,
The cluster still has inconsistent configuration for a short period
of time. We make sure to wait for the config to be consistent.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If bgsaveerr is error, there is no need to protect the rdb channel.
The impact of this may be that when bgsave fails, we will protect
the rdb channel for 60s. It may occupy the reference of the repl
buf block, making it impossible to recycle it until we free the
client due to COB or free the client after 60s.
We kept the RDB channel open as long as the replica hadn't established
a main connection, even if the snapshot process failed. There is no
value in keeping the RDB client in this case.
Signed-off-by: Binbin <binloveplay1314@qq.com>
We have an assert in propagateNow. If the primary node receives a
CLUSTER UPDATE such as dirty slots during SIGTERM waitting or during
a manual failover pausing or during a client pause, the delKeysInSlot
call will trigger this assert and cause primary crash.
In this case, we added a new server_del_keys_in_slot state just like
client_pause_in_transaction to track the state to avoid the assert
in propagateNow, the dirty slots will be deleted in the end without
affecting the data consistency.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
We originally checked the replica connection to whether to kill the
diskless child only when rdbPipeReadHandler is triggered. Actually
we can check it when the replica is disconnected, so that we don't
have to wait for rdbPipeReadHandler to be triggered and can kill
the forkless child as soon as possible.
In this way, when the child or rdbPipeReadHandler is stuck for some
reason, we can kill the child faster and release the fork resources.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This adds a log when a handshake fails for a timeout. This can help
troubleshoot cluster asymmetry issues caused by failed MEETs
---------
Signed-off-by: Ben Totten <btotten@amazon.com>
Signed-off-by: bentotten <59932872+bentotten@users.noreply.github.com>
Co-authored-by: Ben Totten <btotten@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Remove the dict pointer argument to the `dictType` callbacks `keyDup`,
`keyCompare`, `keyDestructor` and `valDestructor`. This argument was
unused in all of the callback implementations.
The macros `dictFreeKey()` and `dictFreeVal()` are made internal to dict
and moved from dict.h to dict.c. They're also changed from macros to
static inline functions.
Signed-off-by: Qu Chen <quchen@amazon.com>
* We compile various c files into object and package them into library
(.a file) using ar to feed to unit tests. With new GCC versions, the
objects inside such library don't participate in LTO process without
additional flags.
* Here is a direct quote from gcc documentation explaining this issue:
"If you are not using a linker with plugin support and/or do not enable
the linker plugin, then the objects inside libfoo.a are extracted and
linked as usual, but they do not participate in the LTO optimization
process. In order to make a static library suitable for both LTO
optimization and usual linkage, compile its object files with
-flto-ffat-lto-objects."
* Read full documentation about -flto at
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
* Without this additional flag, I get following errors while executing
"make test-unit". With this change, those errors go away.
```
ARCHIVE libvalkey.a
ar: threads_mngr.o: plugin needed to handle lto object
...
..
.
/tmp/ccDYbMXL.ltrans0.ltrans.o: In function `dictClear':
/local/workplace/elasticache/valkey/src/unit/../dict.c:776: undefined
reference to `valkey_free'
/local/workplace/elasticache/valkey/src/unit/../dict.c:770: undefined
reference to `valkey_free'
/tmp/ccDYbMXL.ltrans0.ltrans.o: In function `dictGetVal':
```
Fixes#1290
---------
Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com>
Migrate quicklist unit test to new unit test framework, and cleanup
remaining references of SERVER_TEST, parent ticket #428.
Closes#428.
Signed-off-by: artikell <739609084@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
In #1282, we init server.pid earlier to keep log message role
consistent, but we forgot to consider daemonize. In daemonize
mode, we will always print the child role.
We need to reset server.pid after daemonize(), otherwise the
log printing role will always be the child. It also causes a
incorrect server.pid value, affecting the concatenation of
some pid names.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If we become an empty primary for some reason, we still need to
check if we need to delete dirty slots, because we may have dirty
slots data left over from a bad migration. Like the target node forcibly
executes CLUSTER SETSLOT NODE to take over the slot without
performing key migration.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If multiple primary nodes go down at the same time, their replica nodes will
initiate the elections at the same time. There is a certain probability that
the replicas will initate the elections in the same epoch.
And obviously, in our current election mechanism, only one replica node can
eventually get the enough votes, and the other replica node will fail to win
due the the insufficient majority, and then its election will time out and
we will wait for the retry, which result in a long failure time.
If another node has been won the election in the failover epoch, we can assume
that my election has failed and we can retry as soom as possible.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Currently when a manual failover is triggeded, we will set a
CLUSTER_TODO_HANDLE_FAILOVER to start the election as soon as
possible in the next beforeSleep. But in fact, we won't delay
the election in manual failover, waitting for the next beforeSleep
to kick in will delay the election a some milliseconds.
We can trigger the election immediately in this case in the
same function call, without waitting for beforeSleep, which
can save us some milliseconds.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When doing `$replica replicaof no one`, we may get a LOADING
error, this is because during the test execution, the replica
may reconnect very quickly, and the full sync is initiated,
and the replica has entered the LOADING state.
In this commit, we make sure the primary is pasued after the
fork, so the replica won't enter the LOADING state, and with
this fix, this test seems more natural and predictable.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Null out several references after freeing the object in `freeClient()`.
This is just to make the code more safe, to protect against
use-after-free for future changes.
Signed-off-by: Qu Chen <quchen@amazon.com>
Init server.pid earlier to keep log message role consistent.
Closes#1206.
Before:
```text
24881:C 21 Oct 2024 21:10:57.165 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo
24881:C 21 Oct 2024 21:10:57.165 * Valkey version=255.255.255, bits=64, commit=814e0f55, modified=1, pid=24881, just started
24881:C 21 Oct 2024 21:10:57.165 * Configuration loaded
24881:M 21 Oct 2024 21:10:57.167 * Increased maximum number of open files to 10032 (it was originally set to 1024).
```
After:
```text
68560:M 08 Nov 2024 16:10:12.257 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo
68560:M 08 Nov 2024 16:10:12.257 * Valkey version=255.255.255, bits=64, commit=45d596e1, modified=1, pid=68560, just started
68560:M 08 Nov 2024 16:10:12.257 * Configuration loaded
68560:M 08 Nov 2024 16:10:12.258 * monotonic clock: POSIX clock_gettime
```
Signed-off-by: azuredream <zhaozixuan67@gmail.com>
RDMA: Use connection reference counter style
The reference counter of connection is used to protect re-entry of closenmethod.
Use this style instead the unsafe one.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
As @madolson pointed out, these do have proper null terminators. This
cleans them up to follow the rest of the code which copies the last byte
explicitly, which should help reduce cognitive load and make it more
resilient should code refactors occur (e.g. non-static allocation of
memory, changes to other functions).
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
With this commit, users are able to build valkey using `CMake`.
## Example usage:
Build `valkey-server` in Release mode with TLS enabled and using
`jemalloc` as the allocator:
```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=/tmp/valkey-install \
-DBUILD_MALLOC=jemalloc -DBUILD_TLS=1
make -j$(nproc) install
# start valkey
/tmp/valkey-install/bin/valkey-server
```
Build `valkey-unit-tests`:
```bash
mkdir build-release-ut
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DBUILD_MALLOC=jemalloc -DBUILD_UNIT_TESTS=1
make -j$(nproc)
# Run the tests
./bin/valkey-unit-tests
```
Current features supported by this PR:
- Building against different allocators: (`jemalloc`, `tcmalloc`,
`tcmalloc_minimal` and `libc`), e.g. to enable `jemalloc` pass
`-DBUILD_MALLOC=jemalloc` to `cmake`
- OpenSSL builds (to enable TLS, pass `-DBUILD_TLS=1` to `cmake`)
- Sanitizier: pass `-DBUILD_SANITIZER=<address|thread|undefined>` to
`cmake`
- Install target + redis symbolic links
- Build `valkey-unit-tests` executable
- Standard CMake variables are supported. e.g. to install `valkey` under
`/home/you/root` pass `-DCMAKE_INSTALL_PREFIX=/home/you/root`
Why using `CMake`? To list *some* of the advantages of using `CMake`:
- Superior IDE integrations: cmake generates the file
`compile_commands.json` which is required by `clangd` to get a compiler
accuracy code completion (in other words: your VScode will thank you)
- Out of the source build tree: with the current build system, object
files are created all over the place polluting the build source tree,
the best practice is to build the project on a separate folder
- Multiple build types co-existing: with the current build system, it is
often hard to have multiple build configurations. With cmake you can do
it easily:
- It is the de-facto standard for C/C++ project these days
More build examples:
ASAN build:
```bash
mkdir build-asan
cd $_
cmake .. -DBUILD_SANITIZER=address -DBUILD_MALLOC=libc
make -j$(nproc)
```
ASAN with jemalloc:
```bash
mkdir build-asan-jemalloc
cd $_
cmake .. -DBUILD_SANITIZER=address -DBUILD_MALLOC=jemalloc
make -j$(nproc)
```
As seen by the previous examples, any combination is allowed and
co-exist on the same source tree.
## Valkey installation
With this new `CMake`, it is possible to install the binary by running
`make install` or creating a package `make package` (currently supported
on Debian like distros)
### Example 1: build & install using `make install`:
```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/valkey-install -DCMAKE_BUILD_TYPE=Release
make -j$(nproc) install
# valkey is now installed under $HOME/valkey-install
```
### Example 2: create a `.deb` installer:
```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc) package
# ... CPack deb generation output
sudo gdebi -n ./valkey_8.1.0_amd64.deb
# valkey is now installed under /opt/valkey
```
### Example 3: create installer for non Debian systems (e.g. FreeBSD or
macOS):
```bash
mkdir build-release
cd $_
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc) package
mkdir -p /opt/valkey && ./valkey-8.1.0-Darwin.sh --prefix=/opt/valkey --exclude-subdir
# valkey-server is now installed under /opt/valkey
```
Signed-off-by: Eran Ifrah <eifrah@amazon.com>
functionsLibCtxClear should clear the provided lib_ctx parameter,
not the static variable curr_functions_lib_ctx, as this contradicts
the function's intended purpose.
The impact i guess is minor, like in some unhappy paths (diskless load
fails, function restore fails?), we will mess up the functions_caches
field, which is used in used_memory_functions / used_memory_scripts
fileds in INFO.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The CI report replica will return the error when performing CLUSTER
FAILOVER:
```
-ERR Master is down or failed, please use CLUSTER FAILOVER FORCE
```
This may because the primary state is fail or the cluster connection
is disconnected during the primary pause. In this PR, we added some
waits in wait_for_role, if the role is replica, we will wait for the
replication link and the cluster link to be ok.
Signed-off-by: Binbin <binloveplay1314@qq.com>
We set this to EEXIST in 568c2e039bac388003068cd8debb2f93619dd462,
it prints "File exists" which is not quite accurate,
change it to EALREADY, it will print "Operation already in progress".
Signed-off-by: Binbin <binloveplay1314@qq.com>
We don't think we really care about optimizing for the empty DB case,
which should be uncommon. Adding branches hurts branch prediction.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This commit addresses issues that were likely introduced during a rebase
related to:
b0f23df165
Change dual channel replication state in main handler only
Signed-off-by: naglera <anagler123@gmail.com>
A minor debugging change that helped in the investigation of
https://github.com/valkey-io/valkey/issues/1251. Basically there are
some edge cases where we want to fully isolate a note from receiving
packets, but can't suspend the process because we need it to continue
sending outbound traffic. So, added a filter for that.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
The cluster-node-timeout is 3000 in our tests, the timing test wasn't
succeeding, so extending the wait_for made them much more reliable.
Signed-off-by: Binbin <binloveplay1314@qq.com>
All of the internal variables related to number of dicts in the kvstore
are type `int`. Not sure why these 2 items were declared as `long long`.
Signed-off-by: Jim Brunner <brunnerj@amazon.com>
RDMA MR (memory region) is not forkable, the VMA (virtual memory area)
of a MR gets empty in a child process. Prevent IO for child process to
avoid server crash.
In the check for whether read and write is allowed in an RDMA
connection, a check that if we're in a child process is added. If we
are, the function returns an error, which will cause the RDMA client to
be disconnected.
Suggested-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
The current implementation of `sdssplitargs` does repeated `sdscatlen`
to build the parsed arguments, which isn't very efficient because it
does a lot of extra reallocations and moves through the sds code a lot.
It also typically results in memory overhead, because `sdscatlen`
over-allocates, which is usually not needed since args are usually not
modified after being created.
The new implementation of sdssplitargs does two passes, the first to
parse the argument to figure out the final length and the second to
actually copy the string. It's generally about 2x faster for larger
strings (~100 bytes), and about 20% faster for small strings (~10
bytes). This is generally faster since as long as everything is in the
CPU cache, it's going to be fast.
There are a couple of sanity tests, none existed before, as well as some
fuzzying which was used to find some bugs and also to do the
benchmarking. The original benchmarking code can be seen
6576aeb86a.
```
test_sdssplitargs_benchmark - unit/test_sds.c:530] Using random seed: 1729883235
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 56.44%, new:13039us, old:29930us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 56.58%, new:12057us, old:27771us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 59.18%, new:9048us, old:22165us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 54.61%, new:12381us, old:27278us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 51.17%, new:16012us, old:32793us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 49.18%, new:16041us, old:31563us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 58.40%, new:12450us, old:29930us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 56.49%, new:13066us, old:30031us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 58.75%, new:12744us, old:30894us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 52.44%, new:16885us, old:35504us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 62.57%, new:8107us, old:21659us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 62.12%, new:8320us, old:21966us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 45.23%, new:13960us, old:25487us
[test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 57.95%, new:9188us, old:21849us
```
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
to align with how we encode the length at `_addReplyLongLongWithPrefix`
Signed-off-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Co-authored-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Typically, RDMA connection gets closed by client side, the server side
handles diconnected CM event, and delete keepalive timer correctly.
However, the server side may close connection voluntarily, for example
the maxium connections exceed. Handle this case to avoid invalid memory
access.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
clusterNodeGetReplica agrumnets are missed to migrate during the slave
to replication migration so updated the argument slave to replica.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
At some point unit tests stopped building on MacOS because of duplicate
symbols. I had originally solved this problem by using a flag that
overrides symbols, but the much better solution is to mark the duplicate
symbols as weak and they can be overridden during linking. (Symbols by
default are strong, strong symbols override weak symbols)
I also added macos unit build to the CI, so that this doesn't silently
break in the future again.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Add config options for log format and timestamp format introduced by
#1022
Related to #1225
This change adds two new configs into valkey.conf:
log-format
log-timestamp-format
---------
Signed-off-by: azuredream <zhaozixuan67@gmail.com>
### Description
This patch try to increase the max number of io-threads from 16(128) to
256 for below reasons:
1. The core number increases a lot in the modern server processors, for
example, the [Sierra
Forest](https://en.wikipedia.org/wiki/Sierra_Forest) processors are
targeted towards with up to **288** cores.
Due to limitation of **_io-threads_** number (16 and 128 ), benchmark
like https://openbenchmarking.org/test/pts/valkey even cannot run on a
high core count server.
2. For some workloads, the bottleneck could be main thread, but for the
other workloads, big key/value which caused heavy io, the bottleneck
could be the io-threads, for example benchmark `memtier_benchmark -s
127.0.0.1 -p 9001 "--data-size" "20000" --ratio 1:0 --key-pattern P:P
--key-minimum=1 --key-maximum 1000000 --test-time 180 -c 50 -t 16
--hide-histogram`. The QPS is still scalable after 16 io-threads.

**Fig 1. QPS Scale factor with io-threads number grows.**
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
In the old code, if fstat fails, replica->repldbfd will hold the
fd and we are doing a free client. And in freeClient, we check and
close only if repl_state == REPLICA_STATE_SEND_BULK. So if fstat
fails, we will leak the fd.
We can also extend freeClient to handle REPLICA_STATE_WAIT_BGSAVE_END
as well, but here seems to be a more friendly (and safer) way.
Signed-off-by: Binbin <binloveplay1314@qq.com>
When explored the cycles distribution for main thread with io-threads
enabled. We found this security attack check takes significant time in
main thread, **~3%** cycles were used to do the commands security check
in main thread.
This patch try to completely avoid doing it in the hot path. We can do
it only after we looked up the command and it wasn't found, just before
we call commandCheckExistence.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
Add ability to configure log output format and timestamp format in the
logs.
This change adds two new configs:
* `log-format`: Either legacy or logfmt (See https://brandur.org/logfmt)
* `log-timestamp-format`: legacy, iso8601 or milliseconds (since the
eppch).
Related to #1006.
Example:
```
$ ./valkey-server /home/zhaoz12/git/valkey/valkey/valkey.conf
pid=109463 role=RDB/AOF timestamp="2024-09-10T20:37:25.738-04:00" level=warning message="WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect."
pid=109463 role=RDB/AOF timestamp="2024-09-10T20:37:25.738-04:00" level=notice message="oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo"
pid=109463 role=RDB/AOF timestamp="2024-09-10T20:37:25.738-04:00" level=notice message="Valkey version=255.255.255, bits=64, commit=affbea5d, modified=1, pid=109463, just started"
pid=109463 role=RDB/AOF timestamp="2024-09-10T20:37:25.738-04:00" level=notice message="Configuration loaded"
pid=109463 role=master timestamp="2024-09-10T20:37:25.738-04:00" level=notice message="monotonic clock: POSIX clock_gettime"
pid=109463 role=master timestamp="2024-09-10T20:37:25.739-04:00" level=warning message="Failed to write PID file: Permission denied"
```
---------
Signed-off-by: azuredream <zhaozixuan67@gmail.com>
All other places written in this function are maintained it,
although the caller of rdbSaveDb does not reply on it, it is
maintained to be consistent with other places, is its duty.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If a replica is step into data_age too old stage, it can not
trigger the failover and currently it can not be automatically
recovered and we will print a log every
CLUSTER_CANT_FAILOVER_RELOG_PERIOD,
which is every second. If the primary has not recovered or there is
no manual failover, this log will flood the log file.
In this case, limit its frequency to 10 times period, which is
10 seconds in our code. Also in this data_age too old stage,
the repeated logs also can stand for the progress of the failover.
See also #780 for more details about it.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
The command argument strings created while parsing inline commands (see
`processInlineBuffer()`) can contain free capacity. Since some commands
,such as `SET`, store these strings in the database, that free capacity
increases the memory usage. In the worst case, it could double the
memory usage.
This only occurs if the inline command format is used. The argument
strings are built by appending character by character in
`sdssplitargs()`. Regular RESP commands are not affected.
This change trims the strings within `processInlineBuffer()`.
### Why `trimStringObjectIfNeeded()` within `object.c` is not solving
this?
When the command argument string is packed into an object,
`trimStringObjectIfNeeded()` is called.
This does only trim the string if it is larger than
`PROTO_MBULK_BIG_ARG` (32kB), as only strings larger than this would
ever need trimming if the command it sent using the bulk string format.
We could modify this condition, but that would potentially have a
performance impact on commands using the bulk format. Since those make
up for the vast majority of executed commands, limiting this change to
inline commands seems prudent.
### Experiment Results
* 1 million `SET [key] [value]` commands
* Random keys (16 bytes)
* 600 bytes values
Memory usage without this change:
```
used_memory:1089327888
used_memory_human:1.01G
used_memory_rss:1131696128
used_memory_rss_human:1.05G
used_memory_peak:1089348264
used_memory_peak_human:1.01G
used_memory_peak_perc:100.00%
used_memory_overhead:49302800
used_memory_startup:911808
used_memory_dataset:1040025088
used_memory_dataset_perc:95.55%
```
Memory usage with this change:
```
used_memory:705327888
used_memory_human:672.65M
used_memory_rss:718802944
used_memory_rss_human:685.50M
used_memory_peak:705348256
used_memory_peak_human:672.67M
used_memory_peak_perc:100.00%
used_memory_overhead:49302800
used_memory_startup:911808
used_memory_dataset:656025088
used_memory_dataset_perc:93.13%
```
If the same experiment is repeated using the normal RESP array of bulk
string format (`*3\r\n$3\r\nSET\r\n...`) then the memory usage is 672MB
with and without of this change.
If a replica is attached, its memory usage is 672MB with and without
this change, since the replication link never uses inline commands.
Signed-off-by: Stefan Mueller <muelstef@amazon.com>
Currently, if the replica has a lot of data, CLUSTER RESET
will block for a while and report the slowlog, and it seems
that there is no harm in making it async so external components
can be easier when monitoring it.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Part of https://github.com/valkey-io/valkey/pull/1200 PR, since feild is
changed. Looks like commands.def is missed to get genereated based on
the changes so that is causing CI failure on unstable.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
In some cases bgsave child process can run for a long time exhausting
system resources. Although it is possible to kill the bgsave child
process from the system shell, sometimes it is not possible allowing OS
level access.
This PR adds a new subcommand to the BGSAVE command.
When user will issue `BGSAVE CANCEL`, it will do one of the 2:
1. In case a bgsave child process is currently running, the child
process would be immediately killed thus terminating any
save/replication full sync process.
2. In case a bgsave child process is SCHEDULED to run, the scheduled
execution will be cancelled.
---------
Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
struct connListener.priv should be used by connection type specific
data, static local listener data should not use this.
A RDMA config structure is going to be introduced in the next step:
```
typedef struct serverRdmaContextConfig {
char *bindaddr;
int bindaddr_count;
int port;
int rx_size;
int comp_vector;
...
} serverRdmaContextConfig;
```
Then a builtin RDMA will be supported.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
This special pattern '#' is used to get the element itself,
it does not actually participate in the slot check.
In this case, passing `GET #` will cause '#' to participate
in the slot check, causing the command to get an
`pattern may be in different slots` error.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Hide 'unixsocketgroup' and 'unixsocketperm' into a Unix socket specific
data structure. A single opaque pointer 'void *priv' is enough for a
listener. Once any new config is added, we don't need 'void *priv2',
'void *priv3' and so on.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
When profiling some workloads with `io-threads` enabled. We found the
false sharing issue is heavy.
This patch try to split the the elements accessed by main thread and
io-threads into different cache line by padding the elements in the head
of `used_memory_thread_padded` array.
This design helps mitigate the false sharing between main
thread and io-threads, because the main thread has been the bottleneck
with io-threads enabled. We didn't put each element in an individual
cache line is that we don't want to bring the additional cache line
fetch operation (3 vs 16 cache line) when call function like
`zmalloc_used_memory()`.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <zhu.lipeng@outlook.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
There is no limitation in Valkey to create a cluster with 1 or 2 primaries,
only that it cannot do automatic failover. Remove this restriction and
add `are you sure` prompt to prompt the user.
This allow we use it to create a test cluster by cli or by
create-cluster.
Signed-off-by: Binbin <binloveplay1314@qq.com>
https://github.com/valkey-io/valkey/issues/1145
First part of a two-step effort to add `WithSlot` API for expiry. This
PR is to fix a crash that occurs when a RANDOMKEY uses a different slot
than the cached slot of a client during a multi-exec.
The next part will be to utilize the new API as an optimization to
prevent duplicate work when calculating the slot for a key.
---------
Signed-off-by: Nadav Levanoni <nadavl@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Nadav Levanoni <nadavl@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
**Overview**
This PR introduces the use of
[MurmurHash3](https://en.wikipedia.org/wiki/MurmurHash) as the hashing
function for Lua's luaS_newlstr function, replacing the previous simple
hash function. The change aims to improve performance, particularly for
large strings.
**Changes**
Implemented MurmurHash3 algorithm in lstring.c
Updated luaS_newlstr to use MurmurHash3 for string hashing
**Performance Testing:**
Test Setup:
1. Ran a valkey server
2. Loaded 1000 keys with large values (100KB each) to the server using a
Lua script
```
local numKeys = 1000
for i = 1, numKeys do
local key = "large_key_" .. i
local largeValue = string.rep("x", 1024*100)
redis.call("SET", key, largeValue)
end
```
3. Used a Lua script to randomly select and retrieve keys
```
local randomKey = redis.call("RANDOMKEY")
local result = redis.call("GET", randomKey)
```
4. Benchmarked using valkey-benchmark:
`./valkey-benchmark -n 100000 evalsha
c157a37967e69569339a39a953c046fc2ecb4258 0`
Results:
A | Unstable | This PR | Change
-- | -- | -- | --
Throughput | 6,835.74 requests per second | 17,061.94 requests per
second | **+150% increase**
Avg Latency | 7.218 ms | 2.838 ms | **-61% decrease**
Min Latency | 3.144 ms | 1.320 ms | **-58% decrease**
P50 Latency | 8.463 ms | 3.167 ms | **-63% decrease**
P95 Latency | 8.863 ms | 3.527 ms | **-60% decrease**
P99 Latency | 9.063 ms | 3.663 ms | **-60% decrease**
Max Latency | 63.871 ms | 55.327 ms | **-13% decrease**
Summary:
* Throughput: Improved by 150%.
* Latency: Significant reductions in average, minimum, and percentile
latencies (P50, P95, P99), leading to much faster response times.
* Max Latency: Slightly decreased by 13%, indicating fewer outlier
delays after the fix.
---------
Signed-off-by: Shai Zarka <zarkash@amazon.com>
Signed-off-by: zarkash-aws <zarkash@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The client that was killed by FUNCTION KILL received a reply of
SCRIPT KILL and the server log also showed SCRIPT KILL.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Currently in conf we describe activerehashing as: Active rehashing
uses 1 millisecond every 100 milliseconds of CPU time. This is the
case for hz = 10.
If we change hz, the description in conf will be inaccurate. Users
may notice that the server spends some CPU (used in activerehashing)
at high hz but don't know why, since our cron calls are fixed to 1ms.
This PR takes hz into account and fixed the CPU usage at 1% (this may
not be accurate in some cases because we do 100 step rehashing in
dictRehashMicroseconds but it can avoid CPU spikes in this case).
This PR also improves the description of the activerehashing
configuration item to explain this change.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The zcalloc symbol is a symbol name already used by zlib, which is
defining other names using the "z" prefix specific to zlib. In practice,
linking valkey with a static openssl, which itself might depend on a
static libz will result in link time error rejecting multiple symbol
definitions.
Fixes: #1157
Signed-off-by: Romain Geissler <romain.geissler@amadeus.com>
Currently in our daily, if a job fails, it will cancel the other jobs
in the same matrix, we want to avoid this so that all jobs in a matrix
can eventually run to completion.
Docs: jobs.<job_id>.strategy.fail-fast applies to the entire matrix.
If jobs.<job_id>.strategy.fail-fast is set to true or its expression
evaluates to true, GitHub will cancel all in-progress and queued jobs
in the matrix if any job in the matrix fails. This property defaults
to true.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Sometimes when dual-channel is turned off the tested replica might
disconnect on COB overrun. disable the replica COB limit in order to
prevent such cases.
Fixes: #1153
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
in case of valgrind run, the replica might get disconnected from the
primary due to repl-timeout reached. Fix is to configure larger timeout
in case of valgrind test.
**Partially** fixes: #1152
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Ci report this failure:
```
*** [err]: SHUTDOWN NOSAVE can kill a timedout script anyway in tests/unit/scripting.tcl
Expected 'BUSY Valkey is busy running a script. *' to match '*connection refused*' (context: type eval line 8 cmd {assert_match {*connection refused*} $e} proc ::test)
```
We can see the logs the shutdown got rejected because there is an AOFRW
pending:
```
Writing initial AOF, can't exit.
Errors trying to shut down the server. Check the logs for more information.
```
The reason is that the previous test enabled the aof.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Similar to #860 but this is for HGETALL families (HGETALL/HKEYS/HVALS).
This patch moves `prepareClientToWrite` out of the loop to reduce the
function overhead.
Signed-off-by: Masahiro Ide <imasahiro9@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
this fixes: https://github.com/valkey-io/valkey/issues/1116
_Issue details from #1116 by @zuiderkwast_
> This config is undocumented since #758. The default was changed to
"yes" and it is quite useless to set it to "no". Yet, it can happen that
some user has an old config file where it is explicitly set to "no". The
result will be bad performace, since I/O threads will not do all the
I/O.
>
> It's indeed confusing.
>
> 1. Either remove the whole option from the code. And thus no need for
documentation. _OR:_
> 2. Introduce the option back in the configuration, just as a comment
is fine. And showing the default value "yes": `# io-threads-do-reads
yes` with additional text.
>
> _Originally posted by @melroy89 in [#1019 (reply in
thread)](https://github.com/orgs/valkey-io/discussions/1019#discussioncomment-10824778)_
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
The module commands which were added to acl categories were getting
skipped when `ACL CAT category` command was executed.
This PR fixes the bug.
Before:
```
127.0.0.1:6379> ACL CAT foocategory
(empty array)
```
After:
```
127.0.0.1:6379> ACL CAT foocategory
aclcheck.module.command.test.add.new.aclcategories
```
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
A new option for diskless replication on the replica side.
After a network failure, the replica may need to perform a full sync.
The other option for diskless full sync is `swapdb`, but it uses twice
as much memory, temporarily. In situations where this is not acceptable,
and where losing data is acceptable, the `flush-before-load` can be
useful. If the full sync fails, the old data is lost though. Therefore,
the new option is marked as "dangerous".
---------
Signed-off-by: kronwerk <ca11e5e22g@gmail.com>
Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>
Co-authored-by: kronwerk <ca11e5e22g@gmail.com>
Currently when module loading fails due to busy name, we
don't have a clean way to assist to troubleshooting.
Case 1: when loading the same module multiple times, we can
not detemine the cause of its failure without referring to
the module list or the earliest module load log. The log
may not exist and sometimes it is difficult for people
to associate module list.
Case 2: when multiple modules use the same module name,
we can not quickly associate the busy name without referring
to the module list and the earliest module load log.
Different people wrote modules with the same module name,
they don't easily associate module name.
So in this PR, when doing module onload, we will try to
print a busy name log if this happen. Currently we check
ctx.module since if it is NULL it means the Init call
failed, and Init currently only fails with busy name.
It's kind of ugly. It would have been nice if we could have had a
better way for onload to signal why the load failed.
Signed-off-by: Binbin <binloveplay1314@qq.com>
There is a lot of bad legacy usage of `default:` with enums, which is an
anti-pattern. If you omit the default, the compiler will tell you if a
new enum value was added and that it is missing from a switch statement.
Someone mentioned on another PR they used `default:` because of this
warning, so just removing it, but might create an issue to do a wider
cleanup.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Signed-off-by: Masahiro Ide <imasahiro9@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
I’ve prepared a minor fix for `processCommand()` function.
In `processCommand()`, the `obey_client` variable is created, but some
conditional statements call the `mustObeyClient()` function instead of
reusing `obey_client`.
I’ve modified these statements to `reuse obey_client`.
Since I’m relatively new to Redis, please let me know if there are any
reasons why the conditional statements need to call `mustObeyClient()`
again.
Thank you for taking the time to review my PR.
Signed-off-by: otheng03 <07c00h@gmail.com>
Until now, this flag only dumped logs on a failed assert in test case.
It is useful that this flag dumps logs on a crash as well.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Since a7cbca40661 ("RDMA: Support .is_local method (#1089)"),
valkey-server started to support auto-detect local connection, then we
can use protected mode for local RDMA device for test.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Improved documentation and readability of lua code as well as removed references to Redis.
---------
Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com>
This change counts both solo test executions to give an accurate total number of tests being run.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
This commit adds initialization code for the fields
`io_last_reply_block` and `io_last_bufpos` of the `client` struct.
While in the current code flow, these fields are only accessed after
being written in the `trySendWriteToIOThreads`, I discovered that they
were not being initialized while doing some changes to the code flow of
IO threads.
I believe it's good pratice to initialize all fields of a struct upon
creation, and will avoid future bugs which are usually hard to debug.
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Fix the warning introduced in #688:
```
unit/test_rax.c:168:15: runtime error: left shift of 36625 by 16 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior unit/test_rax.c:168:15 in
Fuzz test in mode 1 [7504]:
```
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Applying the CVEs against mainline.
(CVE-2024-31449) Lua library commands may lead to stack overflow and
potential RCE.
(CVE-2024-31227) Potential Denial-of-service due to malformed ACL
selectors.
(CVE-2024-31228) Potential Denial-of-service due to unbounded pattern
matching.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
- Add systemd support to the build artifact tarballs, so people can use
it under systemd compatible distros. As discussed here:
https://github.com/orgs/valkey-io/discussions/1103#discussioncomment-10815549.
Adding `libsystemd-dev` to install and add `USE_SYSTEMD=yes` to the
build.
- Cleanup & bring the arm & x86 workflow files in-sync. It was a bit of
a mess ;) (removing `jq wget awscli` from the 'Tarball' step)
Signed-off-by: Melroy van den Berg <melroy@melroy.org>
As discussed here:
https://github.com/orgs/valkey-io/discussions/1103#discussioncomment-10814006
`cp` can't be used anymore, `rsync` is more powerful and allow to
exclude files.
Alternatively:
1. Remove the c, d and o files. Which isn't ideal either.
2. Improve the build. Eg. by building inside a `build` directory instead
of in the src folder.
Ps. I know these workflows aren't trigger in this PR. Only via "Build
Release Packages" workflow action:
https://github.com/valkey-io/valkey/actions/workflows/build-release-packages.yml..
So I can't fully test in this PR. But it should work ^^
Ps. ps. I did test `rsync -av --exclude='*.c' --exclude='*.d'
--exclude='*.o' src/valkey-*` command in isolation and that works as
expected!
---------
Signed-off-by: Melroy van den Berg <melroy@melroy.org>
Introduce a `size_t` field into the rax struct to track allocation size.
Update the allocation size on rax insert and deletes.
Return the allocation size when `raxAllocSize` is called.
This size tracking is now used in MEMORY USAGE and MEMORY STATS in place
of the previous method based on sampling.
The module API allows to create sorted dictionaries, which are backed by
rax. Users now also get precise memory allocation for them (through
`ValkeyModule_MallocSizeDict`).
Fixes#677.
For the release notes:
* MEMORY USAGE and MEMORY STATS are now exact for streams, rather than
based on sampling.
---------
Signed-off-by: Guillaume Koenig <knggk@amazon.com>
Signed-off-by: Guillaume Koenig <106696198+knggk@users.noreply.github.com>
Co-authored-by: Joey <yzhaon@amazon.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Since we paused the primary node earlier, the replica may enter
cluster down due to primary node pfail. Here set allow read to
prevent subsequent read errors.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Currently, we create a Lua table without initial capacity even when the
capacity is known. As a result, we need to resize the Lua tables
repeatedly when converting RESP serialized object to Lua object and it
consumes extra cpu resources a bit when we need to transfer
RESP-serialized data to Lua world.
This patch try to remove this extra resize to reduce (re-)allocation
overhead.
| name | unstable bb57dfe6303 (rps) | this patch(rps) | improvements |
| --------------- | -------- | --------- | -------------- |
| evalsha - hgetall h1 | 60565.68 | 64487.01 | 6.47% |
| evalsha - hgetall h10 | 47023.41 | 50602.17 | 7.61% |
| evalsha - hgetall h25 | 33572.82 | 37345.48 | 11.23% |
| evalsha - hgetall h50 | 24206.63 | 25276.14 | 4.42% |
| evalsha - hgetall h100 | 15068.87 | 15656.8 | 3.90% |
| evalsha - hgetall h300 | 5948.56 | 6094.74 | 2.46% |
Signed-off-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Co-authored-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
These two test cases run in a loop:
* AOF rewrite during write load: RDB preamble=yes
* AOF rewrite during write load: RDB preamble=no
Both of the test cases build up a lot of data (3-4 million keys when I
run locally) so we should empty the data before the second test case.
Otherwise, the second test cases adds keys on top of the keys added in
the first test case, resulting in the double number of keys and takes
more time.
Before this commit:
[ok]: AOF rewrite during write load: RDB preamble=yes (18225 ms)
[ok]: AOF rewrite during write load: RDB preamble=no (37249 ms)
After:
[ok]: AOF rewrite during write load: RDB preamble=yes (18777 ms)
[ok]: AOF rewrite during write load: RDB preamble=no (19940 ms)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
There is no ethernet style virtual device (like lo 127.0.0.1) for RDMA,
however a connection with the same local address and peer address are
considered as local.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
moduleTimerHandler is aeTimeProc handler and event loop gets created
with this. However, found that the function return type is int but
actually returns "long long" value(i.e., next_period). and return value
being assigned to int variable in processTimeEvents(where time events
are processed), this might cause an overflow of the timer values. So
changed the return type of the function to long long. And also updated
other callback function return type to be consistent.
I found this when I was checking functions reported in
https://github.com/valkey-io/valkey/issues/1054 issue stacktrace. (FYI,
this is just to update the return type to be consistent and it will not
the fix for the issue reported)
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Currently cluster tests in unit/cluster are run as part of
the ./runtest. Sometims we change the cluster code and only
want to run cluster tests. This PR added a --cluster option
to runtest so that we can run only cluster tests.
Signed-off-by: Binbin <binloveplay1314@qq.com>
For fake clients like the ones used for Lua and modules, we don't
determine TLS in the right way, causing CLUSTER SLOTS from EVAL over TLS
to fail a debug-assert.
This error was introduced when the caching of CLUSTER SLOTS was
introduced, i.e. in 8.0.0.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The reason is VM_Call will use a fake client without connection,
so we also need to check if c->conn is NULL.
This also affects scripts. If they are called in the script, the
server will crash. Injecting commands into AOF will also cause
startup failure.
Fixes#1054.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In CLUSTER FAILOVER FORCE case, we will set mf_can_start to
1 and wait for a cron to trigger the election. We can also set a
CLUSTER_TODO_HANDLE_MANUALFAILOVER flag so that we
can start the election as soon as possible instead of waiting for
the cron, so that we won't have a 100ms delay (clusterCron).
Signed-off-by: Binbin <binloveplay1314@qq.com>
This commit hopefully improves the formatting of the codebase by setting
ColumnLimit to 0 and hence stopping clang-format from trying to put as
much stuff in one line as possible.
This change enabled us to remove most of `clang-format off` directives
and fixed a bunch of lines that looked like this:
```c
#define KEY \
VALUE /* comment */
```
Additionally, one pair of `clang-format off` / `clang-format on` had
`clang-format off` as the second comment and hence didn't enable the
formatting for the rest of the file. This commit addresses this issue as
well.
Please tell me if anything in the changes seem off. If everything is
fine, I will add this commit to `.git-blame-ignore-revs` later.
---------
Signed-off-by: Mikhail Koviazin <mikhail.koviazin@aiven.io>
Ci reported this failure:
```
[exception]: Executing test client: ERR FAILOVER target replica is not online..
ERR FAILOVER target replica is not online.
while executing
"$node_0 failover to $node_1_host $node_1_port"
```
We can see somehow the replica is not online in time and
casuing this failure, added a verify_replica_online to make
sure the replica is online for the test.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This commit changes the `tcmalloc.h` header location from the deprecated
location `google/` to `gperftools/`.
**Why we're doing this now?**
The location `google/tcmalloc.h` has been deprecated for more than 10
years in favor of `gperftools/tcmalloc.h`, and the deprecated location
will be removed in the next release of gperftools.
Fixes#1033
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
The one in CLUSTER SETSLOT help us keep track of state better,
of course it also can make the test case happy.
The one in gossip process fixes a problem that a replica can
print a log saying it is an empty primary.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Apparently this will fail to compile in some masOS version.
And internet claims _Thread_local is portable.
Fixes#1051.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Since in here the monitor value is mymaster, we need to make sure the
primary name is the same, otherwise the default configuration cannot start
sentinel.
```
sentinel monitor mymaster 127.0.0.1 6379 2
```
The following error occurs when the default configuration is started:
```
*** FATAL CONFIG FILE ERROR (Version 255.255.255) ***
Reading the configuration file, at line 358
>>> 'SENTINEL primary-reboot-down-after-period myprimary 0'
No such master with specified name.
```
Introduced in #647.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Apparently there is a timing issue when using wait_for_ofs_sync:
```
[exception]: Executing test client: can't read "out_before": no such variable.
can't read "out_before": no such variable
```
The reason is that if the connection between the primary
and the replica is not established yet, the master_repl_offset
of the primary and replica in wait_for_ofs_sync is 0, and
the check fails, resulting in no replica client in the
client list below.
In this case, we need to make sure the replica is online
before proceeding.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Sometims it is hard to see the old primary during a
multi primaries failover, adding this log can help
use to find the old primary node.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
errno is global variable and shared with system calls, so there is
chance it may be overwritten during io free or close socket in migrate
command code. It would be better it is copied before the free or
closesocket and use copied value to check for retry in socket_err block.
So added new variable to take copy and used the copy variable for the
check.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Replaced "Could not connect to Redis" with "Could not connect to server" in the log
output for connection errors in `getRedisContext` and `createClient`.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Call emptyData right before rdbLoad to prevent errors in the middle
and we drop the replication stream and leaving an empty database.
The real changes is in disk-based part, the rest is just code movement.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This pull request improves code readability, as a follow up of #749.
- Internal Naming Conventions: Removed the use of underscores (_) for
internal static structures/functions.
- Descriptive Function Names: Updated function names to be more
descriptive, making their purpose clearer. For instance, `_dictExpand`
is renamed to `dictExpandIfAutoResizeAllowed`.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas
There is a timing bug where the primary and replica have different
`cluster-allow-replica-migration` settings. In issue #970, we found that if
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains
in the original shard. This happens because we only process the
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.
This commit fixes the issue by also evaluating this flag for replicas in the
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.
Closes#970
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
The node may not be able to initiate an election in time due to
problems with cluster communication. If an election is initiated,
make sure its offset is 0.
Closes#967.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The cluster configuration file is the metadata "database" for the
cluster. It is best to trigger a save when shutdown the server, to
avoid inconsistent content that is not refreshed.
We save the nodes.conf whenever something that affects the nodes.conf
has changed. But we are saving nodes.conf in clusterBeforeSleep, and
some events may save it without a fsync, there is a time gap.
And shutdown has its own save seems good to me, it doesn't need to
care about the others.
At the same time, a comment is added in unlock nodes.conf to explain
why we actively unlock when shutdown.
Signed-off-by: Binbin <binloveplay1314@qq.com>
clang generates warning if there is no newline at the end of the source
file.
Update .clang-format to handle the missing newline at eof.
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
While doing some profiling, I noticed that getKeySlot() was a fairly
large part (~0.7%) of samples doing perf with high pipeline during
standalone. I think this is because we do a very late check for
server.cluster_mode, we first call getKeySlot() and then call
calculateKeySlot(). (calculateKeySlot was surprisingly not automatically
inlined, we were doing a jump into it and then immediately returning
zero). We then also do useless work in the form of caching zero in
client->slot, which will further mess with cache lines.
So, this PR tries to accomplish a few things things.
1) The usage of the `slot` name made a lot more sense before the
introduction of the kvstore. Now with kvstore, we call this the database
index, so all the references to slot in standalone are no longer really
accurate.
2) Pull the cluster mode check all the way out of getKeySlot(), so
hopefully a bit more performant.
3) Remove calculateKeySlot() as independent from getKeySlot().
calculateKeySlot used to have 3 call sites outside of db.c, which
warranted it's own function. It's now only called in two places,
pubsub.c and networking.c.
I ran some profiling, and saw about ~0.3% improvement, but don't really
trust it because you'll see a much higher (~2%) variance in test runs
just by how the branch predictions will get changed with a new memory
layout. Running perf again showed no samples in getKeySlot() and a
reduction in samples in lookupKey(), so maybe this will help a little
bit.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Maybe partially resolves https://github.com/valkey-io/valkey/issues/952.
The hostnames test relies on an assumption that node zero and node six
don't communicate with each other to test a bunch of behavior in the
handshake stake. This was done by previously dropping all meet packets,
however it seems like there was some case where node zero was sending a
single pong message to node 6, which was partially initializing the
state.
I couldn't track down why this happened, but I adjusted the test to
simply pause node zero which also correctly emulates the state we want
to be in since we're just testing state on node 6, and removes the
chance of errant messages. The test was failing about 5% of the time
locally, and I wasn't able to reproduce a failure with this new
configuration.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This file enables developers to ignore the certain revisions in
git-blame. This is quite handy considering there was a commit that
reformatted the large amount of code in valkey.
As a downside, one has to do a manual step for each clone of valkey to
enable this feature. The instructions are available in the file itself.
---------
Signed-off-by: Mikhail Koviazin <mikhail.koviazin@aiven.io>
In RdbLoad, we disable AOF before emptyData and rdbLoad to prevent copy-on-write issues. After rdbLoad completes, AOF should be re-enabled, but the code incorrectly checks server.aof_state, which has been reset to AOF_OFF in stopAppendOnly. This leads to AOF not being re-enabled after being disabled.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Prior to comparing the replica buffer against the configured limit, we
need to ensure that the limit configuration is enabled. If the limit is
set to zero, it indicates that there is no limit, and we should skip the
buffer limit check.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Fix for https://github.com/valkey-io/valkey/issues/997
Root Cause Analysis:
1. Two different jobs (READ and WRITE) may be sent to the same IO
thread.
2. When processing the read job in `processIOThreadsReadDone`, the IO
thread may find that the write job has also been completed.
3. In this case, the IO thread calls `processClientIOWriteDone` to first
process the completed write job and free the COBs
affbea5dc1/src/networking.c (L4666)
4. If there are pending writes (resulting from pipeline commands), a new
async IO write job is sent before processing the completed read job
affbea5dc1/src/networking.c (L2417)
When sending the write job, the `TLS_CONN_FLAG_POSTPONE_UPDATE_STATE`
flag is set to prevent the IO thread from updating the event loop, which
is not thread-safe.
5. Upon resuming the read job processing, the flag is cleared,
affbea5dc1/src/networking.c (L4685)
causing the IO thread to update the event loop.
Fix:
Prevent sending async write job for pending writes when a read job is
about to be processed.
Testing:
The issue could not be reproduced due to its rare occurrence, which
requires multiple specific conditions to align simultaneously.
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.
This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.
Partially fixes#774
---------
Signed-off-by: Ben Totten <btotten@amazon.com>
Co-authored-by: Ben Totten <btotten@amazon.com>
This PR migrates the tests related to dict into new test framework as
part of #428.
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This PR migrates the tests related to listpack into new test framework
as part of #428.
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Print the full client info by using catClientInfoString, the
info is useful when we want to identify the source of request.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If we modify aof-use-rdb-preamble in the middle of rewrite,
we may get a wrong aof base suffix. This is because the suffix
is concatenated by the main process afterwards, and it may be
different from the beginning.
We cache this value when we start the rewrite.
Signed-off-by: Binbin <binloveplay1314@qq.com>
`clusterProcessGossipSection` currently doesn't trigger a check and call `replicationSetPrimary` when `myself`'s primary node’s IP/port is updated. This fix ensures that after every node address update, `replicationSetPrimary` is called if the updated node is `myself`'s primary. This prevents missed updates and ensures that replicas reconnect properly to maintain their replication link with the primary.
Before this doc update, the comments in valkey.conf said that DEL is a
blocking command, and even refered to other synchronous freeing as "in a
blocking way, like if DEL was called". This has now become confusing and
incorrect, since DEL is now non-blocking by default.
The comments also mentioned too much about the "old default" and only
later explain that the "new default" is non-blocking.
This doc update focuses on the current default and expresses it like
"Starting from Valkey 8.0, lazy freeing is enabled by default", rather
than using words like old and new.
This is a follow-up to #913.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This pull request introduces several changes to improve the type safety
of Valkey's dictionary implementation:
- Getter/Setter Macros: Implemented macros `DICT_SET_VALUE` and
`DICT_GET_VALUE` to centralize type casting within these macros. This
change emulates the behavior of C++ templates in C, limiting type
casting to specific low-level operations and preventing it from being
spread across the codebase.
- Reduced Assert Overhead: Removed unnecessary asserts from critical hot
paths in the dictionary implementation.
- Consistent Naming: Standardized the naming of dictionary entry types.
For example, all dictionary entry types start their names with
`dictEntry`.
Fix#737
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Signed-off-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Follow up to https://github.com/valkey-io/valkey/pull/966, which didn't
update the kvstore tests. I'm not actually entirely clear why it fixes
it, but the consistency prevents the crash very reliably so will merge
it now and maybe see if Zhao has a better explanation.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Implement data masking for user data in server logs and diagnostic output. This change prevents potential exposure of confidential information, such as PII, and enhances privacy protection. It masks all command arguments, client names, and client usernames.
Added a new hide-user-data-from-log configuration item, default yes.
---------
Signed-off-by: Amit Nagler <anagler123@gmail.com>
Feature `one-dict-per-slot` refactors the database, and part of it
involved splitting the rehashing list from the global level back to the
database level, or more specifically, the kvstore level. This change is
fine, and it also simplifies the process of swapping databases, which is
good. And it should not have a major impact on the efficiency of
incremental rehashing.
To implement the kvstore-level rehashing list, each `dict` under the
`kvstore` needs to know which `kvstore` it belongs. However, kvstore did
not insert the reference relationship into the `dict` itself, instead,
it placed it in the `dictType`. In my view, this is a somewhat odd way.
Theoretically, `dictType` is just a collection of function handles, a
kind of virtual type that can be referenced globally, not an entity. But
now the `dictType` is instantiated, with each `kvstore` owning an actual
`dictType`, which in turn holds a reverse reference to the `kvstore`'s
resource pointer. This design is somewhat uncomfortable for me.
I think the `dictType` should not be instantiated. The references
between actual resources (`kvstore` and `dict`) should occur between
specific objects, rather than force materializing the `dictType`, which
is supposed to be virtual.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
## Set replica-lazy-flush and lazyfree-lazy-user-flush to yes by
default.
There are many problems with running flush synchronously. Even in
single CPU environments, the thread managers should balance between
the freeing and serving incoming requests.
## Set lazy eviction, expire, server-del, user-del to yes by default
We now have a del and a lazyfree del, we also have these configuration
items to control: lazyfree-lazy-eviction, lazyfree-lazy-expire,
lazyfree-lazy-server-del, lazyfree-lazy-user-del. In most cases lazyfree
is better since it reduces the risk of blocking the main thread, and
because we have lazyfreeGetFreeEffort, on those with high effor
(currently
64) will use lazyfree.
Part of #653.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
If the expiration time passed in SET is expired, for example, it
has expired due to the machine time (DTS) or the expiration time
passed in (wrong arg). In this case, we don't need to set the key
and wait for the active expire scan before deleting the key.
Compared with previous changes:
1. If the key does not exist, previously we would set the key and wait
for the active expire to delete it, so it is a set + del from the
perspective
of propaganda. Now we will no set the key and return, so it a NOP.
2. If the key exists, previously we woule set the key and wait
for the active expire to delete it, so it is a set + del From the
perspective
of propaganda. Now we will delete it and return, so it is a del.
Adding a new deleteExpiredKeyFromOverwriteAndPropagate function
to reduce the duplicate code.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Deletes zipmapSet, zipmapGet, etc. Only keep iterator and validate
integrity, what we use when loading an old RDB file.
Adjust unit tests to not use zipmapSet, etc.
Solves a build failure where when compiling with fortify source.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The reason is the server 3 still have the server 7 as its replica
due to a short wait, the wait is not enough, we should wait for
server loss its replica.
```
*** [err]: valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT
Expected '{127.0.0.1 21497 267}' to be equal to '' (context: type eval line 34 cmd {assert_equal [lindex [R 3 role] 2] {}} proc ::test)
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.
We have adopted the same handling method as the cluster mode:
1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:
```
MULTI ==> +OK
SET x y ==> -REDIRECT
EXEC ==> -EXECABORT
```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:
```
MULTI ==> +OK
SET x y ==> +QUEUED
failover
EXEC ==> -REDIRECT
```
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Migrate zipmap unit test to new unit test framework, parent ticket #428
.
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
Signed-off-by: hwware <wen.hui.ware@gmail.com>
Co-authored-by: hwware <wen.hui.ware@gmail.com>
When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.
In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.
As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.
This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.
This one follow #885 and closes#942.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Fix a bug in isValidAuxChar where valid characters '.' and ':' were
incorrectly included in the banned charset. This issue affected the
validation of auxiliary fields in the nodes.conf file used by Valkey in
cluster mode, particularly when handling IPv4 and IPv6 addresses. The
code now correctly allows '.' and ':' as valid characters, ensuring
proper handling of these fields. Comments were added to clarify the use
of the banned charset.
Related to #736
---------
Signed-off-by: Ping Xie <pingxie@google.com>
In #792, the time complexity became ambiguous, fluctuating between
O(1) and O(n), which is a significant difference. And we agree uncertainty
can potentially bring disaster to the business, the right thing to do is
to persuade users to use EXISTS instead of KEYS in this case, to do the
right thing the right way, rather than accommodating this incorrect usage.
This reverts commit d66a06e8183818c035bb78706f46fd62645db07e.
This reverts #792.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Most of the content of TLS.md has already been copied to README.md in
#927.
The description of how to run tests with TLS is moved to
tests/README.md.
Descriptions of the additional scripts runtest-cluster, runtest-sentinel
and runtest-module are added in tests/README.md.
Links to tests/README.md and src/unit/README.md are added in the
top-level README.md along with a brief overview of the `make test-*`
commands.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The MANIFESTO is not Valkey's manifesto and it doesn't even mention open
source. Let's write another one later, or some other document about our
project principles.
The other two files are one-line files with no relevant info. They're
polluting the file listing at root level. It's the first thing you see
when you start exploring the repo for the first time.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Migrate the contents in TLS.md into TLS sections including building,
running and detail supports. TODO list in the TLS.md is almost done
except the implementation of benchmark support is still not the best
approach which should migrate to hiredis async mode.
Closes#888
---------
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In #885, we only add a shutdown path, there is another path
is that the server might got hang by slowlog. This PR added
the pause path coverage to cover it.
Signed-off-by: Binbin <binloveplay1314@qq.com>
## Description
When I explore the cycles distributions for `lrange` test (
`valkey-benchmark -p 9001 -t lrange -d 100 -r 1000000 -n 1000000 -c 50
--threads 4`). I found the `prepareClientToWrite` and
`clientHasPendingReplies` could be reduced to single call outside
instead of called in a loop, ideally we can gain 3% performance. The
corresponding `LRANG_100`, `LRANG_300`, `LRANGE_500`, `LRANGE_600` have
~2% - 3% performance boost, the benchmark test prove it helps.
This patch try to move the `prepareClientToWrite` and its child
`clientHasPendingReplies` out of the loop to reduce the function
overhead.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
The test might be fast enough and then there is no change in the role
causing the test to fail. Adding a wait to avoid the timing issue:
```
*** [err]: valkey-cli make source node ignores NOREPLICAS error when doing the last CLUSTER SETSLOT
Expected '{127.0.0.1 23154 267}' to be equal to '' (context: type eval line 24 cmd {assert_equal [lindex [R 3 role] 2] {}} proc ::test)
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
sdsAllocSize returns the correct size without consulting the
allocator. Which is much faster than consulting the allocator.
The only exception is SDS_TYPE_5, for which it has to
consult the allocator.
This PR also sets alloc field correctly for embedded string objects.
It assumes that no allocator would allocate a buffer larger
than `259 + sizeof(robj)` for embedded string. We use embedded strings
for strings up to 44 bytes. If this assumption is wrong, the whole
function would require a rewrite. In general case sds type adjustment
might be needed. Such logic should go to sds.c.
---------
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Currently, the `dual-channel-replication` feature flag is immutable if
`enable-protected-configs` is enabled, which is the default behavior.
This PR proposes to make the `dual-channel-replication` flag mutable,
allowing it to be changed dynamically without restarting the cluster.
**Motivation:**
The ability to change the `dual-channel-replication` flag dynamically is
essential for testing and validating the feature on real clusters
running in production environments. By making the flag mutable, we can
enable or disable the feature without disrupting the cluster's
operations, facilitating easier testing and experimentation.
Additionally, this change would provide more flexibility for users to
enable or disable the feature based on their specific requirements or
operational needs without requiring a cluster restart.
---------
Signed-off-by: naglera <anagler123@gmail.com>
1. Remove redundant connIncrRefs/connDecrRefs
In socket.c, the reference counter is incremented before calling
callHandler, but the same reference counter is also incremented inside
callHandler before calling the actual callback.
static inline int callHandler(connection *conn, ConnectionCallbackFunc handler) {
connIncrRefs(conn);
if (handler) handler(conn);
connDecrRefs(conn);
...
}
This commit removes the redundant incr/decr calls in socket.c
2. Correct return value of connRead for TLS when peer closed
According to comments in connection.h, connRead returns 0 when the peer
has closed the connection. This patch corrects the return value for TLS
connections. (Without this patch, it returns -1 which means error.)
There is an observable difference in what is logged in the verbose
level: "Client closed connection" vs "Reading from client: (null)".
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This PR utilizes the IO threads to execute commands in batches, allowing
us to prefetch the dictionary data in advance.
After making the IO threads asynchronous and offloading more work to
them in the first 2 PRs, the `lookupKey` function becomes a main
bottle-neck and it takes about 50% of the main-thread time (Tested with
SET command). This is because the Valkey dictionary is a straightforward
but inefficient chained hash implementation. While traversing the hash
linked lists, every access to either a dictEntry structure, pointer to
key, or a value object requires, with high probability, an expensive
external memory access.
### Memory Access Amortization
Memory Access Amortization (MAA) is a technique designed to optimize the
performance of dynamic data structures by reducing the impact of memory
access latency. It is applicable when multiple operations need to be
executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.
Rather than executing operations sequentially, this approach interleaves
the execution of all operations. This is done in such a way that
whenever a memory access is required during an operation, the program
prefetches the necessary memory and transitions to another operation.
This ensures that when one operation is blocked awaiting memory access,
other memory accesses are executed in parallel, thereby reducing the
average access latency.
We applied this method in the development of `dictPrefetch`, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, `dictPrefetch` is an interleaved execution of dictFind for
all the keys.
**Implementation details**
When the main thread iterates over the `clients-pending-io-read`, for
clients with ready-to-execute commands (i.e., clients for which the IO
thread has parsed the commands), a batch of up to 16 commands is
created. Initially, the command's argv, which were allocated by the IO
thread, is prefetched to the main thread's L1 cache. Subsequently, all
the dict entries and values required for the commands are prefetched
from the dictionary before the command execution. Only then will the
commands be executed.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
This example was for script replication which we have
completely removed in 7.0, so this example is outdated now.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Although KEYS is a dangerous command and we recommend people
to avoid using it, some people who are not familiar with it
still using it, and even use KEYS with no pattern at all.
Once KEYS is using with no pattern, we can convert it to an
exact match to avoid iterating over all data.
Signed-off-by: Binbin <binloveplay1314@qq.com>
According to the Python document[1], any invalid escape sequences in
string literals now generate a DeprecationWarning (SyntaxWarning as of
3.12) and in the future this will become a SyntaxError.
This Change uses Python’s raw string notation for regular expression
patterns to avoid it.
[1]: https://docs.python.org/3.10/library/re.html
Signed-off-by: haoqixu <hq.xu0o0@gmail.com>
When failover deny to vote, sometimes due to network or
some blocking operations, the time of FAILOVER_AUTH_REQUEST
packet arrival is very uncertain. Since there is no epoch
information in these logs, it is hard to associate the log
with other logs.
Signed-off-by: Binbin <binloveplay1314@qq.com>
At the VERBOSE/DEBUG log level, which is output once every 5 seconds,
added to show the "Total" message of all clients and to show memory
usage (used_memory) with used_memory_human.
Also, it seems clearer to show "total" number of keys and the number of
volatile in entire keys.
---------
Signed-off-by: NAM UK KIM <namuk2004@naver.com>
Add new optional, immutable string config called `unixsocketgroup`.
Change the group of the unix socket to `unixsocketgroup` after `bind()`
if specified.
Adds tests to validate the behavior.
Fixes#873.
Signed-off-by: Ayush Sharma <mrayushs933@gmail.com>
Today if we attached the "run-extra-tests" tag it adds at least 20
minutes because the dump-fuzzer test runs with full accuracy. This
fuzzer is useful, but probably only really needed for the daily, so
removing it from the PRs. We still run the fuzzers, just not for as
long.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
This fixes#899. In that issue, the primary is cluster-allow-replica-migration no
and its replica is cluster-allow-replica-migration yes.
And during the slot migration:
1. Primary calling blockClientForReplicaAck, waiting its replica.
2. Its replica reconfiguring itself as a replica of other shards due to
replica migration and disconnect from the old primary.
3. The old primary never got the chance to receive the ack, so it got a
timeout and got a NOREPLICAS error.
In this case, the replicas might automatically migrate to another primary,
resulting in the client being unblocked with the NOREPLICAS error. In this
case, since the configuration will eventually propagate itself, we can safely
ignore this error on the source node.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In CLUSTER SETSLOT propagation logic, if the replicas are down, the
client will get block during command processing and then unblock
with `NOREPLICAS Not enough good replicas to write`.
The reason is that all replicas are down (or some are down), but
myself->num_replicas is including all replicas, so the client will
get block and always get timeout.
We should only wait for those online replicas, otherwise the waiting
propagation will always timeout since there are not enough replicas.
The admin can easily check if there are replicas that are down for an
extended period of time. If they decide to move forward anyways, we
should not block it. If a replica failed right before the replication and
was not included in the replication, it would also unlikely win the election.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
There is a test that assumes that the backlog will get overrun, but
because of the recent changes to the default it no longer fails. It
seems like it is a bit flakey now though, so resetting the value in the
test back to 1mb. (This relates to the CoB of 1100k. So it should
consistently work with a 1mb limit).
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Now, when clients run the unsubscribe, sunsubscribe and punsubscribe
commands in the non-subscribed mode, it returns 0.
Indeed this is a bug, we should not allow client run these kind of
commands here.
Thus, this PR fixes this bug, but it is a break change for existing
clients
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
In #764, we add a --io-threads mode in test, but forgot
to handle runtest-cluster, they are different framework.
Currently runtest-cluster does not support tags, and we
don't have plan to support it. And currently cluster tests
does not have any io-threads tests, so this PR just align
--io-threads option with #764.
Signed-off-by: Binbin <binloveplay1314@qq.com>
If n is already myself primary, there is no need to re-establish the
replication connection.
In the past we allow a replica node to reconnect with its primary via
this CLUSTER REPLICATE command, it will use psync. But since #885, we
will assume that a full sync is needed in this case, so if we don't do
this, the replica will always use full sync.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
Fix feedback loop in key eviction with tracking clients when using I/O
threads.
Current issue:
Evicting keys while tracking clients or key space-notification exist
creates a feedback loop when using I/O threads:
While evicting keys we send tracking async writes to I/O threads,
preventing immediate release of tracking clients' COB memory
consumption.
Before the I/O thread finishes its write, we recheck used_memory, which
now includes the tracking clients' COB and thus continue to evict more
keys.
**Fix:**
We will skip the test for now while IO threads are active. We may
consider avoiding sending writes in `processPendingWrites` to I/O
threads for tracking clients when we are out of memory.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Update README.md since the project is no longer under construction, and
can reference the main website.
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
The repl-backlog-size 1mb is too small in most cases, now network
transmission and bandwidth performance have improved rapidly in more
than ten years.
The bigger the replication backlog, the longer the replica can endure
the disconnect and later be able to perform a partial resynchronization.
Part of #653.
---------
Signed-off-by: Binbin <binloveplay1314@qq.com>
Just add one more test for command "sentinel IS-PRIMARY-DOWN-BY-ADDR" to
make the reply-schemas-validator
run successfully.
Note: test result here
https://github.com/hwware/valkey/actions/runs/10457516111
Signed-off-by: hwware <wen.hui.ware@gmail.com>
If the client side crashes by any issue or exits normally, the kernel
will try to disconnect RDMA QPs. Then the kernel of server side receives
CM packets, valkey-server handles CM disconnected event and close
connection.
However, there is a lack of keepalive mechanism from RDMA transport
layer. Once the kernel of client side crashes, the server side will not
be notified. To avoid this issue, valkey server sents Keepaliv command
periodically to detect any dead QPs.
An example of mlx-cx5:
```
# RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
# RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
# RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
# RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
# RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
# RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
```
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In these places we should use RDB_EOF_MARK_SIZE, but we mixed
it with CONFIG_RUN_ID_SIZE. This is not an issue since they are
all 40, just a cleanup.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Saves some memory (one pointer) in the client struct.
Since a client cannot be blocked multiple times, we can assume
it will be held in only one extra utility list, so it is ok to maintain
a union of these listNode references.
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Currently, if the client enters a blocked state, it will be
added to the server.clients_waiting_acks list. When the client
is unblocked, that is, when unblockClient is called, we will
need to linearly traverse server.clients_waiting_acks to delete
the client, and this search is O(N).
When WAIT (or WAITAOF) is used extensively in some cases, this
O(N) search may be time-consuming. We can remember the list node
and store it in the blockingState struct and it can avoid the
linear search in unblockClientWaitingReplicas.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Add 4 new commands for Sentinel (reference
https://github.com/valkey-io/valkey/issues/36)
Sentinel GET-PRIMARY-ADDR-BY-NAME
Sentinel PRIMARY
Sentinel PRIMARIES
Sentinel IS-PRIMARY-DOWN-BY-ADDR
and deprecate 4 old commands:
Sentinel GET-MASTER-ADDR-BY-NAME
Sentinel MASTER
Sentinel MASTERS
Sentinel IS-MASTER-DOWN-BY-ADDR
and all sentinel tests pass here
https://github.com/hwware/valkey/actions/runs/9962102363/job/27525124583
Note:
1. runtest-sentinel pass all test cases
2. I finished a sentinel rolling upgrade test: 1 primary 2 replicas 3
sentinel
there are 4 steps in this test scenario:
step 1: all 3 sentinel nodes run old sentinel, shutdown primary, and
then new primary can be voted successfully.
step 2: replace sentinel 1 with new sentinel bin file, and then shutdown
primary, and then another new primary can be voted successfully
step 3: replace sentinel 2 with new sentinel bin file, and then shutdown
primary, and then another new primary can be voted successfully
step 4: replace sentinel 3 with new sentinel bin file, and then shutdown
primary, and then another new primary can be voted successfully
We can see, even mixed version sentinel running, whole system still
works.
---------
Signed-off-by: hwware <wen.hui.ware@gmail.com>
For `debug object` command, we use `val->lru` but ignore the `lfu` mode.
So in `lfu` mode, `debug object` would return meaningless `lru` descriptions.
Added two new fields lfu_freq and lfu_access_time_minutes.
Signed-off-by: jiangyujie.jyj <yjjiang1996@163.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Adding FAST option to DEBUG OBJECT command.
The light version only shows the light weight infomation,
which mostly O(1). The pre-existing version that show more
stats such as serializedlength sometimes is time consuming.
This should allow looking into debug stats (the key expired
but not deleted), even on huge object, on which we're afraid
to run the command for fear of causing a server freeze.
Somehow like 3ca451c46fed894bf49e7561fa0282d2583f1c06.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In #786, we did skip it in the daily, but not for the others.
When running ./runtest on MacOS, we will get the failure.
```
couldn't open socket: host is unreachable (nodename nor servname provided, or not known)
```
The reason is that TCL 8.5 doesn't support ipv6, so we skip tests
tagged with ipv6. This also revert #786.
Signed-off-by: Binbin <binloveplay1314@qq.com>
I've tried to test a dual channel replication but forgot to add +sync
for my replication user. As a result replica entered silent cycle like
this:
```
* Connecting to PRIMARY 127.0.0.1:6379
* PRIMARY <-> REPLICA sync started
* Non blocking connect for SYNC fired the event.
* Primary replied to PING, replication can continue...
* Trying a partial resynchronization (request ...)
* PSYNC is not possible, initialize RDB channel.
* Aborting dual channel sync
```
And primary got endless cycle like this:
```
* Replica 127.0.0.1:6380 asks for synchronization
* Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '...', my replication IDs are '...' and '...')
* Replica 127.0.0.1:6380 is capable of dual channel synchronization, and partial sync isn't possible. Full sync will continue with dedicated RDB channel.
```
There was no way to understand that replication user is missing +sync
acl on notice log level. With this one-line change we get a warning
message in our replica log.
---------
Signed-off-by: secwall <secwall@yandex-team.ru>
Update references of copyright being assigned to Salvatore when it was
transferred to Redis Ltd. as per
https://github.com/valkey-io/valkey/issues/544.
---------
Signed-off-by: Pieter Cailliau <pieter@redis.com>
A pointer to dtype is stored in the dict forever.
dtype is stack-allocated while the dict created is global.
The dict (and the pointer to dtype in it) will live past the lifetime of
dtype.
clusterManagerLinkDictType is a global object that has the same values
as dtype.
Signed-off-by: Salvatore Mesoraca <salvatore.mesoraca@aiven.io>
Fix#821
During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.
---------
Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
We are updating this variable in the main thread, and the
child threads can printing the logs at the same time. This
generating a warning in SANITIZER=thread:
```
WARNING: ThreadSanitizer: data race (pid=74208)
Read of size 4 at 0x000102875c10 by thread T3:
#0 serverLogRaw <null>:52173615 (valkey-server:x86_64+0x10003c556)
#1 _serverLog <null>:52173615 (valkey-server:x86_64+0x10003ca89)
#2 bioProcessBackgroundJobs <null>:52173615 (valkey-server:x86_64+0x1001402c9)
Previous write of size 4 at 0x000102875c10 by main thread (mutexes: write M0):
#0 afterSleep <null>:52173615 (valkey-server:x86_64+0x10004989b)
#1 aeProcessEvents <null>:52173615 (valkey-server:x86_64+0x100031e52)
#2 main <null>:52173615 (valkey-server:x86_64+0x100064a3c)
#3 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
#4 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
```
The refresh of daylight_active is not real time, we update
it in aftersleep, so we don't need a strong synchronization,
so using memory_order_relaxed. But also noted we are doing
load/store operations only for daylight_active, which is an
aligned 32-bit integer, so using memory_order_relaxed will
not provide more consistency than what we have today.
So this is just a cleanup that to clear the warning.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Test dual-channel-replication primary gets cob overrun during replica
rdb load` fails during the Valgrind run. This is due to the load
handlers disconnecting before the tests complete, resulting in a low
primary COB. Increasing the handlers' timeout should resolve this issue.
Failure:
https://github.com/valkey-io/valkey/actions/runs/10361286333/job/28681321393
Server logs reveals that the load handler clients were disconnected
before the test started
Also the two previus test took about 20 seconds which is the handler
timeout.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
The clusterManagerCompareKeysValues is introduced in
143bfa1e6e65cf8be1eaad0b8169e2d95ca62f9a, which calls
DEBUG DIGEST-VALUE to check whether the value of the
source node and the target node are consistent.
However, the DEBUG DIGEST-VALUE command is not supported
in older version, such as version 4, and the node will return
unknown subcommand. Or the DEBUG command can be disabled
in version 7, and the node will return DEBUG not allowed.
In these cases, we need to output friendly message to
allow users to proceed to the next step, instead of just
outputing `Value check failed!`.
Unknown subcommand example:
```
*** Target key exists
*** Checking key values on both nodes...
Node 127.0.0.1:30001 replied with error:
ERR unknown subcommand or wrong number of arguments for 'DIGEST-VALUE'. Try DEBUG HELP.
Node 127.0.0.1:30003 replied with error:
ERR unknown subcommand or wrong number of arguments for 'DIGEST-VALUE'. Try DEBUG HELP.
*** Value check failed!
DEBUG DIGEST-VALUE command is not supported.
You can relaunch the command with --cluster-replace option to force key overriding.
```
DEBUG not allowed example:
```
*** Target key exists
*** Checking key values on both nodes...
Node 127.0.0.1:30001 replied with error:
ERR DEBUG command not allowed. If the enable-debug-command option is ...
Node 127.0.0.1:30003 replied with error:
ERR DEBUG command not allowed. If the enable-debug-command option is ...
*** Value check failed!
DEBUG command is not allowed.
You can turn on the enable-debug-command option.
Or you can relaunch the command with --cluster-replace option to force key overriding.
```
Signed-off-by: Binbin <binloveplay1314@qq.com>
In the past implementation of `ZUNION` and `ZUNIONSTORE` commands, we
first create a temporary dict called `accumulator`. After adding all
member-score mappings to `accumulator`, we still need to convert
`accumulator` back to the final dict `dstzset->dict`. However, we can
directly use `dstzset->dict` to avoid the additional copy operation.
This PR removes the `accumulator` dict and directly uses` dstzset->dict
`to store the member-score mappings.
- **Test**
First, I added 300 unique elements to two sorted sets called
'zunion_test1' and 'zunion_test2'. Then, I tested `zunion` and
`zunionstore` on these two sorted sets. The test results shown below
indicate that the performance of both zunion and zunionstore improved
about 31%.
### ZUNION
#### unstable
```
./valkey-benchmark -P 10 -n 100000 zunion 2 zunion_test1 zunion_test2
Summary:
throughput summary: 2713.41 requests per second
latency summary (msec):
avg min p50 p95 p99 max
146.252 3.464 153.343 182.015 184.959 192.895
```
#### This PR
```
./valkey-benchmark -P 10 -n 100000 zunion 2 zunion_test1 zunion_test2
Summary:
throughput summary: 3543.84 requests per second
latency summary (msec):
avg min p50 p95 p99 max
108.259 2.984 114.239 141.695 145.151 160.255
```
### ZUNIONSTORE
#### unstable
```
./valkey-benchmark -P 10 -n 100000 zunionstore out 2 zunion_test1 zunion_test2
Summary:
throughput summary: 3168.07 requests per second
latency summary (msec):
avg min p50 p95 p99 max
157.511 3.368 183.167 189.311 193.535 231.679
```
#### This PR
```
./valkey-benchmark -P 10 -n 100000 zunionstore out 2 zunion_test1 zunion_test2
Summary:
throughput summary: 4144.73 requests per second
latency summary (msec):
avg min p50 p95 p99 max
120.374 2.648 141.823 149.119 153.855 183.167
```
---------
Signed-off-by: RayCao <zisong.cw@alibaba-inc.com>
Signed-off-by: zisong.cw <zisong.cw@alibaba-inc.com>
A configuration option with zero impact on server operation but is
printed out on server crash and can be accessed by gdb for debugging. It
can be used by the user/operator to store any free-form string. This
string will persist as long as the server is running and will be
accessible in the following ways:
And printed in crash reports:
```
------ CONFIG DEBUG OUTPUT ------
lazyfree-lazy-eviction no
...
io-threads-do-reads yes
debug-context "test2"
proto-max-bulk-len 512mb
```
---------
Signed-off-by: Eran Liberty <eranl@amazon.com>
Co-authored-by: Eran Liberty <eranl@amazon.com>
- Fix TLS bug where connection were shutdown by primary's main process
while the child process was still writing- causing main process to be
blocked.
- TLS connection fix -file descriptors are set to blocking mode in the
main thread, followed by a blocking write. This sets the file
descriptors to non-blocking if TLS is used (see `connTLSSyncWrite()`)
(@xbasel).
- Improve the reliability of dual-channel tests. Modify the pause
mechanism to verify process status directly, rather than relying on log.
- Ensure that `server.repl_offset` and `server.replid` are updated
correctly when dual channel synchronization completes successfully.
Thist led to failures in replication tests that validate replication IDs
or compare replication offsets.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: naglera <58042354+naglera@users.noreply.github.com>
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
When debug assert mode enabled, verify that we don't insert the same
client twice into server.clients_to_close.
Signed-off-by: naglera <anagler123@gmail.com>
Currently lastbgsave_status is used in bgsave or disk-replication,
and the target is the disk. In #60, we update it when transfer error,
i think it is mainly used in tests, so we can use log to replace it.
It changes lastbgsave_status to err in this case, but it is strange
that it does not set ok or err in the above if and the following else.
Also noted this will affect stop-writes-on-bgsave-error.
Signed-off-by: Binbin <binloveplay1314@qq.com>
I think we should first check if the server is currently enabled in
cluster mode or if it has modules loaded prior to the throttled cron run
(`run_with_period`) condition.
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Added an assertion to avoid incorrect usage of the network bytes out for
replication code flow in slot stats computation.
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
This PR adjusts the logging conditions of clusterLogCantFailover
in this two ways.
1. For the same cant_failover_reason, we will print the log once
in CLUSTER_CANT_FAILOVER_RELOG_PERIOD, but its value is 10s, which
is a bit long, shorten it to 1s, so we can better track its state.
We get to see the system making progress by watching the message.
Using 1s also covers pretty much all cases as i don't see a reason
for using a <1s node timeout, test or prod.
2. We will not print logs before the nolog_fail_time, its value
is cluster-node-timeout+5000. This may casue us to lose some logs,
for example, if cluster-node-timeout is small, auth_timeout will
be 2000, and auth_retry_time will be 4000. In this case, we will
lose all the reasons during the election if the failover is timedout.
So remove the nolog_fail_time logic, since we still do have the
CLUSTER_CANT_FAILOVER_RELOG_PERIOD logic, we won't print too many
logs.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Few CI improvements witch will reduce occupation CI queue and eliminate
stale runs.
1. Kill CI jobs on PRs once PR branch gets a new push. This will prevent
situation happened today - a huge job triggered twice in less than an
hour and occupied all **org** (for all repositories) runners queue for
the rest of the day (see pic). This completely blocked valkey-glide
team.
2. Distribute nightly croned jobs on time to prevent them running
together. Keep in mind, cron's TZ is UTC, so midnight tasks incur
developers located in other timezones.
This must be backported to all release branches (`valkey-x.y` and `x.y`)

---------
Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com>
This example of a minimal user account in your Valkey server
for Sentinel is incorrect. If you add this ACL as-is to your
valkey users.acl, valkey will add resetchannels -@all before
the +client which prevents sentinel from publishing messages
to the __sentinel__:hello pubsub for sentinel discovery.
Fix#744.
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
We have a number of test failures in the empty shard migration which
seem to be related to race conditions in the failover, but could be more
pervasive. For now disable the tests to prevent so many false negative
test failures.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-07-31 16:52:20 -07:00
449 changed files with 34965 additions and 16193 deletions
run:make all-with-unit-tests OPT=-O3 SANITIZER=undefined SERVER_CFLAGS='-DSERVER_TEST -Werror' LUA_DEBUG=yes# we (ab)use this flow to also check Lua C API violations
run:make all-with-unit-tests OPT=-O3 SANITIZER=undefined SERVER_CFLAGS='-Werror' LUA_DEBUG=yes# we (ab)use this flow to also check Lua C API violations
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# License 2
BSD 3-Clause License
Copyright (c) 2024-present, Valkey contributors
All rights reserved.
@ -13,11 +29,11 @@ Redistribution and use in source and binary forms, with or without modification,
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# License 2
# License 3
BSD 3-Clause License
Copyright (c) 2006-2020, Salvatore Sanfilippo
Copyright (c) 2006-2020, Redis Ltd.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
The Valkey project is managed by a Technical Steering Committee (TSC) composed of the maintainers of the Valkey repository.
The Valkey project includes all of the current and future repositories under the Valkey-io organization.
Maintainers are defined as individuals with full commit access to a repository, which shall be in sync with the MAINTAINERS.md file in a given projects repository.
Committers are defined as individuals with write access to the code within a repository.
Maintainers are defined as individuals with full access to a repository and own its governance.
Both maintainers and committers should be clearly listed in the MAINTAINERS.md file in a given projects repository.
Maintainers of other repositories within the Valkey project are not members of the TSC unless explicitly added.
Fixing build problems with dependencies or cached build options
---------
<!-- ABOUT THE PROJECT -->
## О проекте
Valkey has some dependencies which are included in the `deps` directory.
`make` does not automatically rebuild dependencies even if something in
the source code of dependencies changes.
Проект Futriix является форком проекта Valkey.
Futriix-Распределённая СУБД на языке "C", построенная на базе [Valkey](https://valkey.io/), с поддержкой модулей на базе Искусственного интеллекта и модулей на языке Golang.
When you update the source code with `git pull` or when code inside the
dependencies tree is modified in any other way, make sure to use the following
command in order to really clean everything and rebuild from scratch:
СУБД поддерживает модуль c распределённым [JSON](https://source.futriix.ru/gvsafronov/futriix-json), [ИИ-модуль "Виртуальный помощник"](), [SQL-модуль](https://source.futriix.ru/gvsafronov/fdx).
% make distclean
Ниже приведён пример того, инструкции по настройке вашего проекта локально.
Чтобы запустить локальную копию проекта, выполните следующие простые шаги.
This will clean: jemalloc, lua, hiredis, linenoise and other dependencies.
Also if you force certain build options like 32bit target, no C compiler
optimizations (for debugging purposes), and other similar build time options,
those options are cached indefinitely until you issue a `make distclean`
command.
Fixing problems building 32 bit binaries
---------
### Подготовка
If after building Valkey with a 32 bit target you need to rebuild it
with a 64 bit target, or the other way around, you need to perform a
`make distclean` in the root directory of the Valkey distribution.
Ниже приведены шаги, которые помогут вам скомпилировать и установить Futriix.
* Устанавливаем язык программирования C, соопутствующие утилиты (autoconf и другие)
In case of build errors when trying to build a 32 bit binary of Valkey, try
## Исправление проблем сборки с зависимостями или кэшированными параметрами сборки.
Futriix содержит некоторые зависимости, которые хранятся в директории `deps`.
Утилита `make` автоматически не пересобирает зависимости даже если вносятся каие-либо изменения в код зависимостей.
Когда вы обновляете код проекта командой `git pull` или когда код внутри
дерева зависимостей изменен каким-либо другим способом, обязательно используйте следующее
команду для того, чтобы действительно все почистить и пересобрать с нуля:
```sh
unix:$ make distclean
```
В результате работы команды выше будут очищены: аллокатор памяти jemalloc, язык lua, библиотеку hiredis, библиотеку linenoise а также другие зависимости.
Кроме того, если вы принудительно используете определенные параметры сборки, такие как 32-битная версия для 32-битной системы, оптимизации компилятора C в данном случае не будут выполнены. Оптимизации (для целей отладки) и другие подобные параметры времени сборки,
кэшируются на неопределенный срок, пока вы не выполните команду `make distclean`.
To build with support for the processor's internal instruction clock, use:
Для сборки с поддержкой внутренней тактовой частоты процессора, используйте команду ниже:
% make CFLAGS="-DUSE_PROCESSOR_CLOCK"
```sh
unix:$ make CFLAGS="-DUSE_PROCESSOR_CLOCK"
```
Verbose build
-------------
## Расширенный вариант сборки
Valkey will build with a user-friendly colorized output by default.
If you want to see a more verbose output, use the following:
Futriix по умолчанию создает удобный для пользователя цветной вывод.
Если вы хотите увидеть более подробный вывод, выполните следующую команду:
% make V=1
```sh
unix:$ make V=1
```
Running Valkey
-------------
4. Если вы хотите запустить сервер Futriix с параметрами по-умолчанию (без указания файла конфигурации) выполните следующую команду:
```sh
`./futriix-server`
```
5. Также вы можете использовать файл конфигурации, располагающийся в директории "Futriix" `futriix.conf` для конфигурирования вашего сервера.
Для запуска Futriix с файлом конфигурации используйте команду ниже:
To run Valkey with the default configuration, just type:
```sh
./futriix-server /path/to/futriix.conf
```
% cd src
% ./valkey-server
6. Запустите утилиту futriix-cli (Client Futriix) для подключения к **локальному** серверу Futriix, а также для того чтобы начать работу с инстансом:
If you want to provide your valkey.conf, you have to run it using an additional
parameter (the path of the configuration file):
```sh
./futriix-cli
```
% cd src
% ./valkey-server /path/to/valkey.conf
7. Для подключения с помощью утилиты futriix-cli к конкретному узлу в сети, добавьте параметр `h`-указание удалённого хоста по его ip-адресу и параметр `p`- указания номера порта:
It is possible to alter the Valkey configuration by passing parameters directly
You can use `make PREFIX=/some/other/directory install` if you wish to use a
different destination.
## Кластер
_Note_: For compatibility with Redis, we create symlinks from the Redis names (`redis-server`, `redis-cli`, etc.) to the Valkey binaries installed by `make install`.
The symlinks are created in same directory as the Valkey binaries.
The symlinks are removed when using `make uninstall`.
The creation of the symlinks can be skipped by setting the makefile variable `USE_REDIS_SYMLINKS=no`.
`make install` will just install binaries in your system, but will not configure
init scripts and configuration files in the appropriate place. This is not
needed if you just want to play a bit with Valkey, but if you are installing
it the proper way for a production system, we have a script that does this
for Ubuntu and Debian systems:
1. Откройте директорию Futriix
% cd utils
% ./install_server.sh
```sh
unix:$ cd futriix
_Note_: `install_server.sh` will not work on Mac OSX; it is built for Linux only.
```
The script will ask you a few questions and will setup everything you need
to run Valkey properly as a background daemon that will start again on
system reboots.
2. Откройте файл конфигурации futriix.conf в любом текстовом редакторе, например nano, как в примере приведённом ниже:
You'll be able to stop and start Valkey using the script named
`/etc/init.d/valkey_<portnumber>`, for instance `/etc/init.d/valkey_6379`.
```sh
unix:$ nano futriix/futriix.conf
Code contributions
-----------------
Please see the [CONTRIBUTING.md][2]. For security bugs and vulnerabilities, please see [SECURITY.md][3].
3. Найдите и установите значения "yes" для параметров "active-replica" и "multi-master". После чего добавьте в файл конфигурации ip-адреса, узлов вашего кластера. Если вы всё сделали правильно у вас должны отробразится строки в файле конфигурации `futriix.conf` как показано ниже:
Valkey is an open community project under LF Projects
5. Перейдите в директорию Futriix и запустите скрипт `cluster.sh`с параметрами `pick` (скрипт запущенный с данным параметром "соберёт кластер"), и `run`,(скрипт запущенный с данным параметром "запустит кластер") как указано ниже:
```sh
unix:$ ./cluster pick
unix:$ ./cluster run
```
6. Установите права на исполнение на скрипт `cluster.sh` , воспользовавшись командой ниже:
```sh
unix:$ chmod +x cluster.sh
```
7. Для остановки кластера запустите скрипт `cluster.sh`с параметром `stop`
Вклады — это то, что делает сообщество открытого исходного кода таким замечательным местом для обучения, вдохновения и творчества. Любой ваш вклад **очень ценится**.
Если у вас есть предложение, которое могло бы улучшить ситуацию, создайте форк репозитория и создайте запрос на включение. Также можно просто открыть задачу с тегом «улучшение».
Не забудьте поставить проекту звезду! Еще раз спасибо!
1. Форкните проект
2. Создайте свою ветку функций (`git checkout -b Feature/AmazingFeature`)
3. Зафиксируйте свои изменения (git commit -m 'Add some AmazingFeature'`)
4. Отправьте в ветку (`git push origin Feature/AmazingFeature`)
5. Откройте запрос на включение
<!-- LICENSE -->
## Лицензия
Проект распространяется под 3-пунктной лицензией BSD. Подробнсти смотрите в файле `COPYING.txt`.
@ -6,6 +6,7 @@ should be provided by the operating system.
* **linenoise** is a readline replacement. It is developed by the same authors of Valkey but is managed as a separated project and updated as needed.
* **lua** is Lua 5.1 with minor changes for security and additional libraries.
* **hdr_histogram** Used for per-command latency tracking histograms.
* **fast_float** is a replacement for strtod to convert strings to floats efficiently.
How to upgrade the above dependencies
===
@ -94,6 +95,7 @@ and our version:
1. Makefile is modified to allow a different compiler than GCC.
2. We have the implementation source code, and directly link to the following external libraries: `lua_cjson.o`, `lua_struct.o`, `lua_cmsgpack.o` and `lua_bit.o`.
3. There is a security fix in `ldo.c`, line 498: The check for `LUA_SIGNATURE[0]` is removed in order to avoid direct bytecode execution.
4. In `lstring.c`, the luaS_newlstr function's hash calculation has been upgraded from a simple hash function to MurmurHash3, implemented within the same file, to enhance performance, particularly for operations involving large strings.
Hdr_Histogram
---
@ -104,3 +106,17 @@ We use a customized version based on master branch commit e4448cf6d1cd08fff51981
2. Copy updated files from newer version onto files in /hdr_histogram.
3. Apply the changes from 1 above to the updated files.
fast_float
---
The fast_float library provides fast header-only implementations for the C++ from_chars functions for `float` and `double` types as well as integer types. These functions convert ASCII strings representing decimal values (e.g., `1.3e10`) into binary types. The functions are much faster than comparable number-parsing functions from existing C++ standard libraries.
Specifically, `fast_float` provides the following function to parse floating-point numbers with a C++17-like syntax (the library itself only requires C++11):
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.