If a replica is step into data_age too old stage, it can not
trigger the failover and currently it can not be automatically
recovered and we will print a log every
CLUSTER_CANT_FAILOVER_RELOG_PERIOD,
which is every second. If the primary has not recovered or there is
no manual failover, this log will flood the log file.
In this case, limit its frequency to 10 times period, which is
10 seconds in our code. Also in this data_age too old stage,
the repeated logs also can stand for the progress of the failover.
See also #780 for more details about it.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
The command argument strings created while parsing inline commands (see
`processInlineBuffer()`) can contain free capacity. Since some commands
,such as `SET`, store these strings in the database, that free capacity
increases the memory usage. In the worst case, it could double the
memory usage.
This only occurs if the inline command format is used. The argument
strings are built by appending character by character in
`sdssplitargs()`. Regular RESP commands are not affected.
This change trims the strings within `processInlineBuffer()`.
### Why `trimStringObjectIfNeeded()` within `object.c` is not solving
this?
When the command argument string is packed into an object,
`trimStringObjectIfNeeded()` is called.
This does only trim the string if it is larger than
`PROTO_MBULK_BIG_ARG` (32kB), as only strings larger than this would
ever need trimming if the command it sent using the bulk string format.
We could modify this condition, but that would potentially have a
performance impact on commands using the bulk format. Since those make
up for the vast majority of executed commands, limiting this change to
inline commands seems prudent.
### Experiment Results
* 1 million `SET [key] [value]` commands
* Random keys (16 bytes)
* 600 bytes values
Memory usage without this change:
```
used_memory:1089327888
used_memory_human:1.01G
used_memory_rss:1131696128
used_memory_rss_human:1.05G
used_memory_peak:1089348264
used_memory_peak_human:1.01G
used_memory_peak_perc:100.00%
used_memory_overhead:49302800
used_memory_startup:911808
used_memory_dataset:1040025088
used_memory_dataset_perc:95.55%
```
Memory usage with this change:
```
used_memory:705327888
used_memory_human:672.65M
used_memory_rss:718802944
used_memory_rss_human:685.50M
used_memory_peak:705348256
used_memory_peak_human:672.67M
used_memory_peak_perc:100.00%
used_memory_overhead:49302800
used_memory_startup:911808
used_memory_dataset:656025088
used_memory_dataset_perc:93.13%
```
If the same experiment is repeated using the normal RESP array of bulk
string format (`*3\r\n$3\r\nSET\r\n...`) then the memory usage is 672MB
with and without of this change.
If a replica is attached, its memory usage is 672MB with and without
this change, since the replication link never uses inline commands.
Signed-off-by: Stefan Mueller <muelstef@amazon.com>
Currently, if the replica has a lot of data, CLUSTER RESET
will block for a while and report the slowlog, and it seems
that there is no harm in making it async so external components
can be easier when monitoring it.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Part of https://github.com/valkey-io/valkey/pull/1200 PR, since feild is
changed. Looks like commands.def is missed to get genereated based on
the changes so that is causing CI failure on unstable.
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
In some cases bgsave child process can run for a long time exhausting
system resources. Although it is possible to kill the bgsave child
process from the system shell, sometimes it is not possible allowing OS
level access.
This PR adds a new subcommand to the BGSAVE command.
When user will issue `BGSAVE CANCEL`, it will do one of the 2:
1. In case a bgsave child process is currently running, the child
process would be immediately killed thus terminating any
save/replication full sync process.
2. In case a bgsave child process is SCHEDULED to run, the scheduled
execution will be cancelled.
---------
Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
struct connListener.priv should be used by connection type specific
data, static local listener data should not use this.
A RDMA config structure is going to be introduced in the next step:
```
typedef struct serverRdmaContextConfig {
char *bindaddr;
int bindaddr_count;
int port;
int rx_size;
int comp_vector;
...
} serverRdmaContextConfig;
```
Then a builtin RDMA will be supported.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
This special pattern '#' is used to get the element itself,
it does not actually participate in the slot check.
In this case, passing `GET #` will cause '#' to participate
in the slot check, causing the command to get an
`pattern may be in different slots` error.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Hide 'unixsocketgroup' and 'unixsocketperm' into a Unix socket specific
data structure. A single opaque pointer 'void *priv' is enough for a
listener. Once any new config is added, we don't need 'void *priv2',
'void *priv3' and so on.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
When profiling some workloads with `io-threads` enabled. We found the
false sharing issue is heavy.
This patch try to split the the elements accessed by main thread and
io-threads into different cache line by padding the elements in the head
of `used_memory_thread_padded` array.
This design helps mitigate the false sharing between main
thread and io-threads, because the main thread has been the bottleneck
with io-threads enabled. We didn't put each element in an individual
cache line is that we don't want to bring the additional cache line
fetch operation (3 vs 16 cache line) when call function like
`zmalloc_used_memory()`.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <zhu.lipeng@outlook.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
There is no limitation in Valkey to create a cluster with 1 or 2 primaries,
only that it cannot do automatic failover. Remove this restriction and
add `are you sure` prompt to prompt the user.
This allow we use it to create a test cluster by cli or by
create-cluster.
Signed-off-by: Binbin <binloveplay1314@qq.com>
https://github.com/valkey-io/valkey/issues/1145
First part of a two-step effort to add `WithSlot` API for expiry. This
PR is to fix a crash that occurs when a RANDOMKEY uses a different slot
than the cached slot of a client during a multi-exec.
The next part will be to utilize the new API as an optimization to
prevent duplicate work when calculating the slot for a key.
---------
Signed-off-by: Nadav Levanoni <nadavl@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Nadav Levanoni <nadavl@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
The client that was killed by FUNCTION KILL received a reply of
SCRIPT KILL and the server log also showed SCRIPT KILL.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Currently in conf we describe activerehashing as: Active rehashing
uses 1 millisecond every 100 milliseconds of CPU time. This is the
case for hz = 10.
If we change hz, the description in conf will be inaccurate. Users
may notice that the server spends some CPU (used in activerehashing)
at high hz but don't know why, since our cron calls are fixed to 1ms.
This PR takes hz into account and fixed the CPU usage at 1% (this may
not be accurate in some cases because we do 100 step rehashing in
dictRehashMicroseconds but it can avoid CPU spikes in this case).
This PR also improves the description of the activerehashing
configuration item to explain this change.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The zcalloc symbol is a symbol name already used by zlib, which is
defining other names using the "z" prefix specific to zlib. In practice,
linking valkey with a static openssl, which itself might depend on a
static libz will result in link time error rejecting multiple symbol
definitions.
Fixes: #1157
Signed-off-by: Romain Geissler <romain.geissler@amadeus.com>
Similar to #860 but this is for HGETALL families (HGETALL/HKEYS/HVALS).
This patch moves `prepareClientToWrite` out of the loop to reduce the
function overhead.
Signed-off-by: Masahiro Ide <imasahiro9@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
this fixes: https://github.com/valkey-io/valkey/issues/1116
_Issue details from #1116 by @zuiderkwast_
> This config is undocumented since #758. The default was changed to
"yes" and it is quite useless to set it to "no". Yet, it can happen that
some user has an old config file where it is explicitly set to "no". The
result will be bad performace, since I/O threads will not do all the
I/O.
>
> It's indeed confusing.
>
> 1. Either remove the whole option from the code. And thus no need for
documentation. _OR:_
> 2. Introduce the option back in the configuration, just as a comment
is fine. And showing the default value "yes": `# io-threads-do-reads
yes` with additional text.
>
> _Originally posted by @melroy89 in [#1019 (reply in
thread)](https://github.com/orgs/valkey-io/discussions/1019#discussioncomment-10824778)_
---------
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
The module commands which were added to acl categories were getting
skipped when `ACL CAT category` command was executed.
This PR fixes the bug.
Before:
```
127.0.0.1:6379> ACL CAT foocategory
(empty array)
```
After:
```
127.0.0.1:6379> ACL CAT foocategory
aclcheck.module.command.test.add.new.aclcategories
```
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
A new option for diskless replication on the replica side.
After a network failure, the replica may need to perform a full sync.
The other option for diskless full sync is `swapdb`, but it uses twice
as much memory, temporarily. In situations where this is not acceptable,
and where losing data is acceptable, the `flush-before-load` can be
useful. If the full sync fails, the old data is lost though. Therefore,
the new option is marked as "dangerous".
---------
Signed-off-by: kronwerk <ca11e5e22g@gmail.com>
Signed-off-by: kronwerk <kronwerk@users.noreply.github.com>
Co-authored-by: kronwerk <ca11e5e22g@gmail.com>
Currently when module loading fails due to busy name, we
don't have a clean way to assist to troubleshooting.
Case 1: when loading the same module multiple times, we can
not detemine the cause of its failure without referring to
the module list or the earliest module load log. The log
may not exist and sometimes it is difficult for people
to associate module list.
Case 2: when multiple modules use the same module name,
we can not quickly associate the busy name without referring
to the module list and the earliest module load log.
Different people wrote modules with the same module name,
they don't easily associate module name.
So in this PR, when doing module onload, we will try to
print a busy name log if this happen. Currently we check
ctx.module since if it is NULL it means the Init call
failed, and Init currently only fails with busy name.
It's kind of ugly. It would have been nice if we could have had a
better way for onload to signal why the load failed.
Signed-off-by: Binbin <binloveplay1314@qq.com>
There is a lot of bad legacy usage of `default:` with enums, which is an
anti-pattern. If you omit the default, the compiler will tell you if a
new enum value was added and that it is missing from a switch statement.
Someone mentioned on another PR they used `default:` because of this
warning, so just removing it, but might create an issue to do a wider
cleanup.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Signed-off-by: Masahiro Ide <imasahiro9@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
I’ve prepared a minor fix for `processCommand()` function.
In `processCommand()`, the `obey_client` variable is created, but some
conditional statements call the `mustObeyClient()` function instead of
reusing `obey_client`.
I’ve modified these statements to `reuse obey_client`.
Since I’m relatively new to Redis, please let me know if there are any
reasons why the conditional statements need to call `mustObeyClient()`
again.
Thank you for taking the time to review my PR.
Signed-off-by: otheng03 <07c00h@gmail.com>
Improved documentation and readability of lua code as well as removed references to Redis.
---------
Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com>
This commit adds initialization code for the fields
`io_last_reply_block` and `io_last_bufpos` of the `client` struct.
While in the current code flow, these fields are only accessed after
being written in the `trySendWriteToIOThreads`, I discovered that they
were not being initialized while doing some changes to the code flow of
IO threads.
I believe it's good pratice to initialize all fields of a struct upon
creation, and will avoid future bugs which are usually hard to debug.
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Fix the warning introduced in #688:
```
unit/test_rax.c:168:15: runtime error: left shift of 36625 by 16 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior unit/test_rax.c:168:15 in
Fuzz test in mode 1 [7504]:
```
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Applying the CVEs against mainline.
(CVE-2024-31449) Lua library commands may lead to stack overflow and
potential RCE.
(CVE-2024-31227) Potential Denial-of-service due to malformed ACL
selectors.
(CVE-2024-31228) Potential Denial-of-service due to unbounded pattern
matching.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Introduce a `size_t` field into the rax struct to track allocation size.
Update the allocation size on rax insert and deletes.
Return the allocation size when `raxAllocSize` is called.
This size tracking is now used in MEMORY USAGE and MEMORY STATS in place
of the previous method based on sampling.
The module API allows to create sorted dictionaries, which are backed by
rax. Users now also get precise memory allocation for them (through
`ValkeyModule_MallocSizeDict`).
Fixes#677.
For the release notes:
* MEMORY USAGE and MEMORY STATS are now exact for streams, rather than
based on sampling.
---------
Signed-off-by: Guillaume Koenig <knggk@amazon.com>
Signed-off-by: Guillaume Koenig <106696198+knggk@users.noreply.github.com>
Co-authored-by: Joey <yzhaon@amazon.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Currently, we create a Lua table without initial capacity even when the
capacity is known. As a result, we need to resize the Lua tables
repeatedly when converting RESP serialized object to Lua object and it
consumes extra cpu resources a bit when we need to transfer
RESP-serialized data to Lua world.
This patch try to remove this extra resize to reduce (re-)allocation
overhead.
| name | unstable bb57dfe6303 (rps) | this patch(rps) | improvements |
| --------------- | -------- | --------- | -------------- |
| evalsha - hgetall h1 | 60565.68 | 64487.01 | 6.47% |
| evalsha - hgetall h10 | 47023.41 | 50602.17 | 7.61% |
| evalsha - hgetall h25 | 33572.82 | 37345.48 | 11.23% |
| evalsha - hgetall h50 | 24206.63 | 25276.14 | 4.42% |
| evalsha - hgetall h100 | 15068.87 | 15656.8 | 3.90% |
| evalsha - hgetall h300 | 5948.56 | 6094.74 | 2.46% |
Signed-off-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Co-authored-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
There is no ethernet style virtual device (like lo 127.0.0.1) for RDMA,
however a connection with the same local address and peer address are
considered as local.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
moduleTimerHandler is aeTimeProc handler and event loop gets created
with this. However, found that the function return type is int but
actually returns "long long" value(i.e., next_period). and return value
being assigned to int variable in processTimeEvents(where time events
are processed), this might cause an overflow of the timer values. So
changed the return type of the function to long long. And also updated
other callback function return type to be consistent.
I found this when I was checking functions reported in
https://github.com/valkey-io/valkey/issues/1054 issue stacktrace. (FYI,
this is just to update the return type to be consistent and it will not
the fix for the issue reported)
Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
For fake clients like the ones used for Lua and modules, we don't
determine TLS in the right way, causing CLUSTER SLOTS from EVAL over TLS
to fail a debug-assert.
This error was introduced when the caching of CLUSTER SLOTS was
introduced, i.e. in 8.0.0.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The reason is VM_Call will use a fake client without connection,
so we also need to check if c->conn is NULL.
This also affects scripts. If they are called in the script, the
server will crash. Injecting commands into AOF will also cause
startup failure.
Fixes#1054.
Signed-off-by: Binbin <binloveplay1314@qq.com>
In CLUSTER FAILOVER FORCE case, we will set mf_can_start to
1 and wait for a cron to trigger the election. We can also set a
CLUSTER_TODO_HANDLE_MANUALFAILOVER flag so that we
can start the election as soon as possible instead of waiting for
the cron, so that we won't have a 100ms delay (clusterCron).
Signed-off-by: Binbin <binloveplay1314@qq.com>
This commit hopefully improves the formatting of the codebase by setting
ColumnLimit to 0 and hence stopping clang-format from trying to put as
much stuff in one line as possible.
This change enabled us to remove most of `clang-format off` directives
and fixed a bunch of lines that looked like this:
```c
#define KEY \
VALUE /* comment */
```
Additionally, one pair of `clang-format off` / `clang-format on` had
`clang-format off` as the second comment and hence didn't enable the
formatting for the rest of the file. This commit addresses this issue as
well.
Please tell me if anything in the changes seem off. If everything is
fine, I will add this commit to `.git-blame-ignore-revs` later.
---------
Signed-off-by: Mikhail Koviazin <mikhail.koviazin@aiven.io>
This commit changes the `tcmalloc.h` header location from the deprecated
location `google/` to `gperftools/`.
**Why we're doing this now?**
The location `google/tcmalloc.h` has been deprecated for more than 10
years in favor of `gperftools/tcmalloc.h`, and the deprecated location
will be removed in the next release of gperftools.
Fixes#1033
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
The one in CLUSTER SETSLOT help us keep track of state better,
of course it also can make the test case happy.
The one in gossip process fixes a problem that a replica can
print a log saying it is an empty primary.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Apparently this will fail to compile in some masOS version.
And internet claims _Thread_local is portable.
Fixes#1051.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Sometims it is hard to see the old primary during a
multi primaries failover, adding this log can help
use to find the old primary node.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>