futriix

Author	SHA1	Message	Date
Roman Gershman	b09db3ef78	Fix typo in streams seen-time / active-time test (#1409 ) This variable name is wrong, it causes the wrong variable to be asserted. Signed-off-by: Roman Gershman <romange@gmail.com>	2024-12-09 16:01:43 +08:00
Guillaume Koenig	e8078b7315	Allow MEMORY MALLOC-STATS and MEMORY PURGE during loading phase (#1317 ) - Enable investigation of memory issues during loading - Previously, all memory commands were rejected with LOADING error (except memory help) - `MEMORY MALLOC-STATS` and `MEMORTY PURGE` are now allowed as they don't depend on the dataset - `MEMORY STATS` and `MEMORY USAGE KEY` remain disallowed Fixes #1299 Signed-off-by: Guillaume Koenig <knggk@amazon.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2024-12-08 20:30:07 +08:00
Binbin	176fafcaf7	Add a note to conf about the dangers of modifying dir at runtime (#887 ) We've had security issues in the past with it, which is why we marked it as PROTECTED. But, modifying during runtime is also a dangerous action. For example, when child processes are running, persistent temp files and log files may have unexpected effects. A scenario for modifying dir at runtime is to migrate a disk failure, such as using disk-based replication to migrate a node, writing nodes.conf to save the cluster configuration. We decided to leave it as is and add a note in the conf about the dangers of modifying dir at runtime. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-08 20:28:14 +08:00
Viktor Söderqvist	f20d629dbe	Fix sanitizer builds with clang (#1402 ) By including <stdatomic.h> after the other includes in the unit test, we can avoid redefining a macro which led to a build failure. Fixes #1394 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-12-07 10:26:31 +01:00
Viktor Söderqvist	a2fe6af457	Fix Module Update Args test when other modules are loaded (#1403 ) Fixes #1400 Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-12-07 10:25:40 +01:00
Caiyi Wu	6df376d68a	Fix coredump when use hellodict example module (#1395 ) In the ValkeyModule_OnLoad method of the file hellodict.c, the parameter keystep of ValkeyModule_CreateCommand should be 1. Otherwise, execute command will coredump. MODULE LOAD /home/tiger/valkey/src/modules/hellodict.so COMMAND GETKEYS HELLODICT.SET key value Signed-off-by: Codebells <1347103071@qq.com>	2024-12-05 20:01:38 +01:00
风去幽墨	6b3e1228cd	RDMA: Fix dead loop when transfer large data (20KB) (#1386 ) Determine the status of the Client when attempting to read data. If state=CLIENT_COMPLETED_IO, no read attempt is made and I/O operations on the Client are rescheduled by the main thread. > And 20474 Byte = PROTO_IOBUF_LEN(16KB) + SDS_HDR_VAR(16, s)(4090 Byte) Fixes #1385 --------- Signed-off-by: fengquyoumo <1455117463@qq.com>	2024-12-05 18:26:56 +01:00
Wen Hui	71560a2a4a	Add API UpdateRuntimeArgs for updating the module arguments during runtime (#1041 ) Before Redis OSS 7, if we load a module with some arguments during runtime, and run the command "config rewrite", the module information will not be saved into the config file. Since Redis OSS 7 and Valkey 7.2, if we load a module with some arguments during runtime, the module information (path, arguments number, and arguments value) can be saved into the config file after config rewrite command is called. Thus, the module will be loaded automatically when the server startup next time. Following is one example: bind 172.25.0.58 port 7000 protected-mode no enable-module-command yes Generated by CONFIG REWRITE latency-tracking-info-percentiles 50 99 99.9 dir "/home/ubuntu/valkey" save 3600 1 300 100 60 10000 user default on nopass sanitize-payload ~* &* +https://github.com/ALL loadmodule tests/modules/datatype.so 10 20 However, there is one problem. If developers write a module, and update the running arguments by someway, the updated arguments can not be saved into the config file even "config rewrite" is called. The reason comes from the following function rewriteConfigLoadmoduleOption (src/config.c) void rewriteConfigLoadmoduleOption(struct rewriteConfigState state) { .......... struct ValkeyModule module = dictGetVal(de); line = sdsnew("loadmodule "); line = sdscatsds(line, module->loadmod->path); for (int i = 0; i < module->loadmod->argc; i++) { line = sdscatlen(line, " ", 1); line = sdscatsds(line, module->loadmod->argv[i]->ptr); } rewriteConfigRewriteLine(state, "loadmodule", line, 1); ....... } The function only save the initial arguments information (module->loadmod) into the configfile. After core members discuss, ref https://github.com/valkey-io/valkey/issues/1177 We decide add the following API to implement this feature: Original proposal: int VM_UpdateRunTimeArgs(ValkeyModuleCtx ctx, int index, char value); Updated proposal: ValkeyModuleString *values VM_GetRuntimeArgs(ValkeyModuleCtx ctx); *int VM_UpdateRuntimeArgs(ValkeyModuleCtx ctx, int argc, ValkeyModuleString **values); Why we do not recommend the following way: MODULE UNLOAD Update module args in the conf file MODULE LOAD I think there are the following disadvantages: 1. Some modules can not be unloaded. Such as the example module datatype.so, which is tests/modules/datatype.so 2. it is not atomic operation for MODULE UNLOAD + MODULE LOAD 3. sometimes, if we just run the module unload, the client business could be interrupted --------- Signed-off-by: hwware <wen.hui.ware@gmail.com>	2024-12-05 11:58:24 -05:00
Madelyn Olson	a401e3789d	Update code of conduct maintainers email address (#1391 ) Updating code of conduct maintainer's email address Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-12-04 10:33:14 -08:00
zhenwei pi	105509cdad	Run RDMA builtin in CI workflow (#1380 ) Since 4695d118dd (#1209), RDMA supports builtin. And module connection type may be removed in future. So run a builtin RDMA support for CI workflow. RDMA module is complied only in CI, keep it building check only until module connection type gets obsolete. Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>	2024-12-03 23:09:56 +01:00
Jim Brunner	349bc7547b	defrag: use monotime in module interface (#1388 ) The recent PR (https://github.com/valkey-io/valkey/pull/1242) converted Active Defrag to use `monotime`. In that change, a conversion was performed to continue to use `ustime()` as part of the module interface. Since this time is only used internally, and never actually exposed to the module, we can convert this to use `monotime` directly. Signed-off-by: Jim Brunner <brunnerj@amazon.com>	2024-12-03 11:19:53 -08:00
uriyage	9f8b174c2e	Optimize IO thread offload for modified argv (#1360 ) ### Improve expired commands performance with IO threads #### Background In our IO threads architecture, IO threads allocate client argv's and later when we free it after processCommand we offload its free to the IO threads. With jemalloc, it's crucial that the same thread that allocates memory also frees it. For some commands we modify the client's argv in the main thread during command processing (for example in `SET EX` command we rewrite the command to use absolute time for replication propagation). #### Current issues 1. When commands are rewritten (e.g., expire commands), we store the original argv in `c->original_argv`. However, we're currently: - Freeing new argv (allocated by main thread) in IO threads - Freeing original argv (allocated by IO threads) in main thread 2. Currently, `c->original_argv` points to new array with old objects, while `c->argv` has old array with new objects, making memory free management complicated. #### Changes 1. Refactored argv modification handling code to ensure consistency - both array and objects are now either all new or all old 2. Moved original_argv cleanup to happen in resetClient after argv cleanup 3. Modified IO threads code to properly handle original argv cleanup when argv are modified. #### Performance Impact Benchmark with `SET EX` commands (650 clients, 512 byte value, 8 IO threads): - New implementation: 729,548 ops/sec - Old implementation: 633,243 ops/sec Representing a ~15% performance improvement due to more efficient memory handling. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com> Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>	2024-12-03 19:20:31 +02:00
Jim Brunner	397201c48f	Refactor of ActiveDefrag to reduce latencies (#1242 ) Refer to: https://github.com/valkey-io/valkey/issues/1141 This update refactors the defrag code to: * Make the overall code more readable and maintainable * Reduce latencies incurred during defrag processing With this update, the defrag cycle time is reduced to 500us, with more frequent cycles. This results in much more predictable latencies, with a dramatic reduction in tail latencies. (See https://github.com/valkey-io/valkey/issues/1141 for more complete details.) This update is focused mostly on the high-level processing, and does NOT address lower level functions which aren't currently timebound (e.g. `activeDefragSdsDict()`, and `moduleDefragGlobals()`). These are out of scope for this update and left for a future update. I fixed `kvstoreDictLUTDefrag` because it was using up to 7ms on a CME single shard. See original github issue for performance details. --------- Signed-off-by: Jim Brunner <brunnerj@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-12-03 08:42:29 -08:00
Nugine	3df609ef06	Optimize PFCOUNT, PFMERGE command by SIMD acceleration (#1293 ) This PR optimizes the performance of HyperLogLog commands (PFCOUNT, PFMERGE) by adding AVX2 fast paths. Two AVX2 functions are added for conversion between raw representation and dense representation. They are 15 ~ 30 times faster than scalar implementaion. Note that sparse representation is not accelerated. AVX2 fast paths are enabled when the CPU supports AVX2 (checked at runtime) and the hyperloglog configuration is default (HLL_REGISTERS == 16384 && HLL_BITS == 6). `PFDEBUG SIMD (ON\|OFF)` subcommand is added for unit tests. A new TCL unit test checks that the results produced by non-AVX2 and AVX2 implementations are exactly equal. When merging 3 dense hll structures, the benchmark shows a 12x speedup compared to the scalar version. ``` pfcount key1 key2 key3 pfmerge keyall key1 key2 key3 ``` ``` ====================================================================================================== Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec ------------------------------------------------------------------------------------------------------ PFCOUNT-scalar 5665.56 35.29839 32.25500 63.99900 67.58300 608.60 PFCOUNT-avx2 72377.83 2.75834 2.67100 5.34300 6.81500 7774.96 ------------------------------------------------------------------------------------------------------ PFMERGE-scalar 9851.29 20.28806 20.09500 36.86300 39.16700 615.71 PFMERGE-avx2 125621.89 1.59126 1.55100 3.11900 4.70300 15702.74 ------------------------------------------------------------------------------------------------------ scalar: valkey:unstable 2df56d87c0ebe802f38e8922bb2ea1e4ca9cfa76 avx2: Nugine:hll-simd 8f9adc34021080d96e60bd0abe06b043f3ed0275 CPU: 13th Gen Intel® Core™ i9-13900H × 20 Memory: 32.0 GiB OS: Ubuntu 22.04.5 LTS ``` Experiment repo: https://github.com/Nugine/redis-hyperloglog Benchmark script: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh Algorithm: https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp --------- Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>	2024-12-02 19:40:38 +01:00
Binbin	fbbfe5d3d3	Print logs when the cluster state changes to fail or the fail reason changes (#1188 ) This log allows us to easily distinguish between full coverage and minority partition when the cluster fails. Sometimes it is not easy to see the minority partition in a healthy shards (both primary and replicas). And we decided not to add a cluster_fail_reason field to cluster info. Given that there are only two reasons and both are well-known and if we ended up adding more down the road we can add it in the furture. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-02 15:55:24 +08:00
Vadym Khoptynets	90475af594	Free strings during BGSAVE/BGAOFRW to reduce copy-on-write (#905 ) Motivation Copy-on-write (COW) amplification refers to the issue where writing to a small object leads to the entire page being cloned, resulting in inefficient memory usage. This issue arises during the BGSAVE process, which can be particularly problematic on instances with limited memory. If the BGSAVE process could release unneeded memory, it could reduce memory consumption. To address this, the BGSAVE process calls the `madvise` function to signal the operating system to reclaim the buffer. However, this approach does not work for buffers smaller than a page (usually 4KiB). Even after multiple such calls, where a full page may be free, the operating system will not reclaim it. To solve this issue, we can call `zfree` directly. This allows the allocator (jemalloc) to handle the bookkeeping and release pages when buffers are no longer needed. This approach reduces copy-on-write events. Benchmarks To understand how usage of `zfree` affects BGSAVE and the memory consumption I ran 45 benchmarks that compares my clonewith the vanilla version. The benchmark has the following steps: 1. Start a new Valkey process 2. Fill the DB with data sequentially 3. Run a warmup to randomize the memory layout 4. Introduce fragmentation by deleting part of the keys 5. In parallel: 1. Trigger BGSAVE 2. Start 80/20 get/set load I played the following parameters to understand their influence: 1. Number of keys: 3M, 6M, and 12M. 2. Data size. While key themselves are of fixed length ~30 bytes, the value size is 120, 250, 500, 1000, and 2000 bytes. 3. Fragmentation. I delete 5%, 10%, and 15% of the original key range. I'm attaching a graph of BGSAVE process memory consumption. Instead of all benchmarks, I show the most representative runs IMO. <img width="1570" alt="3m-fixed" src="https://github.com/user-attachments/assets/3dbbc528-01c1-4821-a3c2-6be455e7f78a"> For 2000 bytes values peak memory usage is ~53% compared to vanilla. The peak happens at 57% BGSAVE progress. For 500 bytes values the peak is ~80% compared to vanilla. And happens at ~80% progress. For 120 bytes the difference is under 5%, and the patched version could even use more memory. ![500b-fixed](https://github.com/user-attachments/assets/b09451d3-4bce-4f33-b3db-2b5df2178ed2) For 12M keys, the peak is ~85% of the vanilla’s. Happens at ~70% mark. For 6M keys, the peak is ~87% of the vanilla’s. Happens at ~77% mark. For 3M keys, the peak is ~87% of the vanilla’s Happens at ~80% mark. Changes The PR contains 2 changes: 1. Static buffer for RDB comrpession. RDB compression leads to COW events even without any write load if we use `zfree`. It happens because the compression functions allocates a new buffer for each object. Together with freeing objects with `zfree` it leads to reusing of the memory shared with the main process. To deal with this problem, we use a pre-allocated constant 8K buffer for compression. If the object size is too big for this buffer, than we fall back to the ad hoc allocation behavior. 2. Freeing string objects instead of dismissing them Call to `zfree` is more expensive than direct call to `madvise`. But with #453 strings use the fast path – `zfree_with_size`. As a possible next step we can optimize `zfree` for other data types as well. --------- Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com> Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-12-01 17:12:27 +02:00
Amit Nagler	7043ef0bbb	Split dual-channel COB overrun tests to separate servers (#1374 ) 1. The test isn't waiting long enough for the output buffer to overrun. This problem is happening because an error from the previous test is bleeding into the current test's logs. The simplest fix would be to split these tests. 2. Increased replication timeout to ensure sync fails due to output buffer overrun before a timeout occurs. Fixes #1367 Signed-off-by: naglera <anagler123@gmail.com>	2024-12-01 21:33:43 +08:00
Binbin	9c48f56790	Reset repl_down_since to zero only on state change (#1149 ) We should reset repl_down_since only on state change, in the current code, if the rdb channel in the dual channel is normal, that is, rdb is loaded normally, but the psync channel is abnormal, we will set repl_down_since 0 here. If the primary is down at this time, the replica may be abnormal when calculating data_age in cluster failover, since repl_state != REPL_STATE_CONNECTED, this causes the replica to be unable to initiate an election due to the old data_age. In dualChannelSyncHandleRdbLoadCompletion, if the psync channel is not established, the function will return. We will set repl_state to REPL_STATE_CONNECTED and set repl_down_since to 0 in dualChannelSyncSuccess, that is, in establishPrimaryConnection. See also 677d10b2a8ff7f13033ccfe56ffcd246dbe70fb6 for more details. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-01 21:33:21 +08:00
Stav Ben-Tov	c8ceb2ee25	Use zfree_with_size for client buffer (#1376 ) Replace occurrences of 'zfree' with 'zfree_with_size' to improve performance. 'zfree_with_size' function avoids calling 'zmalloc_size' to retrieve buffer size and uses previuos calculation of size for calling 'zfree_with_size'. This results in faster memory deallocation and reduces overhead. Signed-off-by: stav bentov <stavbt@amazon.com> Co-authored-by: stav bentov <stavbt@amazon.com>	2024-12-01 12:24:18 +01:00
zhenwei pi	4695d118dd	RDMA builtin support (#1209 ) There are several patches in this PR: * Abstract set/rewrite config bind option: `bind` option is a special config, `socket` and `tls` are using the same one. However RDMA uses the similar style but different one. Use a bit abstract work to make it flexible for both `socket` and `RDMA`. (Even for QUIC in the future.) * Introduce closeListener for connection type: closing socket by a simple syscall would be fine, RDMA has complex logic. Introduce connection type specific close listener method. * RDMA: Use valkey.conf style instead of module parameters: use `--rdma-bind` and `--rdma-port` style instead of module parameters. The module style config `rdma.bind` and `rdma.port` are removed. * RDMA: Support builtin: support `make BUILD_RDMA=yes`. module style is still kept for now. Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>	2024-11-29 11:13:34 +01:00
zvi-code	fd58f8d058	Disable lazy free in defrag test to fix 32bit daily failure (#1370 ) Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com> Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com>	2024-11-28 16:27:00 +01:00
Binbin	a939cb88ee	Handle keyIsExpiredWithDictIndex to make it check for import mode (#1368 ) In #1326 we make KEYS can visit expired key in import-source state by updating keyIsExpired to check for import mode. But after #1205, we now use keyIsExpiredWithDictIndex to optimize and remove the redundant dict_index, and keyIsExpiredWithDictIndex does not handle this logic. In this commit, we handle keyIsExpiredWithDictIndex to make it check for import mode as well so that KEYS can visit the expired key. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-28 14:10:48 +08:00
Binbin	db7b7396ff	Make KEYS can visit expired key in import-source state (#1326 ) After #1185, a client in import-source state can visit expired key both in read commands and write commands, this commit handle keyIsExpired function to handle import-source state as well, so KEYS can visit the expired key. This is not particularly important, but it ensures the definition, also doing some cleanup around the test, verified that the client can indeed visit the expired key. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-28 00:16:55 +08:00
Binbin	5d08149e72	Use fake client flag to replace not conn check (#1198 ) The fake client flag was introduced in #1063, we want this to replace all !conn fake client checks. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-27 18:02:07 +08:00
ranshid	66ae8b7135	change the container image to ubuntu:plucky (#1359 ) Our fortify workflow is running on ubuntu lunar container that is EOL since [January 25, 2024(January 25, 2024](https://lists.ubuntu.com/archives/ubuntu-announce/2024-January/000298.html). This case cause the workflow to fail during update actions like: ``` apt-get update && apt-get install -y make gcc-13 update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-1[3](https://github.com/valkey-io/valkey/actions/runs/12021130026/job/33547460209#step:5:3) 100 make all-with-unit-tests CC=gcc OPT=-O3 SERVER_CFLAGS='-Werror -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3' shell: sh -e {0} Ign:1 http://security.ubuntu.com/ubuntu lunar-security InRelease Err:2 http://security.ubuntu.com/ubuntu lunar-security Release [4](https://github.com/valkey-io/valkey/actions/runs/12021130026/job/33547460209#step:5:4)04 Not Found [IP: 91.189.91.82 80] Ign:3 http://archive.ubuntu.com/ubuntu lunar InRelease Ign:4 http://archive.ubuntu.com/ubuntu lunar-updates InRelease Ign:[5](https://github.com/valkey-io/valkey/actions/runs/12021130026/job/33547460209#step:5:5) http://archive.ubuntu.com/ubuntu lunar-backports InRelease Err:[6](https://github.com/valkey-io/valkey/actions/runs/12021130026/job/33547460209#step:5:7) http://archive.ubuntu.com/ubuntu lunar Release 404 Not Found [IP: 185.125.190.81 80] Err:7 http://archive.ubuntu.com/ubuntu lunar-updates Release 404 Not Found [IP: 185.125.190.81 80] Err:8 http://archive.ubuntu.com/ubuntu lunar-backports Release 404 Not Found [IP: 185.125.190.81 80] Reading package lists... E: The repository 'http://security.ubuntu.com/ubuntu lunar-security Release' does not have a Release file. E: The repository 'http://archive.ubuntu.com/ubuntu lunar Release' does not have a Release file. E: The repository 'http://archive.ubuntu.com/ubuntu lunar-updates Release' does not have a Release file. E: The repository 'http://archive.ubuntu.com/ubuntu lunar-backports Release' does not have a Release file. update-alternatives: error: alternative path /usr/bin/gcc-[13](https://github.com/valkey-io/valkey/actions/runs/12021130026/job/33547460209#step:5:14) doesn't exist Error: Process completed with exit code 2. ``` example: https://github.com/valkey-io/valkey/actions/runs/12021130026/job/33547460209 This pr uses the latest stable ubuntu image release [plucky](https://hub.docker.com/layers/library/ubuntu/plucky/images/sha256-dc4565c7636f006c26d54c988faae576465e825ea349fef6fd3af6bf5100e8b6?context=explore) Signed-off-by: Ran Shidlansik <ranshid@amazon.com>	2024-11-27 07:34:02 +02:00
Amit Nagler	9305b49145	Add tag for dual-channel logs (#999 ) This PR introduces a consistent tagging system for dual-channel logs. The goal is to improve log readability and filterability, making it easier for operators to manage and analyze log entries. Resolves https://github.com/valkey-io/valkey/issues/986 --------- Signed-off-by: naglera <anagler123@gmail.com>	2024-11-26 16:51:52 +02:00
Binbin	469d41fb37	Avoid double close on repl_transfer_fd (#1349 ) The code is ok before 2de544cfcc6d1aa7cf6d0c75a6116f7fc27b6fd6, but now we will set server.repl_transfer_fd right after dfd was initiated, and in here we have a double close error since dfd and server.repl_transfer_fd are the same fd. Also move the declaration of dfd/maxtries to a small scope to avoid the confusion since they are only used in this code. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-26 00:00:47 +08:00
Binbin	2d48a39c27	Save open's errno when opening temp rdb fails to prevent it from being modified (#1347 ) Apparently on Mac, sleep will modify errno to ETIMEDOUT, and then it prints the misleading message: Operation timed out. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-25 23:56:51 +08:00
Ray Cao	cf1a1e0931	Optimize sdscatrepr by batch processing printable characters (#1342 ) Optimize sdscatrepr by reducing realloc calls, furthermore, we can reduce memcpy calls by batch processing of consecutive printable characters. Signed-off-by: Ray Cao <zisong.cw@alibaba.com> Co-authored-by: Ray Cao <zisong.cw@alibaba.com>	2024-11-25 07:16:46 -08:00
Parth	c4920bca4a	Integrating fast_float to optionally replace strtod (#1260 ) Fast_float is a C++ header-only library to parse doubles using SIMD instructions. The purpose is to speed up sorted sets and other commands that use doubles. A single-file copy of fast_float is included in this repo. This introduces an optional dependency on a C++ compiler. The use of fast_float is enabled at compile time using the make variable `USE_FAST_FLOAT=yes`. It is disabled by default. Fixes #1069. --------- Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com> Signed-off-by: Parth <661497+parthpatel@users.noreply.github.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Roshan Swain <swainroshan001@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-11-25 10:01:43 +01:00
Binbin	653d5f7fe3	Support empty callback on function and free temp function in async way (#1334 ) We have a replicationEmptyDbCallback, it is a callback used by emptyData while flushing away old data. Previously, we did not add this callback logic for function, in case of abuse, there may be a lot of functions, and also to make the code consistent, we add the same callback logic for function. Changes around this commit: 1. Extend emptyData / functionsLibCtxClear to support passing callback when flushing functions. 2. Added disklessLoad function create and discard helper function, just like disklessLoadInitTempDb and disklessLoadDiscardTempDb), we wll always flush the temp function in a async way to avoid any block. 3. Cleanup around discardTempDb, remove the callback pointer since in async way we don't need the callback. 4. Remove functionsLibCtxClear call in readSyncBulkPayload, because we called emptyData in the previous lines, which also empty functions. We are doing this callback in replication is because during the flush, replica may block a while if the flush is doing in the sync way, to avoid the primary to detect the replica is timing out, replica will use this callback to notify the primary (we also do this callback when loading a RDB). And in the async way, we empty the data in the bio and there is no slw operation, so it will ignores the callback. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-25 09:59:37 +08:00
eifrah-aws	33f42d7fb5	CMake fixes + README update (#1276 )	2024-11-22 12:17:53 -08:00
Binbin	9851006d6d	Add short client info log to CLUSTER MEET / FORGET / RESET commands (#1249 ) These commands are all administrator commands. If they are operated incorrectly, serious consequences may occur. Print the full client info by using catClientInfoString, the info is useful when we want to identify the source of request. Since the origin client info is very large and might complicate the output, we added a catClientInfoShortString function, it will only print some basic fields, we want these fields that are useful to identify the client. These fields are: - id - addr - laddr - connection info - name - user - lib-name - lib-ver And also used it to replace the origin client info where it has the same purpose. Some logging is changed from full client info to short client info: - CLUSTER FAILOVER - FAILOVER / PSYNC - REPLICAOF NO ONE - SHUTDOWN Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-23 00:23:38 +08:00
Binbin	b9d224097a	Brocast a PONG to all node in cluster when role changed (#1295 ) When a node role changes, we should brocast the change to notify other nodes. For example, one primary and one replica, after a failover, the replica became a new primary, the primary became a new replica. And then we trigger a second cluster failover for the new replica, the new replica will send a MFSTART to its primary, ie, the new primary. But the new primary may reject the MFSTART due to this logic: ``` } else if (type == CLUSTERMSG_TYPE_MFSTART) { if (!sender \|\| sender->replicaof != myself) return 1; ``` In the new primary views, sender is still a primary, and sender->replicaof is NULL, so we will return. Then the manual failover timedout. Another possibility is that other primaries refuse to vote after receiving the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary, so it refuse to vote, and then manual failover timedout. ``` void clusterSendFailoverAuthIfNeeded(clusterNode node, clusterMsg request) { ... if (clusterNodeIsPrimary(node)) { serverLog(LL_WARNING, "Failover auth denied to... ``` The reason is that, currently, we only update the node->replicaof information when we receive a PING/PONG from the sender. For details, see clusterProcessPacket. Therefore, in some scenarios, such as clusters with many nodes and a large cluster-ping-interval (that is, cluster-node-timeout), the role change of the node will be very delayed. Added a DEBUG DISABLE-CLUSTER-RANDOM-PING command, send cluster ping to a random node every second (see clusterCron). Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-23 00:22:04 +08:00
Binbin	979f4c1ceb	Add cmake-build-debug and cmake-build-release to gitignore (#1340 ) Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-22 16:49:16 +08:00
Alan Scherger	377ed22c97	[feat] add Ubuntu 24.04 Noble package support (#971 ) add Ubuntu 24.04 Noble package support Signed-off-by: Alan Scherger <alan.scherger@gmail.com>	2024-11-21 19:26:30 -08:00
Yury-Fridlyand	109d2dadc0	Add slack link for users (#1273 ) Add slack link for users --------- Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-11-21 19:19:10 -08:00
Nadav Levanoni	18d1eb5a85	Remove redundant dict_index calculations (#1205 ) We need to start making use of the new `WithDictIndex` APIs which allow us to reuse the dict_index calculation (avoid over-calling `getKeySlot` for no good reason). In this PR I optimized `lookupKey` so it now calls `getKeySlot` to reuse the dict_index two additional times. It also optimizes the keys command to avoid unnecessary computation of the slot id. --------- Signed-off-by: Nadav Levanoni <nadavl@amazon.com> Co-authored-by: Nadav Levanoni <nadavl@amazon.com>	2024-11-21 19:14:28 -08:00
Sinkevich Artem	43b5026162	Fix argument types of formatting functions (#1253 ) `cluster_legacy.c`: `slot_info_pairs` has `uint16_t` values, but they were cast to `unsigned long` and `%i` was used. `valkey-cli.c`: `node->replicas_count` is `int`, not `unsigned long`. Signed-off-by: ArtSin <artsin666@gmail.com>	2024-11-21 18:58:15 -08:00
Binbin	50aae13b0a	Skip reclaim file page cache test in valgrind (#1327 ) The test is incompatible with valgrind. Added a new `--valgrind` argument to test suite, which will cause that test to be skipped. We skipped it in the past, see 5b61b0dc6d2579ee484fa6cf29bfac59513f84ab Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-22 10:29:24 +08:00
Binbin	c4be326c32	Make manual failover reset the on-going election to promote failover (#1274 ) If a manual failover got timed out, like the election don't get the enough votes, since we have a auth_timeout and a auth_retry_time, a new manual failover will not be able to proceed on the replica side. Like if we initiate a new manual failover after a election timed out, we will pause the primary, but on the replica side, due to retry_time, replica does not trigger the new election and the manual failover will eventually time out. In this case, if we initiate manual failover again and there is an ongoing election, we will reset it so that the replica can initiate a new election at the manual failover's request. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-22 10:28:59 +08:00
zvi-code	b56eed2479	Remove valkey specific changes in jemalloc source code (#1266 ) ### Summary of the change This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc [native api](https://github.com/jemalloc/jemalloc/pull/1463#issuecomment-479706489) instead of relying on custom code changes made by valkey in the jemalloc ([je_defrag_hint](`9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)`)) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions. This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the `computeDefragCycles` and not for every single ptr, this way we are avoiding the additional api call. Followup work will utilize the new options that are now open and will further improve the defrag decision and process. ### Added files: `allocator_defrag.c` / `allocator_defrag.h` - This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation. ### Moved functions: [`zmalloc_no_tcache` , `zfree_no_tcache` ](`4593dc2f05/src/zmalloc.c (L215)`) - these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc]. `frag_smallbins_bytes` - the logic and implementation moved to the new file ### Existing API: * [once a second + when completed full cycle] [`computeDefragCycles`](`4593dc2f05/src/defrag.c (L916)`) * `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active, resident, retained, muzzy_, `frag_smallbins_bytes` * [`frag_smallbins_bytes`](`4593dc2f05/src/zmalloc.c (L690)`) : for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs` * [during defrag, for each pointer] * `je_defrag_hint` is getting a memory pointer and returns {0,1} . [Internally it uses](`4593dc2f05/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L368)`) this information points: * #`nonfull_slabs` * #`total_slabs` * #free regs in the ptr slab ## Jemalloc API (via ctl interface) [BATCH][`experimental_utilization_batch_query_ctl`](`4593dc2f05/deps/jemalloc/src/ctl.c (L4114)`) : gets an array of pointers, returns for each pointer 3 values, * number of free regions in the extent * number of regions in the extent * size of the extent in terms of bytes [EXTENDED][`experimental_utilization_query_ctl`](`4593dc2f05/deps/jemalloc/src/ctl.c (L3989)`) : * memory address of the extent a potential reallocation would go into * number of free regions in the extent * number of regions in the extent * size of the extent in terms of bytes * [stats-enabled]total number of free regions in the bin the extent belongs to * [stats-enabled]total number of regions in the bin the extent belongs to ### `experimental_utilization_batch_query_ctl` vs valkey `je_defrag_hint`? [good] - We can query pointers in a batch, reduce the overall overhead - The per ptr decision algorithm is not within jemalloc api, jemalloc only provides information, valkey can tune\configure\optimize easily [bad] - In the batch API we only know the utilization of the slab (of that memory ptr), we don’t get the data about #`nonfull_slabs` and total allocated regs. ## New functions: 1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the [MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster calls. See this quote from the jemalloc documentation: The mallctlnametomib() function provides a way to avoid repeated name lookups for applications that repeatedly query the same portion of the namespace,by translating a name to a “Management Information Base” (MIB) that can be passed repeatedly to mallctlbymib(). 6. `jemalloc_sz2binind_lgq` : this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived from [`lg_quantum`](`4593dc2f05/deps/Makefile (L115)`) 7. `defrag_jemalloc_get_frag_smallbins` : This function replaces `frag_smallbins_bytes` the logic moved to the new file allocator_defrag `defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the results 8. `should_defrag` : implements the same logic as the existing implementation [inside](`9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)`) je_defrag_hint 9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed. ### Logical differences: In order to get the information about #`nonfull_slabs` and #`regs`, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we use `jemalloc_sz2binind_lgq` . ## Testing This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance. ### Test: 1. disable `activedefrag` 2. run valkey benchmark on overlapping address ranges with different block sizes 3. wait untill `used_memory` reaches 10GB 4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru` 5. stop load 6. wait for `mem_fragmentation_ratio` to reach 2 7. enable `activedefrag` - start test timer 8. wait until reach `mem_fragmentation_ratio` = 1.1 #### Results: (With this PR)Test results: ` 56 sec` (Without this PR)Test results: `67 sec` both runs perform same "work" number of buffers moved to reach fragmentation target Next benchmarking is to compare to: - DONE // existing `je_get_defrag_hint` - compare with naive defrag all: `int defrag_hint() {return 1;}` --------- Signed-off-by: Zvi Schneider <ezvisch@amazon.com> Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com> Co-authored-by: Zvi Schneider <ezvisch@amazon.com> Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-11-21 16:29:21 -08:00
xbasel	b486a41500	Preserve original fd blocking state in TLS I/O operations (#1298 ) This change prevents unintended side effects on connection state and improves consistency with non-TLS sync operations. For example, when invoking `connTLSSyncRead` with a blocking file descriptor, the mode is switched to non-blocking upon `connTLSSyncRead` exit. If the code assumes the file descriptor remains blocking and calls the normal `read` expecting it to block, it may result in a short read. This caused a crash in dual-channel, which was fixed in this PR by relocating `connBlock()`: https://github.com/valkey-io/valkey/pull/837 Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>	2024-11-21 18:22:16 +02:00
Binbin	6038eda010	Make FUNCTION RESTORE FLUSH flush async based on lazyfree-lazy-user-flush (#1254 ) FUNCTION RESTORE have a FLUSH option, it will delete all the existing libraries before restoring the payload. If for some reasons, there are a lot of libraries, we will block a while in here. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-21 21:02:05 +08:00
Binbin	f553ccbda6	Use goto to cleanup error handling in readSyncBulkPayload (#1332 ) The goto error label is the same as the error return, use goto to reduce the references. ``` error: cancelReplicationHandshake(1); return; ``` Also this can make the log printing more continuous under the error, that is, we print the error log first, and then print the reconnecting log at the last (in cancelReplicationHandshake). Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-21 20:01:30 +08:00
Yanqi Lv	4986310945	Import-mode: Avoid expiration and eviction during data syncing (#1185 ) New config: `import-mode (yes\|no)` New command: `CLIENT IMPORT-SOURCE (ON\|OFF)` The config, when set to `yes`, disables eviction and deletion of expired keys, except for commands coming from a client which has marked itself as an import-source, the data source when importing data from another node, using the CLIENT IMPORT-SOURCE command. When we sync data from the source Valkey to the destination Valkey using some sync tools like [redis-shake](https://github.com/tair-opensource/RedisShake), the destination Valkey can perform expiration and eviction, which may cause data corruption. This problem has been discussed in https://github.com/redis/redis/discussions/9760#discussioncomment-1681041 and Redis already have a solution. But in Valkey we haven't fixed it by now. E.g. we call `set key 1 ex 1` on the source server and transfer this command to the destination server. Then we call `incr key` on the source server before the key expired, we will have a key on the source server with a value of 2. But when the command arrived at the destination server, the key may be expired and has deleted. So we will have a key on the destination server with a value of 1, which is inconsistent with the source server. In standalone mode, we can use writable replica to simplify the sync process. However, in cluster mode, we still need a sync tool to help us transfer the source data to the destination. The sync tool usually work as a normal client and the destination works as a primary which keep expiration and eviction. In this PR, we add a new mode named 'import-mode'. In this mode, server stop expiration and eviction just like a replica. Notice that this mode exists only in sync state to avoid data inconsistency caused by expiration and eviction. Import mode only takes effect on the primary. Sync tools can mark their clients as an import source by `CLIENT IMPORT-SOURCE`, which work like a client from primary and can visit expired keys in `lookupkey`. Notice: during the migration, other clients, apart from the import source, should not access the data imported by import source. --------- Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com> Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-11-19 21:53:19 +01:00
Binbin	ee386c92ff	Manual failover vote is not limited by two times the node timeout (#1305 ) This limit should not restrict manual failover, otherwise in some scenarios, manual failover will time out. For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs are lost during a manual failover, it cannot vote in the second manual failover. Or in a mixed scenario of plain failover and manual failover, it cannot vote for the subsequent manual failover. The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-11-19 11:17:20 -05:00
Binbin	132798b57d	Receipt of REPLCONF VERSION reply should be triggered by event (#1320 ) This add the missing return when repl_state change to RECEIVE_VERSION_REPLY, this way we won’t be blocked if the primary doesn’t reply with REPLCONF VERSION. In practice i guess this is no likely to block in this context, reading small responses are are likely to be received in one packet, so this is just a cleanup (consistent with the previous state machine processing). Also update the state machine diagram to mention the VERSION reply. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-19 23:42:50 +08:00
Seungmin Lee	3d0c834203	Fix LRU crash when getting too many random lua scripts (#1310 ) ### Problem Valkey stores scripts in a dictionary (lua_scripts) keyed by their SHA1 hashes, but it needs a way to know which scripts are least recently used. It uses an LRU list (lua_scripts_lru_list) to keep track of scripts in usage order. When the list reaches a maximum length, Valkey evicts the oldest scripts to free memory in both the list and dictionary. The problem here is that the sds from the LRU list can be pointing to already freed/moved memory by active defrag that the sds in the dictionary used to point to. It results in assertion error at [this line](https://github.com/valkey-io/valkey/blob/unstable/src/eval.c#L519) ### Solution If we duplicate the sds when adding it to the LRU list, we can create an independent copy of the script identifier (sha). This duplication ensures that the sha string in the LRU list remains stable and unaffected by any defragmentation that could alter or free the original sds. In addition, dictUnlink doesn't require exact pointer match([ref](https://github.com/valkey-io/valkey/blob/unstable/src/eval.c#L71-L78)) so this change makes sense to unlink the right dictEntry with the copy of the sds. ### Reproduce To reproduce it with tcl test: 1. Disable je_get_defrag_hint in defrag.c to trigger defrag often 2. Execute test script ``` start_server {tags {"auth external:skip"}} { test {Regression for script LRU crash} { r config set activedefrag yes r config set active-defrag-ignore-bytes 1 r config set active-defrag-threshold-lower 0 r config set active-defrag-threshold-upper 1 r config set active-defrag-cycle-min 99 r config set active-defrag-cycle-max 99 for {set i 0} {$i < 100000} {incr i} { r eval "return $i" 0 } after 5000; } } ``` ### Crash info Crash report: ``` === REDIS BUG REPORT START: Cut & paste starting from here === 14044:M 12 Nov 2024 14:51:27.054 # === ASSERTION FAILED === 14044:M 12 Nov 2024 14:51:27.054 # ==> eval.c:556 'de' is not true ------ STACK TRACE ------ Backtrace: /usr/bin/redis-server 127.0.0.1:6379 [cluster](luaDeleteFunction+0x148)[0x723708] /usr/bin/redis-server 127.0.0.1:6379 [cluster](luaCreateFunction+0x26c)[0x724450] /usr/bin/redis-server 127.0.0.1:6379 [cluster](evalCommand+0x2bc)[0x7254dc] /usr/bin/redis-server 127.0.0.1:6379 [cluster](call+0x574)[0x5b8d14] /usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommand+0xc84)[0x5b9b10] /usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommandAndResetClient+0x11c)[0x6db63c] /usr/bin/redis-server 127.0.0.1:6379 [cluster](processInputBuffer+0x1b0)[0x6dffd4] /usr/bin/redis-server 127.0.0.1:6379 [cluster][0x6bd968] /usr/bin/redis-server 127.0.0.1:6379 [cluster][0x659634] /usr/bin/redis-server 127.0.0.1:6379 [cluster](amzTLSEventHandler+0x194)[0x6588d8] /usr/bin/redis-server 127.0.0.1:6379 [cluster][0x750c88] /usr/bin/redis-server 127.0.0.1:6379 [cluster](aeProcessEvents+0x228)[0x757fa8] /usr/bin/redis-server 127.0.0.1:6379 [cluster](redisMain+0x478)[0x7786b8] /lib64/libc.so.6(__libc_start_main+0xe4)[0xffffa7763da4] /usr/bin/redis-server 127.0.0.1:6379 [cluster][0x5ad3b0] ``` Defrag info: ``` mem_fragmentation_ratio:1.18 mem_fragmentation_bytes:47229992 active_defrag_hits:20561 active_defrag_misses:5878518 active_defrag_key_hits:77 active_defrag_key_misses:212 total_active_defrag_time:29009 ``` ### Test: Run the test script to push 100,000 scripts to ensure the LRU list keeps 500 maximum length without any crash. ``` 27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 [ok]: Regression for script LRU crash (6811 ms) [1/1 done]: unit/test (7 seconds) ``` --------- Signed-off-by: Seungmin Lee <sungming@amazon.com> Signed-off-by: Seungmin Lee <155032684+sungming2@users.noreply.github.com> Co-authored-by: Seungmin Lee <sungming@amazon.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2024-11-18 18:06:35 -08:00
Seungmin Lee	f9d0b87622	Upgrade macos-12 to macos-13 in workflows (#1318 ) ### Problem GitHub Actions is starting the deprecation process for macOS 12. Deprecation will begin on 10/7/24 and the image will be fully unsupported by 12/3/24. For more details, see https://github.com/actions/runner-images/issues/10721 Signed-off-by: Seungmin Lee <sungming@amazon.com> Co-authored-by: Seungmin Lee <sungming@amazon.com>	2024-11-18 18:00:30 -08:00

... 2 3 4 5 6 ...

12919 Commits