futriix

Author	SHA1	Message	Date
Binbin	211b250aad	Do election in order based on failed primary rank to avoid voting conflicts (#1018 ) When multiple primary nodes fail simultaneously, the cluster can not recover within the default effective time (data_age limit). The main reason is that the vote is without ranking among multiple replica nodes, which case too many epoch conflicts. Therefore, we introduced into ranking based on the failed primary shard-id. Introduced a new failed_primary_rank var, this var means the rank of this myself instance in the context of all failed primary list. This var will be used in failover and we will do the failover election packets in order based on the rank, this can effectively avoid the voting conflicts. If a single primary is down, the behavior is the same as before. If multiple primaries are down, their replica election initiation time will be delayed by 500ms according to the ranking. Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-01-11 10:43:18 +08:00
Binbin	d6bdd9e7d7	Fix module LatencyAddSample still work when latency-monitor-threshold is 0 (#1541 ) When latency-monitor-threshold is set to 0, it means the latency monitor is disabled, and in VM_LatencyAddSample, we wrote the condition incorrectly, causing us to record latency when latency was turned off. This bug was introduced in the very first day, see e3b1d6d, it was merged in 2019. Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-01-11 10:32:58 +08:00
Binbin	e60990e579	Fix crash when freeing newly created node when nodeIp2String fail (#1535 ) In #1441, we found a assert, and decided remove this assert and instead just free the newly created node and close the link, since if we cannot get the IP from the link it probably means the connection was closed. ``` === VALKEY BUG REPORT START: Cut & paste starting from here === 17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED === 17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true ------ STACK TRACE ------ 17847 valkey-server * src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634] src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de] /__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e] src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea] src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547] /lib64/libc.so.6(+0x40c8) [0x7f083985a0c8] /lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b] src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5] ``` But it also introduces another assert. The reason is that this new node is not added to the cluster nodes dict. ``` 17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED === 17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true ------ STACK TRACE ------ 17128 valkey-server * src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4] src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2] src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618] /__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278] src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09] src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5] src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e] ``` This closes #1527. Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-01-10 10:19:04 +08:00
Madelyn Olson	d99457c09c	Free the passed in lua context instead of the global (#1536 ) The fix that Redis gave us for the CVE-2024-46981 was freeing lctx.lua, and I didn't merge it correctly. We made some changes so that we are able to async free the lua context, so we need to free the passed in context. This was applied correctly on the two released versions (8.0 and 7.2) just not on unstable. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2025-01-09 14:35:48 +08:00
Karthick Ariyaratnam	80c35402bc	Remove legacy SERVER_TEST compiler flag from cmake. (#1530 ) This PR is to cleanup the `SERVER_TEST` compiler flag from cmake compile definitions, as it is no longer required in the new unit test framework, see #428. Signed-off-by: Karthick Ariyaratnam <karthyuom@gmail.com>	2025-01-09 11:52:45 +08:00
Nadav Gigi	9f4815a224	Accelerate hash table iterator with prefetching (#1501 ) This PR introduces improvements to the hashtable iterator, implementing prefetching technique described in the blog post [Unlock One Million RPS - Part 2](https://valkey.io/blog/unlock-one-million-rps-part2/) . The changes lay the groundwork for further enhancements in use cases involving iterators. Future PRs will build upon this foundation to improve performance and functionality in various iterator-dependent operations. In the pursuit of maximizing iterator performance, I conducted a comprehensive series of experiments. My tests encompassed a wide range of approaches, including processing multiple bucket indices in parallel, prefetching the next bucket upon completion of the current one, and several other timing and quantity variations. Surprisingly, after rigorous testing and performance analysis, the simplest implementation presented in this PR consistently outperformed all other more complex strategies. ## Implementation Each time we start iterating over a bucket, we prefetch data for future iterations: - We prefetch the entries of the next bucket (if it exists). - We prefetch the structure (but not the entries) of the bucket after the next. This prefetching is done when we pick up a new bucket, increasing the chance that the data will be in cache by the time we need it. ## Performance The data below was taken by conducting keys command on 64cores Graviton 3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each. The results regarding the duration of “keys *” command was taken from “info all” command. ``` +--------------------+------------------+-----------------------------+ \| prefetching \| Time (seconds) \| Keys Processed per Second \| +--------------------+------------------+-----------------------------+ \| No \| 11.112279 \| 4,499,529 \| \| Yes \| 3.141916 \| 15,913,862 \| +--------------------+------------------+-----------------------------+ Improvement: Comparing the iterator without prefetching to the one with prefetching, we can see a speed improvement of 11.112279 / 3.141916 ≈ 3.54 times faster. ``` ### Save command improvment #### Setup: - 64cores Graviton 3 Amazon EC2 instance. - 50 mil keys in size of 100 bytes each. - Running valkey server over RAM file system. - crc checksum and comperssion off. #### Results ``` +--------------------+------------------+-----------------------------+ \| prefetching \| Time (seconds) \| Keys Processed per Second \| +--------------------+------------------+-----------------------------+ \| No \| 28 \| 1,785,700 \| \| Yes \| 19.6 \| 2,550,000 \| +--------------------+------------------+-----------------------------+ Improvement: - Reduced SAVE time by 30% (8.4 seconds faster) - Increased key processing rate by 42.8% (764,300 more keys/second) ``` Signed-off-by: NadavGigi <nadavgigi102@gmail.com>	2025-01-08 23:18:55 +01:00
Nikhil Manglore	9e0204941d	valkey-cli auto-exit from subscribed mode (#1432 ) Resolves issue with valkey-cli not auto exiting from subscribed mode on reaching zero pub/sub subscription (previously filed on Redis) https://github.com/redis/redis/issues/12592 --------- Signed-off-by: Nikhil Manglore <nmanglor@amazon.com>	2025-01-08 21:03:06 +01:00
Rain Valentine	ab627d6721	Replace dict with new hashtable: sorted set datatype (#1427 ) This PR replaces dict with hashtable in the ZSET datatype. Instead of mapping key to score as dict did, the hashtable maps key to a node in the skiplist, which contains the score. This takes advantage of hashtable performance improvements and saves 15 bytes per set item - 24 bytes overhead before, 9 bytes after. Closes #1096 --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-01-08 18:34:02 +01:00
uriyage	6c09eea2bc	client struct: lazy init components and optimize struct layout (#1405 ) # Refactor client structure to use modular data components ## Current State The client structure allocates memory for replication / pubsub / multi-keys / module / blocked data for every client, despite these features being used by only a small subset of clients. In addition the current field layout in the client struct is suboptimal, with poor alignment and unnecessary padding between fields, leading to a larger than necessary memory footprint of 896 bytes per client. Furthermore, fields that are frequently accessed together during operations are scattered throughout the struct, resulting in poor cache locality. ## This PR's Change 1. Lazy Initialization - Components are only allocated when first used: - PubSubData: Created on first SUBSCRIBE/PUBLISH operation - ReplicationData: Initialized only for replica connections - ModuleData: Allocated when module interaction begins - BlockingState: Created when first blocking command is issued - MultiState: Initialized on MULTI command 2. Memory Layout Optimization: - Grouped related fields for better locality - Moved rarely accessed fields (e.g., client->name) to struct end - Optimized field alignment to eliminate padding 3. Additional changes: - Moved watched_keys to be static allocated in the `mstate` struct - Relocated replication init logic to replication.c ### Key Benefits - Efficient Memory Usage: - 45% smaller base client structure - Basic clients now use 528 bytes (down from 896). - Better memory locality for related operations - Performance improvement in high throughput scenarios. No performance regressions in other cases. ### Performance Impact Tested with 650 clients and 512 bytes values. #### Single Thread Performance \| Operation \| Dataset \| New (ops/sec) \| Old (ops/sec) \| Change % \| \|------------\|---------\|---------------\|---------------\|-----------\| \| SET \| 1 key \| 261,799 \| 258,261 \| +1.37% \| \| SET \| 3M keys \| 209,134 \| ~209,000 \| ~0% \| \| GET \| 1 key \| 281,564 \| 277,965 \| +1.29% \| \| GET \| 3M keys \| 231,158 \| 228,410 \| +1.20% \| #### 8 IO Threads Performance \| Operation \| Dataset \| New (ops/sec) \| Old (ops/sec) \| Change % \| \|------------\|---------\|---------------\|---------------\|-----------\| \| SET \| 1 key \| 1,331,578 \| 1,331,626 \| -0.00% \| \| SET \| 3M keys \| 1,254,441 \| 1,152,645 \| +8.83% \| \| GET \| 1 key \| 1,293,149 \| 1,289,503 \| +0.28% \| \| GET \| 3M keys \| 1,152,898 \| 1,101,791 \| +4.64% \| #### Pipeline Performance (3M keys) \| Operation \| Pipeline Size \| New (ops/sec) \| Old (ops/sec) \| Change % \| \|-----------\|--------------\|---------------\|---------------\|-----------\| \| SET \| 10 \| 548,964 \| 538,498 \| +1.94% \| \| SET \| 20 \| 606,148 \| 594,872 \| +1.89% \| \| SET \| 30 \| 631,122 \| 616,606 \| +2.35% \| \| GET \| 10 \| 628,482 \| 624,166 \| +0.69% \| \| GET \| 20 \| 687,371 \| 681,659 \| +0.84% \| \| GET \| 30 \| 725,855 \| 721,102 \| +0.66% \| ### Observations: 1. Single-threaded operations show consistent improvements (1-1.4%) 2. Multi-threaded performance shows significant gains for large datasets: - SET with 3M keys: +8.83% improvement - GET with 3M keys: +4.64% improvement 3. Pipeline operations show consistent improvements: - SET operations: +1.89% to +2.35% - GET operations: +0.66% to +0.84% 4. No performance regressions observed in any test scenario Related issue:https://github.com/valkey-io/valkey/issues/761 --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com> Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-01-08 10:28:54 +02:00
Rueian	dc4628d444	Add `availability_zone` to the HELLO command history (#1524 ) This PR is a followup for #1487. Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2025-01-08 10:04:58 +08:00
Rueian	3b52186b6a	Add `availability_zone` to the HELLO response (#1487 ) It's inconvenient for client implementations to extract the `availability_zone` information from the `INFO` response. The `INFO` response contains a lot of information that a client implementation typically doesn't need. This PR adds the availability zone to the `HELLO` response. Clients usually already use the `HELLO` command for protocol negotiation and also get the server `version` and `role` from its response. To keep the `HELLO` response small, the field is only added if availability zone is configured. --------- Signed-off-by: Rueian <rueiancsie@gmail.com>	2025-01-07 22:54:55 +01:00
Madelyn Olson	4ffd3ebdeb	Fix LUA garbage collector (CVE-2024-46981) (#1513 ) Reset GC state before closing the lua VM to prevent user data to be wrongly freed while still might be used on destructor callbacks. Created and publish by Redis in their OSS branch. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: YaacovHazan <yaacov.hazan@redis.com>	2025-01-06 14:02:22 -08:00
Madelyn Olson	7977c55ac9	Fix Read/Write key pattern selector (CVE-2024-51741) (#1514 ) The explanation on the original commit was wrong. Key based access must have a `~` in order to correctly configure whey key prefixes to apply the selector to. If this is missing, a server assert will be triggered later. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: YaacovHazan <yaacov.hazan@redis.com>	2025-01-06 14:02:16 -08:00
Binbin	c0014ef15e	Check whether to switch to fail when setting the node to pfail in cron (#1061 ) This may speed up the transition to the fail state a bit. Previously we would only check when we received a pfail/fail report from others in gossip. If myself is the last vote, we can directly switch to fail in here without waiting for the next gossip packet. Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-01-06 09:26:17 +08:00
Binbin	33b824137e	Explicitly check C_ERR condition to improve readability in clusterSaveConfig (#1505 ) It's not obvious to see it at first, modify it to use C_ERR. Signed-off-by: Binbin <binloveplay1314@qq.com>	2025-01-04 10:47:32 +08:00
eifrah-aws	b3b4bdcda4	CMake: fail on warnings (#1503 ) When building with `CMake` (especially the targets `valkey-cli`, `valkey-server` and `valkey-benchmark`) it is possible to have a successful build while having warnings. This PR fixes this - which is aligned with how the `Makefile` is working today: - Enable `-Wall` + `-Werror` for valkey targets - Fixed warning in valkey-cli:jsonStringOutput method Signed-off-by: Eran Ifrah <eifrah@amazon.com>	2025-01-03 09:44:41 +08:00
gmbnomis	26a72fa89c	Use the correct command proc for the LOOKUP_NOTOUCH exception in lookupKey (#1499 ) When looking up a key in no-touch mode, `LOOKUP_NOTOUCH` is set to avoid updating the last access time in `lookupKey`. An exception must be made for the `TOUCH` command which must always update the key. When called from a script, `server.executing_client` will point to the `TOUCH` command, while `server.current_client` will point to e.g. an `EVAL` command. So, we must use the former to find out the currently executing command if defined. This fix addresses the issue where TOUCH wasn't updating key access times when called from scripts like EVAL. Fixes #1498 Signed-off-by: Simon Baatz <gmbnomis@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2025-01-03 09:41:15 +08:00
Ricardo Dias	8d764f27b3	Refactor: move all valkey modules related declarations to `module.h` (#1489 ) In this commit we move all structures and functions declarations related to Valkey modules from `server.h` to the recently added `module.h` file. This re-organization makes it easier for new contributors to find the valkey modules related code, as well as reducing the compilation times when changes are made to the modules code. --------- Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>	2025-01-02 18:35:10 +01:00
uriyage	35abb68b79	Offload reading the replication stream to IO threads (#1449 ) Support Primary client IO offload. Related issue: https://github.com/valkey-io/valkey/issues/761 --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com>	2025-01-02 10:42:39 +01:00
uriyage	ae70c5459b	replication: fix io-threads possible race by moving waitForClientIO (#1422 ) ### Fix race with pending writes in replica state transition #### The Problem In #60 (Dual channel replication) a new `connWrite` call was added before the `waitForClientIO` check. This created a race condition where the main thread may attempt to write to a client that could have pending writes in IO threads. #### The Fix Moved the `waitForClientIO()` call earlier in `syncCommand`, before any `connWrite` call. This ensures all pending IO operations are completed before attempting to write to the client. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com>	2025-01-02 10:01:55 +02:00
ranshid	0f273bb648	Align rejected unblocked commands to update the correct error statistic (#577 ) Currently, in case a blocked command is unblocked externally (eg. due to the relevant slot being migrated or the CLIENT UNBLOCK command was issued, the command statistics will always update the failed_calls error statistic. This leads to missalignment with `90b9f08e5d` as well as some inconsistencies. For example when a key is migrated during cluster slot migration, clients blocked on XREADGROUP will be unblocked and update the rejected_calls stat, while clients blocked on BLPOP will get unblocked updating the failed_calls stat. In this PR we add explicit indication in updateStatsOnUnblock thet indicates if the command was rejected or failed. --------- Signed-off-by: ranshid <ranshid@amazon.com> Signed-off-by: Ran Shidlansik <ranshid@amazon.com>	2025-01-01 16:33:09 +02:00
zhenwei pi	a136ad9a50	Make global configs as static (#1159 ) Don't expose static configs symbol, and make configEnumGetValue as static function. Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>	2024-12-30 15:58:06 -05:00
Pierre	e4179f1f3b	Only (re-)send MEET packet once every handshake timeout period (#1441 ) Add `meet_sent` field in `clusterNode` indicating the last time we sent a MEET packet. Use this field to only (re-)send a MEET packet once every handshake timeout period when detecting a node without an inbound link. When receiving multiple MEET packets on the same link while the node is in handshake state, instead of dropping the packet, we now simply prevent the creation of a new node. This way we still process the MEET packet's gossip and reply with a PONG as any other packets. Improve some logging messages to include `human_nodename`. Add `nodeExceedsHandshakeTimeout()` function. This is a follow-up to this previous PR: https://github.com/valkey-io/valkey/pull/1307 And a partial fix to the crash described in: https://github.com/valkey-io/valkey/pull/1436 --------- Signed-off-by: Pierre Turin <pieturin@amazon.com>	2024-12-30 15:56:39 -05:00
Madelyn Olson	e470735d91	Immediately restart the defrag cycle if we still need to defrag (#1492 )	2024-12-29 08:22:49 -08:00
gmbnomis	8b40341295	Fix JSON description of SET command (#1473 ) In the `arguments` section, the `arguments` key is only used for arguments of type `block` or `oneof`. Consequently, the `arguments` given for `IFEQ` are ignored by the server. However, they lead to strange results when rendering the command's page for the web documentation. Fix this by removing `arguments` for `IFEQ`. Signed-off-by: Simon Baatz <gmbnomis@gmail.com>	2024-12-27 00:55:20 +01:00
uriyage	bb325bde35	Fix restore replica output bytes stat update (#1486 ) This PR fixes the missing stat update for `total_net_repl_output_bytes` that was removed during the refactoring in PR #758. The metric was not being updated when writing to replica connections. Changes: - Restored the stat update in postWriteToClient for replica connections - Added integration test to verify the metric is properly updated Signed-off-by: Uri Yagelnik <uriy@amazon.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2024-12-25 10:58:49 +08:00
Binbin	da92c1d6c8	Document all command flags near serverCommand (#1474 ) These flags are not documented here. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-25 10:57:42 +08:00
Amit Nagler	9f4503ca50	Add scoped RDB loading context and immediate abort flag (#1173 ) This PR introduces a new mechanism for temporarily changing the server's loading_rio context during RDB loading operations. The new `RDB_SCOPED_LOADING_RIO` macro allows for a scoped change of the `server.loading_rio` value, ensuring that it's automatically restored to its original value when the scope ends. Introduces a dedicated flag to `rio` to signal immediate abort, preventing potential use-after-free scenarios during replication disconnection in dual-channel load. This ensures proper termination of `rdbLoadRioWithLoadingCtx` when replication is cancelled due to connection loss on main connection. Fixes https://github.com/valkey-io/valkey/issues/1152 --------- Signed-off-by: naglera <anagler123@gmail.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>	2024-12-24 08:14:32 +02:00
Madelyn Olson	2ee06e7983	Remove readability refactor for failover auth to fix clang warning (#1481 ) As part of #1463, I made a small refactor between the PR and the daily test I submitted to try to improve readability by adding a function to abstract the extraction of the message types. However, that change apparently caused GCC to throw another warning, so reverting the abstraction on just one line. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-12-24 13:07:15 +08:00
Ricardo Dias	6adef8e2f9	Adds support for scripting engines as Valkey modules (#1277 ) This PR extends the module API to support the addition of different scripting engines to execute user defined functions. The scripting engine can be implemented as a Valkey module, and can be dynamically loaded with the `loadmodule` config directive, or with the `MODULE LOAD` command. This PR also adds an example of a dummy scripting engine module, to show how to use the new module API. The dummy module is implemented in `tests/modules/helloscripting.c`. The current module API support, only allows to load scripting engines to run functions using `FCALL` command. The additions to the module API are the following: ```c /* This struct represents a scripting engine function that results from the * compilation of a script by the engine implementation. / struct ValkeyModuleScriptingEngineCompiledFunction typedef ValkeyModuleScriptingEngineCompiledFunction (ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc)( ValkeyModuleScriptingEngineCtx engine_ctx, const char code, size_t timeout, size_t out_num_compiled_functions, char err); typedef void (ValkeyModuleScriptingEngineCallFunctionFunc)( ValkeyModuleCtx module_ctx, ValkeyModuleScriptingEngineCtx engine_ctx, ValkeyModuleScriptingEngineFunctionCtx func_ctx, void compiled_function, ValkeyModuleString keys, size_t nkeys, ValkeyModuleString args, size_t nargs); typedef size_t (ValkeyModuleScriptingEngineGetUsedMemoryFunc)( ValkeyModuleScriptingEngineCtx engine_ctx); typedef size_t (ValkeyModuleScriptingEngineGetFunctionMemoryOverheadFunc)( void compiled_function); typedef size_t (ValkeyModuleScriptingEngineGetEngineMemoryOverheadFunc)( ValkeyModuleScriptingEngineCtx engine_ctx); typedef void (ValkeyModuleScriptingEngineFreeFunctionFunc)( ValkeyModuleScriptingEngineCtx engine_ctx, void compiled_function); / This struct stores the callback functions implemented by the scripting * engine to provide the functionality for the `FUNCTION ` commands. / typedef struct ValkeyModuleScriptingEngineMethodsV1 { uint64_t version; /* Version of this structure for ABI compat. / / Library create function callback. When a new script is loaded, this * callback will be called with the script code, and returns a list of * ValkeyModuleScriptingEngineCompiledFunc objects. / ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc create_functions_library; / The callback function called when `FCALL` command is called on a function * registered in this engine. / ValkeyModuleScriptingEngineCallFunctionFunc call_function; / Function callback to get current used memory by the engine. / ValkeyModuleScriptingEngineGetUsedMemoryFunc get_used_memory; / Function callback to return memory overhead for a given function. / ValkeyModuleScriptingEngineGetFunctionMemoryOverheadFunc get_function_memory_overhead; / Function callback to return memory overhead of the engine. / ValkeyModuleScriptingEngineGetEngineMemoryOverheadFunc get_engine_memory_overhead; / Function callback to free the memory of a registered engine function. / ValkeyModuleScriptingEngineFreeFunctionFunc free_function; } ValkeyModuleScriptingEngineMethodsV1; / Registers a new scripting engine in the server. * * - `engine_name`: the name of the scripting engine. This name will match * against the engine name specified in the script header using a shebang. * * - `engine_ctx`: engine specific context pointer. * * - `engine_methods`: the struct with the scripting engine callback functions * pointers. / int ValkeyModule_RegisterScriptingEngine(ValkeyModuleCtx ctx, const char engine_name, void engine_ctx, ValkeyModuleScriptingEngineMethods engine_methods); /* Removes the scripting engine from the server. * * `engine_name` is the name of the scripting engine. * / int ValkeyModule_UnregisterScriptingEngine(ValkeyModuleCtx ctx, const char *engine_name); ``` --------- Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>	2024-12-21 23:09:35 +01:00
Madelyn Olson	1c97317518	Resolve bounds checks on cluster_legacy.c (#1463 ) We are getting a number of errors like: ``` array subscript ‘clusterMsg[0]’ is partly outside array bounds of ‘unsigned char[2272]’ ``` Which is basically GCC telling us that we have an object which is longer than the underlying storage of the allocation. We actually do this a lot, but GCC is generally not aware of how big the underlying allocation is, so it doesn't throw this error. We are specifically getting this error because the msgBlock can be of variable length depending on the type of message, but GCC assumes it's the longest one possible. The solution I went with here was make the message type optional, so that it wasn't included in the size. I think this also makes some sense, since it's really just a helper for us to easily cast the object around. I considered disabling this error, but it is generally pretty useful since it can catch real issues. Another solution would be to over-allocate to the largest possible object, which could hurt performance as we initialize it to zero. Results: https://github.com/madolson/valkey/actions/runs/12423414811/job/34686899884 This is a slightly cleaned up version of https://github.com/valkey-io/valkey/pull/1439. I thought I had another strategy but alas, it didn't work out. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-12-20 12:10:48 -08:00
Binbin	ca0b0c662a	Clear outdated failure reports more accurately (#1184 ) There are two changes here: 1. The one in clusterNodeCleanupFailureReports, only primary with slots can report the failure report, if the primary became a replica its failure report should be cleared. This may lead to inaccurate node fail judgment in some network partition cases i guess, it will also affect the CLUSTER COUNT-FAILURE-REPORTS command. 2. The one in clusterProcessGossipSection, it is not that important, but it can print a "node is back online" log helps us troubleshoot the problem, although it may conflict with 1 at some points. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-20 10:14:01 +08:00
Jungwoo Song	e9a1fe0b32	Support for reading from replicas in valkey-benchmark (#1392 ) Background When conducting performance tests using `valkey-benchmark`, reading from replicas was not supported. Consequently, even in cluster mode, all reads were directed to the primary nodes. This limitation made it challenging to obtain accurate metrics during workload stress testing for performance measurement or before a version upgrade. Related issue : https://github.com/valkey-io/valkey/issues/900 Changes 1. Replaced the use of `CLUSTER NODES` with `CLUSTER SLOTS` when fetching cluster configuration. This allows for easier identification of replica slots. 2. Support for reading from replicas by executing the client in `READONLY` mode. 3. Support reading from replicas even during slot migrations. 4. Introduced two CLI options `--rfr` to enable reading from replicas only or all cluster nodes. A warning added to indicate that write requests might not be handled correctly when using this option. --------- Signed-off-by: bluayer <ijacsong98@gmail.com> Signed-off-by: bluayer <bluayer@gmail.com> Signed-off-by: Jungwoo Song <37579681+bluayer@users.noreply.github.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>	2024-12-19 18:32:31 +02:00
Binbin	97029953a0	Minor log fixes when failover auth denied due to slot epoch (#1341 ) The old reqEpoch mainly refers to requestCurrentEpoch, see: ``` if (requestCurrentEpoch < server.cluster->currentEpoch) { serverLog(LL_WARNING, "Failover auth denied to %.40s (%s): reqEpoch (%llu) < curEpoch(%llu)", node->name, node->human_nodename, (unsigned long long)requestCurrentEpoch, (unsigned long long)server.cluster->currentEpoch); return; } ``` And in here we refer to requestConfigEpoch, it's a bit misleading, so change it to reqConfigEpoch to make it clear. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-19 16:12:34 +08:00
uriyage	8060c86d20	Offload TLS negotiation to I/O threads (#1338 ) ## TLS Negotiation Offloading to I/O Threads ### Overview This PR introduces the ability to offload TLS handshake negotiations to I/O threads, significantly improving performance under high TLS connection loads. ### Key Changes - Added infrastructure to offload TLS negotiations to I/O threads - Refactored SSL event handling to allow I/O threads modify conn flags. - Introduced new connection flag to identify client connections ### Performance Impact Testing with 650 clients with SET commands and 160 new TLS connections per second in the background: #### Throughput Impact of new TLS connections - With Offloading: Minimal impact (1050K → 990K ops/sec) - Without Offloading: Significant drop (1050K → 670K ops/sec) #### New Connection Rate - With Offloading: - 1,757 conn/sec - Without Offloading: - 477 conn/sec ### Implementation Details 1. Main Thread: - Initiates negotiation-offload jobs to I/O threads - Adds connections to pending-read clients list (using existing read offload mechanism) - Post-negotiation handling: - Creates read/write events if needed for incomplete negotiations - Calls accept handler for completed negotiations 2. I/O Thread: - Performs TLS negotiation - Updates connection flags based on negotiation result Related issue:https://github.com/valkey-io/valkey/issues/761 --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com> Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-12-18 09:03:30 +02:00
Madelyn Olson	e203ca35b7	Fix undefined behavior defined by ASAN (#1451 ) Asan now supports making sure you are passing in the correct pointer type, which seems useful but we can't support it since we pass in an incorrect pointer in several places. This is most commonly done with generic free functions, where we simply cast it to the correct type. It's not a lot of code to clean up, so it seems appropriate to cleanup instead of disabling the check. --------- Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-12-17 17:48:53 -08:00
Viktor Szépe	b66698b887	Discover and fix new typos (#1446 ) Upgrade `typos` and fix corresponding typos --------- Signed-off-by: Viktor Szépe <viktor@szepe.net>	2024-12-17 17:45:43 -08:00
ranshid	ba25b586d5	Introduce FORCE_DEFRAG compilation option to allow activedefrag run when allocator is not jemalloc (#1303 ) Introduce compile time option to force activedefrag to run even when jemalloc is not used as the allocator. This is in order to be able to run tests with defrag enabled while using memory instrumentation tools. fixes: https://github.com/valkey-io/valkey/issues/1241 --------- Signed-off-by: ranshid <ranshid@amazon.com> Signed-off-by: Ran Shidlansik <ranshid@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-12-17 19:07:55 +02:00
xbasel	7892bf808b	Fix test_reclaimFilePageCache to avoid tmpfs (#1379 ) Avoid tmpfs as fadvise(FADV_DONTNEED) has no effect on memory-backed filesystems. Fixes https://github.com/valkey-io/valkey/issues/897 --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com> Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com>	2024-12-17 18:04:27 +02:00
Binbin	e024b4bd27	Drop the MEET packet if the link node is in handshake state (#1436 ) After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink: ``` === ASSERTION FAILED === ==> '!link->node' is not true ``` In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link during the MEET processing, so if we receive a another MEET packet in a short time, the node is still in handshake state, we will meet this assert and crash the server. If the link is bound to a node and the node is in the handshake state, and we receive a MEET packet, it may be that the sender sent multiple MEET packets so in here we are dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert will happen if the other sends a MEET packet because it detects that there is no inbound link, this node creates a new node in HANDSHAKE state (with a random node name), and respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET flag. This node is supposed to open an outbound connection to the other node in the next cron cycle, but before this happens, the other node re-sends a MEET on the same link because it still detects no inbound connection. Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET packet eliminate the handshake state. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-16 13:43:48 +08:00
Binbin	ad24220681	Automatic failover vote is not limited by two times the node timeout (#1356 ) This is a follow of #1305, we now decided to apply the same change to automatic failover as well, that is, move forward with removing it for both automatic and manual failovers. Quote from Ping during the review: Note that we already debounce transient primary failures with node timeout, ensuring failover is only triggered after sustained outages. Election timing is naturally staggered by replica spacing, making the likelihood of simultaneous elections from replicas of the same shard very low. The one-vote-per-epoch rule further throttles retries and ensures orderly elections. On top of that, quorum-based primary failure confirmation, cluster-state convergence, and slot ownership validation are all built into the process. Quote from Madelyn during the review: It against the specific primary. It's to prevent double failovers. If a primary just took over we don't want someone else to try to take over and give the new primary some amount of time to take over. I have not seen this issue though, it might have been over optimizing? The double failure mode, where a node fails and then another node fails within the nodetimeout also doesn't seem that common either though. So the conclusion is that we all agreed to remove it completely, it will make the code a lot simpler. And if there is other specific edge cases we are missing, we will fix it in other way. See discussion #1305 for more information. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-15 12:09:53 +08:00
Rain Valentine	88942c8e61	Replace dict with new hashtable for sets datatype (#1176 ) The new `hashtable` provides faster lookups and uses less memory than `dict`. A TCL test case "SRANDMEMBER with a dict containing long chain" is deleted because it's covered by a hashtable unit test "test_random_entry_with_long_chain", which is already present. This change also moves some logic from dismissMemory (object.c) to zmadvise_dontneed (zmalloc.c), so the hashtable implementation which needs the dismiss functionality doesn't need to depend on object.c and server.h. This PR follows #1186. --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-12-14 20:53:48 +01:00
Madelyn Olson	3cd176dc39	Avoid importing memory aligned malloc (#1442 ) We deprecate the usage of classic malloc and free, but under certain circumstances they might get imported from intrinsics. The original thought is we should just override malloc and free to use zmalloc and zfree, but I think we should continue to deprecate it to avoid accidental imports of allocations. Closes https://github.com/valkey-io/valkey/issues/1434. --------- Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-12-14 19:13:04 +01:00
Thalia Archibald	b60097ba07	Check length before reading in `stringmatchlen` (#1431 ) Fixes four cases where `stringmatchlen` could overrun the pattern if it is not terminated with NUL. These commits are cherry-picked from my [fork](https://github.com/thaliaarchi/antirez-stringmatch) which extracts `stringmatch` as a library and compares it to other projects by antirez which use the same matcher. Signed-off-by: Thalia Archibald <thalia@archibald.dev>	2024-12-13 11:05:19 +01:00
Jim Brunner	32f2c73cb5	defrag: eliminate persistent kvstore pointer and edge case fixes (#1430 ) This update addresses several issues in defrag: 1. In the defrag redesign (https://github.com/valkey-io/valkey/pull/1242), a bug was introduced where `server.cronloops` was no longer being incremented in the `whileBlockedCron()`. This resulted in some memory statistics not being updated while blocked. 2. In the test case for AOF loading, we were seeing errors due to defrag latencies. However, running the math, the latencies are justified given the extremely high CPU target of the testcase. Adjusted the expected latency check to allow longer latencies for this case where defrag is undergoing starvation while AOF loading is in progress. 3. A "stage" is passed a "target". For the main dictionary and expires, we were passing in a `kvstore*`. However, on flushall or swapdb, the pointer may change. It's safer and more stable to use an index for the DB (a DBID). Then if the pointer changes, we can detect the change, and simply abort the stage. (If there's still fragmentation to deal with, we'll pick it up again on the next cycle.) 4. We always start a new stage on a new defrag cycle. This gives the new stage time to run, and prevents latency issues for certain stages which don't operate incrementally. However, often several stages will require almost no work, and this will leave a chunk of our CPU allotment unused. This is mainly an issue in starvation situations (like AOF loading or LUA script) - where defrag is running infrequently, with a large duty-cycle. This change allows a new stage to be initiated if we still have a standard duty-cycle remaining. (This can happen during starvation situations where the planned duty cycle is larger than the standard cycle. Most likely this isn't a concern for real scenarios, but it was observed in testing.) 5. Minor comment correction in `server.h` Signed-off-by: Jim Brunner <brunnerj@amazon.com>	2024-12-12 14:55:57 -08:00
ranshid	2d92404522	Avoid defragging scripts during EVAL command execution (#1414 ) This can happen when scripts are running for long period of time and the server attempts to defrag it in the whileBlockedCron. Signed-off-by: Ran Shidlansik <ranshid@amazon.com>	2024-12-12 13:52:58 -08:00
Pierre	5f7fe9ef21	Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout (#1307 ) In some cases, when meeting a new node, if the handshake times out, we can end up with an inconsistent view of the cluster where the new node knows about all the nodes in the cluster, but the cluster does not know about this new node (or vice versa). To detect this inconsistency, we now check if a node has an outbound link but no inbound link, in this case it probably means this node does not know us. In this case we (re-)send a MEET packet to this node to do a new handshake with it. If we receive a MEET packet from a known node, we disconnect the outbound link to force a reconnect and sending of a PING packet so that the other node recognizes the link as belonging to us. This prevents cases where a node could send MEET packets in a loop because it thinks the other node does not have an inbound link. This fixes the bug described in #1251. --------- Signed-off-by: Pierre Turin <pieturin@amazon.com>	2024-12-11 17:26:06 -08:00
Jim Brunner	0c8ad5cd34	defrag: allow defrag to start during AOF loading (#1420 ) Addresses https://github.com/valkey-io/valkey/issues/1393 Changes: * During AOF loading or long running script, this allows defrag to be initiated. * The AOF defrag test was corrected to eliminate the wait period and rely on non-timer invocations. * Logic for "overage" time in defrag was changed. It previously accumulated underage leading to large latencies in extreme tests having very high CPU percentage. After several simple stages were completed during infrequent blocked processing, a large cycle time would be experienced. Signed-off-by: Jim Brunner <brunnerj@amazon.com>	2024-12-11 19:47:06 +02:00
Binbin	1acf7f71c0	Fix memory leak in the new hashtable unittest (#1421 ) There is a leak in here, hashtableTwoPhasePopDelete won't call the entry destructor and like hashtablePop we need to call it by myself. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-12-11 06:40:18 +01:00
Viktor Söderqvist	3eb8314be6	Replace dict with hashtable for keys, expires and pubsub channels Instead of a dictEntry with pointers to key and value, the hashtable has a pointer directly to the value (robj) which can hold an embedded key and acts as a key-value in the hashtable. This minimizes the number of pointers to follow and thus the number of memory accesses to lookup a key-value pair. Keys robj hashtable +-------+ +-----------------------+ \| 0 \| \| type, encoding, LRU \| \| 1 ------->\| refcount, expire \| \| 2 \| \| ptr \| \| ... \| \| optional embedded key \| +-------+ \| optional embedded val \| +-----------------------+ The expire timestamp (TTL) is also stored in the robj, if any. The expire hash table points to the same robj. Overview of changes: * Replace dict with hashtable in kvstore (kvstore.c) * Add functions for embedding key and expire in robj (object.c) * When there's unused space, reserve an expire field to avoid realloting it later if expire is added. * Always reserve space for expire for large key names to avoid realloc if it's set later. * Update db functions (db.c) * dbAdd, setKey and setExpire reallocate the object when embedding a key * setKey does not increment the reference counter, since it would require duplicating the object. This responsibility is moved to the caller. * Remove logic for shared integer objects as values in the database. The keys are now embedded in the objects, so all objects in the database need to be unique. Thus, we can't use shared objects as values. Also delete test cases for shared integers. * Adjust various commands to the changes mentioned above. * Adjust defrag code * Improvement: Don't access the expires table before defrag has actually reallocated the object. * Adjust test cases that were using hard-coded sizes for dict when realloc would happen, and some other adjustments in test cases. * Adjust memory prefetch for new hash table implementation in IO-threading, using new `hashtableIncrementalFind` API * Adjust offloading of free() to IO threads: Object free to be done in main thread while keeping obj->ptr offloading in IO-thread since the DB object is now allocated by the main-thread and not by the IO-thread as it used to be. * Let expireIfNeeded take an optional value, to avoid looking up the expires table when possible. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com> Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Uri Yagelnik <uriy@amazon.com>	2024-12-10 21:30:56 +01:00

1 2 3 4 5 ...

9374 Commits