futriix

Author	SHA1	Message	Date
Binbin	b9d224097a	Brocast a PONG to all node in cluster when role changed (#1295 ) When a node role changes, we should brocast the change to notify other nodes. For example, one primary and one replica, after a failover, the replica became a new primary, the primary became a new replica. And then we trigger a second cluster failover for the new replica, the new replica will send a MFSTART to its primary, ie, the new primary. But the new primary may reject the MFSTART due to this logic: ``` } else if (type == CLUSTERMSG_TYPE_MFSTART) { if (!sender \|\| sender->replicaof != myself) return 1; ``` In the new primary views, sender is still a primary, and sender->replicaof is NULL, so we will return. Then the manual failover timedout. Another possibility is that other primaries refuse to vote after receiving the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary, so it refuse to vote, and then manual failover timedout. ``` void clusterSendFailoverAuthIfNeeded(clusterNode node, clusterMsg request) { ... if (clusterNodeIsPrimary(node)) { serverLog(LL_WARNING, "Failover auth denied to... ``` The reason is that, currently, we only update the node->replicaof information when we receive a PING/PONG from the sender. For details, see clusterProcessPacket. Therefore, in some scenarios, such as clusters with many nodes and a large cluster-ping-interval (that is, cluster-node-timeout), the role change of the node will be very delayed. Added a DEBUG DISABLE-CLUSTER-RANDOM-PING command, send cluster ping to a random node every second (see clusterCron). Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-23 00:22:04 +08:00
Binbin	979f4c1ceb	Add cmake-build-debug and cmake-build-release to gitignore (#1340 ) Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-22 16:49:16 +08:00
Alan Scherger	377ed22c97	[feat] add Ubuntu 24.04 Noble package support (#971 ) add Ubuntu 24.04 Noble package support Signed-off-by: Alan Scherger <alan.scherger@gmail.com>	2024-11-21 19:26:30 -08:00
Yury-Fridlyand	109d2dadc0	Add slack link for users (#1273 ) Add slack link for users --------- Signed-off-by: Yury-Fridlyand <yury.fridlyand@improving.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-11-21 19:19:10 -08:00
Nadav Levanoni	18d1eb5a85	Remove redundant dict_index calculations (#1205 ) We need to start making use of the new `WithDictIndex` APIs which allow us to reuse the dict_index calculation (avoid over-calling `getKeySlot` for no good reason). In this PR I optimized `lookupKey` so it now calls `getKeySlot` to reuse the dict_index two additional times. It also optimizes the keys command to avoid unnecessary computation of the slot id. --------- Signed-off-by: Nadav Levanoni <nadavl@amazon.com> Co-authored-by: Nadav Levanoni <nadavl@amazon.com>	2024-11-21 19:14:28 -08:00
Sinkevich Artem	43b5026162	Fix argument types of formatting functions (#1253 ) `cluster_legacy.c`: `slot_info_pairs` has `uint16_t` values, but they were cast to `unsigned long` and `%i` was used. `valkey-cli.c`: `node->replicas_count` is `int`, not `unsigned long`. Signed-off-by: ArtSin <artsin666@gmail.com>	2024-11-21 18:58:15 -08:00
Binbin	50aae13b0a	Skip reclaim file page cache test in valgrind (#1327 ) The test is incompatible with valgrind. Added a new `--valgrind` argument to test suite, which will cause that test to be skipped. We skipped it in the past, see 5b61b0dc6d2579ee484fa6cf29bfac59513f84ab Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-22 10:29:24 +08:00
Binbin	c4be326c32	Make manual failover reset the on-going election to promote failover (#1274 ) If a manual failover got timed out, like the election don't get the enough votes, since we have a auth_timeout and a auth_retry_time, a new manual failover will not be able to proceed on the replica side. Like if we initiate a new manual failover after a election timed out, we will pause the primary, but on the replica side, due to retry_time, replica does not trigger the new election and the manual failover will eventually time out. In this case, if we initiate manual failover again and there is an ongoing election, we will reset it so that the replica can initiate a new election at the manual failover's request. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-22 10:28:59 +08:00
zvi-code	b56eed2479	Remove valkey specific changes in jemalloc source code (#1266 ) ### Summary of the change This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc [native api](https://github.com/jemalloc/jemalloc/pull/1463#issuecomment-479706489) instead of relying on custom code changes made by valkey in the jemalloc ([je_defrag_hint](`9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)`)) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions. This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the `computeDefragCycles` and not for every single ptr, this way we are avoiding the additional api call. Followup work will utilize the new options that are now open and will further improve the defrag decision and process. ### Added files: `allocator_defrag.c` / `allocator_defrag.h` - This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation. ### Moved functions: [`zmalloc_no_tcache` , `zfree_no_tcache` ](`4593dc2f05/src/zmalloc.c (L215)`) - these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc]. `frag_smallbins_bytes` - the logic and implementation moved to the new file ### Existing API: * [once a second + when completed full cycle] [`computeDefragCycles`](`4593dc2f05/src/defrag.c (L916)`) * `zmalloc_get_allocator_info` : gets from jemalloc _allocated, active, resident, retained, muzzy_, `frag_smallbins_bytes` * [`frag_smallbins_bytes`](`4593dc2f05/src/zmalloc.c (L690)`) : for each bin; gets from jemalloc bin_info, `curr_regs`, `cur_slabs` * [during defrag, for each pointer] * `je_defrag_hint` is getting a memory pointer and returns {0,1} . [Internally it uses](`4593dc2f05/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L368)`) this information points: * #`nonfull_slabs` * #`total_slabs` * #free regs in the ptr slab ## Jemalloc API (via ctl interface) [BATCH][`experimental_utilization_batch_query_ctl`](`4593dc2f05/deps/jemalloc/src/ctl.c (L4114)`) : gets an array of pointers, returns for each pointer 3 values, * number of free regions in the extent * number of regions in the extent * size of the extent in terms of bytes [EXTENDED][`experimental_utilization_query_ctl`](`4593dc2f05/deps/jemalloc/src/ctl.c (L3989)`) : * memory address of the extent a potential reallocation would go into * number of free regions in the extent * number of regions in the extent * size of the extent in terms of bytes * [stats-enabled]total number of free regions in the bin the extent belongs to * [stats-enabled]total number of regions in the bin the extent belongs to ### `experimental_utilization_batch_query_ctl` vs valkey `je_defrag_hint`? [good] - We can query pointers in a batch, reduce the overall overhead - The per ptr decision algorithm is not within jemalloc api, jemalloc only provides information, valkey can tune\configure\optimize easily [bad] - In the batch API we only know the utilization of the slab (of that memory ptr), we don’t get the data about #`nonfull_slabs` and total allocated regs. ## New functions: 1. `defrag_jemalloc_init`: Reducing the cost of call to je_ctl: use the [MIB interface](https://jemalloc.net/jemalloc.3.html) to get a faster calls. See this quote from the jemalloc documentation: The mallctlnametomib() function provides a way to avoid repeated name lookups for applications that repeatedly query the same portion of the namespace,by translating a name to a “Management Information Base” (MIB) that can be passed repeatedly to mallctlbymib(). 6. `jemalloc_sz2binind_lgq` : this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived from [`lg_quantum`](`4593dc2f05/deps/Makefile (L115)`) 7. `defrag_jemalloc_get_frag_smallbins` : This function replaces `frag_smallbins_bytes` the logic moved to the new file allocator_defrag `defrag_jemalloc_should_defrag_multi` → `handle_results` - unpacks the results 8. `should_defrag` : implements the same logic as the existing implementation [inside](`9f8185f5c8/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h (L382)`) je_defrag_hint 9. `defrag_jemalloc_should_defrag_multi` : implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed. ### Logical differences: In order to get the information about #`nonfull_slabs` and #`regs`, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we use `jemalloc_sz2binind_lgq` . ## Testing This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance. ### Test: 1. disable `activedefrag` 2. run valkey benchmark on overlapping address ranges with different block sizes 3. wait untill `used_memory` reaches 10GB 4. set `maxmemory` to 5GB and `maxmemory-policy` to `allkeys-lru` 5. stop load 6. wait for `mem_fragmentation_ratio` to reach 2 7. enable `activedefrag` - start test timer 8. wait until reach `mem_fragmentation_ratio` = 1.1 #### Results: (With this PR)Test results: ` 56 sec` (Without this PR)Test results: `67 sec` both runs perform same "work" number of buffers moved to reach fragmentation target Next benchmarking is to compare to: - DONE // existing `je_get_defrag_hint` - compare with naive defrag all: `int defrag_hint() {return 1;}` --------- Signed-off-by: Zvi Schneider <ezvisch@amazon.com> Signed-off-by: Zvi Schneider <zvi.schneider22@gmail.com> Signed-off-by: zvi-code <54795925+zvi-code@users.noreply.github.com> Co-authored-by: Zvi Schneider <ezvisch@amazon.com> Co-authored-by: Zvi Schneider <zvi.schneider22@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-11-21 16:29:21 -08:00
xbasel	b486a41500	Preserve original fd blocking state in TLS I/O operations (#1298 ) This change prevents unintended side effects on connection state and improves consistency with non-TLS sync operations. For example, when invoking `connTLSSyncRead` with a blocking file descriptor, the mode is switched to non-blocking upon `connTLSSyncRead` exit. If the code assumes the file descriptor remains blocking and calls the normal `read` expecting it to block, it may result in a short read. This caused a crash in dual-channel, which was fixed in this PR by relocating `connBlock()`: https://github.com/valkey-io/valkey/pull/837 Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>	2024-11-21 18:22:16 +02:00
Binbin	6038eda010	Make FUNCTION RESTORE FLUSH flush async based on lazyfree-lazy-user-flush (#1254 ) FUNCTION RESTORE have a FLUSH option, it will delete all the existing libraries before restoring the payload. If for some reasons, there are a lot of libraries, we will block a while in here. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-21 21:02:05 +08:00
Binbin	f553ccbda6	Use goto to cleanup error handling in readSyncBulkPayload (#1332 ) The goto error label is the same as the error return, use goto to reduce the references. ``` error: cancelReplicationHandshake(1); return; ``` Also this can make the log printing more continuous under the error, that is, we print the error log first, and then print the reconnecting log at the last (in cancelReplicationHandshake). Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-21 20:01:30 +08:00
Yanqi Lv	4986310945	Import-mode: Avoid expiration and eviction during data syncing (#1185 ) New config: `import-mode (yes\|no)` New command: `CLIENT IMPORT-SOURCE (ON\|OFF)` The config, when set to `yes`, disables eviction and deletion of expired keys, except for commands coming from a client which has marked itself as an import-source, the data source when importing data from another node, using the CLIENT IMPORT-SOURCE command. When we sync data from the source Valkey to the destination Valkey using some sync tools like [redis-shake](https://github.com/tair-opensource/RedisShake), the destination Valkey can perform expiration and eviction, which may cause data corruption. This problem has been discussed in https://github.com/redis/redis/discussions/9760#discussioncomment-1681041 and Redis already have a solution. But in Valkey we haven't fixed it by now. E.g. we call `set key 1 ex 1` on the source server and transfer this command to the destination server. Then we call `incr key` on the source server before the key expired, we will have a key on the source server with a value of 2. But when the command arrived at the destination server, the key may be expired and has deleted. So we will have a key on the destination server with a value of 1, which is inconsistent with the source server. In standalone mode, we can use writable replica to simplify the sync process. However, in cluster mode, we still need a sync tool to help us transfer the source data to the destination. The sync tool usually work as a normal client and the destination works as a primary which keep expiration and eviction. In this PR, we add a new mode named 'import-mode'. In this mode, server stop expiration and eviction just like a replica. Notice that this mode exists only in sync state to avoid data inconsistency caused by expiration and eviction. Import mode only takes effect on the primary. Sync tools can mark their clients as an import source by `CLIENT IMPORT-SOURCE`, which work like a client from primary and can visit expired keys in `lookupkey`. Notice: during the migration, other clients, apart from the import source, should not access the data imported by import source. --------- Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com> Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-11-19 21:53:19 +01:00
Binbin	ee386c92ff	Manual failover vote is not limited by two times the node timeout (#1305 ) This limit should not restrict manual failover, otherwise in some scenarios, manual failover will time out. For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs are lost during a manual failover, it cannot vote in the second manual failover. Or in a mixed scenario of plain failover and manual failover, it cannot vote for the subsequent manual failover. The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-11-19 11:17:20 -05:00
Binbin	132798b57d	Receipt of REPLCONF VERSION reply should be triggered by event (#1320 ) This add the missing return when repl_state change to RECEIVE_VERSION_REPLY, this way we won’t be blocked if the primary doesn’t reply with REPLCONF VERSION. In practice i guess this is no likely to block in this context, reading small responses are are likely to be received in one packet, so this is just a cleanup (consistent with the previous state machine processing). Also update the state machine diagram to mention the VERSION reply. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-19 23:42:50 +08:00
Seungmin Lee	3d0c834203	Fix LRU crash when getting too many random lua scripts (#1310 ) ### Problem Valkey stores scripts in a dictionary (lua_scripts) keyed by their SHA1 hashes, but it needs a way to know which scripts are least recently used. It uses an LRU list (lua_scripts_lru_list) to keep track of scripts in usage order. When the list reaches a maximum length, Valkey evicts the oldest scripts to free memory in both the list and dictionary. The problem here is that the sds from the LRU list can be pointing to already freed/moved memory by active defrag that the sds in the dictionary used to point to. It results in assertion error at [this line](https://github.com/valkey-io/valkey/blob/unstable/src/eval.c#L519) ### Solution If we duplicate the sds when adding it to the LRU list, we can create an independent copy of the script identifier (sha). This duplication ensures that the sha string in the LRU list remains stable and unaffected by any defragmentation that could alter or free the original sds. In addition, dictUnlink doesn't require exact pointer match([ref](https://github.com/valkey-io/valkey/blob/unstable/src/eval.c#L71-L78)) so this change makes sense to unlink the right dictEntry with the copy of the sds. ### Reproduce To reproduce it with tcl test: 1. Disable je_get_defrag_hint in defrag.c to trigger defrag often 2. Execute test script ``` start_server {tags {"auth external:skip"}} { test {Regression for script LRU crash} { r config set activedefrag yes r config set active-defrag-ignore-bytes 1 r config set active-defrag-threshold-lower 0 r config set active-defrag-threshold-upper 1 r config set active-defrag-cycle-min 99 r config set active-defrag-cycle-max 99 for {set i 0} {$i < 100000} {incr i} { r eval "return $i" 0 } after 5000; } } ``` ### Crash info Crash report: ``` === REDIS BUG REPORT START: Cut & paste starting from here === 14044:M 12 Nov 2024 14:51:27.054 # === ASSERTION FAILED === 14044:M 12 Nov 2024 14:51:27.054 # ==> eval.c:556 'de' is not true ------ STACK TRACE ------ Backtrace: /usr/bin/redis-server 127.0.0.1:6379 [cluster](luaDeleteFunction+0x148)[0x723708] /usr/bin/redis-server 127.0.0.1:6379 [cluster](luaCreateFunction+0x26c)[0x724450] /usr/bin/redis-server 127.0.0.1:6379 [cluster](evalCommand+0x2bc)[0x7254dc] /usr/bin/redis-server 127.0.0.1:6379 [cluster](call+0x574)[0x5b8d14] /usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommand+0xc84)[0x5b9b10] /usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommandAndResetClient+0x11c)[0x6db63c] /usr/bin/redis-server 127.0.0.1:6379 [cluster](processInputBuffer+0x1b0)[0x6dffd4] /usr/bin/redis-server 127.0.0.1:6379 [cluster][0x6bd968] /usr/bin/redis-server 127.0.0.1:6379 [cluster][0x659634] /usr/bin/redis-server 127.0.0.1:6379 [cluster](amzTLSEventHandler+0x194)[0x6588d8] /usr/bin/redis-server 127.0.0.1:6379 [cluster][0x750c88] /usr/bin/redis-server 127.0.0.1:6379 [cluster](aeProcessEvents+0x228)[0x757fa8] /usr/bin/redis-server 127.0.0.1:6379 [cluster](redisMain+0x478)[0x7786b8] /lib64/libc.so.6(__libc_start_main+0xe4)[0xffffa7763da4] /usr/bin/redis-server 127.0.0.1:6379 [cluster][0x5ad3b0] ``` Defrag info: ``` mem_fragmentation_ratio:1.18 mem_fragmentation_bytes:47229992 active_defrag_hits:20561 active_defrag_misses:5878518 active_defrag_key_hits:77 active_defrag_key_misses:212 total_active_defrag_time:29009 ``` ### Test: Run the test script to push 100,000 scripts to ensure the LRU list keeps 500 maximum length without any crash. ``` 27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500 [ok]: Regression for script LRU crash (6811 ms) [1/1 done]: unit/test (7 seconds) ``` --------- Signed-off-by: Seungmin Lee <sungming@amazon.com> Signed-off-by: Seungmin Lee <155032684+sungming2@users.noreply.github.com> Co-authored-by: Seungmin Lee <sungming@amazon.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2024-11-18 18:06:35 -08:00
Seungmin Lee	f9d0b87622	Upgrade macos-12 to macos-13 in workflows (#1318 ) ### Problem GitHub Actions is starting the deprecation process for macOS 12. Deprecation will begin on 10/7/24 and the image will be fully unsupported by 12/3/24. For more details, see https://github.com/actions/runner-images/issues/10721 Signed-off-by: Seungmin Lee <sungming@amazon.com> Co-authored-by: Seungmin Lee <sungming@amazon.com>	2024-11-18 18:00:30 -08:00
Amit Nagler	c5012cc630	Optimize RDB load performance and fix cluster mode resizing on replica side (#1199 ) This PR addresses two issues: 1. Performance Degradation Fix - Resolves a significant performance issue during RDB load on replica nodes. - The problem was causing replicas to rehash multiple times during the load process. Local testing demonstrated up to 50% degradation in BGSAVE time. - The problem occurs when the replica tries to expand pre-created slot dictionaries. This operation fails quietly, resulting in undetected performance issues. - This fix aims to optimize the RDB load process and restore expected performance levels. 2. Bug fix when reading `RDB_OPCODE_RESIZEDB` in Valkey 8.0 cluster mode- - Use the shard's master slots count when processing this opcode, as `clusterNodeCoversSlot` is not initialized for the currently syncing replica. - Previously, this problem went unnoticed because `RDB_OPCODE_RESIZEDB` had no practical impact (due to 1). These improvements will enhance overall system performance and ensure smoother upgrades to Valkey 8.0 in the future. Testing: - Conducted local tests to verify the performance improvement during RDB load. - Verified that ignoring `RDB_OPCODE_RESIZEDB` does not negatively impact functionality in the current version. Signed-off-by: naglera <anagler123@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2024-11-18 19:09:35 +08:00
Binbin	d07674fc01	Fix sds unittest tests to check for zmalloc_usable_size (#1314 ) s_malloc_size == zmalloc_size, currently sdsAllocSize does not calculate PREFIX_SIZE when no malloc_size available, this casue test_typesAndAllocSize fail in the new unittest, what we want to check is actually zmalloc_usable_size. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-18 14:55:26 +08:00
uriyage	94113fde7f	Improvements for TLS with I/O threads (#1271 ) Main thread profiling revealed significant overhead in TLS operations, even with read/write offloaded to I/O threads: Perf results: 10.82% 8.82% `valkey-server libssl.so.3 [.] SSL_pending` # Called by main thread after I/O completion 10.16% 5.06% `valkey-server libcrypto.so.3 [.] ERR_clear_error` # Called for every event regardless of thread handling This commit further optimizes TLS operations by moving more work from the main thread to I/O threads: Improve TLS offloading to I/O threads with two main changes: 1. Move `ERR_clear_error()` calls closer to SSL operations - Currently, error queue is cleared for every TLS event - Now only clear before actual SSL function calls - This prevents unnecessary clearing in main thread when operations are handled by I/O threads 2. Optimize `SSL_pending()` checks - Add `TLS_CONN_FLAG_HAS_PENDING` flag to track pending data - Move pending check to follow read operations immediately - I/O thread sets flag when pending data exists - Main thread uses flag to update pending list Performance improvements: Testing setup based on https://valkey.io/blog/unlock-one-million-rps-part2/ Before: - SET: 896,047 ops/sec - GET: 875,794 ops/sec After: - SET: 985,784 ops/sec (+10% improvement) - GET: 1,066,171 ops/sec (+22% improvement) Signed-off-by: Uri Yagelnik <uriy@amazon.com>	2024-11-17 21:52:35 -08:00
Binbin	aa2dd3ecb8	Stabilize replica migration test to make sure cluster config is consistent (#1311 ) CI report this failure: ``` [exception]: Executing test client: MOVED 1 127.0.0.1:22128. MOVED 1 127.0.0.1:22128 while executing "wait_for_condition 1000 50 { [R 3 get key_991803] == 1024 && [R 3 get key_977613] == 10240 && [R 4 get key_991803] == 1024 && ..." ``` This may be because, even though the cluster state becomes OK, The cluster still has inconsistent configuration for a short period of time. We make sure to wait for the config to be consistent. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-16 18:58:25 +08:00
Binbin	86f33ea2b0	Unprotect rdb channel when bgsave child fails in dual channel replication (#1297 ) If bgsaveerr is error, there is no need to protect the rdb channel. The impact of this may be that when bgsave fails, we will protect the rdb channel for 60s. It may occupy the reference of the repl buf block, making it impossible to recycle it until we free the client due to COB or free the client after 60s. We kept the RDB channel open as long as the replica hadn't established a main connection, even if the snapshot process failed. There is no value in keeping the RDB client in this case. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-15 16:48:13 +08:00
Binbin	92181b6797	Fix primary crash when processing dirty slots during shutdown wait / failover wait / client pause (#1131 ) We have an assert in propagateNow. If the primary node receives a CLUSTER UPDATE such as dirty slots during SIGTERM waitting or during a manual failover pausing or during a client pause, the delKeysInSlot call will trigger this assert and cause primary crash. In this case, we added a new server_del_keys_in_slot state just like client_pause_in_transaction to track the state to avoid the assert in propagateNow, the dirty slots will be deleted in the end without affecting the data consistency. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-11-15 16:47:15 +08:00
Binbin	4e2493e5c9	Kill diskless fork child asap when the last replica drop (#1227 ) We originally checked the replica connection to whether to kill the diskless child only when rdbPipeReadHandler is triggered. Actually we can check it when the replica is disconnected, so that we don't have to wait for rdbPipeReadHandler to be triggered and can kill the forkless child as soon as possible. In this way, when the child or rdbPipeReadHandler is stuck for some reason, we can kill the child faster and release the fork resources. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-15 16:34:32 +08:00
Binbin	d3f3b9cc3a	Fix daily valgrind build with unit tests (#1309 ) This was introduced in #515. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-15 14:27:28 +08:00
bentotten	b9994030e9	Log clusterbus handshake timeout failures (#1247 ) This adds a log when a handshake fails for a timeout. This can help troubleshoot cluster asymmetry issues caused by failed MEETs --------- Signed-off-by: Ben Totten <btotten@amazon.com> Signed-off-by: bentotten <59932872+bentotten@users.noreply.github.com> Co-authored-by: Ben Totten <btotten@amazon.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-11-14 20:48:48 -08:00
Qu Chen	32f7541fe3	Simplify dictType callbacks and move some macros from dict.h to dict.c (#1281 ) Remove the dict pointer argument to the `dictType` callbacks `keyDup`, `keyCompare`, `keyDestructor` and `valDestructor`. This argument was unused in all of the callback implementations. The macros `dictFreeKey()` and `dictFreeVal()` are made internal to dict and moved from dict.h to dict.c. They're also changed from macros to static inline functions. Signed-off-by: Qu Chen <quchen@amazon.com>	2024-11-14 09:45:47 +01:00
Parth	863d312803	Fix link-time optimization to work correctly for unit tests (i.e. -flto flag) (#1290 ) (#1296 ) * We compile various c files into object and package them into library (.a file) using ar to feed to unit tests. With new GCC versions, the objects inside such library don't participate in LTO process without additional flags. * Here is a direct quote from gcc documentation explaining this issue: "If you are not using a linker with plugin support and/or do not enable the linker plugin, then the objects inside libfoo.a are extracted and linked as usual, but they do not participate in the LTO optimization process. In order to make a static library suitable for both LTO optimization and usual linkage, compile its object files with -flto-ffat-lto-objects." * Read full documentation about -flto at https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html * Without this additional flag, I get following errors while executing "make test-unit". With this change, those errors go away. ``` ARCHIVE libvalkey.a ar: threads_mngr.o: plugin needed to handle lto object ... .. . /tmp/ccDYbMXL.ltrans0.ltrans.o: In function `dictClear': /local/workplace/elasticache/valkey/src/unit/../dict.c:776: undefined reference to `valkey_free' /local/workplace/elasticache/valkey/src/unit/../dict.c:770: undefined reference to `valkey_free' /tmp/ccDYbMXL.ltrans0.ltrans.o: In function `dictGetVal': ``` Fixes #1290 --------- Signed-off-by: Parth Patel <661497+parthpatel@users.noreply.github.com>	2024-11-13 21:50:55 -08:00
skyfirelee	4a9864206f	Migrate quicklist unit test to new framework (#515 ) Migrate quicklist unit test to new unit test framework, and cleanup remaining references of SERVER_TEST, parent ticket #428. Closes #428. Signed-off-by: artikell <739609084@qq.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2024-11-14 10:37:44 +08:00
Binbin	6fba747c39	Fix log printing always shows the role as child under daemonize (#1301 ) In #1282, we init server.pid earlier to keep log message role consistent, but we forgot to consider daemonize. In daemonize mode, we will always print the child role. We need to reset server.pid after daemonize(), otherwise the log printing role will always be the child. It also causes a incorrect server.pid value, affecting the concatenation of some pid names. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-14 10:26:23 +08:00
Binbin	2df56d87c0	Fix empty primary may have dirty slots data due to bad migration (#1285 ) If we become an empty primary for some reason, we still need to check if we need to delete dirty slots, because we may have dirty slots data left over from a bad migration. Like the target node forcibly executes CLUSTER SETSLOT NODE to take over the slot without performing key migration. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-11 22:13:47 +08:00
Binbin	a2d22c63c0	Fix replica not able to initate election in time when epoch fails (#1009 ) If multiple primary nodes go down at the same time, their replica nodes will initiate the elections at the same time. There is a certain probability that the replicas will initate the elections in the same epoch. And obviously, in our current election mechanism, only one replica node can eventually get the enough votes, and the other replica node will fail to win due the the insufficient majority, and then its election will time out and we will wait for the retry, which result in a long failure time. If another node has been won the election in the failover epoch, we can assume that my election has failed and we can retry as soom as possible. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-11 22:12:49 +08:00
Binbin	167e8ab8de	Trigger the election immediately when doing a manual failover (#1081 ) Currently when a manual failover is triggeded, we will set a CLUSTER_TODO_HANDLE_FAILOVER to start the election as soon as possible in the next beforeSleep. But in fact, we won't delay the election in manual failover, waitting for the next beforeSleep to kick in will delay the election a some milliseconds. We can trigger the election immediately in this case in the same function call, without waitting for beforeSleep, which can save us some milliseconds. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-11 21:43:46 +08:00
Binbin	4aacffa32d	Stabilize dual replication test to avoid getting LOADING error (#1288 ) When doing `$replica replicaof no one`, we may get a LOADING error, this is because during the test execution, the replica may reconnect very quickly, and the full sync is initiated, and the replica has entered the LOADING state. In this commit, we make sure the primary is pasued after the fork, so the replica won't enter the LOADING state, and with this fix, this test seems more natural and predictable. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-11 21:42:34 +08:00
Qu Chen	9300a7ebc8	Set fields to NULL after free in freeClient() (#1279 ) Null out several references after freeing the object in `freeClient()`. This is just to make the code more safe, to protect against use-after-free for future changes. Signed-off-by: Qu Chen <quchen@amazon.com>	2024-11-11 10:39:48 +01:00
zixuan zhao	0b5b2c7484	Log as primary role (M) instead of child process (C) during startup (#1282 ) Init server.pid earlier to keep log message role consistent. Closes #1206. Before: ```text 24881:C 21 Oct 2024 21:10:57.165 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo 24881:C 21 Oct 2024 21:10:57.165 * Valkey version=255.255.255, bits=64, commit=814e0f55, modified=1, pid=24881, just started 24881:C 21 Oct 2024 21:10:57.165 * Configuration loaded 24881:M 21 Oct 2024 21:10:57.167 * Increased maximum number of open files to 10032 (it was originally set to 1024). ``` After: ```text 68560:M 08 Nov 2024 16:10:12.257 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo 68560:M 08 Nov 2024 16:10:12.257 * Valkey version=255.255.255, bits=64, commit=45d596e1, modified=1, pid=68560, just started 68560:M 08 Nov 2024 16:10:12.257 * Configuration loaded 68560:M 08 Nov 2024 16:10:12.258 * monotonic clock: POSIX clock_gettime ``` Signed-off-by: azuredream <zhaozixuan67@gmail.com>	2024-11-11 10:33:26 +01:00
zhenwei pi	45d596e121	RDMA: Use conn ref counter to prevent double close (#1250 ) RDMA: Use connection reference counter style The reference counter of connection is used to protect re-entry of closenmethod. Use this style instead the unsafe one. Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>	2024-11-08 09:33:01 +01:00
Jacob Murphy	e972d56460	Make sure to copy null terminator byte in dual channel code (#1272 ) As @madolson pointed out, these do have proper null terminators. This cleans them up to follow the rest of the code which copies the last byte explicitly, which should help reduce cognitive load and make it more resilient should code refactors occur (e.g. non-static allocation of memory, changes to other functions). --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2024-11-07 18:25:43 -08:00
eifrah-aws	07b3e7ae7a	Add CMake build system for valkey (#1196 ) With this commit, users are able to build valkey using `CMake`. ## Example usage: Build `valkey-server` in Release mode with TLS enabled and using `jemalloc` as the allocator: ```bash mkdir build-release cd $_ cmake .. -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=/tmp/valkey-install \ -DBUILD_MALLOC=jemalloc -DBUILD_TLS=1 make -j$(nproc) install # start valkey /tmp/valkey-install/bin/valkey-server ``` Build `valkey-unit-tests`: ```bash mkdir build-release-ut cd $_ cmake .. -DCMAKE_BUILD_TYPE=Release \ -DBUILD_MALLOC=jemalloc -DBUILD_UNIT_TESTS=1 make -j$(nproc) # Run the tests ./bin/valkey-unit-tests ``` Current features supported by this PR: - Building against different allocators: (`jemalloc`, `tcmalloc`, `tcmalloc_minimal` and `libc`), e.g. to enable `jemalloc` pass `-DBUILD_MALLOC=jemalloc` to `cmake` - OpenSSL builds (to enable TLS, pass `-DBUILD_TLS=1` to `cmake`) - Sanitizier: pass `-DBUILD_SANITIZER=<address\|thread\|undefined>` to `cmake` - Install target + redis symbolic links - Build `valkey-unit-tests` executable - Standard CMake variables are supported. e.g. to install `valkey` under `/home/you/root` pass `-DCMAKE_INSTALL_PREFIX=/home/you/root` Why using `CMake`? To list some of the advantages of using `CMake`: - Superior IDE integrations: cmake generates the file `compile_commands.json` which is required by `clangd` to get a compiler accuracy code completion (in other words: your VScode will thank you) - Out of the source build tree: with the current build system, object files are created all over the place polluting the build source tree, the best practice is to build the project on a separate folder - Multiple build types co-existing: with the current build system, it is often hard to have multiple build configurations. With cmake you can do it easily: - It is the de-facto standard for C/C++ project these days More build examples: ASAN build: ```bash mkdir build-asan cd $_ cmake .. -DBUILD_SANITIZER=address -DBUILD_MALLOC=libc make -j$(nproc) ``` ASAN with jemalloc: ```bash mkdir build-asan-jemalloc cd $_ cmake .. -DBUILD_SANITIZER=address -DBUILD_MALLOC=jemalloc make -j$(nproc) ``` As seen by the previous examples, any combination is allowed and co-exist on the same source tree. ## Valkey installation With this new `CMake`, it is possible to install the binary by running `make install` or creating a package `make package` (currently supported on Debian like distros) ### Example 1: build & install using `make install`: ```bash mkdir build-release cd $_ cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/valkey-install -DCMAKE_BUILD_TYPE=Release make -j$(nproc) install # valkey is now installed under $HOME/valkey-install ``` ### Example 2: create a `.deb` installer: ```bash mkdir build-release cd $_ cmake .. -DCMAKE_BUILD_TYPE=Release make -j$(nproc) package # ... CPack deb generation output sudo gdebi -n ./valkey_8.1.0_amd64.deb # valkey is now installed under /opt/valkey ``` ### Example 3: create installer for non Debian systems (e.g. FreeBSD or macOS): ```bash mkdir build-release cd $_ cmake .. -DCMAKE_BUILD_TYPE=Release make -j$(nproc) package mkdir -p /opt/valkey && ./valkey-8.1.0-Darwin.sh --prefix=/opt/valkey --exclude-subdir # valkey-server is now installed under /opt/valkey ``` Signed-off-by: Eran Ifrah <eifrah@amazon.com>	2024-11-07 18:01:37 -08:00
Wen Hui	3672f9b2c3	Revert "Decline unsubscribe related command in non-subscribed mode" (#1265 ) This PR goal is to revert the changes on PR https://github.com/valkey-io/valkey/pull/759 Recently, we got some reports that in Valkey 8.0 the PR https://github.com/valkey-io/valkey/pull/759 (Decline unsubscribe related command in non-subscribed mode) causes break change. (https://github.com/valkey-io/valkey/issues/1228) Although from my thought, call commands "unsubscribeCommand", "sunsubscribeCommand", "punsubscribeCommand" in request-response mode make no sense. This is why I created PR https://github.com/valkey-io/valkey/pull/759 But breaking change is always no good, @valkey-io/core-team How do you think we revert this PR code changes? Signed-off-by: hwware <wen.hui.ware@gmail.com>	2024-11-07 20:05:16 -05:00
Binbin	1c18c80844	Fix incorrect cache_memory reset in functionsLibCtxClear (#1255 ) functionsLibCtxClear should clear the provided lib_ctx parameter, not the static variable curr_functions_lib_ctx, as this contradicts the function's intended purpose. The impact i guess is minor, like in some unhappy paths (diskless load fails, function restore fails?), we will mess up the functions_caches field, which is used in used_memory_functions / used_memory_scripts fileds in INFO. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-07 13:44:21 +08:00
Binbin	22bc49c4a6	Try to stabilize the failover call in the slot migration test (#1078 ) The CI report replica will return the error when performing CLUSTER FAILOVER: ``` -ERR Master is down or failed, please use CLUSTER FAILOVER FORCE ``` This may because the primary state is fail or the cluster connection is disconnected during the primary pause. In this PR, we added some waits in wait_for_role, if the role is replica, we will wait for the replication link and the cluster link to be ok. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-07 13:42:20 +08:00
Binbin	a0b1cbad83	Change errno from EEXIST to EALREADY in serverFork if child process exists (#1258 ) We set this to EEXIST in 568c2e039bac388003068cd8debb2f93619dd462, it prints "File exists" which is not quite accurate, change it to EALREADY, it will print "Operation already in progress". Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-07 12:13:00 +08:00
Binbin	12c5af03b8	Remove empty DB check branch in KEYS command (#1259 ) We don't think we really care about optimizing for the empty DB case, which should be uncommon. Adding branches hurts branch prediction. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-06 10:32:00 +08:00
Amit Nagler	48ebe21ad1	fix: clean up refactoring leftovers (#1264 ) This commit addresses issues that were likely introduced during a rebase related to: `b0f23df165` Change dual channel replication state in main handler only Signed-off-by: naglera <anagler123@gmail.com>	2024-11-05 04:57:34 -08:00
Madelyn Olson	3c32ee1bda	Add a filter option to drop all cluster packets (#1252 ) A minor debugging change that helped in the investigation of https://github.com/valkey-io/valkey/issues/1251. Basically there are some edge cases where we want to fully isolate a note from receiving packets, but can't suspend the process because we need it to continue sending outbound traffic. So, added a filter for that. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-11-04 12:36:20 -08:00
Binbin	a102852d5e	Fix timing issue in cluster-shards tests (#1243 ) The cluster-node-timeout is 3000 in our tests, the timing test wasn't succeeding, so extending the wait_for made them much more reliable. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-02 19:51:14 +08:00
Jim Brunner	0d7b2344b2	correct type internal to kvstore (minor) (#1246 ) All of the internal variables related to number of dicts in the kvstore are type `int`. Not sure why these 2 items were declared as `long long`. Signed-off-by: Jim Brunner <brunnerj@amazon.com>	2024-11-01 15:16:18 -07:00
zhenwei pi	e985ead7f9	RDMA: Prevent IO for child process (#1244 ) RDMA MR (memory region) is not forkable, the VMA (virtual memory area) of a MR gets empty in a child process. Prevent IO for child process to avoid server crash. In the check for whether read and write is allowed in an RDMA connection, a check that if we're in a child process is added. If we are, the function returns an error, which will cause the RDMA client to be disconnected. Suggested-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>	2024-11-01 13:28:09 +01:00
Madelyn Olson	1c222f77ce	Improve performance of sdssplitargs (#1230 ) The current implementation of `sdssplitargs` does repeated `sdscatlen` to build the parsed arguments, which isn't very efficient because it does a lot of extra reallocations and moves through the sds code a lot. It also typically results in memory overhead, because `sdscatlen` over-allocates, which is usually not needed since args are usually not modified after being created. The new implementation of sdssplitargs does two passes, the first to parse the argument to figure out the final length and the second to actually copy the string. It's generally about 2x faster for larger strings (~100 bytes), and about 20% faster for small strings (~10 bytes). This is generally faster since as long as everything is in the CPU cache, it's going to be fast. There are a couple of sanity tests, none existed before, as well as some fuzzying which was used to find some bugs and also to do the benchmarking. The original benchmarking code can be seen `6576aeb86a`. ``` test_sdssplitargs_benchmark - unit/test_sds.c:530] Using random seed: 1729883235 [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 56.44%, new:13039us, old:29930us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 56.58%, new:12057us, old:27771us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 59.18%, new:9048us, old:22165us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 54.61%, new:12381us, old:27278us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 51.17%, new:16012us, old:32793us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 49.18%, new:16041us, old:31563us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 58.40%, new:12450us, old:29930us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 56.49%, new:13066us, old:30031us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 58.75%, new:12744us, old:30894us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 52.44%, new:16885us, old:35504us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 62.57%, new:8107us, old:21659us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 62.12%, new:8320us, old:21966us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 45.23%, new:13960us, old:25487us [test_sdssplitargs_benchmark - unit/test_sds.c:577] Improvement: 57.95%, new:9188us, old:21849us ``` --------- Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-10-31 11:37:53 -07:00

... 3 4 5 6 7 ...

12936 Commits