futriix

Author	SHA1	Message	Date
Kyle Kim (kimkyle@)	e1d936b339	Add network-bytes-in and network-bytes-out metric support under CLUSTER SLOT-STATS command (#20 ) (#720 ) Adds two new metrics for per-slot statistics, network-bytes-in and network-bytes-out. The network bytes are inclusive of replication bytes but exclude other types of network traffic such as clusterbus traffic. #### network-bytes-in The metric tracks network ingress bytes under per-slot context, by reverse calculation of `c->argv_len_sum` and `c->argc`, stored under a newly introduced field `c->net_input_bytes_curr_cmd`. #### network-bytes-out The metric tracks network egress bytes under per-slot context, by hooking onto COB buffer mutations. #### sample response Both metrics are reported under the `CLUSTER SLOT-STATS` command. ``` 127.0.0.1:6379> cluster slot-stats slotsrange 0 0 1) 1) (integer) 0 2) 1) "key-count" 2) (integer) 0 3) "cpu-usec" 4) (integer) 0 5) "network-bytes-in" 6) (integer) 0 7) "network-bytes-out" 8) (integer) 0 ``` --------- Signed-off-by: Kyle Kim <kimkyle@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-26 16:06:16 -07:00
Binbin	59aa00823c	Replicas with the same offset queue up for election (#762 ) In some cases, like read more than write scenario, the replication offset of the replicas are the same. When the primary fails, the replicas have the same rankings (rank == 0). They issue the election at the same time (although we have a random 500), the simultaneous elections may lead to the failure of the election due to quorum. In clusterGetReplicaRank, when we calculates the rank, if the offsets are the same, the one with the smaller node name will have a better rank to avoid this situation. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-22 23:43:16 -07:00
Kyle Kim (kimkyle@)	5000c050b5	Add cpu-usec metric support under CLUSTER SLOT-STATS command (#20 ). (#712 ) The metric tracks cpu time in micro-seconds, sharing the same value as `INFO COMMANDSTATS`, aggregated under per-slot context. --------- Signed-off-by: Kyle Kim <kimkyle@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-22 18:03:28 -07:00
Binbin	14e09e981e	Fix the wrong woff when execute WAIT / WAITAOF in script (#776 ) When executing the script, the client passed in is a fake client, and its woff is always 0. This results in woff always being 0 when executing wait/waitaof in the script, and the command returns a wrong number. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-22 10:33:10 +02:00
Binbin	15a8290231	Optimize failover time when the new primary node is down again (#782 ) We will not reset failover_auth_time after setting it, this is used to check auth_timeout and auth_retry_time, but we should at least reset it after a successful failover. Let's assume the following scenario: 1. Two replicas initiate an election. 2. Replica 1 is elected as the primary node, and replica 2 does not have enough votes. 3. Replica 1 is down, ie the new primary node down again in a short time. 4. Replica 2 know that the new primary node is down and wants to initiate a failover, but because the failover_auth_time of the previous round has not been reset, it needs to wait for it to time out and then wait for the next retry time, which will take cluster-node-timeout * 4 times, this adds a lot of delay. There is another problem. Like we will set additional random time for failover_auth_time, such as random 500ms and replicas ranking 1s. If replica 2 receives PONG from the new primary node before sending the FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will change itself to a replica. If the new primary node goes down again at this time, replica 2 will use the previous failover_auth_time to initiate an election instead of going through the logic of random 500ms and replicas ranking 1s again, which may lead to unexpected consequences (for example, a low-ranking replica initiates an election and becomes the new primary node). That is, we need to reset failover_auth_time at the appropriate time. When the replica switches to a new primary, we reset it, because the existing failover_auth_time is already out of date in this case. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-19 15:27:49 -04:00
Harkrishn Patro	816accea76	Generate correct slot information in cluster shards command on primary failure (#790 ) Fix #784 Prior to the change, `CLUSTER SHARDS` command processing might pick a failed primary node which won't have the slot coverage information and the slots `output` in turn would be empty. This change finds an appropriate node which has the slot coverage information served by a given shard and correctly displays it as part of `CLUSTER SHARDS` output. Before: ``` 1) 1) "slots" 2) (empty array) 3) "nodes" 4) 1) 1) "id" 2) "2936f22a490095a0a851b7956b0a88f2b67a5d44" ... 9) "role" 10) "master" ... 13) "health" 14) "fail" ``` After: ``` 1) 1) "slots" 2) 1) 0 2) 5461 3) "nodes" 4) 1) 1) "id" 2) "2936f22a490095a0a851b7956b0a88f2b67a5d44" ... 9) "role" 10) "master" ... 13) "health" 14) "fail" ``` --------- Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>	2024-07-19 09:32:39 -07:00
Binbin	35a1888333	Fix incorrect usage of process_is_paused in tests (#783 ) It was introduced wrong in #442. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-19 11:25:58 +08:00
naglera	ff6b780fe6	Dual channel replication (#60 ) In this PR we introduce the main benefit of dual channel replication by continuously steaming the COB (client output buffers) in parallel to the RDB and thus keeping the primary's side COB small AND accelerating the overall sync process. By streaming the replication data to the replica during the full sync, we reduce 1. Memory load from the primary's node. 2. CPU load from the primary's main process. [Latest performance tests](#data) ## Motivation * Reduce primary memory load. We do that by moving the COB tracking to the replica side. This also decrease the chance for COB overruns. Note that primary's input buffer limits at the replica side are less restricted then primary's COB as the replica plays less critical part in the replication group. While increasing the primary’s COB may end up with primary reaching swap and clients suffering, at replica side we’re more at ease with it. Larger COB means better chance to sync successfully. * Reduce primary main process CPU load. By opening a new, dedicated connection for the RDB transfer, child processes can have direct access to the new connection. Due to TLS connection restrictions, this was not possible using one main connection. We eliminate the need for the child process to use the primary's child-proc -> main-proc pipeline, thus freeing up the main process to process clients queries. ## Dual Channel Replication high level interface design - Dual channel replication begins when the replica sends a `REPLCONF CAPA DUALCHANNEL` to the primary during initial handshake. This is used to state that the replica is capable of dual channel sync and that this is the replica's main channel, which is not used for snapshot transfer. - When replica lacks sufficient data for PSYNC, the primary will send `-FULLSYNCNEEDED` response instead of RDB data. As a next step, the replica creates a new connection (rdb-channel) and configures it against the primary with the appropriate capabilities and requirements. The replica then requests a sync using the RDB channel. - Prior to forking, the primary sends the replica the snapshot's end repl-offset, and attaches the replica to the replication backlog to keep repl data until the replica requests psync. The replica uses the main channel to request a PSYNC starting at the snapshot end offset. - The primary main threads sends incremental changes via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. As for the replica, the incremental changes are stored on a local buffer, while the RDB is loaded into memory. - Once the replica completes loading the rdb, it drops the rdb-connection and streams the accumulated incremental changes into memory. Repl steady state continues normally. ## New replica state machine ![image](https://github.com/user-attachments/assets/38fbfff0-60b9-4066-8b13-becdb87babc3) ## Data <a name="data"></a> ![image](https://github.com/user-attachments/assets/d73631a7-0a58-4958-a494-a7f4add9108f) ![image](https://github.com/user-attachments/assets/f44936ed-c59a-4223-905d-0fe48a6d31a6) ![image](https://github.com/user-attachments/assets/bd333ee2-3c47-47e5-b244-4ea75f77c836) ## Explanation These graphs demonstrate performance improvements during full sync sessions using rdb-channel + streaming rdb directly from the background process to the replica. First graph- with at most 50 clients and light weight commands, we saw 5%-7.5% improvement in write latency during sync session. Two graphs below- full sync was tested during heavy read commands from the primary (such as sdiff, sunion on large sets). In that case, the child process writes to the replica without sharing CPU with the loaded main process. As a result, this not only improves client response time, but may also shorten sync time by about 50%. The shorter sync time results in less memory being used to store replication diffs (>60% in some of the tested cases). ## Test setup Both primary and replica in the performance tests ran on the same machine. RDB size in all tests is 3.7gb. I generated write load using valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list __rand_int__`. --------- Signed-off-by: naglera <anagler123@gmail.com> Signed-off-by: naglera <58042354+naglera@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-17 13:59:33 -07:00
Ping Xie	66d0f7d9a1	Ensure only primary sender drives slot ownership updates (#754 ) Fixes a regression introduced in PR #445, which allowed a message from a replica to update the slot ownership of its primary. The regression results in a `replicaof` cycle, causing server crashes due to the cycle detection assert. The fix restores the previous behavior where only primary senders can trigger `clusterUpdateSlotsConfigWith`. Additional changes: * Handling of primaries without slots is obsoleted by new handling of when a sender that was a replica announces that it is now a primary. * Replication loop detection code is unchanged but shifted downwards. * Some variables are renamed for better readability and some are introduced to avoid repeated memcmp() calls. Fixes #753. --------- Signed-off-by: Ping Xie <pingxie@google.com>	2024-07-16 13:05:49 -07:00
KarthikSubbarao	418901dec4	Limit tracking custom errors (e.g. from LUA) while allowing non custom errors to be tracked normally (#500 ) Implementing the change proposed here: https://github.com/valkey-io/valkey/issues/487 In this PR, we prevent tracking new custom error messages (e.g. LUA) if the number of error messages (in the errors RAX) is greater than 128. Instead, we will track any additional custom error prefix in a new counter: `errorstat_ERRORSTATS_OVERFLOW ` and if any non-custom flagged errors (e.g. MOVED / CLUSTERDOWN) occur, they will continue to be tracked as usual. This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute `CONFIG RESETSTAT` to restore error stats functionality because normal error messages continue to be tracked. Example: ``` # Errorstats . . . errorstat_127:count=2 errorstat_128:count=2 errorstat_ERR:count=1 errorstat_ERRORSTATS_OVERFLOW:count=2 ``` --------- Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-14 20:04:47 -07:00
Binbin	a4ee8dada4	Fix WAITAOF test in external test due to appendonly is enabled (#775 ) The test fails because, in external, another test may have enabled appendonly, causing acklocal to return 1. We can add a CONFIG SET to disable the appendonly, but this is not safe too unless we use multi. The test does not actually rely on appendonly, so we can just * it. Fixes #770. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-12 23:32:39 +08:00
Madelyn Olson	9948f07a01	Temporary skip blockwait aof test until it's fixed (#773 ) See https://github.com/valkey-io/valkey/issues/770 for details about failure. Want to prevent the test failures. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-11 13:10:13 -04:00
Viktor Söderqvist	a323dce890	Dual stack and client-specific IPs in cluster (#736 ) New configs: * `cluster-announce-client-ipv4` * `cluster-announce-client-ipv6` New module API function: * `ValkeyModule_GetClusterNodeInfoForClient`, takes a client id and is otherwise just like its non-ForClient cousin. If configured, one of these IP addresses are reported to each client in CLUSTER SLOTS, CLUSTER SHARDS, CLUSTER NODES and redirects, replacing the IP (`custer-announce-ip` or the auto-detected IP) of each node. Which one is reported to the client depends on whether the client is connected over IPv4 or IPv6. Benefits: * This allows clients using IPv4 to get the IPv4 addresses of all cluster nodes and IPv6 clients to get the IPv6 clients. * This allows the IPs visible to clients to be different to the IPs used between the cluster nodes due to NAT'ing. The information is propagated in the cluster bus using new Ping extensions. (Old nodes without this feature ignore unknown Ping extensions.) This adds another dimension to CLUSTER SLOTS reply. It now depends on the client's use of TLS, the IP address family and RESP version. Refactoring: The cached connection type definition is moved from connection.h (it actually has nothing to do with the connection abstraction) to server.h and is changed to a bitmap, with one bit for each of TLS, IPv6 and RESP3. Fixes #337 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-07-10 13:53:52 +02:00
Viktor Söderqvist	b99c7237f4	Fix unstable test case EVAL+WAITAOF (#766 ) Test case "EVAL - Scripts do not block on waitaof" observed to fail in e.g. https://github.com/valkey-io/valkey/actions/runs/9860131487/job/27233756421?pr=688 It can happen that the local AOF has been written and 1 is returned here where 0 is expected. Writing a key inside the EVAL script makes sure there's no time to write the AOF. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-07-09 21:40:49 +02:00
uriyage	bbfd041895	Async IO threads (#758 ) This PR is 1 of 3 PRs intended to achieve the goal of 1 million requests per second, as detailed by [dan touitou](https://github.com/touitou-dan) in https://github.com/valkey-io/valkey/issues/22. This PR modifies the IO threads to be fully asynchronous, which is a first and necessary step to allow more work offloading and better utilization of the IO threads. ### Current IO threads state: Valkey IO threads were introduced in Redis 6.0 to allow better utilization of multi-core machines. Before this, Redis was single-threaded and could only use one CPU core for network and command processing. The introduction of IO threads helps in offloading the IO operations to multiple threads. Current IO Threads flow: 1. Initialization: When Redis starts, it initializes a specified number of IO threads. These threads are in addition to the main thread, each thread starts with an empty list, the main thread will populate that list in each event-loop with pending-read-clients or pending-write-clients. 2. Read Phase: The main thread accepts incoming connections and reads requests from clients. The reading of requests are offloaded to IO threads. The main thread puts the clients ready-to-read in a list and set the global io_threads_op to IO_THREADS_OP_READ, the IO threads pick the clients up, perform the read operation and parse the first incoming command. 3. Command Processing: After reading the requests, command processing is still single-threaded and handled by the main thread. 4. Write Phase: Similar to the read phase, the write phase is also be offloaded to IO threads. The main thread prepares the response in the clients’ output buffer then the main thread puts the client in the list, and sets the global io_threads_op to the IO_THREADS_OP_WRITE. The IO threads then pick the clients up and perform the write operation to send the responses back to clients. 5. Synchronization: The main-thread communicate with the threads on how many jobs left per each thread with atomic counter. The main-thread doesn’t access the clients while being handled by the IO threads. Issues with current implementation: * Underutilized Cores: The current implementation of IO-threads leads to the underutilization of CPU cores. * The main thread remains responsible for a significant portion of IO-related tasks that could be offloaded to IO-threads. * When the main-thread is processing client’s commands, the IO threads are idle for a considerable amount of time. * Notably, the main thread's performance during the IO-related tasks is constrained by the speed of the slowest IO-thread. * Limited Offloading: Currently, Since the Main-threads waits synchronously for the IO threads, the Threads perform only read-parse, and write operations, with parsing done only for the first command. If the threads can do work asynchronously we may offload more work to the threads reducing the load from the main-thread. * TLS: Currently, we don't support IO threads with TLS (where offloading IO would be more beneficial) since TLS read/write operations are not thread-safe with the current implementation. ### Suggested change Non-blocking main thread - The main thread and IO threads will operate in parallel to maximize efficiency. The main thread will not be blocked by IO operations. It will continue to process commands independently of the IO thread's activities. Implementation details Inter-thread communication. * We use a static, lock-free ring buffer of fixed size (2048 jobs) for the main thread to send jobs and for the IO to receive them. If the ring buffer fills up, the main thread will handle the task itself, acting as back pressure (in case IO operations are more expensive than command processing). A static ring buffer is a better candidate than a dynamic job queue as it eliminates the need for allocation/freeing per job. * An IO job will be in the format: ` [void* function-call-back \| void data] `where data is either a client to read/write from and the function-ptr is the function to be called with the data for example readQueryFromClient using this format we can use it later to offload other types of works to the IO threads. The Ring buffer is one way from the main-thread to the IO thread, Upon read/write event the main thread will send a read/write job then in before sleep it will iterate over the pending read/write clients to checking for each client if the IO threads has already finished handling it. The IO thread signals it has finished handling a client read/write by toggling an atomic flag read_state / write_state on the client struct. Thread Safety As suggested in this solution, the IO threads are reading from and writing to the clients' buffers while the main thread may access those clients. We must ensure no race conditions or unsafe access occurs while keeping the Valkey code simple and lock free. Minimal Action in the IO Threads The main change is to limit the IO thread operations to the bare minimum. The IO thread will access only the client's struct and only the necessary fields in this struct. The IO threads will be responsible for the following: * Read Operation: The IO thread will only read and parse a single command. It will not update the server stats, handle read errors, or parsing errors. These tasks will be taken care of by the main thread. * Write Operation: The IO thread will only write the available data. It will not free the client's replies, handle write errors, or update the server statistics. To achieve this without code duplication, the read/write code has been refactored into smaller, independent components: * Functions that perform only the read/parse/write calls. * Functions that handle the read/parse/write results. This refactor accounts for the majority of the modifications in this PR. Client Struct Safe Access As we ensure that the IO threads access memory only within the client struct, we need to ensure thread safety only for the client's struct's shared fields. * Query Buffer * Command parsing - The main thread will not try to parse a command from the query buffer when a client is offloaded to the IO thread. * Client's memory checks in client-cron - The main thread will not access the client query buffer if it is offloaded and will handle the querybuf grow/shrink when the client is back. * CLIENT LIST command - The main thread will busy-wait for the IO thread to finish handling the client, falling back to the current behavior where the main thread waits for the IO thread to finish their processing. * Output Buffer * The IO thread will not change the client's bufpos and won't free the client's reply lists. These actions will be done by the main thread on the client's return from the IO thread. * bufpos / block→used: As the main thread may change the bufpos, the reply-block→used, or add/delete blocks to the reply list while the IO thread writes, we add two fields to the client struct: io_last_bufpos and io_last_reply_block. The IO thread will write until the io_last_bufpos, which was set by the main-thread before sending the client to the IO thread. If more data has been added to the cob in between, it will be written in the next write-job. In addition, the main thread will not trim or merge reply blocks while the client is offloaded. * Parsing Fields * Client's cmd, argc, argv, reqtype, etc., are set during parsing. * The main thread will indicate to the IO thread not to parse a cmd if the client is not reset. In this case, the IO thread will only read from the network and won't attempt to parse a new command. * The main thread won't access the c→cmd/c→argv in the CLIENT LIST command as stated before it will busy wait for the IO threads. * Client Flags * c→flags, which may be changed by the main thread in multiple places, won't be accessed by the IO thread. Instead, the main thread will set the c→io_flags with the information necessary for the IO thread to know the client's state. * Client Close * On freeClient, the main thread will busy wait for the IO thread to finish processing the client's read/write before proceeding to free the client. * Client's Memory Limits * The IO thread won't handle the qb/cob limits. In case a client crosses the qb limit, the IO thread will stop reading for it, letting the main thread know that the client crossed the limit. TLS TLS is currently not supported with IO threads for the following reasons: 1. Pending reads - If SSL has pending data that has already been read from the socket, there is a risk of not calling the read handler again. To handle this, a list is used to hold the pending clients. With IO threads, multiple threads can access the list concurrently. 2. Event loop modification - Currently, the TLS code registers/unregisters the file descriptor from the event loop depending on the read/write results. With IO threads, multiple threads can modify the event loop struct simultaneously. 3. The same client can be sent to 2 different threads concurrently (https://github.com/redis/redis/issues/12540). Those issues were handled in the current PR: 1. The IO thread only performs the read operation. The main thread will check for pending reads after the client returns from the IO thread and will be the only one to access the pending list. 2. The registering/unregistering of events will be similarly postponed and handled by the main thread only. 3. Each client is being sent to the same dedicated thread (c→id % num_of_threads). Sending Replies Immediately with IO threads. Currently, after processing a command, we add the client to the pending_writes_list. Only after processing all the clients do we send all the replies. Since the IO threads are now working asynchronously, we can send the reply immediately after processing the client’s requests, reducing the command latency. However, if we are using AOF=always, we must wait for the AOF buffer to be written, in which case we revert to the current behavior. IO threads dynamic adjustment Currently, we use an all-or-nothing approach when activating the IO threads. The current logic is as follows: if the number of pending write clients is greater than twice the number of threads (including the main thread), we enable all threads; otherwise, we enable none. For example, if 8 IO threads are defined, we enable all 8 threads if there are 16 pending clients; else, we enable none. It makes more sense to enable partial activation of the IO threads. If we have 10 pending clients, we will enable 5 threads, and so on. This approach allows for a more granular and efficient allocation of resources based on the current workload. In addition, the user will now be able to change the number of I/O threads at runtime. For example, when decreasing the number of threads from 4 to 2, threads 3 and 4 will be closed after flushing their job queues. Tests Currently, we run the io-threads tests with 4 IO threads (`443d80f168/.github/workflows/daily.yml (L353)`). This means that we will not activate the IO threads unless there are 8 (threads * 2) pending write clients per single loop, which is unlikely to happened in most of tests, meaning the IO threads are not currently being tested. To enforce the main thread to always offload work to the IO threads, regardless of the number of pending events, we add an events-per-io-thread configuration with a default value of 2. When set to 0, this configuration will force the main thread to always offload work to the IO threads. When we offload every single read/write operation to the IO threads, the IO-threads are running with 100% CPU when running multiple tests concurrently some tests fail as a result of larger than expected command latencies. To address this issue, we have to add some after or wait_for calls to some of the tests to ensure they pass with IO threads as well. Signed-off-by: Uri Yagelnik <uriy@amazon.com>	2024-07-08 20:01:39 -07:00
Binbin	6bf1d02edf	Nested MULTI or WATCH in MULTI now will abort the transaction (#723 ) Currently, for nested MULTI or executing WATCH in MULTI, we will return an error but we will not abort the transaction. ``` 127.0.0.1:6379> multi OK 127.0.0.1:6379(TX)> multi (error) ERR MULTI calls can not be nested 127.0.0.1:6379(TX)> set key value QUEUED 127.0.0.1:6379(TX)> exec 1) OK 127.0.0.1:6379> multi OK 127.0.0.1:6379(TX)> watch key (error) ERR WATCH inside MULTI is not allowed 127.0.0.1:6379(TX)> set key value QUEUED 127.0.0.1:6379(TX)> exec 1) OK ``` This is an unexpected behavior that should abort the transaction. The number of elements returned by EXEC also doesn't match the number of commands in MULTI. Add the NO_MULTI flag to them so that they will be rejected in processCommand and rejectCommand will abort the transaction. So there are two visible changes: - Different words in the error messages. (Command not allowed inside a transaction) - Exec returns error. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-03 21:27:45 +02:00
Sankar	eff45f5467	Fix flakiness of cluster-multiple-meets and cluster-reliable-meet (#728 ) Tests in cluster-multiple-meets were flaky as reported by @madolson * https://github.com/valkey-io/valkey/actions/runs/9688455588/job/26776953320 * https://github.com/valkey-io/valkey/actions/runs/9688455588/job/26776953585 I wasn't able to reproduce this locally, but I suspect that the flakiness is coming from the fact that nodes are reported as "connected" as long as there is an outgoing link. An outgoing link is created before MEET is sent out. Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>	2024-07-01 22:27:38 -07:00
KarthikSubbarao	fa01a29365	Allow Module authentication to succeed when cluster is down (#693 ) Module Authentication using a blocking implementation currently gets rejected when the "cluster is down" from the client timeout cron job (`clientsCronHandleTimeout`). This PR exempts clients blocked on Module Authentication from being rejected here. --------- Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>	2024-07-01 13:59:06 -07:00
ranshid	24208812a6	Increase ping and cluster timeout for cluster-slots test (#717 ) cluster-slots test is tesing a very fragmented slots range of a relatively large cluster. For this reason, when run under valgrind, some of the nodes are timing out when cluster is attempting to converge and propagate. This pr sets the test's cluster-node-timeout to 90000 and cluster-ping-interval to 1000. Signed-off-by: ranshid <ranshid@amazon.com>	2024-06-30 16:30:46 -07:00
w. ian douglas	b59762f734	Very minor misspelling in some tests (#705 ) Fix misspelling "faiover" instead of "failover" in two test cases. Signed-off-by: w. ian douglas <ian.douglas@iandouglas.com>	2024-06-28 23:56:30 +02:00
Binbin	2979fe6060	CLUSTER SLOT-STATS ORDERBY when stats are the same, compare by slot in ascending order (#710 ) Test failed in my local: ``` *** [err]: CLUSTER SLOT-STATS ORDERBY LIMIT correct response pagination, where limit is less than number of assigned slots in tests/unit/cluster/slot-stats.tcl Expected [dict exists 0 0 1 0 2 0 3 0 4 0 16383] (context: type source line 64 file /xxx/tests/unit/cluster/slot-stats.tcl cmd {assert {[dict exists $expected_slots $slot]}} proc ::assert_slot_visibility level 1) ``` It seems that when the stat is equal, that is, when the key-count is equal, the qsort performance will be different. When the stat is equal, we compare by slot (in ascending order). Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-06-28 08:03:03 -07:00
Binbin	518f0bf79b	Fix limit undefined behavior crash in CLUSTER SLOT-STATS (#709 ) We did not set a default value for limit, but it will be used in addReplyOrderBy later, the undefined behavior may crash the server since the value could be negative and crash will happen in addReplyArrayLen. An interesting reproducible example (limit reuses the value of -1): ``` > cluster slot-stats orderby key-count desc limit -1 (error) ERR Limit has to lie in between 1 and 16384 (maximum number of slots). > cluster slot-stats orderby key-count desc Error: Server closed the connection ``` Set the default value of limit to 16384. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-06-28 08:02:52 -07:00
Kyle Kim (kimkyle@)	1269532fbd	Introduce CLUSTER SLOT-STATS command (#20 ). (#351 ) The command provides detailed slot usage statistics upon invocation, with initial support for key-count metric. cpu-usec (approved) and memory-bytes (pending-approval) metrics will soon follow after the merger of this PR. --------- Signed-off-by: Kyle Kim <kimkyle@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-06-27 16:58:27 -07:00
Binbin	bf1fb1fd36	Fix copy-paste error in scripts eviction test (#671 ) The test needs to test "return 2" but mistakenly uses "return 1". Also remove a extra debug print. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-06-20 10:28:47 +08:00
kukey	ae2d4217e1	Add new SCRIPT SHOW subcommand to dump script via sha1 (#617 ) In some scenarios, the business may not be able to find the previously used Lua script and only have a SHA signature. Or there are multiple identical evalsha's args in monitor/slowlog, and admin is not able to distinguish the script body. Add a new script subcommmand to show the contents of script given the scripts sha1. Returns a NOSCRIPT error if the script is not present in the cache. Usage: `SCRIPT SHOW sha1` Complexity: `O(1)` Closes #604. Doc PR: https://github.com/valkey-io/valkey-doc/pull/143 --------- Signed-off-by: wei.kukey <wei.kukey@gmail.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-06-18 17:48:58 -07:00
Ping Xie	4135894a5d	Update remaining `master` references to `primary` (#660 ) Signed-off-by: Ping Xie <pingxie@google.com>	2024-06-17 20:31:15 -07:00
Binbin	db6d3c1138	Only primary with slots has the right to mark a node as failed (#634 ) In markNodeAsFailingIfNeeded we will count needed_quorum and failures, needed_quorum is the half the cluster->size and plus one, and cluster-size is the size of primary node which contain slots, but when counting failures, we dit not check if primary has slots. Only the primary has slots that has the rights to vote, adding a new clusterNodeIsVotingPrimary to formalize this concept. Release notes: bugfix where nodes not in the quorum group might spuriously mark nodes as failed --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>	2024-06-16 20:46:08 -07:00
Sankar	a81c32079c	Make cluster meet reliable under link failures (#461 ) When there is a link failure while an ongoing MEET request is sent the sending node stops sending anymore MEET and starts sending PINGs. Since every node responds to PINGs from unknown nodes with a PONG, the receiving node never adds the sending node. But the sending node adds the receiving node when it sees a PONG. This can lead to asymmetry in cluster membership. This changes makes the sender keep sending MEET until it sees a PONG, avoiding the asymmetry. --------- Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>	2024-06-16 20:37:09 -07:00
Binbin	d5496e42bc	Lua scripts promoted from eval to script load to avoid evict (#637 ) In ad28d222edcef9d4496fd7a94656013f07dd08e5, we added a Lua eval scripts eviction. If the script was previously added via EVAL, we promote it to SCRIPT LOAD, prevent it from being evicted later. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-06-14 08:32:19 -07:00
Ping Xie	5d9d41868d	Replace `DEBUG RESTART` with `pause_server` and `resume_server` (#652 )	2024-06-13 17:52:50 -07:00
uriyage	d211078a27	Fix query buffer resized test flakiness (#646 ) Added a wait_for_condition to avoid the timing issue. ``` * [err]: query buffer resized correctly in tests/unit/querybuf.tcl Expected 11 >= 16384 && 11 <= 32770 (context: type eval line 24 cmd {assert {$orig_test_client_qbuf >= 16384 && $orig_test_client_qbuf <= $MAX_QUERY_BUFFER_SIZE}} proc ::test) * [err]: query buffer resized correctly when not idle in tests/unit/querybuf.tcl Expected 11 > 32768 (context: type eval line 14 cmd {assert {$orig_test_client_qbuf > 32768}} proc ::test) *** [err]: query buffer resized correctly with fat argv in tests/unit/querybuf.tcl query buffer should not be resized when client idle time smaller than 2s ``` Signed-off-by: Uri Yagelnik <uriy@amazon.com>	2024-06-13 18:07:07 +08:00
Madelyn Olson	627d387ad8	Improve reliability of querybuf test (#639 ) We've been seeing some pretty consistent failures from `test-valgrind-test` and `test-sanitizer-address` because of the querybuf test periodically failing. I tracked it down to the test periodically taking too long and the client cron getting triggered. A simple solution is to just disable the cron during the key race condition. I was able to run this locally for 100 iterations without seeing a failure. Example: https://github.com/valkey-io/valkey/actions/runs/9474458354/job/26104103514 and https://github.com/valkey-io/valkey/actions/runs/9474458354/job/26104106830. Signed-off-by: Madelyn Olson <matolson@amazon.com>	2024-06-12 14:27:42 -07:00
Ping Xie	aad6769a80	Replicate slot migration states via RDB aux fields (#586 )	2024-06-07 20:32:27 -07:00
Madelyn Olson	bce240eab7	Replace masteruser and masterauth with primaryuser and primaryauth (#598 ) Make the one backwards compatible config change we are allowed to replace for removing master from our API. `masterauth` and `masteruser` are still used as an alias, but aren't explicitly referenced. As an addendum to https://github.com/valkey-io/valkey/pull/591, it would be good to have this in 8. Given the related PR for updated other references for master, I just updated the ones around this specific change. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-06-07 00:46:52 -07:00
Viktor Söderqvist	ad5fd5b95c	More rebranding (#606 ) More rebranding of * Log messages (#252) * The DENIED error reply * Internal function names and comments, mainly Lua API --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-06-07 01:40:55 +02:00
Viktor Söderqvist	278ce0cae0	Rebrand the Lua debugger (#603 ) Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-06-06 19:53:17 +02:00
Madelyn Olson	b95e7c384f	Skip tls for xgroup read regression since it doesn't matter (#595 ) "Client blocked on XREADGROUP while stream's slot is migrated" uses the migrate command, which requires special handling for TLS and non-tls. This was not being handled, so was throwing an error. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-06-03 11:49:15 -07:00
uriyage	b72e43ed16	Adjust query buffer resized correctly test to non-jemalloc allocators. (#593 ) Test `query buffer resized correctly` start to fail (https://github.com/valkey-io/valkey/actions/runs/9278013807) with non-jemalloc allocators after https://github.com/valkey-io/valkey/pull/258 PR. With Jemalloc we allocate ~20K for the query buffer, in the test we read 1 byte in the first read, in the second read we make sure we have at least 16KB free place in the query buffer and we have as Jemalloc allocated 20KB, But with non jemalloc we allocate in the first read exactly 16KB. in the second read we check and see that we don't have 16KB free space as we already read 1 byte hence we reallocate this time greedly (*2 of the requested size of 16KB+1) hence the test condition that the querybuf size is < 32KB is no longer true The `query buffer resized correctly test` starts [failing](https://github.com/valkey-io/valkey/actions/runs/9278013807) with non-jemalloc allocators after PR #258 . With jemalloc, we allocate ~20KB for the query buffer. In the test, we read 1 byte initially and then ensure there is at least 16KB of free space in the buffer for the second read, which is satisfied by jemalloc's 20KB allocation. However, with non-jemalloc allocators, the first read allocates exactly 16KB. When we check again, we don't have 16KB free due to the 1 byte already read. This triggers a greedy reallocation (doubling the requested size of 16KB+1), causing the query buffer size to exceed the 32KB limit, thus failing the test condition. This PR adjusted the test query buffer upper limit to be 32KB +2. Signed-off-by: Uri Yagelnik <uriy@amazon.com>	2024-06-03 11:15:28 -07:00
Chen Tianjie	d16b4ec1b9	Unshare object to avoid LRU/LFU being messed up (#250 ) When LRU/LFU enabled, Valkey does not allow using shared objects, as value objects may be shared among many different keys and they can't share LRU/LFU information. However `maxmemory-policy` is modifiable at runtime. If LRU/LFU is not enabled at start, but then enabled when some shared objects are already used, there could be some confusion in LRU/LFU information. For `set` command it is OK since it is going to create a new object when LRU/LFU enabled, but `get` command will not unshare the object and just update LRU/LFU information. So we may duplicate the object in this case. It is a one-time task for each key using shared objects, unless this is the case for so many keys, there should be no serious performance degradation. Still, LRU will be updated anyway, no matter LRU/LFU is enabled or not, because `OBJECT IDLETIME` needs it, unless `maxmemory-policy` is set to LFU. So idle time of a key may still be messed up. --------- Signed-off-by: chentianjie.ctj <chentianjie.ctj@alibaba-inc.com> Signed-off-by: Chen Tianjie <TJ_Chen@outlook.com>	2024-06-01 10:09:20 +02:00
nitaicaro	6fb90adf4b	Fix crash where command duration is not reset when client is blocked … (#526 ) In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <caronita@amazon.com> Co-authored-by: Nitai Caro <caronita@amazon.com>	2024-05-30 12:55:00 -07:00
Binbin	6bab2d7968	Make sure clear the CLUSTER SLOTS cache on time when updating hostname (#564 ) In #53, we will cache the CLUSTER SLOTS response to improve the throughput and reduct the latency. In the code snippet below, the second cluster slots will use the old hostname: ``` config set cluster-preferred-endpoint-type hostname config set cluster-announce-hostname old-hostname.com multi cluster slots config set cluster-announce-hostname new-hostname.com cluster slots exec ``` When updating the hostname, in updateAnnouncedHostname, we will set CLUSTER_TODO_SAVE_CONFIG and we will do a clearCachedClusterSlotsResponse in clusterSaveConfigOrDie, so harmless in most cases. Move the clearCachedClusterSlotsResponse call to clusterDoBeforeSleep instead of scheduling it to be called in clusterSaveConfigOrDie. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-05-30 10:44:12 +08:00
LiiNen	96dcd1183a	Change BITCOUNT 'end' as optional like BITPOS (#118 ) _This change is the thing I suggested to redis when it was BSD, and is not just migration - this is of course more advanced_ ### Issue There is weird difference in syntax between BITPOS and BITCOUNT: ``` BITPOS key bit [start [end [BYTE \| BIT]]] BITCOUNT key [start end [BYTE \| BIT]] ``` I think this might cause confusion in terms of usability. It was not just a syntax typo error, and really works differently. The results below are with unstable build: ``` > get TEST:ABCD "ABCD" > BITPOS TEST:ABCD 1 0 -1 (integer) 1 > BITCOUNT TEST:ABCD 0 -1 (integer) 9 > BITPOS TEST:ABCD 1 0 (integer) 1 > BITCOUNT TEST:ABCD 0 (error) ERR syntax error ``` ### What did I fix simply changes logic, to accept BITCOUNT also without 'end' - 'end' become optional, like BITPOS ``` > GET TEST:ABCD "ABCD" > BITPOS TEST:ABCD 1 0 -1 (integer) 1 > BITCOUNT TEST:ABCD 0 -1 (integer) 9 > BITPOS TEST:ABCD 1 0 (integer) 1 > BITCOUNT TEST:ABCD 0 (integer) 9 ``` Of course, I also fixed syntax hint: ``` # ASIS > BITCOUNT key [start end [BYTE\|BIT]] # TOBE > BITCOUNT key [start [end [BYTE\|BIT]]] ``` ![image](https://github.com/valkey-io/valkey/assets/38001238/8485f58e-6785-4106-9f3f-45e62f90d24b) ### Moreover ... I hadn't noticed that there was very small dead code in these command logic, when I wrote PR to redis. I found it now, when write code again, so I wrote it in valkey. ``` c /* asis unstable / / bitcountCommand() / if (!strcasecmp(c->argv[4]->ptr,"bit")) isbit = 1; // ... if (c->argc < 4) { if (isbit) end = (totlen<<3) + 7; else end = totlen-1; } / bitposCommand() */ if (!strcasecmp(c->argv[5]->ptr,"bit")) isbit = 1; // ... if (c->argc < 5) { if (isbit) end = (totlen<<3) + 7; else end = totlen-1; } ``` Bit variable (actually int) "isbit" is only being set as 1, when 'BIT' is declared. But we were checking whether 'isbit' is true or false in this 'if' phrase, even if isbit could never be 1, because argc is always less than 4 (or 5 in bitpos). I think this minor fixes will make valkey command operation more consistent. Of course, this PR contains just changing args from "required" to "optional", so it will never hurt previous users. Thanks, --------- Signed-off-by: LiiNen <kjeonghoon065@gmail.com> Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>	2024-05-28 15:01:28 -04:00
uriyage	fd58b73f0a	Introduce shared query buffer for client reads (#258 ) This PR optimizes client query buffer handling in Valkey by introducing a shared query buffer that is used by default for client reads. This reduces memory usage by ~20KB per client by avoiding allocations for most clients using short (<16KB) complete commands. For larger or partial commands, the client still gets its own private buffer. The primary changes are: * Adding a shared query buffer `shared_qb` that clients use by default * Modifying client querybuf initialization and reset logic * Copying any partial query from shared to private buffer before command execution * Freeing idle client query buffers when empty to allow reuse of shared buffer * Master client query buffers are kept private as their contents need to be preserved for replication stream In addition to the memory savings, this change shows a 3% improvement in latency and throughput when running with 1000 active clients. The memory reduction may also help reduce the need to evict clients when reaching max memory limit, as the query buffer is the main memory consumer per client. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-05-28 11:09:37 -07:00
Viktor Söderqvist	4e44f5aae9	Fix races in test for tot-net-in, tot-net-out, tot-cmds (#559 ) The races are between the '$rd' client and the 'r' client in the test case. Test case "client input output and command process statistics" in unit/introspection. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-05-28 17:13:16 +02:00
Ping Xie	e4ead9442b	Make CLUSTER SETSLOT with TIMEOUT 0 block indefinitely (#556 ) This aligns the behaviour with established Valkey commands with a TIMEOUT argument, such as BLPOP. Fix #422 Signed-off-by: Ping Xie <pingxie@google.com>	2024-05-27 07:11:24 -07:00
Viktor Söderqvist	d72ba06dd0	Make cluster replicas return ASK and TRYAGAIN (#495 ) After READONLY, make a cluster replica behave as its primary regarding returning ASK redirects and TRYAGAIN. Without this patch, a client reading from a replica cannot tell if a key doesn't exist or if it has already been migrated to another shard as part of an ongoing slot migration. Therefore, without an ASK redirect in this situation, offloading reads to cluster replicas wasn't reliable. Note: The target of a redirect is always a primary. If a client wants to continue reading from a replica after following a redirect, it needs to figure out the replicas of that new primary using CLUSTER SHARDS or similar. This is related to #21 and has been made possible by the introduction of Replication of Slot Migration States in #445. ---- Release notes: During cluster slot migration, replicas are able to return -ASK redirects and -TRYAGAIN. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-05-24 17:58:03 +02:00
Roshan Khatri	c4782066e7	Cache CLUSTER SLOTS response for improving throughput and reduced latency. (#53 ) This commit adds a logic to cache `CLUSTER SLOTS` response for reduced latency and also updates the cache when a change in the cluster is detected. Historically, `CLUSTER SLOTS` command was deprecated, however all the server clients have been using `CLUSTER SLOTS` and have not migrated to `CLUSTER SHARDS`. In future this logic can be added to any other commands to improve the performance of the engine. --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>	2024-05-22 14:21:41 -07:00
Viktor Söderqvist	efa8ba519b	Finish postponed SCAN changes (#501 ) Commit 07ed0eafa98a66 introduced some SCAN improvements, but some changes were postponed to a later version (8.0), which this PR finishes: 1. Prepare to move the TYPE filtering to the scan callback as well. this was put on hold since it has side effects that can be considered a breaking change, which is that we will not attempt to do lazy expire (delete) a key that was filtered by not matching the TYPE (changing it would mean TYPE filter starts behaving the same as MATCH filter already does in that respect). 2. when the specified key TYPE filter is an unknown type, server will reply a error immediately instead of doing a full scan that comes back empty handed. Fixes #235 Release notes: > SCAN: Expired keys that don't match the TYPE argument for the SCAN are no longer deleted by SCAN Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-05-17 13:35:31 +02:00
Ping Xie	fd53f17a61	Use pause_process to stop a node to make Valgrind happy, hopefully (#508 ) Signed-off-by: Ping Xie <pingxie@google.com>	2024-05-16 22:59:00 -07:00
Lipeng Zhu	7a9951fb80	Correct the actual allocated size from allocator when call sdsRedize to align the logic with sdsnewlen function. (#476 ) This patch try to correct the actual allocated size from allocator when call sdsRedize to align the logic with sdsnewlen function. Maybe the https://github.com/valkey-io/valkey/pull/453 optimization should depend on this. Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>	2024-05-15 18:22:50 -07:00

1 2 3 4 5 ...

1496 Commits