futriix

Author	SHA1	Message	Date
Viktor Söderqvist	12ec3d5932	Increase timeout for cross-version-replication test (#1644 ) Fixes #1641 Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-01-29 13:29:35 -08:00
Viktor Söderqvist	99ed308817	Add cross-version test framework (and a simple test) (#1371 ) This includes a way to run two versions of the server from the TCL test framework. It's a preparation to add more cross-version tests. The runtest script accepts a new parameter ./runtest --other-server-path path/to/valkey-server and a new tag "needs:other-server" for test cases and start_server. Tests with this tag are automatically skipped if `--other-server-path` is not provided. This PR adds it in a CI job with Valkey 7.2.7 by downloading a binary release. Fixes #76 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2025-01-23 11:26:54 +01:00
Madelyn Olson	079f4edf2d	Add a hint about the current file for TCL debugging (#1459 ) There are some tests that fail and give no useful information since they are outside of a test context. Now we will at least get the file we are located in. We can sort of reverse engineer where we are in the test by seeing which tests have finished in a file. ``` [TIMEOUT]: clients state report follows. sock6 => (SPAWNED SERVER) pid:30375 - tests/unit/info.tcl Killing still running Valkey server 30375 - tests/unit/info.tcl ``` Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-12-19 14:18:02 +08:00
Viktor Szépe	b66698b887	Discover and fix new typos (#1446 ) Upgrade `typos` and fix corresponding typos --------- Signed-off-by: Viktor Szépe <viktor@szepe.net>	2024-12-17 17:45:43 -08:00
ranshid	ba25b586d5	Introduce FORCE_DEFRAG compilation option to allow activedefrag run when allocator is not jemalloc (#1303 ) Introduce compile time option to force activedefrag to run even when jemalloc is not used as the allocator. This is in order to be able to run tests with defrag enabled while using memory instrumentation tools. fixes: https://github.com/valkey-io/valkey/issues/1241 --------- Signed-off-by: ranshid <ranshid@amazon.com> Signed-off-by: Ran Shidlansik <ranshid@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-12-17 19:07:55 +02:00
Pierre	5f7fe9ef21	Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout (#1307 ) In some cases, when meeting a new node, if the handshake times out, we can end up with an inconsistent view of the cluster where the new node knows about all the nodes in the cluster, but the cluster does not know about this new node (or vice versa). To detect this inconsistency, we now check if a node has an outbound link but no inbound link, in this case it probably means this node does not know us. In this case we (re-)send a MEET packet to this node to do a new handshake with it. If we receive a MEET packet from a known node, we disconnect the outbound link to force a reconnect and sending of a PING packet so that the other node recognizes the link as belonging to us. This prevents cases where a node could send MEET packets in a loop because it thinks the other node does not have an inbound link. This fixes the bug described in #1251. --------- Signed-off-by: Pierre Turin <pieturin@amazon.com>	2024-12-11 17:26:06 -08:00
Viktor Söderqvist	3eb8314be6	Replace dict with hashtable for keys, expires and pubsub channels Instead of a dictEntry with pointers to key and value, the hashtable has a pointer directly to the value (robj) which can hold an embedded key and acts as a key-value in the hashtable. This minimizes the number of pointers to follow and thus the number of memory accesses to lookup a key-value pair. Keys robj hashtable +-------+ +-----------------------+ \| 0 \| \| type, encoding, LRU \| \| 1 ------->\| refcount, expire \| \| 2 \| \| ptr \| \| ... \| \| optional embedded key \| +-------+ \| optional embedded val \| +-----------------------+ The expire timestamp (TTL) is also stored in the robj, if any. The expire hash table points to the same robj. Overview of changes: * Replace dict with hashtable in kvstore (kvstore.c) * Add functions for embedding key and expire in robj (object.c) * When there's unused space, reserve an expire field to avoid realloting it later if expire is added. * Always reserve space for expire for large key names to avoid realloc if it's set later. * Update db functions (db.c) * dbAdd, setKey and setExpire reallocate the object when embedding a key * setKey does not increment the reference counter, since it would require duplicating the object. This responsibility is moved to the caller. * Remove logic for shared integer objects as values in the database. The keys are now embedded in the objects, so all objects in the database need to be unique. Thus, we can't use shared objects as values. Also delete test cases for shared integers. * Adjust various commands to the changes mentioned above. * Adjust defrag code * Improvement: Don't access the expires table before defrag has actually reallocated the object. * Adjust test cases that were using hard-coded sizes for dict when realloc would happen, and some other adjustments in test cases. * Adjust memory prefetch for new hash table implementation in IO-threading, using new `hashtableIncrementalFind` API * Adjust offloading of free() to IO threads: Object free to be done in main thread while keeping obj->ptr offloading in IO-thread since the DB object is now allocated by the main-thread and not by the IO-thread as it used to be. * Let expireIfNeeded take an optional value, to avoid looking up the expires table when possible. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com> Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Uri Yagelnik <uriy@amazon.com>	2024-12-10 21:30:56 +01:00
Binbin	ee386c92ff	Manual failover vote is not limited by two times the node timeout (#1305 ) This limit should not restrict manual failover, otherwise in some scenarios, manual failover will time out. For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs are lost during a manual failover, it cannot vote in the second manual failover. Or in a mixed scenario of plain failover and manual failover, it cannot vote for the subsequent manual failover. The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-11-19 11:17:20 -05:00
Binbin	22bc49c4a6	Try to stabilize the failover call in the slot migration test (#1078 ) The CI report replica will return the error when performing CLUSTER FAILOVER: ``` -ERR Master is down or failed, please use CLUSTER FAILOVER FORCE ``` This may because the primary state is fail or the cluster connection is disconnected during the primary pause. In this PR, we added some waits in wait_for_role, if the role is replica, we will wait for the replication link and the cluster link to be ok. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-11-07 13:42:20 +08:00
Viktor Söderqvist	00c97979d9	Make ./runtest --dump-logs dump logs on crash (#1117 ) Until now, this flag only dumped logs on a failed assert in test case. It is useful that this flag dumps logs on a crash as well. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-10-06 10:40:36 -07:00
Binbin	f7c5b40183	Avoid false positive in election tests (#984 ) The node may not be able to initiate an election in time due to problems with cluster communication. If an election is initiated, make sure its offset is 0. Closes #967. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-09-13 14:53:39 +08:00
uriyage	39f8bcb91b	Skip tracking clients OOM test when I/O threads are enabled (#764 ) Fix feedback loop in key eviction with tracking clients when using I/O threads. Current issue: Evicting keys while tracking clients or key space-notification exist creates a feedback loop when using I/O threads: While evicting keys we send tracking async writes to I/O threads, preventing immediate release of tracking clients' COB memory consumption. Before the I/O thread finishes its write, we recheck used_memory, which now includes the tracking clients' COB and thus continue to evict more keys. Fix: We will skip the test for now while IO threads are active. We may consider avoiding sending writes in `processPendingWrites` to I/O threads for tracking clients when we are out of memory. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-08-21 17:02:57 -07:00
Binbin	76ad8f7a76	Skip IPv6 tests when TCLSH version is < 8.6 (#910 ) In #786, we did skip it in the daily, but not for the others. When running ./runtest on MacOS, we will get the failure. ``` couldn't open socket: host is unreachable (nodename nor servname provided, or not known) ``` The reason is that TCL 8.5 doesn't support ipv6, so we skip tests tagged with ipv6. This also revert #786. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-08-15 15:11:38 +08:00
Pieter Cailliau	4d284daefd	Copyright update to reflect IP transfer from salvatore to Redis (#740 ) Update references of copyright being assigned to Salvatore when it was transferred to Redis Ltd. as per https://github.com/valkey-io/valkey/issues/544. --------- Signed-off-by: Pieter Cailliau <pieter@redis.com>	2024-08-14 09:20:36 -07:00
Binbin	59aa00823c	Replicas with the same offset queue up for election (#762 ) In some cases, like read more than write scenario, the replication offset of the replicas are the same. When the primary fails, the replicas have the same rankings (rank == 0). They issue the election at the same time (although we have a random 500), the simultaneous elections may lead to the failure of the election due to quorum. In clusterGetReplicaRank, when we calculates the rank, if the offsets are the same, the one with the smaller node name will have a better rank to avoid this situation. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-22 23:43:16 -07:00
Binbin	15a8290231	Optimize failover time when the new primary node is down again (#782 ) We will not reset failover_auth_time after setting it, this is used to check auth_timeout and auth_retry_time, but we should at least reset it after a successful failover. Let's assume the following scenario: 1. Two replicas initiate an election. 2. Replica 1 is elected as the primary node, and replica 2 does not have enough votes. 3. Replica 1 is down, ie the new primary node down again in a short time. 4. Replica 2 know that the new primary node is down and wants to initiate a failover, but because the failover_auth_time of the previous round has not been reset, it needs to wait for it to time out and then wait for the next retry time, which will take cluster-node-timeout * 4 times, this adds a lot of delay. There is another problem. Like we will set additional random time for failover_auth_time, such as random 500ms and replicas ranking 1s. If replica 2 receives PONG from the new primary node before sending the FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will change itself to a replica. If the new primary node goes down again at this time, replica 2 will use the previous failover_auth_time to initiate an election instead of going through the logic of random 500ms and replicas ranking 1s again, which may lead to unexpected consequences (for example, a low-ranking replica initiates an election and becomes the new primary node). That is, we need to reset failover_auth_time at the appropriate time. When the replica switches to a new primary, we reset it, because the existing failover_auth_time is already out of date in this case. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-07-19 15:27:49 -04:00
naglera	ff6b780fe6	Dual channel replication (#60 ) In this PR we introduce the main benefit of dual channel replication by continuously steaming the COB (client output buffers) in parallel to the RDB and thus keeping the primary's side COB small AND accelerating the overall sync process. By streaming the replication data to the replica during the full sync, we reduce 1. Memory load from the primary's node. 2. CPU load from the primary's main process. [Latest performance tests](#data) ## Motivation * Reduce primary memory load. We do that by moving the COB tracking to the replica side. This also decrease the chance for COB overruns. Note that primary's input buffer limits at the replica side are less restricted then primary's COB as the replica plays less critical part in the replication group. While increasing the primary’s COB may end up with primary reaching swap and clients suffering, at replica side we’re more at ease with it. Larger COB means better chance to sync successfully. * Reduce primary main process CPU load. By opening a new, dedicated connection for the RDB transfer, child processes can have direct access to the new connection. Due to TLS connection restrictions, this was not possible using one main connection. We eliminate the need for the child process to use the primary's child-proc -> main-proc pipeline, thus freeing up the main process to process clients queries. ## Dual Channel Replication high level interface design - Dual channel replication begins when the replica sends a `REPLCONF CAPA DUALCHANNEL` to the primary during initial handshake. This is used to state that the replica is capable of dual channel sync and that this is the replica's main channel, which is not used for snapshot transfer. - When replica lacks sufficient data for PSYNC, the primary will send `-FULLSYNCNEEDED` response instead of RDB data. As a next step, the replica creates a new connection (rdb-channel) and configures it against the primary with the appropriate capabilities and requirements. The replica then requests a sync using the RDB channel. - Prior to forking, the primary sends the replica the snapshot's end repl-offset, and attaches the replica to the replication backlog to keep repl data until the replica requests psync. The replica uses the main channel to request a PSYNC starting at the snapshot end offset. - The primary main threads sends incremental changes via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. As for the replica, the incremental changes are stored on a local buffer, while the RDB is loaded into memory. - Once the replica completes loading the rdb, it drops the rdb-connection and streams the accumulated incremental changes into memory. Repl steady state continues normally. ## New replica state machine ![image](https://github.com/user-attachments/assets/38fbfff0-60b9-4066-8b13-becdb87babc3) ## Data <a name="data"></a> ![image](https://github.com/user-attachments/assets/d73631a7-0a58-4958-a494-a7f4add9108f) ![image](https://github.com/user-attachments/assets/f44936ed-c59a-4223-905d-0fe48a6d31a6) ![image](https://github.com/user-attachments/assets/bd333ee2-3c47-47e5-b244-4ea75f77c836) ## Explanation These graphs demonstrate performance improvements during full sync sessions using rdb-channel + streaming rdb directly from the background process to the replica. First graph- with at most 50 clients and light weight commands, we saw 5%-7.5% improvement in write latency during sync session. Two graphs below- full sync was tested during heavy read commands from the primary (such as sdiff, sunion on large sets). In that case, the child process writes to the replica without sharing CPU with the loaded main process. As a result, this not only improves client response time, but may also shorten sync time by about 50%. The shorter sync time results in less memory being used to store replication diffs (>60% in some of the tested cases). ## Test setup Both primary and replica in the performance tests ran on the same machine. RDB size in all tests is 3.7gb. I generated write load using valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list __rand_int__`. --------- Signed-off-by: naglera <anagler123@gmail.com> Signed-off-by: naglera <58042354+naglera@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-07-17 13:59:33 -07:00
Viktor Söderqvist	a323dce890	Dual stack and client-specific IPs in cluster (#736 ) New configs: * `cluster-announce-client-ipv4` * `cluster-announce-client-ipv6` New module API function: * `ValkeyModule_GetClusterNodeInfoForClient`, takes a client id and is otherwise just like its non-ForClient cousin. If configured, one of these IP addresses are reported to each client in CLUSTER SLOTS, CLUSTER SHARDS, CLUSTER NODES and redirects, replacing the IP (`custer-announce-ip` or the auto-detected IP) of each node. Which one is reported to the client depends on whether the client is connected over IPv4 or IPv6. Benefits: * This allows clients using IPv4 to get the IPv4 addresses of all cluster nodes and IPv6 clients to get the IPv6 clients. * This allows the IPs visible to clients to be different to the IPs used between the cluster nodes due to NAT'ing. The information is propagated in the cluster bus using new Ping extensions. (Old nodes without this feature ignore unknown Ping extensions.) This adds another dimension to CLUSTER SLOTS reply. It now depends on the client's use of TLS, the IP address family and RESP version. Refactoring: The cached connection type definition is moved from connection.h (it actually has nothing to do with the connection abstraction) to server.h and is changed to a bitmap, with one bit for each of TLS, IPv6 and RESP3. Fixes #337 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-07-10 13:53:52 +02:00
Kyle Kim (kimkyle@)	b49eaad367	Introduce a minimal debugger for .tcl integration test suite. (#683 ) Introduce a break-point function called `bp`, based on the tcl wiki's minimal debugger. ```tcl proc bp {{s {}}} { if ![info exists ::bp_skip] { set ::bp_skip [list] } elseif {[lsearch -exact $::bp_skip $s]>=0} return if [catch {info level -1} who] {set who ::} while 1 { puts -nonewline "$who/$s> "; flush stdout gets stdin line if {$line=="c"} {puts "continuing.."; break} if {$line=="i"} {set line "info locals"} catch {uplevel 1 $line} res puts $res } } ``` ``` ... your test code before break-point bp 1 ... your test code after break-point ``` The `bp 1` will give back the tcl interpreter to the developer, and allow you to interactively print local variables (through `puts`), run functions and so forth. Source: https://wiki.tcl-lang.org/page/A+minimal+debugger --------- Signed-off-by: Kyle Kim <kimkyle@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2024-06-25 10:24:53 -07:00
Madelyn Olson	ce79539047	Fail tests immediately if the server is no longer running (#678 ) Fix a minor inconvenience I have when writing tests. If I have a typo or forget to generate the tls certificates, the start_server handle will just loop for 2 minutes before printing the error. This just fails and prints as soon as it sees the error. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-06-21 15:29:05 +08:00
Ping Xie	aad6769a80	Replicate slot migration states via RDB aux fields (#586 )	2024-06-07 20:32:27 -07:00
Arthur Lee	3de5c71f48	[Feat] Support fast fail option for tcl test cases (#482 ) This PR added a new option for tcl test case which will fail fast once any test cases fail. This can be useful while running redis CI pipeline, and you want to accelerate the CI pipeline. usage for example > ./runtest --single unit/type/hash --fast-fail --------- Signed-off-by: arthur.lee <arthur-lee@qq.com>	2024-05-15 06:55:24 -07:00
Binbin	fdd023ff82	Migrate cluster mode tests to normal framework (#442 ) We currently has two disjoint TCL frameworks: 1. Normal testing framework, which trigger by runtest, which individually launches nodes for testing. 2. Cluster framework, which trigger by runtest-cluster, which pre-allocates N nodes and uses them for testing large configurations. The normal TCL testing framework is much more readily tested and is also automatically run as part of the CI for new PRs. The runtest-cluster since it runs very slowly (cannot be parallelized), it currently only runs in daily CI, this results in some changes to the cluster not being exposed in PR CI in time. This PR migrate the Cluster mode tests to normal framework. Some cluster tests are kept in runtest-cluster because of timing issues or not yet supported, we can process them later. Signed-off-by: Binbin <binloveplay1314@qq.com>	2024-05-09 10:14:47 +08:00
Shivshankar	52f9291f79	Rename redis to valkey in test suite logs and test names. (#366 ) This PR covers below cases. 1. test suite's prints(i.e., puts statement logs). 2. Links refering to redis issues. 3. test names contains redis. Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>	2024-04-25 15:13:21 +08:00
Wen Hui	191be272b4	Rename redis.tcl to valkey.tcl (#283 ) Includes some more changes e.g. the README under tests and some script under utils. Signed-off-by: hwware <wen.hui.ware@gmail.com>	2024-04-24 20:54:52 +02:00
Shivshankar	8baf322742	Rename remaining test procedures (#355 ) Renamed below procedures and variables (missed in #287) as follows. redis_cluster -> valkey_cluster redis1 -> valkey1 redis2 -> valkey2 get_redis_dir -> get_valkey_dir test_redis_cli_rdb_dump -> test_valkey_cli_rdb_dump test_redis_cli_repl -> test_valkey_cli_repl redis-cli -> valkey-cli redis_reset_state -> valkey_reset_state redisHandle -> valkeyHandle redis_safe_read -> valkey_safe_read redis_safe_gets -> valkey_safe_gets redis_write -> valkey_write redis_read_reply -> valkey_read_reply redis_readable -> valkey_readable redis_readnl -> valkey_readnl redis_writenl -> valkey_writenl redis_read_map -> valkey_read_map redis_read_line -> valkey_read_line redis_read_null -> valkey_read_null redis_read_bool -> valkey_read_bool redis_read_double -> valkey_read_double redis_read_verbatim_str -> valkey_read_verbatim_str redis_call_callback -> valkey_call_callback --------- Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>	2024-04-24 18:01:33 +02:00
Shivshankar	669f1d3014	redisbenchmark to valkeybenchmark in test directory and some test name rename. (#347 ) This pr covers following chnages. 1. redisbenchmark to valkeybenchmark in test directory 2. Removed redis from some test's title and changed the name accordingly. 3. Updated test suite name and redis-server to valkey readme in test directory. --------- Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>	2024-04-23 10:51:53 -07:00
Lipeng Zhu	87a5bfc002	Support connection schemes valkey:// and valkeys:// (#199 ) 1. Support valkey:// and valkeys:// scheme in valkey-cli and valkey-benchmark. Retain the original Redis schemes for compatibility. 2. Add unit tests for valid URI, all schemes. Fixes: https://github.com/valkey-io/valkey/issues/198 Fixes: https://github.com/valkey-io/valkey/issues/200 --------- Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>	2024-04-23 03:02:41 +02:00
Shivshankar	34413e0862	Replace "redis" with "valkey" test code (#287 ) Occurrences of "redis" in TCL test suites and helpers, such as TCL client used in tests, are replaced with "valkey". Occurrences of uppercase "Redis" are not changed in this PR. No files are renamed in this PR. --------- Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>	2024-04-18 15:57:17 +02:00
Viktor Söderqvist	9e2b7838ea	Add 'extended-redis-compatibility' config (#306 ) New config 'extended-redis-compatibility' (yes/no) default no * When yes: * Use "Redis" in the following error replies: - `-LOADING Redis is loading the dataset in memory` - `-BUSY Redis is busy`... - `-MISCONF Redis is configured to`... * Use `=== REDIS BUG REPORT` in the crash log delimiters (START and END). * The HELLO command returns `"server" => "redis"` and `"version" => "7.2.4"` (our Redis OSS compatibility version). * The INFO field for mode is called `"redis_mode"`. * When no: * Use "Valkey" instead of "Redis" in the mentioned errors and crash log delimiters. * The HELLO command returns `"server" => "valkey"` and the Valkey version for `"version"`. * The INFO field for mode is called `"server_mode"`. * Documentation added in valkey.conf: > Valkey is largely compatible with Redis OSS, apart from a few cases where > Redis OSS compatibility mode makes Valkey pretend to be Redis. Enable this > only if you have problems with tools or clients. This is a temporary > configuration added in Valkey 8.0 and is scheduled to have no effect in Valkey > 9.0 and be completely removed in Valkey 10.0. * A test case for the config is added. It is designed to fail if the config is not deprecated (has no effect) in Valkey 9 and deleted in Valkey 10. * Other test cases are adjusted to work regardless of this config. Fixes #274 Fixes #61 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-04-18 14:10:24 +02:00
jonghoonpark	c090874ed4	List test files dynamically (#313 ) Motivation: Currently we have to manually update the all_tests variable when introducing new test files. Fix: I've modified it to list test files dynamically, but rather than modify it to add all test files, I've first modified it to only add test files from the following 4 paths so that it doesn't deviate too much from what we already do - unit - unit/type - unit/cluster - integration Related issue: https://github.com/valkey-io/valkey/issues/302 --------- Signed-off-by: jonghoonpark <dev@jonghoonpark.com>	2024-04-15 14:25:33 +02:00
Madelyn Olson	c0cef48e98	Fix reference to redis-tls module (#273 ) Update test usage of valkey-tls.so module to use valkey-tls.so instead. Fixes tests failures like https://github.com/valkey-io/valkey/actions/runs/8592855995/job/23543475478. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-04-09 07:15:59 -07:00
Jacob Murphy	df5db0627f	Remove trademarked language in code comments (#223 ) This includes comments used for module API documentation. * Strategy for replacement: Regex search: `(//\|/\\| \\|#).* ("\|$)?(r\|R)edis( \|\. \|'\|\n\|,\|-\|$\|")(?!nor the names of its contributors)(?!Ltd.)(?!Labs)(?!Contributors.)` * Don't edit copyright comments * Replace "Redis version X.X" -> "Redis OSS version X.X" to distinguish from newly licensed repository * Replace "Redis Object" -> "Object" * Exclude markdown for now * Don't edit Lua scripting comments referring to redis.X API * Replace "Redis Protocol" -> "RESP" * Replace redis-benchmark, -cli, -server, -check-aof/rdb with "valkey-" prefix * Most other places, I use best judgement to either remove "Redis", or replace with "the server" or "server" Fixes #148 --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com> Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2024-04-09 10:24:03 +02:00
VoletiRam	d89ef06ce5	Wait for cluster fully online in cluster_config_consistent (#272 ) Wait for cluster to be in a fully consistent and online state in `cluster_config_consistent`. We expect the `start_server` to create the desired primaries and replicas before the start of the tests. With the current setup, the replicas may not complete the sync with primaries and can be in loading state. In some cases, the role of replicas can still be master with the delay of propagation of replicate command. The tests can show flaky behavior in such cases. Add a check that verifies the nodes health status 'online' for the cluster consistency. Leverage the deterministic order of `CLUSTER SLOTS` to consider the cluster as consistent along with the nodes health status. --------- Signed-off-by: Harkrishn Patro <harkrisp@amazon.com> Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com> Co-authored-by: Harkrishn Patro <harkrisp@amazon.com> Co-authored-by: Ram Prasad Voleti <ramvolet@amazon.com>	2024-04-08 20:03:56 -07:00
Wen Hui	7f5bcc96f0	Update some valkey-cli related in tcl (#236 ) Signed-off-by: hwware <wen.hui.ware@gmail.com>	2024-04-05 16:46:33 -04:00
Shivshankar	f3ccfbb01f	Rename TLS test cert files to valkey (#186 ) This PR covers changing the redis.crt and redis.key to valkey certs for TLS testing. The files are generated by the gen-test-certs.sh script under tests/tls/. Also covers comments provided. Signed-off-by: hwware <wen.hui.ware@gmail.com> Co-authored-by: hwware <wen.hui.ware@gmail.com>	2024-04-03 23:04:52 +02:00
Harkrishn Patro	1736018aa9	Remove trademarked wording on configuration file and individual configs (#29 ) Remove trademarked wording on configuration layer. Following changes for release notes: 1. Rename redis.conf to valkey.conf 2. Pre-filled config in the template config file: Changing pidfile to `/var/run/valkey_6379.pid` Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>	2024-04-03 19:47:26 +02:00
John Vandenberg	253fe9dced	Fix typos and replace 'codespell' with 'typos' (#72 ) Uses https://github.com/taiki-e/install-action to install https://github.com/crate-ci/typos in CI This finds many more/different typos than https://github.com/codespell-project/codespell , while having very few false positives. Signed-off-by: John Vandenberg <jayvdb@gmail.com>	2024-03-31 12:38:22 -07:00
Madelyn Olson	57789d4d08	Update naming to to Valkey (#62 ) Documentation references should use `Valkey` while server and cli references are all under `valkey`. --------- Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2024-03-28 09:58:28 -07:00
Madelyn Olson	3586485355	Moved to correct cli and benchmark	2024-03-21 19:16:03 -07:00
Madelyn Olson	ee107481e5	Refactor some tests to reference new executable	2024-03-21 19:11:08 -07:00
Binbin	3c2ea1ea95	Fix wathced client test timing issue caused by late close (#13062 ) There is a timing issue in the test, close may arrive late, or in freeClientAsync we will free the client in async way, which will lead to errors in watching_clients statistics, since we will only unwatch all keys when we truly freeClient. Add a wait here to avoid this problem. Also fixed some outdated comments i saw. The test was introduced in #12966.	2024-02-20 11:12:19 +02:00
bentotten	b3aaa0a136	When one shard, sole primary node marks potentially failed replica as FAIL instead of PFAIL (#12824 ) Fixes issue where a single primary cannot mark a replica as failed in a single-shard cluster.	2024-01-11 15:48:19 -08:00
Madelyn Olson	8bb9a2895e	Address some failures with new tests for improving debug report (#12915 ) Fix a daily test failure because alpine doesn't support stack traces and add in an extra assertion related to making sure the stack trace was printed twice.	2024-01-08 17:56:06 -08:00
Binbin	c85a9b7896	Fix delKeysInSlot server events are not executed inside an execution unit (#12745 ) This is a follow-up fix to #12733. We need to apply the same changes to delKeysInSlot. Refer to #12733 for more details. This PR contains some other minor cleanups / improvements to the test suite and docs. It uses the postnotifications test module in a cluster mode test which revealed a leak in the test module (fixed).	2023-12-11 20:15:19 +02:00
Vitaly	0270abda82	Replace cluster metadata with slot specific dictionaries (#11695 ) This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data. ## Important changes * Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms. * getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time. * Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree. * scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue. * Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot. * Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading. * DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well. ## Performance This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict. RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load. ## Interface changes * Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS` * Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored. * New RDB version to support the new op code for SLOT information. --------- Co-authored-by: Vitaly Arbuzov <arvit@amazon.com> Co-authored-by: Harkrishn Patro <harkrisp@amazon.com> Co-authored-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>	2023-10-14 23:58:26 -07:00
Oran Agra	f0c1c730d4	test suite: clean server pids after server crashed (#12639 ) when a server in the test suite crashes and is restarted by redstart_server, we didn't clean it's pid from the list. we can see that when the corrupt-dump-fuzzer hangs, it has a long list of servers to lean, but in fact they're all already dead.	2023-10-13 16:28:52 +03:00
YaacovHazan	2e0f6724e0	Stabilization and improvements around aof tests (#12626 ) In some tests, the code manually searches for a log message, and it uses tail -1 with a delay of 1 second, which can miss the expected line. Also, because the aof tests use start_server_aof and not start_server, the test name doesn't log into the server log. To fix the above, I made the following changes: - Change the start_server_aof to wrap the start_server. This will add the created aof server to the servers list, and make srv() and wait_for_log_messages() available for the tests. - Introduce a new option for start_server. 'wait_ready' - an option to let the caller start the test code without waiting for the server to be ready. useful for tests on a server that is expected to exit on startup. - Create a new start_server_aof_ex. The new proc also accept options as argument and make use of the new 'short_life' option for tests that are expected to exit on startup because of some error in the aof file(s). Because of the above, I had to change many lines and replace every local srv variable (a server config) usage with the srv().	2023-10-02 08:20:53 +03:00
DarrenJiang13	6abb3c4038	change log match to line match in tcl sanitizer_errors_from_file. (#12446 ) In the tcl foreach loop, the function should compare line rather than the whole file.	2023-07-30 08:48:29 +03:00
Chen Tianjie	22a29935ff	Support TLS service when "tls-cluster" is not enabled and persist both plain and TLS port in nodes.conf (#12233 ) Originally, when "tls-cluster" is enabled, `port` is set to TLS port. In order to support non-TLS clients, `pport` is used to propagate TCP port across cluster nodes. However when "tls-cluster" is disabled, `port` is set to TCP port, and `pport` is not used, which means the cluster cannot provide TLS service unless "tls-cluster" is on. ``` typedef struct { // ... uint16_t port; /* Latest known clients port (TLS or plain). / uint16_t pport; / Latest known clients plaintext port. Only used if the main clients port is for TLS. / // ... } clusterNode; ``` ``` typedef struct { // ... uint16_t port; / TCP base port number. / uint16_t pport; / Sender TCP plaintext port, if base port is TLS */ // ... } clusterMsg; ``` This PR renames `port` and `pport` in `clusterNode` to `tcp_port` and `tls_port`, to record both ports no matter "tls-cluster" is enabled or disabled. This allows to provide TLS service to clients when "tls-cluster" is disabled: when displaying cluster topology, or giving `MOVED` error, server can provide TLS or TCP port according to client's connection type, no matter what type of connection cluster bus is using. For backwards compatibility, `port` and `pport` in `clusterMsg` are preserved, when "tls-cluster" is enabled, `port` is set to TLS port and `pport` is set to TCP port, when "tls-cluster" is disabled, `port` is set to TCP port and `pport` is set to TLS port (instead of 0). Also, in the nodes.conf file, a new aux field displaying an extra port is added to complete the persisted info. We may have `tls_port=xxxxx` or `tcp_port=xxxxx` in the aux field, to complete the cluster topology, while the other port is stored in the normal `<ip>:<port>` field. The format is shown below. ``` <node-id> <ip>:<tcp_port>@<cport>,<hostname>,shard-id=...,tls-port=6379 myself,master - 0 0 0 connected 0-1000 ``` Or we can switch the position of two ports, both can be correctly resolved. ``` <node-id> <ip>:<tls_port>@<cport>,<hostname>,shard-id=...,tcp-port=6379 myself,master - 0 0 0 connected 0-1000 ```	2023-06-26 07:43:38 -07:00

1 2 3 4 5 ...

317 Commits