futriix

Author	SHA1	Message	Date
John Sully	f49d8f9adb	Merge tag '6.2.1' into unstable Former-commit-id: bfed57e3e0edaa724b9d060a6bb8edc5a6de65fa	2021-05-19 02:59:48 +00:00
Oran Agra	912e22b4f9	Sanitize dump payload: fuzz tester and fixes for segfaults and leaks it exposed The test creates keys with various encodings, DUMP them, corrupt the payload and RESTORES it. It utilizes the recently added use-exit-on-panic config to distinguish between asserts and segfaults. If the restore succeeds, it runs random commands on the key to attempt to trigger a crash. It runs in two modes, one with deep sanitation enabled and one without. In the first one we don't expect any assertions or segfaults, in the second one we expect assertions, but no segfaults. We also check for leaks and invalid reads using valgrind, and if we find them we print the commands that lead to that issue. Changes in the code (other than the test): - Replace a few NPD (null pointer deference) flows and division by zero with an assertion, so that it doesn't fail the test. (since we set the server to use `exit` rather than `abort` on assertion). - Fix quite a lot of flows in rdb.c that could have lead to memory leaks in RESTORE command (since it now responds with an error rather than panic) - Add a DEBUG flag for SET-SKIP-CHECKSUM-VALIDATION so that the test don't need to bother with faking a valid checksum - Remove a pile of code in serverLogObjectDebugInfo which is actually unsafe to run in the crash report (see comments in the code) - fix a missing boundary check in lzf_decompress test suite infra improvements: - be able to run valgrind checks before the process terminates - rotate log files when restarting servers	2020-12-06 14:54:34 +02:00
John Sully	14daf6f909	Merge tag '6.0.8' into unstable Former-commit-id: 4c7e4b91a6bb2034636856b608b8c386d07f5541	2020-09-30 19:47:55 +00:00
Oran Agra	298e93c360	tests/valgrind: don't use debug restart (#7404 ) * tests/valgrind: don't use debug restart DEBUG REATART causes two issues: 1. it uses execve which replaces the original process and valgrind doesn't have a chance to check for errors, so leaks go unreported. 2. valgrind report invalid calls to close() which we're unable to resolve. So now the tests use restart_server mechanism in the tests, that terminates the old server and starts a new one, new PID, but same stdout, stderr. since the stderr can contain two or more valgrind report, it is not enough to just check for the absence of leaks, we also need to check for some known errors, we do both, and fail if we either find an error, or can't find a report saying there are no leaks. other changes: - when killing a server that was already terminated we check for leaks too. - adding DEBUG LEAK which was used to test it. - adding --trace-children to valgrind, although no longer needed. - since the stdout contains two or more runs, we need slightly different way of checking if the new process is up (explicitly looking for the new PID) - move the code that handles --wait-server to happen earlier (before watching the startup message in the log), and serve the restarted server too. * squashme - CR fixes (cherry picked from commit 8d4f055e43ab554adfce617c971f10c4b6423484)	2020-07-20 21:08:26 +03:00
Oran Agra	8d4f055e43	tests/valgrind: don't use debug restart (#7404 ) * tests/valgrind: don't use debug restart DEBUG REATART causes two issues: 1. it uses execve which replaces the original process and valgrind doesn't have a chance to check for errors, so leaks go unreported. 2. valgrind report invalid calls to close() which we're unable to resolve. So now the tests use restart_server mechanism in the tests, that terminates the old server and starts a new one, new PID, but same stdout, stderr. since the stderr can contain two or more valgrind report, it is not enough to just check for the absence of leaks, we also need to check for some known errors, we do both, and fail if we either find an error, or can't find a report saying there are no leaks. other changes: - when killing a server that was already terminated we check for leaks too. - adding DEBUG LEAK which was used to test it. - adding --trace-children to valgrind, although no longer needed. - since the stdout contains two or more runs, we need slightly different way of checking if the new process is up (explicitly looking for the new PID) - move the code that handles --wait-server to happen earlier (before watching the startup message in the log), and serve the restarted server too. * squashme - CR fixes	2020-07-10 08:26:52 +03:00
John Sully	4820142896	PSYNC test shouldn't wait forever Former-commit-id: 130613e16636923296a8d5b2c4bc623e62fef2f5	2020-06-01 16:13:58 -04:00
John Sully	92de178bfe	PSYNC test reliability improvements (test only issue) Former-commit-id: 50fd4fa7e62f3996f15f6a8c4dcd892022f111ec	2020-06-01 16:01:26 -04:00
John Sully	ed2e0e66f6	Merge tag '6.0.4' into unstable Redis 6.0.4. Former-commit-id: 9c31ac7925edba187e527f506e5e992946bd38a6	2020-05-29 00:57:07 -04:00
antirez	41bb699867	Test: take PSYNC2 test master timeout high during switch. This will likely avoid false positives due to trailing pings.	2020-05-28 10:56:14 +02:00
antirez	0071eb1311	Test: take PSYNC2 test master timeout high during switch. This will likely avoid false positives due to trailing pings.	2020-05-28 10:47:30 +02:00
antirez	0163e4e495	Another meaningful offset test removed.	2020-05-28 10:09:51 +02:00
antirez	2411e4e33f	Test: PSYNC2 test can now show server logs.	2020-05-28 10:09:51 +02:00
antirez	fafe3502da	Another meaningful offset test removed.	2020-05-27 12:50:02 +02:00
antirez	d325091ba6	Test: PSYNC2 test can now show server logs.	2020-05-25 20:26:29 +02:00
antirez	3d478f2e3f	Improve the PSYNC2 test reliability.	2020-05-22 12:37:49 +02:00
John Sully	193d7c76cb	Fix bad merge in CI.yml Former-commit-id: 6311d709c39b3bacaeab77b18033010f1b548f81	2020-05-21 22:09:06 -04:00
antirez	5781712458	Improve the PSYNC2 test reliability.	2020-05-17 18:24:34 +02:00
Oran Agra	a8995ce3c9	fix loading race in psync2 tests	2020-04-28 11:20:15 +02:00
Oran Agra	a29e617381	fix loading race in psync2 tests	2020-04-28 09:18:01 +03:00
Oran Agra	58619c1286	Keep track of meaningful replication offset in replicas too Now both master and replicas keep track of the last replication offset that contains meaningful data (ignoring the tailing pings), and both trim that tail from the replication backlog, and the offset with which they try to use for psync. the implication is that if someone missed some pings, or even have excessive pings that the promoted replica has, it'll still be able to psync (avoid full sync). the downside (which was already committed) is that replicas running old code may fail to psync, since the promoted replica trims pings form it's backlog. This commit adds a test that reproduces several cases of promotions and demotions with stale and non-stale pings Background: The mearningful offset on the master was added recently to solve a problem were the master is left all alone, injecting PINGs into it's backlog when no one is listening and then gets demoted and tries to replicate from a replica that didn't have any of the PINGs (or at least not the last ones). however, consider this case: master A has two replicas (B and C) replicating directly from it. there's no traffic at all, and also no network issues, just many pings in the tail of the backlog. now B gets promoted, A becomes a replica of B, and C remains a replica of A. when A gets demoted, it trims the pings from its backlog, and successfully replicate from B. however, C is still aware of these PINGs, when it'll disconnect and re-connect to A, it'll ask for something that's not in the backlog anymore (since A trimmed the tail of it's backlog), and be forced to do a full sync (something it didn't have to do before the meaningful offset fix). Besides that, the psync2 test was always failing randomly here and there, it turns out the reason were PINGs. Investigating it shows the following scenario: cycle 1: redis #1 is master, and all the rest are direct replicas of #1 cycle 2: redis #2 is promoted to master, #1 is a replica of #2 and #3 is replica of #1 now we see that when #1 is demoted it prints: 17339:S 21 Apr 2020 11:16:38.523 * Using the meaningful offset 3929963 instead of 3929977 to exclude the final PINGs (14 bytes difference) 17339:S 21 Apr 2020 11:16:39.391 * Trying a partial resynchronization (request e2b3f8817735fdfe5fa4626766daa938b61419e5:3929964). 17339:S 21 Apr 2020 11:16:39.392 * Successful partial resynchronization with master. and when #3 connects to the demoted #2, #2 says: 17339:S 21 Apr 2020 11:16:40.084 * Partial resynchronization not accepted: Requested offset for secondary ID was 3929978, but I can reply up to 3929964 so the issue here is that the meaningful offset feature saved the day for the demoted master (since it needs to sync from a replica that didn't get the last ping), but it didn't help one of the other replicas which did get the last ping.	2020-04-27 15:52:49 +02:00
Oran Agra	5633862924	Keep track of meaningful replication offset in replicas too Now both master and replicas keep track of the last replication offset that contains meaningful data (ignoring the tailing pings), and both trim that tail from the replication backlog, and the offset with which they try to use for psync. the implication is that if someone missed some pings, or even have excessive pings that the promoted replica has, it'll still be able to psync (avoid full sync). the downside (which was already committed) is that replicas running old code may fail to psync, since the promoted replica trims pings form it's backlog. This commit adds a test that reproduces several cases of promotions and demotions with stale and non-stale pings Background: The mearningful offset on the master was added recently to solve a problem were the master is left all alone, injecting PINGs into it's backlog when no one is listening and then gets demoted and tries to replicate from a replica that didn't have any of the PINGs (or at least not the last ones). however, consider this case: master A has two replicas (B and C) replicating directly from it. there's no traffic at all, and also no network issues, just many pings in the tail of the backlog. now B gets promoted, A becomes a replica of B, and C remains a replica of A. when A gets demoted, it trims the pings from its backlog, and successfully replicate from B. however, C is still aware of these PINGs, when it'll disconnect and re-connect to A, it'll ask for something that's not in the backlog anymore (since A trimmed the tail of it's backlog), and be forced to do a full sync (something it didn't have to do before the meaningful offset fix). Besides that, the psync2 test was always failing randomly here and there, it turns out the reason were PINGs. Investigating it shows the following scenario: cycle 1: redis #1 is master, and all the rest are direct replicas of #1 cycle 2: redis #2 is promoted to master, #1 is a replica of #2 and #3 is replica of #1 now we see that when #1 is demoted it prints: 17339:S 21 Apr 2020 11:16:38.523 * Using the meaningful offset 3929963 instead of 3929977 to exclude the final PINGs (14 bytes difference) 17339:S 21 Apr 2020 11:16:39.391 * Trying a partial resynchronization (request e2b3f8817735fdfe5fa4626766daa938b61419e5:3929964). 17339:S 21 Apr 2020 11:16:39.392 * Successful partial resynchronization with master. and when #3 connects to the demoted #2, #2 says: 17339:S 21 Apr 2020 11:16:40.084 * Partial resynchronization not accepted: Requested offset for secondary ID was 3929978, but I can reply up to 3929964 so the issue here is that the meaningful offset feature saved the day for the demoted master (since it needs to sync from a replica that didn't get the last ping), but it didn't help one of the other replicas which did get the last ping.	2020-04-27 15:52:23 +02:00
John Sully	0725491043	Merge commit 'c609bf3f2c7f0982f632f82623ee4802868b8ef1' into redis_6_merge Former-commit-id: 320bc3c0329ff9e5a980b79426b719addae381cf	2020-04-14 21:04:42 -04:00
Oran Agra	cde46df309	fix for flaky psync2 test *** [err]: PSYNC2: total sum of full synchronizations is exactly 4 in tests/integration/psync2.tcl Expected 5 == 4 (context: type eval line 6 cmd {assert {$sum == 4}} proc ::test) issue was that sometime the test got an unexpected full sync since it tried to switch to the replica before it was in sync with it's master.	2020-03-12 15:53:47 +01:00
Oran Agra	d18e79f91d	fix for flaky psync2 test *** [err]: PSYNC2: total sum of full synchronizations is exactly 4 in tests/integration/psync2.tcl Expected 5 == 4 (context: type eval line 6 cmd {assert {$sum == 4}} proc ::test) issue was that sometime the test got an unexpected full sync since it tried to switch to the replica before it was in sync with it's master.	2020-03-05 16:55:14 +02:00
John Sully	397e85befb	Merge branch 'unstable' of https://github.com/antirez/redis into MergeRedis Note: some tests failing Former-commit-id: 86d7276f24f0cf1a0eceb6cd00a6a0ae2a0fa520	2019-05-11 02:20:34 -04:00
Oran Agra	c76bb465f2	make replication tests more stable on slow machines solving few replication related tests race conditions which fail on slow machines bugfix in slave buffers test: since the test is executed twice, each time with a different commands count, the threshold for the delta can't be a constant.	2019-05-05 08:25:01 +03:00
John Sully	3e591fe487	Make PSYNC2 tests more reliable on slower hardware Former-commit-id: 7b1dd0b60d0d65baa43cb69457e06744e0d9094f	2019-03-26 18:59:31 -04:00
antirez	d7dd6b4618	Slave removal: remove slave from integration tests descriptions.	2018-09-11 15:32:28 +02:00
antirez	d76d5e06fa	Minor improvements to PR #5187 .	2018-07-31 17:30:12 +02:00
antirez	db46df9153	Regression test: Slave restart with EVALSHA in backlog issue #4483 .	2017-11-30 18:37:10 +01:00
antirez	e4c6e8a710	PSYNC2 test: check ability to resync after restart.	2016-11-29 11:15:16 +01:00
antirez	bda1dd05b9	PSYNC2 test: 20 seconds are enough...	2016-11-29 10:27:53 +01:00
antirez	4d8362506b	PSYNC2 test: modify the test for production.	2016-11-29 10:22:40 +01:00
antirez	c90c479f9e	PSYNC2: stop sending newlines to sub-slaves when master is down. This actually includes two changes: 1) No newlines to take the master-slave link up when the upstream master is down. Doing this is dangerous because the sub-slave often is received replication protocol for an half-command, so can't receive newlines without desyncing the replication link, even with the code in order to cancel out the bytes that PSYNC2 was using. Moreover this is probably also not needed/sane, because anyway the slave can keep serving requests, and because if it's configured to don't serve stale data, it's a good idea, actually, to break the link. 2) When a +CONTINUE with a different ID is received, we now break connection with the sub-slaves: they need to be notified as well. This was part of the original specification but for some reason it was not implemented in the code, and was alter found as a PSYNC2 bug in the integration testing.	2016-11-28 17:54:04 +01:00
antirez	a6561f8e30	PSYNC2: Test (WIP). This is the PSYNC2 test that helped find issues in the code, and that still can show a protocol desync from time to time. Work is in progress in order to find the issue. For now the test is not enabled in "make test" and must be run manually.	2016-11-28 10:13:24 +01:00

35 Commits