35 Commits

Author SHA1 Message Date
John Sully
f49d8f9adb Merge tag '6.2.1' into unstable
Former-commit-id: bfed57e3e0edaa724b9d060a6bb8edc5a6de65fa
2021-05-19 02:59:48 +00:00
Oran Agra
912e22b4f9 Sanitize dump payload: fuzz tester and fixes for segfaults and leaks it exposed
The test creates keys with various encodings, DUMP them, corrupt the payload
and RESTORES it.
It utilizes the recently added use-exit-on-panic config to distinguish between
 asserts and segfaults.
If the restore succeeds, it runs random commands on the key to attempt to
trigger a crash.

It runs in two modes, one with deep sanitation enabled and one without.
In the first one we don't expect any assertions or segfaults, in the second one
we expect assertions, but no segfaults.
We also check for leaks and invalid reads using valgrind, and if we find them
we print the commands that lead to that issue.

Changes in the code (other than the test):
- Replace a few NPD (null pointer deference) flows and division by zero with an
  assertion, so that it doesn't fail the test. (since we set the server to use
  `exit` rather than `abort` on assertion).
- Fix quite a lot of flows in rdb.c that could have lead to memory leaks in
  RESTORE command (since it now responds with an error rather than panic)
- Add a DEBUG flag for SET-SKIP-CHECKSUM-VALIDATION so that the test don't need
  to bother with faking a valid checksum
- Remove a pile of code in serverLogObjectDebugInfo which is actually unsafe to
  run in the crash report (see comments in the code)
- fix a missing boundary check in lzf_decompress

test suite infra improvements:
- be able to run valgrind checks before the process terminates
- rotate log files when restarting servers
2020-12-06 14:54:34 +02:00
John Sully
14daf6f909 Merge tag '6.0.8' into unstable
Former-commit-id: 4c7e4b91a6bb2034636856b608b8c386d07f5541
2020-09-30 19:47:55 +00:00
Oran Agra
298e93c360 tests/valgrind: don't use debug restart (#7404)
* tests/valgrind: don't use debug restart

DEBUG REATART causes two issues:
1. it uses execve which replaces the original process and valgrind doesn't
   have a chance to check for errors, so leaks go unreported.
2. valgrind report invalid calls to close() which we're unable to resolve.

So now the tests use restart_server mechanism in the tests, that terminates
the old server and starts a new one, new PID, but same stdout, stderr.

since the stderr can contain two or more valgrind report, it is not enough
to just check for the absence of leaks, we also need to check for some known
errors, we do both, and fail if we either find an error, or can't find a
report saying there are no leaks.

other changes:
- when killing a server that was already terminated we check for leaks too.
- adding DEBUG LEAK which was used to test it.
- adding --trace-children to valgrind, although no longer needed.
- since the stdout contains two or more runs, we need slightly different way
  of checking if the new process is up (explicitly looking for the new PID)
- move the code that handles --wait-server to happen earlier (before
  watching the startup message in the log), and serve the restarted server too.

* squashme - CR fixes

(cherry picked from commit 8d4f055e43ab554adfce617c971f10c4b6423484)
2020-07-20 21:08:26 +03:00
Oran Agra
8d4f055e43 tests/valgrind: don't use debug restart (#7404)
* tests/valgrind: don't use debug restart

DEBUG REATART causes two issues:
1. it uses execve which replaces the original process and valgrind doesn't
   have a chance to check for errors, so leaks go unreported.
2. valgrind report invalid calls to close() which we're unable to resolve.

So now the tests use restart_server mechanism in the tests, that terminates
the old server and starts a new one, new PID, but same stdout, stderr.

since the stderr can contain two or more valgrind report, it is not enough
to just check for the absence of leaks, we also need to check for some known
errors, we do both, and fail if we either find an error, or can't find a
report saying there are no leaks.

other changes:
- when killing a server that was already terminated we check for leaks too.
- adding DEBUG LEAK which was used to test it.
- adding --trace-children to valgrind, although no longer needed.
- since the stdout contains two or more runs, we need slightly different way
  of checking if the new process is up (explicitly looking for the new PID)
- move the code that handles --wait-server to happen earlier (before
  watching the startup message in the log), and serve the restarted server too.

* squashme - CR fixes
2020-07-10 08:26:52 +03:00
John Sully
4820142896 PSYNC test shouldn't wait forever
Former-commit-id: 130613e16636923296a8d5b2c4bc623e62fef2f5
2020-06-01 16:13:58 -04:00
John Sully
92de178bfe PSYNC test reliability improvements (test only issue)
Former-commit-id: 50fd4fa7e62f3996f15f6a8c4dcd892022f111ec
2020-06-01 16:01:26 -04:00
John Sully
ed2e0e66f6 Merge tag '6.0.4' into unstable
Redis 6.0.4.


Former-commit-id: 9c31ac7925edba187e527f506e5e992946bd38a6
2020-05-29 00:57:07 -04:00
antirez
41bb699867 Test: take PSYNC2 test master timeout high during switch.
This will likely avoid false positives due to trailing pings.
2020-05-28 10:56:14 +02:00
antirez
0071eb1311 Test: take PSYNC2 test master timeout high during switch.
This will likely avoid false positives due to trailing pings.
2020-05-28 10:47:30 +02:00
antirez
0163e4e495 Another meaningful offset test removed. 2020-05-28 10:09:51 +02:00
antirez
2411e4e33f Test: PSYNC2 test can now show server logs. 2020-05-28 10:09:51 +02:00
antirez
fafe3502da Another meaningful offset test removed. 2020-05-27 12:50:02 +02:00
antirez
d325091ba6 Test: PSYNC2 test can now show server logs. 2020-05-25 20:26:29 +02:00
antirez
3d478f2e3f Improve the PSYNC2 test reliability. 2020-05-22 12:37:49 +02:00
John Sully
193d7c76cb Fix bad merge in CI.yml
Former-commit-id: 6311d709c39b3bacaeab77b18033010f1b548f81
2020-05-21 22:09:06 -04:00
antirez
5781712458 Improve the PSYNC2 test reliability. 2020-05-17 18:24:34 +02:00
Oran Agra
a8995ce3c9 fix loading race in psync2 tests 2020-04-28 11:20:15 +02:00
Oran Agra
a29e617381 fix loading race in psync2 tests 2020-04-28 09:18:01 +03:00
Oran Agra
58619c1286 Keep track of meaningful replication offset in replicas too
Now both master and replicas keep track of the last replication offset
that contains meaningful data (ignoring the tailing pings), and both
trim that tail from the replication backlog, and the offset with which
they try to use for psync.

the implication is that if someone missed some pings, or even have
excessive pings that the promoted replica has, it'll still be able to
psync (avoid full sync).

the downside (which was already committed) is that replicas running old
code may fail to psync, since the promoted replica trims pings form it's
backlog.

This commit adds a test that reproduces several cases of promotions and
demotions with stale and non-stale pings

Background:
The mearningful offset on the master was added recently to solve a problem were
the master is left all alone, injecting PINGs into it's backlog when no one is
listening and then gets demoted and tries to replicate from a replica that didn't
have any of the PINGs (or at least not the last ones).

however, consider this case:
master A has two replicas (B and C) replicating directly from it.
there's no traffic at all, and also no network issues, just many pings in the
tail of the backlog. now B gets promoted, A becomes a replica of B, and C
remains a replica of A. when A gets demoted, it trims the pings from its
backlog, and successfully replicate from B. however, C is still aware of
these PINGs, when it'll disconnect and re-connect to A, it'll ask for something
that's not in the backlog anymore (since A trimmed the tail of it's backlog),
and be forced to do a full sync (something it didn't have to do before the
meaningful offset fix).

Besides that, the psync2 test was always failing randomly here and there, it
turns out the reason were PINGs. Investigating it shows the following scenario:

cycle 1: redis #1 is master, and all the rest are direct replicas of #1
cycle 2: redis #2 is promoted to master, #1 is a replica of #2 and #3 is replica of #1
now we see that when #1 is demoted it prints:
17339:S 21 Apr 2020 11:16:38.523 * Using the meaningful offset 3929963 instead of 3929977 to exclude the final PINGs (14 bytes difference)
17339:S 21 Apr 2020 11:16:39.391 * Trying a partial resynchronization (request e2b3f8817735fdfe5fa4626766daa938b61419e5:3929964).
17339:S 21 Apr 2020 11:16:39.392 * Successful partial resynchronization with master.
and when #3 connects to the demoted #2, #2 says:
17339:S 21 Apr 2020 11:16:40.084 * Partial resynchronization not accepted: Requested offset for secondary ID was 3929978, but I can reply up to 3929964

so the issue here is that the meaningful offset feature saved the day for the
demoted master (since it needs to sync from a replica that didn't get the last
ping), but it didn't help one of the other replicas which did get the last ping.
2020-04-27 15:52:49 +02:00
Oran Agra
5633862924 Keep track of meaningful replication offset in replicas too
Now both master and replicas keep track of the last replication offset
that contains meaningful data (ignoring the tailing pings), and both
trim that tail from the replication backlog, and the offset with which
they try to use for psync.

the implication is that if someone missed some pings, or even have
excessive pings that the promoted replica has, it'll still be able to
psync (avoid full sync).

the downside (which was already committed) is that replicas running old
code may fail to psync, since the promoted replica trims pings form it's
backlog.

This commit adds a test that reproduces several cases of promotions and
demotions with stale and non-stale pings

Background:
The mearningful offset on the master was added recently to solve a problem were
the master is left all alone, injecting PINGs into it's backlog when no one is
listening and then gets demoted and tries to replicate from a replica that didn't
have any of the PINGs (or at least not the last ones).

however, consider this case:
master A has two replicas (B and C) replicating directly from it.
there's no traffic at all, and also no network issues, just many pings in the
tail of the backlog. now B gets promoted, A becomes a replica of B, and C
remains a replica of A. when A gets demoted, it trims the pings from its
backlog, and successfully replicate from B. however, C is still aware of
these PINGs, when it'll disconnect and re-connect to A, it'll ask for something
that's not in the backlog anymore (since A trimmed the tail of it's backlog),
and be forced to do a full sync (something it didn't have to do before the
meaningful offset fix).

Besides that, the psync2 test was always failing randomly here and there, it
turns out the reason were PINGs. Investigating it shows the following scenario:

cycle 1: redis #1 is master, and all the rest are direct replicas of #1
cycle 2: redis #2 is promoted to master, #1 is a replica of #2 and #3 is replica of #1
now we see that when #1 is demoted it prints:
17339:S 21 Apr 2020 11:16:38.523 * Using the meaningful offset 3929963 instead of 3929977 to exclude the final PINGs (14 bytes difference)
17339:S 21 Apr 2020 11:16:39.391 * Trying a partial resynchronization (request e2b3f8817735fdfe5fa4626766daa938b61419e5:3929964).
17339:S 21 Apr 2020 11:16:39.392 * Successful partial resynchronization with master.
and when #3 connects to the demoted #2, #2 says:
17339:S 21 Apr 2020 11:16:40.084 * Partial resynchronization not accepted: Requested offset for secondary ID was 3929978, but I can reply up to 3929964

so the issue here is that the meaningful offset feature saved the day for the
demoted master (since it needs to sync from a replica that didn't get the last
ping), but it didn't help one of the other replicas which did get the last ping.
2020-04-27 15:52:23 +02:00
John Sully
0725491043 Merge commit 'c609bf3f2c7f0982f632f82623ee4802868b8ef1' into redis_6_merge
Former-commit-id: 320bc3c0329ff9e5a980b79426b719addae381cf
2020-04-14 21:04:42 -04:00
Oran Agra
cde46df309 fix for flaky psync2 test
*** [err]: PSYNC2: total sum of full synchronizations is exactly 4 in tests/integration/psync2.tcl
Expected 5 == 4 (context: type eval line 6 cmd {assert {$sum == 4}} proc ::test)

issue was that sometime the test got an unexpected full sync since it
tried to switch to the replica before it was in sync with it's master.
2020-03-12 15:53:47 +01:00
Oran Agra
d18e79f91d fix for flaky psync2 test
*** [err]: PSYNC2: total sum of full synchronizations is exactly 4 in tests/integration/psync2.tcl
Expected 5 == 4 (context: type eval line 6 cmd {assert {$sum == 4}} proc ::test)

issue was that sometime the test got an unexpected full sync since it
tried to switch to the replica before it was in sync with it's master.
2020-03-05 16:55:14 +02:00
John Sully
397e85befb Merge branch 'unstable' of https://github.com/antirez/redis into MergeRedis
Note: some tests failing

Former-commit-id: 86d7276f24f0cf1a0eceb6cd00a6a0ae2a0fa520
2019-05-11 02:20:34 -04:00
Oran Agra
c76bb465f2 make replication tests more stable on slow machines
solving few replication related tests race conditions which fail on slow machines

bugfix in slave buffers test: since the test is executed twice, each time with
a different commands count, the threshold for the delta can't be a constant.
2019-05-05 08:25:01 +03:00
John Sully
3e591fe487 Make PSYNC2 tests more reliable on slower hardware
Former-commit-id: 7b1dd0b60d0d65baa43cb69457e06744e0d9094f
2019-03-26 18:59:31 -04:00
antirez
d7dd6b4618 Slave removal: remove slave from integration tests descriptions. 2018-09-11 15:32:28 +02:00
antirez
d76d5e06fa Minor improvements to PR #5187. 2018-07-31 17:30:12 +02:00
antirez
db46df9153 Regression test: Slave restart with EVALSHA in backlog issue #4483. 2017-11-30 18:37:10 +01:00
antirez
e4c6e8a710 PSYNC2 test: check ability to resync after restart. 2016-11-29 11:15:16 +01:00
antirez
bda1dd05b9 PSYNC2 test: 20 seconds are enough... 2016-11-29 10:27:53 +01:00
antirez
4d8362506b PSYNC2 test: modify the test for production. 2016-11-29 10:22:40 +01:00
antirez
c90c479f9e PSYNC2: stop sending newlines to sub-slaves when master is down.
This actually includes two changes:

1) No newlines to take the master-slave link up when the upstream master
is down. Doing this is dangerous because the sub-slave often is received
replication protocol for an half-command, so can't receive newlines
without desyncing the replication link, even with the code in order to
cancel out the bytes that PSYNC2 was using. Moreover this is probably
also not needed/sane, because anyway the slave can keep serving
requests, and because if it's configured to don't serve stale data, it's
a good idea, actually, to break the link.

2) When a +CONTINUE with a different ID is received, we now break
connection with the sub-slaves: they need to be notified as well. This
was part of the original specification but for some reason it was not
implemented in the code, and was alter found as a PSYNC2 bug in the
integration testing.
2016-11-28 17:54:04 +01:00
antirez
a6561f8e30 PSYNC2: Test (WIP).
This is the PSYNC2 test that helped find issues in the code, and that
still can show a protocol desync from time to time. Work is in progress
in order to find the issue. For now the test is not enabled in "make
test" and must be run manually.
2016-11-28 10:13:24 +01:00