79 Commits

Author SHA1 Message Date
VivekSainiEQ
7992428583 Merge remote-tracking branch 'opensource/unstable' into PRO_RELEASE_6
Former-commit-id: 60da86471f68b13e2456e113ecd4aa117d95b134
2021-11-12 21:12:53 +00:00
VivekSainiEQ
b84ee93b5d Merge tag '6.2.6' into Redis_626_Merge
Former-commit-id: e6d7e01be6965110d487e12f40511fe0b3497695
2021-10-21 22:33:55 +00:00
Oran Agra
4233b60978 longer timeout in replication test (#8963)
the test normally passes. but we saw one failure in a valgrind run in github actions

(cherry picked from commit 8458baf6a96fa6c6050bac24160f82d32a0b9ed4)
2021-07-21 21:06:49 +03:00
YaacovHazan
4af7a28728 stabilize tests that involved with load handlers (#8967)
When test stop 'load handler' by killing the process that generating the load,
some commands that already in the input buffer, still might be processed by the server.
This may cause some instability in tests, that count on that no more commands
processed after we stop the `load handler'

In this commit, new proc 'wait_load_handlers_disconnected' added, to verify that no more
cammands from any 'load handler' prossesed, by checking that the clients who
genreate the load is disconnceted.

Also, replacing check of dbsize with wait_for_ofs_sync before comparing debug digest, as
it would fail in case the last key the workload wrote was an overridden key (not a new one).

Affected tests
Race fix:
- failover command to specific replica works
- Connect multiple replicas at the same time (issue #141), master diskless=$mdl, replica diskless=$sdl
- AOF rewrite during write load: RDB preamble=$rdbpre

Cleanup and speedup:
- Test replication with blocking lists and sorted sets operations
- Test replication with parallel clients writing in different DBs
- Test replication partial resync: $descr (diskless: $mdl, $sdl, reconnect: $reconnect

(cherry picked from commit 32a2584e079a1b3c2d1e6649e38239381a73a459)
2021-07-21 21:06:49 +03:00
YaacovHazan
bd2785d18b unregister AE_READABLE from the read pipe in backgroundSaveDoneHandlerSocket (#8991)
In diskless replication, we create a read pipe for the RDB, between the child and the parent.
When we close this pipe (fd), the read handler also needs to be removed from the event loop (if it still registered).
Otherwise, next time we will use the same fd, the registration will be fail (panic), because
we will use EPOLL_CTL_MOD (the fd still register in the event loop), on fd that already removed from epoll_ctl

(cherry picked from commit 501d7755831527b4237f9ed6050ec84203934e4d)
2021-06-01 17:03:36 +03:00
John Sully
ef5bf1fdd7 Prevent partial sync in test that requires only full syncs
Former-commit-id: 1b9fea066914d7f23d6bec220f26b8c0112d7f8b
2021-05-29 02:22:20 +00:00
John Sully
f4151f0d6b Merge branch 'unstable' into keydbpro
Former-commit-id: 205d8f18d2bb8df5253bab40578b006b7aa73fd5
2021-05-28 23:32:46 +00:00
John Sully
ea6a0f214b Merge tag '6.2.2' into unstable
Former-commit-id: 93ebb31b17adec5d406d2e30a5b9ea71c07fce5c
2021-05-21 05:54:39 +00:00
John Sully
f49d8f9adb Merge tag '6.2.1' into unstable
Former-commit-id: bfed57e3e0edaa724b9d060a6bb8edc5a6de65fa
2021-05-19 02:59:48 +00:00
Oran Agra
442bd29612 Fix timing of new replication test (#8807)
In github actions CI with valgrind, i saw that even the fast replica
(one that wasn't paused), didn't get to complete the replication fast
enough, and ended up getting disconnected by timeout.

Additionally, due to a typo in uname, we didn't get to actually run the
CPU efficiency part of the test.
2021-04-18 15:12:34 +03:00
guybe7
0bf6c205db Add a timeout mechanism for replicas stuck in fullsync (#8762)
Starting redis 6.0 (part of the TLS feature), diskless master uses pipe from the fork
child so that the parent is the one sending data to the replicas.
This mechanism has an issue in which a hung replica will cause the master to wait
for it to read the data sent to it forever, thus preventing the fork child from terminating
and preventing the creations of any other forks.

This PR adds a timeout mechanism, much like the ACK-based timeout,
we disconnect replicas that aren't reading the RDB file fast enough.
2021-04-15 17:18:51 +03:00
Qu Chen
1372210a03 Properly initialize variable to make valgrind happy in checkChildrenDone(). Removed usage for the obsolete wait3() and wait4() in favor of waitpid(), and properly check for the exit status code. (#8666) 2021-03-24 08:41:05 -07:00
Oran Agra
10929ace19 Fix race in replication test (#8679)
Since redis 6.2, redis immediately tries to connect to the master, not
waiting for replication cron.

in the slow freebsd CI, this test failed and master_link_status was
already "up" when INFO was called.
2021-03-22 10:50:39 +02:00
christianEQ
c068f2cd3d Merge tag 'tags/6.0.10' into redismerge_2021-01-20
Former-commit-id: dadce055f897cee83946c2d3e5cbb76341b94230
2021-01-26 21:43:09 +00:00
Yossi Gottlieb
e4d0ba933d Add io-thread daily CI tests. (#8232)
This adds basic coverage to IO threads by running the cluster and few selected Redis test suite tests with the IO threads enabled.

Also provides some necessary additional improvements to the test suite:

* Add --config to sentinel/cluster tests for arbitrary configuration.
* Fix --tags whitelisting which was broken.
* Add a `network` tag to some tests that are more network intensive. This is work in progress and more tests should be properly tagged in the future.
2021-01-17 15:48:48 +02:00
Oran Agra
54b8623eeb Propagate GETSET and SET-GET as SET (#7957)
- Generates a more backwards compatible command stream
- Slightly more efficient execution in replica/AOF
- Add a test for coverage
2020-11-03 14:56:57 +02:00
Wang Yuan
2916bb4be8 Fix timing dependence in replication tcl tests (#7969)
Remove 'fork child $pid' log in replication.tcl
2020-10-27 09:36:42 +02:00
Yossi Gottlieb
deaa71637a Add a --no-latency tests flag. (#7939)
Useful for running tests on systems which may be way slower than usual.

(cherry picked from commit be43c030df2a352c5fa02d70e4dcaa46ece25b2b)
2020-10-27 09:12:01 +02:00
Yossi Gottlieb
be43c030df Add a --no-latency tests flag. (#7939)
Useful for running tests on systems which may be way slower than usual.
2020-10-22 11:10:53 +03:00
Felipe Machado
57dfccfbf9 Adds new pop-push commands (LMOVE, BLMOVE) (#6929)
Adding [B]LMOVE <src> <dst> RIGHT|LEFT RIGHT|LEFT. deprecating [B]RPOPLPUSH.

Note that when receiving a BRPOPLPUSH we'll still propagate an RPOPLPUSH,
but on BLMOVE RIGHT LEFT we'll propagate an LMOVE

improvement to existing tests
- Replace "after 1000" with "wait_for_condition" when wait for
  clients to block/unblock.
- Add a pre-existing element to target list on basic tests so
  that we can check if the new element was added to the correct
  side of the list.
- check command stats on the replica to make sure the right
  command was replicated

Co-authored-by: Oran Agra <oran@redislabs.com>
2020-10-08 08:33:17 +03:00
John Sully
14daf6f909 Merge tag '6.0.8' into unstable
Former-commit-id: 4c7e4b91a6bb2034636856b608b8c386d07f5541
2020-09-30 19:47:55 +00:00
Wang Yuan
b551c8fdb7 Kill disk-based fork child when all replicas drop and 'save' is not enabled (#7819)
When all replicas waiting for a bgsave get disconnected (possibly due to output buffer limit),
It may be good to kill the bgsave child. in diskless replication it already happens, but in
disk-based, the child may still serve some purpose (for persistence).

By killing the child, we prevent it from eating COW memory in vain, and we also allow a new child fork sooner for the next full synchronization or bgsave.
We do that only if rdb persistence wasn't enabled in the configuration.

Btw, now, rdbRemoveTempFile in killRDBChild won't block server, so we can killRDBChild safely.
2020-09-22 09:47:58 +03:00
Oran Agra
725616534e if diskless repl child is killed, make sure to reap the pid (#7742)
Starting redis 6.0 and the changes we made to the diskless master to be
suitable for TLS, I made the master avoid reaping (wait3) the pid of the
child until we know all replicas are done reading their rdb.

I did that in order to avoid a state where the rdb_child_pid is -1 but
we don't yet want to start another fork (still busy serving that data to
replicas).

It turns out that the solution used so far was problematic in case the
fork child was being killed (e.g. by the kernel OOM killer), in that
case there's a chance that we currently disabled the read event on the
rdb pipe, since we're waiting for a replica to become writable again.
and in that scenario the master would have never realized the child
exited, and the replica will remain hung too.
Note that there's no mechanism to detect a hung replica while it's in
rdb transfer state.

The solution here is to add another pipe which is used by the parent to
tell the child it is safe to exit. this mean that when the child exits,
for whatever reason, it is safe to reap it.

Besides that, i'm re-introducing an adjustment to REPLCONF ACK which was
part of #6271 (Accelerate diskless master connections) but was dropped
when that PR was rebased after the TLS fork/pipe changes (6fd5ff8).
Now that RdbPipeCleanup no longer calls checkChildrenDone, and the ACK
has chance to detect that the child exited, it should be the one to call
it so that we don't have to wait for cron (server.hz) to do that.
2020-09-06 16:43:57 +03:00
Oran Agra
10a8407a4f Fix failing tests due to issues with wait_for_log_message (#7572)
- the test now waits for specific set of log messages rather than wait for
  timeout looking for just one message.
- we don't wanna sample the current length of the log after an action, due
  to a race, we need to start the search from the line number of the last
  message we where waiting for.
- when attempting to trigger a full sync, use multi-exec to avoid a race
  where the replica manages to re-connect before we completed the set of
  actions that should force a full sync.
- fix verify_log_message which was broken and unused

(cherry picked from commit 06aaeabaea9d9b248e8a790dde352cd14d66628a)
2020-09-01 09:27:58 +03:00
Oran Agra
cad93ed273 Accelerate diskless master connections, and general re-connections (#6271)
Diskless master has some inherent latencies.
1) fork starts with delay from cron rather than immediately
2) replica is put online only after an ACK. but the ACK
   was sent only once a second.
3) but even if it would arrive immediately, it will not
   register in case cron didn't yet detect that the fork is done.

Besides that, when a replica disconnects, it doesn't immediately
attempts to re-connect, it waits for replication cron (one per second).
in case it was already online, it may be important to try to re-connect
as soon as possible, so that the backlog at the master doesn't vanish.

In case it disconnected during rdb transfer, one can argue that it's
not very important to re-connect immediately, but this is needed for the
"diskless loading short read" test to be able to run 100 iterations in 5
seconds, rather than 3 (waiting for replication cron re-connection)

changes in this commit:
1) sync command starts a fork immediately if no sync_delay is configured
2) replica sends REPLCONF ACK when done reading the rdb (rather than on 1s cron)
3) when a replica unexpectedly disconnets, it immediately tries to
   re-connect rather than waiting 1s
4) when when a child exits, if there is another replica waiting, we spawn a new
   one right away, instead of waiting for 1s replicationCron.
5) added a call to connectWithMaster from replicationSetMaster. which is called
   from the REPLICAOF command but also in 3 places in cluster.c, in all of
   these the connection attempt will now be immediate instead of delayed by 1
   second.

side note:
we can add a call to rdbPipeReadHandler in replconfCommand when getting
a REPLCONF ACK from the replica to solve a race where the replica got
the entire rdb and EOF marker before we detected that the pipe was
closed.
in the test i did see this race happens in one about of some 300 runs,
but i concluded that this race is unlikely in real life (where the
replica is on another host and we're more likely to first detect the
pipe was closed.
the test runs 100 iterations in 3 seconds, so in some cases it'll take 4
seconds instead (waiting for another REPLCONF ACK).

Removing unneeded startBgsaveForReplication from updateSlavesWaitingForBgsave
Now that CheckChildrenDone is calling the new replicationStartPendingFork
(extracted from serverCron) there's actually no need to call
startBgsaveForReplication from updateSlavesWaitingForBgsave anymore,
since as soon as updateSlavesWaitingForBgsave returns, CheckChildrenDone is
calling replicationStartPendingFork that handles that anyway.
The code in updateSlavesWaitingForBgsave had a bug in which it ignored
repl-diskless-sync-delay, but removing that code shows that this bug was
hiding another bug, which is that the max_idle should have used >= and
not >, this one second delay has a big impact on my new test.
2020-08-06 16:53:06 +03:00
Oran Agra
06aaeabaea Fix failing tests due to issues with wait_for_log_message (#7572)
- the test now waits for specific set of log messages rather than wait for
  timeout looking for just one message.
- we don't wanna sample the current length of the log after an action, due
  to a race, we need to start the search from the line number of the last
  message we where waiting for.
- when attempting to trigger a full sync, use multi-exec to avoid a race
  where the replica manages to re-connect before we completed the set of
  actions that should force a full sync.
- fix verify_log_message which was broken and unused
2020-07-28 11:15:29 +03:00
Oran Agra
c994e73c8e stabilize tests that look for log lines (#7367)
tests were sensitive to additional log lines appearing in the log
causing the search to come empty handed.

instead of just looking for the n last log lines, capture the log lines
before performing the action, and then search from that offset.

(cherry picked from commit efc4189b6227a17f26ed9bd6bbac62bf4bf7ab66)
2020-07-20 21:08:26 +03:00
Oran Agra
efc4189b62 stabilize tests that look for log lines (#7367)
tests were sensitive to additional log lines appearing in the log
causing the search to come empty handed.

instead of just looking for the n last log lines, capture the log lines
before performing the action, and then search from that offset.
2020-07-10 08:28:22 +03:00
John Sully
7384abfe56 replication test race
Former-commit-id: e1f3cd6ec3bf2319484de04c3796dcfa75e0479c
2020-06-07 01:14:57 -04:00
John Sully
9e87395c34 Fix for issue #187 we need to properly handle the case where a key with a subkey expirey itself expires during load
Former-commit-id: e6a9a6b428b91b6108df24ae6285ea9b582b7b23
2020-06-01 15:33:19 -04:00
John Sully
ed2e0e66f6 Merge tag '6.0.4' into unstable
Redis 6.0.4.


Former-commit-id: 9c31ac7925edba187e527f506e5e992946bd38a6
2020-05-29 00:57:07 -04:00
John Sully
fa0be83fd9 Merge tag '6.0.2' into unstable
Redis 6.0.2


Former-commit-id: a010e4a4b2cc2bcad1cb14604b7ebc596c35b05e
2020-05-22 16:45:18 -04:00
Oran Agra
7d8259d151 fix valgrind test failure in replication test
in 00323f342 i added more keys to that test to make it run longer
but in valgrind this now means the test times out, give valgrind more
time.
2020-05-22 12:37:49 +02:00
Oran Agra
5e75739bfd add regression test for the race in #7205
with the original version of 6.0.0, this test detects an excessive full
sync.
with the fix in 146201c69, this test detects memory corruption,
especially when using libc allocator with or without valgrind.
2020-05-22 12:37:49 +02:00
Oran Agra
7c88eca1e6 fix valgrind test failure in replication test
in 00323f342 i added more keys to that test to make it run longer
but in valgrind this now means the test times out, give valgrind more
time.
2020-05-18 10:26:53 +03:00
Oran Agra
ba6f40ea94 add regression test for the race in #7205
with the original version of 6.0.0, this test detects an excessive full
sync.
with the fix in 146201c69, this test detects memory corruption,
especially when using libc allocator with or without valgrind.
2020-05-17 18:26:02 +03:00
Oran Agra
5258341880 fix unstable replication test
this test which has coverage for varoius flows of diskless master was
failing randomly from time to time.

the failure was:
[err]: diskless all replicas drop during rdb pipe in tests/integration/replication.tcl
log message of '*Diskless rdb transfer, last replica dropped, killing fork child*' not found

what seemed to have happened is that the master didn't detect that all
replicas dropped by the time the replication ended, it thought that one
replica is still connected.

now the test takes a few seconds longer but it seems stable.
2020-05-14 11:29:43 +02:00
Oran Agra
00323f342d fix unstable replication test
this test which has coverage for varoius flows of diskless master was
failing randomly from time to time.

the failure was:
[err]: diskless all replicas drop during rdb pipe in tests/integration/replication.tcl
log message of '*Diskless rdb transfer, last replica dropped, killing fork child*' not found

what seemed to have happened is that the master didn't detect that all
replicas dropped by the time the replication ended, it thought that one
replica is still connected.

now the test takes a few seconds longer but it seems stable.
2020-05-12 08:59:09 +03:00
John Sully
1adc5e9832 More threading fixes from merge
Former-commit-id: 4a980f4ddbebe3f62703aa3de67c93cdffb6b4b8
2020-01-28 17:54:00 -05:00
zhaozhao.zz
5c56f82bc3 incrbyfloat: fix issue #5256 ttl lost after propagate 2019-12-18 15:44:51 +08:00
Yossi Gottlieb
85d7f38136 Merge remote-tracking branch 'upstream/unstable' into tls 2019-10-16 17:08:07 +03:00
Daniel Dai
46c1377c83 update typo 2019-10-09 14:15:31 -04:00
Yossi Gottlieb
df08b624bd TLS: Configuration options.
Add configuration options for TLS protocol versions, ciphers/cipher
suites selection, etc.
2019-10-07 21:07:27 +03:00
Oran Agra
6fd5ff8c98 diskless replication rdb transfer uses pipe, and writes to sockets form the parent process.
misc:
- handle SSL_has_pending by iterating though these in beforeSleep, and setting timeout of 0 to aeProcessEvents
- fix issue with epoll signaling EPOLLHUP and EPOLLERR only to the write handlers. (needed to detect the rdb pipe was closed)
- add key-load-delay config for testing
- trim connShutdown which is no longer needed
- rioFdsetWrite -> rioFdWrite - simplified since there's no longer need to write to multiple FDs
- don't detect rdb child exited (don't call wait3) until we detect the pipe is closed
- Cleanup bad optimization from rio.c, add another one
2019-10-07 21:06:30 +03:00
Yossi Gottlieb
10ffeb03e4 TLS: Connections refactoring and TLS support.
* Introduce a connection abstraction layer for all socket operations and
integrate it across the code base.
* Provide an optional TLS connections implementation based on OpenSSL.
* Pull a newer version of hiredis with TLS support.
* Tests, redis-cli updates for TLS support.
2019-10-07 21:06:13 +03:00
Oran Agra
73a945c73c prevent diskless replica from terminating on short read
now that replica can read rdb directly from the socket, it should avoid exiting
on short read and instead try to re-sync.

this commit tries to have minimal effects on non-diskless rdb reading.
and includes a test that tries to trigger this scenario on various read cases.
2019-07-17 16:46:22 +02:00
Oran Agra
29754ebe22 diskless replication on slave side (don't store rdb to file), plus some other related fixes
The implementation of the diskless replication was currently diskless only on the master side.
The slave side was still storing the received rdb file to the disk before loading it back in and parsing it.

This commit adds two modes to load rdb directly from socket:
1) when-empty
2) using "swapdb"
the third mode of using diskless slave by flushdb is risky and currently not included.

other changes:
--------------
distinguish between aof configuration and state so that we can re-enable aof only when sync eventually
succeeds (and not when exiting from readSyncBulkPayload after a failed attempt)
also a CONFIG GET and INFO during rdb loading would have lied

When loading rdb from the network, don't kill the server on short read (that can be a network error)

Fix rdb check when performed on preamble AOF

tests:
run replication tests for diskless slave too
make replication test a bit more aggressive
Add test for diskless load swapdb
2019-07-08 15:37:48 +03:00
antirez
11f6eb50a6 Remove debugging printf from replication.tcl test. 2018-12-12 11:55:30 +01:00
antirez
d7dd6b4618 Slave removal: remove slave from integration tests descriptions. 2018-09-11 15:32:28 +02:00
antirez
3880af7808 Test: processing of master stream in slave -BUSY state.
See #5297.
2018-08-31 16:45:02 +02:00