When client tracking is enabled signalModifiedKey can increase memory usage,
this can cause the loop in performEvictions to keep running since it was measuring
the memory usage impact of signalModifiedKey.
The section that measures the memory impact of the eviction should be just on dbDelete,
excluding keyspace notification, client tracking, and propagation to AOF and replicas.
This resolves part of the problem described in #8069
p.s. fix took 1 minute, test took about 3 hours to write.
(cherry picked from commit c4fdf09c0584a3cee32b92f01b7958c72776aedc)
When a Lua script returns a map to redis (a feature which was added in
redis 6 together with RESP3), it would have returned the value first and
the key second.
If the client was using RESP2, it was getting them out of order, and if
the client was in RESP3, it was getting a map of value => key.
This was happening regardless of the Lua script using redis.setresp(3)
or not.
This also affects a case where the script was returning a map which it got
from from redis by doing something like: redis.setresp(3); return redis.call()
This fix is a breaking change for redis 6.0 users who happened to rely
on the wrong order (either ones that used redis.setresp(3), or ones that
returned a map explicitly).
This commit also includes other two changes in the tests:
1. The test suite now handles RESP3 maps as dicts rather than nested
lists
2. Remove some redundant (duplicate) tests from tracking.tcl
(cherry picked from commit 2017407b4d1d19a91af1e7c0b199f2c1775dbaf9)
The tests sometimes fail to find a log message.
Recently i added a print that shows the log files that are searched
and it shows that the message was in deed there.
The only reason i can't think of for this seach to fail, is we we
happened to read an incomplete line, which didn't match our pattern and
then on the next iteration we would continue reading from the line after
it.
The fix is to always re-evaluation the previous line.
(cherry picked from commit 4e2e5be201439cae4c0a03cfc8b6a60be4bff625)
if there are nested tests and nested servers, we need to restore the
previous value of cur_test when a test exist.
example:
```
test{test 1} {
start_server {
test{test 1.1 - master only} {
}
start_server {
test{test 1.2 - with replication} {
}
}
}
}
```
when `test 1.1 - master only exists`, we're still inside `test 1`
(cherry picked from commit 0a1e7341935dbca4bae582de1a4a26d5ed4c652d)
1) cur_test: when restart_server, "no such variable" error occurs
./runtest --single integration/rdb
test {client freed during loading}
SET ::cur_test
restart_server
kill_server
test "Check for memory leaks (pid $pid)"
SET ::cur_test
UNSET ::cur_test
UNSET ::cur_test // This global variable has been unset.
2) `ps --ppid` not available on macOS platform, can be replaced with
`pgrep -P pid`.
(cherry picked from commit f22fa9594d536cb53f83ed8e508c03d4278778b0)
There is an inherent race condition in port allocation for spawned
servers. If a server fails to start because a port is taken, a new port
is allocated. This fixes a problem where the logs are not truncated and
as a result a large number of unmonitored servers are started.
(cherry picked from commit 2df4cb93acabf10bb0ff39c12030791b0947e719)
- redirect valgrind reports to a dedicated file rather than console
- try to avoid killing instances with SIGKILL so that we get the memory
leak report (killing with SIGTERM before resorting to SIGKILL)
- search for valgrind reports when done, print them and fail the tests
- add --dont-clean option to keep the logs on exit
- fix exit error code when crash is found (would have exited with 0)
changes that affect the normal redis test suite:
- refactor check_valgrind_errors into two functions one to search and
one to report
- move the search half into util.tcl to serve the cluster tests too
- ignore "address range perms" valgrind warnings which seem non relevant.
(cherry picked from commit 2b998de46078c172c6b19ac3b779318e7992c60a)
in some cases a command that returns an error possibly due to a timing
issue causes the tcl code to crash and thus prevents the rest of the
tests from running. this adds an option to make the test proceed despite
the crash.
maybe it should be the default mode some day.
(cherry picked from commit fe5da2e60d8d6d907062f4789673fbe06fa8773e)
reduce code duplication in aof.tcl.
move creation of clients into the test so that it can be skipped
(cherry picked from commit 1b7ba44e7917082ac6d5523666d3b4ab210dfbad)
- skip full units
- skip a single test (not just a list of tests)
- when skipping tag, skip spinning up servers, not just the tests
- skip tags when running against an external server too
- allow using multiple tags (split them)
(cherry picked from commit 677d14c2137ab50fa25c8163d20b14bc563261c7)
- the test now waits for specific set of log messages rather than wait for
timeout looking for just one message.
- we don't wanna sample the current length of the log after an action, due
to a race, we need to start the search from the line number of the last
message we where waiting for.
- when attempting to trigger a full sync, use multi-exec to avoid a race
where the replica manages to re-connect before we completed the set of
actions that should force a full sync.
- fix verify_log_message which was broken and unused
(cherry picked from commit 109b5ccdcd6e6b8cecdaeb13a246bc49ce7a61f4)
in cases where you have
test name {
start_server {
start_server {
assert
}
}
}
the exception will be thrown to the test proc, and the servers are
supposed to be killed on the way out. but it seems there was always a
bug of not cleaning the server stack, and recently (#7404) we started
relying on that stack in order to kill them, so with that bug sometimes
we would have tried to kill the same server twice, and leave one alive.
luckly, in most cases the pattern is:
start_server {
test name {
}
}
(cherry picked from commit 36b949438547eb5bf8555fcac2c5040528fd7854)
in the majority of the cases (on this rarely used feature) we want to
stop and be able to connect to the shard with redis-cli.
since these are two different processes interracting with the tty we
need to stop both, and we'll have to hit enter twice, but it's not that
bad considering it is rarely used.
(cherry picked from commit 02ef355f98691adba4126bbdab0d4d2bfe475701)
tests were sensitive to additional log lines appearing in the log
causing the search to come empty handed.
instead of just looking for the n last log lines, capture the log lines
before performing the action, and then search from that offset.
(cherry picked from commit 8e76e13472b7d277af78691775c2cf845f68ab90)
* tests/valgrind: don't use debug restart
DEBUG REATART causes two issues:
1. it uses execve which replaces the original process and valgrind doesn't
have a chance to check for errors, so leaks go unreported.
2. valgrind report invalid calls to close() which we're unable to resolve.
So now the tests use restart_server mechanism in the tests, that terminates
the old server and starts a new one, new PID, but same stdout, stderr.
since the stderr can contain two or more valgrind report, it is not enough
to just check for the absence of leaks, we also need to check for some known
errors, we do both, and fail if we either find an error, or can't find a
report saying there are no leaks.
other changes:
- when killing a server that was already terminated we check for leaks too.
- adding DEBUG LEAK which was used to test it.
- adding --trace-children to valgrind, although no longer needed.
- since the stdout contains two or more runs, we need slightly different way
of checking if the new process is up (explicitly looking for the new PID)
- move the code that handles --wait-server to happen earlier (before
watching the startup message in the log), and serve the restarted server too.
* squashme - CR fixes
(cherry picked from commit 69ade87325eedebdb44760af9a8c28e15381888e)
i.e. don't start the search from scratch hitting the used ones again.
this will also reduce the likelihood of collisions (if there are any
left) by increasing the time until we re-use a port we did use in the
past.
apparently when running tests in parallel (the default of --clients 16),
there's a chance for two tests to use the same port.
specifically, one test might shutdown a master and still have the
replica up, and then another test will re-use the port number of master
for another master, and then that replica will connect to the master of
the other test.
this can cause a master to count too many full syncs and fail a test if
we run the tests with --single integration/psync2 --loop --stop
see Probmem 2 in #7314
sometimes we have several assertions with the same condition in the same test
at different stages, and when these fail (the ones that print the condition
text) you don't know which one it was. other assertions didn't print the
condition text (variable names), just the expected and unexpected values.
So now, all assertions print context line, and conditin text.
besides, one of the major differences between 'assert' and 'assert_equal',
is that the later is able to print the value that doesn't match the expected.
if there is a rare non-reproducible failure, it is helpful to know what was
the value the test encountered and how far it was from the threshold.
So now, adding assert_lessthan and assert_range that can be used in some places.
were we used just 'assert { a > b }' so far.