Checking OOM by `getMaxMemoryState` inside script might get different result
with `freeMemoryIfNeededAndSafe` at script start, because lua stack and
arguments also consume memory.
This leads to memory `borderline` when memory grows near server.maxmemory:
- `freeMemoryIfNeededAndSafe` at script start detects no OOM, no memory freed
- `getMaxMemoryState` inside script detects OOM, script aborted
We solve this 'borderline' issue by saving OOM state at script start to get
stable lua OOM state.
related to issue #6565 and #5250.
There is an inherent race between the deferring client and the
"main" client of the test: While the deferring client issues a blocking
command, we can't know for sure that by the time the "main" client
tries to issue another command (Usually one that unblocks the deferring
client) the deferring client is even blocked...
For lack of a better choice this commit uses TCL's 'after' in order
to give some time for the deferring client to issues its blocking
command before the "main" client does its thing.
This problem probably exists in many other tests but this commit
tries to fix blockonkeys.tcl
Example: Client uses a pipe to send the following to a
stale replica:
MULTI
.. do something ...
DISCARD
The replica will reply the MUTLI with -MASTERDOWN and
execute the rest of the commands... A client using a
pipe might not be aware that MULTI failed until it's
too late.
I can't think of a reason why MULTI/EXEC/DISCARD should
not be executed on stale replicas...
Also, enable MULTI/EXEC/DISCARD during loading
First, we must parse the IDs, so that we abort ASAP.
The return value of this command cannot be an error if
the client successfully acknowledged some messages,
so it should be executed in a "all or nothing" fashion.
propagate_last_id is declared outside of the loop but used
only from within the loop. Once it's '1' it will never go
back to '0' and will replicate XSETID even for IDs that
don't actually change the last_id.
While not a serious bug (XSETID always used group->last_id
so there's no risk), it does causes redundant traffic
between master and its replicas
By using a "circular BRPOPLPUSH"-like scenario it was
possible the get the same client on db->blocking_keys
twice (See comment in moduleTryServeClientBlockedOnKey)
The fix was actually already implememnted in
moduleTryServeClientBlockedOnKey but it had a bug:
the funxction should return 0 or 1 (not OK or ERR)
Other changes:
1. Added two commands to blockonkeys.c test module (To
reproduce the case described above)
2. Simplify blockonkeys.c in order to make testing easier
3. cast raxSize() to avoid warning with format spec
Makse sure call() doesn't wrap replicated commands with
a redundant MULTI/EXEC
Other, unrelated changes:
1. Formatting compiler warning in INFO CLIENTS
2. Use CLIENT_ID_AOF instead of UINT64_MAX
37a10cef introduced automatic wrapping of MULTI/EXEC for the
alsoPropagate API. However this collides with the built-in mechanism
already present in module.c. To avoid complex changes near Redis 6 GA
this commit introduces the ability to exclude call() MUTLI/EXEC wrapping
for also propagate in order to continue to use the old code paths in
module.c.
the AOF will be loaded successfully, but the stream will be missing,
i.e inconsistencies with the original db.
this was because XADD with id of 0-0 would error.
add a test to reproduce.
Now that this mechanism is the sole one used for blocked clients
timeouts, it is more wise to cleanup the table when the client unblocks
for any reason. We use a flag: CLIENT_IN_TO_TABLE, in order to avoid a
radix tree lookup when the client was already removed from the table
because we processed it by scanning the radix tree.
If lots of clients PSUBSCRIBE to same patterns, multiple pattens matching will take place. This commit change it into just one single pattern matching by using a `dict *` to store the unique pattern and which clients subscribe to it.
A very commonly signaled operational problem with Redis master-replicas
sets is that, once the master becomes unavailable for some reason,
especially because of network problems, many times it wont be able to
perform a partial resynchronization with the new master, once it rejoins
the partition, for the following reason:
1. The master becomes isolated, however it keeps sending PINGs to the
replicas. Such PINGs will never be received since the link connection is
actually already severed.
2. On the other side, one of the replicas will turn into the new master,
setting its secondary replication ID offset to the one of the last
command received from the old master: this offset will not include the
PINGs sent by the master once the link was already disconnected.
3. When the master rejoins the partion and is turned into a replica, its
offset will be too advanced because of the PINGs, so a PSYNC will fail,
and a full synchronization will be required.
Related to issue #7002 and other discussion we had in the past around
this problem.