Recently we found two issues in the fuzzer tester: #9302#9285
After fixing them, more problems surfaced and this PR (as well as #9297) aims to fix them.
Here's a list of the fixes
- Prevent an overflow when allocating a dict hashtable
- Prevent OOM when attempting to allocate a huge string
- Prevent a few invalid accesses in listpack
- Improve sanitization of listpack first entry
- Validate integrity of stream consumer groups PEL
- Validate integrity of stream listpack entry IDs
- Validate ziplist tail followed by extra data which start with 0xff
Co-authored-by: sundb <sundbcn@gmail.com>
(cherry picked from commit 0c90370e6d71cc68e4d9cc79a0d8b1e768712a5b)
Fix that there is no sample latency after the key expires via expireIfNeeded().
Some refactoring for shared code.
(cherry picked from commit ca559819f7dcd97ba9ef667bf38360a9527d62f6)
- fix possible heap corruption in ziplist and listpack resulting by trying to
allocate more than the maximum size of 4GB.
- prevent ziplist (hash and zset) from reaching size of above 1GB, will be
converted to HT encoding, that's not a useful size.
- prevent listpack (stream) from reaching size of above 1GB.
- XADD will start a new listpack if the new record may cause the previous
listpack to grow over 1GB.
- XADD will respond with an error if a single stream record is over 1GB
- List type (ziplist in quicklist) was truncating strings that were over 4GB,
now it'll respond with an error.
This change sets a low limit for multibulk and bulk length in the
protocol for unauthenticated connections, so that they can't easily
cause redis to allocate massive amounts of memory by sending just a few
characters on the network.
The new limits are 10 arguments of 16kb each (instead of 1m of 512mb)
GETBIT, SETBIT may access wrong address because of wrap.
BITCOUNT and BITPOS may return wrapped results.
BITFIELD may access the wrong address but also allocate insufficient memory and segfault (see CVE-2021-32761).
This commit uses `uint64_t` or `long long` instead of `size_t`.
related https://github.com/redis/redis/pull/8096
At 32bit platform:
> setbit bit 4294967295 1
(integer) 0
> config set proto-max-bulk-len 536870913
OK
> append bit "\xFF"
(integer) 536870913
> getbit bit 4294967296
(integer) 0
When the bit index is larger than 4294967295, size_t can't hold bit index. In the past, `proto-max-bulk-len` is limit to 536870912, so there is no problem.
After this commit, bit position is stored in `uint64_t` or `long long`. So when `proto-max-bulk-len > 536870912`, 32bit platforms can still be correct.
For 64bit platform, this problem still exists. The major reason is bit pos 8 times of byte pos. When proto-max-bulk-len is very larger, bit pos may overflow.
But at 64bit platform, we don't have so long string. So this bug may never happen.
Additionally this commit add a test cost `512MB` memory which is tag as `large-memory`. Make freebsd ci and valgrind ci ignore this test.
(cherry picked from commit 71d452876ebf8456afaadd6b3c27988abadd1148)
Modules that use background threads with thread safe contexts are likely
to use RM_BlockClient() without a timeout function, because they do not
set up a timeout.
Before this commit, `CLIENT UNBLOCK` would result with a crash as the
`NULL` timeout callback is called. Beyond just crashing, this is also
logically wrong as it may throw the module into an unexpected client
state.
This commits makes `CLIENT UNBLOCK` on such clients behave the same as
any other client that is not in a blocked state and therefore cannot be
unblocked.
(cherry picked from commit aa139e2f02292d668370afde8c91575363c2d611)
- promote the code in DEBUG PROTOCOL to addReplyBigNum
- DEBUG PROTOCOL ATTRIB skips the attribute when client is RESP2
- networking.c addReply for push and attributes generate assertion when
called on a RESP2 client, anything else would produce a broken
protocol that clients can't handle.
(cherry picked from commit 6a5bac309e868deef749c36949723b415de2496f)
There are two issues fixed in this commit:
1. we want to fail the EXEC command in case there is a watched key that's logically
expired but not yet deleted by active expire or lazy expire.
2. we saw that currently cache time is update in every `call()` (including nested calls),
this time is being also being use for the isKeyExpired comparison, we want to update
the cache time only in the first call (execCommand)
Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit ac8b1df8850cc80fbf9ce8c2fbde0c1d3a1b4e91)
When client breached the output buffer soft limit but then went idle,
we didn't disconnect on soft limit timeout, now we do.
Note this also resolves some sporadic test failures in due to Linux
buffering data which caused tests to fail if during the test we went
back under the soft COB limit.
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: sundb <sundbcn@gmail.com>
(cherry picked from commit 152fce5e2cbf947a389da414a431f7331981a374)
This prevents a case where NTP moves the system clock
forward resulting in a false detection of a busy script.
Signed-off-by: zyxwvu Shi <i@shiyc.cn>
(cherry picked from commit f61c37cec900ba391541f20f7655aad44a26bafc)
Adding a new type mask for key space notification, REDISMODULE_NOTIFY_MODULE, to enable unique notifications from commands on REDISMODULE_KEYTYPE_MODULE type keys (which is currently unsupported).
Modules can subscribe to a module key keyspace notification by RM_SubscribeToKeyspaceEvents,
and clients by notify-keyspace-events of redis.conf or via the CONFIG SET, with the characters 'd' or 'A'
(REDISMODULE_NOTIFY_MODULE type mask is part of the '**A**ll' notation for key space notifications).
Refactor: move some pubsub test infra from pubsub.tcl to util.tcl to be re-used by other tests.
Starting redis 6.0 (part of the TLS feature), diskless master uses pipe from the fork
child so that the parent is the one sending data to the replicas.
This mechanism has an issue in which a hung replica will cause the master to wait
for it to read the data sent to it forever, thus preventing the fork child from terminating
and preventing the creations of any other forks.
This PR adds a timeout mechanism, much like the ACK-based timeout,
we disconnect replicas that aren't reading the RDB file fast enough.
The bio aof fsync fd may be closed by main thread (AOFRW done handler)
and even possibly reused for another socket, pipe, or file.
This can can an EBADF or EINVAL fsync error, which will lead to -MISCONF errors failing all writes.
We just ignore these errno because aof fsync did not really fail.
We handle errno when fsyncing aof in bio, so we could know the real reason
when users get -MISCONF Errors writing to the AOF file error
Issue created with #8419
Background:
Redis 6.2 added ACL control for pubsub channels (#7993), which were supposed
to be permissive by default to retain compatibility with redis 6.0 ACL.
But due to a bug, only newly created users got this `acl-pubsub-default` applied,
while overwritten (updated) users got reset to `resetchannels` (denied).
Since the "default" user exists before loading the config file,
any ACL change to it, results in an update / overwrite.
So when a "default" user is loaded from config file or include ACL
file with no channels related rules, the user will not have any
permissions to any channels. But other users will have default
permissions to any channels.
When upgraded from 6.0 with config rewrite, this will lead to
"default" user channels permissions lost.
When users are loaded from include file, then call "acl load", users
will also lost channels permissions.
Similarly, the `reset` ACL rule, would have reset the user to be denied
access to any channels, ignoring `acl-pubsub-default` and breaking
compatibility with redis 6.0.
The implication of this fix is that it regains compatibility with redis 6.0,
but breaks compatibility with redis 6.2.0 and 2.0.1. e.g. after the upgrade,
the default user will regain access to pubsub channels.
Other changes:
Additionally this commit rename server.acl_pubusub_default to
server.acl_pubsub_default and fix typo in acl tests.
Previously (and by default after commit) when master loose its last slot
(due to migration, for example), its replicas will migrate to new last slot
holder.
There are cases where this is not desired:
* Consolidation that results with removed nodes (including the replica, eventually).
* Manually configured cluster topologies, which the admin wishes to preserve.
Needlessly migrating a replica triggers a full synchronization and can have a negative impact, so
we prefer to be able to avoid it where possible.
This commit adds 'cluster-allow-replica-migration' configuration option that is
enabled by default to preserve existed behavior. When disabled, replicas will
not be auto-migrated.
Fixes#4896
Co-authored-by: Oran Agra <oran@redislabs.com>
In `aof.c`, we call fsync when stop aof, and now print a log to let user know that if fail.
In `cluster.c`, we now return error, the calling function already handles these write errors.
In `redis-cli.c`, users hope to save rdb, we now print a message if fsync failed.
In `rio.c`, we now treat fsync errors like we do for write errors.
In `server.c`, we try to fsync aof file when shutdown redis, we only can print one log if fail.
In `bio.c`, if failing to fsync aof file, we will set `aof_bio_fsync_status` to error , and reject writing just like last writing aof error, moreover also set INFO command field `aof_last_write_status` to error.
The implications of this change is just that in the past when a config file was missing,
in some cases it was exiting before printing the sever startup prints and sometimes after,
and now it'll always exit before printing them.
The 'sentinel replicas <master>' command will ignore replicas with
`replica-announced` set to no.
The goal of disabling the config setting replica-announced is to allow ghost
replicas. The replica is in the cluster, synchronize with its master, can be
promoted to master and is not exposed to sentinel clients. This way, it is
acting as a live backup or living ghost.
In addition, to prevent the replica to be promoted as master, set
replica-priority to 0.
The cluster bus is established over TLS or non-TLS depending on the configuration tls-cluster. The client ports distributed in the cluster and sent to clients are assumed to be TLS or non-TLS also depending on tls-cluster.
The cluster bus is now extended to also contain the non-TLS port of clients in a TLS cluster, when available. The non-TLS port of a cluster node, when available, is sent to clients connected without TLS in responses to CLUSTER SLOTS, CLUSTER NODES, CLUSTER SLAVES and MOVED and ASK redirects, instead of the TLS port.
The user was able to override the client port by defining cluster-announce-port. Now cluster-announce-tls-port is added, so the user can define an alternative announce port for both TLS and non-TLS clients.
Fixes#8134
Add publish channel permissions check in processCommand.
processCommand didn't check publish channel permissions, so we can
queue a publish command in a transaction. But when exec the transaction,
it will fail with -NOPERM.
We also union keys/commands/channels permissions check togegher in
ACLCheckAllPerm. Remove pubsubCheckACLPermissionsOrReply in
publishCommand/subscribeCommand/psubscribeCommand. Always
check permissions in processCommand/execCommand/
luaRedisGenericCommand.
* SLOWLOG didn't record anything for blocked commands because the client
was reset and argv was already empty. there was a fix for this issue
specifically for modules, now it works for all blocked clients.
* The original command argv (before being re-written) was also reset
before adding the slowlog on behalf of the blocked command.
* Latency monitor is now updated regardless of the slowlog flags of the
command or its execution (their purpose is to hide sensitive info from
the slowlog, not hide the fact the latency happened).
* Latency monitor now uses real_cmd rather than c->cmd (which may be
different if the command got re-written, e.g. GEOADD)
Changes:
* Unify shared code between slowlog insertion in call() and
updateStatsOnUnblock(), hopefully prevent future bugs from happening
due to the later being overlooked.
* Reset CLIENT_PREVENT_LOGGING in resetClient rather than after command
processing.
* Add a test for SLOWLOG and BLPOP
Notes:
- real_cmd == c->lastcmd, except inside MULTI and Lua.
- blocked commands never happen in these cases (MULTI / Lua)
- real_cmd == c->cmd, except for when the command is rewritten (e.g.
GEOADD)
- blocked commands (currently) are never rewritten
- other than the command's CLIENT_PREVENT_LOGGING, and the
execution flag CLIENT_PREVENT_LOGGING, other cases that we want to
avoid slowlog are on AOF loading (specifically CMD_CALL_SLOWLOG will
be off when executed from execCommand that runs from an AOF)
Reading CoW from /proc/<pid>/smaps can be slow with large processes on
some platforms.
This measures the time it takes to read CoW info and limits the duty
cycle of future updates to roughly 1/100.
As current_cow_size no longer represnets a current, fixed interval value
there is also a new current_cow_size_age field that provides information
about the age of the size value, in seconds.