In INFO CLIENTS section, we already have blocked_clients and
tracking_clients. We should add a new metric showing the number of
pubsub connections, which helps performance monitoring and trouble
shooting.
The change in dbSwapDatabases seems harmless. Because in non-clustered
mode, dbBuckets calculations are strictly accurate and in cluster mode,
we only have one DB. Modify it for uniformity (just like resize_cursor).
The change in swapMainDbWithTempDb is needed in case we swap with the
temp db, otherwise the overhead memory usage of db can be miscalculated.
In addition we will swap all fields (including rehashing list), just for
completeness (and reduce the chance of surprises in the future).
Introduced in #12697.
Additional optimizations for the eviction logic in #11695:
To make the eviction logic clearer and decouple the number of sampled
keys from the running mode (cluster or standalone).
* When sampling in each database, we only care about the number of keys
in the current database (not the dicts we sampled from).
* If there are a insufficient number of keys in the current database
(e.g. 10 times the value of `maxmemory_samples`), we can break out
sooner (to avoid looping on a sparse database).
* We'll never try to sample the db dicts more times than the number of
non-empty dicts in the db (max 1 in non-cluster mode).
And it also ensures that each database has a sufficient amount of
sampled keys, so even if unsharded-cluster supports multiple databases,
there won't be any issues.
other changes:
1. keep track of the number of non-empty dicts in each database.
2. move key_count tracking into cumulativeKeyCountAdd rather than all
it's callers
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Introduced in #11695 .
The tryResizeHashTables function gets stuck on the last non-empty slot
while iterating through dictionaries. It does not restart from the
beginning. The reason for this issue is a problem with the usage of
dbIteratorNextDict:
/* Returns next dictionary from the iterator, or NULL if iteration is complete. */
dict *dbIteratorNextDict(dbIterator *dbit) {
if (dbit->next_slot == -1) return NULL;
dbit->slot = dbit->next_slot;
dbit->next_slot = dbGetNextNonEmptySlot(dbit->db, dbit->slot, dbit->keyType);
return dbGetDictFromIterator(dbit);
}
When iterating to the last non-empty slot, next_slot is set to -1,
causing it to loop indefinitely on that slot. We need to modify the code
to ensure that after iterating to the last non-empty slot, it returns to
the first non-empty slot.
BTW, function tryResizeHashTables is actually iterating over slots
that have keys. However, in its implementation, it leverages the
dbIterator (which is a key iterator) to obtain slot and dictionary
information. While this approach works fine, but it is not very
intuitive. This PR also improves readability by changing the iteration
to directly iterate over slots, thereby enhancing clarity.
see discussion from after https://github.com/redis/redis/pull/12453 was
merged
----
This PR replaces signals that are not considered async-signal-safe
(AS-safe) with safe calls.
#### **1. serverLog() and serverLogFromHandler()**
`serverLog` uses unsafe calls. It was decided that we will **avoid**
`serverLog` calls by the signal handlers when:
* The signal is not fatal, such as SIGALRM. In these cases, we prefer
using `serverLogFromHandler` which is the safe version of `serverLog`.
Note they have different prompts:
`serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>`
`serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>`
* The code was added recently. Calls to `serverLog` by the signal
handler have been there ever since Redis exists and it hasn't caused
problems so far. To avoid regression, from now we should use
`serverLogFromHandler`
#### **2. `snprintf` `fgets` and `strtoul`(base = 16) -------->
`_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`**
The safe version of `snprintf` was taken from
[here](8cfc4ca5e7/src/mc_util.c (L754))
#### **3. fopen(), fgets(), fclose() --------> open(), read(), close()**
#### **4. opendir(), readdir(), closedir() --------> open(),
syscall(SYS_getdents64), close()**
#### **5. Threads_mngr sync mechanisms**
* waiting for the thread to generate stack trace: semaphore -------->
busy-wait
* `globals_rw_lock` was removed: as we are not using malloc and the
semaphore anymore we don't need to protect `ThreadsManager_cleanups`.
#### **6. Stacktraces buffer**
The initial problem was that we were not able to safely call malloc
within the signal handler.
To solve that we created a buffer on the stack of `writeStacktraces` and
saved it in a global pointer, assuming that under normal circumstances,
the function `writeStacktraces` would complete before any thread
attempted to write to it. However, **if threads lag behind, they might
access this global pointer after it no longer belongs to the
`writeStacktraces` stack, potentially corrupting memory.**
To address this, various solutions were discussed
[here](https://github.com/redis/redis/pull/12658#discussion_r1390442896)
Eventually, we decided to **create a pipe** at server startup that will
remain valid as long as the process is alive.
We chose this solution due to its minimal memory usage, and since
`write()` and `read()` are atomic operations. It ensures that stack
traces from different threads won't mix.
**The stacktraces collection process is now as follows:**
* Cleaning the pipe to eliminate writes of late threads from previous
runs.
* Each thread writes to the pipe its stacktrace
* Waiting for all the threads to mark completion or until a timeout (2
sec) is reached
* Reading from the pipe to print the stacktraces.
#### **7. Changes that were considered and eventually were dropped**
* replace watchdog timer with a POSIX timer:
according to [settimer man](https://linux.die.net/man/2/setitimer)
> POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending
the use of the POSIX timers API
([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2),
[timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.)
instead.
However, although it is supposed to conform to POSIX std, POSIX timers
API is not supported on Mac.
You can take a look here at the Linux implementation:
[here](c7562ee135)
To avoid messing up the code, and uncertainty regarding compatibility,
it was decided to drop it for now.
* avoid using sds (uses malloc) in logConfigDebugInfo
It was considered to print config info instead of using sds, however
apparently, `logConfigDebugInfo` does more than just print the sds, so
it was decided this fix is out of this issue scope.
#### **8. fix Signal mask check**
The check `signum & sig_mask` intended to indicate whether the signal is
blocked by the thread was incorrect. Actually, the bit position in the
signal mask corresponds to the signal number. We fixed this by changing
the condition to: `sig_mask & (1L << (sig_num - 1))`
#### **9. Unrelated changes**
both `fork.tcl `and `util.tcl` implemented a function called
`count_log_message` expecting different parameters. This caused
confusion when trying to run daily tests with additional test parameters
to run a specific test.
The `count_log_message` in `fork.tcl` was removed and the calls were
replaced with calls to `count_log_message` located in `util.tcl`
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Move clusterState into cluster_legacy.h. In order to achieve
this some "accessor" methods needed to be added to the
cluster API and some other minor refactors.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
When using DB iterator, it will use dictInitSafeIterator to init a old safe
dict iterator. When dbIteratorNext is used, it will jump to the next slot db
dict when we are done a dict. During this process, we do not have any calls to
dictResumeRehashing, which causes the dict's pauserehash to always be > 0.
And at last, it will be returned directly in dictRehashMilliseconds, which causes
us to have slot dict in a state where rehash cannot be completed.
In the "expire scan should skip dictionaries with lot's of empty buckets" test,
adding a `keys *` can reproduce the problem stably. `keys *` will call dbIteratorNext
to trigger a traversal of all slot dicts.
Added dbReleaseIterator and dbIteratorInitNextSafeIterator methods to call dictResetIterator.
Issue was introduced in #11695.
In #12476 server.stat_client_qbuf_limit_disconnections was added. It is written in readQueryFromClient, which may be called by multiple threads when io-threads and io-threads-do-reads are turned on. Somehow we missed to make it an atomic variable.
As part of #11695 independent dictionaries were introduced per slot.
Time complexity to discover total no. of buckets across all dictionaries
increased to O(N) with straightforward implementation of iterating over
all dictionaries and adding dictBuckets of each.
To optimize the time complexity, we could maintain a global counter at
db level to keep track of the count of buckets and update it on the start
and end of rehashing.
---------
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Dictionary iterator logic in the `tryResizeHashTables` method is picking the next
(incorrect) dictionary while the cursor is at a given slot. This could lead to some
dictionary/slot getting skipped from resizing.
Also stabilize the test.
problem introduced recently in #11695
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
We use the C standard assert() in various places in the codebase, which requires NDEBUG to be undefined. We introduced the redisassert.h file in order to allow low level files to access the assert that maps to serverPanic, but this was only applied tactically and is not available broadly.
This PR removes all usage of the standard library asserts and replaces them with an assert that maps to serverPanic. It makes us immune to accidentally setting the NDEBUG flag preventing assertions. I also marked marked the server asserts as "likely" to not execute. I spot checked various points in the code, and it didn't change the code layout on my x86 mac, but it is more consistent with redisassert.h and seems more correct overall.
In a long printf call with many placeholders, it's hard to see which argument
belongs to which placeholder.
The long printf-like calls in the INFO and CLIENT commands are rewritten into
pairs of (format, argument). These pairs are then rewritten to a single call with
a long format string and a long list of arguments, using a macro called FMTARGS.
The file `fmtargs.h` is added to the repo.
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
In this PR we are adding the functionality to collect all the process's threads' backtraces.
## Changes made in this PR
### **introduce threads mngr API**
The **threads mngr API** which has 2 abilities:
* `ThreadsManager_init() `- register to SIGUSR2. called on the server start-up.
* ` ThreadsManager_runOnThreads()` - receives a list of a pid_t and a callback, tells every
thread in the list to invoke the callback, and returns the output collected by each invocation.
**Elaborating atomicvar API**
* `atomicIncrGet(var,newvalue_var,count) `-- Increment and get the atomic counter new value
* `atomicFlagGetSet` -- Get and set the atomic counter value to 1
### **Always set SIGALRM handler**
SIGALRM handler prints the process's stacktrace to the log file. Up until now, it was set only if the
`server.watchdog_period` > 0. This can be also useful if debugging is needed. However, in situations
where the server can't get requests, (a deadlock, for example) we weren't able to change the signal handler.
To make it available at run time we set SIGALRM handler on server startup. The signal handler name was
changed to a more general `sigalrmSignalHandler`.
### **Print all the process' threads' stacktraces**
`logStackTrace()` now calls `writeStacktraces()`, instead of logging the current thread stacktrace.
`writeStacktraces()`:
* On Linux systems we use the threads manager API to collect the backtraces of all the process' threads.
To get the `tids` list (threads ids) we read the `/proc/<redis-server-pid>/tasks` file which includes a list of directories.
Each directory name corresponds to one tid (including the main thread). For each thread, we also need to check if it
can get the signal from the threads manager (meaning it is not blocking/ignoring that signal). We send the threads
manager this tids list and `collect_stacktrace_data()` callback, which collects the thread's backtrace addresses,
its name, and tid.
* On other systems, the behavior remained as it was (writing only the current thread stacktrace to the log file).
## compatibility notes
1. **The threads mngr API is only supported in linux.**
2. glibc earlier than 2.3 We use `syscall(SYS_gettid)` and `syscall(SYS_tgkill...)` because their dedicated
alternatives (`gettid()` and `tgkill`) were added in glibc 2.3.
## Output example
Each thread backtrace will have the following format:
`<tid> <thread_name> [additional_info]`
* **tid**: as read from the `/proc/<redis-server-pid>/tasks` file
* **thread_name**: the tread name as it is registered in the os/
* **additional_info**: Sometimes we want to add specific information about one of the threads. currently.
it is only used to mark the thread that handles the backtraces collection by adding "*".
In case of crash - this also indicates which thread caused the crash. The handling thread in won't
necessarily appear first.
```
------ STACK TRACE ------
EIP:
/lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc]
67089 redis-server *
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb9437790]
/lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc]
redis-server *:6379(+0x75e0c)[0xaaaac2fe5e0c]
redis-server *:6379(aeProcessEvents+0x18c)[0xaaaac2fe6c00]
redis-server *:6379(aeMain+0x24)[0xaaaac2fe7038]
redis-server *:6379(main+0xe0c)[0xaaaac3001afc]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb91d73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb91d74cc]
redis-server *:6379(_start+0x30)[0xaaaac2fe0370]
67093 bio_lazy_free
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67091 bio_close_file
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67092 bio_aof
/lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc]
/lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc]
redis-server *:6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8]
/lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8]
/lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c]
67089:signal-handler (1693824528) --------
```
Add these INFO metrics:
* client_query_buffer_limit_disconnections
* client_output_buffer_limit_disconnections
Sometimes it is useful to monitor whether clients reaches size limit of
query buffer and output buffer, to decide whether we need to adjust the
buffer size limit or reduce client query payload.
This PR purpose is to make the crash report process thread safe.
main changes include:
1. `setupSigSegvHandler()` is introduced to initialize the signal handler.
This function first initializes the signal handler mutex (if not initialized yet)
and then registers the process to the signal handler.
2. **sigsegvHandler** flags :
SA_NODEFER - don't add the signal to the process signal mask. We use this
flag because we want to be able to handle a second call to the signal manually.
removed SA_RESETHAND: this flag resets the signal handler function upon the first
entrance to the registered function. The reason to use this flag is to protect from
recursively entering the signal handler by the same thread. But, it also means
that if a second thread crashes while handling a signal, the process will be
terminated immediately and we won't get the crash report.
In this PR we discard this flag. The signal handler guard described below purpose
is to solve the above issues.
3. Add a **signal handler lock** with ERRORCHECK attributes.
The lock's purpose is to ensure that only one thread generates a crash report.
Once a second thread enters the signal handler it will be blocked.
We use the ERRORCHECK lock in order to protect from possible deadlock in
case the thread handling the crash gets a signal. In the latest scenario, we log
what we have collected until the handler crashed.
At the end of the crash report we reset the signal handler SIG_DFL, with no flags, and
rethrow the signal to generate a core dump (if enabled) and exit the process.
During the work on this PR we wanted to understand the historical reasons for
how crash is handled.
With respect to the choice of the flag, we believe the **SA_RESETHAND** was not
added for any specific purpose.
**SA_ONSTACK** which is removed here from bugReportEnd(), was originally also
set in the initial registration to signal handler, but removed in 3ada43e73. In addition,
it was removed from another location in deee2c1ef with the following description,
which is also relevant to why it should be removed from bugReportEnd:
> it seems to be some valgrind bug with SA_ONSTACK.
> SA_ONSTACK seems unneeded since WD is not recursive (SA_NODEFER was removed),
> also, not sure if it's even valid without a call to sigaltstack()
Fix the assertion when a busy script (timeout) signal ready keys (like LPUSH),
and then an arbitrary client's `allow-busy` command steps into `handleClientsBlockedOnKeys`
try wake up clients blocked on keys (like BLPOP).
Reproduction process:
1. start a redis with aof
`./redis-server --appendonly yes`
2. exec blpop
`127.0.0.1:6379> blpop a 0`
3. use another client call a busy script and this script push the blocked key
`127.0.0.1:6379> eval "redis.call('lpush','a','b') while(1) do end" 0`
4. user a new client call an allow-busy command like auth
`127.0.0.1:6379> auth a`
BTW, this issue also break the atomicity of script.
This bug has been around for many years, the old versions only have the
atomic problem, only 7.0/7.2 has the assertion problem.
Co-authored-by: Oran Agra <oran@redislabs.com>
A bug introduced in #11657 (7.2 RC1), causes client-eviction (#8687)
and INFO to have inaccurate memory usage metrics of MONITOR clients.
Because the type in `c->type` and the type in `getClientType()` are confusing
(in the later, `CLIENT_TYPE_NORMAL` not `CLIENT_TYPE_SLAVE`), the comment
we wrote in `updateClientMemUsageAndBucket` was wrong, and in fact that function
didn't skip monitor clients.
And since it doesn't skip monitor clients, it was wrong to delete the call for it from
`replicationFeedMonitors` (it wasn't a NOP).
That deletion could mean that the monitor client memory usage is not always up to
date (updated less frequently, but still a candidate for client eviction).
The new blockedBeforeSleep was added in #12337, it breaks the order in 2ecb5ed.
This may be related to #2288, quoted from comment in #2288:
```
Moreover the clusterBeforeSleep() call was misplaced at the end of the chain of the
beforeSleep() call in redis.c. It should be at the top, before processing un blocking
clients. This is exactly the reason in the specific instance of the bug as reported,
why the state was not updated in time before clients served.
```
Apart from adding the missing coverage, this PR also adds `blockedBeforeSleep`
that gathers all block-related functions from `beforeSleep`
The order inside `blockedBeforeSleep` is different: now `handleClientsBlockedOnKeys`
(which may unblock clients) is called before `processUnblockedClients` (which handles
unblocked clients).
It makes sense to have this order.
There are no visible effects of the wrong ordering, except some cleanups of the now-unblocked
client would have happen in the next `beforeSleep` (will now happen in the current one)
The reason we even got into it is because i triggers an assertion in logresreq.c (breaking
the assumption that `unblockClient` is called **before** actually flushing the reply to the socket):
`handleClientsBlockedOnKeys` is called, then it calls `moduleUnblockClientOnKey`, which calls
`moduleUnblockClient`, which adds the client to `moduleUnblockedClients` back to `beforeSleep`,
we call `handleClientsWithPendingWritesUsingThreads`, it writes the data of buf to the client, so
`client->bufpos` became 0
On the next `beforeSleep`, we call `moduleHandleBlockedClients`, which calls `unblockClient`,
which calls `reqresAppendResponse`, triggering the assert. (because the `bufpos` is 0) - see https://github.com/redis/redis/pull/12301#discussion_r1226386716
When a connection that's subscribe to a channel emits PUBLISH inside MULTI-EXEC,
the push notification messes up the EXEC response.
e.g. MULTI, PING, PUSH foo bar, PING, EXEC
the EXEC's response will contain: PONG, {message foo bar}, 1. and the second PONG
will be delivered outside the EXEC's response.
Additionally, this PR changes the order of responses in case of a plain PUBLISH (when
the current client also subscribed to it), by delivering the push after the command's
response instead of before it.
This also affects modules calling RM_PublishMessage in a similar way, so that we don't
run the risk of getting that push mixed together with the module command's response.
In 7.2, After 971b177fa we make sure (assert) that
the duration has been recorded when resetting the client.
This is not true for rejected commands.
The use case I found is a blocking command that an ACL rule changed before
it was unblocked, and while reprocessing it, the command rejected and triggered the assert.
The PR reset the command duration inside rejectCommand / rejectCommandSds.
Co-authored-by: Oran Agra <oran@redislabs.com>
This pr can get two performance benefits:
1. Stop redundant initialization when most robj objects are created
2. LRU_CLOCK will no longer be called in io threads, so we can avoid the `atomicGet`
Another code optimization:
deleted the redundant judgment in dbSetValue, no matter in LFU or LRU, the lru field inold
robj is always the freshest (it is always updated in lookupkey), so we don't need to judge if in LFU
The measured latency(duration) includes the list below, which can be shown by `INFO STATS`.
```
eventloop_cycles // ever increasing counter
eventloop_duration_sum // cumulative duration of eventloop in microseconds
eventloop_duration_cmd_sum // cumulative duration of executing commands in microseconds
instantaneous_eventloop_cycles_per_sec // average eventloop count per second in recent 1.6s
instantaneous_eventloop_duration_usec // average single eventloop duration in recent 1.6s
```
Also added some experimental metrics, which are shown only when `INFO DEBUG` is called.
This section isn't included in the default INFO, or even in `INFO ALL` and the fields in this section
can change in the future without considering backwards compatibility.
```
eventloop_duration_aof_sum // cumulative duration of writing AOF
eventloop_duration_cron_sum // cumulative duration cron jobs (serverCron, beforeSleep excluding IO and AOF)
eventloop_cmd_per_cycle_max // max number of commands executed in one eventloop
eventloop_duration_max // max duration of one eventloop
```
All of these are being reset by CONFIG RESETSTAT
When master reboot from RDB, if rsi in RDB is valid we should not free replication backlog, even if master_repl_offset or repl-offset is 0.
Since if master doesn't send any data to replicas master_repl_offset is 0, it's a valid number.
A clear example:
1. start a master and apply some write commands, the master's master_repl_offset is 0 since it has no replicas.
2. stop write commands on master, and start another instance and replicaof the master, trigger an FULLRESYNC
3. the master's master_repl_offset is still 0 (set a large number for repl-ping-replica-period), do BGSAVE and restart the master
4. master load master_repl_offset from RDB's rsi and it's still 0, and we should make sure replica can partially resync with master.
Technically declaring a prototype with an empty declaration has been deprecated since the early days of C, but we never got a warning for it. C2x will apparently be introducing a breaking change if you are using this type of declarator, so Clang 15 has started issuing a warning with -pedantic. Although not apparently a problem for any of the compiler we build on, if feels like the right thing is to properly adhere to the C standard and use (void).
Every time when accept a connection, we add the client to `server.clients` list's tail, but in `clientsCron` we rotate the tail to head at first, and then process the head. It means that the "new" client would be processed before "old" client, moreover if connections established and then freed frequently, the "old" client may have no chance to be processed.
To fix it, we need take a fair way to iterate the list, that is take the current head and process, and then rotate the head to tail, thus we can make sure all clients could be processed step by step.
p.s. client has `client_list_node` pointer, we don't need put the current client to head to avoid O(N) when remove it.
Since we do report the RDB error in below:
```
serverLog(LL_WARNING,"Error trying to save the DB, can't exit.");
if (server.supervised_mode == SUPERVISED_SYSTEMD)
redisCommunicateSystemd("STATUS=Error trying to save the DB, can't exit.\n");
goto error;
```
This may be an overlook in #6052
We currently do not allow the use of bool type in redis project.
We didn't catch it in script.c because we included hdr_histogram.h in server.h
Removing it (but still having it in some c files) reducing
the chance to miss the usage of bool type in the future and catch it
in compiling stage.
It also removes the dependency on hdr_histogram for every unit
that includes server.h
1. it's a bad idea to print these errors and exit after daemonization (if we'll exit,
the failure may not be detected)
2. it's not nice to exit (or even do the check that uses `fork`) after modules already
started (could create threads, or do some changes)
This PR is to fix the compilation warnings and errors generated by the latest
complier toolchain, and to add a new runner of the latest toolchain for daily CI.
## Fix various compilation warnings and errors
1) jemalloc.c
COMPILER: clang-14 with FORTIFY_SOURCE
WARNING:
```
src/jemalloc.c:1028:7: warning: suspicious concatenation of string literals in an array initialization; did you mean to separate the elements with a comma? [-Wstring-concatenation]
"/etc/malloc.conf",
^
src/jemalloc.c:1027:3: note: place parentheses around the string literal to silence warning
"\"name\" of the file referenced by the symbolic link named "
^
```
REASON: the compiler to alert developers to potential issues with string concatenation
that may miss a comma,
just like #9534 which misses a comma.
SOLUTION: use `()` to tell the compiler that these two line strings are continuous.
2) config.h
COMPILER: clang-14 with FORTIFY_SOURCE
WARNING:
```
In file included from quicklist.c:36:
./config.h:319:76: warning: attribute declaration must precede definition [-Wignored-attributes]
char *strcat(char *restrict dest, const char *restrict src) __attribute__((deprecated("please avoid use of unsafe C functions. prefer use of redis_strlcat instead")));
```
REASON: Enabling _FORTIFY_SOURCE will cause the compiler to use `strcpy()` with check,
it results in a deprecated attribute declaration after including <features.h>.
SOLUTION: move the deprecated attribute declaration from config.h to fmacro.h before "#include <features.h>".
3) networking.c
COMPILER: GCC-12
WARNING:
```
networking.c: In function ‘addReplyDouble.part.0’:
networking.c:876:21: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
876 | dbuf[start] = '$';
| ^
networking.c:868:14: note: at offset -5 into destination object ‘dbuf’ of size 5152
868 | char dbuf[MAX_LONG_DOUBLE_CHARS+32];
| ^
networking.c:876:21: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
876 | dbuf[start] = '$';
| ^
networking.c:868:14: note: at offset -6 into destination object ‘dbuf’ of size 5152
868 | char dbuf[MAX_LONG_DOUBLE_CHARS+32];
```
REASON: GCC-12 predicts that digits10() may return 9 or 10 through `return 9 + (v >= 1000000000UL)`.
SOLUTION: add an assert to let the compiler know the possible length;
4) redis-cli.c & redis-benchmark.c
COMPILER: clang-14 with FORTIFY_SOURCE
WARNING:
```
redis-benchmark.c:1621:2: warning: embedding a directive within macro arguments has undefined behavior [-Wembedded-directive] #ifdef USE_OPENSSL
redis-cli.c:3015:2: warning: embedding a directive within macro arguments has undefined behavior [-Wembedded-directive] #ifdef USE_OPENSSL
```
REASON: when _FORTIFY_SOURCE is enabled, the compiler will use the print() with
check, which is a macro. this may result in the use of directives within the macro, which
is undefined behavior.
SOLUTION: move the directives-related code out of `print()`.
5) server.c
COMPILER: gcc-13 with FORTIFY_SOURCE
WARNING:
```
In function 'lookupCommandLogic',
inlined from 'lookupCommandBySdsLogic' at server.c:3139:32:
server.c:3102:66: error: '*(robj **)argv' may be used uninitialized [-Werror=maybe-uninitialized]
3102 | struct redisCommand *base_cmd = dictFetchValue(commands, argv[0]->ptr);
| ~~~~^~~
```
REASON: The compiler thinks that the `argc` returned by `sdssplitlen()` could be 0,
resulting in an empty array of size 0 being passed to lookupCommandLogic.
this should be a false positive, `argc` can't be 0 when strings are not NULL.
SOLUTION: add an assert to let the compiler know that `argc` is positive.
6) sha1.c
COMPILER: gcc-12
WARNING:
```
In function ‘SHA1Update’,
inlined from ‘SHA1Final’ at sha1.c:195:5:
sha1.c:152:13: warning: ‘SHA1Transform’ reading 64 bytes from a region of size 0 [-Wstringop-overread]
152 | SHA1Transform(context->state, &data[i]);
| ^
sha1.c:152:13: note: referencing argument 2 of type ‘const unsigned char[64]’
sha1.c: In function ‘SHA1Final’:
sha1.c:56:6: note: in a call to function ‘SHA1Transform’
56 | void SHA1Transform(uint32_t state[5], const unsigned char buffer[64])
| ^
In function ‘SHA1Update’,
inlined from ‘SHA1Final’ at sha1.c:198:9:
sha1.c:152:13: warning: ‘SHA1Transform’ reading 64 bytes from a region of size 0 [-Wstringop-overread]
152 | SHA1Transform(context->state, &data[i]);
| ^
sha1.c:152:13: note: referencing argument 2 of type ‘const unsigned char[64]’
sha1.c: In function ‘SHA1Final’:
sha1.c:56:6: note: in a call to function ‘SHA1Transform’
56 | void SHA1Transform(uint32_t state[5], const unsigned char buffer[64])
```
REASON: due to the bug[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80922], when
enable LTO, gcc-12 will not see `diagnostic ignored "-Wstringop-overread"`, resulting in a warning.
SOLUTION: temporarily set SHA1Update to noinline to avoid compiler warnings due
to LTO being enabled until the above gcc bug is fixed.
7) zmalloc.h
COMPILER: GCC-12
WARNING:
```
In function ‘memset’,
inlined from ‘moduleCreateContext’ at module.c:877:5,
inlined from ‘RM_GetDetachedThreadSafeContext’ at module.c:8410:5:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:59:10: warning: ‘__builtin_memset’ writing 104 bytes into a region of size 0 overflows the destination [-Wstringop-overflow=]
59 | return __builtin___memset_chk (__dest, __ch, __len,
```
REASON: due to the GCC-12 bug [https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96503],
GCC-12 cannot see alloc_size, which causes GCC to think that the actual size of memory
is 0 when checking with __glibc_objsize0().
SOLUTION: temporarily set malloc-related interfaces to `noinline` to avoid compiler warnings
due to LTO being enabled until the above gcc bug is fixed.
## Other changes
1) Fixed `ps -p [pid]` doesn't output `<defunct>` when using procps 4.x causing `replication
child dies when parent is killed - diskless` test to fail.
2) Add a new fortify CI with GCC-13 and ubuntu-lunar docker image.
this pr fix two wrongs:
1. When client’s querybuf is pre-allocated for a fat argv, we need to update the
querybuf_peak of the client immediately to completely avoid the unexpected
shrinking of querybuf in the next clientCron (before data arrives to set the peak).
2. the protocol's bulklen does not include `\r\n`, but the allocation and the data we
read does. so in `clientsCronResizeQueryBuffer`, the `resize` or `querybuf_peak`
should add these 2 bytes.
the first bug is likely to hit us on large payloads over slow connections, in which case
transferring the payload can take longer and a cron event will be triggered (specifically
if there are not a lot of clients)
In cluster mode, when a node restart as a replica, it doesn't immediately
sync with the master, replication is enabled in clusterCron. It means that
sometime server.masterhost is NULL and we wrongly judge it in beforeSleep.
In this case, we may trigger a fast activeExpireCycle in beforeSleep, but the
node's flag is actually a replica, that can lead to data inconsistency. In this
PR, we use iAmMaster to replace the `server.masterhost == NULL`
This is an overlook in #7001, and more discussion in #11783.
Now that the command argument specs are available at runtime (#9656), this PR addresses
#8084 by implementing a complete solution for command-line hinting in `redis-cli`.
It correctly handles nearly every case in Redis's complex command argument definitions, including
`BLOCK` and `ONEOF` arguments, reordering of optional arguments, and repeated arguments
(even when followed by mandatory arguments). It also validates numerically-typed arguments.
It may not correctly handle all possible combinations of those, but overall it is quite robust.
Arguments are only matched after the space bar is typed, so partial word matching is not
supported - that proved to be more confusing than helpful. When the user's current input
cannot be matched against the argument specs, hinting is disabled.
Partial support has been implemented for legacy (pre-7.0) servers that do not support
`COMMAND DOCS`, by falling back to a statically-compiled command argument table.
On startup, if the server does not support `COMMAND DOCS`, `redis-cli` will now issue
an `INFO SERVER` command to retrieve the server version (unless `HELLO` has already
been sent, in which case the server version will be extracted from the reply to `HELLO`).
The server version will be used to filter the commands and arguments in the command table,
removing those not supported by that version of the server. However, the static table only
includes core Redis commands, so with a legacy server hinting will not be supported for
module commands. The auto generated help.h and the scripts that generates it are gone.
Command and argument tables for the server and CLI use different structs, due primarily
to the need to support different runtime data. In order to generate code for both, macros
have been added to `commands.def` (previously `commands.c`) to make it possible to
configure the code generation differently for different use cases (one linked with redis-server,
and one with redis-cli).
Also adding a basic testing framework for the command hints based on new (undocumented)
command line options to `redis-cli`: `--test_hint 'INPUT'` prints out the command-line hint for
a given input string, and `--test_hint_file <filename>` runs a suite of test cases for the hinting
mechanism. The test suite is in `tests/assets/test_cli_hint_suite.txt`, and it is run from
`tests/integration/redis-cli.tcl`.
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. This commit fixes some misses from that commit.
* Wait commands were not correctly reporting their duration if the timeout was reached.
* Multi/scripts/and modules with RM_Call were not properly resetting the duration between inner calls, leading to them reporting cumulative duration.
* When a blocked client is freed, the call and duration are always discarded.
This commit also adds an assert if the duration is not properly reset, potentially indicating that a report to call statistics was missed. The assert potentially be removed in the future, as it's mainly intended to detect misses in tests.
This PR fix several unrelated bugs that were discovered by the same set of tests
(WAITAOF tests in #11713), could make the `WAITAOF` test hang.
The change in `backgroundRewriteDoneHandler` is about MP-AOF.
That leftover / old code assumes that we started a new AOF file just now
(when we have a new base into which we're gonna incrementally write), but
the fact is that with MP-AOF, the fork done handler doesn't really affect the
incremental file being maintained by the parent process, there's no reason to
re-issue `SELECT`, and no reason to update any of the fsync variables in that flow.
This should have been deleted with MP-AOF (introduced in #9788, 7.0).
The damage is that the update to `aof_fsync_offset` will cause us to miss an fsync
in `flushAppendOnlyFile`, that happens if we stop write commands in `AOF_FSYNC_EVERYSEC`
while an AOFRW is in progress. This caused a new `WAITAOF` test to sometime hang forever.
Also because of MP-AOF, we needed to change `aof_fsync_offset` to `aof_last_incr_fsync_offset`
and match it to `aof_last_incr_size` in `flushAppendOnlyFile`. This is because in the past we compared
`aof_fsync_offset` and `aof_current_size`, but with MP-AOF it could be the total AOF file will be
smaller after AOFRW, and the (already existing) incr file still has data that needs to be fsynced.
The change in `flushAppendOnlyFile`, about the `AOF_FSYNC_ALWAYS`, it is follow #6053
(the details is in #5985), we also check `AOF_FSYNC_ALWAYS` to handle a case where
appendfsync is changed from everysec to always while there is data that's written but not yet fsynced.
Previously we would run the module command filters even upon blocked
command reprocessing. This could modify the command, and it's args.
This is irrelevant in the context of a command being reprocessed (it already
went through the filters), as well as breaks the crashed command lookup
that exists in the case of a reprocessed command.
fixes#11894.
Co-authored-by: Oran Agra <oran@redislabs.com>
Allow running blocking commands from within a module using `RM_Call`.
Today, when `RM_Call` is used, the fake client that is used to run command
is marked with `CLIENT_DENY_BLOCKING` flag. This flag tells the command
that it is not allowed to block the client and in case it needs to block, it must
fallback to some alternative (either return error or perform some default behavior).
For example, `BLPOP` fallback to simple `LPOP` if it is not allowed to block.
All the commands must respect the `CLIENT_DENY_BLOCKING` flag (including
module commands). When the command invocation finished, Redis asserts that
the client was not blocked.
This PR introduces the ability to call blocking command using `RM_Call` by
passing a callback that will be called when the client will get unblocked.
In order to do that, the user must explicitly say that he allow to perform blocking
command by passing a new format specifier argument, `K`, to the `RM_Call`
function. This new flag will tell Redis that it is allow to run blocking command
and block the client. In case the command got blocked, Redis will return a new
type of call reply (`REDISMODULE_REPLY_PROMISE`). This call reply indicates
that the command got blocked and the user can set the on_unblocked handler using
`RM_CallReplyPromiseSetUnblockHandler`.
When clients gets unblocked, it eventually reaches `processUnblockedClients` function.
This is where we check if the client is a fake module client and if it is, we call the unblock
callback instead of performing the usual unblock operations.
**Notice**: `RM_CallReplyPromiseSetUnblockHandler` must be called atomically
along side the command invocation (without releasing the Redis lock in between).
In addition, unlike other CallReply types, the promise call reply must be released
by the module when the Redis GIL is acquired.
The module can abort the execution on the blocking command (if it was not yet
executed) using `RM_CallReplyPromiseAbort`. the API will return `REDISMODULE_OK`
on success and `REDISMODULE_ERR` if the operation is already executed.
**Notice** that in case of misbehave module, Abort might finished successfully but the
operation will not really be aborted. This can only happened if the module do not respect
the disconnect callback of the blocked client.
For pure Redis commands this can not happened.
### Atomicity Guarantees
The API promise that the unblock handler will run atomically as an execution unit.
This means that all the operation performed on the unblock handler will be wrapped
with a multi exec transaction when replicated to the replica and AOF.
The API **do not** grantee any other atomicity properties such as when the unblock
handler will be called. This gives us the flexibility to strengthen the grantees (or not)
in the future if we will decide that we need a better guarantees.
That said, the implementation **does** provide a better guarantees when performing
pure Redis blocking command like `BLPOP`. In this case the unblock handler will run
atomically with the operation that got unblocked (for example, in case of `BLPOP`, the
unblock handler will run atomically with the `LPOP` operation that run when the command
got unblocked). This is an implementation detail that might be change in the future and the
module writer should not count on that.
### Calling blocking commands while running on script mode (`S`)
`RM_Call` script mode (`S`) was introduced on #0372. It is used for usecases where the
command that was invoked on `RM_Call` comes from a user input and we want to make
sure the user will not run dangerous commands like `shutdown`. Some command, such
as `BLPOP`, are marked with `NO_SCRIPT` flag, which means they will not be allowed on
script mode. Those commands are marked with `NO_SCRIPT` just because they are
blocking commands and not because they are dangerous. Now that we can run blocking
commands on RM_Call, there is no real reason not to allow such commands on script mode.
The underline problem is that the `NO_SCRIPT` flag is abused to also mark some of the
blocking commands (notice that those commands know not to block the client if it is not
allowed to do so, and have a fallback logic to such cases. So even if those commands
were not marked with `NO_SCRIPT` flag, it would not harm Redis, and today we can
already run those commands within multi exec).
In addition, not all blocking commands are marked with `NO_SCRIPT` flag, for example
`blmpop` are not marked and can run from within a script.
Those facts shows that there are some ambiguity about the meaning of the `NO_SCRIPT`
flag, and its not fully clear where it should be use.
The PR suggest that blocking commands should not be marked with `NO_SCRIPT` flag,
those commands should handle `CLIENT_DENY_BLOCKING` flag and only block when
it's safe (like they already does today). To achieve that, the PR removes the `NO_SCRIPT`
flag from the following commands:
* `blmove`
* `blpop`
* `brpop`
* `brpoplpush`
* `bzpopmax`
* `bzpopmin`
* `wait`
This might be considered a breaking change as now, on scripts, instead of getting
`command is not allowed from script` error, the user will get some fallback behavior
base on the command implementation. That said, the change matches the behavior
of scripts and multi exec with respect to those commands and allow running them on
`RM_Call` even when script mode is used.
### Additional RedisModule API and changes
* `RM_BlockClientSetPrivateData` - Set private data on the blocked client without the
need to unblock the client. This allows up to set the promise CallReply as the private
data of the blocked client and abort it if the client gets disconnected.
* `RM_BlockClientGetPrivateData` - Return the current private data set on a blocked client.
We need it so we will have access to this private data on the disconnect callback.
* On RM_Call, the returned reply will be added to the auto memory context only if auto
memory is enabled, this allows us to keep the call reply for longer time then the context
lifetime and does not force an unneeded borrow relationship between the CallReply and
the RedisModuleContext.
Implementing the WAITAOF functionality which would allow the user to
block until a specified number of Redises have fsynced all previous write
commands to the AOF.
Syntax: `WAITAOF <num_local> <num_replicas> <timeout>`
Response: Array containing two elements: num_local, num_replicas
num_local is always either 0 or 1 representing the local AOF on the master.
num_replicas is the number of replicas that acknowledged the a replication
offset of the last write being fsynced to the AOF.
Returns an error when called on replicas, or when called with non-zero
num_local on a master with AOF disabled, in all other cases the response
just contains number of fsync copies.
Main changes:
* Added code to keep track of replication offsets that are confirmed to have
been fsynced to disk.
* Keep advancing master_repl_offset even when replication is disabled (and
there's no replication backlog, only if there's an AOF enabled).
This way we can use this command and it's mechanisms even when replication
is disabled.
* Extend REPLCONF ACK to `REPLCONF ACK <ofs> FACK <ofs>`, the FACK
will be appended only if there's an AOF on the replica, and already ignored on
old masters (thus backwards compatible)
* WAIT now no longer wait for the replication offset after your last command, but
rather the replication offset after your last write (or read command that caused
propagation, e.g. lazy expiry).
Unrelated changes:
* WAIT command respects CLIENT_DENY_BLOCKING (not just CLIENT_MULTI)
Implementation details:
* Add an atomic var named `fsynced_reploff_pending` that's updated
(usually by the bio thread) and later copied to the main `fsynced_reploff`
variable (only if the AOF base file exists).
I.e. during the initial AOF rewrite it will not be used as the fsynced offset
since the AOF base is still missing.
* Replace close+fsync bio job with new BIO_CLOSE_AOF (AOF specific)
job that will also update fsync offset the field.
* Handle all AOF jobs (BIO_CLOSE_AOF, BIO_AOF_FSYNC) in the same bio
worker thread, to impose ordering on their execution. This solves a
race condition where a job could set `fsynced_reploff_pending` to a higher
value than another pending fsync job, resulting in indicating an offset
for which parts of the data have not yet actually been fsynced.
Imposing an ordering on the jobs guarantees that fsync jobs are executed
in increasing order of replication offset.
* Drain bio jobs when switching `appendfsync` to "always"
This should prevent a write race between updates to `fsynced_reploff_pending`
in the main thread (`flushAppendOnlyFile` when set to ALWAYS fsync), and
those done in the bio thread.
* Drain the pending fsync when starting over a new AOF to avoid race conditions
with the previous AOF offsets overriding the new one (e.g. after switching to
replicate from a new master).
* Make sure to update the fsynced offset at the end of the initial AOF rewrite.
a must in case there are no additional writes that trigger a periodic fsync,
specifically for a replica that does a full sync.
Limitations:
It is possible to write a module and a Lua script that propagate to the AOF and doesn't
propagate to the replication stream. see REDISMODULE_ARGV_NO_REPLICAS and luaRedisSetReplCommand.
These features are incompatible with the WAITAOF command, and can result
in two bad cases. The scenario is that the user executes command that only
propagates to AOF, and then immediately
issues a WAITAOF, and there's no further writes on the replication stream after that.
1. if the the last thing that happened on the replication stream is a PING
(which increased the replication offset but won't trigger an fsync on the replica),
then the client would hang forever (will wait for an fack that the replica will never
send sine it doesn't trigger any fsyncs).
2. if the last thing that happened is a write command that got propagated properly,
then WAITAOF will be released immediately, without waiting for an fsync (since
the offset didn't change)
Refactoring:
* Plumbing to allow bio worker to handle multiple job types
This introduces infrastructure necessary to allow BIO workers to
not have a 1-1 mapping of worker to job-type. This allows in the
future to assign multiple job types to a single worker, either as
a performance/resource optimization, or as a way of enforcing
ordering between specific classes of jobs.
Co-authored-by: Oran Agra <oran@redislabs.com>
Work in progress towards implementing a reply schema as part of COMMAND DOCS, see #9845
Since ironing the details of the reply schema of each and every command can take a long time, we
would like to merge this PR when the infrastructure is ready, and let this mature in the unstable branch.
Meanwhile the changes of this PR are internal, they are part of the repo, but do not affect the produced build.
### Background
In #9656 we add a lot of information about Redis commands, but we are missing information about the replies
### Motivation
1. Documentation. This is the primary goal.
2. It should be possible, based on the output of COMMAND, to be able to generate client code in typed
languages. In order to do that, we need Redis to tell us, in detail, what each reply looks like.
3. We would like to build a fuzzer that verifies the reply structure (for now we use the existing
testsuite, see the "Testing" section)
### Schema
The idea is to supply some sort of schema for the various replies of each command.
The schema will describe the conceptual structure of the reply (for generated clients), as defined in RESP3.
Note that the reply structure itself may change, depending on the arguments (e.g. `XINFO STREAM`, with
and without the `FULL` modifier)
We decided to use the standard json-schema (see https://json-schema.org/) as the reply-schema.
Example for `BZPOPMIN`:
```
"reply_schema": {
"oneOf": [
{
"description": "Timeout reached and no elements were popped.",
"type": "null"
},
{
"description": "The keyname, popped member, and its score.",
"type": "array",
"minItems": 3,
"maxItems": 3,
"items": [
{
"description": "Keyname",
"type": "string"
},
{
"description": "Member",
"type": "string"
},
{
"description": "Score",
"type": "number"
}
]
}
]
}
```
#### Notes
1. It is ok that some commands' reply structure depends on the arguments and it's the caller's responsibility
to know which is the relevant one. this comes after looking at other request-reply systems like OpenAPI,
where the reply schema can also be oneOf and the caller is responsible to know which schema is the relevant one.
2. The reply schemas will describe RESP3 replies only. even though RESP3 is structured, we want to use reply
schema for documentation (and possibly to create a fuzzer that validates the replies)
3. For documentation, the description field will include an explanation of the scenario in which the reply is sent,
including any relation to arguments. for example, for `ZRANGE`'s two schemas we will need to state that one
is with `WITHSCORES` and the other is without.
4. For documentation, there will be another optional field "notes" in which we will add a short description of
the representation in RESP2, in case it's not trivial (RESP3's `ZRANGE`'s nested array vs. RESP2's flat
array, for example)
Given the above:
1. We can generate the "return" section of all commands in [redis-doc](https://redis.io/commands/)
(given that "description" and "notes" are comprehensive enough)
2. We can generate a client in a strongly typed language (but the return type could be a conceptual
`union` and the caller needs to know which schema is relevant). see the section below for RESP2 support.
3. We can create a fuzzer for RESP3.
### Limitations (because we are using the standard json-schema)
The problem is that Redis' replies are more diverse than what the json format allows. This means that,
when we convert the reply to a json (in order to validate the schema against it), we lose information (see
the "Testing" section below).
The other option would have been to extend the standard json-schema (and json format) to include stuff
like sets, bulk-strings, error-string, etc. but that would mean also extending the schema-validator - and that
seemed like too much work, so we decided to compromise.
Examples:
1. We cannot tell the difference between an "array" and a "set"
2. We cannot tell the difference between simple-string and bulk-string
3. we cannot verify true uniqueness of items in commands like ZRANGE: json-schema doesn't cover the
case of two identical members with different scores (e.g. `[["m1",6],["m1",7]]`) because `uniqueItems`
compares (member,score) tuples and not just the member name.
### Testing
This commit includes some changes inside Redis in order to verify the schemas (existing and future ones)
are indeed correct (i.e. describe the actual response of Redis).
To do that, we added a debugging feature to Redis that causes it to produce a log of all the commands
it executed and their replies.
For that, Redis needs to be compiled with `-DLOG_REQ_RES` and run with
`--reg-res-logfile <file> --client-default-resp 3` (the testsuite already does that if you run it with
`--log-req-res --force-resp3`)
You should run the testsuite with the above args (and `--dont-clean`) in order to make Redis generate
`.reqres` files (same dir as the `stdout` files) which contain request-response pairs.
These files are later on processed by `./utils/req-res-log-validator.py` which does:
1. Goes over req-res files, generated by redis-servers, spawned by the testsuite (see logreqres.c)
2. For each request-response pair, it validates the response against the request's reply_schema
(obtained from the extended COMMAND DOCS)
5. In order to get good coverage of the Redis commands, and all their different replies, we chose to use
the existing redis test suite, rather than attempt to write a fuzzer.
#### Notes about RESP2
1. We will not be able to use the testing tool to verify RESP2 replies (we are ok with that, it's time to
accept RESP3 as the future RESP)
2. Since the majority of the test suite is using RESP2, and we want the server to reply with RESP3
so that we can validate it, we will need to know how to convert the actual reply to the one expected.
- number and boolean are always strings in RESP2 so the conversion is easy
- objects (maps) are always a flat array in RESP2
- others (nested array in RESP3's `ZRANGE` and others) will need some special per-command
handling (so the client will not be totally auto-generated)
Example for ZRANGE:
```
"reply_schema": {
"anyOf": [
{
"description": "A list of member elements",
"type": "array",
"uniqueItems": true,
"items": {
"type": "string"
}
},
{
"description": "Members and their scores. Returned in case `WITHSCORES` was used.",
"notes": "In RESP2 this is returned as a flat array",
"type": "array",
"uniqueItems": true,
"items": {
"type": "array",
"minItems": 2,
"maxItems": 2,
"items": [
{
"description": "Member",
"type": "string"
},
{
"description": "Score",
"type": "number"
}
]
}
}
]
}
```
### Other changes
1. Some tests that behave differently depending on the RESP are now being tested for both RESP,
regardless of the special log-req-res mode ("Pub/Sub PING" for example)
2. Update the history field of CLIENT LIST
3. Added basic tests for commands that were not covered at all by the testsuite
### TODO
- [x] (maybe a different PR) add a "condition" field to anyOf/oneOf schemas that refers to args. e.g.
when `SET` return NULL, the condition is `arguments.get||arguments.condition`, for `OK` the condition
is `!arguments.get`, and for `string` the condition is `arguments.get` - https://github.com/redis/redis/issues/11896
- [x] (maybe a different PR) also run `runtest-cluster` in the req-res logging mode
- [x] add the new tests to GH actions (i.e. compile with `-DLOG_REQ_RES`, run the tests, and run the validator)
- [x] (maybe a different PR) figure out a way to warn about (sub)schemas that are uncovered by the output
of the tests - https://github.com/redis/redis/issues/11897
- [x] (probably a separate PR) add all missing schemas
- [x] check why "SDOWN is triggered by misconfigured instance replying with errors" fails with --log-req-res
- [x] move the response transformers to their own file (run both regular, cluster, and sentinel tests - need to
fight with the tcl including mechanism a bit)
- [x] issue: module API - https://github.com/redis/redis/issues/11898
- [x] (probably a separate PR): improve schemas: add `required` to `object`s - https://github.com/redis/redis/issues/11899
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Hanna Fadida <hanna.fadida@redislabs.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Shaya Potter <shaya@redislabs.com>
Currently there is no BUG. However during some internal code changes
I found that it can happen (for example in case new code will not update
the buf_peak) which can currently lead to memory overrun which is much
harder to detect and root cause.
Why did I please the assert here? The reason is to be able to have the
buf_peak value without the risk of it being overriden by the peak_reset
We have cases where we print information (might be important but by
no means an error indicator) with the LL_WARNING level.
Demoting these to LL_NOTICE:
- oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
- User requested shutdown...
This is also true for cases that we encounter a rare but normal situation.
Demoting to LL_NOTICE. Examples:
- AOF was enabled but there is already another background operation. An AOF background was scheduled to start when possible.
- Connection with master lost.
base on yoav-steinberg's https://github.com/redis/redis/pull/10650#issuecomment-1112280554
and yossigo's https://github.com/redis/redis/pull/10650#pullrequestreview-967677676
* Make it clear that current_client is the root client that was called by
external connection
* add executing_client which is the client that runs the current command
(can be a module or a script)
* Remove script_caller that was used for commands that have CLIENT_SCRIPT
to get the client that called the script. in most cases, that's the current_client,
and in others (when being called from a module), it could be an intermediate
client when we actually want the original one used by the external connection.
bugfixes:
* RM_Call with C flag should log ACL errors with the requested user rather than
the one used by the original client, this also solves a crash when RM_Call is used
with C flag from a detached thread safe context.
* addACLLogEntry would have logged info about the script_caller, but in case the
script was issued by a module command we actually want the current_client. the
exception is when RM_Call is called from a timer event, in which case we don't
have a current_client.
behavior changes:
* client side tracking for scripts now tracks the keys that are read by the script
instead of the keys that are declared by the caller for EVAL
other changes:
* Log both current_client and executing_client in the crash log.
* remove prepareLuaClient and resetLuaClient, being dead code that was forgotten.
* remove scriptTimeSnapshot and snapshot_time and instead add cmd_time_snapshot
that serves all commands and is reset only when execution nesting starts.
* remove code to propagate CLIENT_FORCE_REPL from the executed command
to the script caller since scripts aren't propagated anyway these days and anyway
this flag wouldn't have had an effect since CLIENT_PREVENT_PROP is added by scriptResetRun.
* fix a module GIL violation issue in afterSleep that was introduced in #10300 (unreleased)
Starting from Redis 7.0 (#9890) we started wrapping everything a command
propagates with MULTI/EXEC. The problem is that both SCAN and RANDOMKEY can
lazy-expire arbitrary keys (similar behavior to active-expire), and put DELs in a transaction.
Fix: When these commands are called without a parent exec-unit (e.g. not in EVAL or
MULTI) we avoid wrapping their DELs in a transaction (for the same reasons active-expire
and eviction avoids a transaction)
This PR adds a per-command flag that indicates that the command may touch arbitrary
keys (not the ones in the arguments), and uses that flag to avoid the MULTI-EXEC.
For now, this flag is internal, since we're considering other solutions for the future.
Note for cluster mode: if SCAN/RANDOMKEY is inside EVAL/MULTI it can still cause the
same situation (as it always did), but it won't cause a CROSSSLOT because replicas and AOF
do not perform slot checks.
The problem with the above is mainly for 3rd party ecosystem tools that propagate commands
from master to master, or feed an AOF file with redis-cli into a master.
This PR aims to fix the regression in redis 7.0, and we opened #11792 to try to handle the
bigger problem with lazy expire better for another release.
# Background
The RDB file is usually generated and used once and seldom used again, but the content would reside in page cache until OS evicts it. A potential problem is that once the free memory exhausts, the OS have to reclaim some memory from page cache or swap anonymous page out, which may result in a jitters to the Redis service.
Supposing an exact scenario, a high-capacity machine hosts many redis instances, and we're upgrading the Redis together. The page cache in host machine increases as RDBs are generated. Once the free memory drop into low watermark(which is more likely to happen in older Linux kernel like 3.10, before [watermark_scale_factor](https://lore.kernel.org/lkml/1455813719-2395-1-git-send-email-hannes@cmpxchg.org/) is introduced, the `low watermark` is linear to `min watermark`, and there'is not too much buffer space for `kswapd` to be wake up to reclaim memory), a `direct reclaim` happens, which means the process would stall to wait for memory allocation.
# What the PR does
The PR introduces a capability to reclaim the cache when the RDB is operated. Generally there're two cases, read and write the RDB. For read it's a little messy to address the incremental reclaim, so the reclaim is done in one go in background after the load is finished to avoid blocking the work thread. For write, incremental reclaim amortizes the work of reclaim so no need to put it into background, and the peak watermark of cache can be reduced in this way.
Two cases are addresses specially, replication and restart, for both of which the cache is leveraged to speed up the processing, so the reclaim is postponed to a right time. To do this, a flag is added to`rdbSave` and `rdbLoad` to control whether the cache need to be kept, with the default value false.
# Something deserve noting
1. Though `posix_fadvise` is the POSIX standard, but only few platform support it, e.g. Linux, FreeBSD 10.0.
2. In Linux `posix_fadvise` only take effect on writeback-ed pages, so a `sync`(or `fsync`, `fdatasync`) is needed to flush the dirty page before `posix_fadvise` if we reclaim write cache.
# About test
A unit test is added to verify the effect of `posix_fadvise`.
In integration test overall cache increase is checked, as well as the cache backed by RDB as a specific TCL test is executed in isolated Github action job.
In #7875 (Redis 6.2), we changed the sds alloc to be the usable allocation
size in order to:
> reduce the need for realloc calls by making the sds implicitly take over
the internal fragmentation
This change was done most sds functions, excluding `sdsRemoveFreeSpace` and
`sdsResize`, the reason is that in some places (e.g. clientsCronResizeQueryBuffer)
we call sdsRemoveFreeSpace when we see excessive free space and want to trim it.
so if we don't trim it exactly to size, the caller may still see excessive free space and
call it again and again.
However, this resulted in some excessive calls to realloc, even when there's no need
and it's gonna be a no-op (e.g. when reducing 15 bytes allocation to 13).
It turns out that a call for realloc with jemalloc can be expensive even if it ends up
doing nothing, so this PR adds a check using `je_nallocx`, which is cheap to avoid
the call for realloc.
in addition to that this PR unifies sdsResize and sdsRemoveFreeSpace into common
code. the difference between them was that sdsResize would avoid using SDS_TYPE_5,
since it want to keep the string ready to be resized again, while sdsRemoveFreeSpace
would permit using SDS_TYPE_5 and get an optimal memory consumption.
now both methods take a `would_regrow` argument that makes it more explicit.
the only actual impact of that is that in clientsCronResizeQueryBuffer we call both sdsResize
and sdsRemoveFreeSpace for in different cases, and we now prevent the use of SDS_TYPE_5 in both.
The new test that was added to cover this concern used to pass before this PR as well,
this PR is just a performance optimization and cleanup.
Benchmark:
`redis-benchmark -c 100 -t set -d 512 -P 10 -n 100000000`
on i7-9850H with jemalloc, shows improvement from 1021k ops/sec to 1067k (average of 3 runs).
some 4.5% improvement.
Co-authored-by: Oran Agra <oran@redislabs.com>