A minor debugging change that helped in the investigation of
https://github.com/valkey-io/valkey/issues/1251. Basically there are
some edge cases where we want to fully isolate a note from receiving
packets, but can't suspend the process because we need it to continue
sending outbound traffic. So, added a filter for that.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
At some point unit tests stopped building on MacOS because of duplicate
symbols. I had originally solved this problem by using a flag that
overrides symbols, but the much better solution is to mark the duplicate
symbols as weak and they can be overridden during linking. (Symbols by
default are strong, strong symbols override weak symbols)
I also added macos unit build to the CI, so that this doesn't silently
break in the future again.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
Signed-off-by: Masahiro Ide <imasahiro9@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Masahiro Ide <masahiro.ide@lycorp.co.jp>
This commit hopefully improves the formatting of the codebase by setting
ColumnLimit to 0 and hence stopping clang-format from trying to put as
much stuff in one line as possible.
This change enabled us to remove most of `clang-format off` directives
and fixed a bunch of lines that looked like this:
```c
#define KEY \
VALUE /* comment */
```
Additionally, one pair of `clang-format off` / `clang-format on` had
`clang-format off` as the second comment and hence didn't enable the
formatting for the rest of the file. This commit addresses this issue as
well.
Please tell me if anything in the changes seem off. If everything is
fine, I will add this commit to `.git-blame-ignore-revs` later.
---------
Signed-off-by: Mikhail Koviazin <mikhail.koviazin@aiven.io>
Print the full client info by using catClientInfoString, the
info is useful when we want to identify the source of request.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Implement data masking for user data in server logs and diagnostic output. This change prevents potential exposure of confidential information, such as PII, and enhances privacy protection. It masks all command arguments, client names, and client usernames.
Added a new hide-user-data-from-log configuration item, default yes.
---------
Signed-off-by: Amit Nagler <anagler123@gmail.com>
sdsAllocSize returns the correct size without consulting the
allocator. Which is much faster than consulting the allocator.
The only exception is SDS_TYPE_5, for which it has to
consult the allocator.
This PR also sets alloc field correctly for embedded string objects.
It assumes that no allocator would allocate a buffer larger
than `259 + sizeof(robj)` for embedded string. We use embedded strings
for strings up to 44 bytes. If this assumption is wrong, the whole
function would require a rewrite. In general case sds type adjustment
might be needed. Such logic should go to sds.c.
---------
Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
For `debug object` command, we use `val->lru` but ignore the `lfu` mode.
So in `lfu` mode, `debug object` would return meaningless `lru` descriptions.
Added two new fields lfu_freq and lfu_access_time_minutes.
Signed-off-by: jiangyujie.jyj <yjjiang1996@163.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Adding FAST option to DEBUG OBJECT command.
The light version only shows the light weight infomation,
which mostly O(1). The pre-existing version that show more
stats such as serializedlength sometimes is time consuming.
This should allow looking into debug stats (the key expired
but not deleted), even on huge object, on which we're afraid
to run the command for fear of causing a server freeze.
Somehow like 3ca451c46fed894bf49e7561fa0282d2583f1c06.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Update references of copyright being assigned to Salvatore when it was
transferred to Redis Ltd. as per
https://github.com/valkey-io/valkey/issues/544.
---------
Signed-off-by: Pieter Cailliau <pieter@redis.com>
Introduce several improvements to improve the stability of dual-channel
replication and fix compatibility issues.
1. Make dual-channel-replication tests more reliable: use pause instead
of forced sleep.
2. Fix race conditions when freeing RDB client.
3. Check if sync was stopped during local buffer streaming.
4. Fix $ENDOFFSET reply format to work on 32-bit machines too.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
In this PR we introduce the main benefit of dual channel replication by
continuously steaming the COB (client output buffers) in parallel to the
RDB and thus keeping the primary's side COB small AND accelerating the
overall sync process. By streaming the replication data to the replica
during the full sync, we reduce
1. Memory load from the primary's node.
2. CPU load from the primary's main process. [Latest performance
tests](#data)
## Motivation
* Reduce primary memory load. We do that by moving the COB tracking to
the replica side. This also decrease the chance for COB overruns. Note
that primary's input buffer limits at the replica side are less
restricted then primary's COB as the replica plays less critical part in
the replication group. While increasing the primary’s COB may end up
with primary reaching swap and clients suffering, at replica side we’re
more at ease with it. Larger COB means better chance to sync
successfully.
* Reduce primary main process CPU load. By opening a new, dedicated
connection for the RDB transfer, child processes can have direct access
to the new connection. Due to TLS connection restrictions, this was not
possible using one main connection. We eliminate the need for the child
process to use the primary's child-proc -> main-proc pipeline, thus
freeing up the main process to process clients queries.
## Dual Channel Replication high level interface design
- Dual channel replication begins when the replica sends a `REPLCONF
CAPA DUALCHANNEL` to the primary during initial
handshake. This is used to state that the replica is capable of dual
channel sync and that this is the replica's main channel, which is not
used for snapshot transfer.
- When replica lacks sufficient data for PSYNC, the primary will send
`-FULLSYNCNEEDED` response instead
of RDB data. As a next step, the replica creates a new connection
(rdb-channel) and configures it against
the primary with the appropriate capabilities and requirements. The
replica then requests a sync
using the RDB channel.
- Prior to forking, the primary sends the replica the snapshot's end
repl-offset, and attaches the replica
to the replication backlog to keep repl data until the replica requests
psync. The replica uses the main
channel to request a PSYNC starting at the snapshot end offset.
- The primary main threads sends incremental changes via the main
channel, while the bgsave process
sends the RDB directly to the replica via the rdb-channel. As for the
replica, the incremental
changes are stored on a local buffer, while the RDB is loaded into
memory.
- Once the replica completes loading the rdb, it drops the
rdb-connection and streams the accumulated incremental
changes into memory. Repl steady state continues normally.
## New replica state machine

## Data <a name="data"></a>



## Explanation
These graphs demonstrate performance improvements during full sync
sessions using rdb-channel + streaming rdb directly from the background
process to the replica.
First graph- with at most 50 clients and light weight commands, we saw
5%-7.5% improvement in write latency during sync session.
Two graphs below- full sync was tested during heavy read commands from
the primary (such as sdiff, sunion on large sets). In that case, the
child process writes to the replica without sharing CPU with the loaded
main process. As a result, this not only improves client response time,
but may also shorten sync time by about 50%. The shorter sync time
results in less memory being used to store replication diffs (>60% in
some of the tested cases).
## Test setup
Both primary and replica in the performance tests ran on the same
machine. RDB size in all tests is 3.7gb. I generated write load using
valkey-benchmark ` ./valkey-benchmark -r 100000 -n 6000000 lpush my_list
__rand_int__`.
---------
Signed-off-by: naglera <anagler123@gmail.com>
Signed-off-by: naglera <58042354+naglera@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This PR is 1 of 3 PRs intended to achieve the goal of 1 million requests
per second, as detailed by [dan touitou](https://github.com/touitou-dan)
in https://github.com/valkey-io/valkey/issues/22. This PR modifies the
IO threads to be fully asynchronous, which is a first and necessary step
to allow more work offloading and better utilization of the IO threads.
### Current IO threads state:
Valkey IO threads were introduced in Redis 6.0 to allow better
utilization of multi-core machines. Before this, Redis was
single-threaded and could only use one CPU core for network and command
processing. The introduction of IO threads helps in offloading the IO
operations to multiple threads.
**Current IO Threads flow:**
1. Initialization: When Redis starts, it initializes a specified number
of IO threads. These threads are in addition to the main thread, each
thread starts with an empty list, the main thread will populate that
list in each event-loop with pending-read-clients or
pending-write-clients.
2. Read Phase: The main thread accepts incoming connections and reads
requests from clients. The reading of requests are offloaded to IO
threads. The main thread puts the clients ready-to-read in a list and
set the global io_threads_op to IO_THREADS_OP_READ, the IO threads pick
the clients up, perform the read operation and parse the first incoming
command.
3. Command Processing: After reading the requests, command processing is
still single-threaded and handled by the main thread.
4. Write Phase: Similar to the read phase, the write phase is also be
offloaded to IO threads. The main thread prepares the response in the
clients’ output buffer then the main thread puts the client in the list,
and sets the global io_threads_op to the IO_THREADS_OP_WRITE. The IO
threads then pick the clients up and perform the write operation to send
the responses back to clients.
5. Synchronization: The main-thread communicate with the threads on how
many jobs left per each thread with atomic counter. The main-thread
doesn’t access the clients while being handled by the IO threads.
**Issues with current implementation:**
* Underutilized Cores: The current implementation of IO-threads leads to
the underutilization of CPU cores.
* The main thread remains responsible for a significant portion of
IO-related tasks that could be offloaded to IO-threads.
* When the main-thread is processing client’s commands, the IO threads
are idle for a considerable amount of time.
* Notably, the main thread's performance during the IO-related tasks is
constrained by the speed of the slowest IO-thread.
* Limited Offloading: Currently, Since the Main-threads waits
synchronously for the IO threads, the Threads perform only read-parse,
and write operations, with parsing done only for the first command. If
the threads can do work asynchronously we may offload more work to the
threads reducing the load from the main-thread.
* TLS: Currently, we don't support IO threads with TLS (where offloading
IO would be more beneficial) since TLS read/write operations are not
thread-safe with the current implementation.
### Suggested change
Non-blocking main thread - The main thread and IO threads will operate
in parallel to maximize efficiency. The main thread will not be blocked
by IO operations. It will continue to process commands independently of
the IO thread's activities.
**Implementation details**
**Inter-thread communication.**
* We use a static, lock-free ring buffer of fixed size (2048 jobs) for
the main thread to send jobs and for the IO to receive them. If the ring
buffer fills up, the main thread will handle the task itself, acting as
back pressure (in case IO operations are more expensive than command
processing). A static ring buffer is a better candidate than a dynamic
job queue as it eliminates the need for allocation/freeing per job.
* An IO job will be in the format: ` [void* function-call-back | void
*data] `where data is either a client to read/write from and the
function-ptr is the function to be called with the data for example
readQueryFromClient using this format we can use it later to offload
other types of works to the IO threads.
* The Ring buffer is one way from the main-thread to the IO thread, Upon
read/write event the main thread will send a read/write job then in
before sleep it will iterate over the pending read/write clients to
checking for each client if the IO threads has already finished handling
it. The IO thread signals it has finished handling a client read/write
by toggling an atomic flag read_state / write_state on the client
struct.
**Thread Safety**
As suggested in this solution, the IO threads are reading from and
writing to the clients' buffers while the main thread may access those
clients.
We must ensure no race conditions or unsafe access occurs while keeping
the Valkey code simple and lock free.
Minimal Action in the IO Threads
The main change is to limit the IO thread operations to the bare
minimum. The IO thread will access only the client's struct and only the
necessary fields in this struct.
The IO threads will be responsible for the following:
* Read Operation: The IO thread will only read and parse a single
command. It will not update the server stats, handle read errors, or
parsing errors. These tasks will be taken care of by the main thread.
* Write Operation: The IO thread will only write the available data. It
will not free the client's replies, handle write errors, or update the
server statistics.
To achieve this without code duplication, the read/write code has been
refactored into smaller, independent components:
* Functions that perform only the read/parse/write calls.
* Functions that handle the read/parse/write results.
This refactor accounts for the majority of the modifications in this PR.
**Client Struct Safe Access**
As we ensure that the IO threads access memory only within the client
struct, we need to ensure thread safety only for the client's struct's
shared fields.
* Query Buffer
* Command parsing - The main thread will not try to parse a command from
the query buffer when a client is offloaded to the IO thread.
* Client's memory checks in client-cron - The main thread will not
access the client query buffer if it is offloaded and will handle the
querybuf grow/shrink when the client is back.
* CLIENT LIST command - The main thread will busy-wait for the IO thread
to finish handling the client, falling back to the current behavior
where the main thread waits for the IO thread to finish their
processing.
* Output Buffer
* The IO thread will not change the client's bufpos and won't free the
client's reply lists. These actions will be done by the main thread on
the client's return from the IO thread.
* bufpos / block→used: As the main thread may change the bufpos, the
reply-block→used, or add/delete blocks to the reply list while the IO
thread writes, we add two fields to the client struct: io_last_bufpos
and io_last_reply_block. The IO thread will write until the
io_last_bufpos, which was set by the main-thread before sending the
client to the IO thread. If more data has been added to the cob in
between, it will be written in the next write-job. In addition, the main
thread will not trim or merge reply blocks while the client is
offloaded.
* Parsing Fields
* Client's cmd, argc, argv, reqtype, etc., are set during parsing.
* The main thread will indicate to the IO thread not to parse a cmd if
the client is not reset. In this case, the IO thread will only read from
the network and won't attempt to parse a new command.
* The main thread won't access the c→cmd/c→argv in the CLIENT LIST
command as stated before it will busy wait for the IO threads.
* Client Flags
* c→flags, which may be changed by the main thread in multiple places,
won't be accessed by the IO thread. Instead, the main thread will set
the c→io_flags with the information necessary for the IO thread to know
the client's state.
* Client Close
* On freeClient, the main thread will busy wait for the IO thread to
finish processing the client's read/write before proceeding to free the
client.
* Client's Memory Limits
* The IO thread won't handle the qb/cob limits. In case a client crosses
the qb limit, the IO thread will stop reading for it, letting the main
thread know that the client crossed the limit.
**TLS**
TLS is currently not supported with IO threads for the following
reasons:
1. Pending reads - If SSL has pending data that has already been read
from the socket, there is a risk of not calling the read handler again.
To handle this, a list is used to hold the pending clients. With IO
threads, multiple threads can access the list concurrently.
2. Event loop modification - Currently, the TLS code
registers/unregisters the file descriptor from the event loop depending
on the read/write results. With IO threads, multiple threads can modify
the event loop struct simultaneously.
3. The same client can be sent to 2 different threads concurrently
(https://github.com/redis/redis/issues/12540).
Those issues were handled in the current PR:
1. The IO thread only performs the read operation. The main thread will
check for pending reads after the client returns from the IO thread and
will be the only one to access the pending list.
2. The registering/unregistering of events will be similarly postponed
and handled by the main thread only.
3. Each client is being sent to the same dedicated thread (c→id %
num_of_threads).
**Sending Replies Immediately with IO threads.**
Currently, after processing a command, we add the client to the
pending_writes_list. Only after processing all the clients do we send
all the replies. Since the IO threads are now working asynchronously, we
can send the reply immediately after processing the client’s requests,
reducing the command latency. However, if we are using AOF=always, we
must wait for the AOF buffer to be written, in which case we revert to
the current behavior.
**IO threads dynamic adjustment**
Currently, we use an all-or-nothing approach when activating the IO
threads. The current logic is as follows: if the number of pending write
clients is greater than twice the number of threads (including the main
thread), we enable all threads; otherwise, we enable none. For example,
if 8 IO threads are defined, we enable all 8 threads if there are 16
pending clients; else, we enable none.
It makes more sense to enable partial activation of the IO threads. If
we have 10 pending clients, we will enable 5 threads, and so on. This
approach allows for a more granular and efficient allocation of
resources based on the current workload.
In addition, the user will now be able to change the number of I/O
threads at runtime. For example, when decreasing the number of threads
from 4 to 2, threads 3 and 4 will be closed after flushing their job
queues.
**Tests**
Currently, we run the io-threads tests with 4 IO threads
(443d80f168/.github/workflows/daily.yml (L353)).
This means that we will not activate the IO threads unless there are 8
(threads * 2) pending write clients per single loop, which is unlikely
to happened in most of tests, meaning the IO threads are not currently
being tested.
To enforce the main thread to always offload work to the IO threads,
regardless of the number of pending events, we add an
events-per-io-thread configuration with a default value of 2. When set
to 0, this configuration will force the main thread to always offload
work to the IO threads.
When we offload every single read/write operation to the IO threads, the
IO-threads are running with 100% CPU when running multiple tests
concurrently some tests fail as a result of larger than expected command
latencies. To address this issue, we have to add some after or wait_for
calls to some of the tests to ensure they pass with IO threads as well.
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
This PR incorporates changes related to key embedding described in the
https://github.com/redis/redis/issues/12216
With this change there will be no `key` pointer and embedded the key
within the `dictEntry`. 1 byte is used for additional bookkeeping.
Overall the saving would be 7 bytes on average.
Key changes:
New dict entry type introduced, which is now used as an entry for the
main dictionary:
```c
typedef struct {
union {
void *val;
uint64_t u64;
int64_t s64;
double d;
} v;
struct dictEntry *next; /* Next entry in the same hash bucket. */
uint8_t key_header_size; /* offset into key_buf where the key is located at. */
unsigned char key_buf[]; /* buffer with embedded key. */
} embeddedDictEntry;
```
One new function has been added to the dictType:
```c
size_t (*embedKey)(unsigned char *buf, size_t buf_len, const void *key, unsigned char *header_size);
```
Change is opt-in per dict type, hence sets, hashes and other types that
are using dictionary are not impacted.
With this change main dictionary now owns the data, so copy on insert in
dbAdd is no longer needed.
### Benchmarking results
TLDR; Around 9-10% memory usage reduction in overall memory usage for
scenario with key of 16 bytes and value of 8 bytes and 16 bytes. The
throughput per second varies but is similar or greater in most of the
run(s) with the changes against unstable (ae2d421).
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
When there is a link failure while an ongoing MEET request is sent the
sending node stops sending anymore MEET and starts sending PINGs. Since
every node responds to PINGs from unknown nodes with a PONG, the
receiving node never adds the sending node. But the sending node adds
the receiving node when it sees a PONG. This can lead to asymmetry in
cluster membership. This changes makes the sender keep sending MEET
until it sees a PONG, avoiding the asymmetry.
---------
Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>
More rebranding of
* Log messages (#252)
* The DENIED error reply
* Internal function names and comments, mainly Lua API
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
I have validated that these settings closely match the existing coding
style with one major exception on `BreakBeforeBraces`, which will be
`Attach` going forward. The mixed `BreakBeforeBraces` styles in the
current codebase are hard to imitate and also very odd IMHO - see below
```
if (a == 1) { /*Attach */
}
```
```
if (a == 1 ||
b == 2)
{ /* Why? */
}
```
Please do NOT merge just yet. Will add the github action next once the
style is reviewed/approved.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
Existing code does not build on macOS < 10.7. There are two issues:
1. The check for `MAC_OS_10_6_DETECTED` does the opposite of what it
should, since `AvailabilityMacros.h` does not define underscore-prefixed
versions of macros. `__MAC_OS_X_VERSION_MAX_ALLOWED` evaluates to 0, and
on 10.6 everything breaks down:
Credits to @mohd-akram who pointed out at possible origin of the this
problem.
2. Once that is fixed, on 10.6 when building for `ppc` there are new
errors, because the code uses inaccurate assumptions for archs. Fix that
too.
---------
Signed-off-by: Sergey Fedorov <vital.had@gmail.com>
Fix the mem_freed variable to be initialized with init.
with this PR prevents the variable from acting unknowingly.
Signed-off-by: NAM UK KIM <namuk2004@naver.com>
This patch try to do following things:
1. Rename `redis_*` and `REDIS_*` macros defined in config.h to
`valkey_*`, `VALKEY_*` and update associated used files. (`redis_fstat`,
`redis_fsync`, `REDIS_THREAD_STACK_SIZE`, etc.)
2. Remove the leading double underscore for guard macro in config.h.
---------
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
New config 'extended-redis-compatibility' (yes/no) default no
* When yes:
* Use "Redis" in the following error replies:
- `-LOADING Redis is loading the dataset in memory`
- `-BUSY Redis is busy`...
- `-MISCONF Redis is configured to`...
* Use `=== REDIS BUG REPORT` in the crash log delimiters (START and
END).
* The HELLO command returns `"server" => "redis"` and `"version" =>
"7.2.4"` (our Redis OSS compatibility version).
* The INFO field for mode is called `"redis_mode"`.
* When no:
* Use "Valkey" instead of "Redis" in the mentioned errors and crash log
delimiters.
* The HELLO command returns `"server" => "valkey"` and the Valkey
version for `"version"`.
* The INFO field for mode is called `"server_mode"`.
* Documentation added in valkey.conf:
> Valkey is largely compatible with Redis OSS, apart from a few cases
where
> Redis OSS compatibility mode makes Valkey pretend to be Redis. Enable
this
> only if you have problems with tools or clients. This is a temporary
> configuration added in Valkey 8.0 and is scheduled to have no effect
in Valkey
> 9.0 and be completely removed in Valkey 10.0.
* A test case for the config is added. It is designed to fail if the
config is not deprecated (has no effect) in Valkey 9 and deleted in
Valkey 10.
* Other test cases are adjusted to work regardless of this config.
Fixes#274Fixes#61
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This includes comments used for module API documentation.
* Strategy for replacement: Regex search: `(//|/\*| \*|#).* ("|\()?(r|R)edis( |\.
|'|\n|,|-|\)|")(?!nor the names of its contributors)(?!Ltd.)(?!Labs)(?!Contributors.)`
* Don't edit copyright comments
* Replace "Redis version X.X" -> "Redis OSS version X.X" to distinguish
from newly licensed repository
* Replace "Redis Object" -> "Object"
* Exclude markdown for now
* Don't edit Lua scripting comments referring to redis.X API
* Replace "Redis Protocol" -> "RESP"
* Replace redis-benchmark, -cli, -server, -check-aof/rdb with "valkey-"
prefix
* Most other places, I use best judgement to either remove "Redis", or
replace with "the server" or "server"
Fixes#148
---------
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This commit updates the following fields:
1. server_version -> valkey_version in server info. Since we would like
to advertise specific compatibility, we are making the version specific
to valkey. servername will remain as an optional indicator, and other
valkey compatible stores might choose to advertise something else.
1. We dropped redis-ver from the API. This isn't related to API
compatibility, but we didn't want to "fake" that valkey was creating an
rdb from a Redis version.
1. Renamed server-ver -> valkey_version in rdb info. Same as point one,
we want to explicitly indicate this was created by a valkey server.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Fix#146
Removed REDISMODULE_ prefixes from the core source code to align with
the new SERVERMODULE_ naming convention. Added a new 'redismodule.h'
header file to ensure full backward compatibility with existing modules.
This compatibility layer maps all legacy REDISMODULE_ prefixed
identifiers to their new SERVERMODULE_ equivalents, allowing existing
Redis modules to function without modification.
---------
Signed-off-by: Ping Xie <pingxie@google.com>
New info information to be used to determine the valkey versioning info.
Internally, introduce new define values for "SERVER_VERSION" which is
different from the Redis compatibility version, "REDIS_VERSION".
Add two new info fields:
`server_version`: The Valkey server version
`server_name`: Indicates that the server is valkey.
Add one new RDB field: `server_ver`, which indicates the valkey version
that produced the server.
Add 3 new LUA globals: `SERVER_VERSION_NUM`, `SERVER_VERSION`, and
`SERVER_NAME`. Which reflect the valkey version instead of the Redis
compatibility version.
Also clean up various places where Redis and configuration was being
used that is no longer necessary.
---------
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Following #12568
In issue #9357, when inserting an element larger than 1GB, we currently
store it in a plain node instead of a listpack.
Presently, when we insert an element that exceeds the maximum size of a
packed node, it cannot be accommodated in any other nodes, thus ending
up isolated like a large element.
I.e. it's a node with only one element, but it's listpack encoded rather
than a plain buffer.
This PR lowers the threshold for considering an element as 'large' from
1GB to the maximum size of a node.
While this change doesn't completely resolve the bug mentioned in the
previous PR, it does mitigate its potential impact.
As a result of this change, we can now only use LSET to replace an
element with another element that falls below the maximum size
threshold.
In the worst-case scenario, with a fill of -5, the largest packed node
we can create is 2GB (32k * 64k):
* 32k: The smallest element in a listpack is 2 bytes, which allows us to
store up to 32k elements.
* 64k: This is the maximum size for a single quicklist node.
## Others
To fully fix#9357, we need more work, as discussed in #12568, when we
insert an element into a quicklistNode, it may be created in a new node,
put into another node, or merged, and we can't correctly delete the node
that was supposed to be deleted.
I'm not sure it's worth it, since it involves a lot of modifications.
The test fails here and there:
```
*** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl
scan didn't handle slot skipping logic.
```
There are two case:
1. In the case of passing the test, we use child process to avoid the
dict resize, but it can not completely limit it, since in the dictDelete
we still have chance to trigger the resize (hit the force radio). The
reason why our test passed before is because the expire dict is still
in the rehashing process, so the dictDelete, the dictShrinkIfNeeded can
not trigger the resize.
2. In the case of failing the test, the expire dict finished the
rehashing,
so the last dictDelete, the dictShrinkIfNeeded trigger the dict resize
since it hit the force radio, so the skipping logic fail.
This PR add a new DEBUG command to disbale the dict resize.
# Description
Gather most of the scattered `redisDb`-related code from the per-slot
dict PR (#11695) and turn it to a new data structure, `kvstore`. i.e.
it's a class that represents an array of dictionaries.
# Motivation
The main motivation is code cleanliness, the idea of using an array of
dictionaries is very well-suited to becoming a self-contained data
structure.
This allowed cleaning some ugly code, among others: loops that run twice
on the main dict and expires dict, and duplicate code for allocating and
releasing this data structure.
# Notes
1. This PR reverts the part of https://github.com/redis/redis/pull/12848
where the `rehashing` list is global (handling rehashing `dict`s is
under the responsibility of `kvstore`, and should not be managed by the
server)
2. This PR also replaces the type of `server.pubsubshard_channels` from
`dict**` to `kvstore` (original PR:
https://github.com/redis/redis/pull/12804). After that was done,
server.pubsub_channels was also chosen to be a `kvstore` (with only one
`dict`, which seems odd) just to make the code cleaner by making it the
same type as `server.pubsubshard_channels`, see
`pubsubtype.serverPubSubChannels`
3. the keys and expires kvstores are currenlty configured to allocate
the individual dicts only when the first key is added (unlike before, in
which they allocated them in advance), but they won't release them when
the last key is deleted.
Worth mentioning that due to the recent change the reply of DEBUG
HTSTATS changed, in case no keys were ever added to the db.
before:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
[Expires HT]
Hash table 0 stats (main hash table):
No stats available for empty dictionaries
```
after:
```
127.0.0.1:6379> DEBUG htstats 9
[Dictionary HT]
[Expires HT]
```
This change is trying to make two failure modes a bit easier to deep dive:
1. If a serverPanic or serverAssert occurs during the info (or module)
printing, it will recursively panic, which is a lot of fun as it will
just keep recursively printing. It will eventually stack overflow, but
will generate a lot of text in the process.
2. When a segfault happens during the segfault handler, no information
is communicated other than it happened. This can be problematic because
`info` may help diagnose the real issue, but without fixing the
recursive crash it might be hard to get at that info.
see discussion from after https://github.com/redis/redis/pull/12453 was
merged
----
This PR replaces signals that are not considered async-signal-safe
(AS-safe) with safe calls.
#### **1. serverLog() and serverLogFromHandler()**
`serverLog` uses unsafe calls. It was decided that we will **avoid**
`serverLog` calls by the signal handlers when:
* The signal is not fatal, such as SIGALRM. In these cases, we prefer
using `serverLogFromHandler` which is the safe version of `serverLog`.
Note they have different prompts:
`serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>`
`serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>`
* The code was added recently. Calls to `serverLog` by the signal
handler have been there ever since Redis exists and it hasn't caused
problems so far. To avoid regression, from now we should use
`serverLogFromHandler`
#### **2. `snprintf` `fgets` and `strtoul`(base = 16) -------->
`_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`**
The safe version of `snprintf` was taken from
[here](8cfc4ca5e7/src/mc_util.c (L754))
#### **3. fopen(), fgets(), fclose() --------> open(), read(), close()**
#### **4. opendir(), readdir(), closedir() --------> open(),
syscall(SYS_getdents64), close()**
#### **5. Threads_mngr sync mechanisms**
* waiting for the thread to generate stack trace: semaphore -------->
busy-wait
* `globals_rw_lock` was removed: as we are not using malloc and the
semaphore anymore we don't need to protect `ThreadsManager_cleanups`.
#### **6. Stacktraces buffer**
The initial problem was that we were not able to safely call malloc
within the signal handler.
To solve that we created a buffer on the stack of `writeStacktraces` and
saved it in a global pointer, assuming that under normal circumstances,
the function `writeStacktraces` would complete before any thread
attempted to write to it. However, **if threads lag behind, they might
access this global pointer after it no longer belongs to the
`writeStacktraces` stack, potentially corrupting memory.**
To address this, various solutions were discussed
[here](https://github.com/redis/redis/pull/12658#discussion_r1390442896)
Eventually, we decided to **create a pipe** at server startup that will
remain valid as long as the process is alive.
We chose this solution due to its minimal memory usage, and since
`write()` and `read()` are atomic operations. It ensures that stack
traces from different threads won't mix.
**The stacktraces collection process is now as follows:**
* Cleaning the pipe to eliminate writes of late threads from previous
runs.
* Each thread writes to the pipe its stacktrace
* Waiting for all the threads to mark completion or until a timeout (2
sec) is reached
* Reading from the pipe to print the stacktraces.
#### **7. Changes that were considered and eventually were dropped**
* replace watchdog timer with a POSIX timer:
according to [settimer man](https://linux.die.net/man/2/setitimer)
> POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending
the use of the POSIX timers API
([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2),
[timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.)
instead.
However, although it is supposed to conform to POSIX std, POSIX timers
API is not supported on Mac.
You can take a look here at the Linux implementation:
[here](c7562ee135)
To avoid messing up the code, and uncertainty regarding compatibility,
it was decided to drop it for now.
* avoid using sds (uses malloc) in logConfigDebugInfo
It was considered to print config info instead of using sds, however
apparently, `logConfigDebugInfo` does more than just print the sds, so
it was decided this fix is out of this issue scope.
#### **8. fix Signal mask check**
The check `signum & sig_mask` intended to indicate whether the signal is
blocked by the thread was incorrect. Actually, the bit position in the
signal mask corresponds to the signal number. We fixed this by changing
the condition to: `sig_mask & (1L << (sig_num - 1))`
#### **9. Unrelated changes**
both `fork.tcl `and `util.tcl` implemented a function called
`count_log_message` expecting different parameters. This caused
confusion when trying to run daily tests with additional test parameters
to run a specific test.
The `count_log_message` in `fork.tcl` was removed and the calls were
replaced with calls to `count_log_message` located in `util.tcl`
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
This PR reworks the clustering code to support multiple clustering
implementations, specifically, the current "legacy" clustering
implementation or, although not part of this PR, flotilla (see #10875).
Which implementation is used could be a compile-time flag (will be added
later). Legacy clustering functionality remains unchanged.
The basic idea is as follows. The header cluster.h now contains function
declarations that define the "Cluster API." These are the contract and
interface between any clustering implementation and the rest of the
Redis source code. Some of the function definitions are shared between
all clustering implementations. These functions are in cluster.c. The
functions and data structures specific to legacy clustering are in
cluster-legacy.c/h. One consequence of this is that the structs
clusterNode and clusterState which were previously "public" to the rest
of Redis are now hidden behind the Cluster API.
The PR is divided up into commits, each with a commit message explaining
the changes. some are just mass rename or moving code between files (may
not require close inspection / review), others are manual changes.
One other, related change is:
- The "failover" command is now plugged into the Cluster API so that the
clustering implementation can (a) enable/disable the command to begin
with and if enabled (b) perform the actual failover. The "failover"
command remains disabled for legacy clustering.
Move clusterNode into cluster_legacy.h.
In order to achieve this some accessor methods
were added and also a refactor of how debugCommand
handles cluster related subcommands.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
When we are loading data, it is not safe to generate data
through DEBUG POPULATE. POPULATE may generate keys, causing
panic when loading data with duplicate keys.
When using DB iterator, it will use dictInitSafeIterator to init a old safe
dict iterator. When dbIteratorNext is used, it will jump to the next slot db
dict when we are done a dict. During this process, we do not have any calls to
dictResumeRehashing, which causes the dict's pauserehash to always be > 0.
And at last, it will be returned directly in dictRehashMilliseconds, which causes
us to have slot dict in a state where rehash cannot be completed.
In the "expire scan should skip dictionaries with lot's of empty buckets" test,
adding a `keys *` can reproduce the problem stably. `keys *` will call dbIteratorNext
to trigger a traversal of all slot dicts.
Added dbReleaseIterator and dbIteratorInitNextSafeIterator methods to call dictResetIterator.
Issue was introduced in #11695.
Using heap allocation during signal handlers is unsafe.
This PR purpose is to replace all the heap allocations done within the signal
handlers raised upon server crash and assertions.
These were added in #12453.
writeStacktraces(): allocates the stacktraces output array on the calling thread's
stack and assigns the address to a global variable.
It calls `ThreadsManager_runOnThreads()` that invokes `collect_stacktrace_data()`
by each thread: each thread writes to a different location in the above array to allow
sync writes.
get_ready_to_signal_threads_tids(): instead of allocating the `tids` array, it receives it
as a fixed size array parameter, allocated on on the stack of the calling function, and
returns the number of valid threads. The array size is hard-coded to 50.
`ThreadsManager_runOnThreads():` To avoid the outputs array allocation, the
**callback signature** was changed. Now it should return void. This function return type
has also changed to int - returns 1 if successful, and 0 otherwise.
Other unsafe calls will be handled in following PRs
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Use the __MAC_OS_X_VERSION_MIN_REQUIRED macro to detect the
macOS system version instead of using MAC_OS_X_VERSION_10_6.
From MacOSX14.0.sdk, the default definitions of MAC_OS_X_VERSION_xxx have
been removed in usr/include/AvailabilityMacros.h. It includes AvailabilityVersions.h,
where the following condition must be met:
`#if (!defined(_POSIX_C_SOURCE) && !defined(_XOPEN_SOURCE)) || defined(_DARWIN_C_SOURCE)`
Only then will MAC_OS_X_VERSION_xxx be defined.
However, in the project, _DARWIN_C_SOURCE is not defined, which leads to the
loss of the definition for MAC_OS_X_VERSION_10_6.