More declerations can be moved into cluster_legacy.h
as they are not requied for the cluster api. The code
was simply moved, not changed in any way.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Move clusterNode into cluster_legacy.h.
In order to achieve this some accessor methods
were added and also a refactor of how debugCommand
handles cluster related subcommands.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Move clusterState into cluster_legacy.h. In order to achieve
this some "accessor" methods needed to be added to the
cluster API and some other minor refactors.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Move some declerations from cluster.h to cluster_legacy.h.
The items moved are specific to the legacy clustering
implementation and DO NOT require any other refactoring
other than moving them from one file to another.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
Optimize the performance of SCAN commands when a match pattern can only contain keys from a
single slot in cluster mode. This can happen when the pattern contains a hash tag before any
wildcard matchers or when the key contains no matchers.
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data.
## Important changes
* Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms.
* getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time.
* Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree.
* scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue.
* Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot.
* Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading.
* DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well.
## Performance
This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict.
RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load.
## Interface changes
* Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS`
* Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored.
* New RDB version to support the new op code for SLOT information.
---------
Co-authored-by: Vitaly Arbuzov <arvit@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
Process loss of slot ownership in cluster bus
When a node no longer owns a slot, it clears the bit corresponding
to the slot in the cluster bus messages. The receiving nodes
currently don't record the fact that the sender stopped claiming
a slot until some other node in the cluster starts claiming the slot.
This can cause a slot to go missing during slot migration when subjected
to inopportune race with addition of new shards or a failover.
This fix forces the receiving nodes to process the loss of ownership
to avoid spreading wrong information.
Originally, when "tls-cluster" is enabled, `port` is set to TLS port. In order to support non-TLS clients, `pport` is used to propagate TCP port across cluster nodes. However when "tls-cluster" is disabled, `port` is set to TCP port, and `pport` is not used, which means the cluster cannot provide TLS service unless "tls-cluster" is on.
```
typedef struct {
// ...
uint16_t port; /* Latest known clients port (TLS or plain). */
uint16_t pport; /* Latest known clients plaintext port. Only used if the main clients port is for TLS. */
// ...
} clusterNode;
```
```
typedef struct {
// ...
uint16_t port; /* TCP base port number. */
uint16_t pport; /* Sender TCP plaintext port, if base port is TLS */
// ...
} clusterMsg;
```
This PR renames `port` and `pport` in `clusterNode` to `tcp_port` and `tls_port`, to record both ports no matter "tls-cluster" is enabled or disabled.
This allows to provide TLS service to clients when "tls-cluster" is disabled: when displaying cluster topology, or giving `MOVED` error, server can provide TLS or TCP port according to client's connection type, no matter what type of connection cluster bus is using.
For backwards compatibility, `port` and `pport` in `clusterMsg` are preserved, when "tls-cluster" is enabled, `port` is set to TLS port and `pport` is set to TCP port, when "tls-cluster" is disabled, `port` is set to TCP port and `pport` is set to TLS port (instead of 0).
Also, in the nodes.conf file, a new aux field displaying an extra port is added to complete the persisted info. We may have `tls_port=xxxxx` or `tcp_port=xxxxx` in the aux field, to complete the cluster topology, while the other port is stored in the normal `<ip>:<port>` field. The format is shown below.
```
<node-id> <ip>:<tcp_port>@<cport>,<hostname>,shard-id=...,tls-port=6379 myself,master - 0 0 0 connected 0-1000
```
Or we can switch the position of two ports, both can be correctly resolved.
```
<node-id> <ip>:<tls_port>@<cport>,<hostname>,shard-id=...,tcp-port=6379 myself,master - 0 0 0 connected 0-1000
```
This PR adds a human readable name to a node in clusters that are visible as part of error logs. This is useful so that admins and operators of Redis cluster have better visibility into failures without having to cross-reference the generated ID with some logical identifier (such as pod-ID or EC2 instance ID). This is mentioned in #8948. Specific nodenames can be set by using the variable cluster-announce-human-nodename. The nodename is gossiped using the clusterbus extension in #9530.
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This commit excludes aux fields from the output of the `cluster nodes` and `cluster replicas` command.
We may decide to re-introduce them in some form or another in the future, but not in v7.2.
Technically declaring a prototype with an empty declaration has been deprecated since the early days of C, but we never got a warning for it. C2x will apparently be introducing a breaking change if you are using this type of declarator, so Clang 15 has started issuing a warning with -pedantic. Although not apparently a problem for any of the compiler we build on, if feels like the right thing is to properly adhere to the C standard and use (void).
Currently while a sharded pubsub message publish tries to propagate the message across the cluster, a NULL check is missing for clusterLink. clusterLink could be NULL if the link is causing memory beyond the set threshold cluster-link-sendbuf-limit and server terminates the link.
This change introduces two things:
Avoids the engine crashes on the publishing node if a message is tried to be sent to a node and the link is NULL.
Adds a debugging tool CLUSTERLINK KILL to terminate the clusterLink between two nodes.
This change increase the frequency of the failover log from 5 minutes to 10 seconds. This log is only emitted when a replica has an outstanding election is progress, and waiting 5 minutes for the next log makes debugging and alarming on the log messages too slow. It also now prints out the number of votes the replica has currently received as well as the number of votes it needs to achieve quorum so that we can track the progress if it's running slowly.
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
This change deletes the dictGetNext and dictGetNextRef functions, so the
dict API doesn't expose the next field at all.
The bucket function in dictScan is deleted. A separate dictScanDefrag function
is added which takes a defrag alloc function to defrag-reallocate the dict entries.
"Dirty" code accessing the dict internals in active defrag is removed.
An 'afterReplaceEntry' is added to dictType, which allows the dict user
to keep the dictEntry metadata up to date after reallocation/defrag/move.
Additionally, for updating the cluster slot-to-key mapping, after a dictEntry
has been reallocated, we need to know which db a dict belongs to, so we store
a pointer to the db in a new metadata section in the dict struct, which is
a new mechanism similar to dictEntry metadata. This adds some complexity but
provides better isolation.
The cluster-announce-port/cluster-announce-bus-port/cluster-announce-tls-port should take effect at runtime
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Introduce Shard IDs to logically group nodes in cluster mode.
1. Added a new "shard_id" field to "cluster nodes" output and nodes.conf after "hostname"
2. Added a new PING extension to propagate "shard_id"
3. Handled upgrade from pre-7.2 releases automatically
4. Refactored PING extension assembling/parsing logic
Behavior of Shard IDs:
Replicas will always follow the shards of their reported primaries. If a primary updates its shard ID, the replica will follow. (This need not follow for cluster v2) This is not an expected use case.
PR #9320 introduces initialization order changes. Now cluster is initialized after modules.
This changes causes a crash if the module uses RM_Call inside the load function
on cluster mode (the code will try to access `server.cluster` which at this point is NULL).
To solve it, separate cluster initialization into 2 phases:
1. Structure initialization that happened before the modules initialization
2. Listener initialization that happened after.
Test was added to verify the fix.
Gossip the cluster node blacklist in ping and pong messages.
This means that CLUSTER FORGET doesn't need to be sent to all nodes in a cluster.
It can be sent to one or more nodes and then be propagated to the rest of them.
For each blacklisted node, its node id and its remaining blacklist TTL is gossiped in a
cluster bus ping extension (introduced in #9530).
since PUBLISH and SPUBLISH use different dictionaries for channels and clients,
and we already have an API for PUBLISH, it only makes sense to have one for SPUBLISH
Add test coverage and unifying some test infrastructure.
* Limit cluster node id length for CLUSTER commands loading
* Cluster node name sanity check for length and values
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
This commit improve malloc efficiency of the slots_info_pairs mechanism in cluster.c
by changing adlist into an array being realloced with greedy growth mechanism
Recently the cluster tests are consistently failing when executed with ASAN in the CI.
I tried to track down the commit that started it, and it appears to be #10293.
Looking at the commit, i realize it didn't affect this test / flow, other than the
replacement of the slots_info_pairs from sds to list.
I concluded that what could be happening is that the slot range is very fragmented,
and that results in many allocations.
with sds, it results in one allocation and also, we have a greedy growth mechanism,
but with adlist, we just have many many small allocations.
this probably causes stress on ASAN, and causes it to be slow at termination.
Implement a new cluster shards command, which provides a flexible and extensible API for topology discovery.
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
publishshard was added in #8621 (7.0 RC1), but the publishshard_sent
stat is not shown in CLUSTER INFO command.
Other changes:
1. Remove useless `needhelp` statements, it was removed in 3dad819.
2. Use `LL_WARNING` log level for some error logs (I/O error, Connection failed).
3. Fix typos that saw by the way.
When I used C++ to develop a redis module. i used `string.data()` as the second parameter `ele`
of `RedisModule_DigestAddStringBuffer`, but there is a warning, since we never change the `ele`,
i think we should use `const char` for it.
This PR adds const to just a handful of module APIs that required it, all not very widely used.
The implication is a breaking change in terms of compilation error that's easy to resolve, and no ABI impact.
The affected APIs are around Digest, Info injection, and Cluster bus messages.
This commit implements a sharded pubsub implementation based off of shard channels.
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Introduce memory management on cluster link buffers:
* Introduce a new `cluster-link-sendbuf-limit` config that caps memory usage of cluster bus link send buffers.
* Introduce a new `CLUSTER LINKS` command that displays current TCP links to/from peers.
* Introduce a new `mem_cluster_links` field under `INFO` command output, which displays the overall memory usage by all current cluster links.
* Introduce a new `total_cluster_links_buffer_limit_exceeded` field under `CLUSTER INFO` command output, which displays the accumulated count of cluster links freed due to `cluster-link-sendbuf-limit`.
For diskless replication in swapdb mode, considering we already spend replica memory
having a backup of current db to restore in case of failure, we can have the following benefits
by instead swapping database only in case we succeeded in transferring db from master:
- Avoid `LOADING` response during failed and successful synchronization for cases where the
replica is already up and running with data.
- Faster total time of diskless replication, because now we're moving from Transfer + Flush + Load
time to Transfer + Load only. Flushing the tempDb is done asynchronously after swapping.
- This could be implemented also for disk replication with similar benefits if consumers are willing
to spend the extra memory usage.
General notes:
- The concept of `backupDb` becomes `tempDb` for clarity.
- Async loading mode will only kick in if the replica is syncing from a master that has the same
repl-id the one it had before. i.e. the data it's getting belongs to a different time of the same timeline.
- New property in INFO: `async_loading` to differentiate from the blocking loading
- Slot to Key mapping is now a field of `redisDb` as it's more natural to access it from both server.db
and the tempDb that is passed around.
- Because this is affecting replicas only, we assume that if they are not readonly and write commands
during replication, they are lost after SYNC same way as before, but we're still denying CONFIG SET
here anyways to avoid complications.
Considerations for review:
- We have many cases where server.loading flag is used and even though I tried my best, there may
be cases where async_loading should be checked as well and cases where it shouldn't (would require
very good understanding of whole code)
- Several places that had different behavior depending on the loading flag where actually meant to just
handle commands coming from the AOF client differently than ones coming from real clients, changed
to check CLIENT_ID_AOF instead.
**Additional for Release Notes**
- Bugfix - server.dirty was not incremented for any kind of diskless replication, as effect it wouldn't
contribute on triggering next database SAVE
- New flag for RM_GetContextFlags module API: REDISMODULE_CTX_FLAGS_ASYNC_LOADING
- Deprecated RedisModuleEvent_ReplBackup. Starting from Redis 7.0, we don't fire this event.
Instead, we have the new RedisModuleEvent_ReplAsyncLoad holding 3 sub-events: STARTED,
ABORTED and COMPLETED.
- New module flag REDISMODULE_OPTIONS_HANDLE_REPL_ASYNC_LOAD for RedisModule_SetModuleOptions
to allow modules to declare they support the diskless replication with async loading (when absent, we fall
back to disk-based loading).
Co-authored-by: Eduardo Semprebon <edus@saxobank.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
* Enhance dict to support arbitrary metadata carried in dictEntry
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
* Rewrite slot-to-keys mapping to linked lists using dict entry metadata
This is a memory enhancement for Redis Cluster.
The radix tree slots_to_keys (which duplicates all key names prefixed with their
slot number) is replaced with a linked list for each slot. The dict entries of
the same cluster slot form a linked list and the pointers are stored as metadata
in each dict entry of the main DB dict.
This commit also moves the slot-to-key API from db.c to cluster.c.
Co-authored-by: Jim Brunner <brunnerj@amazon.com>
* Cleaning up the cluster interface by moving almost all related declarations into cluster.h
(no logic change -- just moving declarations/definitions around)
This initial effort leaves two items out of scope - the configuration parsing into the server
struct and the internals exposed by the clusterNode struct.
* Remove unneeded declarations of dictSds*
Ideally all the dictSds functionality would move from server.c into a dedicated module
so we can avoid the duplication in redis-benchmark/cli
* Move crc16 back into server.h, will be moved out once we create a seperate header file for
hashing functions
The cluster bus is established over TLS or non-TLS depending on the configuration tls-cluster. The client ports distributed in the cluster and sent to clients are assumed to be TLS or non-TLS also depending on tls-cluster.
The cluster bus is now extended to also contain the non-TLS port of clients in a TLS cluster, when available. The non-TLS port of a cluster node, when available, is sent to clients connected without TLS in responses to CLUSTER SLOTS, CLUSTER NODES, CLUSTER SLAVES and MOVED and ASK redirects, instead of the TLS port.
The user was able to override the client port by defining cluster-announce-port. Now cluster-announce-tls-port is added, so the user can define an alternative announce port for both TLS and non-TLS clients.
Fixes#8134
Turns out this was broken since version 4.0 when we added sds size
classes.
The cluster code uses sds for the receive buffer, and then casts it to a
struct and accesses a 64 bit variable.
This commit replaces the use of sds with a simple reallocated buffer.
This commit deals with manual failover as well as non-manual failover.
We did tests with manual failover as follows:
1, Setup redis cluster which holds 16 partions, each having only
1 corresponding replica.
2, Write a batch of data to redis cluster and make sure the redis is doing
a active expire in serverCron.
3, Do a manual failover sequentially to each partions with a time interval
of 3 minutes.
4, Collect logs and do some computaiton work.
The result:
case avgTime maxTime minTime
C1 95.8ms 227ms 25ms
C2 47.9ms 96ms 12ms
C3 12.6ms 27ms 7ms
Explanation
case C1: All nodes use the version before optimization
case C2: Masters use the elder version while replicas use the optimized version
case C3: All nodes use the optimized version
failover time: The time between when replica got a `manual failover request` and
when it `won the failover election`.
avgTime: average failover time
maxTime: maximum failover time
minTime: mimimum failover time
ms: millisecond
Co-authored-by: chendq8 <c.d_q@163.com>
We want to send pings and pongs at specific intervals, since our packets
also contain information about the configuration of the cluster and are
used for gossip. However since our cluster bus is used in a mixed way
for data (such as Pub/Sub or modules cluster messages) and metadata,
sometimes a very busy channel may delay the reception of pong packets.
So after discussing it in #7216, this commit introduces a new field that
is not exposed in the cluster, is only an internal information about
the last time we received any data from a given node: we use this field
in order to avoid detecting failures, claiming data reception of new
data from the node is a proof of liveness.
Changes in behavior:
- Change server.stream_node_max_entries from int64_t to long long, so that it can be used by the generic infra
- standard error reply instead of "repl-backlog-size must be 1 or greater" and such
- tls-port and a few TLS booleans were readable (config get) even when USE_OPENSSL was off (now they aren't)
- syslog-enabled, syslog-ident, cluster-enabled, appendfilename, and supervised didn't have a get (now they do)
- pidfile was initialized to NULL in InitServerConfig but had CONFIG_DEFAULT_PID_FILE in rewriteConfig (so the real default was "", but rewrite would cause it to be set), fixed the rewrite.
- TLS config in server.h was uninitialized (if no tls config args were provided)
Adding test for sanity and coverage