382 Commits

Author SHA1 Message Date
Rain Valentine
88942c8e61
Replace dict with new hashtable for sets datatype (#1176)
The new `hashtable` provides faster lookups and uses less memory than
`dict`.

A TCL test case "SRANDMEMBER with a dict containing long chain" is
deleted because it's covered by a hashtable unit test
"test_random_entry_with_long_chain", which is already present.

This change also moves some logic from dismissMemory (object.c) to
zmadvise_dontneed (zmalloc.c), so the hashtable implementation which
needs the dismiss functionality doesn't need to depend on object.c and
server.h.

This PR follows #1186.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-14 20:53:48 +01:00
Viktor Söderqvist
3eb8314be6 Replace dict with hashtable for keys, expires and pubsub channels
Instead of a dictEntry with pointers to key and value, the hashtable
has a pointer directly to the value (robj) which can hold an embedded
key and acts as a key-value in the hashtable. This minimizes the number
of pointers to follow and thus the number of memory accesses to lookup
a key-value pair.

        Keys         robj
      hashtable
      +-------+   +-----------------------+
      | 0     |   | type, encoding, LRU   |
      | 1 ------->| refcount, expire      |
      | 2     |   | ptr                   |
      | ...   |   | optional embedded key |
      +-------+   | optional embedded val |
                  +-----------------------+

The expire timestamp (TTL) is also stored in the robj, if any. The expire
hash table points to the same robj.

Overview of changes:

* Replace dict with hashtable in kvstore (kvstore.c)
* Add functions for embedding key and expire in robj (object.c)
  * When there's unused space, reserve an expire field to avoid realloting
    it later if expire is added.
  * Always reserve space for expire for large key names to avoid realloc
    if it's set later.
* Update db functions (db.c)
  * dbAdd, setKey and setExpire reallocate the object when embedding a key
  * setKey does not increment the reference counter, since it would require
    duplicating the object. This responsibility is moved to the caller.
* Remove logic for shared integer objects as values in the database. The keys
  are now embedded in the objects, so all objects in the database need to be
  unique. Thus, we can't use shared objects as values. Also delete test cases
  for shared integers.
* Adjust various commands to the changes mentioned above.
* Adjust defrag code
  * Improvement: Don't access the expires table before defrag has actually
    reallocated the object.
* Adjust test cases that were using hard-coded sizes for dict when realloc
  would happen, and some other adjustments in test cases.
* Adjust memory prefetch for new hash table implementation in IO-threading,
  using new `hashtableIncrementalFind` API
* Adjust offloading of free() to IO threads: Object free to be done in main
  thread while keeping obj->ptr offloading in IO-thread since the DB object is
  now allocated by the main-thread and not by the IO-thread as it used to be.
* Let expireIfNeeded take an optional value, to avoid looking up the expires
  table when possible.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
2024-12-10 21:30:56 +01:00
Sarthak Aggarwal
9cfe1b3d81
Set Command with IFEQ Support (#1324)
This PR allows the Valkey users to perform conditional updates where the
SET command is completed if the given comparison-value matches the key’s
current value.

Syntax:

```
SET key value IFEQ comparison-value
```

Behavior:

If the values match, the SET completes as expected. If they do not
match, the command returns a (nil), except if the GET argument is also
given (see below).

Behavior with Additional Flags:

1. ```SET key value IFEQ comparison-value GET``` returns the existing
value, regardless of whether it matches comparison-value or not. The
conditional set operation is performed if the given comparison value
matches the existing value. To check if the SET succeeded, the caller
needs to check if the returned string matches the comparison-value.
2. ```SET key value IFEQ comparison-value XX``` is a syntax error.
3.  ```SET key value IFEQ comparison-value NX``` is a syntax error.

Closes: #1215

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2024-12-10 12:54:49 +01:00
Roman Gershman
b09db3ef78
Fix typo in streams seen-time / active-time test (#1409)
This variable name is wrong, it causes the wrong variable to be asserted.

Signed-off-by: Roman Gershman <romange@gmail.com>
2024-12-09 16:01:43 +08:00
Binbin
5693fe4664
Fix set expire test due to the new lazyfree configs changes (#980)
Test failed because these two PRs #865 and #913.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-02 22:43:09 +08:00
Binbin
70624ea63d
Change all the lazyfree configurations to yes by default (#913)
## Set replica-lazy-flush and lazyfree-lazy-user-flush to yes by
default.
There are many problems with running flush synchronously. Even in
single CPU environments, the thread managers should balance between
the freeing and serving incoming requests.

## Set lazy eviction, expire, server-del, user-del to yes by default
We now have a del and a lazyfree del, we also have these configuration
items to control: lazyfree-lazy-eviction, lazyfree-lazy-expire,
lazyfree-lazy-server-del, lazyfree-lazy-user-del. In most cases lazyfree
is better since it reduces the risk of blocking the main thread, and
because we have lazyfreeGetFreeEffort, on those with high effor
(currently
64) will use lazyfree.

Part of #653.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-09-02 07:07:17 -07:00
Binbin
e3af1a30e4
Fast path in SET if the expiration time is expired (#865)
If the expiration time passed in SET is expired, for example, it
has expired due to the machine time (DTS) or the expiration time
passed in (wrong arg). In this case, we don't need to set the key
and wait for the active expire scan before deleting the key.

Compared with previous changes:
1. If the key does not exist, previously we would set the key and wait
for the active expire to delete it, so it is a set + del from the
perspective
of propaganda. Now we will no set the key and return, so it a NOP.

2. If the key exists, previously we woule set the key and wait
for the active expire to delete it, so it is a set + del From the
perspective
of propaganda. Now we will delete it and return, so it is a del.

Adding a new deleteExpiredKeyFromOverwriteAndPropagate function
to reduce the duplicate code.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-08-31 22:39:07 +08:00
uriyage
bbfd041895
Async IO threads (#758)
This PR is 1 of 3 PRs intended to achieve the goal of 1 million requests
per second, as detailed by [dan touitou](https://github.com/touitou-dan)
in https://github.com/valkey-io/valkey/issues/22. This PR modifies the
IO threads to be fully asynchronous, which is a first and necessary step
to allow more work offloading and better utilization of the IO threads.

### Current IO threads state:

Valkey IO threads were introduced in Redis 6.0 to allow better
utilization of multi-core machines. Before this, Redis was
single-threaded and could only use one CPU core for network and command
processing. The introduction of IO threads helps in offloading the IO
operations to multiple threads.

**Current IO Threads flow:**

1. Initialization: When Redis starts, it initializes a specified number
of IO threads. These threads are in addition to the main thread, each
thread starts with an empty list, the main thread will populate that
list in each event-loop with pending-read-clients or
pending-write-clients.
2. Read Phase: The main thread accepts incoming connections and reads
requests from clients. The reading of requests are offloaded to IO
threads. The main thread puts the clients ready-to-read in a list and
set the global io_threads_op to IO_THREADS_OP_READ, the IO threads pick
the clients up, perform the read operation and parse the first incoming
command.
3. Command Processing: After reading the requests, command processing is
still single-threaded and handled by the main thread.
4. Write Phase: Similar to the read phase, the write phase is also be
offloaded to IO threads. The main thread prepares the response in the
clients’ output buffer then the main thread puts the client in the list,
and sets the global io_threads_op to the IO_THREADS_OP_WRITE. The IO
threads then pick the clients up and perform the write operation to send
the responses back to clients.
5. Synchronization: The main-thread communicate with the threads on how
many jobs left per each thread with atomic counter. The main-thread
doesn’t access the clients while being handled by the IO threads.

**Issues with current implementation:**

* Underutilized Cores: The current implementation of IO-threads leads to
the underutilization of CPU cores.
* The main thread remains responsible for a significant portion of
IO-related tasks that could be offloaded to IO-threads.
* When the main-thread is processing client’s commands, the IO threads
are idle for a considerable amount of time.
* Notably, the main thread's performance during the IO-related tasks is
constrained by the speed of the slowest IO-thread.
* Limited Offloading: Currently, Since the Main-threads waits
synchronously for the IO threads, the Threads perform only read-parse,
and write operations, with parsing done only for the first command. If
the threads can do work asynchronously we may offload more work to the
threads reducing the load from the main-thread.
* TLS: Currently, we don't support IO threads with TLS (where offloading
IO would be more beneficial) since TLS read/write operations are not
thread-safe with the current implementation.

### Suggested change

Non-blocking main thread - The main thread and IO threads will operate
in parallel to maximize efficiency. The main thread will not be blocked
by IO operations. It will continue to process commands independently of
the IO thread's activities.

**Implementation details**

**Inter-thread communication.**

* We use a static, lock-free ring buffer of fixed size (2048 jobs) for
the main thread to send jobs and for the IO to receive them. If the ring
buffer fills up, the main thread will handle the task itself, acting as
back pressure (in case IO operations are more expensive than command
processing). A static ring buffer is a better candidate than a dynamic
job queue as it eliminates the need for allocation/freeing per job.
* An IO job will be in the format: ` [void* function-call-back | void
*data] `where data is either a client to read/write from and the
function-ptr is the function to be called with the data for example
readQueryFromClient using this format we can use it later to offload
other types of works to the IO threads.
* The Ring buffer is one way from the main-thread to the IO thread, Upon
read/write event the main thread will send a read/write job then in
before sleep it will iterate over the pending read/write clients to
checking for each client if the IO threads has already finished handling
it. The IO thread signals it has finished handling a client read/write
by toggling an atomic flag read_state / write_state on the client
struct.

**Thread Safety**

As suggested in this solution, the IO threads are reading from and
writing to the clients' buffers while the main thread may access those
clients.
We must ensure no race conditions or unsafe access occurs while keeping
the Valkey code simple and lock free.

Minimal Action in the IO Threads
The main change is to limit the IO thread operations to the bare
minimum. The IO thread will access only the client's struct and only the
necessary fields in this struct.
The IO threads will be responsible for the following:

* Read Operation: The IO thread will only read and parse a single
command. It will not update the server stats, handle read errors, or
parsing errors. These tasks will be taken care of by the main thread.
* Write Operation: The IO thread will only write the available data. It
will not free the client's replies, handle write errors, or update the
server statistics.


To achieve this without code duplication, the read/write code has been
refactored into smaller, independent components:

* Functions that perform only the read/parse/write calls.
* Functions that handle the read/parse/write results.

This refactor accounts for the majority of the modifications in this PR.

**Client Struct Safe Access**

As we ensure that the IO threads access memory only within the client
struct, we need to ensure thread safety only for the client's struct's
shared fields.

* Query Buffer 
* Command parsing - The main thread will not try to parse a command from
the query buffer when a client is offloaded to the IO thread.
* Client's memory checks in client-cron - The main thread will not
access the client query buffer if it is offloaded and will handle the
querybuf grow/shrink when the client is back.
* CLIENT LIST command - The main thread will busy-wait for the IO thread
to finish handling the client, falling back to the current behavior
where the main thread waits for the IO thread to finish their
processing.
* Output Buffer 
* The IO thread will not change the client's bufpos and won't free the
client's reply lists. These actions will be done by the main thread on
the client's return from the IO thread.
* bufpos / block→used: As the main thread may change the bufpos, the
reply-block→used, or add/delete blocks to the reply list while the IO
thread writes, we add two fields to the client struct: io_last_bufpos
and io_last_reply_block. The IO thread will write until the
io_last_bufpos, which was set by the main-thread before sending the
client to the IO thread. If more data has been added to the cob in
between, it will be written in the next write-job. In addition, the main
thread will not trim or merge reply blocks while the client is
offloaded.
* Parsing Fields 
    * Client's cmd, argc, argv, reqtype, etc., are set during parsing.
* The main thread will indicate to the IO thread not to parse a cmd if
the client is not reset. In this case, the IO thread will only read from
the network and won't attempt to parse a new command.
* The main thread won't access the c→cmd/c→argv in the CLIENT LIST
command as stated before it will busy wait for the IO threads.
* Client Flags 
* c→flags, which may be changed by the main thread in multiple places,
won't be accessed by the IO thread. Instead, the main thread will set
the c→io_flags with the information necessary for the IO thread to know
the client's state.
* Client Close 
* On freeClient, the main thread will busy wait for the IO thread to
finish processing the client's read/write before proceeding to free the
client.
* Client's Memory Limits 
* The IO thread won't handle the qb/cob limits. In case a client crosses
the qb limit, the IO thread will stop reading for it, letting the main
thread know that the client crossed the limit.

**TLS**

TLS is currently not supported with IO threads for the following
reasons:

1. Pending reads - If SSL has pending data that has already been read
from the socket, there is a risk of not calling the read handler again.
To handle this, a list is used to hold the pending clients. With IO
threads, multiple threads can access the list concurrently.
2. Event loop modification - Currently, the TLS code
registers/unregisters the file descriptor from the event loop depending
on the read/write results. With IO threads, multiple threads can modify
the event loop struct simultaneously.
3. The same client can be sent to 2 different threads concurrently
(https://github.com/redis/redis/issues/12540).

Those issues were handled in the current PR:

1. The IO thread only performs the read operation. The main thread will
check for pending reads after the client returns from the IO thread and
will be the only one to access the pending list.
2. The registering/unregistering of events will be similarly postponed
and handled by the main thread only.
3. Each client is being sent to the same dedicated thread (c→id %
num_of_threads).


**Sending Replies Immediately with IO threads.**

Currently, after processing a command, we add the client to the
pending_writes_list. Only after processing all the clients do we send
all the replies. Since the IO threads are now working asynchronously, we
can send the reply immediately after processing the client’s requests,
reducing the command latency. However, if we are using AOF=always, we
must wait for the AOF buffer to be written, in which case we revert to
the current behavior.

**IO threads dynamic adjustment**

Currently, we use an all-or-nothing approach when activating the IO
threads. The current logic is as follows: if the number of pending write
clients is greater than twice the number of threads (including the main
thread), we enable all threads; otherwise, we enable none. For example,
if 8 IO threads are defined, we enable all 8 threads if there are 16
pending clients; else, we enable none.
It makes more sense to enable partial activation of the IO threads. If
we have 10 pending clients, we will enable 5 threads, and so on. This
approach allows for a more granular and efficient allocation of
resources based on the current workload.

In addition, the user will now be able to change the number of I/O
threads at runtime. For example, when decreasing the number of threads
from 4 to 2, threads 3 and 4 will be closed after flushing their job
queues.

**Tests**

Currently, we run the io-threads tests with 4 IO threads
(443d80f168/.github/workflows/daily.yml (L353)).
This means that we will not activate the IO threads unless there are 8
(threads * 2) pending write clients per single loop, which is unlikely
to happened in most of tests, meaning the IO threads are not currently
being tested.

To enforce the main thread to always offload work to the IO threads,
regardless of the number of pending events, we add an
events-per-io-thread configuration with a default value of 2. When set
to 0, this configuration will force the main thread to always offload
work to the IO threads.

When we offload every single read/write operation to the IO threads, the
IO-threads are running with 100% CPU when running multiple tests
concurrently some tests fail as a result of larger than expected command
latencies. To address this issue, we have to add some after or wait_for
calls to some of the tests to ensure they pass with IO threads as well.

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
2024-07-08 20:01:39 -07:00
jonghoonpark
c090874ed4
List test files dynamically (#313)
Motivation: Currently we have to manually update the all_tests variable
when introducing new test files.

Fix: I've modified it to list test files dynamically, but rather than
modify it to add all test files, I've first modified it to only add test
files from the following 4 paths so that it doesn't deviate too much
from what we already do

- unit
- unit/type
- unit/cluster
- integration

Related issue: https://github.com/valkey-io/valkey/issues/302

---------

Signed-off-by: jonghoonpark <dev@jonghoonpark.com>
2024-04-15 14:25:33 +02:00
Shivshankar
a054862b72
Rename redis_client* procedure to valkey_client* in test environment (#276)
Renamed redis-client* procedure to valkey_client*

Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
2024-04-10 10:18:47 -04:00
Shivshankar
da831c0d22
rename procedure redis_deferring_client to valkey_deferring_client (#270)
Updated procedure redis_deferring_client in test environent to
valkey_deferring_client.

Signed-off-by: Shivshankar-Reddy <shiva.sheri.github@gmail.com>
2024-04-09 10:38:09 -04:00
Jacob Murphy
df5db0627f
Remove trademarked language in code comments (#223)
This includes comments used for module API documentation.

* Strategy for replacement: Regex search: `(//|/\*| \*|#).* ("|\()?(r|R)edis( |\.
  |'|\n|,|-|\)|")(?!nor the names of its contributors)(?!Ltd.)(?!Labs)(?!Contributors.)`
* Don't edit copyright comments
* Replace "Redis version X.X" -> "Redis OSS version X.X" to distinguish
from newly licensed repository
* Replace "Redis Object" -> "Object"
* Exclude markdown for now
* Don't edit Lua scripting comments referring to redis.X API
* Replace "Redis Protocol" -> "RESP"
* Replace redis-benchmark, -cli, -server, -check-aof/rdb with "valkey-"
prefix
* Most other places, I use best judgement to either remove "Redis", or
replace with "the server" or "server"

Fixes #148

---------

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-04-09 10:24:03 +02:00
John Vandenberg
253fe9dced
Fix typos and replace 'codespell' with 'typos' (#72)
Uses https://github.com/taiki-e/install-action to install
https://github.com/crate-ci/typos in CI

This finds many more/different typos than
https://github.com/codespell-project/codespell , while having very few
false positives.

Signed-off-by: John Vandenberg <jayvdb@gmail.com>
2024-03-31 12:38:22 -07:00
Yanqi Lv
bad33f8738
fix wrong data type conversion in zrangeResultBeginStore (#13148)
In `beginResultEmission`, -1 means the result length is not known in
advance. But after #12185, if we pass -1 to `zrangeResultBeginStore`, it
will convert to SIZE_MAX in `zsetTypeCreate` and try to `dictExpand`.
Although `dictExpand` won't succeed because the size overflows, I think
we'd better to avoid this wrong conversion.

This bug can be triggered when the source of `zrangestore` doesn't exist
or we use `zrangestore` command with `byscore` or `bylex`.
The impact is that dst keys will be converted to use skiplist instead of
listpack.
2024-03-19 08:52:55 +02:00
Ronen Kalish
a8e745117f
Xread last entry in stream (#7388) (#13117)
Allow using `+` as a special ID for last item in stream on XREAD
command.

This would allow to iterate on a stream with XREAD starting with the
last available message instead of the next one which `$` is used for.
I.e. the caller can use `BLOCK` and `+` on the first call, and change to
`$` on the next call.

Closes #7388

---------

Co-authored-by: Felipe Machado <462154+felipou@users.noreply.github.com>
2024-03-13 08:23:32 +02:00
Binbin
f17381a38d
Fix propagation of entries_read by calling streamPropagateGroupID unconditionally (#12898)
In XREADGROUP ACK, because streamPropagateXCLAIM does not propagate
entries-read, entries-read will be inconsistent between master and
replicas.
I.e. if no entries were claimed, it would have propagated correctly, but
if some
were claimed, then the entries-read field would be inconsistent on the
replica.

The fix was suggested by guybe7, call streamPropagateGroupID
unconditionally,
so that we will normalize entries_read on the replicas. In the past, we
would
only set propagate_last_id when NOACK was specified. And in #9127,
XCLAIM did
not propagate entries_read in ACK, which would cause entries_read to be
inconsistent between master and replicas.

Another approach is add another arg to XCLAIM and let it propagate
entries_read,
but we decided not to use it. Because we want minimal damage in case
there's an
old target and new source (in the worst case scenario, the new source
doesn't
recognize XGROUP SETID ... ENTRIES READ and the lag is lost. If we
change XCLAIM,
the damage is much more severe).

In this patch, now if the user uses XREADGROUP .. COUNT 1 there will be
an additional
overhead of MULTI, EXEC and XGROUPSETID. We assume the extra command in
case of
COUNT 1 (4x factor, changing from one XCLAIM to
MULTI+XCLAIM+XSETID+EXEC), is probably
ok since reading just one entry is in any case very inefficient (a
client round trip
per record), so we're hoping it's not a common case.

Issue was introduced in #9127.
2024-02-29 09:48:20 +02:00
debing.sun
165afc5f2a
Determine the large limit of the quicklist node based on fill (#12659)
Following #12568

In issue #9357, when inserting an element larger than 1GB, we currently
store it in a plain node instead of a listpack.
Presently, when we insert an element that exceeds the maximum size of a
packed node, it cannot be accommodated in any other nodes, thus ending
up isolated like a large element.
I.e. it's a node with only one element, but it's listpack encoded rather
than a plain buffer.

This PR lowers the threshold for considering an element as 'large' from
1GB to the maximum size of a node.
While this change doesn't completely resolve the bug mentioned in the
previous PR, it does mitigate its potential impact.

As a result of this change, we can now only use LSET to replace an
element with another element that falls below the maximum size
threshold.
In the worst-case scenario, with a fill of -5, the largest packed node
we can create is 2GB (32k * 64k):
* 32k: The smallest element in a listpack is 2 bytes, which allows us to
store up to 32k elements.
* 64k: This is the maximum size for a single quicklist node.

## Others
To fully fix #9357, we need more work, as discussed in #12568, when we
insert an element into a quicklistNode, it may be created in a new node,
put into another node, or merged, and we can't correctly delete the node
that was supposed to be deleted.
I'm not sure it's worth it, since it involves a lot of modifications.
2024-02-22 10:02:38 +02:00
Binbin
32f44da510
Increase tolerance range to block reprocess tests to avoid timing issues (#13053)
These tests have all failed in daily CI:
```
*** [err]: Blocking XREADGROUP for stream key that has clients blocked on stream - reprocessing command in tests/unit/type/stream-cgroups.tcl
Expected '1101' to be between to '1000' and '1100' (context: type eval line 23 cmd {assert_range [expr $end-$start] 1000 1100} proc ::test)

*** [err]: BLPOP unblock but the key is expired and then block again - reprocessing command in tests/unit/type/list.tcl
Expected '1101' to be between to '1000' and '1100' (context: type eval line 23 cmd {assert_range [expr $end-$start] 1000 1100} proc ::test)

*** [err]: BZPOPMIN unblock but the key is expired and then block again - reprocessing command in tests/unit/type/zset.tcl
Expected '1103' to be between to '1000' and '1100' (context: type eval line 23 cmd {assert_range [expr $end-$start] 1000 1100} proc ::test)
```

Increase the range to avoid failures, and improve the comment to be
clearer.
tests was introduced in #13004.
2024-02-15 10:44:49 +02:00
debing.sun
1e8dc1da0d
Fix crash due to merge of quicklist node introduced by #12955 (#13040)
Fix two crash introducted by #12955

When a quicklist node can't be inserted and split, we eventually merge
the current node with its neighboring
nodes after inserting, and compress the current node and its siblings.

1. When the current node is merged with another node, the current node
may become invalid and can no longer be used.

   Solution: let `_quicklistMergeNodes()` return the merged nodes.

3. If the current node is a LZF quicklist node, its recompress will be
1. If the split node can be merged with a sibling node to become head or
tail, recompress may cause the head and tail to be compressed, which is
not allowed.

    Solution: always recompress to 0 after merging.
2024-02-08 14:29:16 +02:00
debing.sun
1f00c951c2
Prevent LSET command from causing quicklist plain node size to exceed 4GB (#12955)
Fix #12864

The main reason for this crash is that when replacing a element of a
quicklist packed node with lpReplace() method,
if the final size is larger than 4GB, lpReplace() will fail and returns
NULL, causing `node->entry` to be incorrectly set to NULL.

Since the inserted data is not a large element, we can't just replace it
like a large element, first quicklistInsertAfter()
and then quicklistDelIndex(), because the current node may be merged and
invalidated in quicklistInsertAfter().

The solution of this PR:
When replacing a node fails (listpack exceeds 4GB), split the current
node, create a new node to put in the middle, and try to merge them.
This is the same as inserting a large element.
In the worst case, its size will not exceed 4GB.
2024-02-06 18:21:28 +02:00
Binbin
492021db95
Fix blocking commands timeout is reset due to re-processing command (#13004)
In #11012, we will reprocess command when client is unblocked on keys,
in some blocking commands, for example, in the XREADGROUP BLOCK
scenario,
because of the re-processing command, we will recalculate the block
timeout,
causing the blocking time to be reset.

This commit add a new CLIENT_REPROCESSING_COMMAND clent flag, explicitly
let the command know that it is being re-processed, later in
blockForKeys
we will not reset the timeout.

Affected BLOCK cases: 
- list / zset / stream, added test cases for each.

Unaffected cases:
- module (never re-process the commands).
- WAIT / WAITAOF (never re-process the commands).

Fixes #12998.
2024-01-30 11:32:59 +02:00
Roshan Khatri
5358bd7cdd
Reduce performance impact of dict rehashing and make it shorter. (#12899)
#### Problem Statement:
For any read/update operation during rehashing, we're doing ~10+ random
DRAM lookups to do the rehashing, as we are using the `rehashidx` to
rehash 10 buckets, whose dict entries most likely aren't cached in the
CPU or near the bucket we are operating on. If these random bucket are
empty, the rehashing process during that command execution is skipped.

#### Implementation:
For reducing the performance recession while dict is rehashing, we
determine the index at which the key would be stored in the 0th HT, we
check if that index has already been rehashed, if not we will rehash the
bucket containing the key and the bucket will be moved from 0th HT to
the 1st HT.

If the key has already been rehashed, we perform the random access
bucket rehash (using `rehashidx`) and we again verify if rehashing is
still ongoing and look up the key in the respective HT.

This ensures rehashing is not skipped in any command call and that we
rehash a particular bucket or random bucket in each call.

#### Changes in this PR:
- Added a new method `dictBucketRehash` to perform rehash on a single
bucket.
- Helper function `moveKeysInBucketOldtoNew` for `dictRehash` and
`dictBucketRehash` to move all the keys in a bucket from the old to the
new hash HT.
- Helper function `verifyMoreRehashRequired` for `dictRehash` and
`dictBucketRehash` to check if we have already rehashed the whole table
and if more rehashing is required.

### Benchmark:
- This PR still shows **~13%** improvement in the latency during
rehashing.

- Rehashing is now **~2%** faster for this PR when compared to unstable.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
2024-01-27 11:11:53 +02:00
Wen Hui
685409139b
Add INCR type command against wrong argument test cases. (#12836)
We have test cases for incr related commands with no key exist and
spaces in key and wrong type of key. However, we dont have test cases
covered for INCRBY INCRBYFLOAT DECRBY INCR DECR HINCRBY HINCRBYFLOAT
ZINCRBY with valid key and invalid value as argument, and float value to
incrby and decrby. So added test cases for the scenarios in incr.tcl.

Thank you!
2024-01-23 15:39:38 +02:00
Yanqi Lv
e2b7932b34
Shrink dict when deleting dictEntry (#12850)
When we insert entries into dict, it may autonomously expand if needed.
However, when we delete entries from dict, it doesn't shrink to the
proper size. If there are few entries in a very large dict, it may cause
huge waste of memory and inefficiency when iterating.

The main keyspace dicts (keys and expires), are shrinked by cron
(`tryResizeHashTables` calls `htNeedsResize` and `dictResize`),
And some data structures such as zset and hash also do that (call
`htNeedsResize`) right after a loop of calls to `dictDelete`,
But many other dicts are completely missing that call (they can only
expand).

In this PR, we provide the ability to automatically shrink the dict when
deleting. The conditions triggering the shrinking is the same as
`htNeedsResize` used to have. i.e. we expand when we're over 100%
utilization, and shrink when we're below 10% utilization.

Additionally:
* Add `dictPauseAutoResize` so that flows that do mass deletions, will
only trigger shrinkage at the end.
* Rename `dictResize` to `dictShrinkToFit` (same logic as it used to
have, but better name describing it)
* Rename `_dictExpand` to `_dictResize` (same logic as it used to have,
but better name describing it)
 
related to discussion
https://github.com/redis/redis/pull/12819#discussion_r1409293878

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
2024-01-15 08:20:53 +02:00
Wen Hui
5dc631d880
Add missing test cases for hash commands (#12851)
We dont have test for hgetall against key doesnot exist so added the
test in test suite and along with this, added wrong type cases for other
missing commands.
2023-12-17 14:02:53 +02:00
Binbin
96e9dec419
Bump codespell from 2.2.4 to 2.2.5 (#12557)
and adjustments.
2023-09-08 16:10:17 +03:00
Chen Tianjie
b26e8e3213
Optimize ZRANGE offset location from linear search to skiplist jump. (#12450)
ZRANGE BYSCORE/BYLEX with [LIMIT offset count] option was
using every level in skiplist to jump to the first/last node in range,
but only use level[0] in skiplist to locate the node at offset, resulting
in sub-optimal performance using LIMIT:
```
while (ln && offset--) {
    if (reverse) {
        ln = ln->backward;
    } else {
        ln = ln->level[0].forward;
    }
}
```
It could be slow when offset is very big. We can get the total rank of
the offset location and use skiplist to jump to it. It is an improvement
from O(offset) to O(log rank).

Below shows how this is implemented (if the offset is positve):

Use the skiplist to seach for the first element in the range, record its
rank `rank_0`, so we can have the rank of the target node `rank_t`.
Meanwhile we record the last node we visited which has zsl->level-1
levels and its rank `rank_1`. Then we start from the zsl->level-1 node,
use skiplist to go forward `rank_t-rank_1` nodes to reach the target node.

It is very similiar when the offset is reversed.

Note that if `rank_t` is very close to `rank_0`, we just start from the first
element in range and go node by node, this for the case when zsl->level-1
node is to far away and it is quicker to reach the target node by node.

Here is a test using a random generated zset including 10000 elements
(with different positive scores), doing a bench mark which compares how
fast the `ZRANGE` command is exucuted before and after the optimization. 

The start score is set to 0 and the count is set to 1 to make sure that
most of the time is spent on locating the offset.
```
memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test 0 +inf byscore limit <offset> 1"
```
| offset | QPS(unstable) | QPS(optimized) |
|--------|--------|--------|
| 10 | 73386.02 | 74819.82 |
| 1000 | 48084.96 | 73177.73 |
| 2000 | 31156.79 | 72805.83 |
| 5000 | 10954.83 | 71218.21 |

With the result above, we can see that the original code is greatly
slowed down when offset gets bigger, and with the optimization the
speed is almost not affected.

Similiar results are generated when testing reversed offset:
```
memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test +inf 0 byscore rev limit <offset> 1"
```
| offset | QPS(unstable) | QPS(optimized) |
|--------|--------|--------|
| 10 | 74505.14 | 71653.67 |
| 1000 | 46829.25 | 72842.75 |
| 2000 | 28985.48 | 73669.01 |
| 5000 | 11066.22 | 73963.45 | 

And the same conclusion is drawn from the tests of ZRANGE BYLEX.
2023-08-31 14:42:08 +03:00
Harkrishn Patro
42985b00ea
Test coverage for incr/decr operation on robj encoding type optimization (#12435)
Additional test coverage for incr/decr operation.

integer number could be present in raw encoding format due to operation like append. A incr/decr operation following it optimize the string to int encoding format.
2023-07-25 16:43:31 -07:00
Harkrishn Patro
34b95f752c
Add test case for APPEND command usage on integer value (#12429)
Add test coverage to validate object encoding update on APPEND command usage on a integer value
2023-07-24 18:25:50 -07:00
Binbin
6b57241fa8
Revert zrangeGenericCommand negative offset check (#12377)
The negative offset check was added in #9052, we realized
that this is a non-mandatory breaking change and we would
like to add it only in 8.0.

This reverts PR #9052, will be re-introduced later in 8.0.
2023-07-02 20:38:30 +03:00
Binbin
d9c2ef8a72
zrangeGenericCommand add check for negative offset (#9052)
Now we will check the offset in zrangeGenericCommand.
With a negative offset, we will throw an error and return.

This also resolve the issue of zeroing the destination key
in case of the "store" variant when we input a negative offset.
```
127.0.0.1:6379> set key value
OK
127.0.0.1:6379> zrangestore key myzset 0 10 byscore limit -1 10
(integer) 0
127.0.0.1:6379> exists key
(integer) 0
```

This change affects the following commands:
- ZRANGE / ZRANGESTORE / ZRANGEBYLEX / ZRANGEBYSCORE
- ZREVRANGE / ZREVRANGEBYSCORE / ZREVRANGEBYLEX
2023-06-20 14:33:17 +03:00
Binbin
d306d86146
Fix ZRANK/ZREVRANK reply_schema description (#12331)
The parameter name is WITHSCORE instead of WITHSCORES.
2023-06-20 11:15:40 +03:00
sundb
b00a235186
Use Reservoir Sampling for random sampling of dict, and fix hang during fork (#12276)
## Issue:
When a dict has a long chain or the length of the chain is longer than
the number of samples, we will never be able to sample the elements
at the end of the chain using dictGetSomeKeys().
This could mean that SRANDMEMBER can be hang in and endless loop.
The most severe case, is the pathological case of when someone uses SCAN+DEL
or SSCAN+SREM creating an unevenly distributed dict.
This was amplified by the recent change in #11692 which prevented a
down-sizing rehashing while there is a fork.

## Solution
1. Before, we will stop sampling when we reach the maximum number
  of samples, even if there is more data after the current chain.
  Now when we reach the maximum we use the Reservoir Sampling
  algorithm to fairly sample the end of the chain that cannot be sampled
2. Fix the rehashing code, so that the same as it allows rehashing for up-sizing
  during fork when the ratio is extreme, it will allow it for down-sizing as well.

Issue was introduced (or became more severe) by #11692

Co-authored-by: Oran Agra <oran@redislabs.com>
2023-06-16 19:13:07 +03:00
Binbin
439b0315c8
Fix SPOP/RESTORE propagation when doing lazy free (#12320)
In SPOP, when COUNT is greater than or equal to set's size,
we will remove the set. In dbDelete, we will do DEL or UNLINK
according to the lazy flag. This is also required for propagate.

In RESTORE, we won't store expired keys into the db, see #7472.
When used together with REPLACE, it should emit a DEL or UNLINK
according to the lazy flag.

This PR also adds tests to cover the propagation. The RESTORE
test will also cover #7472.
2023-06-16 08:14:11 -07:00
Binbin
9dc6f93e9d
Add command being unblocked cause another command to get unblocked execution order test (#12324)
* Add command being unblocked cause another command to get unblocked execution order test

In #12301, we observed that if the
`while(listLength(server.ready_keys) != 0)`
in handleClientsBlockedOnKeys is changed to
`if(listLength(server.ready_keys) != 0)`,
the order of command execution will change.

It is wrong to change that. It means that if a command
being unblocked causes another command to get unblocked
(like a BLMOVE would do), then the new unblocked command
will wait for later to get processed rather than right away.

It'll not have any real implication if we change that since
we do call handleClientsBlockedOnKeys in beforeSleep again,
and redis will still behave correctly, but we don't change that.

An example:
1. $rd1 blmove src{t} dst{t} left right 0
2. $rd2 blmove dst{t} src{t} right left 0
3. $rd3 set key1{t}, $rd3 lpush src{t}, $rd3 set key2{t} in a pipeline

The correct order would be:
1. set key1{t}
2. lpush src{t}
3. lmove src{t} dst{t} left right
4. lmove dst{t} src{t} right left
5. set key2{t}

The wrong order would be:
1. set key1{t}
2. lpush src{t}
3. lmove src{t} dst{t} left right
4. set key2{t}
5. lmove dst{t} src{t} right left

This PR adds corresponding test to cover it.

* Add comment near while(listLength(server.ready_keys) != 0)
2023-06-16 15:39:00 +03:00
Wen Hui
1946083410
Removing the duplicate test case (#12310)
Looks like the Zadd test case was copied to create Zincrby test case ,but missed to change the command.
2023-06-14 10:03:33 +03:00
Binbin
e7129e43e0
Fix XREADGROUP BLOCK stuck in endless loop (#12301)
For the XREADGROUP BLOCK > scenario, there is an endless loop.
Due to #11012, it keep going, reprocess command -> blockForKeys -> reprocess command

The right fix is to avoid an endless loop in handleClientsBlockedOnKey and handleClientsBlockedOnKeys,
looks like there was some attempt in handleClientsBlockedOnKeys but maybe not sufficiently good,
and it looks like using a similar trick in handleClientsBlockedOnKey is complicated.
i.e. stashing the list on the stack and iterating on it after creating a fresh one for future use,
is problematic since the code keeps accessing the global list.

Co-authored-by: Oran Agra <oran@redislabs.com>
2023-06-13 13:27:05 +03:00
Oran Agra
2764dc3768
Optimize MSETNX to avoid double lookup (#11944)
This is a redo of #11594 which got reverted in #11940
It improves performance by avoiding double lookup of the the key.
2023-05-28 10:58:29 +03:00
Wen Hui
1a188e4ed6
[BUG] Incorrect error msg for XREAD command (#12238)
XREAD only supports a special ID of $ and XREADGROUP only supports ^.
make sure not to suggest the wrong one when rerunning an error about unbalanced ID arguments

Co-authored-by: Oran Agra <oran@redislabs.com>
2023-05-28 08:37:32 +03:00
Wen Hui
d664889992
Adding test case for hvals, hkeys, hexists against wrong type (#12198)
HVALS, HKEYS and HEXISTS commands wrong type test cases were not covered so added the test cases.
2023-05-24 09:34:13 +03:00
Binbin
48757934ff
Performance improvement to ZADD and ZRANGESTORE, convert to skiplist and expand dict in advance (#12185)
For zsets that will eventually be stored as the skiplist encoding (has a dict),
we can convert it to skiplist ahead of time. This change checks the number
of arguments in the ZADD command, and converts the data-structure
if the number of new entries exceeds the listpack-max-entries configuration.
This can cause us to over-allocate memory if there are duplicate entries in the
input, which is unexpected.

For ZRANGESTORE, we know the size of the zset, so we can expand
the dict in advance, to avoid the temporary dict from being rehashed
while it grows.

Simple benchmarks shows it provides some 4% improvement in ZADD and 20% in ZRANGESTORE
2023-05-18 15:24:46 +03:00
Binbin
fd566f4050
Fix for set max entries edge case in setTypeCreate / setTypeMaybeConvert (#12183)
In the judgment in setTypeCreate, we should judge size_hint <= max_entries.

This results in the following inconsistencies:
```
127.0.0.1:6379> config set set-max-intset-entries 5 set-max-listpack-entries 5
OK

127.0.0.1:6379> sadd intset_set1 1 2 3 4 5
(integer) 5
127.0.0.1:6379> object encoding intset_set1
"hashtable"
127.0.0.1:6379> sadd intset_set2 1 2 3 4
(integer) 4
127.0.0.1:6379> sadd intset_set2 5
(integer) 1
127.0.0.1:6379> object encoding intset_set2
"intset"

127.0.0.1:6379> sadd listpack_set1 a 1 2 3 4
(integer) 5
127.0.0.1:6379> object encoding listpack_set1
"hashtable"
127.0.0.1:6379> sadd listpack_set2 a 1 2 3
(integer) 4
127.0.0.1:6379> sadd listpack_set2 4
(integer) 1
127.0.0.1:6379> object encoding listpack_set2
"listpack"
```

This was introduced in #12019, added corresponding tests.
2023-05-16 11:32:21 -07:00
Wen Hui
e45272884e
Adding missing test case for smembers scard commands (#12148)
Minor missing test case addition.

SMEMBERS SCARD against non set
SMEMBERS SCARD against non existing key
2023-05-16 09:26:49 -07:00
JJ Lu
11cf5cbdcc
Fix bug: LPOS RANK LONG_ MIN causes overflow (#12167)
Limit the range of RANK to -LONG_ MAX ~ LONG_ MAX.
Without this limit, passing -9223372036854775808 would effectively
be the same as passing -1.
2023-05-14 09:04:33 +03:00
Wen Hui
42dd98ec19
adding missing test cases GET and GETEX (#12125)
adding test case of expired key or not exist for GET and GETEX.
for better test coverage.
2023-05-07 11:46:11 +03:00
Wen Hui
f32d1817e3
Updating missing test cases for Hash commands (#12116)
Adding missing test case against wrong type for HRANDFIELD HGET HGETALL HDEL HINCRBY HINCRBYFLOAT HSTRLEN.
2023-05-01 21:00:07 +03:00
Wen Hui
091412cf62
add test cases for decr decrby on missing key (#12070)
Minor test case addition for DECR and DECRBY.

Currently DECR and DECRBY do not have test case coverage for the
scenarios where they run on a non-existing key.
2023-04-19 09:55:56 +03:00
Wen Hui
d2db4aa753
Added getrange missing testcase (#12061)
Minor test case addition.
Currently GETRANGE command does not have the test case coverage for the scenarios:
An error is returned when key exists but of different type
Added missing test cases for getrange command.
2023-04-18 08:32:10 +03:00
Wen Hui
4375b01cc7
Adding missing test cases for substring (#12039)
There is are some missing test cases for SUBSTR command.
These might already be covered by GETRANGE, but no harm in adding them since they're simple.

Added 3 test case.

* start > stop
* start and stop both greater than string length
* when no key is present.
2023-04-13 21:48:26 +03:00
Wen Hui
bc82309ceb
Adding missing test cases for linsert command (#12040)
Currently LINSERT command does not have the test case coverage for following scenarios.
1. When key does not exist, it is considered an empty list and no operation is performed.
2. An error is returned when key exists but does not hold a list value.

Added above two missing test cases for linsert command.
2023-04-13 19:05:41 +03:00