12799 Commits

Author SHA1 Message Date
ranshid
ba25b586d5
Introduce FORCE_DEFRAG compilation option to allow activedefrag run when allocator is not jemalloc (#1303)
Introduce compile time option to force activedefrag to run even when
jemalloc is not used as the allocator.
This is in order to be able to run tests with defrag enabled
while using memory instrumentation tools.

fixes: https://github.com/valkey-io/valkey/issues/1241

---------

Signed-off-by: ranshid <ranshid@amazon.com>
Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-17 19:07:55 +02:00
xbasel
7892bf808b
Fix test_reclaimFilePageCache to avoid tmpfs (#1379)
Avoid tmpfs as fadvise(FADV_DONTNEED) has no effect on memory-backed
filesystems.

Fixes https://github.com/valkey-io/valkey/issues/897

---------

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
2024-12-17 18:04:27 +02:00
Roshan Khatri
980a801159
Fix the secrete for test bucket. (#1447)
We have set the secret as `AWS_S3_TEST_BUCKET` for test bucket and I
missed it in the initial review.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
2024-12-16 13:01:34 -08:00
Binbin
e024b4bd27
Drop the MEET packet if the link node is in handshake state (#1436)
After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```

In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.

Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-16 13:43:48 +08:00
Binbin
ad24220681
Automatic failover vote is not limited by two times the node timeout (#1356)
This is a follow of #1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion #1305 for more information.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-15 12:09:53 +08:00
Rain Valentine
88942c8e61
Replace dict with new hashtable for sets datatype (#1176)
The new `hashtable` provides faster lookups and uses less memory than
`dict`.

A TCL test case "SRANDMEMBER with a dict containing long chain" is
deleted because it's covered by a hashtable unit test
"test_random_entry_with_long_chain", which is already present.

This change also moves some logic from dismissMemory (object.c) to
zmadvise_dontneed (zmalloc.c), so the hashtable implementation which
needs the dismiss functionality doesn't need to depend on object.c and
server.h.

This PR follows #1186.

---------

Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-14 20:53:48 +01:00
Madelyn Olson
0e96bb311e
Synchronously delete data during defrag tests (#1443)
The creation of fragmentation is delayed when we use lazy-free. You can
induce some of the active-defrag tests to fail by artificially adding a
delay in the lazyfree process, similar to the issues seen in #1433 and
issues like
https://github.com/valkey-io/valkey/actions/runs/12267010712/job/34226304803#step:7:6538.
The solution is to always do sync free during tests.

Might close https://github.com/valkey-io/valkey/issues/1433.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-14 19:14:01 +01:00
Madelyn Olson
3cd176dc39
Avoid importing memory aligned malloc (#1442)
We deprecate the usage of classic malloc and free, but under certain
circumstances they might get imported from intrinsics. The original
thought is we should just override malloc and free to use zmalloc and
zfree, but I think we should continue to deprecate it to avoid
accidental imports of allocations.

Closes https://github.com/valkey-io/valkey/issues/1434.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-14 19:13:04 +01:00
Binbin
7d72fada2c
Fix wrong file name in build-release-packages.yml (#1437)
Introduced in #1363, the file name does not match.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-13 14:26:20 -08:00
Binbin
d588bb4406
Skip build-release-packages CI job in forks (#1438)
The CI job was introduced in #1363, we should skip it in forks.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-13 16:32:54 -05:00
Thalia Archibald
b60097ba07
Check length before reading in stringmatchlen (#1431)
Fixes four cases where `stringmatchlen` could overrun the pattern if it
is not terminated with NUL.

These commits are cherry-picked from my
[fork](https://github.com/thaliaarchi/antirez-stringmatch) which
extracts `stringmatch` as a library and compares it to other projects by
antirez which use the same matcher.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
2024-12-13 11:05:19 +01:00
Jim Brunner
32f2c73cb5
defrag: eliminate persistent kvstore pointer and edge case fixes (#1430)
This update addresses several issues in defrag:
1. In the defrag redesign
(https://github.com/valkey-io/valkey/pull/1242), a bug was introduced
where `server.cronloops` was no longer being incremented in the
`whileBlockedCron()`. This resulted in some memory statistics not being
updated while blocked.
2. In the test case for AOF loading, we were seeing errors due to defrag
latencies. However, running the math, the latencies are justified given
the extremely high CPU target of the testcase. Adjusted the expected
latency check to allow longer latencies for this case where defrag is
undergoing starvation while AOF loading is in progress.
3. A "stage" is passed a "target". For the main dictionary and expires,
we were passing in a `kvstore*`. However, on flushall or swapdb, the
pointer may change. It's safer and more stable to use an index for the
DB (a DBID). Then if the pointer changes, we can detect the change, and
simply abort the stage. (If there's still fragmentation to deal with,
we'll pick it up again on the next cycle.)
4. We always start a new stage on a new defrag cycle. This gives the new
stage time to run, and prevents latency issues for certain stages which
don't operate incrementally. However, often several stages will require
almost no work, and this will leave a chunk of our CPU allotment unused.
This is mainly an issue in starvation situations (like AOF loading or
LUA script) - where defrag is running infrequently, with a large
duty-cycle. This change allows a new stage to be initiated if we still
have a standard duty-cycle remaining. (This can happen during starvation
situations where the planned duty cycle is larger than the standard
cycle. Most likely this isn't a concern for real scenarios, but it was
observed in testing.)
5. Minor comment correction in `server.h`

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
2024-12-12 14:55:57 -08:00
Roshan Khatri
3a1043a4f0
Fix Valkey binary build workflow, version support changes. (#1429)
This change makes the binary build on the target ubuntu version.

This PR also deprecated ubuntu18 and valkey will not support:

- X86:
  - Ubuntu 20
  - Ubuntu 22
  - Ubuntu 24
 - ARM:
   - Ubuntu 20
   - Ubuntu 22
   
Removed ARM ubuntu 24 as the action we are using for ARM builds does not
support Ubuntu 24.

---------

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
2024-12-12 14:46:35 -08:00
Vu Diep
ab69a8a55d
Use configure-aws-credentials workflow instead of passing secret_access_key (#1363)
## Summary
This PR fixes #1346 where we can get rid of the long term credentials by
using OpenID Connect. OpenID Connect (OIDC) allows your GitHub Actions
workflows to access resources in Amazon Web Services (AWS), without
needing to store the AWS credentials as long-lived GitHub secrets.

---------

Signed-off-by: vudiep411 <vdiep@amazon.com>
2024-12-12 14:42:52 -08:00
ranshid
2d92404522
Avoid defragging scripts during EVAL command execution (#1414)
This can happen when scripts are running for long period of time and the server attempts to defrag it in the whileBlockedCron.

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2024-12-12 13:52:58 -08:00
Pierre
5f7fe9ef21
Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout (#1307)
In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us. This prevents
cases where a node could send MEET packets in a loop because it thinks
the other node does not have an inbound link.

This fixes the bug described in #1251.

---------

Signed-off-by: Pierre Turin <pieturin@amazon.com>
2024-12-11 17:26:06 -08:00
Jim Brunner
0c8ad5cd34
defrag: allow defrag to start during AOF loading (#1420)
Addresses https://github.com/valkey-io/valkey/issues/1393

Changes:
* During AOF loading or long running script, this allows defrag to be
initiated.
* The AOF defrag test was corrected to eliminate the wait period and
rely on non-timer invocations.
* Logic for "overage" time in defrag was changed. It previously
accumulated underage leading to large latencies in extreme tests having
very high CPU percentage. After several simple stages were completed
during infrequent blocked processing, a large cycle time would be
experienced.

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
2024-12-11 19:47:06 +02:00
Binbin
1acf7f71c0
Fix memory leak in the new hashtable unittest (#1421)
There is a leak in here, hashtableTwoPhasePopDelete won't call the entry
destructor and like hashtablePop we need to call it by myself.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-11 06:40:18 +01:00
Viktor Söderqvist
3eb8314be6 Replace dict with hashtable for keys, expires and pubsub channels
Instead of a dictEntry with pointers to key and value, the hashtable
has a pointer directly to the value (robj) which can hold an embedded
key and acts as a key-value in the hashtable. This minimizes the number
of pointers to follow and thus the number of memory accesses to lookup
a key-value pair.

        Keys         robj
      hashtable
      +-------+   +-----------------------+
      | 0     |   | type, encoding, LRU   |
      | 1 ------->| refcount, expire      |
      | 2     |   | ptr                   |
      | ...   |   | optional embedded key |
      +-------+   | optional embedded val |
                  +-----------------------+

The expire timestamp (TTL) is also stored in the robj, if any. The expire
hash table points to the same robj.

Overview of changes:

* Replace dict with hashtable in kvstore (kvstore.c)
* Add functions for embedding key and expire in robj (object.c)
  * When there's unused space, reserve an expire field to avoid realloting
    it later if expire is added.
  * Always reserve space for expire for large key names to avoid realloc
    if it's set later.
* Update db functions (db.c)
  * dbAdd, setKey and setExpire reallocate the object when embedding a key
  * setKey does not increment the reference counter, since it would require
    duplicating the object. This responsibility is moved to the caller.
* Remove logic for shared integer objects as values in the database. The keys
  are now embedded in the objects, so all objects in the database need to be
  unique. Thus, we can't use shared objects as values. Also delete test cases
  for shared integers.
* Adjust various commands to the changes mentioned above.
* Adjust defrag code
  * Improvement: Don't access the expires table before defrag has actually
    reallocated the object.
* Adjust test cases that were using hard-coded sizes for dict when realloc
  would happen, and some other adjustments in test cases.
* Adjust memory prefetch for new hash table implementation in IO-threading,
  using new `hashtableIncrementalFind` API
* Adjust offloading of free() to IO threads: Object free to be done in main
  thread while keeping obj->ptr offloading in IO-thread since the DB object is
  now allocated by the main-thread and not by the IO-thread as it used to be.
* Let expireIfNeeded take an optional value, to avoid looking up the expires
  table when possible.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: uriyage <78144248+uriyage@users.noreply.github.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
2024-12-10 21:30:56 +01:00
Rain Valentine
4efff42f04 Replace dict with hashtable in command tables (#1065)
This changes the type of command tables from dict to hashtable. Command
table lookup takes ~3% of overall CPU time in benchmarks, so it is a
good candidate for optimization.

My initial SET benchmark comparison suggests that hashtable is about 4.5
times faster than dict and this replacement reduced overall CPU time by
2.79% 🥳

---------

Signed-off-by: Rain Valentine <rainval@amazon.com>
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Rain Valentine <rainval@amazon.com>
2024-12-10 21:30:56 +01:00
Viktor Söderqvist
c8ee5c2c46 Hashtable implementation including unit tests
A cache-line aware hash table with a user-defined key-value entry type,
supporting incremental rehashing, scan, iterator, random sampling,
incremental lookup and more...

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-10 21:30:56 +01:00
Viktor Söderqvist
b4c2a1804a
Fix flaky init_test proc in maxmemory test suite (#1419)
The following error has been seen, but not reliably reproduced:

```
*** [err]: eviction due to output buffers of pubsub, client eviction: true in tests/unit/maxmemory.tcl
Expected '42' to be equal to '50' (context: type proc line 17 cmd {assert_equal [r dbsize] 50} proc ::init_test level 2)
```

The reason is probably that FLUSHDB is asynchronous and when we start
populating new keys, they are evicted because the background flush is
too slow. Changing this to FLUSHDB SYNC prevents this.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-10 20:52:06 +02:00
Binbin
7e564887b9
Set HIDDEN_CONFIG flag on events-per-io-thread (#1408)
events-per-io-thread is for testing purposes that allow us to force the
main thread to always offload the works to the IO threads, see
adjustIOThreadsByEventLoad for more details.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-11 00:37:18 +08:00
Viktor Söderqvist
2dfe25b408
Fix race in test "CLUSTER SLOT-STATS cpu-usec for blocking commands, unblocked on timeout" (#1416)
This fix changes the timeout for BLPOP in this test case from 1 second
to 0.5 seconds.

In the test case quoted below, the procedure
`wait_for_blocked_clients_count` waits for one second by default. If
BLPOP has 1 second timeout and the first
`wait_for_blocked_clients_count` finishes very fast, then the second
`wait_for_blocked_clients_count` can time out before the BLPOP has been
unblocked.

```TCL
    test "CLUSTER SLOT-STATS cpu-usec for blocking commands, unblocked on timeout." {
        # Blocking command with 1 second timeout.
        set rd [valkey_deferring_client]
        $rd BLPOP $key 1

        # Confirm that the client is blocked, then unblocked after 1 second timeout.
        wait_for_blocked_clients_count 1
        wait_for_blocked_clients_count 0
```

As seen in the definition of `wait_for_blocked_clients_count`, the total
time to wait is 1 second by default.

```TCL
proc wait_for_blocked_clients_count {count {maxtries 100} {delay 10} {idx 0}} {
    wait_for_condition $maxtries $delay  {
        [s $idx blocked_clients] == $count
    } else {
        fail "Timeout waiting for blocked clients"
    }
}
```

Fixes #1121

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-10 14:46:21 +01:00
Yanqi Lv
f951a1ca73
Add new flag in CLIENT LIST for import-source client (#1398)
- Add new flag "I" in `CLIENT LIST` for import-source client
- Add `DEBUG_CONFIG` for import-mode
- Allow import-source status to be turned off when import-mode is off

Fixes #1350 and
https://github.com/valkey-io/valkey/pull/1185#discussion_r1851049362.

---------

Signed-off-by: lvyanqi.lyq <lvyanqi.lyq@alibaba-inc.com>
Signed-off-by: Yanqi Lv <lvyanqi.lyq@alibaba-inc.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2024-12-10 13:35:07 +01:00
Sarthak Aggarwal
9cfe1b3d81
Set Command with IFEQ Support (#1324)
This PR allows the Valkey users to perform conditional updates where the
SET command is completed if the given comparison-value matches the key’s
current value.

Syntax:

```
SET key value IFEQ comparison-value
```

Behavior:

If the values match, the SET completes as expected. If they do not
match, the command returns a (nil), except if the GET argument is also
given (see below).

Behavior with Additional Flags:

1. ```SET key value IFEQ comparison-value GET``` returns the existing
value, regardless of whether it matches comparison-value or not. The
conditional set operation is performed if the given comparison value
matches the existing value. To check if the SET succeeded, the caller
needs to check if the returned string matches the comparison-value.
2. ```SET key value IFEQ comparison-value XX``` is a syntax error.
3.  ```SET key value IFEQ comparison-value NX``` is a syntax error.

Closes: #1215

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2024-12-10 12:54:49 +01:00
Madelyn Olson
4f61034934
Update governance and maintainers file for Valkey committers (#1390)
We added two more committers, but according to our governance document
that makes them TSC members. As we discussed, for now we want to keep
the balance of corporate interests, so so updating the governance to
explicitly list TSC members compared to folks with just write
permissions.

Also adds the new new folks with commit permissions.

---------

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-09 12:28:17 -08:00
Binbin
1ba85d002a
Use binary representation in assert crash log, cleanup in crash log (#1410)
Change assert crash log to also use binary representation like 5bdd72bea77d4bb237441c9a671e80edcdc998ad.
And do not print the password in assert crash log like 56eef6fb5ab7a755485c19f358761954ca459472.

In addition, for 5bdd72bea77d4bb237441c9a671e80edcdc998ad, we will print '"argv"',
because originally the code would print a '', and sdscatrepr will add an extra "",
so now removing the extra '' here.

Extract the getArgvReprString method and clean up the code a bit.

Examples:
```
debug assert "\x00abc"

before:
client->argv[0] = "debug" (refcount: 1)
client->argv[1] = "assert" (refcount: 1)
client->argv[2] = "" (refcount: 1)

after:
client->argv[0] = "debug" (refcount: 1)
client->argv[1] = "assert" (refcount: 1)
client->argv[2] = "\x00abc" (refcount: 1)

debug panic "\x00abc"

before:
argc: '3'
argv[0]: '"debug"'
argv[1]: '"panic"'
argv[2]: '"\x00abc"'

after:
argc: 3
argv[0]: "debug"
argv[1]: "panic"
argv[2]: "\x00abc"
```

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-10 00:37:04 +08:00
ranshid
5be4ce6d27
Optimize ZRANK to avoid path comparisons (#1389)
ZRANK is a widly used command for workloads using sorted-sets. For
example, in leaderboards It enables query the specific rank of a player.
The way ZRANK is currently implemented is:

1. locate the element in the SortedSet hashtable.
2. take the score of the element and use it in order to locate the
element in the SkipList (when listpack encoding is not used)
3. During the SkipLis scan for the elemnt we keep the path and use it in
order to sum the span in each path node in order to calculate the elemnt
rank

One problem with this approach is that it involves multiple compare
operations in order to locate the element. Specifically string
comparison can be expensive since it will require access multiple memory
locations for the items the element string is compared against.
Perf analysis showed this can take up to 20% of the rank scan time. (TBD
- provide the perf results for example)

We can improve the rank search by taking advantage of the fact that the
element node in the skiplist is pointed by the hashtable value!
Our Skiplist implementation is using FatKeys, where each added node is
assigned a randomly chosen height. Say we keep a height record for every
skiplist element. In order to get an element rank we simply:

1. locate the element in the SortedSet hashtable.
2. we go directly to the node in the skiplist.
3. we jump to the full height of the node and take the span value.
4. we continue going foreward and always jump to the heighst point in
each node we get to, making sure to sum all the spans.
5. we take off the summed spans from the SkipList length and we now have
the specific node rank. :)

In order to test this method I created several benchmarks. All
benchmarks used the same seeds and the lists contained 1M elements.
Since a very important factor is the number of scores compared to the
number of elements (since small ratio means more string compares during
searches) each benchmark test used different number of scores (1, 10K,
100K, 1M)
some results:

**TPS**

Scores range | non-optimized | optimized | gain
-- | -- | -- | --
1 | 416042 | 605363 | 45.51%
10K | 359776 | 459200 | 27.63%
100K | 380387 | 459157 | 20.71%
1M | 416059 | 450853 | 8.36%

**Latency**

Scores range | non-optimized | optimized | gain
-- | -- | -- | --
1 | 1.191000 | 0.831000 | -30.23%
10K | 1.383000 | 1.095000 | -20.82%
100K | 1.311000 | 1.087000 | -17.09%
1M | 1.191000 | 1.119000 | -6.05%

###  Memory efficiency

adding another field to each skiplist node can cause degredation in
memory efficiency for large sortedsets. We use the fact that level 0
recorded span of ALL nodes can either be 1 or zero (for the last node).
So we use wrappers in order to get a node span and override the span for
level 0 to hold the node height.

---------

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
2024-12-09 15:48:46 +01:00
Binbin
924729eb16
Fix the election was reset wrongly before failover epoch was obtained (#1339)
After #1009, we will reset the election when we received
a claim with an equal or higher epoch since a node can win
an election in the past.

But we need to consider the time before the node actually
obtains the failover_auth_epoch. The failover_auth_epoch
default is 0, so before the node actually get the failover
epoch, we might wrongly reset the election.

This is probably harmless, but will produce misleading log
output and may delay election by a cron cycle or beforesleep.
Now we will only reset the election when a node is actually
obtains the failover epoch.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-09 16:19:02 +08:00
Roman Gershman
b09db3ef78
Fix typo in streams seen-time / active-time test (#1409)
This variable name is wrong, it causes the wrong variable to be asserted.

Signed-off-by: Roman Gershman <romange@gmail.com>
2024-12-09 16:01:43 +08:00
Guillaume Koenig
e8078b7315
Allow MEMORY MALLOC-STATS and MEMORY PURGE during loading phase (#1317)
- Enable investigation of memory issues during loading
- Previously, all memory commands were rejected with LOADING error
(except memory help)
- `MEMORY MALLOC-STATS` and `MEMORTY PURGE` are now allowed
as they don't depend on the dataset
- `MEMORY STATS` and `MEMORY USAGE KEY` remain disallowed

Fixes #1299

Signed-off-by: Guillaume Koenig <knggk@amazon.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2024-12-08 20:30:07 +08:00
Binbin
176fafcaf7
Add a note to conf about the dangers of modifying dir at runtime (#887)
We've had security issues in the past with it, which is why
we marked it as PROTECTED. But, modifying during runtime
is also a dangerous action. For example, when child processes
are running, persistent temp files and log files may have
unexpected effects.

A scenario for modifying dir at runtime is to migrate a disk
failure, such as using disk-based replication to migrate a node,
writing nodes.conf to save the cluster configuration.

We decided to leave it as is and add a note in the conf
about the dangers of modifying dir at runtime.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-08 20:28:14 +08:00
Viktor Söderqvist
f20d629dbe
Fix sanitizer builds with clang (#1402)
By including <stdatomic.h> after the other includes in the unit test, we
can avoid redefining a macro which led to a build failure.

Fixes #1394

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-07 10:26:31 +01:00
Viktor Söderqvist
a2fe6af457
Fix Module Update Args test when other modules are loaded (#1403)
Fixes #1400

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-07 10:25:40 +01:00
Caiyi Wu
6df376d68a
Fix coredump when use hellodict example module (#1395)
In the ValkeyModule_OnLoad method of the file hellodict.c, the parameter
keystep of ValkeyModule_CreateCommand should be 1. Otherwise, execute
command will coredump.

    MODULE LOAD /home/tiger/valkey/src/modules/hellodict.so
    COMMAND GETKEYS HELLODICT.SET key value

Signed-off-by: Codebells <1347103071@qq.com>
2024-12-05 20:01:38 +01:00
风去幽墨
6b3e1228cd
RDMA: Fix dead loop when transfer large data (20KB) (#1386)
Determine the status of the Client when attempting to read data. If
state=CLIENT_COMPLETED_IO, no read attempt is made and I/O operations on
the Client are rescheduled by the main thread.

> And 20474 Byte = PROTO_IOBUF_LEN(16KB) + SDS_HDR_VAR(16, s)(4090 Byte)

Fixes #1385

---------

Signed-off-by: fengquyoumo <1455117463@qq.com>
2024-12-05 18:26:56 +01:00
Wen Hui
71560a2a4a
Add API UpdateRuntimeArgs for updating the module arguments during runtime (#1041)
Before Redis OSS 7, if we load a module with some arguments during
runtime,
and run the command "config rewrite", the module information will not be
saved into the
config file.

Since Redis OSS 7 and Valkey 7.2, if we load a module with some
arguments during runtime,
the module information (path, arguments number, and arguments value) can
be saved into the config file after config rewrite command is called.
Thus, the module will be loaded automatically when the server startup
next time.

Following is one example:

bind 172.25.0.58
port 7000
protected-mode no
enable-module-command yes

Generated by CONFIG REWRITE
latency-tracking-info-percentiles 50 99 99.9
dir "/home/ubuntu/valkey"
save 3600 1 300 100 60 10000
user default on nopass sanitize-payload ~* &* +https://github.com/ALL
loadmodule tests/modules/datatype.so 10 20

However, there is one problem.
If developers write a module, and update the running arguments by
someway, the updated arguments can not be saved into the config file
even "config rewrite" is called.
The reason comes from the following function
rewriteConfigLoadmoduleOption (src/config.c)

void rewriteConfigLoadmoduleOption(struct rewriteConfigState *state) {
..........
struct ValkeyModule *module = dictGetVal(de);
line = sdsnew("loadmodule ");
line = sdscatsds(line, module->loadmod->path);
for (int i = 0; i < module->loadmod->argc; i++) {
line = sdscatlen(line, " ", 1);
line = sdscatsds(line, module->loadmod->argv[i]->ptr);
}
rewriteConfigRewriteLine(state, "loadmodule", line, 1);
.......
}

The function only save the initial arguments information
(module->loadmod) into the configfile.

After core members discuss, ref
https://github.com/valkey-io/valkey/issues/1177


We decide add the following API to implement this feature:

Original proposal:

int VM_UpdateRunTimeArgs(ValkeyModuleCtx *ctx, int index, char *value);


Updated proposal:

ValkeyModuleString **values VM_GetRuntimeArgs(ValkeyModuleCtx *ctx);
**int VM_UpdateRuntimeArgs(ValkeyModuleCtx *ctx, int argc,
ValkeyModuleString **values);



Why we do not recommend the following way: 


MODULE UNLOAD
Update module args in the conf file
MODULE LOAD

I think there are the following disadvantages:

1. Some modules can not be unloaded. Such as the example module
datatype.so, which is tests/modules/datatype.so
2. it is not atomic operation for MODULE UNLOAD + MODULE LOAD
3. sometimes, if we just run the module unload, the client business
could be interrupted

---------

Signed-off-by: hwware <wen.hui.ware@gmail.com>
2024-12-05 11:58:24 -05:00
Madelyn Olson
a401e3789d
Update code of conduct maintainers email address (#1391)
Updating code of conduct maintainer's email address

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-04 10:33:14 -08:00
zhenwei pi
105509cdad
Run RDMA builtin in CI workflow (#1380)
Since 4695d118dd (#1209), RDMA supports builtin.
And module connection type may be removed in future. So run a builtin
RDMA support for CI workflow.

RDMA module is complied only in CI, keep it building check only until
module connection type gets obsolete.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
2024-12-03 23:09:56 +01:00
Jim Brunner
349bc7547b
defrag: use monotime in module interface (#1388)
The recent PR (https://github.com/valkey-io/valkey/pull/1242) converted
Active Defrag to use `monotime`. In that change, a conversion was
performed to continue to use `ustime()` as part of the module interface.
Since this time is only used internally, and never actually exposed to
the module, we can convert this to use `monotime` directly.

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
2024-12-03 11:19:53 -08:00
uriyage
9f8b174c2e
Optimize IO thread offload for modified argv (#1360)
### Improve expired commands performance with IO threads

#### Background
In our IO threads architecture, IO threads allocate client argv's and
later when we free it after processCommand we offload its free to the IO
threads.
With jemalloc, it's crucial that the same thread that allocates memory
also frees it.

For some commands we modify the client's argv in the main thread during
command processing (for example in `SET EX` command we rewrite the
command to use absolute time for replication propagation).

#### Current issues
1. When commands are rewritten (e.g., expire commands), we store the
original argv
   in `c->original_argv`. However, we're currently:
   - Freeing new argv (allocated by main thread) in IO threads
   - Freeing original argv (allocated by IO threads) in main thread
2. Currently, `c->original_argv` points to new array with old 
objects, while `c->argv` has old array with new objects, making memory
free management complicated.

#### Changes
1. Refactored argv modification handling code to ensure consistency -
both array and objects are now either all new or all old
2. Moved original_argv cleanup to happen in resetClient after argv
cleanup
3. Modified IO threads code to properly handle original argv cleanup
when argv are modified.

#### Performance Impact
Benchmark with `SET EX` commands (650 clients, 512 byte value, 8 IO
threads):
- New implementation: **729,548 ops/sec**
- Old implementation: **633,243 ops/sec**
Representing a **~15%** performance improvement due to more efficient
memory handling.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
2024-12-03 19:20:31 +02:00
Jim Brunner
397201c48f
Refactor of ActiveDefrag to reduce latencies (#1242)
Refer to:  https://github.com/valkey-io/valkey/issues/1141

This update refactors the defrag code to:
* Make the overall code more readable and maintainable
* Reduce latencies incurred during defrag processing

With this update, the defrag cycle time is reduced to 500us, with more
frequent cycles. This results in much more predictable latencies, with a
dramatic reduction in tail latencies.

(See https://github.com/valkey-io/valkey/issues/1141 for more complete
details.)

This update is focused mostly on the high-level processing, and does NOT
address lower level functions which aren't currently timebound (e.g.
`activeDefragSdsDict()`, and `moduleDefragGlobals()`). These are out of
scope for this update and left for a future update.

I fixed `kvstoreDictLUTDefrag` because it was using up to 7ms on a CME
single shard. See original github issue for performance details.

---------

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-12-03 08:42:29 -08:00
Nugine
3df609ef06
Optimize PFCOUNT, PFMERGE command by SIMD acceleration (#1293)
This PR optimizes the performance of HyperLogLog commands (PFCOUNT,
PFMERGE) by adding AVX2 fast paths.

Two AVX2 functions are added for conversion between raw representation
and dense representation. They are 15 ~ 30 times faster than scalar
implementaion. Note that sparse representation is not accelerated.

AVX2 fast paths are enabled when the CPU supports AVX2 (checked at
runtime) and the hyperloglog configuration is default (HLL_REGISTERS ==
16384 && HLL_BITS == 6).

`PFDEBUG SIMD (ON|OFF)` subcommand is added for unit tests. A new TCL
unit test checks that the results produced by non-AVX2 and AVX2
implementations are exactly equal.

When merging 3 dense hll structures, the benchmark shows a 12x speedup
compared to the scalar version.

```
pfcount key1 key2 key3
pfmerge keyall key1 key2 key3
```

```
======================================================================================================
Type             Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------
PFCOUNT-scalar    5665.56        35.29839        32.25500        63.99900        67.58300       608.60
PFCOUNT-avx2     72377.83         2.75834         2.67100         5.34300         6.81500      7774.96
------------------------------------------------------------------------------------------------------
PFMERGE-scalar    9851.29        20.28806        20.09500        36.86300        39.16700       615.71
PFMERGE-avx2    125621.89         1.59126         1.55100         3.11900         4.70300     15702.74
------------------------------------------------------------------------------------------------------

scalar: valkey:unstable  2df56d87c0ebe802f38e8922bb2ea1e4ca9cfa76
avx2:   Nugine:hll-simd  8f9adc34021080d96e60bd0abe06b043f3ed0275

CPU:    13th Gen Intel® Core™ i9-13900H × 20
Memory: 32.0 GiB
OS:     Ubuntu 22.04.5 LTS
```

Experiment repo: https://github.com/Nugine/redis-hyperloglog
Benchmark script:
https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh
Algorithm:
https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

---------

Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
2024-12-02 19:40:38 +01:00
Binbin
fbbfe5d3d3
Print logs when the cluster state changes to fail or the fail reason changes (#1188)
This log allows us to easily distinguish between full coverage and
minority partition when the cluster fails. Sometimes it is not easy
to see the minority partition in a healthy shards (both primary and
replicas).

And we decided not to add a cluster_fail_reason field to cluster info.
Given that there are only two reasons and both are well-known and if
we ended up adding more down the road we can add it in the furture.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-02 15:55:24 +08:00
Vadym Khoptynets
90475af594
Free strings during BGSAVE/BGAOFRW to reduce copy-on-write (#905)
**Motivation**

Copy-on-write (COW) amplification refers to the issue where writing to a
small object leads to the entire page being cloned, resulting in
inefficient memory usage. This issue arises during the BGSAVE process,
which can be particularly problematic on instances with limited memory.
If the BGSAVE process could release unneeded memory, it could reduce
memory consumption. To address this, the BGSAVE process calls the
`madvise` function to signal the operating system to reclaim the buffer.
However, this approach does not work for buffers smaller than a page
(usually 4KiB). Even after multiple such calls, where a full page may be
free, the operating system will not reclaim it.
To solve this issue, we can call `zfree` directly. This allows the
allocator (jemalloc) to handle the bookkeeping and release pages when
buffers are no longer needed. This approach reduces copy-on-write
events.

**Benchmarks**
To understand how usage of `zfree` affects BGSAVE and the memory
consumption I ran 45 benchmarks that compares my clonewith the vanilla
version. The benchmark has the following steps:
1. Start a new Valkey process
2. Fill the DB with data sequentially
3. Run a warmup to randomize the memory layout
4. Introduce fragmentation by deleting part of the keys
5. In parallel:
    1. Trigger BGSAVE
    2. Start 80/20 get/set load

I played the following parameters to understand their influence:

1. Number of keys: 3M, 6M, and 12M.
2. Data size. While key themselves are of fixed length ~30 bytes, the
value size is 120, 250, 500, 1000, and 2000 bytes.
3. Fragmentation. I delete 5%, 10%, and 15% of the original key range.

I'm attaching a graph of BGSAVE process memory consumption. Instead of
all benchmarks, I show the most representative runs IMO.

<img width="1570" alt="3m-fixed"
src="https://github.com/user-attachments/assets/3dbbc528-01c1-4821-a3c2-6be455e7f78a">


For 2000 bytes values peak memory usage is ~53% compared to vanilla. The
peak happens at 57% BGSAVE progress.
For 500 bytes values the peak is ~80% compared to vanilla. And happens
at ~80% progress.
For 120 bytes the difference is under 5%, and the patched version could
even use more memory.



![500b-fixed](https://github.com/user-attachments/assets/b09451d3-4bce-4f33-b3db-2b5df2178ed2)


For 12M keys, the peak is ~85% of the vanilla’s. Happens at ~70% mark.
For 6M keys, the peak is ~87% of the vanilla’s. Happens at ~77% mark.
For 3M keys, the peak is ~87% of the vanilla’s Happens at ~80% mark.

**Changes**

The PR contains 2 changes:
1. Static buffer for RDB comrpession.
RDB compression leads to COW events even without any write load if we
use `zfree`. It happens because the compression functions allocates a
new buffer for each object. Together with freeing objects with `zfree`
it leads to reusing of the memory shared with the main process.
To deal with this problem, we use a pre-allocated constant 8K buffer for
compression. If the object size is too big for this buffer, than we fall
back to the ad hoc allocation behavior.

2. Freeing string objects instead of dismissing them
Call to `zfree` is more expensive than direct call to `madvise`. But
with #453 strings use the fast path – `zfree_with_size`. As a possible
next step we can optimize `zfree` for other data types as well.

---------

Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com>
Signed-off-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-12-01 17:12:27 +02:00
Amit Nagler
7043ef0bbb
Split dual-channel COB overrun tests to separate servers (#1374)
1. The test isn't waiting long enough for the output buffer to overrun.
This problem is happening because an error from the previous test is
bleeding into the current test's logs. The simplest fix would be to
split these tests.
2. Increased replication timeout to ensure sync fails due to output
buffer overrun before a timeout occurs.

Fixes #1367

Signed-off-by: naglera <anagler123@gmail.com>
2024-12-01 21:33:43 +08:00
Binbin
9c48f56790
Reset repl_down_since to zero only on state change (#1149)
We should reset repl_down_since only on state change, in the
current code, if the rdb channel in the dual channel is normal,
that is, rdb is loaded normally, but the psync channel is
abnormal, we will set repl_down_since 0 here. If the primary
is down at this time, the replica may be abnormal when calculating
data_age in cluster failover, since repl_state != REPL_STATE_CONNECTED,
this causes the replica to be unable to initiate an election due
to the old data_age.

In dualChannelSyncHandleRdbLoadCompletion, if the psync channel
is not established, the function will return. We will set repl_state
to REPL_STATE_CONNECTED and set repl_down_since to 0 in
dualChannelSyncSuccess, that is, in establishPrimaryConnection.

See also 677d10b2a8ff7f13033ccfe56ffcd246dbe70fb6 for more details.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2024-12-01 21:33:21 +08:00
Stav Ben-Tov
c8ceb2ee25
Use zfree_with_size for client buffer (#1376)
Replace occurrences of 'zfree' with 'zfree_with_size' to improve
performance.
'zfree_with_size' function avoids calling 'zmalloc_size' to retrieve
buffer size and
uses previuos calculation of size for calling 'zfree_with_size'. This
results in faster
memory deallocation and reduces overhead.

Signed-off-by: stav bentov <stavbt@amazon.com>
Co-authored-by: stav bentov <stavbt@amazon.com>
2024-12-01 12:24:18 +01:00
zhenwei pi
4695d118dd
RDMA builtin support (#1209)
There are several patches in this PR:

* Abstract set/rewrite config bind option: `bind` option is a special
config, `socket` and `tls` are using the same one. However RDMA uses the
similar style but different one. Use a bit abstract work to make it
flexible for both `socket` and `RDMA`. (Even for QUIC in the future.)
* Introduce closeListener for connection type: closing socket by a
simple syscall would be fine, RDMA has complex logic. Introduce
connection type specific close listener method.
* RDMA: Use valkey.conf style instead of module parameters: use
`--rdma-bind` and `--rdma-port` style instead of module parameters. The
module style config `rdma.bind` and `rdma.port` are removed.
* RDMA: Support builtin: support `make BUILD_RDMA=yes`. module style is
still kept for now.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
2024-11-29 11:13:34 +01:00