85 Commits

Author SHA1 Message Date
yoav-steinberg
843a4cdc07
Add warning for suspected slow system clocksource setting (#10636)
This PR does 2 main things:
1) Add warning for suspected slow system clocksource setting. This is Linux specific.
2) Add a `--check-system` argument to redis which runs all system checks and prints a report.

## System checks
Add a command line option `--check-system` which runs all known system checks and provides
a report to stdout of which systems checks have failed with details on how to reconfigure the
system for optimized redis performance.
The `--system-check` mode exists with an appropriate error code after running all the checks.

## Slow clocksource details
We check the system's clocksource performance by running `clock_gettime()` in a loop and then
checking how much time was spent in a system call (via `getrusage()`). If we spend more than
10% of the time in the kernel then we print a warning. I verified that using the slow clock sources:
`acpi_pm` (~90% in the kernel on my laptop) and `xen` (~30% in the kernel on an ec2 `m4.large`)
we get this warning.

The check runs 5 system ticks so we can detect time spent in kernel at 20% jumps (0%,20%,40%...).
Anything more accurate will require the test to run longer. Typically 5 ticks are 50ms. This means
running the test on startup will delay startup by 50ms. To avoid this we make sure the test is only
executed in the `--check-system` mode.

For a quick startup check, we specifically warn if the we see the system is using the `xen` clocksource
which we know has bad performance and isn't recommended (at least on ec2). In such a case the
user should manually rung redis with `--check-system` to force the slower clocksource test described
above.

## Other changes in the PR

* All the system checks are now implemented as functions in _syscheck.c_.
  They are implemented using a standard interface (see details in _syscheck.c_).
  To do this I moved the checking functions `linuxOvercommitMemoryValue()`,
  `THPIsEnabled()`, `linuxMadvFreeForkBugCheck()` out of _server.c_ and _latency.c_
  and into the new _syscheck.c_. When moving these functions I made sure they don't
  depend on other functionality provided in _server.c_ and made them use a standard
  "check functions" interface. Specifically:
  * I removed all logging out of `linuxMadvFreeForkBugCheck()`. In case there's some
    unexpected error during the check aborts as before, but without any logging.
    It returns an error code 0 meaning the check didn't not complete.
  * All these functions now return 1 on success, -1 on failure, 0 in case the check itself
    cannot be completed.
  * The `linuxMadvFreeForkBugCheck()` function now internally calls `exit()` and not
    `exitFromChild()` because the latter is only available in _server.c_ and I wanted to
    remove that dependency. This isn't an because we don't need to worry about the
    child process created by the test doing anything related to the rdb/aof files which
    is why `exitFromChild()` was created.

* This also fixes parsing of other /proc/\<pid\>/stat fields to correctly handle spaces
  in the process name and be more robust in general. Not that before this fix the rss
  info in `INFO memory` was corrupt in case of spaces in the process name. To
  recreate just rename `redis-server` to `redis server`, start it, and run `INFO memory`.
2022-05-22 17:10:31 +03:00
David CARLIER
bdcd4b3df8
zmalloc_get_rss implementation for haiku. (#10687)
also fixing already defined constants build warning while at it.

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-05-08 15:12:17 +03:00
David CARLIER
834fa5870c
zmalloc_get_rss openbsd implementation (#10149)
Add support for getting the RSS in OpenBSD
2022-01-19 20:56:12 +02:00
David CARLIER
50fa627b90
zmalloc_get_rss netbsd impl fix proposal. (#10116)
Seems like the previous implementation was broken (always returning 0)

since kinfo_proc2 is used the KERN_PROC2 sysctl oid is more appropriate
and also the query's length was not necessarily accurate (6 here).
2022-01-16 10:03:09 +02:00
filipe oliveira
5dd15443ac
Added INFO LATENCYSTATS section: latency by percentile distribution/latency by cumulative distribution of latencies (#9462)
# Short description

The Redis extended latency stats track per command latencies and enables:
- exporting the per-command percentile distribution via the `INFO LATENCYSTATS` command.
  **( percentile distribution is not mergeable between cluster nodes ).**
- exporting the per-command cumulative latency distributions via the `LATENCY HISTOGRAM` command.
  Using the cumulative distribution of latencies we can merge several stats from different cluster nodes
  to calculate aggregate metrics .

By default, the extended latency monitoring is enabled since the overhead of keeping track of the
command latency is very small.
 
If you don't want to track extended latency metrics, you can easily disable it at runtime using the command:
 - `CONFIG SET latency-tracking no`

By default, the exported latency percentiles are the p50, p99, and p999.
You can alter them at runtime using the command:
- `CONFIG SET latency-tracking-info-percentiles "0.0 50.0 100.0"`


## Some details:
- The total size per histogram should sit around 40 KiB. We only allocate those 40KiB when a command
  was called for the first time.
- With regards to the WRITE overhead As seen below, there is no measurable overhead on the achievable
  ops/sec or full latency spectrum on the client. Including also the measured redis-benchmark for unstable
  vs this branch. 
- We track from 1 nanosecond to 1 second ( everything above 1 second is considered +Inf )

## `INFO LATENCYSTATS` exposition format

   - Format: `latency_percentiles_usec_<CMDNAME>:p0=XX,p50....` 

## `LATENCY HISTOGRAM [command ...]` exposition format

Return a cumulative distribution of latencies in the format of a histogram for the specified command names.

The histogram is composed of a map of time buckets:
- Each representing a latency range, between 1 nanosecond and roughly 1 second.
- Each bucket covers twice the previous bucket's range.
- Empty buckets are not printed.
- Everything above 1 sec is considered +Inf.
- At max there will be log2(1000000000)=30 buckets

We reply a map for each command in the format:
`<command name> : { `calls`: <total command calls> , `histogram` : { <bucket 1> : latency , < bucket 2> : latency, ...  } }`

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-05 14:01:05 +02:00
sundb
e725d737fb
Add --large-memory flag for REDIS_TEST to enable tests that consume more than 100mb (#9784)
This is a preparation step in order to add a new test in quicklist.c see #9776
2021-11-16 08:55:10 +02:00
DarrenJiang13
8ab33c18e4
fix a compilation error around madvise when make with jemalloc on MacOS (#9350)
We only use MADV_DONTNEED on Linux, that's were it was tested.
2021-08-10 11:32:27 +03:00
Wang Yuan
d4bca53cd9
Use madvise(MADV_DONTNEED) to release memory to reduce COW (#8974)
## Backgroud
As we know, after `fork`, one process will copy pages when writing data to these
pages(CoW), and another process still keep old pages, they totally cost more memory.
For redis, we suffered that redis consumed much memory when the fork child is serializing
key/values, even that maybe cause OOM.

But actually we find, in redis fork child process, the child process don't need to keep some
memory and parent process may write or update that, for example, child process will never
access the key-value that is serialized but users may update it in parent process.
So we think it may reduce COW if the child process release memory that it is not needed.

## Implementation
For releasing key value in child process, we may think we call `decrRefCount` to free memory,
but i find the fork child process still use much memory when we don't write any data to redis,
and it costs much more time that slows down bgsave. Maybe because memory allocator doesn't
really release memory to OS, and it may modify some inner data for this free operation, especially
when we free small objects.

Moreover, CoW is based on  pages, so it is a easy way that we only free the memory bulk that is
not less than kernel page size. madvise(MADV_DONTNEED) can quickly release specified region
pages to OS bypassing memory allocator, and allocator still consider that this memory still is used
and don't change its inner data.

There are some buffers we can release in the fork child process:
- **Serialized key-values**
  the fork child process never access serialized key-values, so we try to free them.
  Because we only can release big bulk memory, and it is time consumed to iterate all
  items/members/fields/entries of complex data type. So we decide to iterate them and
  try to release them only when their average size of item/member/field/entry is more
  than page size of OS.
- **Replication backlog**
  Because replication backlog is a cycle buffer, it will be changed quickly if redis has heavy
  write traffic, but in fork child process, we don't need to access that.
- **Client buffers**
  If clients have requests during having the fork child process, clients' buffer also be changed
  frequently. The memory includes client query buffer, output buffer, and client struct used memory.

To get child process peak private dirty memory, we need to count peak memory instead
of last used memory, because the child process may continue to release memory (since
COW used to only grow till now, the last was equivalent to the peak).
Also we're adding a new `current_cow_peak` info variable (to complement the existing
`current_cow_size`)

Co-authored-by: Oran Agra <oran@redislabs.com>
2021-08-04 23:01:46 +03:00
Yossi Gottlieb
c3df27d1ea
Fix slowdown due to child reporting CoW. (#8645)
Reading CoW from /proc/<pid>/smaps can be slow with large processes on
some platforms.

This measures the time it takes to read CoW info and limits the duty
cycle of future updates to roughly 1/100.

As current_cow_size no longer represnets a current, fixed interval value
there is also a new current_cow_size_age field that provides information
about the age of the size value, in seconds.
2021-03-22 13:25:58 +02:00
sundb
95d6297db8
Add run all test support with define REDIS_TEST (#8570)
1. Add `redis-server test all` support to run all tests.
2. Add redis test to daily ci.
3. Add `--accurate` option to run slow tests for more iterations (so that
   by default we run less cycles (shorter time, and less prints).
4. Move dict benchmark to REDIS_TEST.
5. fix some leaks in tests
6. make quicklist tests run on a specific fill set of options rather than huge ranges
7. move some prints in quicklist test outside their loops to reduce prints
8. removing sds.h from dict.c since it is now used in both redis-server and
   redis-cli (uses hiredis sds)
2021-03-10 09:13:11 +02:00
Yossi Gottlieb
af2175326c
Fix memory info on FreeBSD. (#8620)
The obtained process_rss was incorrect (the OS reports pages, not
bytes), resulting with many other fields getting corrupted.

This has been tested on FreeBSD but not other platforms.
2021-03-09 11:33:32 +02:00
Yossi Gottlieb
3ea4c43add
Cleanup usage of malloc_usable_size. (#8554)
* Add better control of malloc_usable_size() usage.
* Use malloc_usable_size on alpine libc daily job.
* Add no-malloc-usable-size daily jobs.
* Fix zmalloc(0) when HAVE_MALLOC_SIZE is undefined.

In order to align with the jemalloc behavior, this should never return
NULL or OOM panic.
2021-02-25 09:24:41 +02:00
Yossi Gottlieb
dd885780d6
Fix compile errors with no HAVE_MALLOC_SIZE. (#8533)
Also adds a new daily CI test, relying on the fact that we don't use malloc_size() on alpine libmusl.

Fixes #8531
2021-02-23 17:08:49 +02:00
Yossi Gottlieb
d32f2e9999
Fix integer overflow (CVE-2021-21309). (#8522)
On 32-bit systems, setting the proto-max-bulk-len config parameter to a high value may result with integer overflow and a subsequent heap overflow when parsing an input bulk (CVE-2021-21309).

This fix has two parts:

Set a reasonable limit to the config parameter.
Add additional checks to prevent the problem in other potential but unknown code paths.
2021-02-22 15:41:32 +02:00
Oran Agra
8dd16caec8
Fix last COW INFO report, Skip test on non-linux platforms (#8301)
- the last COW report wasn't always read from the pipe
  (receiveLastChildInfo wasn't used)
- but in fact, there's no reason we won't always try to drain that pipe
  so i'm unifying receiveLastChildInfo with receiveChildInfo
- adjust threshold of the COW test when run in accurate mode
- add some prints in case this test fails again
- fix indentation, page size, and PID! in MacOS proc info

p.s. it seems that pri_pages_dirtied is always 0
2021-01-08 23:35:30 +02:00
Yossi Gottlieb
86e3395c11
Several (mostly Solaris-related) cleanups (#8171)
* Allow runtest-moduleapi use a different 'make', for systems where GNU Make is 'gmake'.
* Fix issue with builds on Solaris re-building everything from scratch due to CFLAGS/LDFLAGS not stored.
* Fix compile failure on Solaris due to atomicvar and a bunch of warnings.
* Fix garbled log timestamps on Solaris.
2020-12-13 17:09:54 +02:00
David CARLIER
ec951cdc15
Solaris based system rss size report. (#8138) 2020-12-06 15:30:29 +02:00
Oran Agra
7ca00d694d Sanitize dump payload: fail RESTORE if memory allocation fails
When RDB input attempts to make a huge memory allocation that fails,
RESTORE should fail gracefully rather than die with panic
2020-12-06 14:54:34 +02:00
David CARLIER
d428de590f
DragonFlyBSD resident memory amount (almost) similar as FreeBSD. (#8023) 2020-11-08 09:16:14 +02:00
Yossi Gottlieb
9824fe3e39
Fix wrong zmalloc_size() assumption. (#7963)
When using a system with no malloc_usable_size(), zmalloc_size() assumed
that the heap allocator always returns blocks that are long-padded.

This may not always be the case, and will result with zmalloc_size()
returning a size that is bigger than allocated. At least in one case
this leads to out of bound write, process crash and a potential security
vulnerability.

Effectively this does not affect the vast majority of users, who use
jemalloc or glibc.

This problem along with a (different) fix was reported by Drew DeVault.
2020-10-26 14:49:08 +02:00
Oran Agra
3945a32177
performance and memory reporting improvement - sds take control of it's internal frag (#7875)
This commit has two aspects:
1) improve memory reporting for all the places that use sdsAllocSize to compute
   memory used by a string, in this case it'll include the internal fragmentation.
2) reduce the need for realloc calls by making the sds implicitly take over
   the internal fragmentation of the block it allocated.
2020-10-02 08:19:44 +03:00
David CARLIER
ce8bfc56ad
getting rss size implementation for netbsd (#7293) 2020-09-29 08:49:35 +03:00
Wang Yuan
445a4b669a
Implement redisAtomic to replace _Atomic C11 builtin (#7707)
Redis 6.0 introduces I/O threads, it is so cool and efficient, we use C11
_Atomic to establish inter-thread synchronization without mutex. But the
compiler that must supports C11 _Atomic can compile redis code, that brings a
lot of inconvenience since some common platforms can't support by default such
as CentOS7, so we want to implement redis atomic type to make it more portable.

We have implemented our atomic variable for redis that only has 'relaxed'
operations in src/atomicvar.h, so we implement some operations with
'sequentially-consistent', just like the default behavior of C11 _Atomic that
can establish inter-thread synchronization. And we replace all uses of C11
_Atomic with redis atomic variable.

Our implementation of redis atomic variable uses C11 _Atomic, __atomic or
__sync macros if available, it supports most common platforms, and we will
detect automatically which feature we use. In Makefile we use a dummy file to
detect if the compiler supports C11 _Atomic. Now for gcc, we can compile redis
code theoretically if your gcc version is not less than 4.1.2(starts to support
__sync_xxx operations). Otherwise, we remove use mutex fallback to implement
redis atomic variable for performance and test. You will get compiling errors
if your compiler doesn't support all features of above.

For cover redis atomic variable tests, we add other CI jobs that build redis on
CentOS6 and CentOS7 and workflow daily jobs that run the tests on them.
For them, we just install gcc by default in order to cover different compiler
versions, gcc is 4.4.7 by default installation on CentOS6 and 4.8.5 on CentOS7.

We restore the feature that we can test redis with Helgrind to find data race
errors. But you need install Valgrind in the default path configuration firstly
before running your tests, since we use macros in helgrind.h to tell Helgrind
inter-thread happens-before relationship explicitly for avoiding false positives.
Please open an issue on github if you find data race errors relate to this commit.

Unrelated:
- Fix redefinition of typedef 'RedisModuleUserChangedFunc'
  For some old version compilers, they will report errors or warnings, if we
  re-define function type.
2020-09-17 16:01:45 +03:00
Oran Agra
50f5181488
Remove dead code from update_zmalloc_stat_alloc (#7589)
this seems like leftover from before 6eb51bf
2020-07-31 13:01:39 +03:00
antirez
4092a75d85 Avoid collision with MacOS LIST_HEAD macro after #6384. 2019-12-02 09:13:29 +01:00
Salvatore Sanfilippo
e5b5f9a2f6
Merge pull request #6384 from devnexen/apple_smaps_impl
Getting region date per process in Darwin
2019-12-02 09:02:08 +01:00
Oran Agra
bf759cc9c3 Merge remote-tracking branch 'antirez/unstable' into jemalloc_purge_bg 2019-10-04 13:53:40 +03:00
Oran Agra
2e19b94113 RED-31295 - redis: avoid race between dlopen and thread creation
It seeems that since I added the creation of the jemalloc thread redis
sometimes fails to start with the following error:

Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!

This seems to be due to a race bug in ld.so, in which TLS creation on the
thread, collide with dlopen.

Move the creation of BIO and jemalloc threads to after modules are loaded.

plus small bugfix when trying to disable the jemalloc thread at runtime
2019-10-02 15:39:44 +03:00
David Carlier
5a8a005026 Adding AnonHugePages case + comments 2019-09-20 11:01:36 +01:00
David Carlier
819a661be5 Getting region date per process in Darwin 2019-09-15 14:05:00 +01:00
David Carlier
f1c6c658ac Updating resident memory request impl on FreeBSD. 2019-07-28 14:33:57 +01:00
Oran Agra
09f99c2a92 make redis purge jemalloc after flush, and enable background purging thread
jemalloc 5 doesn't immediately release memory back to the OS, instead there's a decaying
mechanism, which doesn't work when there's no traffic (no allocations).
this is most evident if there's no traffic after flushdb, the RSS will remain high.

1) enable jemalloc background purging
2) explicitly purge in flushdb
2019-06-02 15:33:14 +03:00
antirez
9dabbd1ab0 Alter coding style in #4696 to conform to Redis code base. 2019-03-21 12:18:55 +01:00
Salvatore Sanfilippo
5c47e2e964
Merge pull request #4696 from oranagra/zrealloc_fix
Fix zrealloc to behave similarly to je_realloc when size is 0
2019-03-21 12:18:04 +01:00
Bruce Merry
8fd1031b10 Fix incorrect memory usage accounting in zrealloc
When HAVE_MALLOC_SIZE is false, each call to zrealloc causes used_memory
to increase by PREFIX_SIZE more than it should, due to mis-matched
accounting between the original zmalloc (which includes PREFIX size in
its increment) and zrealloc (which misses it from its decrement).

I've also supplied a command-line test to easily demonstrate the
problem. It's not wired into the test framework, because I don't know
TCL so I'm not sure how to automate it.
2018-09-30 11:49:03 +02:00
Oran Agra
780815dd6e fix recursion typo in zmalloc_usable 2018-07-22 10:17:35 +03:00
Oran Agra
bf680b6f8c slave buffers were wasteful and incorrectly counted causing eviction
A) slave buffers didn't count internal fragmentation and sds unused space,
   this caused them to induce eviction although we didn't mean for it.

B) slave buffers were consuming about twice the memory of what they actually needed.
- this was mainly due to sdsMakeRoomFor growing to twice as much as needed each time
  but networking.c not storing more than 16k (partially fixed recently in 237a38737).
- besides it wasn't able to store half of the new string into one buffer and the
  other half into the next (so the above mentioned fix helped mainly for small items).
- lastly, the sds buffers had up to 30% internal fragmentation that was wasted,
  consumed but not used.

C) inefficient performance due to starting from a small string and reallocing many times.

what i changed:
- creating dedicated buffers for reply list, counting their size with zmalloc_size
- when creating a new reply node from, preallocate it to at least 16k.
- when appending a new reply to the buffer, first fill all the unused space of the
  previous node before starting a new one.

other changes:
- expose mem_not_counted_for_evict info field for the benefit of the test suite
- add a test to make sure slave buffers are counted correctly and that they don't cause eviction
2018-07-16 16:43:42 +03:00
Jack Drogon
93238575f7 Fix typo 2018-07-03 18:19:46 +02:00
Fuxin Hao
a4f658b2b5 Fix update_zmalloc_stat_alloc in zrealloc 2018-06-14 16:44:19 +08:00
Salvatore Sanfilippo
e2a9ea0405
Merge pull request #4901 from KFilipek/zmalloc_typo_fix
HW_PHYSMEM typo in preprocessor condition
2018-06-11 16:32:40 +02:00
Remi Collet
9561fec496 include stdint.h for unit64_t definition 2018-05-30 15:33:06 +02:00
Oran Agra
ad133e1023 Active defrag fixes for 32bit builds
problems fixed:
* failing to read fragmentation information from jemalloc
* overflow in jemalloc fragmentation hint to the defragger
* test suite not triggering eviction after population
2018-05-17 09:52:00 +03:00
Krzysztof Filipek
fd9177dd33 Typo in preprocessor condition 2018-05-06 20:18:48 +02:00
Oran Agra
806736cdf9 Adding real allocator fragmentation to INFO and MEMORY command + active defrag test
other fixes / improvements:
- LUA script memory isn't taken from zmalloc (taken from libc malloc)
  so it can cause high fragmentation ratio to be displayed (which is false)
- there was a problem with "fragmentation" info being calculated from
  RSS and used_memory sampled at different times (now sampling them together)

other details:
- adding a few more allocator info fields to INFO and MEMORY commands
- improve defrag test to measure defrag latency of big keys
- increasing the accuracy of the defrag test (by looking at real grag info)
  this way we can use an even lower threshold and still avoid false positives
- keep the old (total) "fragmentation" field unchanged, but add new ones for spcific things
- add these the MEMORY DOCTOR command
- deduct LUA memory from the rss in case of non jemalloc allocator (one for which we don't "allocator active/used")
- reduce sampling rate of the rss and allocator info
2018-03-12 15:08:52 +02:00
Oran Agra
5def65008f Fix zrealloc to behave similarly to je_realloc when size is 0
According to C11, the behavior of realloc with size 0 is now deprecated.
it can either behave as free(ptr) and return NULL, or return a valid pointer.
but in zmalloc it can lead to zmalloc_oom_handler and panic.
and that can affect modules that use it.

It looks like both glibc allocator and jemalloc behave like so:
  realloc(malloc(32),0) returns NULL
  realloc(NULL,0) returns a valid pointer

This commit changes zmalloc to behave the same
2018-02-21 11:04:13 +02:00
antirez
6eb51bf1ec zmalloc.c: remove thread safe mode, it's the default way. 2017-05-09 16:59:51 +02:00
antirez
2a51bac44e Simplify atomicvar.h usage by having the mutex name implicit. 2017-05-04 17:01:00 +02:00
antirez
f47607af02 Fix preprocessor if/else chain broken in order to fix #3927. 2017-04-11 16:54:27 +02:00
antirez
aa5b4be02e Fix zmalloc_get_memory_size() ifdefs to actually use the else branch.
Close #3927.
2017-04-11 16:45:11 +02:00
antirez
173d692bc2 Defrag: activate it only if running modified version of Jemalloc.
This commit also includes minor aesthetic changes like removal of
trailing spaces.
2017-01-10 11:25:39 +01:00