futriix/tests/integration/logging.tcl
meiravgri 2e854bccc6
Fix async safety in signal handlers (#12658)
see discussion from after https://github.com/redis/redis/pull/12453 was
merged
----
This PR replaces signals that are not considered async-signal-safe
(AS-safe) with safe calls.

#### **1. serverLog() and serverLogFromHandler()**
`serverLog` uses unsafe calls. It was decided that we will **avoid**
`serverLog` calls by the signal handlers when:
* The signal is not fatal, such as SIGALRM. In these cases, we prefer
using `serverLogFromHandler` which is the safe version of `serverLog`.
Note they have different prompts:
`serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>`
`serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>`
* The code was added recently. Calls to `serverLog` by the signal
handler have been there ever since Redis exists and it hasn't caused
problems so far. To avoid regression, from now we should use
`serverLogFromHandler`

#### **2. `snprintf` `fgets` and `strtoul`(base = 16) -------->
`_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex`**
The safe version of `snprintf` was taken from
[here](8cfc4ca5e7/src/mc_util.c (L754))

#### **3. fopen(), fgets(), fclose() --------> open(), read(), close()**

#### **4. opendir(), readdir(), closedir() --------> open(),
syscall(SYS_getdents64), close()**

#### **5. Threads_mngr sync mechanisms**
* waiting for the thread to generate stack trace: semaphore -------->
busy-wait
* `globals_rw_lock` was removed: as we are not using malloc and the
semaphore anymore we don't need to protect `ThreadsManager_cleanups`.

#### **6. Stacktraces buffer**
The initial problem was that we were not able to safely call malloc
within the signal handler.
To solve that we created a buffer on the stack of `writeStacktraces` and
saved it in a global pointer, assuming that under normal circumstances,
the function `writeStacktraces` would complete before any thread
attempted to write to it. However, **if threads lag behind, they might
access this global pointer after it no longer belongs to the
`writeStacktraces` stack, potentially corrupting memory.**
To address this, various solutions were discussed
[here](https://github.com/redis/redis/pull/12658#discussion_r1390442896)
Eventually, we decided to **create a pipe** at server startup that will
remain valid as long as the process is alive.
We chose this solution due to its minimal memory usage, and since
`write()` and `read()` are atomic operations. It ensures that stack
traces from different threads won't mix.

**The stacktraces collection process is now as  follows:**
* Cleaning the pipe to eliminate writes of late threads from previous
runs.
* Each thread writes to the pipe its stacktrace
* Waiting for all the threads to mark completion or until a timeout (2
sec) is reached
* Reading from the pipe to print the stacktraces.

#### **7. Changes that were considered and eventually were dropped**
* replace watchdog timer with a POSIX timer: 
according to [settimer man](https://linux.die.net/man/2/setitimer)

> POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending
the use of the POSIX timers API
([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2),
[timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.)
instead.

However, although it is supposed to conform to POSIX std, POSIX timers
API is not supported on Mac.
You can take a look here at the Linux implementation:

[here](c7562ee135)
To avoid messing up the code, and uncertainty regarding compatibility,
it was decided to drop it for now.

* avoid using sds (uses malloc) in logConfigDebugInfo
It was considered to print config info instead of using sds, however
apparently, `logConfigDebugInfo` does more than just print the sds, so
it was decided this fix is out of this issue scope.

#### **8. fix Signal mask check**
The check `signum & sig_mask` intended to indicate whether the signal is
blocked by the thread was incorrect. Actually, the bit position in the
signal mask corresponds to the signal number. We fixed this by changing
the condition to: `sig_mask & (1L << (sig_num - 1))`

#### **9. Unrelated changes**
both `fork.tcl `and `util.tcl` implemented a function called
`count_log_message` expecting different parameters. This caused
confusion when trying to run daily tests with additional test parameters
to run a specific test.
The `count_log_message` in `fork.tcl` was removed and the calls were
replaced with calls to `count_log_message` located in `util.tcl`

---------

Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-11-23 13:22:20 +02:00

139 lines
5.1 KiB
Tcl

tags {"external:skip"} {
set system_name [string tolower [exec uname -s]]
set backtrace_supported 0
set threads_mngr_supported 0 ;# Do we support printing stack trace from all threads, not just the one that got the signal?
# We only support darwin or Linux with glibc
if {$system_name eq {darwin}} {
set backtrace_supported 1
} elseif {$system_name eq {linux}} {
set threads_mngr_supported 1
# Avoid the test on libmusl, which does not support backtrace
# and on static binaries (ldd exit code 1) where we can't detect libmusl
catch {
set ldd [exec ldd src/redis-server]
if {![string match {*libc.*musl*} $ldd]} {
set backtrace_supported 1
}
}
}
# look for the DEBUG command in the backtrace, used when we triggered
# a stack trace print while we know redis is running that command.
proc check_log_backtrace_for_debug {log_pattern} {
# search for the final line in the stacktraces generation to make sure it was completed.
set pattern "* STACK TRACE DONE *"
set res [wait_for_log_messages 0 \"$pattern\" 0 100 100]
set res [wait_for_log_messages 0 \"$log_pattern\" 0 100 100]
if {$::verbose} { puts $res}
# If the stacktrace is printed more than once, it means redis crashed during crash report generation
assert_equal [count_log_message 0 "STACK TRACE -"] 1
upvar threads_mngr_supported threads_mngr_supported
# the following checks are only done if we support printing stack trace from all threads
if {$threads_mngr_supported} {
assert_equal [count_log_message 0 "setupStacktracePipe failed"] 0
assert_equal [count_log_message 0 "failed to open /proc/"] 0
assert_equal [count_log_message 0 "failed to find SigBlk or/and SigIgn"] 0
# the following are skipped since valgrind is slow and a timeout can happen
if {!$::valgrind} {
assert_equal [count_log_message 0 "wait_threads(): waiting threads timed out"] 0
# make sure redis prints stack trace for all threads. we know 3 threads are idle in bio.c
assert_equal [count_log_message 0 "bioProcessBackgroundJobs"] 3
}
}
set pattern "*debugCommand*"
set res [wait_for_log_messages 0 \"$pattern\" 0 100 100]
if {$::verbose} { puts $res}
}
# used when backtrace_supported == 0
proc check_crash_log {log_pattern} {
set res [wait_for_log_messages 0 \"$log_pattern\" 0 50 100]
if {$::verbose} { puts $res }
}
# test the watchdog and the stack trace report from multiple threads
if {$backtrace_supported} {
set server_path [tmpdir server.log]
start_server [list overrides [list dir $server_path]] {
test "Server is able to generate a stack trace on selected systems" {
r config set watchdog-period 200
r debug sleep 1
check_log_backtrace_for_debug "*WATCHDOG TIMER EXPIRED*"
# make sure redis is still alive
assert_equal "PONG" [r ping]
}
}
}
# Valgrind will complain that the process terminated by a signal, skip it.
if {!$::valgrind} {
if {$backtrace_supported} {
set check_cb check_log_backtrace_for_debug
} else {
set check_cb check_crash_log
}
# test being killed by a SIGABRT from outside
set server_path [tmpdir server1.log]
start_server [list overrides [list dir $server_path crash-memcheck-enabled no]] {
test "Crash report generated on SIGABRT" {
set pid [s process_id]
r deferred 1
r debug sleep 10 ;# so that we see the function in the stack trace
r flush
after 100 ;# wait for redis to get into the sleep
exec kill -SIGABRT $pid
$check_cb "*crashed by signal*"
}
}
# test DEBUG SEGFAULT
set server_path [tmpdir server2.log]
start_server [list overrides [list dir $server_path crash-memcheck-enabled no]] {
test "Crash report generated on DEBUG SEGFAULT" {
catch {r debug segfault}
$check_cb "*crashed by signal*"
}
}
# test DEBUG SIGALRM being non-fatal
set server_path [tmpdir server3.log]
start_server [list overrides [list dir $server_path]] {
test "Stacktraces generated on SIGALRM" {
set pid [s process_id]
r deferred 1
r debug sleep 10 ;# so that we see the function in the stack trace
r flush
after 100 ;# wait for redis to get into the sleep
exec kill -SIGALRM $pid
$check_cb "*Received SIGALRM*"
r read
r deferred 0
# make sure redis is still alive
assert_equal "PONG" [r ping]
}
}
}
# test DEBUG ASSERT
if {$backtrace_supported} {
set server_path [tmpdir server4.log]
# Use exit() instead of abort() upon assertion so Valgrind tests won't fail.
start_server [list overrides [list dir $server_path use-exit-on-panic yes crash-memcheck-enabled no]] {
test "Generate stacktrace on assertion" {
catch {r debug assert}
check_log_backtrace_for_debug "*ASSERTION FAILED*"
}
}
}
}