V3 Race Condition Hunt Page

From rsyslog wiki
Jump to: navigation, search

This page is for the hunt after a race condition that happens to exist in v3/4 versions.

As of now (2009-01-29), it looks like the bug has been found. Final confirmation is outstanding and probably will for the next two to four weeks (we need to see no more reports during that time frame), however, it very much looks like it is solved. Be sure to read Rainer's analysis for the details: http://blog.gerhards.net/2009/01/rsyslog-data-race-analysis.html

What's the bug

There seems to be a race condition that causes rsyslog to abort on very fast systems. So far, it has been seen on systems with a minimum of four cores only. Even there, it seems not to exist in all configurations, so besides a fast machine it is configuration dependent. When the race happens, rsyslog usually segfaults, but there was one report of a rsyslog hang.

Help hunt it down

One problem with the bug is that the developers so far could not reproduce it on any development system. There are also only limited bug reports. So if you experience it, please report. This page here is meant as a pivot point around the bug, and will contain all information we know about it. As this may require updates of existing information, Rainer considers the wiki a better resource than the bug tracker.

Configurations in which the problem occured

Lorenzo's

configure options (somewhat stripped down):

./configure  --enable-mysql --enable-pgsql --enable-mail --enable-imfile 
--enable-debug --enable-rtinst --enable-valgrind

config file, excerpt of what I find relevant:

#$ModLoad omrelp 
$WorkDirectory /var/log/rsyslog 
$ActionQueueType LinkedList   # use asynchronous processing
$ActionQueueFileName srvrfwd  # set file name, also enables disk mode 
$ActionResumeRetryCount -1    # infinite retries on insert failure
$ActionQueueSaveOnShutdown on # save in-memory data if rsyslog shuts down
mail.* @@<remote_addr>:514
$ActionExecOnlyIfPreviousIsSuspended on
& /var/log/rsyslog/failover.log
$ActionExecOnlyIfPreviousIsSuspended of

Rainer's

It looks like I finally managed to get a repro, under Debian Lenny. As it turned out, running with exactly two queues was vital in my system. Running with four, for example, did make the problem disappear.

I use a minimalistic config file:

$ModLoad omuxsock
$ActionQueueType LinkedList
*.* /logs/logfile1
$ActionQueueType LinkedList
*.* /logs/logfile1

The $ActionQueueType seems to be vital to be able to reproduce the issue.

I use the traffic generator program posted below to generate messages and I run a couple of those generators in parallel to rsyslogd. I do not use debug output, I did not see the problem occur with that. rsyslogd is configured with --enable-debug and --enable-rtinst and the abort is actually done by an assert(), which detects that a message (msg_t) object is invalid. The abort location is pretty stable, right at the beginning of queueChkDiscardMsg(), line 1393 in the current code base. Bug reports from other folks, however, stated a different location (often in msgDestruct()).

Program to generate local log traffic

I wrote the following quick and dirty program to generate some traffic on the system's local log socket. However, I did not yet have any success in reproducing the problem via it.

#include <stdio.h>
#include <syslog.h>
void main() {
       int i = 0;
       while(i < 1000000000) {
               syslog(LOG_WARNING, "Message %d", i++);
       }
}