what do a walrus and tupperware have in common?

Writing a desktop indexing system is hard for a lot of reasons. One is that you have to wade through all of the user’s data to extract metadata and text from them, and people have all kinds of crazy files on their system. In Beagle we’ve had problems with certain file types — usually Microsoft Office files — that would cause our indexer to get into an infinite loop, using 100% CPU and wedging the indexing process. Thanks a lot, proprietary file formats.

Anyway, we’ve ironed out most of the problems, but there is still the occasional hang. Debugging these problems has always been a pain, because it’s been difficult to pinpoint exactly what file is causing the trouble. While the upstream Beagle tarball releases ship with the default logging level at its highest setting, most of the distributions (including Fedora and openSUSE) patch it down to a lower level. This makes a lot of sense: why waste a lot of disk space on log files?

With the log level on its default Debug setting, Beagle prints out the URI for each document as it’s indexed:

Debug: +email://local@local/Archive;uid=106075
Debug: Generated 2 child indexables from email://local@local/Archive;uid=106075 (filtered with Beagle.Filters.FilterMail)
Debug: +email://local@local/Archive;uid=106090
Debug: Generated 1 child indexable from email://local@local/Archive;uid=106090 (filtered with Beagle.Filters.FilterMail)
Debug: +email://local@local/Archive;uid=106092
Debug: +email://local@local/Archive;uid=106094
Debug: +file:///home/joe/cvs/hal/hal/hald/logger.h
Debug: +file:///home/joe/cvs/hal/hal/hald/hald.h
Debug: +file:///home/joe/cvs/hal/hal/hald/osspec.h
Debug: +file:///home/joe/cvs/hal/hal/hald/logger.c

With this much information it’s pretty easy to find out which file is causing the CPU to spin: the last one. But if the log level is below “Debug”, you won’t see these messages, and you’ll have no idea.

A while ago I introduced the ability to send Beagle SIGUSR1 to turn the debugging level up to “Debug” while the daemon was running. This has helped us find bugs without forcing the user to restart the daemon with the --debug command-line option (and possibly losing important running state!), but it doesn’t help us in the situation above. Earlier today I checked in code so that the indexer reports the file it’s filtering, and how long it has been filtering, so that we can more easily identify when these issues occur, and specifically what file causes them.

To do this, you send SIGUSR2 to the index helper process:

Warn: Handling signal 12 (SIGUSR2)
Warn: Filtering status (0s ago): filtering email://1083884280.3779.4@posthaste/INBOX;uid=52#0 with Beagle.Filters.FilterHtml

Now finding these types of bugs should be a piece of cake. Why didn’t I think of this before?

Lots more great stuff coming in the 0.2.14 release, which should be out before the weekend.

2 comments

Comments are now closed.