down at the dogtrack

This week, after feeling like Beagle’s search results were a little sluggish and having watched Federico’s excellent FOSDEM talk on profiling applications [258 megs, Ogg/Theora], I decided to take a look into optimizing what is unquestionably the most important codepath in the entire project.

Beagle’s search is already pretty fast, especially when you query a single backend, like our Nautilus or Yelp integration does. For a single-term search with a few dozen results, search times with full metadata are already in the sub-second range. Slowdowns always creep into the code, and they’ve become particularly noticeable in larger — many thousand result — queries. Without going into too much detail, here are some of the things dBera and I fixed:

  • We were needlessly walking the list of results an extra time to remove our Beagle-private metadata before returning it to the client. dBera solved this by removing these when we serialize into our message format, since we have to walk then anyway.
  • Instead of running queries to match up URIs between our two Lucene indexes, we now walk the list of terms in the URI field, which is a lot faster on large result sets and memory friendly.
  • Buffer our serialization before sending it out on the network, this way we don’t do a bunch of slow writes and can blast out our messages in nice 8k chunks.
  • And the biggest one, we now flush our hits out to clients 25 at a time. Previously we were holding on to all of them before we sent them out. When you were asking for 100 hits (the default) this was fine, but any more and this became a burden. This drastically reduces the amount of time for a client to see the first hit, but keeps the overall search time the same. Interactivity good.

Since Jorge told me I need to post more graphs, here’s one showing the results of this work:

Improved search speed graph

About the setup: this is an index of my home directory with 177,658 files. The vast majority of those files are source code, along with a hundred or so regular documents (PDFs, OOo docs, MS Office docs), probably a few thousand text files, maybe a thousand images, a few hundred MP3s, and maybe a dozen videos. This is probably on the extreme edge of most home directories; most people probably don’t have more than 20,000 files even with all their music and photos. I was searching for the term “gnome”, which matched 13,107 documents, and asking for the 1,000 most recent results. All “hits”, as they’re called in Beagle, contain the full metadata. So a .desktop file result looked like this:

  Uri: file:///home/joe/svn/beagle/search/beagle-search.desktop
PaUri: (null)
 Snip: (null)
 Type: File
MimeT: application/x-desktop
  Src: Files
Score: 6.97836828231812
 Time: 2007-02-23 19:16:40 (Utc)
    beagle:ExactFilename = 'beagle-search.desktop'
    beagle:Filename = 'beagle-search'
    beagle:FilenameExtension = '.desktop'
    beagle:FileType = 'application'
    beagle:HitType = 'File'
    beagle:InternalUri = 'uid:m1teVAmMMUCsjYxgo6JhBw'
    beagle:IsChild = 'false'
    beagle:MimeType = 'application/x-desktop'
    beagle:NoPunctFilename = 'beagle search'
    beagle:Source = 'Files'
    beagle:SplitFilename = 'beagle search'
    fixme:Categories = 'Application'
    fixme:Categories = 'Core'
    fixme:Categories = 'Filesystem'
    fixme:Categories = 'GNOME'
    fixme:Categories = 'GTK'
    fixme:Categories = 'Utility'
    fixme:Comment = 'Search for data on your desktop'
    fixme:Exec = 'beagle-search'
    fixme:Icon = 'system-search'
    fixme:Name = 'Search'
    fixme:Type = 'Application'

(The “fixme” namespace is because there’s no defined namespace for .desktop files; we’d like to change this at some point.)

As you can see from the graph, the time to display the first hit dropped in half, and the overall speed improved by 0.6s, or 20%. Not bad. I also included the overall search time for a simple libbeagle-based C application, and it takes even less time. This indicates to me that the C# message deserialization could use some tuning next.

I think we can get this even lower. Some next steps:

  • Lucene 2.1 is out now, and it includes a nice FieldSelector API which lets us pick and choose the fields we want to pull from the index, which is nice because that’s probably the slowest path in using Lucene. Sometimes all you want from the search service is a URI, and you don’t care about the metadata, like with the Nautilus integration, for instance. We should add a path where clients can ask only for the URIs and nothing else.
  • Right now backends are searched in parallel. This ends up being slower than searching them serially, because (a) most people don’t have multicore or multi-CPU machines and (b) the indexes are all stored on a single hard drive, so you incur a seek penalty as you hit multiple indexes at the same time. We need to be smarter about how we schedule this, so that we can balance the CPU usage and IO wait effectively. This shouldn’t be too hard.
  • Message serialization is apparently slow. Now that we have a good managed D-Bus implementation and there is the Wasabi draft specification floating around, it might be a good time to start re-evaluating using D-Bus for our IPC. Right now we’re just shoving XML over a Unix domain socket.

Lastly, I saw that Neil J. Patel has been doing some nice stuff with a metadata details pane for Nautilus. It’s somewhat similar to the details pane in beagle-search, but his is a lot prettier and is probably more reusable in other applications. I’m hoping he releases the code soon so I can add Beagle support to it — it only supports Tracker right now.