Skip to content

look out, google

27 October 2004

Technical and boring entry follows…

Beagle is pretty quick for small searches. A couple hundred hits, no problem. But once you start doing rather generic searches on mail, it gets very very slow. The other day at lunch we were discussing where the potential bottlenecks were, speculating wildly about the trouble spots. Yesterday I took a look at some of them.

All of my queries were against the mail backend, which had about 140,000 items indexed. My test search returned approximately 20,000 results. This search took 52 seconds. Very obviously too long.

We’re using Lucene.Net as our indexing and search engine, and since we don’t want to hack it if we can avoid it, it seemed like the best place to start the profiling. We wrap all of the querying stuff in a class called LuceneDriver. I added some instrumentation in there and found that the time to search Lucene, filter the results, and return a list of Hits was taking 17 seconds. How much of that was the penalty of the actual Lucene query? I’m glad you asked. Turns out it’s 0.02 seconds. So, um, yeah, that’s not the problem.

That time is a little misleading though, because there is some actual work in pulling the data out of the Lucene document, and to just iterate across the list of documents takes about 1 second.

The biggest layer of fat turned out to be a workaround for a bug that was later fixed elsewhere in the code. Originally we would just reindex a file whenever it was updated. But we wouldn’t remove or update the previously indexed version, so you could end up getting the same URI back multiple times for a search. We had a function which filtered out these duplicates before returning them. But since we actually now check and update at indexing time, there’s no need for this at searching time. Cutting out this work saved us 8 seconds.

So searches within the LuceneDriver are cut down to 8 seconds and that’s a great improvement, but it’s still too long. There are a few options we can do here:

  • Limit the search results to a certain number. Let’s face it, 20,000 results is too many. It might as well be 1 million. At a certain point (100 maybe?), no one is going to go looking through all that crap. They’d refine their search. So as long as we return only the most relevant results, it’s probably okay to cut back.
  • Don’t convert Lucene hits into our own Beagle hit class structure. Trow says this is primarily a holdover from when Beagle was all a single process. But now we marshal these hits into an intermediate format before we push them over the wire to the client. So there’s absolutely nothing gained except that the marshalling is done in one place instead of a couple different places. From my profiling, though, the bulk of the remaining 8 seconds is in doing this, so we can probably save 5 or 6 seconds by doing this.
  • Return results asynchronously from the LuceneDriver. These can be marshalled in batches and sent over the wire. This would also make the time to marshal and demarshal a chunk of data smaller, and result in faster UI updates, so it’s definitely worth looking into. The downside is that it will add complexity to the infrastructure, as asynchronous stuff always does.

I am definitely going to do the second thing. I’ll probably avoid the first if it can be avoided , and the third will depend on how much we can speed up the marshalling and demarshalling. Right now we’re serializing into XML, and that’s painful. Doing some preliminary profiling, it takes 8.5 seconds to serialize all that, and 11 seconds to deserialize it. That’s a big chunk we can take out of the overall 44 seconds. And there’s still a good 20 seconds or so unaccounted for, so theres still a ton of work to be done.