(17:04:01) Robert Love: ugh my eye is spasming
(17:04:19) Robert Love: I cannot see
(17:07:22) Joe Shaw: if it offends thee, cut it out
(17:08:00) Robert Love: its my good eye!
(17:09:23) Joe Shaw: not anymore it’s not
Over the past couple days Fredrik and I have been struggling to figure out why in the hell Beagle indexing was taking an extremely long time for very large OpenOffice documents. These documents spit out their filtered text in about 5 seconds using the beagle-extract-content tool, but indexing them within the daemon took upwards of 6 minutes. Fredrik also noticed that if he stored the streamed tokens to disk and then fed them into Lucene, it’d also take about 5 seconds.
I was able to see that Lucene was short circuiting on our documents, only indexing the first 10,000 tokens. If I bumped that up to a large enough number — say, 5 million — it’d pick up the speed. But I also noticed that if we tried to extract “hot” content (bold text, etc.) before regular content, it’d take 6 minutes even with beagle-extract-content. That removed Lucene from the equation. The bug either had to be in the OpenOffice filter, the filter architecture, or the PullingReader class. Fredrik and I both looked over the code but didn’t see anything obvious. I started running the mono profiler, but it is so slow that it took 30 minutes for the normal 5 second case, and had been going for 3 hours before I aborted it in the 6 minute case.
We were totally stumped and asked Jon to take a look at it. As is often the case, a fresh pair of eyes on it quickly found a case where we were unnecessarily doing an O(n2) operation in the filter. It was previously being used to profile memory usage in the filters and hadn’t been useful recently, so we nuked it. That brought down the indexing time to about a minute, which was better but still too long. After chatting with Jon about it some more, we found that the PullingReader doing some pretty inefficient things and worked out some ways to both speed things up and reduce the number of memory allocations. After hacking it up, what previously took 6 minutes and then 1 minute took 6 seconds. And what previously took 5 seconds now takes 3.5. So there is quite a win in all cases, and indexing is noticably faster now than it was yesterday. Yay for teamwork!
The only downside to this story is that the bottleneck probably would have been more obvious to us if the profiler were more useful. That’s the main weakness in the Mono development platform right now: the tools. C# is a wonderful language, the compiler is fast, the toolchain isn’t totally retarded like C, the hackers fix our bugs, the community is vibrant and active, but man… the tools. So all of you out there, if you’re looking for something interesting and very helpful to hack on, take a look at tackling some development tools for mono: the debugger, the built-in profiler, deadlock detection, thread profiling, heap profiling, etc. I will buy you several drinks of a refreshing nature.
Comments are now closed.

1 comment
Trackback link: http://joeshaw.org/2005/07/26/167/trackback