My blog seems to be quickly becoming a series of posts about debugging various tricky Beagle issues. Is this useful and interesting to people?
Here’s another one that with the help of a great bug reporter I tracked down yesterday: bug 354161 - Too Large Music Directory To Index. The reporter had a directory of over 3000 MP3s, but for whatever reason Beagle only ever saw a little more than 300. What’s going on?
There are a lot of warts in the .NET class libraries, but one of the worst is the lack of an IEnumerable interface to iterate over files and directories. The best you get are methods like Directory.GetFiles() which return an array of files in a directory. This is fine if you’re dealing with a directory of 10 or even 200 files, but really hammers the CPU and memory if you’re dealing with directories of 3000 or 30,000.
To get around this in Beagle, we created the DirectoryWalker class, whose methods return an IEnumberable for easy and safe iteration. We implement this on top of the POSIX readdir(3) call. This requires us to P/Invoke into native code, always a fun experience.
We wrote some C glue to make the interface a little nicer — we only care about the name of the file and not anything else — and we used an out parameter for this. Because strings in .NET (and thus Mono) are immutable, there is special support in the runtime for passing in a StringBuilder to a native method which takes a string by reference. For example, this C code:
beagled_utils_readdir (void *dir, char *name, int max_len);
translates to this stub in our C# code:
[DllImport ("libbeagleglue", EntryPoint = "beagled_utils_readdir", SetLastError = true)] private static extern int sys_readdir (IntPtr dir, [Out] StringBuilder name, int max_len);
We use this class throughout Beagle, including our file system crawling code, so we’re pretty sure it works. Which is why this bug was so baffling. With the help of the bug reporter, we found some really interesting behavior with the strings returned from our readdir wrapper:
... Zubin Mehta; New York Philharmonic - Symphony No.2, Op. 43 - III. Vivacissimo. Finale- Allegro moderto.mp3 Raul Midon - I Would Do Anything.mp3 Steve Wynn - State Trooper.mp3 Tony Bennett - Jeepers Creepers.mp3 Louis Armstrong - I Didn't Know Until You Told Me.mp3 Red Hot Chili Peppers - Savior.mp3 Los Lobos - Oh Yeah.mp3 Jimmy Durante - If I Ha Mary J. Blige Feat. Bro Elvis Costello & Allen Andrew Weil MD & Mark F ...
Woah! Check that out, right in the middle of the run, files start getting truncated. Coincidentally, this happens to be right at the ~300 mark, where the files cut off. This is because the last thing DirectoryWalker does before returning a file is check to see if it still exists. Since the truncated filenames don’t exist, they’re not returned.
So we clearly know now why files aren’t showing up, time to find out why filenames are being truncated. Going back to our C readdir() helper, the third argument there is the max_len — the size of the buffer that is passed in as the second argument. In the C# code, we have this line that calls the C function:
r = sys_readdir (dir, buffer, buffer.Capacity);
buffer is the StringBuilder. We create the StringBuilder when our FileEnumerator is created, like so:
StringBuilder name_buffer = new StringBuilder (256);
The 256 there is the capacity of the string buffer. It’s important to set this to something large enough when you’re passing in a StringBuffer to unmanaged code to prevent a buffer overrun. PATH_MAX on Linux is 256, so this is a good choice. Additional debug output in the test program confirmed that the capacity was being reset to something pretty small. A simple fix to ensure that the capacity was at least 256 before calling our readdir() helper was all that was needed to fix this:
name_buffer.EnsureCapacity (256);
Under the hood, the StringBuffer was resetting itself to a rather small size because the average size of items was below some certain threshold. The reason why I’ve never been able to reproduce this issue myself was because I was creating extremely large directories with files containing only 5 or 6 characters.
So, is this a bug in Mono? Should the capacity of a StringBuilder never shrink below was it was manually set to (either via the constructor, the Capacity property, or the EnsureCapacity() method)? The behavior appears to be undefined; nothing on MSDN mentioned anything about the capacity shrinking. On the other hand, maybe this is exactly why the EnsureCapacity() method exists in the first place.
