I just came across this boneheaded blog post about Google’s newly open-sourced Protocol Buffers:
They claim they could not use XML because ‘it isn’t going to be efficient enough for this scale’. WTF??? If this statement came from someone else, I would understand, but these guys are supposed to KNOW markup.
Speed in a system in [sic] NOT just optimizing loops in code! It is the architecture: messaging, storage, and re-use. Yes, XML can be fat, but so can any other language. And if they took the time to improve the processing libraries instead of creating their own special methods, we would ALL benefit on projects that used XML, not just this so-called ‘protocol’.
And he had the audacity to title the post Google hates XML.
While the author is right that any attention paid to improving the performance of widespread XML libraries would be widely beneficial, he completely ignores protobuf’s strengths, aims, and specific use cases. I suspect he didn’t bother to read any of the documentation. After all, when you’re dealing with XML, everything looks like a nail. Or something like that.
Protobufs aren’t aimed at replacing the widespread utility of using XML for publishing data widely or providing human readable document formats. I don’t think anyone is suggesting that we replace HTML with this. They’re for (mostly) well-defined interfaces and serializing data in a compact and low-latency way. The author at one point suggests:
I bet I could make XML run circles around their system just by simplifying their schema. I once invented a technique called ‘XmlZip’ that would transform long element names and attributes to smaller symbols for faster transfer - why not try that?
But he obviously didn’t read the section on encoding:
Let’s say you have the following very simple message definition:
message Test1 {
required int32 a = 1;
}In an application, you create a Test1 message and set a to 150. You then serialize the message to an output stream. If you were able to examine the encoded message, you’d see three bytes:
08 96 01
Three bytes! You can’t do anything in XML in three bytes. The simplest XML document you can have, which conveys no information, is 4 bytes: <a/>. That same message definition would look something like this in XML:
<?xml version="1.0"?>
<Test1>
<int32 value="150" />
</Test1>
That’s 61 bytes by my count, and even if you did condense it down to a tight, humanly-unreadable XML, you won’t get anywhere near 3 bytes. And if your messages really are that small, the gzip compression overhead is counterproductive and actually results in a bigger file. (82 bytes, from my testing.) If the messages were large enough that gzip compression did buy you size, you’d suffer additional latency because of the CPU time used to decompress.
When you’re talking about pushing huge amounts of data on a near-saturated gigabit ethernet link, an order of magnitude makes a big deal.
Protocol buffers aren’t going to replace XML — they’re not even really aimed at the same problem — but they are a better solution for certain use cases. Such a huge part of software development is using the right tool for the job, and just because XML can solve the problem doesn’t mean it should by default. Do the research and weigh the pros and cons. Otherwise, like using an O(n^2) sorting algorithm when there are vastly better alternatives, it’s just lazy programming.
