LWN.net Logo

Advertisement

Front, Kernel, Security, Distributions, Development. See your byline here on LWN.net.

Advertise here

Google releases Protocol Buffers

Google has announced the release of its "Protocol Buffers" code under the Apache license. "Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format." In other words, it's another XDR/RPC/pickle implementation, but tuned to Google's performance needs.
(Log in to post comments)

Google releases Protocol Buffers

Posted Jul 8, 2008 15:59 UTC (Tue) by psadauskas (subscriber, #46534) [Link]

Can somebody explain to me how this is any better than, or even different from, JSON? Looks
like the only real difference is they use '=' rather than ':' to separate the key-value pairs.

Google releases Protocol Buffers

Posted Jul 8, 2008 16:09 UTC (Tue) by pharm (guest, #22305) [Link]

As I understand it, the google code serializes the data structures to a binary interchange
format
(like ASN.1 does). JSON on the other hand is text based.

Both have their place of course.

Google releases Protocol Buffers

Posted Jul 8, 2008 16:10 UTC (Tue) by pharm (guest, #22305) [Link]

NB: That should be 'like ASN.1 BER does' .

Google releases Protocol Buffers

Posted Jul 8, 2008 18:01 UTC (Tue) by wahern (subscriber, #37304) [Link]

Why do we keep re-inventing ASN.1? It's a stable standard, widely used. Unfortunately, the
only decent Free Software ASN.1 compiler is asn1c. Which, BTW, looks to be preferable to
Protocol Buffers because it supports stateful parsing (i.e. trickling reads/writes), whereas
it seems like Protocol Buffers block on I/O--non-starter for me. And many more languages have
ASN.1 libraries (albeit most of them suck...).

Oh well. One day... when I get around to it... ;)


Google releases Protocol Buffers

Posted Jul 8, 2008 19:15 UTC (Tue) by jkohen (subscriber, #47486) [Link]

Well, Protocol Buffers have a really simple syntax, closer to the C language. That's a plus
when you need to work with them all the time but can't be bothered to learn yet another
definition language.

Each thing has its purpose, that's why we keep reinventing some of them.

Google releases Protocol Buffers

Posted Jul 8, 2008 21:51 UTC (Tue) by dw (subscriber, #12017) [Link]

For a start, because there's no correlation between a huge, somewhat uncomprehendable,
verbosely standard written with very generalised English and buggy, prone to breakage
implementations that have resulted in more security bugs than you can shake a stick at (I
believe the kernel had an ASN.1 BER related bounds checking error in the past month?) 

The syntax is also immediately grokkable (much the same as Facebook's Thrift), unlike ASN.1.
It's also much less featureful, but then, somehow a large proportion of all the engineers that
work at that company have found it "just enough". It's a useable 80% solution, whereas ASN.1
approaches 100% unuseable.

As an engineer who has worked for that company, protocol buffers where one of the cooler,
"wasn't that bloody obvious?" enabling technologies that I saw while I was there. It makes
extremely light work of several problems (inter-language interoperability, building "good
enough" file formats, building robust parsers). Despite not being anything new, the fact
Google are endorsing this throws much needed weight behind non-textual data interchange
formats (ie. not XML).

I notice over at Tom White's blog (Hadoop guy), he's already talking about how this might be
integrated.

A while ago myself and a friend were toying with ways of combining SQL queries with the web,
in the form of set-returning functions that took URLs and scanned the XML within for RSS-style
containers, or extracted instances of some microformat. The idea was to take a 50%+ understood
technology and mix it with the "real coolness" of the web to allow time challenged developers
to build data apps that really pack a punch. We wanted something like:

SELECT A.name as "Common Friend" FROM my_addressbook AS A LEFT JOIN
hcard_url("http://other/friend/addresses.xml") AS B WHERE A.email = B.email;

Where that URL could be accessed using some imaginary authentication scheme, and discovered
using some imaginary data/service discovery (e.g. a combination of oAuth and YADIS). We found
toy examples that worked, e.g. using the cartesian product of 2 ATOM feeds to discover what
keywords authors both published about, but for anything *really* cool, the technology simply
isn't there yet.

For something like running a "web query" joining the names of all bankrupt individuals
(sourced from imaginary court web site) with directors names obtained from a Companies House
feed, in addition to some simple solution for not retrieving the same data set over and over
again, the problem of added bandwidth costs for someone like Companies House to expose such a
potentially large dataset might negate the use of XML entirely. Going by Google's FAQ, this is
clearly a limiting problem as one of the reasons they give for open sourcing protocol buffers
is to allow customers to use it to talk to them more efficiently.

We're so titillatingly close to an Internet where data flows freely, yet in some ways so far,
far away. Even the UK cabinet office has recently released a data mashup challenge
(<http://news.bbc.co.uk/2/hi/technology/7484131.stm>), yet the newly released data comes as
ZIP files, Excel spreadsheets and God knows what else. A simple binary interchange format
isn't a magic cure-all, but it will surely aid in some CTO's attempt at releasing hitherto
unforeseen data to the world for the benefit of all.

Apologies for the rant, but data interoperability is *really* important for everyone. :)

Google releases Protocol Buffers

Posted Jul 8, 2008 16:34 UTC (Tue) by psadauskas (subscriber, #46534) [Link]

Oh, I missed that. I thought I saw a sample of the output that looked like:

{
  foo = "bar"
}

...and assumed that was the format sent over the wire. Now I see the binary encoding page.
http://code.google.com/apis/protocolbuffers/docs/encoding...


Google releases Protocol Buffers

Posted Jul 8, 2008 17:45 UTC (Tue) by zooko (subscriber, #2589) [Link]

Hm...  This suggests another alternative: a compiler that produces and consumes packed binary
encodings of your JSON structured, or possibly even converts your JSON structures to Google
Protocol Buffer structures and vice versa.

Google releases Protocol Buffers

Posted Jul 8, 2008 16:30 UTC (Tue) by jwb (guest, #15467) [Link]

Are you joking?  JSON is 100% strings.  It's a deeply inefficient encoding.  In JSON the field
"blah:12345678" is represented as 13 bytes.  In this wire format, the same field is
represented in 4 bytes.  Smaller numbers like "blah:1" in JSON would be only 2 bytes in this
format.

The advance here is not really in the wire format, which is very similar to past formats.  The
new thing here is the compiler which generates the marshalling and unmarshalling code for a
variety of languages.

Google releases Protocol Buffers

Posted Jul 8, 2008 16:43 UTC (Tue) by jhs (subscriber, #12429) [Link]

I haven't looked at the code yet; but my hope is that this is something similar to the Construct Python toolkit (Construct website). The idea is that you declare how to parse data and then it builds the code to pull it off the wire directly into your data structures.

I believe that, at least for Construct, the whole idea came about because of the author's disgust at how low-level and by-hand the Wireshark dissectors and plugins were. Apparently, everything is just procedural C code with direct writing to the UI layer, etc.

Google releases Protocol Buffers

Posted Jul 8, 2008 16:49 UTC (Tue) by jhs (subscriber, #12429) [Link]

Whoops, I meant to say imperative C code there.

Google releases Protocol Buffers

Posted Jul 8, 2008 16:52 UTC (Tue) by NAR (subscriber, #1313) [Link]

It seems that they didn't reinvent CORBA, only parts of it :-)

Google releases Protocol Buffers

Posted Jul 8, 2008 23:08 UTC (Tue) by alextingle (subscriber, #20593) [Link]

Yeah, they reinvented the only parts of CORBA that CORBA got right!

Google releases Protocol Buffers

Posted Jul 9, 2008 21:55 UTC (Wed) by njs (subscriber, #40338) [Link]

Err, I seem to recall there being all sorts of horrible decisions in IIOP's encoding design?
Length-prefixing without chunking, screwy games with endianness negotiation, structure
*padding* -- in a *network protocol* -- that doesn't even match any real machine's padding...?

Code Generation

Posted Jul 8, 2008 19:17 UTC (Tue) by larryr (guest, #4030) [Link]

I would be more inclined to advocate this if it did not prefer code generation, instead just providing libraries to read the schema files and (un)pack the data, with a separate option for static-typing lovers (only) to generate code.

Larry@Riedel.org

Code Generation

Posted Jul 10, 2008 7:21 UTC (Thu) by k8to (subscriber, #15413) [Link]

Well the nice thing of course is if this "just enough" solution proves to catch on, you'll get
your wish fulfilled.

Google releases Protocol Buffers

Posted Jul 8, 2008 22:29 UTC (Tue) by endecotp (guest, #36428) [Link]

This is one of those areas where the (perceived) benefits of adopting an existing format are
frequently outweighed by the (perceived) benefits of inventing a new one.  So we have a
profusion of formats, which need not be a bad thing except that (a) it's easer to document
your format if all you need to say is "this is well-known format XYZ", and (b) as your needs
change, you may find yourself re-inventing stuff that you could have got from elsewhere.

A couple of things strike me about this Google format:
- There doesn't seem to be any explicit support for sequences (except that strings are
sequences of characters).  I would think that most protocols have some sort of repetition in
them somewhere.
- They must find the variable-length integer feature very beneficial, since otherwise they
wouldn't have included it.  [They encode integers with 7 bits per byte, with all but the last
byte having the top bit set.]  Perhaps we should think about using this for some in-memory
data too.

Google releases Protocol Buffers

Posted Jul 9, 2008 1:58 UTC (Wed) by pflugstad (subscriber, #224) [Link]

- There doesn't seem to be any explicit support for sequences (except that strings are sequences of characters). I would think that most protocols have some sort of repetition in them somewhere.

They have the repeated keyword...

- They must find the variable-length integer feature very beneficial, since otherwise they wouldn't have included it. [They encode integers with 7 bits per byte, with all but the last byte having the top bit set.] Perhaps we should think about using this for some in-memory data too.

Some of the ASN.1 encoding schemes do this, and it's probably older than that. In most protocols, you see 0 and small values on the wire A LOT, so this does save space. They also have fixed sized integers.

Google releases Protocol Buffers

Posted Jul 9, 2008 21:52 UTC (Wed) by njs (subscriber, #40338) [Link]

>[They encode integers with 7 bits per byte, with all but the last byte having the top bit
set.]

This sounds like the format that dwarf (IIRC) calls "uleb128" = "unsigned little endian
base-128".  It's nice because when putting integers into network protocols, you always have
this little voice in your head saying "it should be efficient!  use as few bytes as possible!"
and another voice saying "but what if someone needs to encode a large number!  remember 32-bit
file offsets!  use a 64-bit field just in case!".  But if you have this kind of variable width
format, you can just pick that and make both voices happy and get on with actual work, instead
of trying to make some agonizing Scylla vs. Charybdis guess about *exactly* how large the
numbers will be in practice.  And then for your actual code you can just unpack it into a
32-bit quantity in memory or whatever, because it's easy to recompile your code later if that
turns out to be wrong, but hard to change an interchange protocol.

Google releases Protocol Buffers

Posted Jul 10, 2008 11:45 UTC (Thu) by nix (subscriber, #2304) [Link]

It's the UTF-8 encoding scheme applied to arbitrary 8-bit-byte numeric 
data, pretty much.

Google releases Protocol Buffers

Posted Jul 10, 2008 23:36 UTC (Thu) by njs (subscriber, #40338) [Link]

Actually, that's what I thought until I looked up the actual UTF-8 encoding, which turns out
to be quite different, and more complicated.  UTF-8 is a kind of length-encoding (in unary!)
stuck into the high bits of the first byte, then uses leftover bits of that byte plus the
low-order 6 bits of following bytes to actually store the numeric data.  It isn't defined
(without some extension, anyway) for data that needs more than 7 bytes, because you overflow
that first byte's ability to encode the length.  Compared to uleb128, it's space inefficient
and somewhat annoying to handle (there are multiple representations for the same numbers so
you have to worry about canonicalization), etc.

This is all for good reason, because UTF-8 has to make sure that single-byte characters look
like ascii, but no byte of a multibyte sequence looks like valid ASCII, and that if lose the
beginning of a stream you can still figure out where the character boundaries are, and that it
will never contain the UTF-16 endianness signature bytes 0xFF and 0xFE, memcmp() will sort by
unicode codepoint, etc. etc.  It makes it a rather poor general-purpose integer encoding,
though.

Google releases Protocol Buffers

Posted Jul 9, 2008 4:36 UTC (Wed) by pphaneuf (subscriber, #23480) [Link]

A few things I haven't seen mentioned yet is how it is designed for a very compact representation (in fact, if all of your fields are optional, a protocol buffer with none of them set is zero byte!), and for incremental improvements with backward compatibility (if you only add optional fields, a program expecting the new version will be happy, and there might be something for forward compatibility, I don't remember).

It's nothing all that complicated, is well done, so if it's useful to you, huzzah, if not, then oh well.

Google releases Protocol Buffers

Posted Jul 18, 2008 10:20 UTC (Fri) by willp (subscriber, #52971) [Link]

The only bits that disappoint me are that:
* it doesn't handle framing
* it doesn't provide message type identification in the wire encoding
* asynchronous RPC is left out

No Framing:
Streaming multiple messages over a socket, or into a file has to include a parsable header.
And you can't just define a Header message, because IT'S length would be variable, so you need
framing to exist outside the messages to indicate their lengths (just like a packet header)

No Message Type Identification:
When you get a new message over a socket, you have to know *in advance*, exactly which kind of
message it is, to route the bits to the correct class decoder.  This means asynchronous RPC
with more than one message type requires you to embed a message type identifier into a header
(next to the frame length field, I'd imagine).  Also, reading .protos out of a file means you
need to either stick to one single message, or use one big encapsulating message{} which
contains optional sub-message{}es, turning your class definition into a gigantic "union"...
Very hard to do OOP with the limitation that you must support every kind of message in a
single class- even with nested internal classes, it gets bulky.

A .proto with "message GetName { ... }" and "message GetAge { ... }" don't get numeric
identifiers themselves, so there's no way to define *in the .proto* file itself, a numeric
identifier that you can re-use in code, later, for identifying one message from another.  If
this was solved 

No Asynchronous RPC (builtin)
In addition to the frame length and message type number, you also need a per-request sequence
number to really implement asynchronous messaging, or else you will have trouble sending more
than one request via the same message type to a server.  How else to properly match up server
replies (possibly out of order!) to your requests, without a request identifier?

These aren't really hard problems to solve.  A framing header of 12 bytes (4length,
4messageid, 4sequenceid) would nail it, but it pushes the wallpaper bubble into a new realm:
numeric message type ID management.  The direct path would be to define an "enum { ... }" in
the .proto which has the mapping, but it feels a bit redundant, and renaming a message means
you have to change the name in two places, or possibly have a syntactically valid (protoc will
compile it), but internally inconsistent and unusable .proto definition.

My only other complaint is no support for hashmaps/dictionaries as a primitive...  Facebook's
Thrift has a much richer set of primitives...  But the (current) binary encoding on Thrift is
much fatter than Google's ProtocolBuffers, and yet doesn't solve the Message type ID problem
either, so neither one fits what I want for platform-independent RPC code generation for
modern high level languages...

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds