JMAP — reinventing IMAP

Posted Mar 27, 2016 9:41 UTC (Sun) by kentonv (subscriber, #92073)
In reply to: JMAP — reinventing IMAP by Cyberax
Parent article: JMAP — reinventing IMAP

> Second, if we go down the rabbit hole of data schemas and tight specifications then we get ASN.1 as a result or something closely approximating it. With all the problems of versioning, global type identification, type mapping and general awfulness.

I guess you haven't worked with Protobuf or Cap'n Proto, which are binary formats that handle versioning arguably more cleanly than JSON, yet have type-safe schemas unlike BSON?

JMAP — reinventing IMAP

Posted Mar 27, 2016 20:09 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

> I guess you haven't worked with Protobuf or Cap'n Proto, which are binary formats that handle versioning arguably more cleanly than JSON, yet have type-safe schemas unlike BSON?
I worked with protobuf quite a bit (haven't had a chance to play with Cap'n Proto yet). For example, I reverse engineered the Android Auto protocol recently ( https://github.com/Cyberax/aauto ) which is totally protobuf-based.

Protobuf doesn't really have a schema. It's more like JSON actually, the structures simply use tag numbers instead of field names and there are slightly more data types available. Versioning support is mostly non-existant - unknown fields are just ignored and conflicting definitions cause havoc.

And there's now a canonical JSON mapping for protobuf as well.

I've read several years ago a post about protobuf from a guy in Google - he wrote that protobuf actually slightly predates JSON and these days they would have just used pure JSON instead.

JMAP — reinventing IMAP

Posted Mar 28, 2016 17:34 UTC (Mon) by kentonv (subscriber, #92073) [Link] (7 responses)

So... I am actually the primary author of Google's open source Protobuf release (though not the original inventor of the protocol, and I haven't worked on it in ~5 years).

> Protobuf doesn't really have a schema.

That's not true at all. https://developers.google.com/protocol-buffers/docs/proto

The schema is not transmitted on the wire. The schema is used to encode/decode on each end.

The raw wire format is numeric tags and values, but no one actually uses protobuf without schemas.

"Reverse engineering" protobufs usually means:
1. Feed a few messages to protoc --decode_raw.
2. Guess the meaning of each numeric tag.
3. Write a .proto file assigning names and types to the tags.

Or, better yet:
1. Yank the protobuf descriptors out of the app binary, which provide the complete schema.

I don't see any .proto files in your repo nor any reference to libprotobuf. Did you reverse engineer from the byte level and write your own decoder?

> Versioning support is mostly non-existant - unknown fields are just ignored

Yes, that's the best way to do versioning, and is exactly the same strategy people use with JSON. Using actual version numbers is a pain as you end up with lots of branchy code to handle every version. That's what Google had before protobuf, and protobuf was explicitly designed to fix it.

Protobuf is better than JSON, though, because with Protobuf you have a schema where you can declare default values to replace things missing on the wire, whereas with JSON you have to check for the existence and type of every single field before accessing it or otherwise risk unexpected exceptions and security bugs.

> and conflicting definitions cause havoc.

In my experience this is not a problem people run into often. You have one owner of the protocol who decides which changes become official. If you want a protocol to be extensible, you use "extensions" which allow third parties to extend the protocol without conflicting (or in proto3 you use the "Any" type).

(Moreover, conflicting definitions would cause an equal amount of havoc for JSON.)

> he wrote that protobuf actually slightly predates JSON and these days they would have just used pure JSON instead.

Sorry, that post was wrong. That's definitely not the opinion held by senior engineers at Google.

1. The importance of type-safety and the implicit documentation provided by schemas is widely recognized inside Google.

2. The vast majority of code at Google is written in statically-typed languages (C++ and Java) where JSON is highly annoying to use due to being dynamic. Protobuf uses schemas to generate classes with pleasant interfaces.

3. Protobuf parsing as-is is responsible for several percent of CPU cycles fleet-wide -- some servers report 30% or more. That equates to many millions of dollars annually. JSON would be an order of magnitude slower, which would cost a massive amount of money.

4. Google stores petabytes of data in Protobuf format. This data would be much larger as JSON. No, compression doesn't magically fix it (compressed protobufs are still much smaller than compressed JSON, particularly for highly-structured (i.e. not textual) data).

5. Similarly, the network bandwidth overhead would be unacceptable.

6. The latency cost of encoding and compressing JSON would be unacceptable for many systems in Google even if there were CPU cycles to spare.

JMAP — reinventing IMAP

Posted Mar 28, 2016 19:26 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> The schema is not transmitted on the wire. The schema is used to encode/decode on each end.
> The raw wire format is numeric tags and values, but no one actually uses protobuf without schemas.
Well, I do.

> Yank the protobuf descriptors out of the app binary, which provide the complete schema.
That might not be legal, though.

> I don't see any .proto files in your repo nor any reference to libprotobuf. Did you reverse engineer from the byte level and write your own decoder?
Yep. Protobuf is self-describing so it wasn't complicated. And adding .proto files into the mix was just not worth it.

> Yes, that's the best way to do versioning, and is exactly the same strategy people use with JSON.
But that's exactly my point. JSON is semantically very similar to protobuf wire format - the differences are really minor (tags instead of names and more integer types in protobuf).

All the protobuf "smarts" are in the mapping layer which enforces the schema, provides default values and so on. This mapping layer can be built atop JSON just as easily (as many people have done, many times in many companies) to provide nice statically-typed interfaces.

You can _almost_ treat protobufs as an encoding for JSON. Contrast it with ASN.1 PER where you actually have to use the schema to decode raw messages as field types and offsets are not transmitted on the wire.

JMAP — reinventing IMAP

Posted Mar 28, 2016 20:04 UTC (Mon) by kentonv (subscriber, #92073) [Link] (5 responses)

Yes, it should generally be easy to map any records-and-lists format to any other records-and-lists format. Indeed, you could pair the Protobuf generated code with JSON encoding, and some people do this.

But once you have those generated classes, then it makes no difference to the developer whether the bytes were JSON or Protobuf. Protobuf can produce a textual representation for debugging as needed. Why waste cycles and bytes encoding human-readable text all the time?

It sounds like you're the kind of person who reads raw network dumps a lot, but that you're also the kind of person who doesn't like to use the tools provided to you for this purpose, so I guess that would explain a preference for human-readable messages on the wire. But, I think between using the tools and spending millions of dollars on additional computer hardware, using the tools seems more reasonable.

> And adding .proto files into the mix was just not worth it.

Well, using libprotobuf and a .proto file would have saved you from writing quite a bit of code.

JMAP — reinventing IMAP

Posted Mar 28, 2016 21:19 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

> But once you have those generated classes, then it makes no difference to the developer whether the bytes were JSON or Protobuf. Protobuf can produce a textual representation for debugging as needed.
But it doesn't do it normally.

> Why waste cycles and bytes encoding human-readable text all the time?
The amount of wasted cycles isn't noticeable even at 10G wire speeds.

> It sounds like you're the kind of person who reads raw network dumps a lot
I don't read network dumps normally, but I do have to do it now and when. Usually during a high-stress breakage situation.

> but that you're also the kind of person who doesn't like to use the tools provided to you for this purpose, so I guess that would explain a preference for human-readable messages on the wire.
I actually like tools that map a domain object model into messages/database/whatever. But they do have their price in being opaque when you have to diagnose a problem.

JSON helps in this regard by making the whole stack less opaque.

> Well, using libprotobuf and a .proto file would have saved you from writing quite a bit of code.
Probably not in this case.

JMAP — reinventing IMAP

Posted Mar 28, 2016 21:47 UTC (Mon) by kentonv (subscriber, #92073) [Link] (3 responses)

> The amount of wasted cycles isn't noticeable even at 10G wire speeds.

Uh, what? The absolute fastest JSON parsers top out around a gigabit per second, consuming 100% of CPU time on parsing, in idealized benchmark scenarios. You're asserting that you can do 10gbps and the CPU usage isn't even noticeable?

JMAP — reinventing IMAP

Posted Mar 28, 2016 22:05 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Uh, what? The absolute fastest JSON parsers top out around a gigabit per second, consuming 100% of CPU time on parsing, in idealized benchmark scenarios. You're asserting that you can do 10gbps and the CPU usage isn't even noticeable?
Yes, when compared to parsing protobufs. The fastest possible JSON parser is based on bitslicing and SSE: http://parabix.costar.sfu.ca/ and with it you can get 10G performance.

JMAP — reinventing IMAP

Posted Mar 28, 2016 22:32 UTC (Mon) by kentonv (subscriber, #92073) [Link] (1 responses)

... no, sorry. That's not a JSON parser, it's a regex matcher. They claim 1GB/s (close to 10gbps) performance matching "\p{Greek}". There's a whole lot more work involved in parsing JSON.

Generally the fastest JSON parser is RapidJSON. Protobuf beats it handily (3x or more) in most benchmarks, e.g.

https://github.com/erickt/rust-serialization-benchmarks

(Cap'n Proto in turn handily beats Protobuf and can in fact saturate a 10gbps link.)

JMAP — reinventing IMAP

Posted Mar 29, 2016 0:18 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> ... no, sorry. That's not a JSON parser, it's a regex matcher.
They actually have a full XML parser with same performance characteristics. There's a JSON parser there as an example.

> Generally the fastest JSON parser is RapidJSON. Protobuf beats it handily (3x or more) in most benchmarks, e.g.
I have my own Parabix-based JSON parser that is used in production to do switching for a JSON-based UDP protocol. It saturates multiple 10G links (though it's also multithreaded).

And yeah, even 3x performance difference in _parsing_ is pretty much negligible these days.