JMAP — reinventing IMAP
JMAP — reinventing IMAP
Posted Mar 27, 2016 9:41 UTC (Sun) by kentonv (subscriber, #92073)In reply to: JMAP — reinventing IMAP by Cyberax
Parent article: JMAP — reinventing IMAP
I guess you haven't worked with Protobuf or Cap'n Proto, which are binary formats that handle versioning arguably more cleanly than JSON, yet have type-safe schemas unlike BSON?
Posted Mar 27, 2016 20:09 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Protobuf doesn't really have a schema. It's more like JSON actually, the structures simply use tag numbers instead of field names and there are slightly more data types available. Versioning support is mostly non-existant - unknown fields are just ignored and conflicting definitions cause havoc.
And there's now a canonical JSON mapping for protobuf as well.
I've read several years ago a post about protobuf from a guy in Google - he wrote that protobuf actually slightly predates JSON and these days they would have just used pure JSON instead.
Posted Mar 28, 2016 17:34 UTC (Mon)
by kentonv (subscriber, #92073)
[Link] (7 responses)
> Protobuf doesn't really have a schema.
That's not true at all. https://developers.google.com/protocol-buffers/docs/proto
The schema is not transmitted on the wire. The schema is used to encode/decode on each end.
The raw wire format is numeric tags and values, but no one actually uses protobuf without schemas.
"Reverse engineering" protobufs usually means:
Or, better yet:
I don't see any .proto files in your repo nor any reference to libprotobuf. Did you reverse engineer from the byte level and write your own decoder?
> Versioning support is mostly non-existant - unknown fields are just ignored
Yes, that's the best way to do versioning, and is exactly the same strategy people use with JSON. Using actual version numbers is a pain as you end up with lots of branchy code to handle every version. That's what Google had before protobuf, and protobuf was explicitly designed to fix it.
Protobuf is better than JSON, though, because with Protobuf you have a schema where you can declare default values to replace things missing on the wire, whereas with JSON you have to check for the existence and type of every single field before accessing it or otherwise risk unexpected exceptions and security bugs.
> and conflicting definitions cause havoc.
In my experience this is not a problem people run into often. You have one owner of the protocol who decides which changes become official. If you want a protocol to be extensible, you use "extensions" which allow third parties to extend the protocol without conflicting (or in proto3 you use the "Any" type).
(Moreover, conflicting definitions would cause an equal amount of havoc for JSON.)
> he wrote that protobuf actually slightly predates JSON and these days they would have just used pure JSON instead.
Sorry, that post was wrong. That's definitely not the opinion held by senior engineers at Google.
1. The importance of type-safety and the implicit documentation provided by schemas is widely recognized inside Google.
2. The vast majority of code at Google is written in statically-typed languages (C++ and Java) where JSON is highly annoying to use due to being dynamic. Protobuf uses schemas to generate classes with pleasant interfaces.
3. Protobuf parsing as-is is responsible for several percent of CPU cycles fleet-wide -- some servers report 30% or more. That equates to many millions of dollars annually. JSON would be an order of magnitude slower, which would cost a massive amount of money.
4. Google stores petabytes of data in Protobuf format. This data would be much larger as JSON. No, compression doesn't magically fix it (compressed protobufs are still much smaller than compressed JSON, particularly for highly-structured (i.e. not textual) data).
5. Similarly, the network bandwidth overhead would be unacceptable.
6. The latency cost of encoding and compressing JSON would be unacceptable for many systems in Google even if there were CPU cycles to spare.
Posted Mar 28, 2016 19:26 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
> Yank the protobuf descriptors out of the app binary, which provide the complete schema.
> I don't see any .proto files in your repo nor any reference to libprotobuf. Did you reverse engineer from the byte level and write your own decoder?
> Yes, that's the best way to do versioning, and is exactly the same strategy people use with JSON.
All the protobuf "smarts" are in the mapping layer which enforces the schema, provides default values and so on. This mapping layer can be built atop JSON just as easily (as many people have done, many times in many companies) to provide nice statically-typed interfaces.
You can _almost_ treat protobufs as an encoding for JSON. Contrast it with ASN.1 PER where you actually have to use the schema to decode raw messages as field types and offsets are not transmitted on the wire.
Posted Mar 28, 2016 20:04 UTC (Mon)
by kentonv (subscriber, #92073)
[Link] (5 responses)
But once you have those generated classes, then it makes no difference to the developer whether the bytes were JSON or Protobuf. Protobuf can produce a textual representation for debugging as needed. Why waste cycles and bytes encoding human-readable text all the time?
It sounds like you're the kind of person who reads raw network dumps a lot, but that you're also the kind of person who doesn't like to use the tools provided to you for this purpose, so I guess that would explain a preference for human-readable messages on the wire. But, I think between using the tools and spending millions of dollars on additional computer hardware, using the tools seems more reasonable.
> And adding .proto files into the mix was just not worth it.
Well, using libprotobuf and a .proto file would have saved you from writing quite a bit of code.
Posted Mar 28, 2016 21:19 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
> Why waste cycles and bytes encoding human-readable text all the time?
> It sounds like you're the kind of person who reads raw network dumps a lot
> but that you're also the kind of person who doesn't like to use the tools provided to you for this purpose, so I guess that would explain a preference for human-readable messages on the wire.
JSON helps in this regard by making the whole stack less opaque.
> Well, using libprotobuf and a .proto file would have saved you from writing quite a bit of code.
Posted Mar 28, 2016 21:47 UTC (Mon)
by kentonv (subscriber, #92073)
[Link] (3 responses)
Uh, what? The absolute fastest JSON parsers top out around a gigabit per second, consuming 100% of CPU time on parsing, in idealized benchmark scenarios. You're asserting that you can do 10gbps and the CPU usage isn't even noticeable?
Posted Mar 28, 2016 22:05 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Mar 28, 2016 22:32 UTC (Mon)
by kentonv (subscriber, #92073)
[Link] (1 responses)
Generally the fastest JSON parser is RapidJSON. Protobuf beats it handily (3x or more) in most benchmarks, e.g.
https://github.com/erickt/rust-serialization-benchmarks
(Cap'n Proto in turn handily beats Protobuf and can in fact saturate a 10gbps link.)
Posted Mar 29, 2016 0:18 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> Generally the fastest JSON parser is RapidJSON. Protobuf beats it handily (3x or more) in most benchmarks, e.g.
And yeah, even 3x performance difference in _parsing_ is pretty much negligible these days.
JMAP — reinventing IMAP
I worked with protobuf quite a bit (haven't had a chance to play with Cap'n Proto yet). For example, I reverse engineered the Android Auto protocol recently ( https://github.com/Cyberax/aauto ) which is totally protobuf-based.
JMAP — reinventing IMAP
1. Feed a few messages to protoc --decode_raw.
2. Guess the meaning of each numeric tag.
3. Write a .proto file assigning names and types to the tags.
1. Yank the protobuf descriptors out of the app binary, which provide the complete schema.
JMAP — reinventing IMAP
> The raw wire format is numeric tags and values, but no one actually uses protobuf without schemas.
Well, I do.
That might not be legal, though.
Yep. Protobuf is self-describing so it wasn't complicated. And adding .proto files into the mix was just not worth it.
But that's exactly my point. JSON is semantically very similar to protobuf wire format - the differences are really minor (tags instead of names and more integer types in protobuf).
JMAP — reinventing IMAP
JMAP — reinventing IMAP
But it doesn't do it normally.
The amount of wasted cycles isn't noticeable even at 10G wire speeds.
I don't read network dumps normally, but I do have to do it now and when. Usually during a high-stress breakage situation.
I actually like tools that map a domain object model into messages/database/whatever. But they do have their price in being opaque when you have to diagnose a problem.
Probably not in this case.
JMAP — reinventing IMAP
JMAP — reinventing IMAP
Yes, when compared to parsing protobufs. The fastest possible JSON parser is based on bitslicing and SSE: http://parabix.costar.sfu.ca/ and with it you can get 10G performance.
JMAP — reinventing IMAP
JMAP — reinventing IMAP
They actually have a full XML parser with same performance characteristics. There's a JSON parser there as an example.
I have my own Parabix-based JSON parser that is used in production to do switching for a JSON-based UDP protocol. It saturates multiple 10G links (though it's also multithreaded).