HTTP and microservices

Posted Apr 12, 2017 9:25 UTC (Wed) by Cyberax (✭ supporter ✭, #52523)
In reply to: HTTP and microservices by eternaleye
Parent article: Connecting Kubernetes services with linkerd

> Instead, I'm suggesting that using a richer protocol, one that handles asynchrony in a more cohesive manner, can make orchestration systems _more effective_.
Ok. I'll bite.

Cap'n Proto is not a good idea for distributed services, ever. It promotes stateful systems that can't be made robust in the face of network/service degradation. It doesn't deal well with intermittent disconnections and retries. It's only good when you need to implement a (possibly bidirectional) communication channel between two reliable endpoints.

> Also, your response doesn't really match how Cap'n Proto promises/futures work; think of it more like "sending a request also creates a place to store the response once it is ready, and a way to realize when that happens".
I implemented a part of Cap'n Proto wire protocol... Cap'n Proto's capability system does not allow one to send a promise to a third party. It's possible in theory but in practice it'll lead to pain, suffering and CORBA.

Services have to be made as stateless as possible. So you pretty much have to use classic request/reply protocols and HTTP is as good as any in this case.

> "Getting the result multiple times" doesn't hit the network multiple times at all.
It most definitely can happen. Imagine that service A calls service B that calls service C. Service C browns out and service B starts doing retries. That causes the call from A to B to error out and start doing retries. And pretty soon the whole system is locked up doing work that will never get used.

In case of Cap'n Proto errors will simply look like disconnection exceptions.

HTTP and microservices

Posted Apr 12, 2017 21:58 UTC (Wed) by kentonv (subscriber, #92073) [Link] (4 responses)

> It promotes stateful systems that can't be made robust in the face of network/service degradation. It doesn't deal well with intermittent disconnections and retries.

No, Cap'n Proto lets you choose your trade-offs to fit the problem. If your problem is fundamentally stateless then go ahead and do request/response like you would with HTTP (but benefit from faster serialization and the fact that you can multiplex on a connection without head-of-line blocking).

If, on the other hand, you have a stateful problem, then in a stateless model you will end up building something that sucks. The typical naive approach is to have every request push back into a database, which means a sequence of state changes will be very slow as you're waiting for an fsync on every one. More involved approaches tend to involve some sort of caching, timeouts, etc. that make everything much more complicated and buggy.

Cap'n Proto lets you express: "Let's set up some state, do a few things to it, then push it back -- and if we have a network failure somewhere in the middle then the server can easily discard that state while the client starts over fresh."

It turns out that while, yes, networks are unstable, they're not nearly as unstable as we are designing for today. We're wasting a whole lot of IOPS and latency designing for networks that are one-9 reliable when what we have is more like five-9's.

Of course, "stateless" HTTP services don't magically mean you don't have to worry about network errors. You need to design all your network interactions to be idempotent, or you need to think about what to do if the connection drops between the request and response. Cap'n Proto is really no different, except that you can more easily batch multiple operations into one interaction.

> Cap'n Proto's capability system does not allow one to send a promise to a third party.

Actually, it does. The level 3 RPC protocol specifies how to forward capabilities (which may be promises) and also how to forward call results directly to a third party.

> It most definitely can happen. Imagine that service A calls service B that calls service C. Service C browns out and service B starts doing retries. That causes the call from A to B to error out and start doing retries. And pretty soon the whole system is locked up doing work that will never get used.

Generally you'll want to let the disconnect exception flow through to the initiator, retrying only at the "top level" rather than at every hop, to avoid storms.

HTTP and microservices

Posted Apr 13, 2017 1:09 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> No, Cap'n Proto lets you choose your trade-offs to fit the problem. If your problem is fundamentally stateless then go ahead and do request/response like you would with HTTP
Here we're discussing enterprise-scale orchestration systems and in this case stateful systems are pretty much a BadIdea(tm).

> but benefit from faster serialization and the fact that you can multiplex on a connection without head-of-line blocking
HTTP doesn't specify serialization and multiplexing is an anti-pattern for services (it makes sense for applications like browsers where connections are short-lived and every WAN roundtrip counts).

> If, on the other hand, you have a stateful problem, then in a stateless model you will end up building something that sucks. The typical naive approach is to have every request push back into a database, which means a sequence of state changes will be very slow as you're waiting for an fsync on every one.
A real application will get an idempotency token, check if the request has already been processed (for requests with side-effects), write the idempotency token (possibly using it as a lock), proceed with changes, log them, unlock the idempotency token, return the request to the client.

That's for a start, disregarding downstream calls and so on. Hence the first rule of distributed programming: "Don't".

> Cap'n Proto lets you express: "Let's set up some state, do a few things to it, then push it back -- and if we have a network failure somewhere in the middle then the server can easily discard that state while the client starts over fresh."
Cap'n Proto has no support in its protocol for retries and idempotency checks. Its only approach is to throw exceptions and hope that the application does everything else. It's no better than HTTP.

> It turns out that while, yes, networks are unstable, they're not nearly as unstable as we are designing for today. We're wasting a whole lot of IOPS and latency designing for networks that are one-9 reliable when what we have is more like five-9's.
Networks are reliable if you have two servers connected to the same switch. But with Twitter-scale systems "the network" is NOT reliable. You can not make any assumptions that your downstream service will be available without hiccups caused by deployments, route flaps, bugs, brown-outs, throttles and so on.

You HAVE to design with the assumption that your downstream services will randomly fail.

This is absolutely fundamental in large-scale systems. There's no way around it.

> Actually, it does. The level 3 RPC protocol specifies how to forward capabilities (which may be promises) and also how to forward call results directly to a third party.
There's no level 3 RPC for Cap'n Proto. In the current protocol promises are simple 32 bit references valid only within the context of a stream. So it won't scale as-is to multiple systems, as the entire state would have to be transferred.

The alternative is introduction of URL-like constructs that encode the endpoint and the context within it. But at this point you'll just reinvent HTTP and REST.

> Generally you'll want to let the disconnect exception flow through to the initiator, retrying only at the "top level" rather than at every hop, to avoid storms.
Then you have another source of vicious loops: service A does 200 calls to service B. Call 199 fails. Then the top-level system retries the whole request to service A again.

HTTP and microservices

Posted Apr 13, 2017 15:04 UTC (Thu) by kentonv (subscriber, #92073) [Link] (2 responses)

> Here we're discussing enterprise-scale orchestration systems and in this case stateful systems are pretty much a BadIdea(tm).

No, you can't just generalize like that. Some use cases are stateful. For example, you can't implement real-time collaboration with stateless services in front of a standard database. You need a coordinator service for operational transforms.

I'm not sure we're talking about the same thing when you say "enterprise-scale orchestration system", but having written an orchestration system from scratch I'd say it's a pretty stateful problem. You can't start up a new container for every request, after all.

> HTTP doesn't specify serialization

It does for the headers. And transfer-encoding: chunked is pretty ugly, too.

> and multiplexing is an anti-pattern for services (it makes sense for applications like browsers where connections are short-lived and every WAN roundtrip counts).

The ability to do multiple independent requests in parallel is an anti-pattern?

> A real application will get an idempotency token, check if the request has already been processed (for requests with side-effects), write the idempotency token (possibly using it as a lock), proceed with changes, log them, unlock the idempotency token, return the request to the client.

That sounds over-engineered. For most apps you don't need two-phase commit on every operation.

But if you do, this goes back to my point: All those steps make the operation take a long time, and if you have to do it again for every subsequent operation, it's going to be very slow.

> Cap'n Proto has no support in its protocol for retries and idempotency checks. Its only approach is to throw exceptions and hope that the application does everything else.

Yes. That is the correct thing to do.

> You HAVE to design with the assumption that your downstream services will randomly fail.

Of course you do. I never said otherwise.

But do they fail 1/10 of the time or 1/10000 of the time? These call for different optimization trade-offs.

> There's no level 3 RPC for Cap'n Proto.

The protocol is defined but it's true that 3-party handoff hasn't been implemented yet. (Though it's based on CapTP, which has been implemented.)

> In the current protocol promises are simple 32 bit references valid only within the context of a stream. So it won't scale as-is to multiple systems, as the entire state would have to be transferred.

You mean the current implementation. I don't know what you mean by "the entire state would have to be transferred", but currently in three-party interactions there tends to be proxying.

> Then you have another source of vicious loops: service A does 200 calls to service B. Call 199 fails. Then the top-level system retries the whole request to service A again.

Sure, you should use good judgment in deciding where to retry. This is why the infrastructure can't do it automatically -- it's almost never the right place to retry. Retrying in your network library is just another version of trying to hide network unreliability from apps.

HTTP and microservices

Posted Apr 13, 2017 21:09 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> No, you can't just generalize like that. Some use cases are stateful. For example, you can't implement real-time collaboration with stateless services in front of a standard database. You need a coordinator service for operational transforms.
Most modern enterprise systems are mostly-stateless - a server typically retrieves state required for a user request from some kind of a storage/cache subsystem for every request. And even storage subsystems themselves are usually "stateless" - they don't keep long-lived sessions with clients.

> The ability to do multiple independent requests in parallel is an anti-pattern?
Correct. Multiple unrelated requests inside one TCP stream is a bad idea in general - it defeats the OS-level flow control logic, may have problems with head-of-line blocking and it has other issues. It makes sense when you want to avoid the overhead of making additional round-trips for TCP's triple handshake.

> That sounds over-engineered. For most apps you don't need two-phase commit on every operation.
Nope. It's pretty much a required workflow if you need to involve multiple services.

> But if you do, this goes back to my point: All those steps make the operation take a long time, and if you have to do it again for every subsequent operation, it's going to be very slow.
Only for mutating operations, though.

> Of course you do. I never said otherwise.
> But do they fail 1/10 of the time or 1/10000 of the time? These call for different optimization trade-offs.
You have to design for 1/10 failure rate (at least!) if you want your service to be resilient.

> You mean the current implementation. I don't know what you mean by "the entire state would have to be transferred", but currently in three-party interactions there tends to be proxying.
This means that you're reimplementing the highly stateful ORB from CORBA. History never teaches people...

And no, linkerd is not stateful. It does not have to track the content of passed data, only the overall streams.

> Sure, you should use good judgment in deciding where to retry. This is why the infrastructure can't do it automatically -- it's almost never the right place to retry.
And how do you decide that you should stop doing retries because the overall global call rate is spiking?

And these issues are not theoretical. For a real-world example of a retry-driven vicious loop you can read this: https://aws.amazon.com/message/5467D2/

HTTP and microservices

Posted Apr 13, 2017 21:33 UTC (Thu) by kentonv (subscriber, #92073) [Link]

Seems our debate has been reduced to "nuh-uh" vs. "uh-huh", with both of us presuming ourselves to be more knowledgeable/authoritative than the other.