Gould, one of the original authors of linkerd, used to work for Twitter in production operations during its crazy growth phase, when the site was down a lot. During the 2010 World Cup, every time a goal was scored, Twitter went down. He was a Twitter user, and after finding himself rooting for 0-0 draws because they would keep the site up, realized that Twitter had operations problems, and he could probably help. So he went to work for them.
In those days, Twitter's main application was a single, monolithic program,
written in Ruby on Rails, known internally as the monorail. This
architecture was already known to be undesirable; attempts were being
made to split the application up, but to keep stability everything had
a slow release cycle — new code often taking weeks to get
into production — except the monorail, which was released daily.
So anything that anyone wanted to see in production in any reasonable
timescale got shoehorned into the monorail, which didn't help the move
to microservices. It also didn't help that the people who were trying to
deploy microservices had to reinvent their own infrastructure —
load-balancing, handling retries and timeouts, and the like — and
these are not easy problems, so some of them were not doing it very well.
So Gould wrote a tool called Finagle, which is a fault-tolerant, protocol-agnostic remote procedure call system that provides all these services. It helped, so Twitter ended up fixing a lot of extant problems inside Finagle, and finally everything at Twitter ended up running on top of it. There are a number of consequent benefits to this; Finagle sees nearly everything, so you have a natural instrumentation point for metrics and tracing. However, Finagle is written in Scala, which Gould concedes is "not for everyone".
He left Twitter convinced that well-instrumented glue that is built to be easily usable can be helpful; turning his attention to the growing use of Docker and Kubernetes, he wrote linkerd to provide Finagle-like functionality for HTTP requests by acting as an intelligent web proxy. The fundamental idea is that applications shouldn't have to know who they need to talk to; they should ask linkerd for a service, and linkerd should take care of tracking who is currently offering that service, selecting the best provider, transporting the request to that provider, and returning the answer.
Facilities that linkerd provides to assist with this include service discovery, load balancing, encryption, tracing and logging, handling retries, expiration and timeouts, back-offs, dynamic routing, and metrics. One of the more elegant wrinkles Gould mentioned was that it can do per-request routing; for example, an application can send an HTTP header informing linkerd that this particular request should go via some alternative path, possibly a staging or testing path. Many statistics are exported; projects like linkerd-viz give a dashboard-style view of request volumes, latencies, and success rates.
Deadlines are something a microservice connector needs to care about. The simplistic approach of having each individual service have its own timeouts and retry budgets doesn't really work when multiple services contribute to the provision of a business feature. If the top service's timeout triggers, the fact that a subordinate service is merrily retrying the database for the third time according to its own timeout and retry rules is completely lost; the top service times out and the end-user is disappointed, while the subordinate transactions may still be needlessly trying to complete. Linkerd, because it is mediating all these transactions, allows the setting of per-feature timeouts, so that each service contributing toward that feature has its execution time deducted from the feature timeout, and the whole chain can be timed out when this expires. Services that are used in providing more than one feature can take advantage of more generous timeouts when they are invoked to provide important features, without having to permit such a long wait when they're doing something quick and dirty.
Retries are also of concern. The simplistic approach of telling a service to retry after failure a finite number of times (say three) fails when things go bad, because each retry decision is taken in isolation. Just as the system is being stressed, the under-responsive service will be hit with four times the quantity of requests it normally gets, as everyone retries it. Linkerd, seeing all these requests as it does, can set a retry budget, allowing up to (say) 20% of requests to retry, thus capping the load on that service at 1.2 times normal. It makes no sense to set a traditional retry limit at a non-integer value like 1.2; this can only meaningfully be done by an overlord which sees and mediates everything.
This high-level view also allows linkerd to propagate backpressure. Consider a feature provided by several stacked microservices, each of which invokes the next one down the stack. When a service somewhere down in the stack has reached capacity, applying backpressure allows that service to propagate the problem as far up the stack as possible. This allows users whose requests will exceed system capacity to quickly see a response informing them that their request will not be serviced, and thus add no further (pointless) load to the feature stack, instead of sitting there waiting for a positive response that will never come, and overloading the feature while they do so. At this point in the talk, an incredulous question from the audience prompted Gould to confirm that all this functionality is in the shipping linkerd; it's not vaporware intended for some putative future version.
Gould's personal pick for most important feature in linkerd is request-aware load balancing. Because linkerd mediates each request, it knows how long each takes to complete, and it uses this information to load-balance services on an exponentially-weighted moving average (EWMA) basis, developed at Twitter. New nodes are drip-fed an increasing amount of traffic until responsiveness suffers, at which point traffic is backed off sharply. He presented data from a test evaluating latencies for three different load-balancing algorithms: round-robin, queue depth, and EWMA, in an application where large numbers of requests were distributed between many nodes, one of which was forced to deliver slow responses. Each algorithm failed to deliver prompt responses for a certain percentage of requests, but the percentage in question varied notably between algorithms.
The round-robin approach only succeeded for 95% of requests; Gould noted that: "Everywhere I've been on-call, 95% is a wake-me-up success rate, and I really, really don't like being woken up." Queue-depth balancing, where new requests are sent to the node which is currently servicing fewest requests, improved things: 99% of clients got typically fast response; but EWMA managed better than 99.9% of clients seeing no sharp increase in latency.
Linkerd is relatively lightweight, using about 100MB of memory in normal use. It can be deployed in a number of ways, including either a centralized resilient cluster of linkerds, or one linkerd per node. Gould noted that the best deployment depends on what you're trying to do with linkerd, but that many people prefer one linkerd per node because TLS is one of the many infrastructural services that linkerd provides, so one-per-node lets you encrypt all traffic between nodes without applications having to worry about it.
One limitation of linkerd is that it only supports HTTP (and HTTPS) requests; it functions as a web proxy, and not every service is provided that way. Gould was very happy to announce the availability of linkerd-tcp, a more-generic proxy which tries to extend much of linkerd's functionality into general TCP-based services. It's still in beta, but attendees were encouraged to play with it.
Gould was open about the costs of a distributed architecture: "Once you're in a microservice environment, you have applications talking to each other over the network. Once you have a network, you have many, many, many, many more failures than you did when you just linked to a library. So if you don't have to do it, you really shouldn't... Microservices are something you have to do to keep your organization fast when managing builds gets too hard."
He was equally open about linkerd having costs of its own, not least in complexity. In response to being asked at what scale point the pain of not having linkerd is likely to outweigh the pain of having it, he replied that it was when your application is complex enough that it can't all fit in one person's head. At that point, incident responses become blame games, and you need something that does the job of intermediating between different bits of the application in a well-instrumented way, or you won't be able to find out what's wrong. While it was nice to hear another speaker being open about containerization not being some panacea, if I had a large, complex ecosystem of microservices to keep an eye on, I'd be very interested in linkerd.
[Thanks to the Linux Foundation, LWN's travel sponsor, for assistance in getting to Berlin for CNC and KubeCon.]
Brief items
Anyway, this discussion prompted me to get off my bum and look at why unattended-upgrades wasn't working. Turns out the default install has "label=Debian-Security", and all these laptops are running testing. I guess the assumption that people running testing have the wherewithal to configure their machines properly isn't unreasonable.
Page editor: Rebecca Sobol
Next page:
Development>>
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds