Learning about Go internals at GopherCon
GopherCon is the major conference for the Go language, attended by 1600 dedicated "gophers", as the members of its community like to call themselves. Held for the last five years in Denver, it attracts programmers, open-source contributors, and technical managers from all over North America and the world. GopherCon's highly-technical program is an intense mix of Go internals and programming tutorials, a few of which we will explore in this article.
Internals talks included one on the scheduler and one on memory allocation; programming talks included why not to base your authorization strategy on hash-based message authentication codes (HMACs). But first, here's a little about upcoming changes to Go itself.
Go 2 features
On the morning of the first full conference day, a video from Go language team lead Russell Cox was played. In it, he announced upcoming planned changes to the next major iteration of the language, nicknamed "Go 2". Apparently attending the conference only by video is usual for Cox, who did the same in 2017.
The plans he announced included three items: improved error handling, adding generic types, and a new architecture for managing third-party modules and dependencies. They are primarily improving error handling by adding a built-in "check expression" that reduces boilerplate code when testing for failure conditions and printing an error message. The new generic variable types will be based on "contract-based specifications", where the user defines the generic as fulfilling specific conditions.
The modules announcement was more controversial. Modules are based on a Go fork called "vgo". One of the Go project's primary issues for the last couple years has been designing a better way to package third-party libraries. However, the new modules architecture wholesale replaces the "dep" system that a community team had worked on for more than a year. Members of that team are quite unhappy about how that change was managed, as Sam Boyer, one of the team members, outlined in a Twitter thread.
According to Cox, the Go 2 changes won't be one big replacement, but will be gradually added to the language as they are ready. You can follow along with them on the Go 2 Drafts page.
The scheduler saga
Kavya Joshi kicked off the conference with an explanation of how the Go scheduler works. To make it easier to understand, she presented it by talking through creating a new, simple scheduler from scratch and ended up with a simplified version of the real scheduler.
As Go is primarily meant for running concurrent applications, each instance of a concurrently running function or module in Go is called a "goroutine", which is considered the basic unit of work. These goroutines run in lightweight, user-space threads which are smaller and cheaper than kernel threads. As an example, the initial goroutine stack is 2KB, compared with 8KB for kernel threads. This design supports fast creation and destruction of goroutines. Several goroutines are run in each kernel thread, managed by the Go scheduler.
In order to maximize work, the default limit on the number of threads is equal to the number of cores on the system. Instead of using a shared run queue of pending work, which would become a scaling bottleneck, each thread has its own run queue. The threads then independently take goroutines off their queue, and then put them back on whenever they are waiting on system calls (such as file reads).
However, this leads to a problem: how can we optimize work if some threads empty their queues faster than others? The solution that Go chose was to implement "work stealing", where a thread with an empty queue will "steal" half the tasks from another, randomly selected thread's queue.
Goroutines that execute long-running CPU-bound tasks are another problem. Go runs a watchdog called "sysmon" that looks for this, and if a task takes more than 10ms, it gets preempted and placed on a global, shared, run queue. This queue is lower priority than the threads' individual run queues, so they only take goroutines from it when they are done with their own queues.
The Go scheduler has some notable limitations. First, by design, all run queues are first-in-first-out (FIFO), with no support for priorities. Go also doesn't have strong preemption, sometimes causing latency, but this is the target of a current design proposal. Another design proposal addresses the lack of hardware topology (NUMA) support.
As you might expect, the real Go scheduler is more complex than the simple model presented by Joshi. Its core principles are the ones she presented, but there are additional caveats and special circumstances. However, even starting from a simple foundation using published architectures doesn't work out for everyone, as the team at Chain found out.
Don't make macaroons
Tess Rinearson, VP of engineering at Chain, shared how its experiment with "macaroons" for authorization failed, and why. She started by filling in the basic issues around authorization on web and API-driven platforms. When she started at Chain, they were using "basic auth", meaning username/password for authentication. Authorization — that is, checking what a user was allowed to do — was performed by checking a long list of specific permission "grants", on the server, every time the user makes an API call. However, this caused some limitations in the platform that the company was unhappy with.
As Rinearson explained, there are two approaches to authorization: capability-first and identity-first. Capability-first platforms use tokens, which behave like car keys, to authorize actions; if you have the token, you're authorized. The problem with these is that the token can be stolen and, like a stolen car key, can then be used any way the thief wishes. Identity-first platforms rely on checking the user's identity against an access-control list or policy server whenever the user performs an action. Chain's initial approach was identity-first.
In addition to server load created by repeated policy checks, another problem with identity-first systems is an inability to delegate actions to third parties, even when the user wishes to. This is equivalent to being unable to have a valet park your car because it uses "face ID" to unlock. Based on a paper by Tyler Close, all systems of identity delegation are vulnerable to hijacking by causing a delegate to take a wrong action. This is known as the "confused deputy" problem.
In 2014, several Google researchers published a solution to the identity/authorization problem, called "macaroons", so named because they are "cookies with layers". Rinearson and the team at Chain decided to deploy macaroons in order to solve their problem with authorization for their developer API.
The idea behind a macaroon is to use HMAC cryptographically authenticated cookies to identify users. When the user wants to delegate a capability, such as allowing a spouse to share their photos on the site, the system would "layer" the HMAC token by encrypting it with a second HMAC key. This second key would contain limitations on the authorization in the token, such as "may only share if authenticated against this specific Google account". This allows a rich set of policy-based authorizations without any need to check against a central policy server.
Macaroons did not work out well for Chain. First, its implementation required every initial authentication to hit the "dashboard", a Rails application that previously hadn't been mission-critical, in order to obtain the core HMAC token. Second, the long chains of code required to unwrap the successive layers of HMAC-encrypted policies were too cumbersome for developers used to being able to write ad-hoc code for the Chain platform instead of depending completely on the SDK. Third, revoking tokens or even changing user roles was extremely slow since applications weren't checking the policy server regularly. There were also formatting and character encoding issues with the extremely long base64-encoded HMAC tokens.
The final objection came from the French members of Chain's staff, who pointed out that layered cookies are actually spelled "macaron", not "macaroon".
The team decided to remove macaroons from the code base, in favor of an enhanced basic auth implementation that Rinearson calls "pumpkin spice auth". However, macaroons proved much harder to remove than to implement. While they were deployed in two weeks originally, deprecating them took over five months due to outstanding tokens and SDK code.
Memory allocation
Eben Freeman shared what he has learned about the internals of Go memory management over the last year of working on improving performance at Honeycomb.io. Its service involves users creating ad-hoc queries to comb through hundreds of millions of log entries, looking for bugs. Freeman's team managed to improve throughput by about 2.5x via multiple techniques to reduce their application's use of memory.
He started by explaining how the Go memory allocator actually works. Go is a "managed memory" language, which means that it automatically allocates and deallocates memory for the programmer. This memory exists in two places: the stack, which is a small memory area that each goroutine gets for local objects in the current function, and the heap, where everything else is stored. Since the majority of memory usage is in the heap, that's also where most optimization is needed.
The Go allocator tries to satisfy three design goals:
- satisfy allocations of any given size while avoiding fragmentation
- avoid locking
- enable reclaiming memory
For example, to prevent fragmentation, the allocator tries to reserve memory for groups of like-sized objects in blocks. To avoid locking, Go uses per-CPU caches. Reclaiming memory involves running a garbage collector (GC) concurrently with the running Go program.
Memory is allocated in large blocks called "arenas", usually 64MB in size. These arenas, in turn, are divided up into multiple "spans", each of which is a run of memory pages containing objects of a fixed-size class, such as 64-byte objects. A CPU running a goroutine is given an arena with one span of each of the 70 different size classes, which is usually enough to store most variable operations.
Every time an object is allocated in a span, Go updates a bitmap in memory to show which slots are in use. These bitmaps are used by the garbage collector, which uses a "tricolor concurrent mark-sweep" algorithm to free memory. During the mark phase, all objects are marked grey; then they are marked black if they have referents and white if they don't. During the sweep phase, white objects are freed, generally just meaning that the memory is designated as available for reuse.
While the GC is designed to be highly concurrent, creative use of pointers to objects can mess up the marking system. For this reason, Go adds a write barrier on writes to pointers during the marking phase, which can add significant latency overhead. Another place that GC can cause a lot of overhead is in "mark assist". If the garbage collector is unable to keep up with concurrently running goroutines in order to mark memory for freeing, it can require the goroutines themselves do some memory marking before they can proceed with execution.
Freeman showed that you can measure the overhead of the allocator and
the garbage collector by disabling them, using the Go execution parameters
GOCG=off and GODEBUG=sbrk=1. A better way to see
memory activity is to run a pprof CPU profile to see
time spent in
runtime.mallocgc. He suggested using the "flamegraph"
visualization built into
pprof's
web interface. For example, time spent in runtime.(*mcache).refill
shows time spent refilling span caches, and time spent in
runtime.gcAssistAlloc shows time spent in mark assist.
Possibly the best tool for checking memory efficiency is the Go execution tracer. It collects granular runtime events over a short period of time. This is a lot of data, but you can drill down into what the garbage collector is doing, like mark assist periods. It also has a statistic called "minimum mutator utilization" that shows how much time the program spent doing things other than GC and allocation.
So, how did the Honeycomb staff learn to speed up execution? One key thing was to limit use of pointers in Go code. Sometimes this is obvious because it's your own pointer, but core library classes like "Time" have their own, built-in pointers. This can make it useful to use int64 instead of Time if you don't care about actual clock time. Particularly, avoid allocating a pointer inside a loop.
Another thing Honeycomb did was to approximate "slab allocation", which is when you allocate a large block of memory and then subdivide it. While allocating individual objects is fast, allocating many objects at once is faster per object. In Go, you can make large "slices" (resizable arrays) behave like slab allocations. Beware, though, that even a single variable still in use can keep the whole allocation from being garbage collected.
A more sophisticated technique that worked for Honeycomb was to recycle
specific objects to hold new objects of an identical size. Honeycomb had a
filter-then-aggregate process for running queries, and he realized that
the row buffer structs (complex variables) could be reused to hold the
aggregation data. This cut the amount of work the garbage collector had to
do considerably. The sync.Pool Go
library has a more general
version of this solution.
Overall, these optimizations more than doubled the throughput for user queries while keeping overall memory usage constant.
Summary
As with every good conference, there were many other talks and activities at GopherCon, including a gobots (robots controlled in Go) hackathon, a community contributor day, and many hands-on workshops. There were even sessions that weren't about Go internals, such as ones for creating data charts and graphs in Go, or another in building interesting Textual User Interfaces (TUIs). GopherCon is also the annual gathering of Women Who Go, an international educational association for women and non-binary people using Go.
At the end of the conference, attendees were surprised to be told that, for the first time, GopherCon would be moving away from Denver. Next year, it will be held in San Diego in July. If you write Go regularly, you should probably consider attending.
[Josh Berkus is an employee of Red Hat.]| Index entries for this article | |
|---|---|
| GuestArticles | Berkus, Josh |
| Conference | GopherCon/2018 |
