An NTP pool storm

December 21, 2016

This article was contributed by Tom Yates

Since mid-December, there have been large loads placed on the global NTP pool network by what seems to be an unintended consequence of programming choices. The offending piece of software has been identified, and a fixed version is slowly rolling out around the world, but the NTP pool project has emerged unscathed from the incident. In addition, some things that needed improvement have gotten it.

Starting around December 14, some NTP pool server operators saw a significant increase in traffic. Not all server operators saw the increase; it seemed to be restricted to servers in the UK, US, Australia, and New Zealand zones. But for affected servers, the increased loads stood out.

My pool server tracks the per-second rate of unique clients, averaged over a five minute period, and it normally sees 50-100 clients/sec. As the graph above shows, during the affected period I saw loads over ten times that rate, with a five-minute peak just under ten thousand individual clients per second, which is two hundred times normal service levels. Although even such high levels of client activity didn't overwhelm my server — ntpd peaked at about 50% of a single CPU, and neither memory nor bandwidth were significant problems — the knock-on effects proved problematic for some operators. One operator found that, although his server could cope with the load, the firewall in front of it was having its state tables flooded by the number of distinct clients, so he had to cease offering service for the duration of the incident.

Because public NTP servers offer time data to anyone who asks for it, server abuse has been an occasional problem in the past. The NTP pool system tries to mitigate the effect of proprietary devices and software products that ship with hard-coded pool servers by the use of vendor zones; since those in charge of the pool DNS infrastructure can see how many queries are coming into each zone, a vendor zone that is suddenly sustaining unexpectedly high lookups can easily be identified and, if necessary, contained. But in the case of this latest rise, it was clear that no vendor pool was involved. Moreover, the observed rise in DNS queries to the pool zone was nowhere near as large as the observed rise in NTP queries, so whatever was responsible was clearly doing something more problematic than merely looking up a few servers and asking some of them for the time.

Early on, the problem had started to be quantifiable, but the results were unhelpful: the traffic was coming from everywhere. I did a few cursory reverse-DNS lookups on my newly enlarged client pool and quite rapidly saw clients from six of the seven continents, though I don't recall seeing any traffic from Antarctica. There was nothing odd or sinister about any of the lookups, no evidence that anyone was mounting some kind of distributed denial-of-service attack. All the NTP community knew was that a lot of people suddenly really wanted to know what the time was.

By December 19, an explanation for the lack of an increase in DNS queries comparable to the increase in NTP queries had been found. Whatever was doing this, it was querying a group of individual pool names, specifically {0,0.uk,0.us,asia,europe,north-america,south-america,oceania,africa,europe}.pool.ntp.org, each of which would normally result in four IP addresses being returned. The culprit would then send an NTP query to every single IP address so discovered. On-list estimates were that each client performed 10 DNS lookups, but 35-60 NTP queries immediately thereafter.

By late on December 19, the iPhone Snapchat client had been identified as the guilty party. Attempts to contact Snap Inc. (the parent company) had succeeded by December 20 and, later that day, Snap Inc. confirmed that it had released a fixed version of its client software to Apple for inspection and release to the Apple Store. Snap Inc. had also emphasized to Apple the urgency of this new client's release to the world to mitigate the effects as quickly as possible.

Interestingly, Snap Inc. also confirmed that its client currently doesn't need to do NTP at all. The functionality came from the ios-ntp library, which was linked into the Snapchat application. It is not clear at this point whether the library, or the way it was used by Snapchat, is responsible for the strange decision to query every single address returned by querying the library's built-in list of NTP servers, but the combination was a serious problem.

The NTP pool project is quite clear on the use of the pool by anything other than people configuring their own computers:

Anyone distributing an appliance, operating system or some other kind of software using NTP [...] must absolutely not use the default pool.ntp.org zone names as the default configuration in [their] application or appliance

The library's developer has clearly heard about this incident, because on December 20 a new version was released that doesn't have the long list of pool server hostnames embedded in it, bringing the library into compliance.

Some good has come out of this; we now know that the NTP pool is robust enough to survive an event like this, which like any other disaster that can happen, was bound to happen sooner or later. A library with the potential for causing problems has been identified and fixed. The total number of servers in the pool doesn't seem to have taken a hit, and may have actually increased; Snap Inc. donated two pool servers by way of apology for the incident.

For me, the biggest lesson to learn is that the law of unintended consequences is still in effect: it was never the library's author's intention that every iPhone should bombard the NTP pool infrastructure, nor was it Snapchat's intention that its client should start an NTP firestorm. But if you create a piece of code with the capacity to do something foolish, sooner or later someone will ask it to do just that; the result may not be all that pretty.

Index entries for this article
GuestArticles	Yates, Tom

An NTP pool storm

Posted Dec 22, 2016 21:41 UTC (Thu) by hmh (subscriber, #3838) [Link]

And we should be very thankful this happened due to a piece of code from the mobile app space, where updates do actually happen.

Were it coming from code in an IoT device, or even a firmware preload...

An NTP pool storm

Posted Dec 27, 2016 13:35 UTC (Tue) by faramir (subscriber, #2327) [Link]

Sounds like the whole thing was handled well by all concerned. I would like to comment on your last line about "capacity to do something foolish" though. This would seem to encourage an attempt to write software that can't do anything foolish. Unfortunately, I think "foolish" is very context dependent and what is clearly a bad idea in one context may very well be the best solution in another one. Please consider this a plea to developers to continue to write software that is flexible.