Improving syncookies

April 9, 2008

This article was contributed by Patrick McManus

Back in 1997 TCP SYN flood attacks were all the rage among script kiddies. A SYN flood is a denial of service attack that uses up server resources by initiating, but not completing, a connection. Attacks via this method still remain a problem today though they are now more likely to be launched by sophisticated botnets rather than an individual. A first line defense against SYN floods is the syncookie. The syncookie was not designed for Linux specifically but found its way into kernel 2.1.44 via a patch from Andi Kleen.

This long-time feature generated some recent discussion when a patch was submitted adding syncookie support to IPv6. The patch has now been queued for acceptance but in discussion along the way the community also began to tackle some longstanding limitations of syncookies and reaffirmed how relevant the feature continues to be.

To fully describe syncookies some background on how TCP uses a three way handshake to establish a connection is in order. The first packet of any TCP session received by the server is known as the SYN packet because it carries the synchronize control flag. The SYN flag indicates that its sender wishes to open a new connection. That flag is only used during the opening sequence. The server responds with a packet also containing the SYN flag because the connection needs to be opened in both directions. This second packet also carries the ACK flag and is known as the SYN-ACK. It serves to both open the connection from the server to the client and to acknowledge receipt of the opening packet from the other host. Finally, the client sends a bare ACK packet to the server to acknowledge receipt of server-to-client SYN-ACK and the connection is then fully established.

During a SYN flood a server receives the first packet of the three-way TCP handshake and responds with a SYN-ACK but no further data is ever received from the initiating client. When the SYN-ACK is generated most servers will also create an entry in the SYN queue. This queue is the waiting area for half-open connections awaiting handshake completion. The attacker intentionally orphans those entries and instead generates more SYN packets which in turn take up more entries in the queue. The server needs to wait for a long timeout before giving up and recovering the connection resources. During this time the attacker can flood it with many more half-open connections. Eventually the server runs out of resources and cannot accept any new connections without dropping some, perhaps legitimate, connection from the queue. Simple solutions such as placing a quota on the number of partially open connections per peer or using dynamically adjusted packet filters do not work because the SYN packets are easy to forge with fake source addresses.

A syncookie allows the server to defer using up any resources until the third packet in the three-way handshake has been received. At that time the peer's address has been mildly authenticated because the final packet in the handshake contains a reference to the sequence number that was sent by the server in the second packet. With this assurance, packet filters and resource quotas keyed to the peer's address will again be useful defenses against resource attacks.

The basic mechanism of the syncookie works by carefully manipulating the initial sequence number value of the connection instead of choosing it at random. Upon receiving a SYN the server carefully encodes the vital information that would have been stored as state in the SYN queue. This encoded information is cryptographically hashed with a secret key to form the sequence number of the SYN-ACK and sent to the client. The third packet of a legitimate handshake, which is the ACK from the client back to the server, contains this sequence number (plus one) in its acknowledgment number field. In this way all the information necessary to fully open the connection is presented back to the server without having to maintain state while the handshake is being completed.

The major downside to syncookies is that they only have space to encode the most basic of TCP handshake options. At the time of initial syncookie deployment this was not a large problem because the only option prominently in use at the time was the Maximum Segment Size (MSS) option. This option is provided to help the peer avoid unnecessary fragmentation by sending packets that the other end of the connection knows a priori are too large to cross its network. This is exactly the kind of information that is normally stored as state in the SYN queue. The syncookie designers knew that this option was important to performance and found 3 bits for it in the encoded syncookie. These bits are used to approximate the real value of the option to one of 8 common values.

In the intervening years new options have come into prominence and these are not syncookie compatible. The most important of these are the window scaling and Selective Acknowledgment (SACK) options. These features respectively allow the TCP congestion control window to grow beyond 64KB and be more efficient in the case of minor packet losses from those large windows. Without using these features it is impossible to get good transfer rates on networks with large bandwidth or large latency. Many household broadband links require at least the window scaling option to fully utilize the network connection. Due to this limitation, and the modest computation overhead of the cryptographic hash, the Linux stack only resorts to syncookie based connections when the number of half-open connection exceeds a high watermark controlled by the net.ipv4.tcp_max_syn_backlog sysctl. These connections are less featureful than normal connections but they are only resorted to when the queue would otherwise require active pruning.

It turns out that the cookie mechanism is only implemented for IPv4. Recently, Glenn Griffin posted patches that add IPv6 support for syncookies. Andi Kleen, author of the original syncookie patch, wondered if the mechanism should be continued at all much less added to IPv6:

Syncookies are discouraged these days. They disable too many valuable TCP features (window scaling, SACK) and even without them the kernel is usually strong enough to defend against syn floods and systems have much more memory than they used to be. So I don't think it makes much sense to add more code to it, sorry.

Andi's argument was three pronged. His first point was about the reduced abilities of cookie initiated connections as already described in this article. Over time the value of these options has increased and therefore the cost of using syncookies has increased too. His second point was that Linux no longer uses all of the memory necessary for a full connection until the new connection is fully open. Instead it uses a "minisock" for that period. The minisock is a 96 byte struct tcp_request_sock structure holding the minimum state necessary to get the connection fully opened. The fully established struct tcp_sock is 1616 bytes. Both structure size measurements refer to a 64-bit kernel. Finally, Andi points out that the queue management routines for an overloaded SYN queue are more sophisticated now than the dumb head drop algorithm that was in place when syncookies were first deployed. The suggestion was that in aggregate these advances might make Linux robust enough without syncookies so that they could therefore be removed all together.

Instead of engaging in a theoretical discussion some readers set up and ran their own experiments. One of the best parts of the Linux community is the tendency to put real data behind their arguments. While there is often disagreement over the realism of the measured scenarios, the data points always help us better understand the dynamics of kernel code.

Willy Tarreau: My tests on an AMD LX800 with max_syn_backlog at 63000 on an HTTP reverse proxy consisted in injecting 250 hits/s of legitimate traffic with 8000 SYN/s of noise.[..] Without SYN cookies, the average response time was about 1.5 second and unstable (due to retransmits), and the CPU was set to 60%. With SYN cookies enabled, the response time dropped to 12-15ms only, but CPU usage jumped to 70%. The difference appears at a higher legitimate traffic rate.

Ross Vandegrift: Under no SYN flood, the server handles 750 HTTP requests per second, measured via httping in flood mode. With a default tcp_max_syn_backlog of 1024, I can trivially prevent any inbound client connections with 2 threads of syn flood. Enabling tcp_syncookies brings the connection handling back up to 725 fetches per second.

This data compellingly supports the continued value of the syncookie and that position seems to have won the day. The IPv6 syncookie patches are now queued within the network 2.6.26 development tree.

However, the biggest news is probably that this discussion brought renewed energy to the problem of lost handshake options. Florian Westphal and Glenn Griffin have recently presented a solution to the most damaging aspect of that problem too.

Their solution is to leverage the echoed TCP timestamp option in a way similar to the way classic syncookies leverage the echoing of the SYN-ACK sequence number in the subsequent ACK. The timestamp option was introduced with RFC 1323 and is widely deployed on modern Linux, Windows, and FreeBSD (including OS X) systems. Its main purpose is to be able to increase the frequency of round trip time measurements in the presence of large congestion control windows.

Using the timestamp to preserve the window scale and SACK option values requires modifying the timestamp of the SYN-ACK packet to include the state necessary to support them. During a normal handshake the client will echo the modified timestamp value of the SYN-ACK packet back to the server as part of the timestamp option on the third part of the handshake and thus propagate the SACK and window scale information without keeping any state on the server.

In order to make room in the timestamp for this new information the least significant 9 bits of the timestamp are shaved off. The encoded representation of the window scale and SACK options are then transferred back and forth at the minor cost of reduced granularity of TCP timestamps during the handshake exchange. Timestamps lose their least significant 512 jiffies with this approach.

Below are two different TCP handshakes completed with syncookies and the timestamp patch. Note that the lowest bits of the SYN-ACK timestamp are the same in each handshake even at different points in time because each handshake uses the same SACK and window scaling options. As a result the timestamp values in each SYN-ACK are different but the lower nine bits share the same 0x166 value.

13:51:04.582464 IP 127.0.0.1.57985 > 127.0.0.1.4050: S 1061746051:1061746051(0)
           win 32792 <mss 16396,sackOK,timestamp 0xfffea013 0,nop,wscale 6>
13:51:04.582478 IP 127.0.0.1.4050 > 127.0.0.1.57985: S 2800702917:2800702917(0)
           ack 1061746052 win 32768 <mss 16396,sackOK,timestamp 0xfffe9f66 0xfffea013,nop,wscale 6>
13:51:04.582480 IP 127.0.0.1.57985 > 127.0.0.1.4050: . 
           ack 1 win 513 <nop,nop,timestamp 0xfffea013 0xfffe9466>

13:59:19.047306 IP 127.0.0.1.45979 > 127.0.0.1.4050: S 218483035:218483035(0) 
           win 32792 <mss 16396,sackOK,timestamp 0x0001bed4 0,nop,wscale 6>
13:59:19.047320 IP 127.0.0.1.4050 > 127.0.0.1.45979: S 1141094138:1141094138(0)
           ack 218483036 win 32768 <mss 16396,sackOK,timestamp 0x0001bd66 0x0001bed4,nop,wscale 6>
13:59:19.047322 IP 127.0.0.1.45979 > 127.0.0.1.4050: . 
           ack 1 win 513 <nop,nop,timestamp 0x0001bed4 0x0001bd66>

While there is no guarantee that the timestamp option will be supported by every TCP peer, timestamps are widely deployed on the most common operating systems. Additionally, because timestamps, window scaling, and selective acknowledgments are all features related to high latency and bandwidth networks it would be unlikely to find an implementation that supported only a subset of these options.

One shortcoming of the scheme is that it is not general enough to be future-proof as new handshake based options may continue to be deployed. At this time the MSS, SACK, window scaling, and timestamp options are the only handshake options seen with any regularity other than the NOP option which is just used for packet alignment. However, the whole point of an extensible option scheme is to leave room for future improvements. The IANA registry that records option values was last updated in February 2007 to reserve option code 27 for use with Experimental RFC 4782 "Quick Start for TCP and IP". Only time will tell if that particular option will be the next challenge to the syncookie scheme or if something else will rise first.

The timestamp patch has only been posted very recently, and there has been little discussion of it beyond the developers who worked directly on it. It is not clear whether or not it will be accepted right away into the mainline, but it certainly seems to address a well known core problem with the syncookie at a minor cost.

With the updates for IPv6 and modern TCP option schemes syncookies appear primed to keep providing sweet relief in their somewhat esoteric networking security niche. Perhaps they will keep chugging away for another 10 years without having to be re-baked.

Index entries for this article
GuestArticles	McManus, Patrick

Improving syncookies

Posted Apr 10, 2008 4:46 UTC (Thu) by skissane (subscriber, #38675) [Link]

Maybe the solution is to add a "syncookie" option? Basically like this:
- client sends SYN with arbitrary options
- server encrypts all the options it understands + any other info it needs and returns them as
an option to SYN-ACK
- client sends ACK, echoing that encrypted option
- server decrypts it and uses it as the syn queue info

Of course, this would be useless without changes to the client OS as well as the server. But
it would give all the advantages of syn cookies (no need to retain a syn queue in memory), but
at the same time work with arbitrary TCP options....

Improving syncookies

Posted Apr 10, 2008 13:39 UTC (Thu) by mcmanus (guest, #4569) [Link]

The timestamp option path has been accepted into the 2.6.26 development tree this morning.

Improving syncookies

Posted Apr 10, 2008 22:52 UTC (Thu) by smoogen (subscriber, #97) [Link]

I am glad they didn't 'remove' the Syncookie thing.. we still get script-kiddie SYN attacks in
the wild.. and they still take down a newer Linux box.. [at least in a DOS situation.. the
server doesnt panic and die...]

Compression?

Posted Apr 10, 2008 23:40 UTC (Thu) by jzbiciak (guest, #5246) [Link] (2 responses)

If I understood correctly, we have a dozen or so bits in the syncookie to store TCP options.
Right now, the options are being stored with an ad hoc encoding that doesn't evolve.

Syncookies only kick in when there's a certain connection backlog, and so represent a graceful
failure strategy.  As a result, most connections are legitimate, and the kernel could keep
some statistics on what option sets are "popular" among legitimate connections.  From this, it
could build a table with the "N" most popular option sets, in essence defining an ad hoc
encoding rather than a rigorous encoding.  Such an ad hoc encoding would evolve as protocols
evolve.

Within 12 bits, we can represent 4096 such sets.  Even if an option set required 64 bits (8
bytes) (that's including its "popularity histogram"), that's only 32K of storage.  If we
reserve some subset of these as "static", for mapping connections onto when syncookies are
enabled and there's no perfect match, then we have a graceful fallback mechanism that also
evolves.  You'd probably need some additional storage to keep track of "most popular recent
misses" to allow new entries to climb their way into the table.

Since I suspect there's strong correlation between certain feature combinations, I imagine
such a table will be fairly stable most of the time.

Or is this too crazy?

Compression?

Posted Apr 11, 2008 15:14 UTC (Fri) by Randakar (guest, #27808) [Link] (1 responses)


One consideration is that with DOS attacks the attacker is trying to make the receiving end do
as much work as possible for as little cost to the attacker as possible. So with this
implementation he'd use an odd combination of option flags to make your server burn as much
bandwidth as possible. More than he is using in sending out SYN packets.

You can't really put more data in your ACK than he is putting in his SYN or you will lose.

Good security requires careful thought :-)

Compression?

Posted Apr 14, 2008 20:42 UTC (Mon) by jzbiciak (guest, #5246) [Link]

I don't know that such a lookup is terribly expensive at all.  If someone shows up with
options that aren't in your "greatest hits" list, you don't need to update anything until the
person makes a complete connection.

Initially, syncookies don't even get engaged at all anyway.  Once they do get engaged, all
you're left with is a table lookup to find hit/miss in the greatest hits table, and a version
of the existing cryptographic hash if the pattern's a miss.  With a properly designed
(non-cryptographic) hash for the "greatest hits" lookup, that should go very quickly and
cheaply, and can even save you from doing the cryptographic hash for the syncookie.  You could
actually end up ahead of the curve in terms of computational load.

encrypted, or hashed?

Posted Apr 14, 2008 19:38 UTC (Mon) by astrophoenix (guest, #13528) [Link] (1 responses)

forgive me if I sound ignorant, but this sentence doesn't make sense to 
me:

"This encoded information is cryptographically hashed with a secret key 
to form the sequence number of the SYN-ACK and sent to the client."

Shouldn't it read something like "encrypted with a secret key", rather 
than "cryptographically hashed with a secret key"? I was thinking if it 
was hashed, the kernel wouldn't be able to decode it when the ack comes 
in.

Reply to this comment

encrypted, or hashed?

Posted Jul 14, 2008 7:39 UTC (Mon) by hso (guest, #24163) [Link]

> Shouldn't it read something like "encrypted with a secret key", rather 
> than "cryptographically hashed with a secret key"? I was thinking if it 
> was hashed, the kernel wouldn't be able to decode it when the ack comes 
> in.

No. It's using a hash algorithm with a key. Saying "encrypted with a secret key" would be
incorrect. Hash algorithms definitely != cipher algorithms.