April 9, 2008
This article was contributed by Patrick McManus
Back in 1997 TCP SYN flood attacks were all the rage among script
kiddies. A SYN flood is a denial of service attack that uses up server
resources by initiating, but not completing, a connection. Attacks via
this method still remain a problem today though
they are now more likely to be launched by sophisticated botnets
rather than an individual. A first line defense against SYN floods is
the syncookie. The syncookie was not designed for Linux specifically
but found its way into kernel 2.1.44 via a patch from Andi Kleen.
This long-time feature generated some recent discussion when a patch was submitted adding
syncookie support to
IPv6. The patch has now been queued for acceptance but in
discussion along the way the community also began to tackle some
longstanding limitations of syncookies and reaffirmed how relevant the
feature continues to be.
To fully describe syncookies some background on how TCP uses a three
way handshake to establish a connection is in order. The first packet
of any TCP session received by the server is known as the SYN packet
because it carries the synchronize control flag. The SYN flag
indicates that its sender wishes to open a new connection. That flag
is only used during the opening sequence. The server responds with a
packet also containing the SYN flag because the connection needs to be
opened in both directions. This second packet also carries the ACK
flag and is known as the SYN-ACK. It serves to both open the
connection from the server to the client and to acknowledge receipt of
the opening packet from the other host. Finally, the client sends a
bare ACK packet to the server to acknowledge receipt of
server-to-client SYN-ACK and the connection is then fully established.
During a SYN flood a server receives the first packet of the three-way
TCP handshake and responds with a SYN-ACK but no further data is ever
received from the initiating client. When the SYN-ACK is generated
most servers will also create an entry in the SYN queue. This queue is
the waiting area for half-open connections awaiting handshake
completion. The attacker intentionally orphans those entries and
instead generates more SYN packets which in turn take up more entries
in the queue. The server needs to wait for a long timeout before
giving up and recovering the connection resources. During this time
the attacker can flood it with many more half-open connections.
Eventually the server runs out of resources and cannot accept any new
connections without dropping some, perhaps legitimate, connection from
the queue. Simple solutions such as placing a quota on the number of
partially open connections per peer or using dynamically adjusted
packet filters do not work because the SYN packets are easy to forge
with fake source addresses.
A syncookie allows the server to defer using up any resources
until the third packet in the three-way handshake has been
received. At that time the peer's address has been mildly
authenticated because the final packet in the handshake contains
a reference to the sequence number that was sent by the server in the
second packet. With this assurance, packet filters and resource quotas
keyed to the peer's address will again be useful defenses against
resource attacks.
The basic mechanism of the syncookie works by carefully manipulating
the initial sequence number value of the connection instead of
choosing it at random. Upon receiving a SYN the server carefully
encodes the vital information that would have been stored as state in
the SYN queue. This encoded information is cryptographically hashed
with a secret key to form the sequence number of the SYN-ACK and sent
to the client. The third packet of a legitimate handshake, which is
the ACK from the client back to the server, contains this sequence
number (plus one) in its acknowledgment number field. In this way all
the information necessary to fully open the connection is presented
back to the server without having to maintain state while the
handshake is being completed.
The major downside to syncookies is that they only have space to
encode the most basic of TCP handshake options. At the time of initial
syncookie deployment this was not a large problem because the only option
prominently in use at the time was the Maximum Segment Size (MSS)
option. This option is provided to help the peer avoid unnecessary
fragmentation by sending packets that the other end of the connection
knows a priori are too large to cross its network. This is exactly the kind
of information that is normally stored as state in the SYN queue. The
syncookie designers knew that this option was important to performance
and found 3 bits for it in the encoded syncookie. These bits are used to
approximate the real value of the option to one of 8 common values.
In the intervening years new options have come into prominence and
these are not syncookie compatible. The most important of these are the window scaling and Selective
Acknowledgment (SACK) options. These features respectively allow the
TCP congestion control window to grow beyond 64KB and be more
efficient in the case of minor packet losses from those large
windows. Without using these features it is impossible to get good
transfer rates on networks with large bandwidth or large latency. Many
household broadband links require at least the window scaling option
to fully utilize the network connection. Due to this limitation, and
the modest computation overhead of the cryptographic hash, the
Linux stack only resorts to syncookie based connections when the
number of half-open connection exceeds a high watermark controlled by
the net.ipv4.tcp_max_syn_backlog sysctl. These connections are less
featureful than normal connections but they are only resorted to when
the queue would otherwise require active pruning.
It turns out that the cookie mechanism is only implemented for
IPv4. Recently, Glenn Griffin posted patches that add IPv6 support
for syncookies. Andi Kleen, author of the original syncookie patch,
wondered if the mechanism should be continued at all much less added
to IPv6:
Syncookies are discouraged these days. They disable too many
valuable TCP features (window scaling, SACK) and even without them
the kernel is usually strong enough to defend against syn floods
and systems have much more memory than they used to be.
So I don't think it makes much sense to add more code to it, sorry.
Andi's argument was three pronged. His first point was about the
reduced abilities of cookie initiated connections as already described
in this article. Over time the value of these options has increased
and therefore the cost of using syncookies has increased too. His
second point was that Linux no longer uses all of the memory necessary
for a full connection until the new connection is fully open. Instead
it uses a "minisock" for that period. The minisock is a 96 byte
struct tcp_request_sock structure holding the minimum state
necessary to get the connection fully opened. The fully established
struct tcp_sock is 1616 bytes. Both structure size
measurements refer to a 64-bit kernel. Finally, Andi points out that
the queue management routines for an overloaded SYN queue are more
sophisticated now than the dumb head drop algorithm that was in place
when syncookies were first deployed. The suggestion was that in
aggregate these advances might make Linux robust enough without
syncookies so that they could therefore be removed all together.
Instead of engaging in a theoretical discussion some readers set up and
ran their own experiments. One of the best parts of the Linux
community is the tendency to put real data behind their
arguments. While there is often disagreement over the realism of the
measured scenarios, the data points always help us better understand
the dynamics of kernel code.
Willy Tarreau: My tests on an AMD LX800 with max_syn_backlog at 63000 on an HTTP
reverse proxy consisted in injecting 250 hits/s of legitimate traffic
with 8000 SYN/s of noise.[..] Without SYN cookies, the average
response time was about 1.5 second and unstable (due to retransmits),
and the CPU was set to 60%. With SYN cookies enabled, the response
time dropped to 12-15ms only, but CPU usage jumped to 70%. The
difference appears at a higher legitimate traffic rate.
Ross Vandegrift:
Under no SYN flood, the server handles 750 HTTP requests per second,
measured via httping in flood mode. With a default tcp_max_syn_backlog
of 1024, I can trivially prevent any inbound client connections with 2
threads of syn flood. Enabling tcp_syncookies brings the connection
handling back up to 725 fetches per second.
This data compellingly supports the continued value of the syncookie
and that position seems to have won the day. The IPv6 syncookie
patches are now queued within the network 2.6.26 development tree.
However, the biggest news is probably that this discussion brought
renewed energy to the problem of lost handshake options. Florian
Westphal and Glenn Griffin have recently presented a solution to the
most damaging aspect of that problem too.
Their solution is to leverage
the echoed TCP timestamp option in a way similar to the way classic
syncookies leverage the echoing of the SYN-ACK sequence number in the
subsequent ACK. The timestamp option was introduced with RFC 1323 and
is widely deployed on modern Linux, Windows, and FreeBSD (including OS
X) systems. Its main purpose is to be able to increase the frequency of round
trip time measurements in the presence of large congestion control
windows.
Using the timestamp to preserve the window scale and SACK option
values requires modifying the timestamp of the SYN-ACK packet to
include the state necessary to support them. During a normal handshake the
client will echo the modified
timestamp value of the SYN-ACK packet back to the server as part of
the timestamp option on the third part of the handshake and thus
propagate the SACK and window scale information without keeping any
state on the server.
In order to make room in the timestamp for this new information the
least significant 9 bits of the timestamp are shaved off. The encoded
representation of the window scale and SACK options are then
transferred back and forth at the minor cost of reduced granularity of
TCP timestamps during the handshake exchange. Timestamps lose their
least significant 512 jiffies with this approach.
Below are two different TCP handshakes completed with syncookies and
the timestamp patch. Note that the lowest bits of the SYN-ACK
timestamp are the same in each handshake even at different points in
time because each handshake uses the same SACK and window scaling
options. As a result the timestamp values in
each SYN-ACK are different but the lower nine bits share the same 0x166
value.
13:51:04.582464 IP 127.0.0.1.57985 > 127.0.0.1.4050: S 1061746051:1061746051(0)
win 32792 <mss 16396,sackOK,timestamp 0xfffea013 0,nop,wscale 6>
13:51:04.582478 IP 127.0.0.1.4050 > 127.0.0.1.57985: S 2800702917:2800702917(0)
ack 1061746052 win 32768 <mss 16396,sackOK,timestamp 0xfffe9f66 0xfffea013,nop,wscale 6>
13:51:04.582480 IP 127.0.0.1.57985 > 127.0.0.1.4050: .
ack 1 win 513 <nop,nop,timestamp 0xfffea013 0xfffe9466>
13:59:19.047306 IP 127.0.0.1.45979 > 127.0.0.1.4050: S 218483035:218483035(0)
win 32792 <mss 16396,sackOK,timestamp 0x0001bed4 0,nop,wscale 6>
13:59:19.047320 IP 127.0.0.1.4050 > 127.0.0.1.45979: S 1141094138:1141094138(0)
ack 218483036 win 32768 <mss 16396,sackOK,timestamp 0x0001bd66 0x0001bed4,nop,wscale 6>
13:59:19.047322 IP 127.0.0.1.45979 > 127.0.0.1.4050: .
ack 1 win 513 <nop,nop,timestamp 0x0001bed4 0x0001bd66>
While there is no guarantee that the timestamp option will be
supported by every TCP peer, timestamps are widely deployed on the most
common operating systems. Additionally, because timestamps, window
scaling, and selective acknowledgments are all features related to
high latency and bandwidth networks it would be unlikely to find an
implementation that supported only a subset of these options.
One shortcoming of the scheme is that it is not general enough to be
future-proof as new handshake based options may continue to be
deployed. At this time the MSS, SACK, window scaling, and timestamp
options are the only handshake options seen with any regularity other
than the NOP option which is just used for packet alignment. However,
the whole point of an extensible option scheme is to leave room for
future improvements. The IANA registry that records option values was
last updated in February 2007 to reserve option code 27 for use with
Experimental RFC 4782 "Quick Start for TCP and IP". Only time will
tell if that particular option will be the next challenge to the
syncookie scheme or if something else will rise first.
The timestamp patch has only been posted very recently, and there has
been little discussion of it beyond the developers who worked directly
on it. It is not clear whether or not it will be accepted right
away into the mainline, but it certainly seems to address a well known
core problem with the syncookie at a minor cost.
With the updates for IPv6 and modern TCP option schemes syncookies
appear primed to keep providing sweet relief in their somewhat
esoteric networking security niche. Perhaps they will keep chugging
away for another 10 years without having to be re-baked.
(
Log in to post comments)