Brief items
The current 2.6 development kernel remains 2.6.25-rc8; there have
been no kernel releases over the last week. The patch rate into the
mainline repository has slowed considerably, but the current
regression list suggests that
the 2.6.25 release is not imminent quite yet.
Comments (1 posted)
Kernel development news
The big item (in more ways than one) for this release is the
addition of s390 support. As it is not actually provided in the
tarball, you will need to use git to fetch it. You will also need
a mainframe.
--
Avi Kivity finally brings
virtualization to s390
err = -ENOBUFS; /* PS. You suck! */
--
Rusty Russell invents enhanced error codes
Comments (2 posted)
Greg Kroah-Hartman has sent out a lengthy report on the state of the Linux
Driver Project. "
The main problem is a lack of projects. It turns out that
there really isn't much hardware that Linux doesn't already support.
Almost all new hardware produced is coming with a Linux driver already
written by the company, or by the community with help from the
company."
Full Story (comments: 53)
By Jonathan Corbet
April 7, 2008
People who put their Linux systems under a certain amount of memory stress
- and who look at their logfiles - may notice an occasional message
indicating that a "page allocation failure" has occurred,
followed by a scary backtrace. These people may also notice that,
despite the apocalyptic appearance of this message, the world often fails
to end. In fact, the system tends to carry on just fine. For this reason,
Dave Jones, who probably gets ten emails for every backtrace generated on a
Fedora system, has
suggested that these
messages are simply noise which should be removed. Whether that should
really happen is not entirely clear, though; understanding why requires a
bit of background.
In general, the kernel's memory allocator does not like to fail. So, when
kernel code requests memory, the memory management code will work hard to
satisfy the request. If this work involves pushing other pages out to swap
or removing data from the page cache, so be it. A big exception happens,
though, when an atomic allocation (using the GFP_ATOMIC flag) is
requested. Code requesting atomic allocations is generally not in a
position where it can wait around for a lot of memory housecleaning work;
in particular, such code cannot sleep. So if the memory manager is unable
to satisfy an atomic allocation with the memory it has in hand, it has no
choice except to fail the request.
Such failures are quite rare, especially when single pages are
requested. The kernel works to keep some spare pages around at all times,
so the memory stress must be severe before a single-page allocation will
fail. Multi-page allocations are harder, though; the kernel's memory
management code tends to fragment pages, making groups of
physically-contiguous pages hard to find. In particular, if the system is
under pressure to the point that there is not much free memory available at
all, the chances of successfully allocating two (or more) contiguous pages
drops considerably.
Multi-page allocations are not often used in the kernel; they are avoided
whenever possible. There are situations where they are necessary,
though. One example is network drivers which (1) support the
transmission and reception of packets too large to fit into a single page,
and which (2) drive hardware which cannot perform scatter/gather I/O
on a single packet. In this situation, the DMA buffers used for packets
must be larger than one page, and they must be physically contiguous. This
is a situation which will become less
pressing over time; scatter/gather capability in the hardware is
increasingly common, and drivers are being rewritten to make use of this
capability. With sufficiently smart hardware, the need for multi-page
allocations goes down considerably.
But all of that skirts around the main point, which is that kernel code is
supposed to handle allocation failures properly. There is never any
guarantee that memory will be available, so kernel code must be written
defensively. Allocation failures must be handled without losing any
more capability than is strictly necessary. If one assumes that kernel
code is written correctly, there should be no need to issue warnings on
allocation failures. Things should just continue to work, perhaps without
users noticing at all.
And, in fact, things often do just work. But the discussion resulting from
Dave's suggestion makes it clear that few developers are confident that all
kernel code does the right thing in the face of memory allocation
problems. In cases where an allocation failure is not dealt with
correctly, the system may go down in random places, leaving few clues as to
what really happened. In that kind of situation, the allocation failure
warning may be the only useful information which survives the crash. For
this reason, some people want to see the warnings left in place.
As it happens, the memory allocator supports a special bit
(__GFP_NOWARN) which causes the warning not to be emitted if a
specific allocation fails. So it has been suggested that the allocations
made from code which is known to handle failures properly have __GFP_NOWARN
set. That would kill the warnings in code known to do the right thing
while leaving it for all other callers, presumably limiting the warnings to
places where there might truly be a problem. Jeff Garzik strongly opposed this idea, though, saying
that it clutters up the code and "punishes good behavior."
The other reason given for keeping the warnings in place is to make it
clear when a system is running under persistent memory pressure. Such
systems will not be performing optimally; often there are changes which can
be made to relieve the pressure and help the system to run more smoothly.
So it has been suggested that the warning could be reduced in frequency and
made less scary. Nick Piggin suggests:
So I think that the messages should stay, and they should print out
some header to say that it is only a warning and if not happening
too often then it is not a problem, and if it is continually
happening then please try X or Y or post a message to lkml...
An alternative idea would be to keep some sort of counter somewhere which
could be queried by curious system administrators.
Of course, the real solution is to ensure that all kernel code is
robust in the face of allocation failures. This can be hard to do, since
the error recovery paths in any code are not often exercised or tested.
Fortunately, the fault injection
framework can help in this situation. Kernel developers can use this
framework to simulate allocation failures in specific regions of code, then
watch to see what happens. Your editor's impression, though, is that
relatively few developers are using this tool. So confidence in the
kernel's handling of allocation failures may remain low, and the desire to
keep the warning around may remain high.
Comments (2 posted)
April 9, 2008
This article was contributed by Patrick McManus
Back in 1997 TCP SYN flood attacks were all the rage among script
kiddies. A SYN flood is a denial of service attack that uses up server
resources by initiating, but not completing, a connection. Attacks via
this method still remain a problem today though
they are now more likely to be launched by sophisticated botnets
rather than an individual. A first line defense against SYN floods is
the syncookie. The syncookie was not designed for Linux specifically
but found its way into kernel 2.1.44 via a patch from Andi Kleen.
This long-time feature generated some recent discussion when a patch was submitted adding
syncookie support to
IPv6. The patch has now been queued for acceptance but in
discussion along the way the community also began to tackle some
longstanding limitations of syncookies and reaffirmed how relevant the
feature continues to be.
To fully describe syncookies some background on how TCP uses a three
way handshake to establish a connection is in order. The first packet
of any TCP session received by the server is known as the SYN packet
because it carries the synchronize control flag. The SYN flag
indicates that its sender wishes to open a new connection. That flag
is only used during the opening sequence. The server responds with a
packet also containing the SYN flag because the connection needs to be
opened in both directions. This second packet also carries the ACK
flag and is known as the SYN-ACK. It serves to both open the
connection from the server to the client and to acknowledge receipt of
the opening packet from the other host. Finally, the client sends a
bare ACK packet to the server to acknowledge receipt of
server-to-client SYN-ACK and the connection is then fully established.
During a SYN flood a server receives the first packet of the three-way
TCP handshake and responds with a SYN-ACK but no further data is ever
received from the initiating client. When the SYN-ACK is generated
most servers will also create an entry in the SYN queue. This queue is
the waiting area for half-open connections awaiting handshake
completion. The attacker intentionally orphans those entries and
instead generates more SYN packets which in turn take up more entries
in the queue. The server needs to wait for a long timeout before
giving up and recovering the connection resources. During this time
the attacker can flood it with many more half-open connections.
Eventually the server runs out of resources and cannot accept any new
connections without dropping some, perhaps legitimate, connection from
the queue. Simple solutions such as placing a quota on the number of
partially open connections per peer or using dynamically adjusted
packet filters do not work because the SYN packets are easy to forge
with fake source addresses.
A syncookie allows the server to defer using up any resources
until the third packet in the three-way handshake has been
received. At that time the peer's address has been mildly
authenticated because the final packet in the handshake contains
a reference to the sequence number that was sent by the server in the
second packet. With this assurance, packet filters and resource quotas
keyed to the peer's address will again be useful defenses against
resource attacks.
The basic mechanism of the syncookie works by carefully manipulating
the initial sequence number value of the connection instead of
choosing it at random. Upon receiving a SYN the server carefully
encodes the vital information that would have been stored as state in
the SYN queue. This encoded information is cryptographically hashed
with a secret key to form the sequence number of the SYN-ACK and sent
to the client. The third packet of a legitimate handshake, which is
the ACK from the client back to the server, contains this sequence
number (plus one) in its acknowledgment number field. In this way all
the information necessary to fully open the connection is presented
back to the server without having to maintain state while the
handshake is being completed.
The major downside to syncookies is that they only have space to
encode the most basic of TCP handshake options. At the time of initial
syncookie deployment this was not a large problem because the only option
prominently in use at the time was the Maximum Segment Size (MSS)
option. This option is provided to help the peer avoid unnecessary
fragmentation by sending packets that the other end of the connection
knows a priori are too large to cross its network. This is exactly the kind
of information that is normally stored as state in the SYN queue. The
syncookie designers knew that this option was important to performance
and found 3 bits for it in the encoded syncookie. These bits are used to
approximate the real value of the option to one of 8 common values.
In the intervening years new options have come into prominence and
these are not syncookie compatible. The most important of these are the window scaling and Selective
Acknowledgment (SACK) options. These features respectively allow the
TCP congestion control window to grow beyond 64KB and be more
efficient in the case of minor packet losses from those large
windows. Without using these features it is impossible to get good
transfer rates on networks with large bandwidth or large latency. Many
household broadband links require at least the window scaling option
to fully utilize the network connection. Due to this limitation, and
the modest computation overhead of the cryptographic hash, the
Linux stack only resorts to syncookie based connections when the
number of half-open connection exceeds a high watermark controlled by
the net.ipv4.tcp_max_syn_backlog sysctl. These connections are less
featureful than normal connections but they are only resorted to when
the queue would otherwise require active pruning.
It turns out that the cookie mechanism is only implemented for
IPv4. Recently, Glenn Griffin posted patches that add IPv6 support
for syncookies. Andi Kleen, author of the original syncookie patch,
wondered if the mechanism should be continued at all much less added
to IPv6:
Syncookies are discouraged these days. They disable too many
valuable TCP features (window scaling, SACK) and even without them
the kernel is usually strong enough to defend against syn floods
and systems have much more memory than they used to be.
So I don't think it makes much sense to add more code to it, sorry.
Andi's argument was three pronged. His first point was about the
reduced abilities of cookie initiated connections as already described
in this article. Over time the value of these options has increased
and therefore the cost of using syncookies has increased too. His
second point was that Linux no longer uses all of the memory necessary
for a full connection until the new connection is fully open. Instead
it uses a "minisock" for that period. The minisock is a 96 byte
struct tcp_request_sock structure holding the minimum state
necessary to get the connection fully opened. The fully established
struct tcp_sock is 1616 bytes. Both structure size
measurements refer to a 64-bit kernel. Finally, Andi points out that
the queue management routines for an overloaded SYN queue are more
sophisticated now than the dumb head drop algorithm that was in place
when syncookies were first deployed. The suggestion was that in
aggregate these advances might make Linux robust enough without
syncookies so that they could therefore be removed all together.
Instead of engaging in a theoretical discussion some readers set up and
ran their own experiments. One of the best parts of the Linux
community is the tendency to put real data behind their
arguments. While there is often disagreement over the realism of the
measured scenarios, the data points always help us better understand
the dynamics of kernel code.
Willy Tarreau: My tests on an AMD LX800 with max_syn_backlog at 63000 on an HTTP
reverse proxy consisted in injecting 250 hits/s of legitimate traffic
with 8000 SYN/s of noise.[..] Without SYN cookies, the average
response time was about 1.5 second and unstable (due to retransmits),
and the CPU was set to 60%. With SYN cookies enabled, the response
time dropped to 12-15ms only, but CPU usage jumped to 70%. The
difference appears at a higher legitimate traffic rate.
Ross Vandegrift:
Under no SYN flood, the server handles 750 HTTP requests per second,
measured via httping in flood mode. With a default tcp_max_syn_backlog
of 1024, I can trivially prevent any inbound client connections with 2
threads of syn flood. Enabling tcp_syncookies brings the connection
handling back up to 725 fetches per second.
This data compellingly supports the continued value of the syncookie
and that position seems to have won the day. The IPv6 syncookie
patches are now queued within the network 2.6.26 development tree.
However, the biggest news is probably that this discussion brought
renewed energy to the problem of lost handshake options. Florian
Westphal and Glenn Griffin have recently presented a solution to the
most damaging aspect of that problem too.
Their solution is to leverage
the echoed TCP timestamp option in a way similar to the way classic
syncookies leverage the echoing of the SYN-ACK sequence number in the
subsequent ACK. The timestamp option was introduced with RFC 1323 and
is widely deployed on modern Linux, Windows, and FreeBSD (including OS
X) systems. Its main purpose is to be able to increase the frequency of round
trip time measurements in the presence of large congestion control
windows.
Using the timestamp to preserve the window scale and SACK option
values requires modifying the timestamp of the SYN-ACK packet to
include the state necessary to support them. During a normal handshake the
client will echo the modified
timestamp value of the SYN-ACK packet back to the server as part of
the timestamp option on the third part of the handshake and thus
propagate the SACK and window scale information without keeping any
state on the server.
In order to make room in the timestamp for this new information the
least significant 9 bits of the timestamp are shaved off. The encoded
representation of the window scale and SACK options are then
transferred back and forth at the minor cost of reduced granularity of
TCP timestamps during the handshake exchange. Timestamps lose their
least significant 512 jiffies with this approach.
Below are two different TCP handshakes completed with syncookies and
the timestamp patch. Note that the lowest bits of the SYN-ACK
timestamp are the same in each handshake even at different points in
time because each handshake uses the same SACK and window scaling
options. As a result the timestamp values in
each SYN-ACK are different but the lower nine bits share the same 0x166
value.
13:51:04.582464 IP 127.0.0.1.57985 > 127.0.0.1.4050: S 1061746051:1061746051(0)
win 32792 <mss 16396,sackOK,timestamp 0xfffea013 0,nop,wscale 6>
13:51:04.582478 IP 127.0.0.1.4050 > 127.0.0.1.57985: S 2800702917:2800702917(0)
ack 1061746052 win 32768 <mss 16396,sackOK,timestamp 0xfffe9f66 0xfffea013,nop,wscale 6>
13:51:04.582480 IP 127.0.0.1.57985 > 127.0.0.1.4050: .
ack 1 win 513 <nop,nop,timestamp 0xfffea013 0xfffe9466>
13:59:19.047306 IP 127.0.0.1.45979 > 127.0.0.1.4050: S 218483035:218483035(0)
win 32792 <mss 16396,sackOK,timestamp 0x0001bed4 0,nop,wscale 6>
13:59:19.047320 IP 127.0.0.1.4050 > 127.0.0.1.45979: S 1141094138:1141094138(0)
ack 218483036 win 32768 <mss 16396,sackOK,timestamp 0x0001bd66 0x0001bed4,nop,wscale 6>
13:59:19.047322 IP 127.0.0.1.45979 > 127.0.0.1.4050: .
ack 1 win 513 <nop,nop,timestamp 0x0001bed4 0x0001bd66>
While there is no guarantee that the timestamp option will be
supported by every TCP peer, timestamps are widely deployed on the most
common operating systems. Additionally, because timestamps, window
scaling, and selective acknowledgments are all features related to
high latency and bandwidth networks it would be unlikely to find an
implementation that supported only a subset of these options.
One shortcoming of the scheme is that it is not general enough to be
future-proof as new handshake based options may continue to be
deployed. At this time the MSS, SACK, window scaling, and timestamp
options are the only handshake options seen with any regularity other
than the NOP option which is just used for packet alignment. However,
the whole point of an extensible option scheme is to leave room for
future improvements. The IANA registry that records option values was
last updated in February 2007 to reserve option code 27 for use with
Experimental RFC 4782 "Quick Start for TCP and IP". Only time will
tell if that particular option will be the next challenge to the
syncookie scheme or if something else will rise first.
The timestamp patch has only been posted very recently, and there has
been little discussion of it beyond the developers who worked directly
on it. It is not clear whether or not it will be accepted right
away into the mainline, but it certainly seems to address a well known
core problem with the syncookie at a minor cost.
With the updates for IPv6 and modern TCP option schemes syncookies
appear primed to keep providing sweet relief in their somewhat
esoteric networking security niche. Perhaps they will keep chugging
away for another 10 years without having to be re-baked.
Comments (8 posted)
By Jonathan Corbet
April 7, 2008
One of the core features of the (now stalled) kevent subsystem was a
circular buffer intended for efficient movement of data between the kernel
and user space. Kevent may have run out of steam, but the ring buffer idea
is back via a different path. Rusty Russell is now
proposing a new system call
(called
vringfd()) which turns some of the
virtio work into a new
kernel-to-user ring buffer interface. The submitted patch is breathtaking
in its lack of documentation on this new system call, especially
considering that its author is quite good with that sort of writing.
Your editor has
taken this omission as a personal challenge and, as a result, has set about
reverse engineering the (somewhat complex)
vringfd() interface.
A user-space process which wishes to set up a vring for communication with
the kernel must create a slightly complicated data structure first. One
starts by deciding how many entries the ring should have; this number must
be a power of two which fits into an unsigned, 16-bit value. Given this
number (we'll call it RING_SIZE), the data structure looks like
this:
struct messy_vring_thing {
struct vring_desc descriptors[RING_SIZE];
struct vring_avail available;
char padding[up-to-next-page-boundary];
struct vring_used used[RING_SIZE];
};
The page alignment for the used array is important - that array
might be mapped separately into kernel space. The array must fit into a
single page, which puts a practical limit of 256 entries for
RING_SIZE on systems with 4096-byte pages. If this API goes
forward, chances are good that a way will be found to raise this limit.
Individual descriptors in the ring are described with this structure:
struct vring_desc
{
__u64 addr; /* Address of the buffer */
__u32 len; /* Length of the buffer */
__u16 flags; /* some flags */
__u16 next; /* Next buffer in the chain */
};
For a simple buffer, the application would simply point addr at
the beginning and set len to the appropriate value. If the buffer
is to be written to by the kernel, the application should also set
VRING_DESC_F_WRITE in the flags field.
Things can get more complicated than that, though, in that the
vringfd() interface supports multipart scatter/gather buffers. To
set up such a buffer, user space would use one vring_desc entry
for each segment of the buffer. For all but the final segment, the
VRING_DESC_F_NEXT flag (saying "use the next descriptor too")
should be set, and next should be the index of the next
descriptor. When the kernel grabs a buffer, it will follow the chain and
use all segments found until the final one (which lacks the
VRING_DESC_F_NEXT flag) is encountered.
Before the kernel will use buffers set up by the application, though, user
space must indicate that the buffer is ready. That is done through the
vring_avail structure:
struct vring_avail
{
__u16 flags;
__u16 idx;
__u16 ring[RING_SIZE];
};
The ring array holds indexes into the descriptors array.
The idx field should always be the index of the last valid entry
in ring. When a new buffer is ready for transfer to or from the
kernel, the application will store the index of the first descriptor into
ring[idx+1], then increment idx. When the ring is first
established, the kernel remembers the position of idx, so the
first buffer should be added here after the vringfd()
system call is made.
The kernel will consume buffers from the available ring as
needed. Once the requested operation has been performed on the buffer and
the kernel is done with it, the buffer will show up in the used
area, which is structured this way:
struct vring_used_elem
{
__u32 id;
__u32 len;
};
struct vring_used
{
__u16 flags;
__u16 idx;
struct vring_used_elem ring[RING_SIZE];
};
In the vring_used structure, idx is the index of the next
entry in ring which may be written by the kernel; it will be
incremented after the ring is updated. When a buffer is placed in the used
ring, the id field will be the index of the descriptor, and
len will be the actual length of the data transferred.
Note that the flags fields in the vring_avail and
vring_used structures appear to be unused.
Once the application has this whole data structure set up, it can establish
the ring buffer with the kernel with the new system call:
long vringfd(void *addr, unsigned int ring_size, u16 *last_used);
Here, addr is the base address of the data structure described
above, ring_size is the number of descriptors in the ring, and
last_used is a 16-bit unsigned integer indicating which entry in
the used ring was last consumed by the application. Failure to
keep last_used current will not slow things down, but it will keep
poll() from working properly.
The return value will be a file descriptor associated with the ring.
Creating the vring is only part of the job, though. The next step is to
connect it with a kernel subsystem for the transfer of data. Rusty's patch
includes vring support in the tun virtual network driver; to use that
support, an application makes a special ioctl() call to provide
the vring file descriptor to the tun driver. Any other subsystem will need
a similar mechanism to support vring.
If the application is using the ring to transfer data into the kernel, it
must (1) set up one or more descriptors for full data buffers in the
available ring, then (2) make a write() call to the
vring file descriptor. The buffer and length passed to write()
are ignored; all that matters is that a write was done to that file
descriptor. When write() returns the operation will have been set
in motion, but it cannot be considered to be complete until the ring
descriptors show up in the used ring.
For data transfers from the kernel to user space, the application simply
puts buffers into the available ring, then waits until they show
up in the used ring. A poll() on the vring file
descriptor will block until buffers are available. The kernel determines
whether unconsumed buffers exist in used by comparing the
vring_used->idx index against the application-supplied
last_used value. It's worth noting that, depending on how the
relevant kernel subsystem works, buffers may not actually make it into the
used ring until the poll() call is made.
On the kernel side, a developer wanting to add vring support to a subsystem
will start by creating a set of vring_ops:
struct vring_ops
{
void (*destroy)(void *);
int (*pull)(void *);
int (*push)(void *);
};
All of these functions take a private pointer given when the subsystem
attaches to the vring (to be described shortly). The pull()
callback is invoked when the application calls poll(); if there is
any descriptor processing which must be done with user space accessible,
this is the place to do it. If pull() adds any buffers to the
used ring, it should return the number of buffers; it can also
return a negative error code. push() is called from a
write() call indicating that there are buffers ready to be
transferred into the kernel; it returns zero or a negative error code. The
destroy() callback is called when the vring file descriptor is
closed. All of these callbacks are optional.
Attaching to a vring is done with:
struct vring_info *vring_attach(int fd, const struct vring_ops *ops,
void *data, bool atomic_use);
For this call, fd is a file descriptor corresponding to a vring,
ops is the operations structure described above, data is
a private data pointer which is passed into the vring_ops
callbacks, and atomic_use is nonzero if the kernel needs to be
able to add buffers to the used ring in atomic context. The
return value is a pointer to an internal vring data structure or an
ERR_PTR() value if something goes wrong.
To obtain a buffer from the available ring, a call is made to:
int vring_get_buffer(struct vring_info *vr,
struct iovec *in_iov,
unsigned int *num_in, unsigned long *in_len,
struct iovec *out_iov,
unsigned int *num_out, unsigned long *out_len);
This function will fill in an array of iovec structures
corresponding to the next available buffer. If the kernel expects to write
to the buffer, it should set in_iov to the iovec array,
num_in pointing to the length of in_iov, and
in_len pointing to a location to store the total length of the
buffer (or NULL if that information is not useful). For transfers
into the kernel, out_iov, num_out, and out_len
should be set similarly. Note that the addresses stored in the
iovec arrays are user-space addresses; vring_get_buffer()
does not validate them, so the caller must do so.
It is possible to set pass both in_iov
and out_iov; in this case, one of the two will be set, depending
on whether the next buffer in the available ring has the
VRING_DESC_F_WRITE flag set. In most cases, though, only one of
the two sets of parameters will have non-NULL values. The
apparent intent of the API is that, if bidirectional transfers between
user space and the kernel are needed, two separate vrings should be used.
The return value from vring_get_buffer will be one of (1) a
positive descriptor index, (2) zero, indicating that no buffers are
available, or (3) a negative error code.
The descriptor index should be saved the the final step, which is indicating
that the kernel is done with a specific buffer:
void vring_used_buffer(struct vring_info *vr, int id, u32 len);
void vring_used_buffer_atomic(struct vring_info *vr, int id, u32 len);
Either one of these functions indicates that the buffer indicated by
id should be put into the used ring; len is the
amount of data actually transferred. If sleeping is not possible,
vring_used_buffer_atomic() should be used - but the vring must
have been attached with the atomic_use flag set.
There does not appear to be a way for a subsystem to detach from a vring;
it must, instead, wait for the application to close the associated file
descriptor.
This interface is in an early stage, and the code has a number of
limitations and FIXME comments. So things seem likely to evolve before
vringfd() is seriously considered for merging into the mainline
kernel. The idea of a ring buffer for this kind of communication seems to
come around on a regular basis, though, so it would seem that there is a
demand for this kind of API.
Comments (5 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>