Brief items
The current 2.6 development kernel remains 2.6.29-rc7; no new
prepatches have been released over the last week. About 160 fixes have
been merged into the mainline since the 2.6.29-rc7 release; a -rc8 prepatch
is likely sometime in the very near future.
The current stable 2.6 kernel remains 2.6.28.7; no stable updates
have been released since February 20.
Comments (1 posted)
Kernel development news
Today's other accomplishment was spending long enough looking at
Toshiba ACPI dumps to figure out how to enable hotkey reporting
without needing to poll. Of course, I then found that the FreeBSD
driver has done the same thing since 2004. Never mind.
--
Matthew Garrett
The real difference between KVM and Xen is that Xen is a separate Operating
System dedicated to virtualization. In many ways, it's a fork of Linux
since it uses quite a lot of Linux code.
The argument for Xen as a separate OS is no different than the argument for
a dedicated Real Time Operating System, a dedicated OS for embedded
systems, or a dedicated OS for a very large system.
Having the distros ship Xen was a really odd thing from a Linux
perspective. It's as if Red Hat started shipping VXworks with a Linux
emulation layer as Real Time Linux.
--
Anthony Liguori
You say, "You never know when your MB, CPU, PS" may bite the
dust. Sure, but you also never know when your RAID controller will
bite the dust and start writing data blocks whenever it's supposed
to be reading from the RAID (yes, we had an Octel voice mailbox
server fail in just that way at MIT once). And you never know when
a hard drive will fail. So if you have those sorts of very high
levels of reliability requirements, then you will probably be
disappointed with any commodity hardware solution. I can direct you
to an IBM salesperson who will be very happy to sell you an IBM
mainframe, however.
--
Ted
Ts'o
Comments (1 posted)
By Jonathan Corbet
March 11, 2009
The ext4 filesystem offers a number of useful features. It has been
stabilizing quickly, but that does not mean that it will work perfectly for
everybody. Consider this example:
Ubuntu's bug tracker contains
an
entry titled "ext4 data loss", wherein a luckless ext4 user reports:
Today, I was experimenting with some BIOS settings that made the
system crash right after loading the desktop. After a clean reboot
pretty much any file written to by any application (during the
previous boot) was 0 bytes.
Your editor had not intended to write (yet) about this issue, but quite a
few readers have suggested that we take a look at it. Since there is
clearly interest, here is a quick look at what is going on.
Early Unix (and Linux) systems were known for losing data on a system
crash. The buffering of filesystem writes within the kernel, while being
very good for performance, causes the buffered data to be lost should the
system go down unexpectedly. Users of Unix systems used to be quite aware
of this possibility; they worried about it, but the performance loss
associated with synchronous writes was generally not seen to be worth it.
So application writers took great pains to ensure that any data which
really needed to be on the physical media got there quickly.
More recent Linux users may be forgiven for thinking that this problem has
been entirely solved; with the ext3 filesystem, system crashes are far less
likely to result in lost data. This outcome is almost an accident
resulting from some decisions made in the design of ext3. What's happening
is this:
- By default, ext3 will commit changes to its journal every five
seconds. What that means is that any filesystem metadata
changes will be saved, and will persist even if the system
subsequently crashes.
- Ext3 does not (by default) save data written to files in the journal.
But, in the (default) data=ordered mode, any modified data
blocks are forced out to disk before the metadata changes are
committed to the journal. This forcing of data is done to ensure
that, should the system crash, a user will not be able to read the
previous contents of the affected blocks - it's a security feature.
- The end result is that data=ordered pretty much guarantees
that data written to files will actually be on disk five seconds
later. So, in general, only five seconds worth of writes might be
lost as the result of a crash.
In other words, ext3 provides a relatively high level of crash resistance,
even though the filesystem's authors never guaranteed that behavior, and
POSIX certainly does not require it. As Ted put it in his
excruciatingly clear and understandable explanation of the situation:
Since ext3 became the dominant filesystem for Linux, application
writers and users have started depending on this, and so they
become shocked and angry when their system locks up and they lose
data --- even though POSIX never really made any such guarantee.
Accidental or not, the avoidance data loss in a crash seems like a nice
feature for a filesystem to have. So one might well wonder just what would
have inspired the ext4 developers to take it away. The answer, of course,
is performance - and delayed allocation in particular.
"Delayed allocation" means that the filesystem tries to delay the
allocation of physical disk blocks for written data for as long as
possible. This policy brings some important performance benefits. Many
files are short-lived; delayed allocation can keep the system from writing
fleeting temporary files to disk at all. And, for longer-lived files,
delayed allocation allows the kernel to accumulate more data and to
allocate the blocks for data contiguously, speeding up both the write and
any subsequent reads of that data. It's an important optimization which is
found in most contemporary filesystems.
But, if blocks have not been allocated for a file, there is no need to
write them quickly as a security measure. Since the blocks do not yet
exist, it is not possible to read somebody else's data from them. So ext4
will not (cannot) write out unallocated blocks as part of the next journal
commit cycle. Those blocks will, instead, wait until the kernel decides to
flush them out; at that point, physical blocks will be allocated on disk
and the data will be made persistent. The kernel doesn't like to let file
data sit unwritten for too long, but it can still take a minute or so (with
the default settings) for that data to be flushed - far longer than
the five seconds normally seen with ext3. And that is why a crash can
cause the loss of quite a bit more data when ext4 is being used.
The real solution to this problem is to fix the applications which are
expecting the filesystem to provide more guarantees than it really is.
Applications which frequently rewrite numerous small files seem to be
especially vulnerable to this kind of problem; they should use a smarter
on-disk format. Applications which want to be sure that their files have
been committed to the media can use the fsync() or
fdatasync() system calls; indeed, that's exactly what those system
calls are for. Bringing the applications back into line with what the
system is really providing is a better solution than trying to fix things up
at other levels.
That said, it would be nice to improve the robustness of the system while
we're waiting for application developers to notice that they have some work
to do. One possible solution is, of course, to just run ext3. Another is
to shorten the system's writeback time,
which is stored in a couple of sysctl variables:
/proc/sys/vm/dirty_expire_centisecs
/proc/sys/vm/dirty_writeback_centisecs
The first of these variables (dirty_expire_centiseconds) controls
how long written data can sit in the page cache before it's considered
"expired" and queued to be written to disk; it defaults to
30 seconds. The value of dirty_writeback_centiseconds
(5 seconds, default) controls how often the pdflush process wakes
up to actually flush expired data to disk. Lowering these values will
cause the system to flush data to disk more aggressively, with a cost in
the form of reduced performance.
A third, partial solution exists in a set of patches queued for 2.6.30; they add a
set of heuristics which attempt to protect users from being badly burned in
certain situations. They are:
- A
patch adding a new EXT4_IOC_ALLOC_DA_BLKS
ioctl() command. When issued on a file, it will force ext4
to allocate any delayed-allocation blocks for that file. That will
have the effect of getting the file's data to disk relatively quickly
while avoiding the full cost of the (heavyweight) fsync()
call.
- The
second patch sets a special flag on any file which has been
truncated; when that file is closed, any delayed allocations will be
forced. That should help to prevent the "zero-length
files" problem reported at the beginning.
- Finally, this
patch forces block allocation when one file is renamed on top of
another. This, too, is aimed at the problem of frequently-rewritten
small files.
Together, these patches should mitigate the worst of the data loss problems
while preserving the performance benefits that come with delayed
allocation. They have not been proposed for merging at this late stage in
the 2.6.29 release cycle, though; they are big enough that they will have
to wait for 2.6.30. Distributors shipping earlier kernels can, of course,
backport the patches, and some may do so. But they should also note the
lesson from this whole episode: ext4, despite its apparent stability,
remains a very young filesystem. There may yet be a surprise or two
waiting to be discovered by its early users.
Comments (114 posted)
By Jonathan Corbet
March 11, 2009
Many kernel developers may work through their entire career without
encountering a buffer_head structure. But the buffer head (often called
"bh") sits at the core of the kernel's memory management and filesystem
layers. Simply put, a bh maintains a mapping between a specific page (or
portion thereof) in RAM and its corresponding block on disk. In the 2.4
days, the bh structure was also a key part of the block I/O layer, but 2.6
broke that particular association. That notwithstanding, the lowly,
much-maligned bh still plays a crucial role in contemporary kernels.
Why "much-maligned"? Buffer heads are difficult to manage, to the point
that they can create significant memory pressure on some systems. They
deal in very small units of I/O (512 bytes), so you need a pile of them
to represent even a single page. And there is a certain sense of antiquity
that one encounters when dealing with them; the buffer head code is some of
the oldest code in the core kernel. But it is important and tricky code,
so few developers dare to try to improve it.
Nick Piggin is the daring type. But Nick, too, is not trying to improve
the bh layer; instead, he would like to replace it outright. The result is
an intimidating set of large patches known as "fsblock." This code was
first posted in 2007, making
it fairly young by the standards of memory-management patches. This patch
set was reposted in early
March; it has shown a number of improvements on the way. Nick says
"I'm pretty intent on getting it merged sooner or later," so
we'll likely be seeing more of this code in the future.
The core data structure is struct fsblock, which represents one
block:
struct fsblock {
unsigned int flags;
unsigned int count;
#ifdef BDFLUSH_FLUSHING
struct rb_node block_node;
#endif
sector_t block_nr;
void *private;
struct page *page;
};
This structure, notes Nick, is about 1/3 the size of struct buffer_head, but it serves
roughly the same purpose: tracking the association between an in-memory
block (found in page) and its on-disk version, indexed by
block_nr. The flags field describes the state of this
block: whether it's up-to-date (memory and disk versions match), locked,
dirty, in writeback, etc. Some of these flags (the dirty state, for
example) match the state stored with
the in-memory page; the fsblock layer (unlike the buffer_head code) takes
great care to keep those flags in sync.
There are a couple of interesting flags in the fsblock structure
which one does not find associated
with buffer heads. One of them is not a flag at all: BL_bits_mask
describes a subfield giving the size of the block. In fsblock, "blocks"
are not limited to the standard 512-byte sector size; they can, in fact,
even be larger than a page. These "superpage" blocks have been on some
filesystem developers' wish lists for some time; they would make it easy to
create filesystems with large blocks which, in turn, would perform better
in a number of situations. But the superpage feature may be removed for
any initial merge of fsblock in an attempt to make the code easier to
understand and review. Besides, large blocks are a bit of a controversial
topic, so it makes sense to address that issue separately.
The flags field also holds a flag called BL_metadata;
this flag indicates a block which holds filesystem metadata instead of file
data. In this case, the block is actually part of a larger structure which
(edited slightly) looks like this:
struct fsblock_meta {
struct fsblock block;
union {
#ifdef VMAP_CACHE
/* filesystems using vmap APIs should not use ->data */
struct vmap_cache_entry *vce;
#endif
/*
* data is a direct mapping to the block device data, used by
* "intermediate" mode filesystems.
*/
char *data;
};
};
In short, this structure makes it easy for filesystem code to deal directly
with metadata blocks. Finally, the fsblock_sb structure ties a
filesystem superblock into the fsblock subsystem.
A filesystem can, at mount time, set things up with a call to:
int fsblock_register_super(struct super_block *sb,
struct fsblock_sb *fsb_sb);
The superblock can then be read in with a call to sb_mbread():
struct fsblock_meta *sb_mbread(struct fsblock_sb *fsb_sb,
sector_t blocknr);
There's only one little problem: before fsblock can perform block I/O
operations, it must have access to the superblock. So, thus far,
filesystems which have been converted to fsblock must still use the buffer
head API to read the superblock. One assumes that this little glitch will
be taken care of at some point.
A tour of the full fsblock API would require a few articles - it is a lot
of code. Hopefully a quick overview will provide a sense for how it all
works. To start with, blocks are reference-counted objects in fsblock, so
there is the usual set of functions for incrementing and decrementing the
counts:
void block_get(struct fsblock *block);
void block_put(struct fsblock *block);
void mblock_get(struct fsblock_meta *block);
void mblock_put(struct fsblock_meta *block);
There's a whole set of functions for performing I/O on blocks and metadata
blocks; some of these are:
struct fsblock_meta *mbread(struct fsblock_sb *fsb_sb, sector_t blocknr,
unsigned int size);
int mblock_read_sync(struct fsblock_meta *mb);
int sync_block(struct fsblock *block);
Note that, while there are a number of functions for reading blocks, there
are fewer write functions. Instead, code will use a function like
set_block_dirty() or mark_mblock_dirty(), then leave it
up to the memory management code to decide when the actual I/O should take
place.
There is a lot more than this to fsblock, including functions to lock
blocks, look up in-memory blocks, perform page I/O, truncate pages,
implement mmap(), and more. One assumes that Nick will certainly
write exhaustive documentation for this API sometime soon.
Beyond that little documentation task, there are a few other things to do,
including supporting direct I/O and fixing a number of known bugs. But,
even now, fsblock seems to have a lot of potential; it updates the old
buffer head API in a way which is more efficient and more robust. It also
appears to perform better with the ext2 filesystem - a fact which appears
to be surprising to Nick. So something like fsblock will almost certainly
be merged sooner or later. A lot could happen in the mean time, though.
Core memory-management-related patches like this are notoriously slow to
get through the merging process, and, despite its age, fsblock has not seen a great
deal of review to date. So there's likely to be plenty of time and
opportunity for other developers to find things to disagree with before
fsblock hits the mainline.
Comments (1 posted)
March 11, 2009
This article was contributed by Goldwyn Rodrigues
As storage devices become bigger and bigger in capacity, the areal
density (number of bits packed per physical square inch) increases;
hard drives are now hitting the limits. Hard drive manufacturers are now
pushing to increase the basic unit of data transfer in hard drives -
physical sector size - from 512 bytes to 4096 bytes (or
4KB) to improve storage efficiency and performance.
However, there are a lot of subsystems affected by this change
that are currently not ready to accept a 4K sector size.
The first hard drive, the RAMAC, was shipped on September 13, 1956. It
weighed 2,140 pounds and held a total of 5 megabytes (MB) of data on
fifty 24-inch platters. It was available for lease for $35,000 USD,
the equivalent of approximately $300,000 in today's dollars.
We have come a long way since then. Hard drive capacities are now
measured in terabytes, but some legacy parameters, such as the sector size,
have remained unchanged. The sector size is wired into a lot of data structures
in the kernel, for example, the i_blocks field of struct inode stores the
number of 512-byte physical blocks it occupies on the media. Even
though the core kernel deals with 512-byte sectors, the block
layer is capable of handling hardware with different length sector sizes.
Why the Change?
Any sort of data communication must contend with noise. This noise is also
present during the data transfer from the magnetic surface of the
physical hard drive platter to the head of the hard drive. Noise can
be introduced by physical defects on the hard drive platter. Noise
such as this is measured with respect to the signal strength, more
commonly known as Signal to Noise Ratio (SNR). As disk drive areal
density increases, the signal to noise ratio decreases, thereby
creating increased sensitivity to defects.
Hard Disk Drives have special reserved bits in addition to the packed data,
called the Error-Correcting Code (ECC) bits. Each physical data byte
sector block is followed by, besides other bytes, the ECC bytes on the
physical medium. ECC is responsible for the reliability of the data
transferred. Usually the Reed-Solomon
Algorithm is used to compute the ECC bits; to detect
and to a certain extent, correct the errors read; it is an efficient
algorithm to correct errors which come in bursts. The ECC bits are
placed immediately after the data bytes (as shown in the diagram below), so
the error, if any, can be
corrected as the disk spins.
Besides the ECC, the disk also has bits reserved before
the data bits, for the preamble, data sync mark; and the Inter Sector
Gap (ISG) after the ECC bits.
With the increase in areal density, more bits are packed in a square
inch of physical surface. A physical defect of, say 100 nanometers,
would require more ECC bits to correct than is needed at lower densities. The physical
defect induces more noise than signal hence the SNR decreases. This
requires more bytes packed in ECC fields of the sector to compensate
for the decrease in SNR and ensure the reliability of the
data stored on the disk. For example: on disks with a density of 215 kbpi (kilo bytes
per square inch), a 512-byte data sector requires 24 bytes of ECC; a format
efficiency (number of user data bytes vs total number of bytes on
disk) of 92%. With an increase of areal density to 750 kbpi,
each 512-byte sector requires 40 bytes per sector to achieve the same level
of disk reliability. The format efficiency of such a drive is 87%.
A sector size of 4096 bytes requires 100 bytes for ECC to
maintain the same level of reliability at an areal density of
750kbpi; that yields a format efficiency of 96%. As areal densities in disk drives
continue to increase, the physical size of each sector on the surface
of the disk become smaller. If the mean size and number of disk
defects and scratches does not scale at the same rate, then we expect
more sectors to be corrupted, and we expect the resulting burst errors
to more easily exceed the error correction capability of each sector.
Having larger sectors, would enable such burst errors to be detected
for larger sectors, hence decreasing the total ECC overhead.
Besides the ECC, the disk also has bits reserved before the data bits,
for the preamble, data sync mark, and the Inter Sector Gap (ISG).
Increasing the sector size to 4K from 512 bytes, would decrease the
occurrences of these fields, thus improving the format efficiency
further.
For all of these reasons, the storage industry wants to move to larger
sector sizes. The IDEMA International Disk
Drive Equipment and Materials Association (IDEMA) was formed to
increase co-operation among competing hard drive brands. IDEMA is
responsible for the smooth transition of sector size from 512 bytes to
4Kbytes. Also, bigsector.org was
set up to maintain documentation of the transition. The documentation section
of bigsector.org contains more information about the transition.
Transition
This change affects a lot of areas in the storage system chain:
from the drive interface, the host interface, BIOS, OS to
applications such as partition managers. A change affecting so many subsystems
might not be readily acceptable to the market. To make a smooth
transition, the following stages are planned:
- 512 byte logical with 512 byte physical. This is the current state
of hard drives
- 512-byte logical with 4096-byte physical sector size. This would
facilitate a smooth transition from 512-byte to 4096-byte sector
sizes.
- 4096-byte logical with 4096-byte physical sectors. This would be done once
all hardware and software would be aware of the underlying change and
geometry with respect to sector size. This change would first be seen
in SCSI devices and later in ATA devices.
During the transition phase (step 2), drives are planned to use
512 byte emulation, known as read-modify write (RMW). Read-modify-write
is a technique used to emulate 512-byte sector size over a 4K physical
sector size. Written data which does not correspond to full 4K sectors
would result in the drive first reading the existing 4K sector, modifying
the part of data which changed, and writing the 4K sector data back to
the drive. More information on RMW and its implementation can be
found in this set
of slides. Needless to say, RMW decreases the throughput of the device, though the shorter
ECC will compensate by giving an overall better performance
(hopefully). Such drives are expected to be commercially available in
the first quarter of 2011.
Matthew Wilcox recently posted a patch to support 4K
sectors according to the ATA-8
standard (PDF). The patch adds an interface function by
the name sector_size_supported(). Individual drivers are required to
implement this function and return the sector size used by the hardware. The size
returned is stored in the sect_size field of the ata_device structure.
This function returns 512 if the device does not recognize the ATA-8
command, or the driver does not implement the interface.
The sect_size is used instead of ATA_SECT_SIZE when the data
transfer is a multiple of 512-byte sectors.
The partitioning system and the bootloader will also require changes
because they rely on the fact that partitions start from the 63rd
sector of the drive, which is misaligned with the 4K sector boundary.
This problem will be solved, in the short term, by using the 4K physical - 512 byte logical
drives. The 512-byte sectors are aligned in a way that the 1st logical
sector starts from the 1st octant of the physical 1st 4K sector, as shown below.
This
scheme to coincide the logical and physical sectors to optimize
data storage and transfer is known as odd-aligned physical/logical
sectors.
It can lead to other problems though:
odd-aligned sectors might misalign the data with
respect to filesystem blocks. Assuming a 4K page size, a
random read would require two 4K sector reads. This
is the reason, applications such as bootloaders and partitioning
systems should be ready for 4K sector size hard drives (step 3), for
overall throughput efficiency.
An increased sector size is required by hard drives to break the current limits
of hard drive capacity while minimizing the overhead of
error checking data. However, a smooth transition
will decide the acceptability of these drives in the market. The previous transition,
which broke the 8.4GB limit using Large Block Access (LBA), was easily
accepted. However, with so many drives in use currently, the
transition would be determined by the co-operation of various
subsystems of the data supply chain, such as filesystems and
applications dealing with hard drives.
Comments (20 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>