|
|
Subscribe / Log in / New account

fallocate()

Applications do not normally worry about the allocation of blocks for files they create; instead, they simply write the data and assume the the kernel will do a proper job of finding a home for that data. There are times when it is useful to take a more active role in block allocation, though. If an application knows how much data it will be writing, it can request the needed blocks ahead of time, enabling the kernel to allocate them all at once, contiguously on the disk. Application developers concerned about reliability may also want to know that the needed disk space has already been procured before beginning a critical operation.

Unix systems have not traditionally provided a way for applications to control block allocation. An application on a current Linux kernel has only one way to force allocation: write a stream of data to the relevant portion of the file. This technique works, but it loses one of the advantages of preallocation: letting the kernel do all the work at once and ensure that the blocks are contiguous on disk if possible. Writing useless data to the disk solely for the purpose of forcing block allocation is also wasteful.

The POSIX way of preallocating disk space is the posix_fallocate() system call, defined as:

     int posix_fallocate(int fd, off_t offset, off_t len);

On success, this call will ensure that the application can write up to len bytes to fd starting at the given offset and know that the disk space is there for it.

Linux does not currently have an implementation of posix_fallocate() in the kernel. This patch by Amit Arora may change that situation, however. Amit's patch has been through a couple of rounds of review which have changed the interface considerably; the current form of the proposed system call is:

    long fallocate(int fd, int mode, loff_t offset, loff_t len);

The fd, offset, and len arguments have the same meaning as with posix_fallocate(), making it easy for the C library to implement the standard interface. The additional mode argument changes the way the call operates; normal usage will be to specify FA_ALLOCATE, which causes the requested blocks to be allocated. If, instead, FA_DEALLOCATE is given, the requested block range will be deallocated, allowing an application to punch a hole in the file.

Internally, the system call does not do much of the work; instead, it calls the new fallocate() inode operation. Thus, each filesystem must implement its own fallocate() support. The future plans call for a possible generic implementation for filesystems which lack fallocate() support, but the generic version would almost certainly have to rely on writing zeroes to the file. By pushing the operation into the filesystem itself, the kernel gives the filesystem the opportunity to satisfy the allocation in a more efficient way, without the need to write filler data. Filesystems do need to be sure that applications cannot use fallocate() to read old data from the allocated blocks, though.

For now, filesystem-level support is scarce. There are patches circulating which add fallocate() support to ext4. The XFS filesystem has supported preallocation (through a special ioctl() call) for some time, but will need to be modified to do preallocation through the new inode operation. It's not clear when other filesystems may get native support; the tracking of allocated but unwritten blocks is a significant addition. So, for the near future, the efficiency benefits of fallocate() may be unavailable for most users.

Index entries for this article
Kernelfallocate()


to post comments

fallocate()

Posted Mar 22, 2007 11:50 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

FA_DEALLOCATE is probably easier to implement, at least for filesystems supporting sparse files at all.

fallocate()

Posted Mar 22, 2007 21:48 UTC (Thu) by landley (guest, #6789) [Link] (2 responses)

Punch! Finally!

http://www.uwsg.iu.edu/hypermail/linux/kernel/0312.0/0889...

fallocate()

Posted Mar 23, 2007 14:37 UTC (Fri) by nix (subscriber, #2304) [Link]

Quite so! punch should really have been in POSIX from the start, except
that sparse files were never really in POSIX per se: they were just an
efficiency hack atop seeking, so nobody seems to have thought `hey, what
if we want to make an existing file sparser than it is?'

fallocate()

Posted Mar 23, 2007 14:50 UTC (Fri) by k8to (guest, #15413) [Link]

Just when almost all of the use cases had dried up ;-)

Incomplete patch?

Posted Mar 23, 2007 5:21 UTC (Fri) by ldo (guest, #40946) [Link] (4 responses)

It seems to me Amit Arora's patch is incomplete. It adds a new field to the inode_operations structure, but I can't see any code for initializing this field to any value. Presumably there are mandatory patches for all filesystems to at least set this field to NULL if they don't support this operation. Otherwise, an attempt to invoke this new system call could crash your system.

Incomplete patch?

Posted Mar 23, 2007 9:56 UTC (Fri) by khim (subscriber, #9252) [Link] (3 responses)

I can't see any code for initializing this field to any value

Have you looked at GCC sources ? That's where this code is, after all...

P.S. Hint: what does C standard says about initialization of static and globar structures ?

Incomplete patch?

Posted Mar 23, 2007 20:46 UTC (Fri) by giraffedata (guest, #1954) [Link] (2 responses)

Have you looked at GCC sources ? That's where this code is, after all...

No, it's not. It's in the filesystem drivers, in the statement that declares the inode_operations variables. Looking to GCC source code for the setting of this member to NULL is like looking at at GCC source code to see the setting of 'a' to 7 in a program that contains the line "int a = 7;"

Hint: what does C standard says about initialization of static and globar structures ?

An even better hint is that all the filesystem drivers (I hope) initialize the inode operations field by assigning from a static constant inode_operations variable. Because that's not obvious, and is essential to this patch working.

Incomplete patch?

Posted Mar 26, 2007 17:44 UTC (Mon) by shishir (subscriber, #20844) [Link] (1 responses)

i guess this might be a case where we assume that if a filesystem does not explicitly initialise the field, it is NULL, as the system call code does check for this field to be non-NULL. If it is NULL, it returns -ENOSYS. So, I guess, you are right in saying that there are certain per-filesystem patches that need to be installed.

Incomplete patch?

Posted Mar 27, 2007 15:50 UTC (Tue) by giraffedata (guest, #1954) [Link]

I guess this might be a case where we assume that if a filesystem does not explicitly initialise the field, it is NULL, as the system call code does check for this field to be non-NULL. If it is NULL, it returns ->ENOSYS. So, I guess, you are right in saying that there are certain per-filesystem patches that need to be installed.
So you're saying that the assumption that if a filesystem driver does not explicitly initialize the field, it is NULL, is wrong? What makes you think that?

fallocate() - ignoring POSIX

Posted Mar 26, 2007 14:28 UTC (Mon) by rwmj (subscriber, #5474) [Link] (1 responses)

I don't know if I'm reading this right, but are they proposing to ignore the POSIX standard?

Rich.

fallocate() - ignoring POSIX

Posted Mar 29, 2007 6:55 UTC (Thu) by pjdc (guest, #6906) [Link]

Surely the C library can easily implement posix_fallocate() in terms of fallocate().


Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds