A superficial introduction to fsblock
Why "much-maligned"? Buffer heads are difficult to manage, to the point that they can create significant memory pressure on some systems. They deal in very small units of I/O (512 bytes), so you need a pile of them to represent even a single page. And there is a certain sense of antiquity that one encounters when dealing with them; the buffer head code is some of the oldest code in the core kernel. But it is important and tricky code, so few developers dare to try to improve it.
Nick Piggin is the daring type.  But Nick, too, is not trying to improve
the bh layer; instead, he would like to replace it outright.  The result is
an intimidating set of large patches known as "fsblock."  This code was
first posted in 2007, making
it fairly young by the standards of memory-management patches.  This patch
set was reposted in early
March; it has shown a number of improvements on the way.  Nick says
"I'm pretty intent on getting it merged sooner or later
", so
we'll likely be seeing more of this code in the future.
The core data structure is struct fsblock, which represents one block:
    struct fsblock {
	unsigned int	flags;
	unsigned int	count;
    #ifdef BDFLUSH_FLUSHING
	struct rb_node	block_node;
    #endif
	sector_t	block_nr;
	void		*private;
	struct page	*page;
    };
This structure, notes Nick, is about 1/3 the size of struct buffer_head, but it serves roughly the same purpose: tracking the association between an in-memory block (found in page) and its on-disk version, indexed by block_nr. The flags field describes the state of this block: whether it's up-to-date (memory and disk versions match), locked, dirty, in writeback, etc. Some of these flags (the dirty state, for example) match the state stored with the in-memory page; the fsblock layer (unlike the buffer_head code) takes great care to keep those flags in sync.
There are a couple of interesting flags in the fsblock structure which one does not find associated with buffer heads. One of them is not a flag at all: BL_bits_mask describes a subfield giving the size of the block. In fsblock, "blocks" are not limited to the standard 512-byte sector size; they can, in fact, even be larger than a page. These "superpage" blocks have been on some filesystem developers' wish lists for some time; they would make it easy to create filesystems with large blocks which, in turn, would perform better in a number of situations. But the superpage feature may be removed for any initial merge of fsblock in an attempt to make the code easier to understand and review. Besides, large blocks are a bit of a controversial topic, so it makes sense to address that issue separately.
The flags field also holds a flag called BL_metadata; this flag indicates a block which holds filesystem metadata instead of file data. In this case, the block is actually part of a larger structure which (edited slightly) looks like this:
    struct fsblock_meta {
	struct fsblock block;
	union {
    #ifdef VMAP_CACHE
	    /* filesystems using vmap APIs should not use ->data */
	    struct vmap_cache_entry *vce;
    #endif
	    /*
	     * data is a direct mapping to the block device data, used by
	     * "intermediate" mode filesystems.
	     */
	    char *data;
	};
    };
In short, this structure makes it easy for filesystem code to deal directly with metadata blocks. Finally, the fsblock_sb structure ties a filesystem superblock into the fsblock subsystem.
A filesystem can, at mount time, set things up with a call to:
    int fsblock_register_super(struct super_block *sb, 
                               struct fsblock_sb *fsb_sb);
The superblock can then be read in with a call to sb_mbread():
    struct fsblock_meta *sb_mbread(struct fsblock_sb *fsb_sb, 
                                   sector_t blocknr);
There's only one little problem: before fsblock can perform block I/O operations, it must have access to the superblock. So, thus far, filesystems which have been converted to fsblock must still use the buffer head API to read the superblock. One assumes that this little glitch will be taken care of at some point.
A tour of the full fsblock API would require a few articles - it is a lot of code. Hopefully a quick overview will provide a sense for how it all works. To start with, blocks are reference-counted objects in fsblock, so there is the usual set of functions for incrementing and decrementing the counts:
    void block_get(struct fsblock *block);
    void block_put(struct fsblock *block);
    void mblock_get(struct fsblock_meta *block);
    void mblock_put(struct fsblock_meta *block);
There's a whole set of functions for performing I/O on blocks and metadata blocks; some of these are:
    struct fsblock_meta *mbread(struct fsblock_sb *fsb_sb, sector_t blocknr, 
    	   		        unsigned int size);
    int mblock_read_sync(struct fsblock_meta *mb);
    int sync_block(struct fsblock *block);
Note that, while there are a number of functions for reading blocks, there are fewer write functions. Instead, code will use a function like set_block_dirty() or mark_mblock_dirty(), then leave it up to the memory management code to decide when the actual I/O should take place.
There is a lot more than this to fsblock, including functions to lock blocks, look up in-memory blocks, perform page I/O, truncate pages, implement mmap(), and more. One assumes that Nick will certainly write exhaustive documentation for this API sometime soon.
Beyond that little documentation task, there are a few other things to do,
including supporting direct I/O and fixing a number of known bugs.  But,
even now, fsblock seems to have a lot of potential; it updates the old
buffer head API in a way which is more efficient and more robust.  It also
appears to perform better with the ext2 filesystem - a fact which appears
to be surprising to Nick.  So something like fsblock will almost certainly
be merged sooner or later.  A lot could happen in the mean time, though.
Core memory-management-related patches like this are notoriously slow to
get through the merging process, and, despite its age, fsblock has not seen a great
deal of review to date.  So there's likely to be plenty of time and
opportunity for other developers to find things to disagree with before
fsblock hits the mainline.
| Index entries for this article | |
|---|---|
| Kernel | Block layer | 
| Kernel | Filesystems | 
| Kernel | fsblock | 
      Posted Mar 12, 2009 12:54 UTC (Thu)
                               by axboe (subscriber, #904)
                              [Link] 
       
     
    buffer_head size
      
 
           