User: Password:
|
|
Subscribe / Log in / New account

Fiemap, an extent mapping ioctl

From:  Mark Fasheh <mfasheh@suse.com>
To:  linux-fsdevel@vger.kernel.org
Subject:  [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl
Date:  Sat, 24 May 2008 17:01:48 -0700
Message-ID:  <20080525000148.GJ8325@wotan.suse.de>
Cc:  Andreas Dilger <adilger@shaw.ca>, Kalpak Shah <Kalpak.Shah@sun.com>, Eric Sandeen <sandeen@redhat.com>, Josef Bacik <jbacik@redhat.com>
Archive-link:  Article

Hello,

	The following patches are the latest attempt at implementing a
fiemap ioctl, which can be used by userspace software to get extent
information for an inode in an efficient manner.

	These patches are against Linus' latest tree. While the core vfs
patch seems to be approaching feature-completeness, most of the series
should still be considered as being incomplete. The fs patches in particular
need some more attention. I think there's enough here however, that it makes
sense to start posting to fsdevel for general comments.

	Testing so far has been light, typically consisting of me running a
bare-bones ioctl wrapper program by hand:

   http://www.kernel.org/pub/linux/kernel/people/mfasheh/fie...

	We definitely need some more rigorous testing software, which I
believe Eric is working on. Additionally, a port of the 'filefrag'
application still needs to be completed.

	A lot has changed since the last fiemap patch was posted. Mostly,
the vfs<->fs api is more fleshed out, with suitable abstractions and helper
functions to aid implementation of ->fiemap. Some checks were added in the
vfs patch to catch things like overflow, fs limits checks, etc. Automatic
trimming of the request happens now so the fs doesn't have to worry about
ranges being larger than it can handle.

	Some changes were also made to the user API with the goal of
simplifying things so that it was easier for client file systems to
implement a callback. My hope is that a simpler API means file systems will
provide ->fiemap() quicker, and will be less likely to return results that
are wrong, or worse, slightly different from other implementations.

- Except for 'fm_flags', the various in/out fields on struct fiemap got
  turned into a single 'out' field - the number of mapped extents
  (fm_mapped_extents). This gives the kernel side dealing with struct fiemap
  fewer 'moving parts' to deal with.

- Extent flags were cleaned up, and some new ones got added.

- Instead of forcing the user to add up all extent lengths before a given
  one to figure it's logical offset, an 'fe_logical" field was added to
  fiemap_extent. This is a lot more obvious and straight forward in my
  opinion, and is well worth the tradeoff of a few bytes. It also obviates
  the need to describe holes as their existence is easily implied now. Also,
  fm_start and fm_length no longer have to be 'out' variables, which goes
  back to the 1st listed change.

- Handling of incompatible flags was simplified to just return -EBADR and
  the set of not-understood flags in fm_flags.

- Documentation/filesystems/fiemap.txt has been added in the 1st patch.


Below this I will include the contents of fiemap.txt to make it more
convenient for folks to get details on the API.
	--Mark


Fiemap Ioctl
============

The fiemap ioctl is an efficient method for userspace to get file
extent mappings. Instead of block-by-block mapping (such as bmap), fiemap
returns a list of extents.


Request Basics
--------------

A fiemap request is encoded within struct fiemap:

struct fiemap {
	__u64	fm_start;	 /* logical offset (inclusive) at
				  * which to start mapping (in) */
	__u64	fm_length;	 /* logical length of mapping which
				  * userspace cares about (in) */
	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request (in) */
	__u32	fm_extent_count; /* size of fm_extents array (in) */
	__u32	fm_mapped_extents; /* number of extents that were
				    * mapped (out) */
	__u32	fm_reserved;
	struct fiemap_extent	fm_extents[0];
};


fm_start, and fm_length specify the logical range within the file
which the process would like mappings for. Extents returned mirror
those on disk - that is, the logical offset of the 1st returned extent
may start before fm_start, and the range covered by the last returned
extent may end after fm_length. All offsets and lengths are in bytes.

Certain flags to modify the way in which mappings are looked up can be
set in fm_flags. If the kernel doesn't understand some particular
flags, it will return EBADR and the contents of fm_flags will contain
the set of flags which caused the error. If the kernel is compatible
with all flags passed, the contents of fm_flags will be unmodified.
It is up to userspace to determine whether rejection of a particular
flag is fatal to it's operation. This scheme is intended to allow the
fiemap interface to grow in the future but without losing
compatibility with old software.

Currently, there are four flags which can be set in fm_flags:

* FIEMAP_FLAG_NUM_EXTENTS
If this flag is set, extent information will not be returned via the
fm_extents array and the value of fm_extent_count will be
ignored. Instead, the total number of extents covering the range will
be returned via fm_mapped_extents. This is useful for programs which
only want to count the number of extents in a file, but don't care
about the actual extent layout.

* FIEMAP_FLAG_SYNC
If this flag is set, the kernel will sync the file before mapping extents.

* FIEMAP_FLAG_HSM_READ
If the extent is offline, retrieve it before mapping and do not flag
it as FIEMAP_EXTENT_SECONDARY. This flag has no effect if the file
system does not support HSM.

* FIEMAP_FLAG_XATTR
If this flag is set, the extents returned will describe the inodes
extended attribute lookup tree, instead of it's data tree.

* FIEMAP_FLAG_LUN_ORDER
If the file system stripes file data, this will return contiguous
regions of physical allocation, sorted by LUN. Logical offsets may not
make sense if this flag is passed. If the file system does not support
multiple LUNs, this flag will be ignored.


Extent Mapping
--------------

Note that all of this is ignored if FIEMAP_FLAG_NUM_EXTENTS is set.

Extent information is returned within the embedded fm_extents array
which userspace must allocate along with the fiemap structure. The
total number of fiemap_extents available should be passed via
fm_extent_count. The of extents mapped by kernel will be returned via
fm_mapped_extents. If the number of fiemap_extents allocated is less
than would be required to map the requested range, the maximum number
of extents that can be mapped in available memory will be returned and
fm_mapped_extents will be equal to fm_extent_count. In that case, the
last extent in the array will not complete the requested range and
will not have the FIEMAP_EXTENT_LAST flag set (see the next section on
extent flags).

Each extent is described by a single fiemap_extent structure as
returned in fm_extents.

struct fiemap_extent {
	__u64	fe_logical;/* logical offset in bytes for the start of
			    * the extent */
	__u64	fe_physical; /* physical offset in bytes for the start
			      * of the extent */
	__u64	fe_length; /* length in bytes for the extent */
	__u32	fe_flags;  /* returned FIEMAP_EXTENT_* flags for the extent */
	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
};

All offsets and lengths are in bytes and mirror those on disk - it is
valid for an extents logical offset to start before the request or
it's logical length to extend past the request. Unless
FIEMAP_EXTENT_NOT_ALIGNED is returned, fe_logical, fe_physical and
fe_length will be aligned to the block size of the file system.

The fe_flags field contains flags which describe the extent
returned. A special flag, FIEMAP_EXTENT_LAST is always set on the last
extent in the file so that the process making fiemap calls can
determine when no more extents are available.

Some flags are intentionally vague and will always be set in the
presence of other more specific flags. This way a program looking for
a general property does not have to know all existing and future flags
which imply that property.

For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL
are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking
for inline or tail-packed data can key on the specific flag. Software
which simply cares not to try operating on non-aligned extents
however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to
worry about all present and future flags which might imply unaligned
data. Note that the opposite is not true - it would be valid for
FIEMAP_EXTENT_NOT_ALIGNED to appear alone.

* FIEMAP_EXTENT_LAST
This is the last extent in the file. A mapping attempt past this
extent will return nothing.

* FIEMAP_EXTENT_UNKNOWN
The location of this extent is currently unknown. This may indicate
the data is stored on an inaccessible volume or that no storage has
been allocated for the file yet.

* FIEMAP_EXTENT_SECONDARY
  - This will also set FIEMAP_EXTENT_UNKNOWN.
The data for this extent is in secondary storage.

* FIEMAP_EXTENT_DELALLOC
  - This will also set FIEMAP_EXTENT_UNKNOWN.
Delayed allocation - while there is data for this extent, it's
physical location has not been allocated yet.

* FIEMAP_EXTENT_NO_DIRECT
Direct access to the data in this extent is illegal or will have
undefined results.

* FIEMAP_EXTENT_NET
  - This will also set FIEMAP_EXTENT_NO_DIRECT
The data for this extent is not stored in a locally-accessible device.

* FIEMAP_EXTENT_DATA_COMPRESSED
  - This will also set FIEMAP_EXTENT_NO_DIRECT
The data in this extent has been compressed by the file system.

* FIEMAP_EXTENT_DATA_ENCRYPTED
  - This will also set FIEMAP_EXTENT_NO_DIRECT
The data in this extent has been encrypted by the file system.

* FIEMAP_EXTENT_NOT_ALIGNED
Extent offsets and length are not guaranteed to be block aligned.

* FIEMAP_EXTENT_DATA_INLINE
  This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is located within a meta data block.

* FIEMAP_EXTENT_DATA_TAIL
  This will also set FIEMAP_EXTENT_NOT_ALIGNED
Data is packed into a block with data from other files.

* FIEMAP_EXTENT_UNWRITTEN
Unwritten extent - the extent is allocated but it's data has not been
initialized.


VFS -> File System Implementation
---------------------------------

File systems wishing to support fiemap must implement a ->fiemap
callback (on struct inode_operations):

struct inode_operations {
       ...

       int (*fiemap) (struct inode *, struct fiemap_extent_info *, u64 start,
       	   	      u64 len);

->fiemap is passed struct fiemap_extent_info which describes the
fiemap request:

struct fiemap_extent_info {
	unsigned int	fi_flags;		/* Flags as passed from user */
	unsigned int	fi_extents_mapped;	/* Number of mapped extents */
	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
	char		*fi_extents_start;	/* Start of fiemap_extent array */
};

It is intended that the file system should only need to access
fi_flags directly. Aside from checking fi_flags to modify callback
behavior, flags which the file system can not handle, can be written
into fieinfo->fi_flags. In this case, the file system *must* return
-EBADR so that ioctl_fiemap() can write them into the userspace
buffer.

For each extent in the request range, the file system should call
the helper function, fiemap_fill_next_extent():

int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
			    u64 phys, u64 len, u32 flags, u32 lun);

fiemap_fill_next_extent() will use the passed values to populate the
next free extent in the fm_extents array. 'General' extent flags will
automatically be set from specific flags on behalf of the calling file
system so that the userspace API is not broken.

fiemap_fill_next_extent() returns 0 on success, and 1 when the
user-supplied fm_extents array is full. If an error is encountered
while copying the extent to user memory, -EFAULT will be returned.

If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
this helper is not necessary and fi_extents_mapped can be set
directly.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds