CacheFS documentation

[Posted September 1, 2004 by corbet]
			  ===========================
			  CacheFS: Caching Filesystem
			  ===========================

========
OVERVIEW
========

CacheFS is a general purpose cache for network filesystems, though it could be
used for caching other things such as ISO9660 filesystems too.

CacheFS uses a block device directly rather than a directory under a
filesystem. This means it can perform its own journalling more efficiently, and
is not beholden to the underlying filesystem. If necessary, however, a file can
be loopback mounted as a cache.

CacheFS does not follow the idea of completely loading every netfs file opened
into the cache before it can be operated upon, and then serving the pages out
of the cachefs rather than the netfs because:

 (1) It must be practical to operate without a cache.

 (2) The size of any accessible file must not be limited to the size of the
     cache.

 (3) The combined size of all opened files (this includes mapped libraries)
     must not be limited to the size of the cache.

 (4) The user should not be forced to download an entire file just to do a
     one-off access of a small portion of it.

It rather serves the cache out in PAGE_SIZE chunks as and when requested by
the netfs('s) using it.


CacheFS provides the following facilities:

 (1) More than one block device can be mounted as a cache.

 (2) Caches can be mounted / unmounted at any time.

 (3) The netfs is provided with an interface that allows either party to
     withdraw caching facilities from a file (required for (2)).

 (4) The interface to the netfs returns as few errors as possible, preferring
     rather to let the netfs remain oblivious.

 (5) Cookies are used to represent files and indexes to the netfs. The simplest
     cookie is just a NULL pointer - indicating nothing cached there.

 (6) The netfs is allowed to propose - dynamically - any index hierarchy it
     desires, though it must be aware that the index search function is
     recursive and stack space is limited.

 (7) Data I/O is done direct to and from the netfs's pages. The netfs indicates
     that page A is at index B of the data-file represented by cookie C, and
     that it should be read or written. CacheFS may or may not start I/O on
     that page, but if it does, a netfs callback will be invoked to indicate
     completion.

 (8) Cookies can be "retired" upon release. At this point CacheFS will mark
     them as obsolete and the index hierarchy rooted at that point will get
     recycled.

 (9) The netfs provides a "match" function for index searches. In addition to
     saying whether a match was made or not, this can also specify that an
     entry should be updated or deleted.

(10) All metadata modifications (this includes index contents) are performed
     as journalled transactions. These are replayed on mounting.


======================
GENERAL ON-DISC LAYOUT
======================

The filesystem is divided into a number of parts:

  0	+---------------------------+
	|        Superblock         |
  1	+---------------------------+
	|      Update Journal       |
	+---------------------------+
	|     Validity Journal      |
	+---------------------------+
	|    Write-Back Journal     |
	+---------------------------+
	|                           |
	|           Data            |
	|                           |
 END	+---------------------------+

The superblock contains the filesystem ID tags and pointers to all the other
regions.

The update journal consists of a set of entries of sector size that keep track
of what changes have been made to the on-disc filesystem, but not yet
committed.

The validity journal contains records of data blocks that have been allocated
but not yet written. Upon journal replay, all these blocks will be detached
from their pointers and recycled.

The writeback journal keeps track of changes that have been made locally to
data blocks, but that have not yet been committed back to the server. This is
not yet implemented.

The journals are replayed upon mounting to make sure that the cache is in a
reasonable state.

The data region holds a number of things:

  (1) Index Files

      These are files of entries used by CacheFS internally and by filesystems
      that wish to cache data here (such as AFS) to keep track of what's in
      the cache at any given time.

      The first index file (inode 1) is special. It holds the cachefs-specific
      metadata for every file in the cache (including direct, single-indirect
      and double-indirect block pointers).

      The second index file (inode 2) is also special. It has an entry for
      each filesystem that's currently holding data in this cache.

      Every allocated entry in an index has an inode bound to it. This inode is
      either another index file or it is a data file.

  (2) Cached Data Files

      These are caches of files from remote servers. Holes in these files
      represent blocks not yet obtained from the server.

  (3) Indirection Blocks

      Should a file have more blocks than can be pointed to by the few
      pointers in its storage management record, then indirection blocks will
      be used to point to further data or indirection blocks.

      Three levels of indirection are currently supported:

	- single indirection
	- double indirection

  (4) Allocation Nodes and Free Blocks

      The free blocks of the filesystem are kept in two single-branched
      "trees". One tree is the blocks that are ready to be allocated, and the
      other is the blocks that have just been recycled. When the former tree
      becomes empty, the latter tree is decanted across.

      Each tree is arranged as a chain of "nodes", each node points to the next
      node in the chain (unless it's at the end) and also up to 1022 free
      blocks.

Note that all blocks are PAGE_SIZE in size. The blocks are numbered starting
with the superblock at 0. Using 32-bit block pointers, a maximum number of
0xffffffff blocks can be accessed, meaning that the maximum cache size is ~16TB
for 4KB pages.


========
MOUNTING
========

Since CacheFS is actually a quasi-filesystem, it requires a block device behind
it. The way to give it one is to mount it as cachefs type on a directory
somewhere. The mounted filesystem will then present the user with a set of
directories outlining the index structure resident in the cache. Indexes
(directories) and files can be turfed out of the cache by the sysadmin through
the use of rmdir and unlink.

For instance, if a cache contains AFS data, the user might see the following:

	root>mount -t cachefs /dev/hdg9 /cache-hdg9
	root>ls -1 /cache-hdg9
	afs
	root>ls -1 /cache-hdg9/afs
	cambridge.redhat.com
	root>ls -1 /cache-hdg9/afs/cambridge.redhat.com
	root.afs
	root.cell

However, a block device that's going to be used for a cache must be prepared
before it can be mounted initially. This is done very simply by:

	echo "cachefs___" >/dev/hdg9

During the initial mount, the basic structure will be scribed into the cache,
and then a background thread will "recycle" the as-yet unused data blocks.


======================
NETWORK FILESYSTEM API
======================

There is, of course, an API by which a network filesystem can make use of the
CacheFS facilities. This is based around a number of principles:

 (1) Every file and index is represented by a cookie. This cookie may or may
     not have anything associated with it, but the netfs doesn't need to care.

 (2) Barring the top-level index (one entry per cached netfs), the index
     hierarchy for each netfs is structured according the whim of the netfs.

 (3) Any netfs page being backed by the cache must have a small token
     associated with it (possibly pointed to by page->private) so that CacheFS
     can keep track of it.

This API is declared in <linux/cachefs.h>.


NETWORK FILESYSTEM DEFINITION
-----------------------------

CacheFS needs a description of the network filesystem. This is specified using
a record of the following structure:

	struct cachefs_netfs {
		const char			*name;
		unsigned			version;
		struct cachefs_netfs_operations	*ops;
		struct cachefs_cookie		*primary_index;
		...
	};

This first three fields should be filled in before registration, and the fourth
will be filled in by the registration function; any other fields should just be
ignored and are for internal use only.

The fields are:

 (1) The name of the netfs (used as the key in the toplevel index).

 (2) The version of the netfs (if the name matches but the version doesn't, the
     entire on-disc hierarchy for this netfs will be scrapped and begun
     afresh).

 (3) The operations table is defined as follows:

	struct cachefs_netfs_operations {
		struct cachefs_page *(*get_page_cookie)(struct page *page);
	};

     The functions here must all be present. Currently the only one is:

     (a) get_page_cookie(): Get the token used to bind a page to a block in a
         cache. This function should allocate it if it doesn't exist.

	 Return -ENOMEM if there's not enough memory and -ENODATA if the page
	 just shouldn't be cached.

	 Set *_page_cookie to point to the token and return 0 if there is now a
	 cookie. Note that the netfs must keep track of the cookie itself (and
	 free it later). page->private can be used for this (see below).

 (4) The cookie representing the primary index will be allocated according to
     another parameter passed into the registration function.

For example, kAFS (linux/fs/afs/) uses the following definitions to describe
itself:

	static struct cachefs_netfs_operations afs_cache_ops = {
		.get_page_cookie	= afs_cache_get_page_cookie,
	};

	struct cachefs_netfs afs_cache_netfs = {
		.name			= "afs",
		.version		= 0,
		.ops			= &afs_cache_ops,
	};


INDEX DEFINITION
----------------

Indexes are used for two purposes:

 (1) To speed up the finding of a file based on a series of keys (such as AFS's
     "cell", "volume ID", "vnode ID").

 (2) To make it easier to discard a subset of all the files cached based around
     a particular key - for instance to mirror the removal of an AFS volume.

However, since it's unlikely that any two netfs's are going to want to define
their index hierarchies in quite the same way, CacheFS tries to impose as few
restraints as possible on how an index is structured and where it is placed in
the tree. The netfs can even mix indexes and data files at the same level, but
it's not recommended.

There are some limits on indexes:

 (1) All entries in any given index must be the same size. An array of such
     entries needn't fit exactly into a page, but they will be not laid across
     a page boundary.

     The netfs supplies a blob of data for each index entry, and CacheFS
     provides an inode number and a flag.

 (2) The entries in one index can be of a different size to the entries in
     another index.

 (3) The entry data must be journallable, and thus must be able to fit into an
     update journal entry - this limits the maximum size to a little over 400
     bytes at present.

 (4) The index data must start with the key. The layout of the key is described
     in the index definition, and this is used to display the key in some
     appropriate way.

 (5) The depth of the index tree should be judged with care as the search
     function is recursive. Too many layers will run the kernel out of stack.

To define an index, a structure of the following type should be filled out:

	struct cachefs_index_def
	{
		uint8_t name[8];
		uint16_t data_size;
		struct {
			uint8_t type;
			uint16_t len;
		} keys[4];

		cachefs_match_val_t (*match)(void *target_netfs_data,
					     const void *entry);

		void (*update)(void *source_netfs_data, void *entry);
	};

This has the following fields:

 (1) The name of the index (NUL terminated unless all 8 chars are used).

 (2) The size of the data blob provided by the netfs.

 (3) A definition of the key(s) at the beginning of the blob. The netfs is
     permitted to specify up to four keys. The total length must not exceed the
     data size. It is assumed that the keys will be laid end to end in order,
     starting at the first byte of the data.

     The type field specifies the way the data should be displayed. It can be
     one of:

	(*) CACHEFS_INDEX_KEYS_NOTUSED	- key field not used
	(*) CACHEFS_INDEX_KEYS_BIN	- display byte-by-byte in hex
	(*) CACHEFS_INDEX_KEYS_ASCIIZ	- NUL-terminated ASCII
	(*) CACHEFS_INDEX_KEYS_IPV4ADDR	- display as IPv4 address
	(*) CACHEFS_INDEX_KEYS_IPV6ADDR	- display as IPv6 address

 (4) A function to compare an in-page-cache index entry blob with the data
     passed to the cookie acquisition function. This function can also be used
     to extract data from the blob and copy it into the netfs's structures.

     The values this function can return are:

	(*) CACHEFS_MATCH_FAILED - failed to match
	(*) CACHEFS_MATCH_SUCCESS - successful match
	(*) CACHEFS_MATCH_SUCCESS_UPDATE - successful match, entry needs update
	(*) CACHEFS_MATCH_SUCCESS_DELETE - entry should be deleted

     For example, in linux/fs/afs/vnode.c:

	static cachefs_match_val_t
	afs_vnode_cache_match(void *target, const void *entry)
	{
		const struct afs_cache_vnode *cvnode = entry;
		struct afs_vnode *vnode = target;

		if (vnode->fid.vnode != cvnode->vnode_id)
			return CACHEFS_MATCH_FAILED;

		if (vnode->fid.unique != cvnode->vnode_unique ||
		    vnode->status.version != cvnode->data_version)
			return CACHEFS_MATCH_SUCCESS_DELETE;

		return CACHEFS_MATCH_SUCCESS;
	}

 (5) A function to initialise or update an in-page-cache index entry blob from
     netfs data passed to CacheFS by the netfs. This function should not assume
     that there's any data yet in the in-page-cache.

     Continuing the above example:

	static void afs_vnode_cache_update(void *source, void *entry)
	{
		struct afs_cache_vnode *cvnode = entry;
		struct afs_vnode *vnode = source;

		cvnode->vnode_id	= vnode->fid.vnode;
		cvnode->vnode_unique	= vnode->fid.unique;
		cvnode->data_version	= vnode->status.version;
	}

To finish the above example, the index definition for the "vnode" level is as
follows:

	struct cachefs_index_def afs_vnode_cache_index_def = {
		.name		= "vnode",
		.data_size	= sizeof(struct afs_cache_vnode),
		.keys[0]	= { CACHEFS_INDEX_KEYS_BIN, 4 },
		.match		= afs_vnode_cache_match,
		.update		= afs_vnode_cache_update,
	};

The first element of struct afs_cache_vnode is the vnode ID.

And for contrast, the cell index definition is:

	struct cachefs_index_def afs_cache_cell_index_def = {
		.name			= "cell_ix",
		.data_size		= sizeof(afs_cell_t),
		.keys[0]		= { CACHEFS_INDEX_KEYS_ASCIIZ, 64 },
		.match			= afs_cell_cache_match,
		.update			= afs_cell_cache_update,
	};

The cell index is the primary index for kAFS.


NETWORK FILESYSTEM (UN)REGISTRATION
-----------------------------------

The first step is to declare the network filesystem to the cache. This also
involves specifying the layout of the primary index (for AFS, this would be the
"cell" level).

The registration function is:

	int cachefs_register_netfs(struct cachefs_netfs *netfs,
				   struct cachefs_index_def *primary_idef);

It just takes pointers to the netfs definition and the primary index
definition. It returns 0 or an error as appropriate.

For kAFS, registration is done as follows:

	ret = cachefs_register_netfs(&afs_cache_netfs,
				     &afs_cache_cell_index_def);

The last step is, of course, unregistration:

	void cachefs_unregister_netfs(struct cachefs_netfs *netfs);


INDEX REGISTRATION
------------------

The second step is to inform cachefs about part of an index hierarchy that can
be used to locate files. This is done by requesting a cookie for each index in
the path to the file:

	struct cachefs_cookie *
	cachefs_acquire_cookie(struct cachefs_cookie *iparent,
			       struct cachefs_index_def *idef,
			       void *netfs_data);

This function creates an index entry in the index represented by iparent,
loading the associated blob by calling iparent's update method with the
supplied netfs_data.

It also creates a new index inode, formatted according to the definition
supplied in idef. The new cookie is then returned in *_cookie.

Note that this function never returns an error - all errors are handled
internally. It may also return CACHEFS_NEGATIVE_COOKIE. It is quite acceptable
to pass this token back to this function as iparent (or even to the relinquish
cookie, read page and write page functions - see below).

Note also that no indexes are actually created on disc until a data file needs
to be created somewhere down the hierarchy. Furthermore, an index may be
created in several different caches independently at different times. This is
all handled transparently, and the netfs doesn't see any of it.

For example, with AFS, a cell would be added to the primary index. This index
entry would have a dependent inode containing a volume location index for the
volume mappings within this cell:

	cell->cache =
		cachefs_acquire_cookie(afs_cache_netfs.primary_index,
				       &afs_vlocation_cache_index_def,
				       cell);

Then when a volume location was accessed, it would be entered into the cell's
index and an inode would be allocated that acts as a volume type and hash chain
combination:

	vlocation->cache =
		cachefs_acquire_cookie(cell->cache,
				       &afs_volume_cache_index_def,
				       vlocation);

And then a particular flavour of volume (R/O for example) could be added to
that index, creating another index for vnodes (AFS inode equivalents):

	volume->cache =
		cachefs_acquire_cookie(vlocation->cache,
				       &afs_vnode_cache_index_def,
				       volume);


DATA FILE REGISTRATION
----------------------

The third step is to request a data file be created in the cache. This is
almost identical to index cookie acquisition. The only difference is that a
NULL index definition is passed.

	vnode->cache =
		cachefs_acquire_cookie(volume->cache,
				       NULL,
				       vnode);



PAGE ALLOC/READ/WRITE
---------------------

And the fourth step is to propose a page be cached. There are two functions
that are used to do this.

Firstly, the netfs should ask CacheFS to examine the caches and read the
contents cached for a particular page of a particular file if present, or else
allocate space to store the contents if not:

	typedef
	void (*cachefs_rw_complete_t)(void *cookie_data,
				      struct page *page,
				      void *end_io_data,
				      int error);

	int cachefs_read_or_alloc_page(struct cachefs_cookie *cookie,
				       struct page *page,
				       cachefs_rw_complete_t end_io_func,
				       void *end_io_data,
				       unsigned long gfp);

The cookie argument must specify a data file cookie, the page specified will
have the data loaded into it (and is also used to specify the page number), and
the gfp argument is used to control how any memory allocations made are satisfied.

If the cookie indicates the inode is not cached:

 (1) The function will return -ENOBUFS.

Else if there's a copy of the page resident on disc:

 (1) The function will submit a request to read the data off the disc directly
     into the page specified.

 (2) The function will return 0.

 (3) When the read is complete, end_io_func() will be invoked with:

     (*) The netfs data supplied when the cookie was created.

     (*) The page descriptor.

     (*) The data passed to the above function.

     (*) An argument that's 0 on success or negative for an error.

     If an error occurs, it should be assumed that the page contains no usable
     data.

Otherwise, if there's not a copy available on disc:

 (1) A block may be allocated in the cache and attached to the inode at the
     appropriate place.

 (2) The validity journal will be marked to indicate this page does not yet
     contain valid data.

 (3) The function will return -ENODATA.


Secondly, if the netfs changes the contents of the page (either due to an
initial download or if a user performs a write), then the page should be
written back to the cache:

	int cachefs_write_page(struct cachefs_cookie *cookie,
			       struct page *page,
			       cachefs_rw_complete_t end_io_func,
			       void *end_io_data,
			       unsigned long gfp);

The cookie argument must specify a data file cookie, the page specified should
contain the data to be written (and is also used to specify the page number),
and the gfp argument is used to control how any memory allocations made are
satisfied.

If the cookie indicates the inode is not cached then:

 (1) The function will return -ENOBUFS.

Else if there's a block allocated on disc to hold this page:

 (1) The function will submit a request to write the data to the disc directly
     from the page specified.

 (2) The function will return 0.

 (3) When the write is complete:

     (a) Any associated validity journal entry will be cleared (the block now
	 contains valid data as far as CacheFS is concerned).

     (b) end_io_func() will be invoked with:

	 (*) The netfs data supplied when the cookie was created.

	 (*) The page descriptor.

	 (*) The data passed to the above function.

	 (*) An argument that's 0 on success or negative for an error.

	 If an error happens, it can be assumed that the page has been
	 discarded from the cache.


PAGE UNCACHING
--------------

To uncache a page, this function should be called:

	void cachefs_uncache_page(struct cachefs_cookie *cookie,
				  struct page *page);

This detaches the page specified from the data file indicated by the cookie and
unbinds it from the underlying block.

Note that pages can't be explicitly detached from the a data file. The whole
data file must be retired (see the relinquish cookie function below).

Furthermore, note that this does not cancel the asynchronous read or write
operation started by the read/alloc and write functions.


INDEX AND DATA FILE UPDATE
--------------------------

To request an update of the index data for an index or data file, the following
function should be called:

	void cachefs_update_cookie(struct cachefs_cookie *cookie);

This function will refer back to the netfs_data pointer stored in the cookie by
the acquisition function to obtain the data to write into each revised index
entry. The update method in the parent index definition will be called to
transfer the data.


INDEX AND DATA FILE UNREGISTRATION
----------------------------------

To get rid of a cookie, this function should be called.

	void cachefs_relinquish_cookie(struct cachefs_cookie *cookie,
				       int retire);

If retire is non-zero, then the index or file will be marked for recycling, and
all copies of it will be removed from all active caches in which it is present.

If retire is zero, then the inode may be available again next the the
acquisition function is called.

One very important note - relinquish must NOT be called unless all "child"
indexes, files and pages have been relinquished first.


PAGE TOKEN MANAGEMENT
---------------------

As previously mentioned, the netfs must keep a token associated with each page
currently actively backed by the cache. This is used by CacheFS to go from a
page to the internal representation of the underlying block and back again. It
is particularly important for managing the withdrawal of a cache whilst it is
in active service (eg: it got unmounted).

The token is this:

	struct cachefs_page {
		...
	};

Note that all fields are for internal CacheFS use only.

The token only needs to be allocated when CacheFS asks for it. This it will do
by calling the get_page_cookie() method in the netfs definition ops table. Once
allocated, the same token should be presented every time the method is called
again for a particular page.

The token should be retained by the netfs, and should be deleted only after the
page has been uncached.

One way to achieve this is to attach the token to page->private (and set the
PG_private bit on the page) once allocated. Shortcut routines are provided by
CacheFS to do this. Firstly, to retrieve if present and allocate if not:

	struct cachefs_page *cachefs_page_get_private(struct page *page,
						      unsigned gfp);

Secondly to retrieve if present and BUG if not:

	static inline
	struct cachefs_page *cachefs_page_grab_private(struct page *page);

To clean up the tokens, the netfs inode hosting the page should be provided
with address space operations that circumvent the buffer-head operations for a
page. For instance:

	struct address_space_operations afs_fs_aops = {
		...
		.sync_page	= block_sync_page,
		.set_page_dirty	= __set_page_dirty_nobuffers,
		.releasepage	= afs_file_releasepage,
		.invalidatepage	= afs_file_invalidatepage,
	};

	static int afs_file_invalidatepage(struct page *page,
					   unsigned long offset)
	{
		struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
		int ret = 1;

		BUG_ON(!PageLocked(page));
		if (!PagePrivate(page))
			return 1;
		cachefs_uncache_page(vnode->cache,page);
		if (offset == 0)
			return 1;
		BUG_ON(!PageLocked(page));
		if (PageWriteback(page))
			return 0;
		return page->mapping->a_ops->releasepage(page, 0);
	}

	static int afs_file_releasepage(struct page *page, int gfp_flags)
	{
		struct cachefs_page *token;
		struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);

		if (PagePrivate(page)) {
			cachefs_uncache_page(vnode->cache, page);
			token = (struct cachefs_page *) page->private;
			page->private = 0;
			ClearPagePrivate(page);
			if (token)
				kfree(token);
		}
		return 0;
	}


INDEX AND DATA FILE INVALIDATION
--------------------------------

There is no direct way to invalidate an index subtree or a data file. To do
this, the caller should relinquish and retire the cookie they have, and then
acquire a new one.