LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.34-rc1; no new prepatches have been released over the last week.

Stable updates: 2.6.32.10 and 2.6.33.1 were released on March 15. They are both massive, with 145 and 123 patches, respectively.

Comments (none posted)

Quotes of the week

May be I should start to stick posters with photos of modules entitled "I want to believe" everywhere in my flat. Or perhaps I'm going to buy electronic glasses that display modules advertizing in the street. I'm not sure yet but I'll find a way.
-- Frederic Weisbecker

I thought everyone learned the lesson behind SystemTap's failure (and to a certain degree this was behind Oprofile's failure as well): when it comes to tooling/instrumentation we dont want to concentrate on the fancy complex setups and abstract requirements drawn up by CIOs, as development isnt being done there. Concentrate on our developers today, and provide no-compromises usability to those who contribute stuff.

If we dont help make the simplest (and most common) use-case convenient then we are failing on a fundamental level.

-- Ingo Molnar

Jan suggests that we not surprise users by having delalloc enabled when ext3 is mounted with the ext4 driver. However there are other behavior differences as well, mballoc behavior comes to mind at least. What about the 32000 subdir limit? If we go back to ext3 is it ok with the subsecond timestamps and creation time etc? Maybe so... have we tested any of this?

At what point do we include the phase of the moon as worth considering when describing ext4.ko behavior?

-- Eric Sandeen

Comments (5 posted)

After the merge window closed...

By Jonathan Corbet
March 16, 2010
Toward the end of the 2.6.33 development cycle, Linus suggested that he might make the next merge window a little shorter than usual. And, indeed, 2.6.34-rc1 came out on March 8, twelve days after the 2.6.33 release. A number of trees got caught out in the cold as a result of that change, and that appears to be a result that suits Linus just fine.

That said, some trees have been pulled after the -rc1 release. These include the trivial tree, with the usual load of spelling fixes and other small changes. There was a large set of ARM changes, including support for a number of new boards and devices. The memory usage controller got a new threshold feature allowing for finer-grained control of (and information about) memory usage. And so on; all told, nearly 1,000 changes have been merged (as of this writing) since the 2.6.34-rc1 release.

When the final SCSI pull request came along, though, Linus found his moment to draw a line in the sand. Linus, it seems, is getting a little tired of what he sees as last-minute behavior from some subsystem maintainers:

I've told people before. The merge window is for _merging_, not for doing development. If you send me something the last day, then there is no "window" any more. And it is _really_ annoying to have fifty pull requests on the last day. I'm not going to take it any more.

So, Linus says, he plans to be even more unpredictable in the future. Evidently determinism in this part of the process leads to behavior he doesn't like, so, in the future, developers won't really be able to know how long the merge window will be. In such an environment, most subsystem maintainers will end up working as if the merge window had been reduced to a single week - an idea which had been discussed and rejected at the 2009 Kernel Summit.

Comments (6 posted)

Big reader locks

By Jonathan Corbet
March 16, 2010
Nick Piggin's VFS scalability patches have been a work in progress for some time - as is often the case for this sort of low-level, performance-oriented work. Recently, Nick has begun to break the patch set into smaller pieces, each of which solves one part of the problem and each of which can be considered independently. One of those pieces introduces an interesting new mutual exclusion mechanism called the big reader lock, or "brlock."

Readers of the patch can be forgiven for wondering what is going on; anything which combines tricky locking and 30-line preprocessor macros is going to raise eyebrows. But the core concept here is simple: a brlock tries to make read-only locking as fast as possible through the creation of a per-CPU array of spinlocks. Whenever a CPU needs to acquire the lock for read-only access, it takes its own dedicated lock. So read-locking is entirely CPU-local, involving no cache line bouncing. Since contention for a per-CPU spinlock should really be zero, this lock will be fast.

Life gets a little uglier when the lock must be acquired for write access. In short: the unlucky CPU must go through the entire array, acquiring every CPU's spinlock. So, on a 64-processor system, 64 locks must be acquired. That will not be fast, even if none of the locks are contended. So this kind of lock should be used rarely, and only in cases where read-only use predominates by a large margin.

One such case - the target for this new lock - is vfsmount_lock, which is required (for read access) in pathname lookup operations. Lookups are frequent events, and are clearly performance-critical. On the other hand, write access is only needed when filesystems are being mounted or unmounted - a much rarer occurrence. So a brlock is a good fit here, and one small piece (out of many) of the VFS scalability puzzle has been put into place.

Comments (4 posted)

Kernel development news

Who let the hogs out?

By Jonathan Corbet
March 16, 2010
As a normal rule of business, the kernel tries to avoid using more system resources than are absolutely necessary; system time is better spent running user-space programs. So Tejun Heo's cpuhog patch may come across as a little surprising; it creates a mechanism by which the kernel can monopolize one or more CPUs with high-priority processes doing nothing. But there is a good reason behind this patch set; it should even improve performance in some situations.

Suppose you wanted to take over one or more CPUs on the system. The first step is to establish a hog function:

    #include <linux/cpuhog.h>

    typedef int (*cpuhog_fn_t)(void *arg);

When hog time comes, this function will be called at the highest possible priority. If the intent is truly to hog the CPU, the function should probably spin in a tight loop. But one should take care to ensure that this loop will end at some point; one does not normally want to take the CPU out of commission permanently.

The monopolization of processors is done with any of:

    int hog_one_cpu(unsigned int cpu, cpuhog_fn_t fn, void *arg);
    void hog_one_cpu_nowait(unsigned int cpu, cpuhog_fn_t fn, void *arg,
			    struct cpuhog_work *work_buf);
    int hog_cpus(const struct cpumask *cpumask, cpuhog_fn_t fn, void *arg);
    int try_hog_cpus(const struct cpumask *cpumask, cpuhog_fn_t fn, void *arg);

A call to hog_one_cpu() will cause the given fn() to be run on cpu in full hog mode; the calling process will wait until fn() returns; at which point the return value from fn() will be passed back. Should there be other useful work to do (on a different CPU, one assumes), hog_one_cpu_nowait() can be called instead; it will return immediately, while fn() may still be running. The work_buf structure must be allocated by the caller and be unused, but the caller need not worry about it beyond that.

Sometimes, total control over one CPU is not enough; in that case, hog_cpus() can be called to run fn() simultaneously on all CPUs indicated by cpumask. The try_hog_cpus() variant is similar, but, unlike hog_cpus(), it will not wait if somebody else got in and started hogging CPUs first.

So what might one use this mechanism for? One possibility is stop_machine(), which is called to ensure that absolutely nothing of interest is happening anywhere in the system for a while. Calls to stop_machine() usually happen when fundamental changes are being made to the system - examples include the insertion of dynamic probes, loading of kernel modules, or the removal of CPUs. It has always worked in the same way as the CPU hog functions do - by running a high-priority thread on each processor.

The new stop_machine() implementation, naturally, uses hog_cpus(). Unlike the previous implementation, though (which used workqueues), the new code takes advantage of the CPU hog threads which already exist. That eliminates a performance bug reported by Dimitri Sivanich, whereby the amount of time required to boot a system would be doubled by the extra overhead of various stop_machine() calls.

Another use for this facility is to force all CPUs to quickly go through the scheduler; that can be useful if the system wants to force a transition to a new read-copy-update grace period. Formerly, this task was bundled into the migration thread, which already runs on each CPU, in a bit of an awkward way; now it's a straightforward CPU hog call.

The migration thread itself is also a user of the single-CPU hogging function. This thread comes into play when the system wants to migrate a process which is running on a given CPU. The first thing that needs to happen is to force that process out of the CPU - a job for which the CPU hog is well suited. Once the hog has taken over the CPU, the just-displaced process can be moved to its new home.

The end result is the removal of a fair amount of code, a cleaned-up migration thread implementation, and improved performance in stop_machine(). Some concerns were raised that passing a blocking function as a CPU hog could create problems in some situations. But blocking in a CPU hog seems like an inherently contradictory thing to do; one assumes that the usual response will be "don't do that". And, in fact, version 2 of the patch disallows sleeping in hog functions. Of course, the "don't do that" response will also apply to most uses of CPU hogs in general; taking over processors in the kernel is still considered to be an antisocial thing to do most of the time.

Comments (none posted)

Huge pages part 4: benchmarking with huge pages

March 17, 2010

This article was contributed by Mel Gorman

[Editor's note: this is part 4 of Mel Gorman's series on support for huge pages in Linux. Parts 1, 2, and 3 are available for those who have not read them yet.]

In this installment, a small number of benchmarks are configured to use huge pages - STREAM, sysbench, SpecCPU 2006 and SpecJVM. In doing so, we show that utilising huge pages is a lot easier than in the past. In all cases, there is a heavy reliance on the hugeadm to simplify the machine configuration and hugectl to configure libhugetlbfs.

STREAM is a memory-intensive benchmark and, while its reference pattern has poor spacial and temporal locality, it can benefit from reduced TLB references. Sysbench is a simple OnLine Transaction Processing (OLTP) benchmark that can use Oracle, MySQL, or PostgreSQL as database backends. While there are better OLTP benchmarks out there, Sysbench is very simple to set up and reasonable for illustration. SpecCPU 2006 is a computational benchmark of interest to high-performance computing (HPC) and SpecJVM benchmarks basic classes of Java applications.

1 Machine Configuration

The machine used for this study is a Terrasoft Powerstation described in the table below.

Architecture PPC64
CPU PPC970MP with altivec
CPU Frequency 2.5GHz
# Physical CPUs 2 (4 cores)
L1 Cache per core 32K Data, 64K Instruction
L2 Cache per core 1024K Unified
L3 Cache per socket N/a
Main Memory 8 GB
Mainboard Machine model specific
Superpage Size 16MB
Machine Model Terrasoft Powerstation

Configuring the system for use with huge pages was a simple matter of performing the following commands.

    $ hugeadm --create-global-mounts
    $ hugeadm --pool-pages-max DEFAULT:8G 
    $ hugeadm --set-recommended-min_free_kbytes
    $ hugeadm --set-recommended-shmmax
    $ hugeadm --pool-pages-min DEFAULT:2048MB
    $ hugeadm --pool-pages-max DEFAULT:8192MB

2 STREAM

STREAM [mccalpin07] is a synthetic memory bandwidth benchmark that measures the performance of four long vector operations: Copy, Scale, Add, and Triad. It can be used to calculate the number of floating point operations that can be performed during the time for the “average” memory access. Simplistically, more bandwidth is better.

The C version of the benchmark was selected and used three statically allocated arrays for calculations. Modified versions of the benchmark using malloc() and get_hugepage_region() were found to have similar performance characteristics.

The benchmark has two parameters: N, the size of the array, and OFFSET, the number of elements padding the end of the array. A range of values for N were used to generate workloads between 128K and 3GB in size. For each size of N chosen, the benchmark was run 10 times and an average taken. The benchmark is sensitive to cache placement and optimal layout varies between architectures; where the standard deviation of 10 iterations exceeded 5% of the throughput, OFFSET was increased to add one cache-line of padding between the arrays and the benchmark for that value of N was rerun. High standard deviations were only observed when the total working set was around the size of the L1, L2 or all caches combined.

The benchmark avoids data re-use, be it in registers or in the cache. Hence, benefits from huge pages would be due to fewer faults, a slight reduction in TLB misses as fewer TLB entries are needed for the working set and an increase in available cache as less translation information needs to be stored.

To use huge pages, the benchmark was first compiled with the libhugetlbfs ld wrapper to align the text and data sections to a huge page boundary [libhtlb09] such as in the following example.

   $ gcc -DN=1864135 -DOFFSET=0 -O2 -m64                     \
        -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-align     \
        -Wl,--library-path=/usr/lib                          \
        -Wl,--library-path=/usr/lib64                        \
        -lhugetlbfs stream.c                                 \
        -o stream

   # Test launch of benchmark
   $ hugectl --text --data --no-preload ./stream	

[STREAM
benchmark result] This page contains plots showing the performance results for a range of sizes running on the test machine; one of them appears to the right. Performance improvements range from 11.6% to 16.59% depending on the operation in use. Performance improvements would be typically lower for an X86 or X86-64 machine, likely in the 0% to 4% range.

3 SysBench

SysBench is a OnLine Transaction Processing (OLTP) benchmark representing a general class of workload where clients perform a sequence of operations whose end result must appear to be an indivisible operation. TPC-C is considered an industry standard for the evaluation of OLTP but requires significant capital investment and is extremely complex to set up. SysBench is a system performance benchmark comprising file I/O, scheduler, memory allocation, threading and includes an OLTP benchmark. The setup requirements are less complicated and SysBench works for MySQL, PostgreSQL, and Oracle databases.

PostgreSQL was used for this experiment on the grounds that it uses a shared memory segment similar to Oracle, making it a meaningful comparison with a commercial database server. Sysbench 0.4.12 and Postgres 8.4.0 were built from source.

Postgres was configured to use a 756MB shared buffer, an effective cache of 150MB, a maximum of 6*NR_CPUs clients were allowed to connect. Note that the maximum number of clients allowed is greater than the number of clients used in the test. This is because a typical configuration would allow more connections than the expected number of clients to allow administrative processes to connect. The update_process_title parameter was turned off as a small optimisation. Options that checkout, fsync, log, or synchronise were turned off to avoid interference from I/O. The system was configured to allow the postgres user to use huge pages with shmget() as described in part 3. Postgres uses System V shared memory so pg_ctl was invoked as follows.

   $ hugectl --shm bin/pg_ctl -D `pwd`/data -l logfile start

For the test itself, the table size was 10 million rows, read-only to avoid I/O and the test type was “complex”, which means each operation by the client is a database transaction. Tests were run varying the number of clients accessing the database from one to four times the number of CPU cores in the system. For each thread count, the test was run multiple times until at least five iterations completed with a confidence level of 99% that the estimated mean is within 2% of the true mean. In practise, the initial iteration gets discarded due to increased I/O and faults incurred during the first run.

[SysBench
benchmark result] The plot to the right (click for larger version) shows the performance results for different numbers of threads with performance improvements ranging in the 1%-3.5% mark. Unlike STREAM, the performance improvements would tend to be similar on X86 and X86-64 machines running this particular test configuration. The exact reasoning for this is beyond the scope of the article but it comes down to the fact that STREAM exhibits a very poor locality of reference, making cache behaviour a significant factor in the performance of the workload. As workloads would typically have a greater degree of reference locality than STREAM, the expectation would be that performance gains across different architectures would be similar.

4 SpecCPU 2006

SpecCPU 2006 v1.1 is a standardised CPU-intensive benchmark used in evaluations for HPC that also stresses the memory subsystem. A --reportable run was made comprising “test”, “train”, and three “ref” sets of input data. Three sets of runs compare base pages, huge pages backing just the heap, and huge pages backing text, data, and the heap. Only base tuning was used with no special compile options other than what was required to compile the tests.

To back the heap using huge pages, the tests were run with:

    hugectl --heap runspec ...

To also back the text and data, the SPEC configuration file was modified to build SPEC similar to STREAM described above, then the --text --data --bss switches were also specified to hugectl.

[SpecCPU
benchmark result] This plot shows the performance results running the integer SpecCPU test (click for full size and the floating-point test results). As is clear, there are very large fluctuations depending on what the reference pattern of the workload was but many of the improvements are quite significant averaging around 13% for the Integer benchmarks and 7-8% for the floating-point operations. An interesting point to note is that for the Fortran applications, performance gains were similar whether text/data was backed or the heap. This heavily implies that the Fortran applications were using dynamic allocation. On older Fortran applications, relinking to back the text and data with huge pages may be required to see any performance gains.

5 SpecJVM (JVM/General)

Java is used in an increasing number of scenarios, including real time systems, and it dominates in the execution of business-logic related applications. Particularly within application servers, the Java Virtual Machine (JVM) uses large quantities of virtual address space that can benefit from being backed by huge pages. SpecJVM 2008 is a benchmark suite for Java Runtime Environments (JRE). According to the documentation, the intention is to reflect the performance of the processor and memory system with a low dependence on file or network I/O. Crucially for HPC, it includes SCIMark, which is a Java benchmark for scientific and numerical computing.

The 64-bit version of IBM Java Standard Edition Version 6 SP 3 was used, but support for huge pages is available in other JVMs. The JVM was configured to use a maximum of 756MB for the heap. Unlike the other benchmarks, the JVM is huge-page-aware and uses huge-page-backed shared memory segments when -Xlp is specified. An example invocation of the benchmark is as follows.

   $ java -Xlp -Xmx756m -jar SPECjvm2008.jar 120 300 --parseJvmArgs -i 1 --peak

[SpecJVM
benchmark result] This plot shows the performance results running the full range of SpecJVM tests. The results are interesting as they show performance gains were not universal, with the serial benchmark being spectacularly poor. Despite this, performance was improved on average by 4.43% with very minimal work required on behalf of the administrator.

6 Summary

In this installment, it was shown that with minimal amounts of additional work, huge pages can be easily used to improve benchmarks. For the database and JVM benchmarks, the same configurations could easily be applied to a real-world deployment rather than as a benchmarking situation. For other benchmarks, the effort can be hidden with minimal use of initialisation scripts. Using huge pages on Linux in the past was a tricky affair but these examples show this is no longer the case.

Comments (1 posted)

A critical look at sysfs attribute values

March 17, 2010

This article was contributed by Neil Brown

One of the many memorable lines from Douglas Adams's famous work The Hitchhiker's Guide to the Galaxy was the accusation, probably leveled by supporters of the Encyclopedia Galactica, that the Hitchhiker's Guide was "unevenly edited" and "contains many passages which simply seemed to its editors like a good idea at the time." With small modifications, such as replacing "edited" with "reviewed", this description seems very relevant to the Linux kernel, and undoubtedly many other bodies of software, whether open or closed, free or proprietary. Review is at best "uneven".

It isn't hard to find complaints that the code in the Linux kernel isn't being reviewed enough, or that we need more reviewers. The creation of tags like "Reviewed-by" for patches was in part an attempt to address this by giving more credit to reviewers and there by encouraging more people to get involved in that role.

However one can equally well find complaints about too much review, where developers cannot make progress with some feature because, every time they post a revision, someone new complains about something else and so, in the pursuit of perfection, the good is lost. Similarly, though it does not seem to be a problem lately, there have been times when lots of review would simply result in complaints about white-space inconsistency and spelling mistakes -- things that are worth correcting, but not worth burying a valuable contribution under.

Finding the right topic, the right level, and the right forum for review is not easy (and finding the time can be even harder). This article doesn't propose to address those questions directly, but rather to present a sample of review - a particular topic at a particular level on a particular forum, in the hope that it will be useful. The topic chosen, largely because it is something that your author has needed to work with lately without completely understanding, is "sysfs", the virtual filesystem that provides access to some of the internals of the Linux kernel. And in particular, the attribute files that expose the fine detail of that access.

The level chosen is a high-level or holistic view, asking whether the implementation matches the goals, and at the same time asking whether the goals are appropriate. And the forum is clearly the present publication.

Sysfs and attribute files

Sysfs has an interesting history and a number of design goals, both of which are worth understanding, but neither of which will be examined here except in as much as they reflect specifically the chosen topic: attribute files. The key design goal relating to attribute files is the stipulation - almost a mantra - of "one file, one value" or sometimes "one item per file". The idea here is that each attribute file should contain precisely one value. If multiple values are needed, then multiple files should be used.

A significant part of the history behind this stipulation is the experience of "procfs" or /proc. /proc is a beautiful idea that unfortunately grew in an almost cancerous way to become widely despised. It is a virtual filesystem that originally had one directory for each process that was running, and that directory contained useful information about the running process in various files.

There is clearly more that just processes that could usefully be put in a virtual filesystem, and, with no clear reason to the contrary, things started being added to procfs. With no real design or structure, more and more information was shoe-horned into procfs until it became an unorganised mess. Even inside the per-process directories procfs isn't a pretty sight. Some files (e.g. limits) contain tables with column headers, others (e.g. mounts) have tables without headers, and still others (e.g. status) have rows labeled rather than columns. Some files have single values (e.g. wchan) while others have lots of assorted and inconsistently formatted values (e.g. mountstats).

Against this background of disorganisation and the attendant difficulty of adding new fields without breaking applications, sysfs was declared to have a new policy - one item per file. In fact, in his excellent (though now somewhat out-dated) article on the Driver Model Core, Greg Kroah-Hartman even asserted that this rule was "enforced" (see the side bar on "sysfs").

It would not be fair to hold Greg accountable to what could have been a throw-away line from years ago, and I don't wish to do that. However that comment serves well in providing a starting point and a focus for reviewing the usage of attribute files in sysfs. We can ask if the rule really is being enforced, whether the rule is sufficient to avoid past mistakes, and whether the rule even makes sense in all cases.

As you might guess the answers will be "no", "no" and "no", but the explanation is far more enlightening than the answer.

Is it enforced?

The best way to test if the rule has been enforced is to survey the contents of sysfs - do files contain simple values, or something more? As a very rough assessment of the complexity of the contents of sysfs attribute file, we can issue a simple command:

 find /sys -mount -type f | xargs wc -w | grep -v ' total$'

to get a count of the number of words in each attribute file (the "-mount" is important if you have /sys/kernel/debug mounted, as reading things in there can cause problems).

Processing these results from your author's (Linux 2.6.32) notebook shows that of the 9254 files, 1189 are empty and 7168 have only one word. It seems reasonable to assume these represent only one value (though many of the empty files are probably write-only and this mechanism gives no information about what value or values can be written). This leaves 897 (nearly 10%) which need further examination. They range from two words (487 cases) to 297 words (one case).

While there are nearly 900 files, there are less than 100 base names. If we filter out some common patterns (e.g. gpe%X), the number of distinct attributes is closer to 62, which is a number that can reasonably be examined manually (with a little help from some scripting). Several of these multi-word attribute files contain non-ASCII data and so are almost certainly single values in some reasonable sense. Others contain strings for which a space is a legal character, such as "Dell Inc.", "i8042 KBD port" or "write back". So they clearly are not aberrations from the rule.

There is a small class of files were the single item stored in the file is of an enumerated type. It is common for the file in these cases to contain all of the possible values listed which still seems to hold true to the "one item per file" rule. However there are three variations on this theme:

  • In some cases, such as the "queue/scheduler" attribute of a block device, or the "trigger" attribute of an LED device, all of the possible options are listed, and the currently active one is enclosed in brackets, thus:
       noop anticipatory deadline [cfq]
    

  • In the second variation there are two files, one which contains the list of possibilities, as with "cpufreq/scaling_available_governors" and one which contains the currently-selected value, "cpufreq/scaling_governor".

  • Finally, and this could be just a special case of one of the above, we have "/sys/power/state" for which there is no current value, so it just contains a list of the possible values.

These are all examples of attribute files that do clearly contain just one value or item, but happen to use multiple words is various ways to describe those values. They are false-positives of our simplistic tool for finding complex attribute values.

However there are other multi-word attribute files that are not so easily explained away. /sys/class/bluetooth contains some class attributes such as rfcomm, l2cap and sco. Each of these contains structured data, one record per line with 3 to 9 different datums per record (depending on the particular file), the first datum looking rather like the BD address of a local blue-tooth interface.

This appears to be a clear violation of the "one item per file" policy. The files do appear to be very well structured and so easy to parse, so it is tempting to think that they should be safe enough. However sysfs attribute files are limited in size to one page - typically 4KB. If the number of entries in these files ever gets too large (about 70 lines in the l2cap file), accesses to the file will start corrupting memory, or crashing. Hopefully that will never happen, but "hope" is not normally an acceptable basis for good engineering. From a conversation with the bluetooth maintainer it appears that there are plans to move these files to "debugfs" where they can benefit from the "seq_file" implementation, also used widely in /proc, which allows arbitrarily large files.

Some other examples include "/sys/devices/system/node/node0/meminfo" which appears to be a per-node version of "/proc/meminfo" and is clearly multiple values, and the "options" attributes in /sys/devices/pnp*/* which appear to contain exactly the sort of ad hoc formatting of multiple values of multiple types that people find so unacceptable in /proc. The pnp "resources" files are similarly multi-valued, though to a lesser extent.

As a final example of a lack of enforcement, the PCI device directory for the (Intel 3945) wireless network in this notebook contains a file called "statistics" which contains a hex dump of 240 bytes of data, complete with ASCII decoding at the end of each line such as:

02 00 03 00 d9 05 00 00 28 03 00 00 45 02 00 00  ........(...E...
0d 00 00 00 00 00 00 00 00 00 00 00 d6 00 00 00  ................
b1 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00 00 00 00 00 00 00 00 67 00 00 00 00 00 00 00  ........g.......
This is surely not the sort of thing that sysfs was intended to report. If anything, this looks like it should be a binary attribute, not a doubly-encoded ASCII file.

So to answer our opening question, "no", the one item per file rule is not enforced in any meaningful way. Certainly the vast majority of attribute files do contain just one item and that is good. But there are a number which contain multiple values in a variety of different ways. And this number is only likely to grow as people either copy the current bad examples, or find new use cases that don't seem to fit the existing patterns, so invent new approaches which don't take the holistic view into account.

Is the rule sufficient?

Our next question to ask is whether the stated rule for sysfs attributes is sufficient to avoid an increasingly unorganised and ad hoc sysfs following the unfortunate path of procfs. We have already seen at least one case where it isn't. We do not have a standardised way of representing an enumerated type in a sysfs attribute, and so we have at least two implementations as already mentioned. There is at least one more implementation (exposed in the "md/level" attribute of md/raid devices) where just the current value is visible and the various options are not. Having a standard here would be good for consistency and encourage optimal functionality. But we have no standard.

A similar issue arises with simple numerical values that represent measurable items such as storage size or time. It would be nice if these were reported using standard units, probably bytes and seconds. But we find that this is not the case. Amounts of storage are sometimes reported as bytes (/sys/devices/system/memory/block_size_bytes), sometimes as sectors (/sys/class/block/*/size), and sometimes as kilobytes (block/*/queue/read_ahead_kb).

As these particular examples show, one way to avoid ambiguity is to include the name of the units (bytes or kb here) as part of the attribute name, a practice known as Hungarian notation. However this is far from uniformly applied with the examples given above being more the exception than the rule.

Measures of duration face the same problem. Many times that the kernel needs to know about are substantially less than one second. However rather than use the tried-and-true decimal point notation for sub-unit values, some attribute files report in milliseconds (unload_heads in libata devices), some in microseconds (cpuide/state*/time), and some are even in seconds (/sys/class/firmware/timeout). As an extra confusion there are some (.../bridge/hello_time) which use a unit that varies depending on architecture, from centiseconds to mibiseconds (if that is a valid name for 1-1024th part of a second). It is probably fortunate that there is no metric/imperial difference in units for time else we would probably find both of those represented too.

And then there are truth values: On, on, 1, Off, off, 0.

So it would seem that the answer to our second question is "no" too, though it is harder to be positive about this as there is no clearly stated goal that we can measure against. If the goal is to have a high degree of uniformity in the representation of values in attributes, then we clearly don't meet that goal.

Does the requirement always make sense?

So the guiding principle of one item per file is not uniformly enforced, and it isn't really enough to avoid needless inconsistencies, but were it to be uniformly applied, would it really give us what we want, or is it too simplistic or too vague to be useful as a strict rule?

A good place to start exploring this question is the "capabilities/key" attribute of "input" devices. The content of this file is a bitmap listing which key-press events the input device can possibly generate. The bitmap is presented in hexadecimal with a space every 64 bits. Clearly this is a single value - a bitmap - but it is also an array of bits. Or maybe an array of "long"s. Does that make is multiple values in a single attribute?

While that is a trivial example which we surely would all accept as being a single value despite being many bits long, it isn't hard to find examples that aren't quite as clear cut. Every block device has an attribute called "inflight" which contains two numbers, the number of read requests that are in-flight (have been submitted, but not yet completed) and the number of write requests that are in-flight. Is this a single array, like the bitmap, or two separate values? There would be little cost to have implemented "inflight" as two separate attributes thus clearly following the rule, but maybe there would be little value either.

The "cpufreq/stats/time_in_state" attribute goes one step further. It contains pairs, one per line, of CPU frequencies (pleasingly in HZ) and the total time spent at that frequency (unfortunately in microseconds). This it is more of a dictionary than an array. On reflection, this is really the same as the previous two examples. For both "key" and "inflight" the key is an enumerated type that just happens to be mapped to a zero-based sequence of integers. So in each case we see a dictionary. In this last case the keys are explicit rather than implicit.

If we contrast this last example with the "statistics" directory in any "net" device (net/*/statistics) we see that it is quite possible to put individual statistics in individual files. Were these 23 different values put into one file, one per line with labels, it is unlikely that anyone would accept that there was just one item in that file.

So the question here is: where do we draw the line? In each of these 4 cases (capabilities/key, inflight, time_in_state, statistics) we have a 'dictionary' mapping from an enumerated type to a scalar value. In the first case the scalar value is a truth value represented by a single bit, in the others the scalar is an integer. The size of the dictionary ranges from 2 to 23 to several hundred for "capabilities/key". Is it rational to draw a line based on the size of the dictionary, or on the size of the value? Or should it be left to the developer - a direction that usually produces disastrous results for uniformity.

The implication of these explorations seems to be that we must allow structured data to be stored in attributes, as there is no clear line between structured and non-structured data. "One item per file" is a great heuristic that guides us well most of the time, but as we have seen there are numerous times where developers find that it is not suitable and so deviate from the rules with a disheartening lack of consistency.

It could even be that the firmly stated rule has a negative effect here. Faced with a strong belief that a collection of numbers really forms a single attribute, and the strongly stated rule that multi-valued attributes are not allowed, the path of least resistance is often to quietly implement a multi-valued attribute without telling anyone. There is a reasonable chance that such code will not get reviewed until it is too late to make a change. This can lead multiple developers to solve the same problem in different ways, thus exacerbating a problem that the rule was intended to avoid.

So to answer our third question, "no", the "one item per file" doesn't always make sense because it isn't always clear what "one item" is, and those places of uncertainty are holes for chaos to creep in to our kernel.

Can we do better?

A review that finds problems without even suggesting a fix is a poor review indeed. The above identifies a number of problems, here we at least discuss solutions.

The problem of existing attributes that are inappropriately complex or inconsistent in their formatting does not permit a quick fix. We cannot just change the format. At best we could provide new ways to access the same information, and then deprecate the old attributes. It is often stated that once something enters the kernel-userspace interface (which includes all of sysfs) it cannot be changed. However the existence of CONFIG_SYSFS_DEPRECATED_V2 disproves this claim. A policy that permits and supports deprecation and removal of sysfs attributes on an on-going basis may cause some pain but would be of long-term benefit to the kernel, especially if we expect our grandchildren to continue developing Linux.

The problem that there is a clear need for structured data in sysfs attributes is probably best addressed by providing for it rather than ignoring or refuting it. Creating a format for representing arbitrarily structured data is not hard. Agreeing on one is much more of a challenge. XML has been enthusiastically suggested and vehemently opposed. Something more akin to the structure initialisations in C might be more pleasing to kernel developers (who already know C).

Your author is currently pondering how best to communicate a list of "known bad blocks" on devices in a RAID between kernel and userspace. sysfs is the obvious place to manage the data, but one file per block would be silly, and a single file listing all bad blocks would hit the one-page maximum at about 300-400 entries, which is many fewer than we want to support. Having support for structured sysfs attributes would help a lot here.

The final problem is how to enforce whatever rules we do come up with. Even with a very simple rule that is easily and often repeated and is heard by many, knowing the rule is not enough to cause people to follow the rule. This we have just seen.

The implementation of sysfs attribute files allows each developer to provide an arbitrary text string which is then included in the sysfs file for them. This incredible flexibility is a great temptation to variety rather than uniformity. While it may not be possible to remove that implementation, it could be beneficial to make it a lot easier to build sysfs attributes of particular well supported types. For example duration, temperature, switch, enum, storage-size, brightness, dictionary etc. We already have a pattern for this in that module parameters are much easier to define when they are of a particular type - as can be seen when exploring include/linux/moduleparam.h. The moduleparam implementation focuses more on basic types such as int, short, long etc. For sysfs we are more interested in higher level types, however the concept is the same.

If most of sysfs were converted over to using an interface that enforces standardised appearance, it would become fairly easy to find non-standard attributes and then either challenge them, or enhance the standard interface to support them.

In Closing

It must be said that hindsight gives much clearer vision than foresight. It is easy to see these issues in retrospect, but would have been harder to be ready to guard against them from the start. While sysfs could possibly have had a better design, it could certainly have had a worse one. Creating imperfect solutions and then needing to fix them is an acknowledged part of the continuous development approach we use in the Linux kernel.

For entirely internal subsystems, we can and do fix things regularly without any concern for legacy support. For external interfaces, fixing things isn't as easy. We need to either carry unsightly baggage around indefinitely or work to remove that which doesn't work, and encourage the creation only of that which does. Is it wrong to dream that our grandchild might work with a uniform and consistent /sys and maybe even a /proc which only contains processes?

Comments (45 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds