Kernel development [LWN.net]

Kernel release status

The 2.6.39 merge window remains open as of this writing; see the separate summary below for details on what has been merged over the last week.

Stable updates: The 2.6.32.34, 2.6.37.5, and 2.6.38.1 stable kernel updates were released on March 23; each contains a number of important fixes.

2.6.33.8 was released on March 21. 2.6.33 had gone out of maintenance, but Greg Kroah-Hartman has resumed creating updates because the realtime preemption patch set is stuck at this release.

Comments (1 posted)

Quotes of the week

I tend to view arch specific embedded code as rather like very dubious parties. What goes on in other peoples' house out of sight is none of my business.

The 8250 however is core code so it should keep its clothes on and behave in a manner befitting its status.

-- Alan Cox

I also believe that Greg spends lots of lonely nights looking into git commits, drinking his favorite Latte and cursing at all the git patches that fix a bug that didn't have a Cc: stable tag attached. He use to have a huge mane of hair on his head before taking over as stable maintainer.

-- Steven Rostedt explains the stable series

If it's some desperate cry for attention by somebody, I just wish those people would release their own sex tapes or something, rather than drag the Linux kernel into their sordid world.

-- Linus Torvalds

I do get the impression that you're extremely unhappy with the way ARM stuff works, and I've no real idea how to solve that. I think much of it is down to perception rather than anything tangible.

Maybe the only solution is for ARM to fork the kernel, which is something I really don't want to do - but from what I'm seeing its the only solution which could come close to making you happy.

-- Russell King

This was discussed before, and it was felt that perhaps 75000 lines of ocaml code was not really appropriate for the Linux source tree.

-- Julia Lawall

Comments (2 posted)

Report from the V4L2 Warsaw Brainstorming Meeting

A group of Video4Linux2 developers recently gathered in Warsaw to discuss a number of topics of interest to that subsystem; Hans Verkuil has posted a report from that gathering. Issues discussed include an API for compressed video formats, subdev hierarchies, cropping and composing, pipeline configuration, HDMI support, and more.

Full Story (comments: none)

Converting strings to integers

By Jonathan Corbet
March 23, 2011

Kernel developers might rightly complain about being confused over which functions should be used to convert strings to integer types. Old functions like simple_strtoul() will silently ignore junk at the end of an integer value, so "100xx" successfully converts to an unsigned integer type. Alternatives like strict_strtoul() have been encouraged instead, but they have problems too, including the lack of overflow checks. So what's a kernel hacker to do?

As of 2.6.39, there is a new set of string-to-integer converters which is expected to be used in preference to all others.

Unsigned conversions can be done with any of kstrtoull(), kstrtoul(), kstrtouint(), kstrtou64(), kstrtou32(), kstrtou16(), or kstrtou8().
Conversions to signed integers can be done with kstrtoll(), kstrtol(), kstrtoint(), kstrtos64(), kstrtos32(), kstrtos16(), or kstrtos8().

All of these functions are marked __must_check, so callers are expected to check to ensure that the conversion happened successfully. The older functions are marked deprecated, and will eventually be removed. These new kstrto*() functions are now the Official Best Way To Convert Strings, so developers need wonder no longer.

Comments (none posted)

2.6.39 merge window, part 2

By Jonathan Corbet
March 23, 2011

As of this writing, some 5,500 non-merge changesets have been merged into the mainline since last week's 2.6.39 merge window summary. A wide-ranging set of new features, cleanups, and performance improvements has been added to the kernel. Some of the more significant user-visible changes include:

The ipset mechanism has been merged. Ipset allows the creation of groups of IP addresses, port numbers, and MAC addresses in a way which can be quickly matched in iptables rules.
The size of the initial congestion window in the TCP stack has been increased, a change which should lead to shorter latencies for the loading of web pages and other server-oriented tasks. See this article for details.
There is a new system call:
```
    int syncfs(int fd);
```
It behaves like sync() with the exception that only the filesystem containing fd will be flushed to persistent storage.
The USB core has gained support for USB 3.0 hubs.
The transcendent memory core has been added to the staging tree. Along with it came "zcache," a compressed in-memory caching mechanism.
There is a new "multi-queue priority scheduler" queueing discipline in the networking layer which enables the offloading of quality-of-service processing work to suitably capable hardware.
The CHOKe flow scheduler and the Stochastic Fair Blue scheduler have been added to the networking code.
RFC 4303 extended IPSEC sequence numbers are now supported.
Support for the UniCore 32-bit RISC architecture has been merged.
New drivers include:
- Processors and systems: VIA/WonderMedia VT8500/WM85xx System-on-Chips, IMX27 IPCAM boards, and MX51 Genesi Efika Smartbook systems.
  Block: Broadcom NetXtreme II FCoE controllers and Freescale MXS Multimedia Card interfaces.
- Graphics: Intel GMA500 controllers (2D acceleration only), USB-connected graphics devices, MXS LCD framebuffer devices, and LD9040 AMOLED panels.
- Input: Hyper-V virtualized mice, Roccat Kova[+] mouse devices, Roccat Arvo keyboards, Wolfson WM831x PMIC touchscreen controllers, Atmel AT42QT1070 touch sensor chips, and Texas Instruments TSC2005 touchscreen controllers.
- Networking: Texas Instruments WiLink7 bluetooth controllers (graduated from staging), Bosch C_CAN controllers, Faraday FTMAC100 10/100 Ethernet controllers, and the Xen "netback" back-end driver.
- Miscellaneous: Faraday FUSB300 USB peripheral controllers, OMAP USBHS host controllers, NVIDIA Tegra USB host controllers, Texas Instruments PRUSS-connected devices, MSM UARTs, Maxim MAX517/518/519 DACs, RealTek PCI-E card readers, Analog Devices ad7606, ad7606-6, and ad7606-4 analog to digital converters, Maxim MAX6639 temperature monitors, Maxim MAX8688, MAX16064, MAX34440 and MAX34441 hardware monitoring chips, Lineage compact power line power entry modules, PMBus-compliant hardware monitoring devices, Linear Technology LTC4151 is high voltage I2C current and voltage monitors, Intel SCU watchdog devices, Ingenic jz4740 SoC hardware watchdogs, Xen watchdog devices, NVIDIA Tegra internal I2C controllers, Freescale i.MX28 I2C interfaces, MXS Application UART (AUART) ports, SuperH SPI controllers, Altera SPI controllers, OpenCores tiny SPI controllers, SMSC SCH5627 Super-I/O hardware monitoring chips, Texas Instruments ADS1015 12-bit 4-input ADC devices, Diolan U2C-12 USB adapters, SPEAr13XX PCIe controllers (in "gadget" mode), and Freescale MXS-based SoC i.MX23/28 DMA engines.
- Sound: Firewire-connected sound devices, Wolfson Micro WM8991 codecs, Cirrus CS4271 codecs, Freescale SGTL5000 codecs, TI tlv320aic32x4 codecs, Maxim MAX9850 codecs, and TerraTec 6fire DMX USB interfaces.
- Outgoing: A number of TTY drivers (epca, ip2, istallion, riscom8, serial167, specialix, stallion, generic_serial, rio, ser_a2232, sx, and vme_scc) have been moved to the staging tree in anticipation of removal in 2.6.41. The smbfs and autofs3 filesystems, which were moved to staging in 2.6.37, have now been moved out of the kernel entirely.

Changes visible to kernel developers include:

After many years of work by a large number of developers, the big kernel lock has been removed from the kernel.
The dynamic debug mechanism has some new control flags allowing for control over whether the function name, line number, module name, and current thread ID are printed.
The kernel can export raw DMI table data via sysfs, making it available in user space without needing to go through /dev/mem.
Network drivers can now enable hardware support for receive flow steering via the new ndo_rx_flow_steer() method.
The "pstore" filesystem provides access to platform-specific persistent storage which can be used to carry information across reboots.
The EXTRA_CFLAGS and EXTRA_AFLAGS makefile variables have been replaced with ccflags-y, ccflags-m, asflags-y, and asflags-m.
kmem_cache_name(), which returned the name of a slab cache, has been removed from the kernel.
The SLUB memory allocator now has a lockless fast path for allocations, speeding performance considerably. "Sadly this does nothing for the slowpath which is where the main issues with performance in slub are but the best case performance rises significantly."
Kernel threads can be created on a specific NUMA node with the new kthread_create_on_node() function.
The new function delete_from_page_cache() does what its name implies; unlike remove_from_page_cache() (which has now been deleted), it also decrements the page's reference count. It thus more closely mirrors add_to_page_cache().
There is a whole new set of functions which are the preferred way to convert strings to integer values; see this article for details.
The new "hwspinlock" framework allows the implementation of synchronization primitives on systems where different cores are running different operating systems. See Documentation/hwspinlock.txt for more information.

If the usual two-weeks rule holds, the 2.6.39 merge window can be expected to close on March 28. Watch this space next week for a summary of the final changes merged for this development cycle.

Comments (11 posted)

The dynamic debugging interface

By Jonathan Corbet
March 22, 2011

The kernel's "dynamic debugging" interface saw some minor changes for 2.6.39. As it happens, LWN has never written about how dynamic debug works, so this seems like an opportune time to fill in the gap.

It can be nice to instrument kernel code with abundant print statements that illustrate what is going on inside. The problem, of course, is that those statements can generate vast amounts of output which is usually not of interest. These statements can be left commented out most of the time, but that leads to situations where an edit/rebuild/reboot cycle is needed to get the output. In response, many developers have created mechanisms which enable or disable specific print statements at run time. The dynamic debugging interface was added as a way of providing a uniform control interface for debugging output while avoiding cluttering the kernel with various hand-rolled alternatives.

Dynamic debug operates on print statements written with either of:

    pr_debug(char *format, ...);
    dev_dbg(struct device *dev, char *format, ...);

If the CONFIG_DYNAMIC_DEBUG option is not set, the above functions will be turned into normal printk() statements at the KERN_DEBUG level. If the option is enabled, though, the code sets aside a special descriptor for every call site, noting the module, function, and file names, along with the line number and format string. At system boot, all of these debug statements are turned off, so their output will not appear even if debug-level kernel messages are routed somewhere useful by the syslog daemon.

Turning on dynamic debug causes a new virtual file to appear at /sys/kernel/debug/dynamic_debug/control (modulo any individual preferences for the location of debugfs, naturally). Writing to that file will enable or disable specific debugging functions, as specified by a simple but flexible language.

As an example, drivers/char/tpm/tpm_nsc.c contains the following code at line 346:

    dev_dbg(&pdev->dev, "NSC TPM detected\n");

Turning on that specific line could be done with a line like:

    echo file tpm_nsc.c line 346 +p > .../dynamic_debug/control

(Where the full path to debugfs has been replaced with "..."). As it happens, that dev_dbg() line does not stand alone - there is a long series of them providing information on the newly-detected device. One could enter a series of lines like the above to enable them all individually, but either of the following would also work:

    echo file tpm_nsc.c line 346-373 +p > .../dynamic_debug/control
    echo file tpm_nsc.c function init_nsc +p > .../dynamic_debug/control

Along with selection by file name, line number, and function name, the interface also allows "module name" to select a specific module, and "format fmt" to select any line whose format string contains "fmt". If more than one selector is given, all must match for a given statement to be enabled.

Commands to the control file must end with a "flags" operation telling the system what to do; "+p" turns on printk() output, while "-p" turns it off. There is also a set of flags (new for 2.6.39) controlling information added to each output line: "f" adds the function name, "l" adds the line number, "m" adds the module name, and "t" adds the thread ID. One can use "=" to set the full mask of flags to a specific value - "=plm" will enable printing with line numbers and module names while disabling thread ID and function output regardless of their prior setting. The only way to clear all of the flags is with "-pflmt".

Reading the control file will produce a list of all currently-enabled call sites.

Sometimes the interesting action happens before the system reaches a point where the control file can be accessed. Dynamic debug output can be turned on early in the boot process with the ddebug_query boot parameter.

More information on how to use this facility can be found in Documentation/dynamic-debug-howto.txt. Dynamic debug has been in the kernel since 2.6.30, but it is still common to see code submitted which contains its own, home-brewed mechanism for controlling debug output. Chances are that reviewers will ask for such mechanisms to be taken out before the code is merged. Given the flexibility and ease of use of the in-kernel implementation, it makes sense to use it from the beginning.

Comments (15 posted)

Persistent storage for a kernel's "dying breath"

By Jake Edge
March 23, 2011

When Linux systems crash, there are various ways to find out what went wrong, but generally those rely on writing to log files on disk. For some systems, disk may not be available, or trusted in the case of a crash, so a way to poke some data into a platform-specific place for use by a subsequent kernel boot would be useful. That's exactly what the pstore filesystem, which was just added during the current 2.6.39 merge window, will provide.

The idea for pstore came out of a conversation between Tony Luck and Thomas Gleixner at last year's Linux Plumbers Conference. Luck wanted to use the ACPI error record serialization table (ERST) to store crash information across a reboot. The ERST is a mechanism specified by the ACPI specification [PDF] (in section 17.4, page 519) that allows saving and retrieving hardware error information to and from a non-volatile location (like flash).

Rather than just doing something specific for the x86 architecture, he decided to create a more general framework so that other platforms could use whatever persistent storage they had available. It would be, as Luck put, "a generic layer for persistent storage usable to pass tens or hundreds of kilobytes of data from the dying breath of a crashing kernel to its successor".

There have been a number of iterations of the code since Luck first posted it for comments back in November. After Alan Cox's suggestion, pstore moved from its original firmware driver with a sysfs interface to a more straightforward filesystem-based implementation.

The basic idea is that a platform can register the availability of a persistent storage location with a call to pstore_register() and pass a pointer to a struct pstore_info, which looks like:

    struct pstore_info {
	    struct module   *owner;
	    char            *name;
	    struct mutex    buf_mutex;      /* serialize access to 'buf' */
	    char            *buf;
	    size_t          bufsize;
	    size_t          (*read)(u64 *id, enum pstore_type_id *type,
			    struct timespec *time);
	    u64             (*write)(enum pstore_type_id type, size_t size);
	    int             (*erase)(u64 id);
    };

The platform driver needs to provide three I/O routines and a buffer. There is also a mutex present to protect against simultaneous access to the buffer. With that, pstore will implement a filesystem that can be accessed from the kernel—or from user space once it has been mounted. The underlying ERST storage is record oriented, and Luck posits that other platform storage areas will be also, so the I/O interface is record oriented as well.

In addition to the pstore framework, the ERST driver was modified to take advantage of pstore; that change was also merged, so there is an in-kernel user of pstore. The pstore_info buffer is allocated and managed by drivers/acpi/apei/erst.c, and is larger than the bufsize advertised to account for the record and section headers required by ERST. Users of the IO interface either fill the buffer before calling pstore_info.write() or read the data from the buffer after a call to pstore_info.read().

Each item is stored with a type, either PSTORE_TYPE_DMESG for log messages (likely oops output), PSTORE_TYPE_MCE for hardware errors, or PSTORE_TYPE_UNKNOWN for other undefined types. When stored, each item gets a record ID associated with it, which gets returned from the pstore_info.write() call. That ID can then be used in read() and erase() operations, but it also appears in the filenames in the pstore filesystem.

The filesystem can be mounted using:

    # mount -t pstore - /dev/pstore

Files will appear there with names based on the type, name of the storage driver, and the id, so the first dmesg record for ERST would be /dev/pstore/dmesg-erst-1. The typical scenario would be for the filesystem to be mounted at boot time, then some user-space process would check for any files there, copy them to some more permanent place, and delete the files with rm. That will allow the storage facility driver to reclaim the space in order to be ready for other crashes or hardware errors.

By default, pstore will register a dump handler with kmsg_dump to write the last 10K bytes of data from the kernel log to the pstore device when there is a kernel oops or panic. The amount of data to store can be configured at mount time using the kmsg_bytes parameter.

Luck has also put out an RFC patch to disable dumping information into pstore for some kinds of kmsg_dump reasons (e.g. KMSG_DUMP_HALT or KMSG_DUMP_RESTART), but various other developers weren't so sure. Seiji Aguchi pointed to two use cases (1, 2) he has found for needing to store the tail of the kernel log messages in most of those cases. In addition, Artem Bityutskiy pointed out that having pstore decide which kmsg_dump reasons to handle "smells like policy in the kernel". Adding more options to control that behavior is certainly possible, but Luck seems to be of a mind to wait a bit before making any change.

There are other persistent storage methods for kernel log messages, notably devices/mtd/mtdoops.c and devices/char/ramoops.c. But those are targeted at the embedded space where NVRAM devices are prevalent or for platforms where RAM can be reserved that will not be cleared on a restart. Pstore is more flexible, as it can store more than just kernel logs, while the two *oops devices are wired into storing the output of kmsg_dump.

Now that pstore has been merged, others will likely start using it. David Miller has already indicated that he will use it for sparc64, where a region of memory can be set aside to persist across reboots. One would guess that other architectures that have hardware support for similar mechanisms will as well.

Comments (9 posted)

Greg KH Linux 2.6.38.1 ?

Greg KH Linux 2.6.37.5 ?

Greg KH Linux 2.6.33.8 ?

Greg KH Linux 2.6.32.34 ?

Heiko Schocher powerpc, 52xx: add charon board support ?

Torben Hohn RFC genirq: Add IRQ timestamping ?

Chad Talbott cfq-iosched: Fair cross-group preemption ?

Trinabh Gupta [RFC PATCH V1 0/2] cpuidle: global registration of idle states with per-cpu statistics ?

Paul Turner CFS Bandwidth Control V5 ?

KAMEZAWA Hiroyuki A forkbomb killer and mm tracking system ?

Tejun Heo ptrace,signal: Improve ptrace and job control interaction ?

Tejun Heo mutex: Apply adaptive spinning on mutex_trylock() ?

Christian Dietrich undertaker 1.1 ?

Dave Jones checkpatch: introduce --nocs to disable CodingStyle warnings. ?

Po-Yu Chuang net: add Faraday FTGMAC100 Gigabit Ethernet driver ?

adharmap@codeaurora.org pm8921 core and subdevices ?

Waldemar Rymarkiewicz NFC: Driver for Inside Secure MicroRead NFC chip ?

Nicholas A. Bellinger tcm_loop: Add multi-fabric Linux/SCSI LLD fabric module ?

Carl Vanderlip video: msm: Adding Support for MDP3.1 ?

Oren Weil char/mei: Intel MEI Driver ?

Jamie Iles Support for picoxcell OTP memory ?

Theodore Ts'o ext4 bigalloc patchset ?

Miao Xie btrfs: implement delayed inode items operation ?

Arne Jansen btrfs: scrub ?

Dave Chinner vfs: inode lock breakup ?

Justin TerAvest [PATCH v2 0/8] Provide cgroup isolation for buffered writes. ?

Valerie Aurora Union mounts version something or other ?

Hidetoshi Seto THP: page collapsing vs. poisoning ?

Prasad Joshi [RFC][PATCH v3 00/22] __vmalloc: Propagating GFP allocation flag inside __vmalloc() ?

Stephen Wilson enable writing to /proc/pid/mem ?

Stephen Hemminger iproute 2.6.38 ?

Greg KH usbutils 002 release ?

Tony Ibbs RFC: KBUS messaging subsystem ?

Jonathan Cameron Introduce kstrtobool function ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Report from the V4L2 Warsaw Brainstorming Meeting

Converting strings to integers

Kernel development news

2.6.39 merge window, part 2

The dynamic debugging interface

Persistent storage for a kernel's "dying breath"

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Miscellaneous