|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The 4.2 kernel was released on August 30; Linus noted that the one-week delay was perhaps not strictly necessary. "So judging by how little happened this week, it wouldn't have been a mistake to release 4.2 last week after all, but hey, there's certainly a few fixes here, and it's not like delaying 4.2 for a week should have caused any problems either." Headline features in this release include the security module stacking patches, the delay-gradient congestion-control algorithm, improvements to writeback management in control groups, a lot of important persistent-memory infrastructure, and more.

The 4.3 merge window is open; see the separate article below for a summary of what has been merged so far.

Stable updates: none have been released in the last week.

Comments (none posted)

Quotes of the week

Getting a vendor to see that they have sound business reasons for giving back is a specialised skill usually practised over long time period and only successfully by a few people in the industry. It doesn't happen automatically and certainly not because you have a dummy spit over someone taking something you advertised as "for free" and having the temerity to expect not to have to pay you.
James Bottomley

There is a difference between killing bug classes and killing exploitation methods. Our kernel has way too few of the latter.
Kees Cook

Comments (none posted)

Kernel development news

4.3 Merge window, part 1

By Jonathan Corbet
September 2, 2015
As of this writing, just over 4,000 non-merge changesets have been pulled into the mainline kernel repository for the 4.3 development cycle. This merge window, in other words, is just getting started. But that is enough to begin to show the shape of this development cycle: useful incremental changes, but not much, thus far, in the way of high-profile features.

The user-visible changes merged so far include:

  • The kernel now supports the attachment of BPF programs to uprobes, making more flexible tracing of user-space code possible. There is also a new libbpf library that is meant to ease the process of working with BPF scripts; its first user is the perf tool.

  • There is a new "PIDs" controller for the control-group subsystem; it enforces a limit on the number of processes contained within the group. This controller thus serves as a sort of defense against fork bombs and similar attacks. See Documentation/cgroups/pids.txt for details.

  • The perf tool has gained the ability to work with Intel processor trace streams.

  • The s390 architecture has gained "fake NUMA" support. This allows a large system to be configured into a set of emulated NUMA nodes, making it easier to partition workloads and, in some situations, improving performance.

  • The CONFIG_VM86 option provides access to the 16-bit legacy mode on x86 systems. Its use has been in decline for years, there are no known recently released tools that need it, and it has been recently shown to have a number of unpleasant problems, some of which are security-related. In 4.3, this option will be renamed (to CONFIG_X86_LEGACY_VM86) and disabled by default. Hopefully nobody actually needs the VM86 mode and it can be removed entirely in the near future.

  • New hardware support includes:

    • Industrial I/O: ROHM RPR0521 ambient-light and proximity sensors, Texas Instruments OPT3001 light sensors, and TXC PA12203001 light and proximity sensors.

    • Miscellaneous: Qualcomm coincell battery chargers, Qualcomm SMD based RPM regulators, UltraChip UC1611 LCD controllers, MediaTek MT6311 power-management ICs, MediaTek SPI controllers, MediaTek SCPSYS power domain controllers, Mediatek MT8173 CPU-frequency controllers, Netlogic XLP SPI controllers, Allwinner Security System cryptographic accelerators, Intel DH895xCC crypto accelerators, ARM PrimeCell PL172 multiport memory controllers, and NVIDIA Tegra124 CPU-frequency controllers.

    • MOST: The MOST specification is a standard for media networking aimed at the automotive industry. The 4.3 kernel will include (in the staging tree) a new MOST subsystem with support for network, sound, media drivers and more. See this document for some introductory information.

    • USB: Qualcomm APQ8016/MSM8916 USB transceiver controllers, Allwinner sun4i A10 musb DRC/OTG controllers, and NXP LPC18xx/43xx SoC USB OTG PHYs.

Changes visible to kernel developers include:

  • There is a new driver framework for nonvolatile memory devices (EEPROMs and the like); see Documentation/nvmem/nvmem.txt for some details.

  • DocBook comments for structures can now be split into multiple chunks within the structure, easing the process of documenting the fields of especially large structures. The HTML document generator can also now create internal cross-reference links automatically.

One pull request that has not yet been acted upon by Linus is Jan Kara's request deleting the ext3 filesystem, as was covered here in July. Linus is worried that the change will force ext3 users to upgrade their filesystems in non-backward-compatible ways, but, as Ted Ts'o explained, that should not happen. Your editor would hazard a guess that this removal will go through before the merge window closes.

If the normal schedule holds, that closure should happen on September 13. As usual, LWN will follow the commit stream and call out the most interesting changes as they happen.

Comments (9 posted)

Thread-level management in control groups

By Jonathan Corbet
September 1, 2015
Progress on the multi-year task of reworking the kernel's control group ("cgroup") mechanism might appear to have slowed down recently, but that work continues and, occasionally, surfaces on the mailing lists. Recently, cgroup maintainer Tejun Heo posted a proposal for the CPU-scheduler controller interface in the new cgroup subsystem; it changes a number of control knobs, makes time units consistent across the interface, and so on. This proposal generated quite a bit of discussion, but it wasn't the contents of the new interface that were controversial. Instead, it became clear that some users are not at all happy about a feature that is absent from this interface — and which may have to be restored before this work can go forward.

The new "unified hierarchy" cgroup interface was a topic of discussion at the 2013 Kernel Summit. At that gathering, Tejun stated his intent to make cgroups handle membership at the process level, but not below. If a process is made up of multiple threads, all of those threads will be placed into the same group as a unit. That is a change from the current cgroup implementation, which allows different threads to be placed in different groups. Eliminating that capability, Tejun said, makes the implementation much more straightforward and, in any case, most controllers only make sense at the process level anyway.

At the time, there were some expressions of concern from the gathered developers, not all of whom were convinced that losing the ability to split a process's threads across control groups would be acceptable. No definitive conclusion on the issue was reached at the time; further discussion was deferred until the code itself made an appearance. Two years later, that code is out but the worries have not gone away; scheduler developer Peter Zijlstra was quick to raise the issue again.

A few users of thread-level cgroup control surfaced in the ensuing discussion; the most vocal of them was Paul Turner of Google, who asserted that this ability is an important part of how systems are managed there. One use case mentioned was the division of a job into work and support threads. The threads doing the "real work" should get the bulk of the available CPU time, but an application will typically want to guarantee a minimum of time to the support threads as well. Putting the two types of threads into different control groups allows this policy to be implemented in a fairly straightforward way.

Tejun's response took a few different forms. One was to question the importance of this use case; he described it as "super duper fringe" at one point. He also suggested that the problem could be solved using process priorities, but Paul clearly stated that priorities are not a suitable solution to the problem, while cgroups are. It seems clear that a number of users beyond Google employ control groups in this manner now and they would not be happy if this ability were to be left out of the new cgroup interface. If nothing else, leaving it out would tend to inhibit movement away from the old interface which, in turn, would make that API's eventual removal an even more distant prospect.

The other significant point argued by Tejun is that the cgroup interface is not a good way for applications to manage their environment. It may work as a system-administration interface, he said, but application developers should be given a more programmatic, system-call-based interface. Such an interface would be more easily used by those developers, he said, and separating the administrative and application interfaces would help to prevent conflicts between the administrator and the application over thread-level management.

In this message Tejun briefly sketched out the "resource group" API that he has in mind. These groups could be created and managed within an application with a new set_resource() system call.

Finally, Tejun argued that, rather than using cgroups for grouping of threads, the kernel should just employ the normal process hierarchy for that purpose; the set_resource() system call follows that guideline. Additionally, new threads could be created with a special clone() flag that would cause them to be placed into a new resource group. The process hierarchy is already understood by application developers, he said, and can be manipulated with existing system calls. If developers use that hierarchy to partition their applications, they will have better results and the complexity of supporting thread-level cgroup membership can be avoided.

The API itself was not discussed much; the discussion was more about identifying the problem than nailing down the details of its solution. There appears to be some concern about moving away from the cgroup API for thread-level management, but developers could probably be convinced on that score if the new API looked good enough; the current API has few overt admirers, after all. There was some resistance to the idea of limiting grouping to the process hierarchy, though. It seems that a number of use cases involve moving threads from one control group to another, depending on just what a specific thread is doing at any given time. A grouping mechanism that was strictly based on the process hierarchy would not be able to move processes in that way.

The end of the discussion came when Ingo Molnar and Peter both indicated that they would block further work on the CPU cgroup controller until the problem of per-thread control had been resolved. The issue, they said, is fundamental to the design of the subsystem, and it is not reasonable to expect that a solution can be retrofitted in after this code is merged. Tejun has not, as of this writing, indicated how he intends to proceed, whether it be by allowing per-thread control-group membership or through a separate control API. Either way, further progress in this area cannot be expected until a solution to this particular problem is presented and accepted by the relevant maintainers.

Comments (2 posted)

Persistent memory, with and without page structures

By Jonathan Corbet
September 2, 2015
Persistent memory offers the prospect of large amounts (e.g. terabytes) of directly attached memory that retains its contents over a system reboot or power cycle. It also offers a number of interesting design problems with regard to how it should be managed; persistent memory looks a lot like ordinary memory, but it differs in a number of important ways. As a result, there has been a long discussion over how to deal with this memory and, in particular, whether the kernel should use page structures to describe it or not. As shown in some recent patch sets, the discussion continues to evolve, and it seems to be heading toward an interesting answer to the struct page question.

For those needing a quick recap, struct page is the kernel's fundamental memory-management data structure; one page structure exists for each page of memory present in the system. In current kernels, though, persistent memory does not have accompanying page structures for a simple reason: the amount of memory required to hold all of those structures looks prohibitive. Storing them in the persistent memory array itself is possible (and discussed further below), but page structures change frequently, making them a poor fit for persistent memory storage, which (1) tends to be slow for writes, and (2) will wear out more quickly if subjected to sustained frequent writes.

As long as a persistent-memory array is treated like a disk drive, there is no need for page structures. But if persistent memory is to take part in DMA or direct I/O operations, it currently needs those structures; for that reason, such operations do not currently work on persistent memory. This problem is widely seen as needing a fix.

When we last looked at the discussion in May, there was a push toward using page-frame numbers (PFNs) as a replacement for page structures in various I/O paths. A PFN is easily derived from a page's physical address, so it is an easy and obvious way to refer to a physical page of memory — if the additional information stored in the page structure is not needed for any given operation. In May, though, it was becoming clear that this information cannot always be done without, and, thus, that this approach had its limitations, especially when it came to supporting direct I/O, which is the most scalable I/O mode that the kernel offers.

Using page-frame numbers

Nonetheless, work continues on the PFN-based approach. Christoph Hellwig posted this patch series adapting the DMA subsystem so that it could manage scatter/gather lists containing PFNs. A scatter/gather list describes an I/O operation that is spread across multiple regions of memory; these lists are used for almost all nontrivial I/O operations, since I/O buffers are rarely situated in a single, physically contiguous block of memory. Making scatter/gather lists work without page structures would, for the most part, solve the problem of doing DMA on buffers stored in persistent memory. Christoph's patch doesn't do that, but it abstracts out the references to page structures, making it easy to use PFNs instead in the future.

Beyond this preparatory work, though, the kernel needs the ability to work more extensively with PFNs. Happily, on the same day, Dan Williams posted a new revision of his patch series implementing the __pfn_t type for the management of pages by PFN. The new __pfn_t type is simpler than it was the last time around:

    typedef struct {
    	unsigned long val;
    } __pfn_t;

There is no more trickery with storing PFNs and struct page pointers in the same structure. There are, however, a few bits of val that are used for related purposes: to chain entries in scatter/gather lists, for example and, in the case of the PFN_DEV bit, to indicate that the PFN has no associated page structure in the system. There is a set of helper functions to do things like get the actual PFN number (__pfn_t_to_pfn()) or the physical address (__pfn_t_to_phys()) associated with a __pfn_t value.

One common use for a page structure is to map the associated page into the kernel's address space with kmap_atomic(); that allows the kernel to manipulate that page directly. For code dealing with PFN values instead of page structures, Dan's patch set adds kmap_atomic_pfn_t() to do the same job; it will work regardless of whether the PFN it is given refers to ordinary or persistent memory. Interestingly, when successful, kmap_atomic_pfn_t() returns with the RCU read lock held, and kunmap_atomic_pfn_t() expects to release that lock.

The final patch in the series converts the scatter/gather DMA code over to the use of PFNs instead of page structures. That should enable the DMA code to work on persistent memory, though it seems that there may be some remaining issues on a few architectures.

The PFN-based approach is not universally admired; in particular, there has been some resistance from Boaz Harrosh, who believes that page structures should always be used with persistent memory — and who posted a patch set to that effect one year ago. Boaz's patches don't seem to have been developed since then, though, and his objections do not appear to be slowing things down much. David Miller has also expressed discomfort with the idea of memory without page structures, for what it's worth.

Back to page structures

These misgivings notwithstanding, persistent-memory developers clearly see value in providing access to this memory without associated page structures. So it may have come as a surprise to some when Dan also posted this patch series adding none other than struct page support for persistent memory. In the end, it seems, there are certain things that simply cannot be done without page structures; direct I/O and remote DMA are two features at the top of that list. This patch set allows the creation of these structures on systems where they are needed while allowing the rest to avoid the associated overhead.

This patch set adds a new type of block device that sits on top of the existing pmem driver that was merged for the 4.1 kernel. The driver for this new device will add a persistent-memory range to the system's memory map, using the memory hotplugging mechanism. The memory goes into a special zone (ZONE_DEVICE) created for this purpose, though, so it will not be made available to the rest of the system like ordinary memory. As part of this process, the driver allocates an array of page structures to describe this memory range.

Allocating that array brings us back to the problem of memory consumption: a large persistent-memory block will require a large array of page structures to describe it. One possible solution to this problem is to introduce a new structure for variably sized pages, or to simply use huge pages, but Dan's patch set sticks to the ordinary page structure describing 4KB pages. So the memory-consumption problem remains.

The original version of the patch offered an interesting approach to that problem: the decision of where these page structures should live was pushed out to user space. By tweaking a sysfs attribute, the system administrator could direct those structures into ordinary memory, or could instead cause them to be stored in the persistent-memory array itself. So large arrays could host their own page structures. As mentioned above, persistent memory may not be the best place to store those structures, but, for many use cases, it may work well enough, and this approach does make the problem of page structures taking up too much RAM go away.

The current version of this patch set drops that feature, though, and instead stores page structures in RAM unconditionally. That change simplifies the memory-management changes, making it easier to get the patch set reviewed and merged. Expect the store-in-persistent-memory option to return in the future, though, as the huge arrays we've been promised finally start to show up in the mass market.

Meanwhile, we now have a set of patches that make persistent memory behave almost entirely like ordinary memory with regard to management within the kernel. That means that, assuming this work is merged, Linux is essentially ready to support the use of persistent memory for a wide variety of use cases. What remains, at this point, is to see just what developers will do once they have terabyte-sized arrays of persistent memory available to play with.

Comments (none posted)

Porting Linux to a new processor architecture, part 2: The early code

September 2, 2015

This article was contributed by Joël Porquet

In part 1 of this series, we laid the groundwork for porting Linux to a new processor architecture by explaining the (non-code-related) preliminary steps. This article continues from there to delve into the boot code. This includes what code needs to be written in order to get from the early assembly boot code to the creation of the first kernel thread.

The header files

As briefly mentioned in the previous article, the arch header files (in my case, located under linux/arch/tsar/include/) constitute the two interfaces between the architecture-specific and architecture-independent code required by Linux.

The first portion of these headers (subdirectory asm/) is part of the kernel interface and is used internally by the kernel source code. The second portion (uapi/asm/) is part of the user interface and is meant to be exported to user space—even though the various standard C libraries tend to reimplement the headers instead of including the exported ones. These interfaces are not completely airtight, as many of the asm headers are used by user space.

Both interfaces are typically more than a hundred header files altogether, which is why headers represent one of the biggest tasks in porting Linux to a new processor architecture. Fortunately, over the past few years, developers noticed that many processor architectures were sharing similar code (because they often exhibited the same behaviors), so the majority of this code has been aggregated into a generic layer of header files (in linux/include/asm-generic/ and linux/include/uapi/asm-generic/).

The real benefit is that it is possible to refer to these generic header files, instead of providing custom versions, by simply writing appropriate Kbuild files. For example, the few first lines of a typical include/asm/Kbuild looks like:

    generic-y += atomic.h
    generic-y += barrier.h
    generic-y += bitops.h
    ...

When porting Linux, I'm afraid there is no other choice than to make a list of all of the possible headers and examine them one by one in order to decide whether the generic version can be used or if it requires customization. Such a list can be created from the generic headers already provided by Linux as well as the customized ones implemented by other architectures.

Basically, a specific version must be developed for all of the headers that are related to the details of an architecture, as defined by the hardware or by the software through the ABI: cache (asm/cache.h) and TLB management (asm/tlbflush.h), the ELF format (asm/elf.h), interrupt enabling/disabling (asm/irqflags.h), page table management (asm/page.h, asm/pgalloc.h, asm/pgtable.h), context switching (asm/mmu_context.h, asm/ptrace.h), byte ordering (uapi/asm/byteorder.h), and so on.

Boot sequence

As explained in part 1, figuring out the boot sequence helps to understand the minimal set of architecture-specific functions that must be implemented—and in which order.

The boot sequence always starts with a function that must be written manually, usually in assembly code (in my case, this function is called kernel_entry() and is located in arch/tsar/kernel/head.S). It is defined as the main entry point of the kernel image, which indicates to the bootloader where to jump after loading the image in memory.

The following trace shows an excerpt of the sequence of functions that is executed during the boot (starred functions are the architecture-specific ones that will be discussed later in this article):

    kernel_entry*
    start_kernel
        setup_arch*
        trap_init*
        mm_init
            mem_init*
        init_IRQ*
        time_init*
        rest_init
            kernel_thread
            kernel_thread
            cpu_startup_entry

Early assembly boot code

The early assembly boot code has this special aura that scared me at first (as I'm sure it did many other programmers), since it is often considered one of the most complex pieces of code in a port. But even though writing assembly code is usually not an easy ride, this early boot code is not magic. It is merely a trampoline to the first architecture-independent C function and, to this end, only needs to perform a short and defined list of tasks.

When the early boot code begins execution, it knows nothing about what has happened before: Has the system been rebooted or just been powered on? Which bootloader has just loaded the kernel in memory? And so forth. For this reason, it is safer to put the processor into a known state. Resetting one or several system registers usually does the trick, making sure that the processor is operating in kernel mode with interrupts disabled.

Similarly, not much is known about the state of the memory. In particular, there is no guarantee that the portion of memory representing the kernel’s bss section (the section containing uninitialized data) was reset to zero, which is why this section must be explicitly cleared.

Often Linux receives arguments from the bootloader (in the same way that a program receives arguments when it is launched). For example, this could be the memory address of a flattened device tree (on ARM, MicroBlaze, openRISC, etc.) or some other architecture-specific structure. Often such arguments are passed using registers and need to be saved into proper kernel variables.

At this point, virtual memory has not been activated and it is interesting to note that kernel symbols, which are all defined in the kernel's virtual address space, have to be accessed through a special macro: pa() in x86, tophys() in OpenRISC, etc. Such a macro translates the virtual memory address for symbols into their corresponding physical memory address, thus acting as a temporary software-based translation mechanism.

Now, in order to enable virtual memory, a page table structure must be set up from scratch. This structure usually exists as a static variable in the kernel image, since at this stage it is nearly impossible to allocate memory. For the same reason, only the kernel image can be mapped by the page table at first, using huge pages if possible. According to convention, this initial page table structure is called swapper_pg_dir and is thereafter used as the reference page table structure throughout the execution of the system.

On many processor architectures, including TSAR, there is an interesting thing about mapping the kernel in that it actually needs to be mapped twice. The first mapping implements the expected direct-mapping strategy as described in part 1 (i.e. access to virtual address 0xC0000000 redirects to physical address 0x00000000). However, another mapping is temporarily required for when virtual memory has just been enabled but the code execution flow still hasn't jumped to a virtually mapped location. This second mapping is a simple identity mapping (i.e. access to virtual address 0x00000000 redirects to physical address 0x00000000).

With an initialized page table structure, it is now possible to enable virtual memory, meaning that the kernel is fully executing in the virtual address space and that all of the kernel symbols can be accessed normally by their name, without having to use the translation macro mentioned earlier.

One of the last steps is to set up the stack register with the address of the initial kernel stack so that C functions can be properly called. In most processor architectures (SPARC, Alpha, OpenRISC, etc.), another register is also dedicated to containing a pointer to the current thread's information (struct thread_info). Setting up such a pointer is optional, since it can be derived from the current kernel stack pointer (the thread_info structure is usually located at the bottom of the kernel stack) but, when allowed by the architecture, it enables much faster and more convenient access.

The last step of the early boot code is to jump to the first architecture-independent C function that Linux provides: start_kernel().

En route to the first kernel thread

start_kernel() is where many subsystems are initialized, from the various virtual filesystem (VFS) caches and the security framework to time management, the console layer, and so on. Here, we will look at the main architecture-specific functions that start_kernel() calls during boot before it finally calls rest_init(), which creates the first two kernel threads and morphs into the boot idle thread.

setup_arch()

While it has a rather generic name, setup_arch() can actually do quite a bit, depending on the architecture. Yet examining the code for different ports reveals that it generally performs the same tasks, albeit never in the same order nor the same way. For a simple port (with device tree support), there is a simple skeleton that setup_arch() can follow.

One of the first steps is to discover the memory ranges in the system. A device-tree-based system can quickly skim through the flattened device tree given by the bootloader (using early_init_devtree()) to discover the physical memory banks available and to register them into the memblock layer. Then, parsing the early arguments (using parse_early_param()) that were either given by the bootloader or directly included in the device tree can activate useful features such as early_printk(). The order is important here, as the device tree might contain the physical address of the terminal device used for printing and thus needs to be scanned first.

Next the memblock layer needs some more configuration before it is possible to map the low memory region, which enables memory to be allocated. First, the regions of memory occupied by the kernel image and the device tree are set as being reserved in order to remove them from the pool of free memory, which is later released to the buddy allocator. The boundary between low memory and high memory (i.e. which portion of the physical memory should be included in the direct mapping region) needs to be determined. Finally the page table structure can be cleaned up (by removing the identity mapping created by the early boot code) and the low memory mapped.

The last step of the memory initialization is to configure the memory zones. Physical memory pages can be associated with different zones: ZONE_DMA for pages compatible with the old ISA 24-bit DMA address limitation, and ZONE_NORMAL and ZONE_HIGHMEM for low- and high-memory pages, respectively. Further reading on memory allocation in Linux can be found in Linux Device Drivers [PDF].

Finally, the kernel memory segments are registered using the resource API and a tree of struct device_node entries is created from the flattened device tree.

If early_printk() is enabled, here is an example of what appears on the terminal at this stage:

    Linux version 3.13.0-00201-g7b7e42b-dirty (joel@joel-zenbook) \
        (gcc version 4.8.3 (GCC) ) #329 SMP Thu Sep 25 14:17:56 CEST 2014
    Model: UPMC/LIP6/SoC - Tsar
    bootconsole [early_tty_cons0] enabled
    Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 65024
    Kernel command line: console=tty0 console=ttyVTTY0 earlyprintk

trap_init()

The role of trap_init() is to configure the hardware and software architecture-specific parts involved in the interrupt/exception infrastructure. Up to this point, an exception would either cause the system to crash immediately or it would be caught by a handler that the bootloader might have set up (which would eventually result in a crash as well, but perhaps with more information).

Behind (the actually simple) trap_init() hides another of the more complex pieces of code in a Linux port: the interrupt/exception handling manager. A big part of it has to be written in assembly code because, as with the early boot code, it deals with specifics that are unique to the targeted processor architecture. On a typical processor, a possible overview of what happens on an interrupt is as follows:

  • The processor automatically switches to kernel mode, disables interrupts, and its execution flow is diverted to a special address that leads to the main interrupt handler.
  • This main handler retrieves the exact cause of the interrupt and usually jumps to a sub-handler specialized for this cause. Often an interrupt vector table is used to associate an interrupt sub-handler with a specific cause, and on some architectures there is no need for a main interrupt handler, as the routing between the actual interrupt event and the interrupt vector is done transparently by hardware.
  • The sub-handler saves the current context, which is the state of the processor that can later be restored in order to resume exactly where it stopped. It may also re-enable the interrupts (thus making Linux re-entrant) and usually jumps to a C function that is better able to handle the cause of the exception. For example, such a C function can, in the case of an access to an illegal memory address, terminate the faulty user program with a SIGBUS signal.

Once all of this interrupt infrastructure is in place, trap_init() merely initializes the interrupt vector table and configures the processor via one of its system registers to reflect the address of the main interrupt handler (or of the interrupt vector table directly).

mem_init()

The main role of mem_init() is to release the free memory from the memblock layer to the buddy allocator (aka the page allocator). This represents the last memory-related task before the slab allocator (i.e. the cache of commonly used objects, accessible via kmalloc()) and the vmalloc infrastructure can be started, as both are based on the buddy allocator.

Often mem_init() also prints some information about the memory system:

    Memory: 257916k/262144k available (1412k kernel code, \
        4228k reserved, 267k data, 84k bss, 169k init, 0k highmem)
    Virtual kernel memory layout:
        vmalloc : 0xd0800000 - 0xfffff000 ( 759 MB)
        lowmem  : 0xc0000000 - 0xd0000000 ( 256 MB)
          .init : 0xc01a5000 - 0xc01ba000 (  84 kB)
          .data : 0xc01621f8 - 0xc01a4fe0 ( 267 kB)
          .text : 0xc00010c0 - 0xc01621f8 (1412 kB)

init_IRQ()

Interrupt networks can be of very different sizes and complexities. In a simple system, the interrupt lines of a few hardware devices are directly connected to the interrupt inputs of the processor. In complex systems, the numerous hardware devices are connected to multiple programmable interrupt controllers (PICs) and these PICs are often cascaded to each other, forming a multilayer interrupt network. The device tree helps a great deal by easily describing such networks (and especially the routing) instead of having to specify them directly in the source code.

In init_IRQ(), the main task is to call irqchip_init() in order to scan the device tree and find all the nodes identified as interrupt controllers (e.g PICs). It then finds the associated driver for each node and initializes it. Unless the targeted system uses an already-supported interrupt controller, that typically means the first device driver will need to be written.

Such a driver contains a few major functions: an initialization function that maps the device in the kernel address space and maps the controller-local interrupt lines to the Linux IRQ number space (through the irq_domain mapping library); a mask/unmask function that can configure the controller in order to mask or unmask the specified Linux IRQ number; and, finally, a controller-specific interrupt handler that can find out which of its inputs is active and call the interrupt handler registered with this input (for example, this is how the interrupt handler of a block device connected to a PIC ends up being called after the device has raised an interrupt).

time_init()

The purpose of time_init() is to initialize the architecture-specific aspects of the timekeeping infrastructure. A minimal version of this function, which relies on the use of a device tree, only involves two function calls.

First, of_clk_init() will scan the device tree and find all the nodes identified as clock providers in order to initialize the clock framework. A very simple clock-provider node only has to define a fixed frequency directly specified as one of its properties.

Then, clocksource_of_init() will parse the clock-source nodes of the device tree and initialize their associated driver. As described in the kernel documentation, Linux actually needs two types of timekeeping abstraction (which are actually often both provided by the same device): a clock-source device provides the basic timeline by monotonically counting (for example it can count system cycles), and a clock-event device raises interrupts on certain points on this timeline, typically by being programmed to count periods of time. Combined with the clock provider, it allows for precise timekeeping.

The driver of a clock-source device can be extremely simple, especially for a memory-mapped device for which the generic MMIO clock-source driver only needs to know the address of the device register containing the counter. For the clock event, it is slightly more complicated as the driver needs to define how to program a period and how to acknowledge it when it is over, as well as provide an interrupt handler for when a timer interrupt is raised.

Conclusion

One of the main tasks performed by start_kernel() later on is to calibrate the number of loops per jiffy, which is the number of times the processor can execute an internal delay loop in one jiffy—an internal timer period that normally ranges from one to ten milliseconds. Succeeding in performing this calibration should mean that the different infrastructures and drivers set up by the architecture-specific functions we just presented are working, since the calibration makes use of most of them.

In the next article, we will present the last portion of the port: from the creation of the first kernel thread to the init process.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.2 is out ?
Kamal Mostafa Linux 3.19.8-ckt6 ?
Jiri Slaby Linux 3.12.47 ?

Architecture-specific

Core kernel code

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Christoph Hellwig Persistent Reservation API V3 ?

Memory management

Networking

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds