Kernel development
Brief items
Kernel release status
The 4.2 kernel was released on August 30; Linus noted that the one-week delay was perhaps not strictly necessary. "So judging by how little happened this week, it wouldn't have been a mistake to release 4.2 last week after all, but hey, there's certainly a few fixes here, and it's not like delaying 4.2 for a week should have caused any problems either." Headline features in this release include the security module stacking patches, the delay-gradient congestion-control algorithm, improvements to writeback management in control groups, a lot of important persistent-memory infrastructure, and more.
The 4.3 merge window is open; see the separate article below for a summary of what has been merged so far.
Stable updates: none have been released in the last week.
Quotes of the week
Kernel development news
4.3 Merge window, part 1
As of this writing, just over 4,000 non-merge changesets have been pulled into the mainline kernel repository for the 4.3 development cycle. This merge window, in other words, is just getting started. But that is enough to begin to show the shape of this development cycle: useful incremental changes, but not much, thus far, in the way of high-profile features.The user-visible changes merged so far include:
- The kernel now supports the attachment of BPF programs to uprobes, making more flexible tracing of
user-space code possible. There is also a new libbpf library that
is meant to ease the process of working with BPF scripts; its first
user is the perf tool.
- There is a new "PIDs" controller for the control-group subsystem; it
enforces a limit on the number of processes contained within the
group. This controller thus serves as a sort of defense against fork
bombs and similar attacks. See Documentation/cgroups/pids.txt for
details.
- The perf tool has gained the ability to work with Intel processor trace streams.
- The s390 architecture has gained "fake NUMA" support. This allows a
large system to be configured into a set of emulated NUMA nodes,
making it easier to partition workloads and, in some situations,
improving performance.
- The CONFIG_VM86 option provides access to the 16-bit legacy
mode on x86 systems. Its use has been in decline for years, there
are no known recently released tools that need it, and it
has been recently shown to have a number of unpleasant problems, some
of which are security-related. In 4.3, this option will be renamed
(to CONFIG_X86_LEGACY_VM86) and disabled by default.
Hopefully nobody actually needs the VM86 mode and it can be removed
entirely in the near future.
- New hardware support includes:
- Industrial I/O:
ROHM RPR0521 ambient-light and proximity sensors,
Texas Instruments OPT3001 light sensors, and
TXC PA12203001 light and proximity sensors.
- Miscellaneous:
Qualcomm coincell battery chargers,
Qualcomm SMD based RPM regulators,
UltraChip UC1611 LCD controllers,
MediaTek MT6311 power-management ICs,
MediaTek SPI controllers,
MediaTek SCPSYS power domain controllers,
Mediatek MT8173 CPU-frequency controllers,
Netlogic XLP SPI controllers,
Allwinner Security System cryptographic accelerators,
Intel DH895xCC crypto accelerators,
ARM PrimeCell PL172 multiport memory controllers, and
NVIDIA Tegra124 CPU-frequency controllers.
- MOST:
The MOST
specification is a standard
for media networking aimed at the automotive industry. The 4.3
kernel will include (in the staging tree) a new MOST subsystem
with support for network, sound, media drivers and more. See this document for some introductory
information.
- USB: Qualcomm APQ8016/MSM8916 USB transceiver controllers, Allwinner sun4i A10 musb DRC/OTG controllers, and NXP LPC18xx/43xx SoC USB OTG PHYs.
- Industrial I/O:
ROHM RPR0521 ambient-light and proximity sensors,
Texas Instruments OPT3001 light sensors, and
TXC PA12203001 light and proximity sensors.
Changes visible to kernel developers include:
- There is a new driver framework for nonvolatile memory devices
(EEPROMs and the like); see Documentation/nvmem/nvmem.txt for some
details.
- DocBook comments for structures can now be split into multiple chunks within the structure, easing the process of documenting the fields of especially large structures. The HTML document generator can also now create internal cross-reference links automatically.
One pull request that has not yet been acted upon by Linus is Jan Kara's request deleting the ext3 filesystem, as was covered here in July. Linus is worried that the change will force ext3 users to upgrade their filesystems in non-backward-compatible ways, but, as Ted Ts'o explained, that should not happen. Your editor would hazard a guess that this removal will go through before the merge window closes.
If the normal schedule holds, that closure should happen on September 13. As usual, LWN will follow the commit stream and call out the most interesting changes as they happen.
Thread-level management in control groups
Progress on the multi-year task of reworking the kernel's control group ("cgroup") mechanism might appear to have slowed down recently, but that work continues and, occasionally, surfaces on the mailing lists. Recently, cgroup maintainer Tejun Heo posted a proposal for the CPU-scheduler controller interface in the new cgroup subsystem; it changes a number of control knobs, makes time units consistent across the interface, and so on. This proposal generated quite a bit of discussion, but it wasn't the contents of the new interface that were controversial. Instead, it became clear that some users are not at all happy about a feature that is absent from this interface — and which may have to be restored before this work can go forward.The new "unified hierarchy" cgroup interface was a topic of discussion at the 2013 Kernel Summit. At that gathering, Tejun stated his intent to make cgroups handle membership at the process level, but not below. If a process is made up of multiple threads, all of those threads will be placed into the same group as a unit. That is a change from the current cgroup implementation, which allows different threads to be placed in different groups. Eliminating that capability, Tejun said, makes the implementation much more straightforward and, in any case, most controllers only make sense at the process level anyway.
At the time, there were some expressions of concern from the gathered developers, not all of whom were convinced that losing the ability to split a process's threads across control groups would be acceptable. No definitive conclusion on the issue was reached at the time; further discussion was deferred until the code itself made an appearance. Two years later, that code is out but the worries have not gone away; scheduler developer Peter Zijlstra was quick to raise the issue again.
A few users of thread-level cgroup control surfaced in the ensuing discussion; the most vocal of them was Paul Turner of Google, who asserted that this ability is an important part of how systems are managed there. One use case mentioned was the division of a job into work and support threads. The threads doing the "real work" should get the bulk of the available CPU time, but an application will typically want to guarantee a minimum of time to the support threads as well. Putting the two types of threads into different control groups allows this policy to be implemented in a fairly straightforward way.
Tejun's response took a few different forms. One was to question the
importance of this use case; he described
it as "super duper fringe
" at one point. He also suggested
that the problem could be solved using process priorities, but Paul clearly stated that priorities are not a
suitable solution to the problem, while cgroups are. It seems clear that a
number of users beyond Google employ control groups in this manner now
and they would not be happy if this ability were to be left out of the new
cgroup interface. If nothing else, leaving it out would tend to inhibit
movement away from the old interface which, in turn, would make that API's
eventual removal an even more distant prospect.
The other significant point argued by Tejun is that the cgroup interface is not a good way for applications to manage their environment. It may work as a system-administration interface, he said, but application developers should be given a more programmatic, system-call-based interface. Such an interface would be more easily used by those developers, he said, and separating the administrative and application interfaces would help to prevent conflicts between the administrator and the application over thread-level management.
In this message Tejun briefly sketched out the "resource group" API that he has in mind. These groups could be created and managed within an application with a new set_resource() system call.
Finally, Tejun argued that, rather than using cgroups for grouping of threads, the kernel should just employ the normal process hierarchy for that purpose; the set_resource() system call follows that guideline. Additionally, new threads could be created with a special clone() flag that would cause them to be placed into a new resource group. The process hierarchy is already understood by application developers, he said, and can be manipulated with existing system calls. If developers use that hierarchy to partition their applications, they will have better results and the complexity of supporting thread-level cgroup membership can be avoided.
The API itself was not discussed much; the discussion was more about identifying the problem than nailing down the details of its solution. There appears to be some concern about moving away from the cgroup API for thread-level management, but developers could probably be convinced on that score if the new API looked good enough; the current API has few overt admirers, after all. There was some resistance to the idea of limiting grouping to the process hierarchy, though. It seems that a number of use cases involve moving threads from one control group to another, depending on just what a specific thread is doing at any given time. A grouping mechanism that was strictly based on the process hierarchy would not be able to move processes in that way.
The end of the discussion came when Ingo Molnar and Peter both indicated that they would block further work on the CPU cgroup controller until the problem of per-thread control had been resolved. The issue, they said, is fundamental to the design of the subsystem, and it is not reasonable to expect that a solution can be retrofitted in after this code is merged. Tejun has not, as of this writing, indicated how he intends to proceed, whether it be by allowing per-thread control-group membership or through a separate control API. Either way, further progress in this area cannot be expected until a solution to this particular problem is presented and accepted by the relevant maintainers.
Persistent memory, with and without page structures
Persistent memory offers the prospect of large amounts (e.g. terabytes) of directly attached memory that retains its contents over a system reboot or power cycle. It also offers a number of interesting design problems with regard to how it should be managed; persistent memory looks a lot like ordinary memory, but it differs in a number of important ways. As a result, there has been a long discussion over how to deal with this memory and, in particular, whether the kernel should use page structures to describe it or not. As shown in some recent patch sets, the discussion continues to evolve, and it seems to be heading toward an interesting answer to the struct page question.For those needing a quick recap, struct page is the kernel's fundamental memory-management data structure; one page structure exists for each page of memory present in the system. In current kernels, though, persistent memory does not have accompanying page structures for a simple reason: the amount of memory required to hold all of those structures looks prohibitive. Storing them in the persistent memory array itself is possible (and discussed further below), but page structures change frequently, making them a poor fit for persistent memory storage, which (1) tends to be slow for writes, and (2) will wear out more quickly if subjected to sustained frequent writes.
As long as a persistent-memory array is treated like a disk drive, there is no need for page structures. But if persistent memory is to take part in DMA or direct I/O operations, it currently needs those structures; for that reason, such operations do not currently work on persistent memory. This problem is widely seen as needing a fix.
When we last looked at the discussion in May, there was a push toward using page-frame numbers (PFNs) as a replacement for page structures in various I/O paths. A PFN is easily derived from a page's physical address, so it is an easy and obvious way to refer to a physical page of memory — if the additional information stored in the page structure is not needed for any given operation. In May, though, it was becoming clear that this information cannot always be done without, and, thus, that this approach had its limitations, especially when it came to supporting direct I/O, which is the most scalable I/O mode that the kernel offers.
Using page-frame numbers
Nonetheless, work continues on the PFN-based approach. Christoph Hellwig posted this patch series adapting the DMA subsystem so that it could manage scatter/gather lists containing PFNs. A scatter/gather list describes an I/O operation that is spread across multiple regions of memory; these lists are used for almost all nontrivial I/O operations, since I/O buffers are rarely situated in a single, physically contiguous block of memory. Making scatter/gather lists work without page structures would, for the most part, solve the problem of doing DMA on buffers stored in persistent memory. Christoph's patch doesn't do that, but it abstracts out the references to page structures, making it easy to use PFNs instead in the future.
Beyond this preparatory work, though, the kernel needs the ability to work more extensively with PFNs. Happily, on the same day, Dan Williams posted a new revision of his patch series implementing the __pfn_t type for the management of pages by PFN. The new __pfn_t type is simpler than it was the last time around:
typedef struct {
unsigned long val;
} __pfn_t;
There is no more trickery with storing PFNs and struct page pointers in the same structure. There are, however, a few bits of val that are used for related purposes: to chain entries in scatter/gather lists, for example and, in the case of the PFN_DEV bit, to indicate that the PFN has no associated page structure in the system. There is a set of helper functions to do things like get the actual PFN number (__pfn_t_to_pfn()) or the physical address (__pfn_t_to_phys()) associated with a __pfn_t value.
One common use for a page structure is to map the associated page into the kernel's address space with kmap_atomic(); that allows the kernel to manipulate that page directly. For code dealing with PFN values instead of page structures, Dan's patch set adds kmap_atomic_pfn_t() to do the same job; it will work regardless of whether the PFN it is given refers to ordinary or persistent memory. Interestingly, when successful, kmap_atomic_pfn_t() returns with the RCU read lock held, and kunmap_atomic_pfn_t() expects to release that lock.
The final patch in the series converts the scatter/gather DMA code over to the use of PFNs instead of page structures. That should enable the DMA code to work on persistent memory, though it seems that there may be some remaining issues on a few architectures.
The PFN-based approach is not universally admired; in particular, there has been some resistance from Boaz Harrosh, who believes that page structures should always be used with persistent memory — and who posted a patch set to that effect one year ago. Boaz's patches don't seem to have been developed since then, though, and his objections do not appear to be slowing things down much. David Miller has also expressed discomfort with the idea of memory without page structures, for what it's worth.
Back to page structures
These misgivings notwithstanding, persistent-memory developers clearly see value in providing access to this memory without associated page structures. So it may have come as a surprise to some when Dan also posted this patch series adding none other than struct page support for persistent memory. In the end, it seems, there are certain things that simply cannot be done without page structures; direct I/O and remote DMA are two features at the top of that list. This patch set allows the creation of these structures on systems where they are needed while allowing the rest to avoid the associated overhead.
This patch set adds a new type of block device that sits on top of the existing pmem driver that was merged for the 4.1 kernel. The driver for this new device will add a persistent-memory range to the system's memory map, using the memory hotplugging mechanism. The memory goes into a special zone (ZONE_DEVICE) created for this purpose, though, so it will not be made available to the rest of the system like ordinary memory. As part of this process, the driver allocates an array of page structures to describe this memory range.
Allocating that array brings us back to the problem of memory consumption: a large persistent-memory block will require a large array of page structures to describe it. One possible solution to this problem is to introduce a new structure for variably sized pages, or to simply use huge pages, but Dan's patch set sticks to the ordinary page structure describing 4KB pages. So the memory-consumption problem remains.
The original version of the patch offered an interesting approach to that problem: the decision of where these page structures should live was pushed out to user space. By tweaking a sysfs attribute, the system administrator could direct those structures into ordinary memory, or could instead cause them to be stored in the persistent-memory array itself. So large arrays could host their own page structures. As mentioned above, persistent memory may not be the best place to store those structures, but, for many use cases, it may work well enough, and this approach does make the problem of page structures taking up too much RAM go away.
The current version of this patch set drops that feature, though, and instead stores page structures in RAM unconditionally. That change simplifies the memory-management changes, making it easier to get the patch set reviewed and merged. Expect the store-in-persistent-memory option to return in the future, though, as the huge arrays we've been promised finally start to show up in the mass market.
Meanwhile, we now have a set of patches that make persistent memory behave almost entirely like ordinary memory with regard to management within the kernel. That means that, assuming this work is merged, Linux is essentially ready to support the use of persistent memory for a wide variety of use cases. What remains, at this point, is to see just what developers will do once they have terabyte-sized arrays of persistent memory available to play with.
Porting Linux to a new processor architecture, part 2: The early code
In part 1 of this series, we laid the groundwork for porting Linux to a new processor architecture by explaining the (non-code-related) preliminary steps. This article continues from there to delve into the boot code. This includes what code needs to be written in order to get from the early assembly boot code to the creation of the first kernel thread.
The header files
As briefly mentioned in the previous article, the arch header files (in my
case, located under linux/arch/tsar/include/) constitute the two interfaces
between the architecture-specific and architecture-independent code required by Linux.
The first portion of these headers (subdirectory asm/) is part of the kernel
interface and is used internally by the kernel source code. The second portion
(uapi/asm/) is part of the user interface and is meant to be exported to
user space—even though the various standard C libraries tend to
reimplement the headers instead of including the exported ones. These interfaces
are not completely airtight, as many of the asm headers are
used by user space.
Both interfaces are typically more than a hundred header files altogether,
which is why headers represent one of the biggest tasks in porting Linux to a
new processor architecture. Fortunately, over the past few years,
developers noticed
that many processor architectures were sharing similar code (because they often
exhibited the same behaviors), so the majority of this code has been
aggregated
into a generic layer of header files (in
linux/include/asm-generic/ and linux/include/uapi/asm-generic/).
The real benefit is that it is possible to refer to these generic header files,
instead of providing custom versions, by simply writing appropriate Kbuild
files. For example, the few first lines of a typical include/asm/Kbuild looks
like:
generic-y += atomic.h
generic-y += barrier.h
generic-y += bitops.h
...
When porting Linux, I'm afraid there is no other choice than to make a list of all of the possible headers and examine them one by one in order to decide whether the generic version can be used or if it requires customization. Such a list can be created from the generic headers already provided by Linux as well as the customized ones implemented by other architectures.
Basically, a specific version must be developed for all of the headers that are related to the details of an
architecture, as defined by the hardware or by the software through the ABI: cache (asm/cache.h) and TLB management
(asm/tlbflush.h), the ELF format (asm/elf.h), interrupt enabling/disabling
(asm/irqflags.h), page table management (asm/page.h, asm/pgalloc.h,
asm/pgtable.h), context switching (asm/mmu_context.h, asm/ptrace.h), byte
ordering (uapi/asm/byteorder.h), and so on.
Boot sequence
As explained in part 1, figuring out the boot sequence helps to understand the minimal set of architecture-specific functions that must be implemented—and in which order.
The boot sequence always starts with a function that must be written manually,
usually in assembly code (in my case, this function is called kernel_entry() and
is located in arch/tsar/kernel/head.S). It is defined as the main entry
point of the kernel image, which indicates to the bootloader where to jump
after loading the image in memory.
The following trace shows an excerpt of the sequence of functions that is executed during the boot (starred functions are the architecture-specific ones that will be discussed later in this article):
kernel_entry*
start_kernel
setup_arch*
trap_init*
mm_init
mem_init*
init_IRQ*
time_init*
rest_init
kernel_thread
kernel_thread
cpu_startup_entry
Early assembly boot code
The early assembly boot code has this special aura that scared me at first (as I'm sure it did many other programmers), since it is often considered one of the most complex pieces of code in a port. But even though writing assembly code is usually not an easy ride, this early boot code is not magic. It is merely a trampoline to the first architecture-independent C function and, to this end, only needs to perform a short and defined list of tasks.
When the early boot code begins execution, it knows nothing about what has happened before: Has the system been rebooted or just been powered on? Which bootloader has just loaded the kernel in memory? And so forth. For this reason, it is safer to put the processor into a known state. Resetting one or several system registers usually does the trick, making sure that the processor is operating in kernel mode with interrupts disabled.
Similarly, not much is known about the state of the memory. In particular, there
is no guarantee that the portion of memory representing the kernel’s bss
section (the section containing uninitialized data) was reset to zero, which is why
this section must be explicitly cleared.
Often Linux receives arguments from the bootloader (in the same way that a program receives arguments when it is launched). For example, this could be the memory address of a flattened device tree (on ARM, MicroBlaze, openRISC, etc.) or some other architecture-specific structure. Often such arguments are passed using registers and need to be saved into proper kernel variables.
At this point, virtual memory has not been activated and it is interesting
to note that kernel symbols, which are all defined in the kernel's virtual
address space, have to be accessed through a special macro: pa() in
x86, tophys() in OpenRISC, etc. Such a macro translates the
virtual memory address for symbols into their corresponding physical memory
address, thus acting as a temporary software-based translation mechanism.
Now, in order to enable virtual memory, a page table structure must be set
up from scratch. This structure usually exists as a static variable in the
kernel image, since at this stage it is nearly impossible to allocate
memory. For the same reason, only the kernel image can be mapped by the page
table at first, using huge pages if possible. According to convention, this
initial page table structure is called swapper_pg_dir and is
thereafter used as the reference page table structure throughout the execution
of the system.
On many processor architectures, including TSAR, there is an interesting
thing about
mapping the kernel in that it actually needs to be mapped twice. The first
mapping implements the expected direct-mapping strategy as described in part 1
(i.e. access to virtual address 0xC0000000 redirects to physical address
0x00000000). However, another mapping is temporarily required for when
virtual memory has just been enabled but the code execution flow still hasn't
jumped to a virtually mapped location. This second mapping is a simple identity
mapping (i.e. access to virtual address 0x00000000 redirects to physical
address 0x00000000).
With an initialized page table structure, it is now possible to enable virtual memory, meaning that the kernel is fully executing in the virtual address space and that all of the kernel symbols can be accessed normally by their name, without having to use the translation macro mentioned earlier.
One of the last steps is to set up the stack register with the address of the
initial kernel stack so that C functions can be properly called. In most
processor architectures (SPARC, Alpha, OpenRISC, etc.), another register is also
dedicated to containing a pointer to the current thread's information (struct
thread_info). Setting up such a pointer is optional, since it can
be derived from the current kernel stack pointer (the thread_info structure is
usually located at the bottom of the kernel stack) but, when allowed by the
architecture, it enables much faster and more convenient access.
The last step of the early boot code is to jump to the first
architecture-independent C function that Linux provides: start_kernel().
En route to the first kernel thread
start_kernel() is where many subsystems are initialized, from
the various virtual filesystem (VFS) caches and the security framework to time
management, the console layer, and so on. Here, we will look at the main
architecture-specific
functions that start_kernel() calls during boot before it
finally calls rest_init(), which creates the
first two kernel threads and morphs into the boot idle thread.
setup_arch()
While it has a rather generic name, setup_arch()
can actually do quite a bit,
depending on the architecture. Yet examining the code for different ports
reveals that it generally performs the same tasks, albeit never in the same
order nor the same way. For a simple port (with device tree support), there is a
simple skeleton that setup_arch() can follow.
One of the first steps is to discover the memory ranges in the system.
A device-tree-based system can quickly skim through the flattened device
tree given by the bootloader (using early_init_devtree()) to
discover the
physical memory banks available and to register them into the
memblock layer. Then, parsing the early
arguments (using parse_early_param()) that were either given
by the bootloader or
directly included in the device tree can activate useful features such as
early_printk(). The order is important here, as the device
tree might contain the
physical address of the terminal device used for printing and thus needs to be
scanned first.
Next the memblock layer needs some more configuration before it is possible to map the low memory region, which enables memory to be allocated. First, the regions of memory occupied by the kernel image and the device tree are set as being reserved in order to remove them from the pool of free memory, which is later released to the buddy allocator. The boundary between low memory and high memory (i.e. which portion of the physical memory should be included in the direct mapping region) needs to be determined. Finally the page table structure can be cleaned up (by removing the identity mapping created by the early boot code) and the low memory mapped.
The last step of the memory initialization is to configure the memory
zones. Physical memory pages can be associated with different zones: ZONE_DMA
for pages compatible with the old ISA 24-bit DMA address limitation, and
ZONE_NORMAL and
ZONE_HIGHMEM for low- and high-memory pages, respectively. Further reading on
memory allocation in Linux can be found in Linux Device Drivers
[PDF].
Finally, the kernel memory segments are registered using the resource
API and a tree of struct device_node
entries is created from the flattened device tree.
If early_printk() is enabled, here is an example of what
appears on the terminal at
this stage:
Linux version 3.13.0-00201-g7b7e42b-dirty (joel@joel-zenbook) \
(gcc version 4.8.3 (GCC) ) #329 SMP Thu Sep 25 14:17:56 CEST 2014
Model: UPMC/LIP6/SoC - Tsar
bootconsole [early_tty_cons0] enabled
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 65024
Kernel command line: console=tty0 console=ttyVTTY0 earlyprintk
trap_init()
The role of trap_init() is to configure the hardware and software
architecture-specific parts involved in the interrupt/exception infrastructure. Up to
this point, an exception would either cause the system to crash immediately or
it would be caught by a handler that the bootloader might have set up
(which would
eventually result in a crash as well, but perhaps with more information).
Behind (the actually simple) trap_init() hides another of
the more complex
pieces of code in a Linux port: the interrupt/exception handling manager. A big
part of it has to be written in assembly code because, as with the early boot
code, it deals with specifics that are unique to the targeted processor
architecture. On a typical processor, a possible overview of what happens on an
interrupt is as follows:
- The processor automatically switches to kernel mode, disables interrupts, and its execution flow is diverted to a special address that leads to the main interrupt handler.
- This main handler retrieves the exact cause of the interrupt and usually jumps to a sub-handler specialized for this cause. Often an interrupt vector table is used to associate an interrupt sub-handler with a specific cause, and on some architectures there is no need for a main interrupt handler, as the routing between the actual interrupt event and the interrupt vector is done transparently by hardware.
- The sub-handler saves the current context, which is the state of the
processor
that can later be restored in order to resume exactly where it stopped. It may
also re-enable the interrupts (thus making Linux re-entrant) and usually jumps
to a C function that is better able to handle the cause of the exception.
For example, such a C function can, in the case of an access to an
illegal memory address, terminate the faulty user program with a
SIGBUSsignal.
Once all of this interrupt infrastructure is
in place, trap_init() merely
initializes the interrupt vector table and configures the processor via one of
its system registers to reflect the address of the main interrupt handler (or
of the interrupt vector table directly).
mem_init()
The main role of mem_init() is to release the free memory from the
memblock layer to the buddy allocator (aka the page
allocator). This represents the last memory-related task before
the slab allocator (i.e. the cache of commonly used objects, accessible via
kmalloc()) and the vmalloc infrastructure can be started, as both are based on
the buddy allocator.
Often mem_init() also prints some information about the memory system:
Memory: 257916k/262144k available (1412k kernel code, \
4228k reserved, 267k data, 84k bss, 169k init, 0k highmem)
Virtual kernel memory layout:
vmalloc : 0xd0800000 - 0xfffff000 ( 759 MB)
lowmem : 0xc0000000 - 0xd0000000 ( 256 MB)
.init : 0xc01a5000 - 0xc01ba000 ( 84 kB)
.data : 0xc01621f8 - 0xc01a4fe0 ( 267 kB)
.text : 0xc00010c0 - 0xc01621f8 (1412 kB)
init_IRQ()
Interrupt networks can be of very different sizes and complexities. In a simple system, the interrupt lines of a few hardware devices are directly connected to the interrupt inputs of the processor. In complex systems, the numerous hardware devices are connected to multiple programmable interrupt controllers (PICs) and these PICs are often cascaded to each other, forming a multilayer interrupt network. The device tree helps a great deal by easily describing such networks (and especially the routing) instead of having to specify them directly in the source code.
In init_IRQ(), the main task is to call
irqchip_init() in order to scan the device tree and find all
the nodes identified as interrupt controllers (e.g PICs). It then finds the
associated driver for each node and initializes it. Unless the targeted system
uses an already-supported interrupt controller, that typically means
the first device driver will need to be written.
Such a driver contains a few major functions: an initialization
function that maps the device in the kernel address space and maps the
controller-local interrupt lines to the Linux IRQ number space (through the
irq_domain mapping library); a mask/unmask function that can
configure the controller in order to mask or unmask the specified Linux IRQ
number; and, finally, a controller-specific interrupt handler that can find out
which of its inputs is active and call the interrupt handler registered with
this input (for example, this is how the interrupt handler of a block device
connected to a PIC ends up being called after the device has raised an
interrupt).
time_init()
The purpose of time_init() is to initialize the architecture-specific
aspects of the timekeeping infrastructure. A minimal version of this function,
which relies on the use of a device tree, only involves two function calls.
First, of_clk_init() will scan the device tree and find all the
nodes identified as clock providers in order to initialize the clock
framework. A very simple clock-provider node only has to define a fixed
frequency directly specified as one of its properties.
Then, clocksource_of_init() will parse the
clock-source
nodes of the device tree and initialize their associated driver. As described in
the kernel documentation, Linux actually needs two types of timekeeping
abstraction (which are actually often both provided by the same device): a
clock-source device provides the basic timeline by monotonically counting
(for example
it can count system cycles), and a clock-event device raises interrupts on
certain points on this timeline, typically by being programmed to count periods
of time. Combined with the clock provider, it allows for precise timekeeping.
The driver of a clock-source device can be extremely simple, especially for a memory-mapped device for which the generic MMIO clock-source driver only needs to know the address of the device register containing the counter. For the clock event, it is slightly more complicated as the driver needs to define how to program a period and how to acknowledge it when it is over, as well as provide an interrupt handler for when a timer interrupt is raised.
Conclusion
One of the main tasks performed by start_kernel() later on is to calibrate the
number of loops per jiffy, which is the number of times the processor can
execute
an internal delay loop in one jiffy—an internal timer period that normally
ranges from one to ten milliseconds. Succeeding in performing this
calibration should mean
that the different infrastructures and drivers set up by the architecture-specific
functions we just presented are working, since the calibration makes use of
most of
them.
In the next article, we will present the last portion of the port: from the
creation of the first kernel thread to the init process.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
