Shrinking the kernel with a hammer

March 2, 2018

This article was contributed by Nicolas Pitre

This is the fourth article of a series discussing various methods of reducing the size of the Linux kernel to make it suitable for small environments. Reducing the kernel binary has its limits and we have pushed them as far as possible at this point. Still, our goal, which is to be able to run Linux entirely from the on-chip resources of a microcontroller, has not been reached yet. This article will conclude this series by looking at the problem from the perspective of making the kernel and user space fit into a resource-limited system.

A microcontroller is a self-contained system with peripherals, memory, and a CPU. It is typically small, inexpensive, and has low power-consumption characteristics. Microcontrollers are designed to accomplish one task and run one specific program. Therefore, the dynamic memory content of a microcontroller is usually much smaller than its static content. This is why it is common to find microcontrollers equipped with many times more ROM than RAM.

For example, the ATmega328 (a popular Arduino target) comes with 32KB of flash memory and only 2KB of static memory (SRAM). Now for something that can boot Linux, the STM32F767BI comes with 2MB of flash and 512KB of SRAM. So we'll aim for that resource profile and figure out how to move as much content as possible from RAM to ROM.

Kernel XIP

The idea of eXecute-In-Place (XIP) is to have the CPU fetch instructions directly from the ROM or flash memory where it is stored and avoid loading them into RAM altogether. XIP is a given in the microcontroller world where RAM is small, as we've seen. But XIP is used a bit less on larger systems where RAM is plentiful and simply executing everything from RAM is often simpler; executing from RAM is also faster due to high-performance caches. This is why most Linux targets don't support XIP. In fact, XIP in the kernel appears to be supported only on ARM, and its introduction predates the Git era.

For kernel XIP, it is necessary to have ROM or flash memory directly accessible in a range of the processor's memory address space, alongside system RAM, without the need for any software drivers. NOR flash is often used for that purpose as it offers random access, unlike the block-addressed NAND flash. Then, the kernel must be specially linked so the text and read-only data sections are allocated in the flash address range. All we need to do is enable CONFIG_XIP_KERNEL and the build system will prompt for the desired kernel physical address location in flash. Only the writable kernel data will be copied to RAM.

It is therefore highly desirable with an XIP kernel to have as much code and data as possible placed in flash memory. The more that remains in flash, the less will be copied to the precious RAM. By default, functions are put in flash, along with any data annotated with the const qualifier. It is convenient that all the "constification" work that took place in recent kernel releases, mainly for hardening purposes, directly benefits the XIP kernel case too.

User-space XIP and filesystems

User space is a huge consumer of RAM. But, just like the kernel, user-space binaries have read-write and read-only segments. It would be nice to have the user-space read-only segments stored in the same flash memory, and executed directly from there rather than being loaded into RAM. However, unlike the kernel, which is a static binary loaded or mapped only once from well-known ROM and RAM addresses, user-space executables are organized into a filesystem, making things more complicated.

Could we get rid of the filesystem? Certainly we could. In fact, this is what most small realtime operating systems do: they link their application code directly with the kernel, bypassing the filesystem layer entirely. And that wouldn't be completely revolutionary even for Linux, as kernel threads are more or less treated like user-space applications: they have an execution context of their own, they are scheduled alongside user applications, they can be signaled, they appear in the task list, etc. And kernel threads have no filesystem under them. An application made into a kernel thread could crash the entire kernel, but in a microcontroller environment lacking a memory-management unit (MMU), this is already the case for pure user-space applications.

However, having a filesystem around for user-space applications still has many advantages we don't want to lose:

Compatibility with full-fledged Linux systems, so our application can be developed and tested natively on a workstation;
The convenience of having multiple unrelated applications together;
The ability to develop and update the kernel and user space independently of each other;
A clear boundary that identifies application code as not being a derived work of the kernel in the context of the GPL.

This being said, we want the smallest and simplest filesystem possible. Let's not forget that our flash memory budget is only 2MB, and our kernel (see the previous article in this series) weighs about 1MB already. That pretty much rules out writable filesystems due to their inherent overhead, and we don't want to be writing to the same flash where the kernel and user space live as this would render all the flash content inaccessible during write operations and crash any code executing from it.

Side note: It is possible to write to the actual flash memory being used for XIP with CONFIG_MTD_XIP but this is tricky, currently available only for Intel and AMD flash memory, and requires target-specific support.

So our choices for small, read-only filesystems are:

Squashfs: highly scalable, compressed by default, somewhat complex code, no XIP support
Romfs: small and simple code, no compression, partial (only on systems without an MMU) XIP support
Cramfs: small and simple code, compressed, partial (MMU-only) out-of-tree XIP support

I settled on cramfs as the small amount of available flash memory warrants compression that romfs doesn't have, and cramfs's simple code base made it easier to add no-MMU XIP support quickly more than it would be for squashfs. Also, cramfs can be used with the block-device subsystem configured out entirely.

However the early attempts at adding XIP to cramfs were rather crude and lacking in a fundamental way. It was an all-or-nothing affair: each file was either completely uncompressed for XIP purposes, or entirely compressed. In reality, executables are made of both code and data, and since writable data has to be copied to RAM anyway, it is wasteful to keep that part uncompressed in flash. So I took upon myself to completely redesign cramfs XIP support for both the MMU and no-MMU cases. I included the needed ability to mix compressed and uncompressed blocks of arbitrary alignments, and did so in a way to meet quality standards for upstream inclusion (available in mainline since Linux v4.15).

I later (re)discovered that the almost 10-year-old AXFS filesystem (still maintained out of tree) could have been a good fit. I had forgotten about it though, and in any case I prefer to work with mainline code.

One may wonder why DAX was not used here. DAX is like XIP on steroids; it is tailored for large writable filesystems and relies on the presence of an MMU (which the STM32 processor lacks) to page in and out data as needed. Its documentation also mentions another shortcoming: "The DAX code does not work correctly on architectures which have virtually mapped caches such as ARM, MIPS and SPARC". Because cramfs with XIP is read-only and small enough to always be entirely mapped in memory, it is possible to achieve the intended result with a much simpler approach, making DAX somewhat overkill in this context.

User-space XIP and executable binary formats

Now that we're set with an XIP-capable filesystem, it is time to populate it. I'm using a static build of BusyBox to keep things simple. Using a target with an MMU, we can see how things are mapped in memory:

    # cat /proc/self/maps
    00010000-000a5000 r-xp 08101000 1f:00 1328       /bin/busybox
    000b5000-000b7000 rw-p 00095000 1f:00 1328       /bin/busybox
    000b7000-000da000 rw-p 00000000 00:00 0          [heap]
    bea07000-bea28000 rw-p 00000000 00:00 0          [stack]
    bebc1000-bebc2000 r-xp 00000000 00:00 0          [sigpage]
    bebc2000-bebc3000 r--p 00000000 00:00 0          [vvar]
    bebc3000-bebc4000 r-xp 00000000 00:00 0          [vdso]
    ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

The clue that gives XIP away is shown in bold in the third column on the first output line. It is meant to be the file offset for that mapping, except that remap_pfn_range(), used to establish an XIP mapping, overwrites the file offset in the virtual memory area (VMA) structure (vma->vm_pgoff) with the physical address for that mapping. We can see that 0x08101000 would be way too big for a file offset here; instead, it corresponds to a location in the physical address range for the flash memory. Cramfs may also use vm_insert_mixed() in some cases, and then this physical address reporting wouldn't be available. A reliable way to display XIP mappings in all cases would be useful.

The second /bin/busybox mapping (the .data section) is flagged read-write (rw-p), unlike the first one (the .text section) which is read-only and executable (r-xp). Writable segments cannot be mapped to the flash memory and, therefore, have to be loaded in RAM in the usual way.

The MMU makes it easy for a program to see its code at the absolute address it expects regardless of the actual memory used. Things aren't that simple in the no-MMU case, where user executables must be able to run at any memory address; position-independent code (PIC) is therefore a requirement. This ability is offered by the bFLT flat file format, and has been available for quite a long time with uClinux targets. However, this format has multiple limitations that make XIP, shared libraries, or the combination of both, unwieldy.

Fortunately there is a variant of ELF, called ELF FDPIC, that overcomes all those limitations. Because FDPIC segments are position-independent with no predetermined relative offset between them, it is possible to share common .text segments across multiple executable instances just like standard ELF-on-MMU targets, and those .text segments may be XIP as well. ELF FDPIC support was added to the ARM architecture (also available in mainline since Linux v4.15).

On my STM32 target, with the combination of a XIP-enabled cramfs and ELF FDPIC user-space binaries, the BusyBox mapping now looks like this:

    # cat /proc/self/maps
    00028000-0002d000 rw-p 00037000 1f:03 1660       /bin/busybox
    0002d000-0002e000 rw-p 00000000 00:00 0
    0002e000-00030000 rw-p 00000000 00:00 0          [stack]
    081a0760-081d8760 r-xs 00000000 1f:03 1660       /bin/busybox

Due to the lack of an MMU, the XIP segment is even more obvious as there is no address translation and the flash-memory address is clearly visible. The no-MMU memory mapping support requires shared mappings for XIP, hence the "r-xs" annotation.

Hammering static memory down

Okay, now we're all set for some hammering. We've seen that our XIP BusyBox above already saved 229,376 bytes of RAM, or 56 memory pages. That represents 44% of our total budget of 128 pages if we want to target 512KB of RAM. From now on, it is important to closely track where memory allocations go and determine how useful that precious memory is. Let's start by looking at the kernel itself, using a trimmed-down configuration from the previous article, but with CONFIG_XIP_KERNEL=y (and LTO disabled for now as it takes too long to build). We get:

       text    data     bss     dec     hex filename
    1016264   97352  169568 1283184  139470 vmlinux

The 1,016,264 bytes of text are located in flash so we can ignore them for a while. The 266,920 bytes of data and BSS, though, represent 51% of our RAM budget. Let's find out what is responsible for it with some scripting on the System.map file:

    #!/bin/sh
    {
        read addr1 type1 sym1
        while read addr2 type2 sym2; do
            size=$((0x$addr2 - 0x$addr1))
            case $type1 in
            b|B|d|D)
                echo -e "$type1 $size\t$sym1"
                ;;
            esac
            type1=$type2
            addr1=$addr2
            sym1=$sym2
        done
    } < System.map | sort -n -r -k 2

The first output lines are:

    B 133953016     _end
    b 131072        __log_buf
    d 8192  safe_print_seq
    d 8192  nmi_print_seq
    D 8192  init_thread_union
    d 4288  timer_bases
    b 4100  in_lookup_hashtable
    b 4096  ucounts_hashtable
    d 3960  cpuhp_ap_states
    [...]

Here we ignore _end because its apparent huge size comes from the fact that the end of kernel static allocation in RAM comes before the next kernel symbol located in flash — much higher in the address space. It is always good to go back to System.map to make sense of some weird cases like this.

However, we do have a clearly identifiable memory allocation to pound on. Looking at the declaration of __log_buf we see:

    /* record buffer */
    #define LOG_ALIGN __alignof__(struct printk_log)
    #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
    static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);

This one is easy. Because we don't want to configure the whole of printk() support out just yet, we'll set CONFIG_LOG_BUF_SHIFT=12 (the smallest allowed value). And while there, we'll also set the configuration symbol CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT to its minimum of ten. The result is:

       text    data     bss     dec     hex filename
    1016220   83016   42624 1141860  116c64 vmlinux

Our RAM usage went down from 266,920 to 125,640 bytes with a couple of simple configuration tweaks. Let's see our symbol-size list again:

    B 134092280     _end
    D 8192  init_thread_union
    d 4288  timer_bases
    b 4100  in_lookup_hashtable
    b 4096  ucounts_hashtable
    b 4096  __log_buf
    d 3960  cpuhp_ap_states
    [...]

The next contender is init_thread_union. This one is interesting because its size is derived from THREAD_SIZE_ORDER, which determines how many stack pages each kernel task gets. The first task (the init task) happens to have its stack statically allocated in the .data segment, which is why we see it here. Changing this from two pages to one page should be perfectly fine for our tiny environment, and this will also save one page per task with dynamically allocated stacks.

To reduce the size of timer_bases we'll tweak the value of LVL_BITS down from six to four. To reduce in_lookup_hashtable we change IN_LOOKUP_SHIFT from ten to five. And so on for a few more random kernel constants.

Nailing down dynamic memory allocations

Figuring out and reducing static memory allocations is easy as we've seen. But dynamic allocations must be dealt with as well, and for that we have to instrument our target and boot it. The first dynamic allocations come from the memblock allocator, as the usual kernel memory allocators are not up and running yet. All the instrumentation we need is already there; it suffices to provide "memblock=debug" on the kernel command line to activate it. Here's what it shows:

    memblock_reserve: [0x00008000-0x000229f7] arm_memblock_init+0xf/0x48
    memblock_reserve: [0x08004000-0x08007dbe] arm_memblock_init+0x1d/0x48

Here we have our static RAM being reserved, followed by our kernel code and read-only data in flash (which is mapped starting at 0x08004000). If the kernel code were in RAM then it would make sense to reserve that too. In this case this is just a useless but harmless reservation since the flash will never be allocated for any other purpose anyway.

Now for actual dynamic allocations:

    memblock_virt_alloc_try_nid_nopanic: 131072 bytes align=0x0 nid=0
    from=0x0 max_addr=0x0 alloc_node_mem_map.constprop.6+0x35/0x5c
      Normal zone: 32 pages used for memmap
      Normal zone: 4096 pages, LIFO batch:0

This is our memmap array taking up 131,072 bytes (32 pages) in order to manage 4096 pages. By default this target uses the full 16MB of external RAM available on the board. So if we reduce the number of available pages, to say 512KB, then this will shrink significantly.

The next significant allocation is:

    memblock_virt_alloc_try_nid_nopanic: 32768 bytes align=0x1000 nid=-1
    from=0xffffffff max_addr=0x0 setup_per_cpu_areas+0x21/0x64
    pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768

32KB of per-CPU memory pool for a uniprocessor system with less than a megabyte of RAM? Nah. Here are a few tweaks to include/linux/percpu.h to reduce that to a single page:

    -#define PCPU_MIN_UNIT_SIZE             PFN_ALIGN(32 << 10)
    +#define PCPU_MIN_UNIT_SIZE             PFN_ALIGN(4 << 10)

    -#define PERCPU_DYNAMIC_EARLY_SLOTS     128
    -#define PERCPU_DYNAMIC_EARLY_SIZE      (12 << 10)
    +#define PERCPU_DYNAMIC_EARLY_SLOTS     32
    +#define PERCPU_DYNAMIC_EARLY_SIZE      (4 << 10)

    +#undef PERCPU_DYNAMIC_RESERVE
    +#define PERCPU_DYNAMIC_RESERVE         (4 << 10)

It is worth noting that only the SLOB memory allocator (CONFIG_SLOB) still works after these changes.

Moving on to the next major allocation:

    memblock_virt_alloc_try_nid_nopanic: 8192 bytes align=0x0 nid=-1
    from=0x0 max_addr=0x0 alloc_large_system_hash+0x119/0x1a4
    Dentry cache hash table entries: 2048 (order: 1, 8192 bytes)
    memblock_virt_alloc_try_nid_nopanic: 4096 bytes align=0x0 nid=-1
    from=0x0 max_addr=0x0 alloc_large_system_hash+0x119/0x1a4
    Inode-cache hash table entries: 1024 (order: 0, 4096 bytes)

Who said this is a large system? Yes, you should get the idea by now; a couple more small tweaks are needed, but they're omitted from this article for the sake of keeping it reasonably short.

After that, the usual kernel memory allocators such as kmalloc() take over, and allocations ultimately end up down in __alloc_pages_nodemask(). The same kind of tracing and tweaks may be applied until the boot is complete. Sometimes it is just a matter of configuring out more stuff, such as the sysfs filesystem, whose memory needs are a bit excessive for our budget, and so on.

Back to user space

Now that we have hammered down the kernel's RAM usage, we're ready to flash and boot it again. The minimum amount of RAM required for a successful boot to user space at this point is 800KB ("mem=800k" on the kernel command line). Let's explore our small world:

    BusyBox v1.7.1 (2017-09-16 02:45:01 EDT) hush - the humble shell

    # free
                 total       used       free     shared    buffers     cached
    Mem:           672        540        132          0          0          0
    -/+ buffers/cache:        540        132

    # cat /proc/maps
    00028000-0002d000 rw-p 00037000 1f:03 1660       /bin/busybox
    0002d000-0002e000 rw-p 00000000 00:00 0
    0002e000-00030000 rw-p 00000000 00:00 0
    00030000-00038000 rw-p 00000000 00:00 0
    0004d000-0004e000 rw-p 00000000 00:00 0
    00061000-00062000 rw-p 00000000 00:00 0
    0006c000-0006d000 rw-p 00000000 00:00 0
    0006f000-00070000 rw-p 00000000 00:00 0
    00070000-00078000 rw-p 00000000 00:00 0
    00078000-0007d000 rw-p 00037000 1f:03 1660       /bin/busybox
    081a0760-081d8760 r-xs 00000000 1f:03 1660       /bin/busybox

Here we can see two four-page RAM mappings from offset 0x37000 of /bin/busybox. Those are two data instances, one for the shell process, and one for the cat process. They both share the busybox XIP code segment at 0x081a0760, which is good. There are also two anonymous eight-page RAM mappings among much smaller ones though, and that eats our page budget pretty quickly. They correspond to a 32KB stack space for each of those processes. That certainly can be tweaked down:

    --- a/fs/binfmt_elf_fdpic.c
    +++ b/fs/binfmt_elf_fdpic.c
    @@ -337,6 +337,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
            retval = -ENOEXEC;
            if (stack_size == 0)
                    stack_size = 131072UL; /* same as exec.c's default commit */

    +       stack_size = 8192;

            if (is_constdisp(&interp_params.hdr))
                    interp_params.flags |= ELF_FDPIC_FLAG_CONSTDISP;

That's certainly a quick and nasty hack; properly changing the stack size in the ELF binary's header is the way to go. It would also require careful validation, say on a MMU system with a fixed-size stack where any stack overflow could be caught. But hey, it wouldn't be our first hack at this point and that will do for now.

Still, before rebooting, let's explore some more:

    # ps
      PID USER       VSZ STAT COMMAND
        1 0          300 S    {busybox} sh
        2 0            0 SW   [kthreadd]
        3 0
    ps invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL),
    nodemask=(null), order=0, oom_score_adj=0
    [...]
    Out of memory: Kill process 19 (ps) score 5 or sacrifice child

The intervention of the out-of-memory killer was bound to happen at some point, of course. However the out-of-memory report also provided this piece of information from the buddy allocator:

    Normal: 2*4kB (U) 3*8kB (U) 2*16kB (U) 2*32kB (UM)
            0*64kB 0*128kB 0*256kB = 128kB

The ps process tried to perform a memory allocation with order=0 (a single 4KB page) and this failed despite having 128KB still available. Why is that? It turns out that the page allocator does not like performing normal memory allocations when there isn't at least a certain small amount of free memory available, as enforced by zone_watermark_ok(). This is to avoid possible deadlocks if a failed memory allocation results in the killing of a process — an operation that may require memory allocations of its own. Even though this watermark is supposed to be small, in our tiny environment this is still something we don't need and can't afford. So let's lower those watermarks slightly:

    --- a/mm/page_alloc.c
    +++ b/mm/page_alloc.c
    @@ -7035,6 +7035,10 @@ static void __setup_per_zone_wmarks(void)
                    zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
                    zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;

    +               zone->watermark[WMARK_MIN] = 0;
    +               zone->watermark[WMARK_LOW] = 0;
    +               zone->watermark[WMARK_HIGH] = 0;
    +
                    spin_unlock_irqrestore(&zone->lock, flags);
            }

Finally we're able to reboot with "mem=768k" on the kernel command line:

    Linux version 4.15.0-00008-gf90e37b6fb-dirty (nico@xanadu.home) (gcc version 6.3.1 20170404
 		  (Linaro GCC 6.3-2017.05)) #634 Fri Feb 23 14:03:34 EST 2018
    CPU: ARMv7-M [410fc241] revision 1 (ARMv7M), cr=00000000
    CPU: unknown data cache, unknown instruction cache
    OF: fdt: Machine model: STMicroelectronics STM32F469i-DISCO board
    On node 0 totalpages: 192
      Normal zone: 2 pages used for memmap
      Normal zone: 0 pages reserved
      Normal zone: 192 pages, LIFO batch:0
    random: fast init done
    [...]

    BusyBox v1.27.1 (2017-09-16 02:45:01 EDT) hush - the humble shell

    # free
                 total       used       free     shared    buffers     cached
    Mem:           644        532        112          0          0         24
    -/+ buffers/cache:        508        136

    # ps
      PID USER       VSZ STAT COMMAND
        1 0          276 S    {busybox} sh
        2 0            0 SW   [kthreadd]
        3 0            0 IW   [kworker/0:0]
        4 0            0 IW<  [kworker/0:0H]
        5 0            0 IW   [kworker/u2:0]
        6 0            0 IW<  [mm_percpu_wq]
        7 0            0 SW   [ksoftirqd/0]
        8 0            0 IW<  [writeback]
        9 0            0 IW<  [watchdogd]
       10 0            0 IW   [kworker/0:1]
       11 0            0 SW   [kswapd0]
       12 0            0 SW   [irq/31-40002800]
       13 0            0 SW   [irq/32-40004800]
       16 0            0 IW   [kworker/u2:1]
       21 0            0 IW   [kworker/u2:2]
       23 0          260 R    ps

    # grep -v " 0 kB" /proc/meminfo
    MemTotal:            644 kB
    MemFree:              92 kB
    MemAvailable:         92 kB
    Cached:               24 kB
    MmapCopy:             92 kB
    KernelStack:          64 kB
    CommitLimit:         320 kB

Here it is! Not exactly our target of 512KB of RAM but 768KB is getting pretty close. Some microcontrollers already have more than that amount of available on-chip SRAM.

Easy improvements are still possible. We can see above that 14 out of the 16 tasks are kernel threads, each with their 4KB stack; some of them could certainly go. Going through another round of memory page tracking would reveal yet more things that could be optimized out, etc. Yet, a dedicated application that doesn't spawn child processes is likely to require less RAM to run as well, unlike this generic shell environment. After all, some popular microcontrollers that are able to connect to the Internet have less total RAM than the remaining free RAM we have here.

Conclusion

There is at least one important lesson to be learned from the work on this project: shrinking the kernel's RAM usage is much easier than shrinking its code size. The code tends to be highly optimized already, because it has a direct influence on system performance, even on big systems. That is not necessarily the case for actual memory usage though. RAM comes relatively cheap on big systems, and wasting some of it really doesn't matter much in practice. Therefore, much low-hanging fruit can be found when optimizing RAM usage for small systems.

Other than the small tweaks and quick hacks presented here, all the major pieces relied upon in this article (XIP kernel, XIP user space, even some device tree memory usage reduction) are available in the mainline already. But further work beyond this proof of concept is still needed to make Linux on tiny devices really useful. Progression of this work will depend, as always, on people's desire to use it and willingness to form a community to promote its development.

[Thanks to Linaro for allowing me to work on this project and to write this article series.]

Index entries for this article
Kernel	Embedded systems
GuestArticles	Pitre, Nicolas

Shrinking the kernel with a hammer

Posted Mar 2, 2018 11:47 UTC (Fri) by atelszewski (guest, #111673) [Link] (1 responses)

Hi,

Yet again excellent article in the series!

My comments:
> The ability to develop and update the kernel and user space independently of each other;

If you mean it for development phase, then I totally agree. It's invaluable.
But production systems, in my opinion, are better updated with a single firmware image
containing the whole system (kernel+userspace).
This allows for easier tracking of what is the actual update status of a particular system,
especially if you're managing significant number of them.

When it comes to memory usage, I think RAM is the biggest challenge as of today.
This opinion is based on the desire to have the PCB layout as simple as possible.
With the recent addition of possibility to execute from QSPI memories, the Flash memory
can be extended quite easily with no much wiring.
But RAM is on the opposite. It's clunky, i.e. it requires quite some PCB traces and microcontroller's GPIOs to get started with.

--
Best regards,
Andrzej Telszewski

Shrinking the kernel with a hammer

Posted Jun 11, 2018 12:48 UTC (Mon) by meyert (subscriber, #32097) [Link]

Wow, I started to read this article series with slight interest, but in the episode it got really cool and everything did come together and made sense now.

Thanks for this article series and this cool final. Gave me a new perspective on memory usage.

Shrinking the kernel with a hammer

Posted Mar 2, 2018 14:45 UTC (Fri) by seebe (guest, #114212) [Link] (2 responses)

> All we need to do is enable CONFIG_XIP_KERNEL and the build system will prompt for the desired kernel physical address location in flash.

Since most ARM MMU-based systems are now multiplatform now in the kernel, you need to hack a line in the ARM Kconfig in order to enable CONFIG_XIP_KERNEL for MMU systems...but it will work (for some systems anyway)

> I later (re)discovered that the almost 10-year-old AXFS filesystem (still maintained out of tree) could have been a good fit. I had forgotten about it though, and in any case I prefer to work with mainline code.

One thing that AXFS has (aside from allowing for a larger file system size) is the ability individually select page by page the portions you want to have XIP-ed (uncompressed) and leave the rest as compressed. This is very helpful because there is a lot of executable and const portions in a file that are simple never run, or only run once at startup or shutdown. So, by only XIP-ing the pages you will commonly use, you can reduce the Flash image size while still retaining low runtime RAM usage. There is profiling tool built into the AXFS driver that will tell you what pages were used (by putting some logging code in the page fault handler) so you can record that information and then input it back into the mkfs.axfs tool.

Maybe at some point we can add this type of functionality into cramfs (since trying to mainline AXFS would be much more work).

Shrinking the kernel with a hammer

Posted Mar 2, 2018 19:55 UTC (Fri) by npitre (subscriber, #5680) [Link] (1 responses)

The XIP support for CramFS available in Linux v4.15 already has the ability to work with individually selected XIP-ed pages. It's just a matter of adding a profile-based page selection mechanism to mkcramfs.

Right now mkcramfs enables XIP only for pages that correspond to loadable ELF segments that are flagged readable and/or executable, and not writable. That could be easily extended to e.g. media files that are inherently compressed.

Shrinking the kernel with a hammer

Posted Jan 17, 2023 10:18 UTC (Tue) by sammythesnake (guest, #17693) [Link]

[thread necromancy alert!]

Presumably when mkcramfs compresses files/blocks it has a case to store incompressible things unmodified - marking those files/blocks for XIP might be give the bulk of the benefit of a more sophisticated version for a lot less effort...

Shrinking the kernel with a hammer

Posted Mar 2, 2018 17:29 UTC (Fri) by rbanffy (guest, #103898) [Link]

XIP-like ideas are handy on very parallel machines (thinking Xeon Phi-like, but any other single-image box - or rack - with a lot of cores would fit). Knowing a given memory range is immutable after you're up and running would make it easy to use core-local memory without worrying about it being consistent across the whole machine and no need to go across the motherboard-side bus. It's not a problem other cores can't modify your core-local memory because nobody is supposed to do it anyway.

Of course, core-local memory is useful for a whole lot of things besides that, but having to be concerned a process stays local to a specific core makes everything more complicated. If all cores have duplicates of frequently code memory that can be read faster than the main system memory can, all cores can spend less time memory-starved.

Shrinking the kernel with a hammer

Posted Mar 5, 2018 19:32 UTC (Mon) by flussence (guest, #85566) [Link] (6 responses)

We can almost say “640KB is enough for Linux” :-)

The kernel's obviously not going to run on an original IBM even if it's squeezed into RAM, but I wonder if there's any other fun applications of this stuff in the same vein... maybe it'd boot on old desktop ARM machines? Those had a whopping 2MB IIRC.

Shrinking the kernel with a hammer

Posted Mar 5, 2018 20:17 UTC (Mon) by farnz (subscriber, #17727) [Link] (5 responses)

The first commercial ARM desktops (Acorn Archimedes) had either 1 MiB (A310 and A410 models) or 4 MiB RAM (A440); the 512 KiB model (A305) was announced at the same time, but shipped later, and the 2 MiB RAM model (A420) also came along later.

Of course, they won't run Linux now, even if you did the soldering job needed to fit 16 MiB RAM - Linux does not support ARMv2 or ARMv2a CPUs (the ARM2 and ARM3 silicon that you can fit in these machines), and you can't fit an ARMv3 or later chip (the ARM6/7 silicon that the RiscPC used, on to modern AArch64 chips).

Shrinking the kernel with a hammer

Posted Mar 6, 2018 7:09 UTC (Tue) by epa (subscriber, #39769) [Link] (4 responses)

When was the support dropped? I know Russell King's original port was to these CPUs.

Shrinking the kernel with a hammer

Posted Mar 6, 2018 10:23 UTC (Tue) by farnz (subscriber, #17727) [Link] (3 responses)

About 10 years ago, give or take - I don't have a git clone to hand to go spelunking, but you're looking for the removal of include/asm-arm26 to see when it was deleted.

Shrinking the kernel with a hammer

Posted Mar 6, 2018 12:51 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

Found it:

commit 99eb8a550dbccc0e1f6c7e866fe421810e0585f6
Author: Adrian Bunk <bunk@stusta.de>
Date:   Tue Jul 31 00:38:19 2007 -0700

    Remove the arm26 port

    The arm26 port has been in a state where it was far from even compiling
    for quite some time.

    Ian Molton agreed with the removal.

Shrinking the kernel with a hammer

Posted Mar 7, 2018 21:43 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

Ah, yeah. I can understand nobody wanting to maintain a 26-bit(!) arch.

Shrinking the kernel with a hammer

Posted Mar 8, 2018 8:48 UTC (Thu) by epa (subscriber, #39769) [Link]

It's a full 32-bit CPU with 32-bit registers, 32-bit address space, and 32-bit data bus; just the program counter is 26 bits. So executable code needs to be in the lower 64 mebibytes of the address space. Data doesn't have that restriction (in the instruction set architecture; I don't think any machine was built using a CPU with this instruction set and more than 16 megs of RAM).

(The same register contained the 26-bit program counter and six flag bits. I believe this was to reduce the amount of saving and restoring needed for responding to interrupts: you could save the whole CPU state apart from the registers in a single 32-bit operation. With 64-bit CPUs I wonder whether the same technique could make a a comeback: I can see the need for a huge address space for data, but surely it wouldn't be much of a hardship if executable code had to be located in the bottom 281 terabytes...)

Like it! How about a script?

Posted Mar 6, 2018 17:22 UTC (Tue) by david.a.wheeler (subscriber, #72896) [Link]

I'd love to see some sort of script that could take the 'current' kernel + tools like busybox and generate the final result. Basically a "starter kit" that people could diverge from.

Bonus points: Put that in a CI environment, so that every update to the Linux kernel or busybox would create a new image, test the image, and report new sizes (including size regressions).

Shrinking the kernel with a hammer

Posted Mar 7, 2018 2:13 UTC (Wed) by abufrejoval (guest, #100159) [Link] (14 responses)

<old-memories>
I remember running Microport Unix on my 80286 (fully loaded with 640K of base RAM) with an Intel Above Board that added 1.5MB of RAM, I believe (could have been 2MB). It also gave me a free 8087 math co-processor as a gift and a set of disks labelled "Microsoft Windows 1.01".

Since I dual booted it with DOS and the mapping between expanded (paged in a 64K "BIOS area" window in real mode) and extended (above 1MB range, available only in protected mode) mapping of RAM was set via DIP switches, I allocated around 50% to each. It meant a little more than 1MB of RAM overall for Microport.

Full UNIX (TM) file system, full multi-user (via serial ports), very much like a PDP-11 in fact, where Unix was born. The 286 had a MMU but not at a page, but segment level, again pretty much like a PDP-11. Of course UNIX System V, Release 2 didn't have 400 system calls and the kernel was statically built and linked. I did some fiddling with the serial drivers to have them support the internal queues of the UARTs, that avoided having to interrupt after every character set or received. That's what made 115kbit possible. Also fiddled with Adaptec bus master SCSI drivers.

Ah and it ran a DOS box, long before OS/2 ever did, one single DOS task which run in the lower 640k by resetting the 286 CPU via the keyboard controller on timer interrupts. The BIOS would then know via a CMOS byte, that the computer had in fact not just been turned on, but come back from protected mode: A glorious hack made possible by IBM for the PC-AT, so it could perform a RAM test after boot on systems which had more than 640K of RAM installed.

For kicks I ran a CP/M emulator in the DOS box, while running a compile job on the Unix...
</old-memories>

<other-old-memories>
A couple of years later I had to port X11R4 to a Motorolla 68020 system that ran AX, a µ-kernel OS somewhere between QNX and Mach. Basically a fixed demo system, that ran a couple of X applications on a true color HDTV display using a TI TMS34020 TIGA board. Had to make X11R4 true-color capable, too: It was only 1 and 8 bit color depth at that point.

MMU wasn't enabled on 68020, there was no file system and the µ-kernel just gave me task scheduling. So I had to write a Unix emulator, basically a library that emulated all system calls required by X11. The "file system" was a memory mapped archive (uncompressed) that just got included into the BLOB along with everything else.

Used a GCC 1.31 cross compiler on a Sun SPARC Station. After several months of working through the X11 source code to make it true color and run some accellerated routines on the TIGA GPU (TMS 34020 was a full CPU that used bit instead of byte addressing for up to 16MB of RAM with 32-bit addresses!) it just worked perfectly at first launch! Without any debugging facilities I'd have been screwed if it didn't...

Dunno what the 68020 had for RAM, but I doubt it was more than 1 or 2MB.
</other-old-memories>

So where it all comes together is that the process you describe is somewhat similar to turning Linux plus a payload into a Unikernel or Library OS, where everything not needed by the payload app is removed from the image.

I sure wouldn't mind if Linux could support that out of the box, including randomization of the LTO phase for RoP protection. And yes, I believe the GPL is not a good match for that.

The Linux build process must be one of the most wasteful things you can do on a computer, starting with the endless parsing of source files which have motivated Google to Go.

I keep dreaming about an AI that can take the Linux source code and convert it into something that is a Unikernel/LibraryOS image, which is only incrementally compiled/merged where needed when you change some kernel or application code at run-time.

I believe I'd call it Multics.

Shrinking the kernel with a hammer

Posted Mar 7, 2018 18:34 UTC (Wed) by nix (subscriber, #2304) [Link] (10 responses)

Fascinating reminiscences, but..

The Linux build process must be one of the most wasteful things you can do on a computer

Oh God no. Compiler build processes with multiple-stage bootstrapping is the first thing that springs to mind (GCC building is *far* harder on a machine than Linux kernel building and most of it is thrown away); but then you look at new stuff like Rust, with, uh, no support for separate compilation or non-source libraries to speak of so everything you depend on is recompiled and relinked in statically for every single thing you build... the Linux build process is nice and trim. The oddest thing it does is edit module object files in a few simple ways after building.

Shrinking the kernel with a hammer

Posted Mar 9, 2018 16:43 UTC (Fri) by fratti (guest, #105722) [Link] (9 responses)

not to forget things like compiling a modern web browser. Chromium needs more than 20 GiB of disk space just to build, and you'll be at it for several hours on a modern system. Sure, ccache can save you some time, but yikes, talk about bad first contributor experiences.

Firefox has also been getting worse now that they're using some Rust. I genuinely hope Rust gets its ABI stuff sorted so that we do not end up living in a world where everything is a >2 MiB static binary in need of recompilation with every dependency update.

Shrinking the kernel with a hammer

Posted Mar 9, 2018 17:56 UTC (Fri) by excors (subscriber, #95769) [Link]

And don't forget Android, which often builds the Linux kernel, and most of a web browser or two, and some of Clang, and a thousand other things. I have several Android trees at about 200GB each. But a reasonable PC can still build the entire thing in under an hour, so it's not too bad really. The kernel itself is trivial.

Shrinking the kernel with a hammer

Posted Mar 10, 2018 0:48 UTC (Sat) by pabs (subscriber, #43278) [Link] (7 responses)

> static binary in need of recompilation with every dependency update.

Is that considered a feature in the Rust community like it is with Go?

Shrinking the kernel with a hammer

Posted Mar 10, 2018 6:42 UTC (Sat) by fratti (guest, #105722) [Link] (5 responses)

Yes, I've been told their religious beliefs state that the dynamic linker is Unsafe™.

Shrinking the kernel with a hammer

Posted Mar 10, 2018 8:13 UTC (Sat) by jdub (guest, #27) [Link] (3 responses)

Hrm, no, practically all Rust Linux binaries dynamically link to glibc by default (and by design), and you can easily dynamically link to C ABI shared libraries. If you want to build a static executable, you have to go out of your way to use the musl target.

There's nothing "unsafe" about dynamic linking, just the challenge of safety across C ABI boundaries (which exists for statically linked code as well) and the lack of a stable Rust ABI (which is pretty reasonable).

Shrinking the kernel with a hammer

Posted Mar 10, 2018 9:40 UTC (Sat) by fratti (guest, #105722) [Link] (2 responses)

I can understand that there is no stable Rust ABI, after all there's no stable C++ ABI either, but the issue is exaggerated by the Rust ecosystem's obsession with microdependencies (there are modules which are really just one function, à la npm), and the fast speed at which the Rust compiler moves.

>practically all Rust Linux binaries dynamically link to glibc by default (and by design)

Indeed, though as far as I know they statically link the Rust standard library. Despite the glibc being dynamically linked, e.g. oxipng still clocks in at 2.8M. Compare that to 86K for optipng.

Shrinking the kernel with a hammer

Posted Mar 11, 2018 17:10 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

> no stable C++ ABI

There isn't in the ISO standard sense, but there are de facto ABIs. GCC and MSVC have declared their ABIs long ago and stick to them. The Rust compiler does not commit to any given ABI between two releases. I suspect there may be one eventually, but it's not in the same area as C++.

Shrinking the kernel with a hammer

Posted Mar 12, 2018 10:48 UTC (Mon) by iq-0 (subscriber, #36655) [Link]

> [...] but the issue is exaggerated by the Rust ecosystem's obsession with microdependencies (there are modules which are really just one function, à la npm),

The reason you have to compile a lot of crates (rust libraries) while the thing you're building only uses a few parts of a few crates directly, has to do with how coherency-rules effectively cause many crates to depend on other crates in order to offer possibly relevant implementation of traits for there types or implementations of their traits on it's types.

To minimize the pain of these type/trait dependencies, and also to ease semver stability guarantees, a number of projects have extracted their basic types and/or traits in single purpose (and thus relatively small) crates. This helps these common crates to have few changes and reduce their compile times.

The fact that the crate dependency explosion often seems worse is due to different crates being able to have different (incompatible) dependencies on different versions of the same crate. Rust often handles these issues gracefully, which in many programming languages would have been painfull version conflicts, at the cost of sitting through additional crate compilations.

But to counter that, they only get build once for a project, unless you switch compiler versions, and thus often have the effect of reducing rebuild times. First time builds can be pretty long, but you only incur that cost occasionally. You do want to keep this in mind when configuring possible CI so that you cache these compiled dependencies.

> and the fast speed at which the Rust compiler moves.

Unless you really depend on the unstable (nightly) rust version the compiler normally is only updated every six weeks.

If you're using the unstable channel, you get to pick when you want to go through the bother of updating and thus recompiling everything. But I agree that that's hardly a consolation.

> Indeed, though as far as I know they statically link the Rust standard library. Despite the glibc being dynamically linked, e.g. oxipng still clocks in at 2.8M. Compare that to 86K for optipng.

All rust dependencies are, by default, statically linked, though LTO will prevent 90% of the standard library and other dependencies from being included in the final binary. A very large part of the resultant binary is debugging information (Rust's multi-versioning, types and module support has a big impact on the symbol length) and unwind information (in order to perform gracefull panics as opposed to plain aborts).

Both can be disabled and, with some effort, Rust binaries can be reasonably small. But things like monomorphization, while generating more optimized code, will almost always result in more code being generated. For most applications this usually isn't a big problem as the larger binaries don't really have a performance impact and greatly aid in error message information and debugging possibilities.

Luckily the people working on Rust support in Debian are working at making Rust programs integrate better with their distribution philosophy (dynamic linking, separating debug info and each dependency in a dedicated package), and I really hope that a number of their requirements and solutions will find their way back to the upstream Rust project.

Shrinking the kernel with a hammer

Posted Mar 10, 2018 8:19 UTC (Sat) by bof (subscriber, #110741) [Link]

"Yes, I've been told their religious beliefs state that the dynamic linker is Unsafe™."

Recently having had openSUSE tumbleweed running crond coredump on me until restarted due to weird DL loading of PAM stuff which was apparently updated, again makes me strongly sympatise with that sentiment...

Shrinking the kernel with a hammer

Posted Mar 11, 2018 17:15 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

It is a feature for deployment, not so much for things one would expect from a distribution (i.e., development tools). One could do the same with C or C++ deployments, but it's a PITA to wrangle build systems in that stack without embedding dependencies, so "no one" does it. I suspect Rust (and not Go[1]) will get dynamic linking before C or C++ have viable "everything static" deployment solutions.

[1]AFAICT, Go has much more of a "non-Go code doesn't exist" mentality than Rust folks do for non-Rust code.

Shrinking the kernel with a hammer

Posted Mar 7, 2018 22:42 UTC (Wed) by anselm (subscriber, #2796) [Link] (2 responses)

The Linux build process must be one of the most wasteful things you can do on a computer

Crypto-“currency” mining?

Shrinking the kernel with a hammer

Posted Mar 8, 2018 8:52 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

In Linuscoin the computational challenge is to start with the SHA256 of a Linux kernel image and work out the combination of build options needed to produce it.

Shrinking the kernel with a hammer

Posted Mar 9, 2018 16:46 UTC (Fri) by fratti (guest, #105722) [Link]

Something semi-related:

https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/ (heading "Hash-based differential compilation")