Leading items

Welcome to the LWN.net Weekly Edition for March 15, 2018

This edition contains the following feature content:

Discussing PEP 572: a discussion of an obscure proposed Python language feature expands to cover discussion forums in general.
An introduction to RISC-V: a beginning look at this open processor architecture.
Variable-length arrays and the max() mess: an attempt to clean up the kernel code runs into a strange snag.
Designing ELF modules: the proposed user-space module mechanism faces requests for significant changes.
Time-based packet transmission: a proposed feature to allow precise timing of outgoing network packets.
JupyterLab: ready for users: an introduction to a new interactive Python notebook.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Discussing PEP 572

By Jake Edge
March 14, 2018

As is often the case, the python-ideas mailing list hosted a discussion about a Python Enhancement Proposal (PEP) recently. In some sense, this particular PEP was created to try to gather together the pros and cons of a feature idea that regularly crops up: statement-local bindings for variable names. But the discussion of the PEP went in enough different directions that it led to calls for an entirely different type of medium in which to have those kinds of discussions.

PEP 572 ("Syntax for Statement-Local Name Bindings") has been through two revisions at this point, with robust discussion of both on python-ideas. The idea is that values with short lifetimes (for the duration of a single statement) may need to be referred to more than once during that lifetime. The canonical example from the PEP is:

    stuff = [ [f(x), f(x)] for x in range(5) ]

That works just fine, but if f() is costly or has side effects, that price will be paid twice in creating the list of lists. PEP 572 proposes new syntax that would allow any Python expression expr to be replaced with (expr as name); it would have the same value as the expression but also bind that value to name. That binding would only last until the end of the statement that contained it. Using that would allow the above example to be written as follows:

    stuff = [ [(f(x) as y), y] for x in range(5) ]

The comments on the first posting went in a bunch of different directions, some of which were anticipated by PEP author Chris Angelico as "open questions" in the PEP. Mostly those questions concerned corner cases of various sorts, such as how to handle the statement-local name bindings (SLNBs) for names that have already been used:

    (x, (1 as x), x)

Should the SLNB x shadow the x in the wider scope? Furthermore, there were questions (and a variety of answers) on what the following code should do:

    b = (3 as b)

The existing prototype implementation creates the SLNB for b before doing the lookup for the assignment; thus it assigns to the statement-local b, which goes out of scope at the end of the statement. So it effectively does nothing—which could easily be confusing.

The list comprehension example given in the PEP is not the truly interesting use case, at least for some participating in the thread. The SLNB is active until the end of the statement, which includes blocks, so the following would also be possible:

    if (re.search(pat, text) as match):
	print("Found:", match.group(0))

    while (sock.read() as data):
	print("Received data:", data)

For those, match and data would go out of scope after the completion of their blocks.

As is fairly normal for discussion on a mailing list, the thread had several lengthy sub-threads with comments (and replies several levels deep) that touched on multiple pieces of the PEP. That did not sit well with Robert Vanden Eynde, who asked: "Isn't it easier to talk on a forum?" As might be guessed, the answer from several was "no", but there was more to it than that. Vanden Eynde said:

We are currently like a dozen of people talking about multiple sections of a single subject.

[...] A forum (or just few "issues" thread on github) is where we could have different thread in parallel, in my messages I end up with like 10 comments not all related, in a forum we could talk about everything and it would still be organized by subjects.

Also, it's more interactive than email on a global list, people can talk to each other in parallel, if I want to answer about a mail that was 10 mail ago, it gets quickly messy.

We could all discuss on a gist or some "Issues" thread on GitHub.

Several participants expanded on their opposition to a forum-style discussion; many have built up a great deal of tooling to handle email, much of which is not usable in a web-based forum. Stephen J. Turnbull, who is one of the developers of the Mailman mailing-list manager—used by python-ideas and others—had a lengthy reply that acknowledged the longtime tension between forums and mailing lists. Like the subject of the PEP, discussion media has been batted around in the Python community for quite some time. According to Turnbull, it boils down to this:

The main advantage to using email is that management of multiple high-traffic streams requires smart, highly configurable clients. These have been available for mail since the late 80s. They've only gotten better since. Forums, on the other hand, are based on thin clients (ie web browsers), and so you get only the configuration the forum offers, and you have to tell the forum all about it. Of course we are hackers: all the main browsers are scriptable, the language is javascript, so we could write our own forum clients. But that's exactly what forum advocates want to avoid, at least if you call it "email".

The problem for mail advocates is the flip side of that. That is, AFAIK there is no good mail software that runs in a browser (let alone a smartphone), and change of mail clients, like religions and programmer's editors, is advocated only at risk to limb and life.

Oleg Broytman also offered up his analysis of web forums versus mailing lists in the thread; it reiterates some of what Turnbull said while adding some other points. Forums may have a lower barrier to entry, but it simply isn't possible for everyone to apply their own customizations to any particular forum installation—not to mention trying to get a single, cohesive "look and feel" across the many different forum implementations available. Email may lack for some things, but it is a simple format that has a myriad of ways to access it—including from the web.

But there is a hybrid solution available: Mailman 3 (MM3). Currently python-ideas is managed by Mailman 2, but some smaller Python mailing lists have made the switch and python-ideas will likely follow soon. Switching to MM3 brings the HyperKitty archiver along with it. HyperKitty provides a web-forum-style archiver, where users can browse, post, and reply to mailing list messages. As Guido van Rossum said: "I think this is the best of both worlds."

Nick Coghlan suggested that python-ideas be used as a guinea pig for migrating some of the larger mailing lists (including, eventually, python-dev) to MM3. Few seemed opposed to the idea, though it will bifurcate the archives; after the switch, all of the newly archived messages will have a different URL. Someone could perhaps convert the old archives to the new (while preserving the existing links) at some point.

This is a discussion that goes on in various parts of our communities with some frequency. We looked at a similar discussion regarding python-ideas back in February 2016. We will certainly see it again. Email is increasingly falling out of favor as a communications mechanism, so the push for alternatives is only likely to grow stronger over time. It is not that hard to imagine that new developers in five or ten years (or ...) will not have even heard of email except as something the old-timers talk about—like gopher or rotary telephones.

Meanwhile, back at the PEP. Angelico posted a second version based on the feedback he had received. It provided more examples, a list of syntax alternatives, as well as discussion of the consequences of Python execution order. Several people in both threads thanked Angelico for creating the PEP as a place to hold the various pieces of the discussion of this feature. The clear implication is that many think PEP 572 will be rejected, but that it is valuable nevertheless. Angelico will not be surprised if that happens:

That's fine :) I'm under no illusions that everyone will adore this proposal (hey, I'm no more than +0.5 on it myself). The greatest goal of this PEP is to record the arguments for and against each of the viable alternatives, as a long-term document.

Most of those commenting on the revision are either against the idea of SLNBs entirely or would like to see a different syntax used to specify them. There is no major agreement on which of the half-dozen alternatives is best, though, so it is a little hard to see Van Rossum (who has been conspicuously silent on the feature) jumping in to resolve the syntax question in some kind of positive ruling on the PEP. A much more likely outcome is that PEP 572 gets rejected, but that python-ideas moves in a direction that makes it more friendly for those who prefer forums for discussion. The next time the SLNB feature rears it head in email or on the web, though, folks can point to PEP 572 and not have to rehash it all again.

Comments (16 posted)

An introduction to RISC-V

March 14, 2018

This article was contributed by Richard W.M. Jones

LWN has covered the open RISC-V ("risk five") processor architecture before, most recently in this article. As the ecosystem and tools around RISC-V have started coming together, a more detailed look is in order. In a series of two articles, I will look at what RISC-V is and follow up with an article on how we can now port Linux distributions to run on it.

The words "Free and Open RISC Instruction Set Architecture" are emblazoned across the web site of the RISC-V Foundation along with the logos of some possibly surprising companies: Google, hard disk manufacturer Western Digital, and notable ARM licensees Samsung and NVIDIA. An instruction set architecture (ISA) is a specification for the instructions or machine code that you feed to a processor and how you encode those instructions into a binary form, along with many other precise details about how a family of processors works. Modern ISAs are huge and complex specifications. Perhaps the most famous ISA is Intel's x86 — that specification runs to ten volumes.

More importantly, ISAs are covered by aggressive copyright, patent, and trademark rules. Want to independently implement an x86-compatible processor? Almost certainly you simply cannot do that without making arrangements with Intel — something the company rarely does. Want to create your own ARM processor? You will need to pay licensing fees to Arm Holdings up front and again for every core you ship.

In contrast, open ISAs, of which RISC-V is only one of the newest, have permissive licenses. RISC-V's specifications, covering user-space instructions and the privileged instructions are licensed under a Creative Commons license (CC BY 4.0). Furthermore, researchers have determined that all RISC-V instructions have prior art and are now patent-free. (Note this is different from saying that implementations will be open or patent-free — almost certainly the highest end chips will be closed and implementations patented). There are also several "cores" — code that compiles to Verilog and can be programmed into an FPGA or (with a great deal more effort) made into a custom chip — licensed under the three-clause BSD.

Unlike earlier open ISAs, RISC-V's main features are that it is scalable and that it is primarily a specification that allows for multiple implementations. RISC-V starts with a choice of 32-, 64- or 128-bit integer-only specifications that we call "RV32I", "RV64I", or "RV128I". (I'm not going to cover the 128-bit ISA any further in this article because it is still in the design phase and there is only one software implementation, written by the inimitable Fabrice Bellard.) The "I" stands for "integer" and includes the basic processor features like loads, stores, jumps, and integer arithmetic. The architecture however is scalable and other extensions are common. Most Linux-capable RISC-V chips will be "RV32IMAFDC" or "RV64IMAFDC" where the letters mean:

I Integer and basic instructions

M Multiply and divide

A Atomics

F IEEE floating point (single precision)

D IEEE floating point (double precision)

C Compressed instructions

I	Integer and basic instructions
M	Multiply and divide
A	Atomics
F	IEEE floating point (single precision)
D	IEEE floating point (double precision)
C	Compressed instructions

For convenience "IMAFD" can be written "G" (for "general purpose") and so you will more commonly see those chips described as "RV32GC" or "RV64GC".

Most Linux-capable designs have skipped 32-bit variants entirely; in the second article I will describe Fedora on RISC-V, which is entirely concentrating on RV64GC. For completeness I should also say there is a cut-down embedded specification called "RV32E" that has half the number of general-purpose registers but is otherwise identical to RV32I. Since RV32E machines are likely to have only a few kilobytes of RAM and lack a "supervisor" mode, they are unlikely to ever run Linux.

RISC-V has 31 general purpose registers (15 for RV32E), approximately double the number visible to the programmer on x86-64. This simple unoptimized loop counting to 1000 demonstrates some features of the instruction set:

       - binary -    - mnemonic -

        fe042623    sw     zero,-20(fp) # store zero into stack slot
        a031        j      L2           # compressed jump
    L1:
        fec42783    lw     a5,-20(fp)   # load stack slot into a5
        2785        addiw  a5,a5,1      # compressed increment
        fef42623    sw     a5,-20(fp)   # store back to stack
    L2:
        fec42783    lw     a5,-20(fp)   # load stack slot into a5
        0007871b    sext.w a4,a5        # sign extend a5 into a4
        3e700793    li     a5,999       # load immediate
        fee7d5e3    ble    a4,a5,L1     # compare and branch

Registers are named x1 through x31 (with x0 being logically wired to zero), but the assembler provides a set of names like a0-a7 for function arguments and return values, t0-t6 for temporaries, fp for the frame pointer, sp for the stack pointer, zero for the zero register, and others. These are just aliases for the x-names. The floating-point extensions (if present) add 32 more registers, and it is expected that future extensions like vectorization will add more.

Instructions are variable length, with the basic length being 32 bits. Many common instructions can be compressed to 16 bits when using the compressed extension (that is expected to be present in all Linux-class chips). Longer instructions are possible too, with the more obscure extensions expected to use them. Unlike x86, variable length does not have to mean "horribly complex to decode". The encoding ensures that the processor can easily see the length of every instruction in its prefetch queue by decoding a few bits in a uniform location. This is even the case where the code is using extensions that the processor does not understand (e.g. for handing them off to a co-processor or to trap and emulate them).

Although the architecture is (by design) simple, boring, and similar to others that have gone before, one interesting area is the approach to complex instructions such as specialized instructions for string handling, video decoding, or encryption. Some of these may be implemented in future extensions. For others, the designers have expressed a preference not to add complex instructions to the specification but instead to rely on macro-op fusion for performance. (Note there is a patent claim on a limited version of this technique, although it expires in 2022.) Processors are expected to detect sequences of simpler instructions that together perform some complex operation (e.g. copying a string) and fuse them together at run time into a single more efficient macro operation. How this wish will meet reality is yet to be seen, but it does mean that, for now, writing a RISC-V emulator is relatively easy because there are only simple instructions.

To make a real computer you need a lot more than just a core, and RISC-V is at least beginning to supply more of those pieces. Code is available for an L1 cache, a cache-coherence and inter-core communication protocol called TileLink, ChipLink, which is an inter-socket version of TileLink, an external hardware debugging interface, and the beginnings of an interrupt controller. But there are many missing pieces: everything from DDR4 interfaces for memory, to ethernet, to GPUs. In the first silicon, and perhaps for a long time to come, these will all be proprietary even if paired with open-source CPUs.

Linux kernel 4.15 added basic RISC-V support, which is sufficient to boot but not much else (there are no interrupts and hence no significant device support). For now you have to use the out-of-tree riscv-linux kernel, although it is expected that most things will be upstream by 4.17. GCC and binutils support has been upstream for over a year, but you are recommended to use at least GCC 7.3.1 and binutils 2.30.

The final missing piece for Linux was a stable glibc ABI, which was added in February 2018 with glibc 2.27. This allows Linux distributions to start to compile packages, knowing that we won't have to recompile everything from scratch if there's a change to the glibc ABI.

And finally, where can you get RISC-V hardware to run Linux on? At the time of this writing almost no hardware is available. A few lucky people have SiFive's HiFive Unleashed development board that has four 64-bit application cores (RV64GC) plus a power management core (RV32IMAC), but costs at least $999. However there is QEMU support in 2.12 that can be used to run Fedora. There are also plenty of FPGA implementations, although you will find that they run much more slowly than QEMU and have limited RAM and device support.

It's expected that the hardware landscape will change quickly in the coming year, with much cheaper iterations of the HiFive Unleashed and several other companies announcing hardware. One surprise though: you might have a RISC-V chip in your PC in the near future. Western Digital has announced that it will transition the cores used in its hard disks and other storage devices to RISC-V; currently it ships over a billion cores each year.

Look for the second article in this series, where I will cover how Fedora was ported to RISC-V.

Comments (14 posted)

Variable-length arrays and the max() mess

By Jonathan Corbet
March 12, 2018

Variable-length arrays (VLAs) have a non-constant size that is determined (and which can vary) at run time; they are supported by the ISO C99 standard. Use of VLAs in the kernel has long been discouraged but not prohibited, so there are naturally numerous VLA instances to be found. A recent push to remove VLAs from the kernel entirely has gained momentum, but it ran into an interesting snag on the way.

A VLA is simply an array that is declared with a non-constant dimension. For example, btree_merge() begins like this:

    int btree_merge(struct btree_head *target, struct btree_head *victim,
		    struct btree_geo *geo, gfp_t gfp)
    {
	unsigned long key[geo->keylen];
	unsigned long dup[geo->keylen];

The length of both the key and dup arrays is determined by the value stored in the keylen field of the passed-in geo structure. The compiler cannot know what that value will be at compile time, so those arrays must be allocated at run time. For this reason, VLAs must be automatic variables, allocated on the stack.

As an extension to the language, GCC also allows the use of VLAs within structures. The LLVM Clang developers have refused to add this extension, though; that has led developers interested in building the kernel with Clang to work to remove VLAs from structures in the kernel. C99 VLAs are supported by Clang, though, so developers working on Clang compatibility have not paid much attention to them.

Since VLAs are standard C, one might wonder why there is a desire to remove them. One reason is that they add a bit of run-time overhead, since the size of a VLA must be calculated every time that the function declaring it is called. But the bigger issue has to do with stack usage. Stacks in the kernel are small, so every kernel developer must be constantly aware of how much stack space each function is using. VLAs, since they are automatic variables whose size is determined at run time, add a degree of uncertainty to a function's stack usage. Changes in distant code might result in a VLA growing in surprising ways. If an attacker finds a way to influence the size of a VLA, the potential for all kinds of mischief arises.

For these reasons, Linus Torvalds recently declared that "using VLA's is actively bad not just for security worries, but simply because VLA's are a really horribly bad idea in general in the kernel". That added some energy to the VLA-removal work that was already underway. In the process, though, Kees Cook discovered an interesting surprise: a number of VLAs in the kernel are not actually variable in length and were never meant to be seen as such by the compiler.

Accidental VLAs and the difficult story of max()

A useful tool in the quest to remove VLAs from the kernel is the GCC -Wvla option, which issues warnings when VLAs are declared. Cook found, though, that it was issuing warnings for arrays that were meant to be of constant size; one of them was this bit of code from lib/vsprintf.c:

    #define RSRC_BUF_SIZE	((2 * sizeof(resource_size_t)) + 4)
    #define FLAG_BUF_SIZE	(2 * sizeof(res->flags))
    #define DECODED_BUF_SIZE	sizeof("[mem - 64bit pref window disabled]")
    #define RAW_BUF_SIZE	sizeof("[mem - flags 0x]")
	char sym[max(2*RSRC_BUF_SIZE + DECODED_BUF_SIZE,
		     2*RSRC_BUF_SIZE + FLAG_BUF_SIZE + RAW_BUF_SIZE)];

The length of sym is clearly constant and can be determined at compile time, but GCC warns that sym is a VLA anyway. The problem turns out to be the kernel's max() macro, which generates an expression that is not recognized as constant by the compiler.

If one looks on page 87 of the original edition of The C Programming Language by Kernighan and Ritchie, one will find the classic definition of the max() macro:

    #define max(A, B)  ((A) > (B) ? (A) : (B))

There are some problems with this definition that make it unsuitable for kernel use, including the double-evaluation of the arguments and the lack of any sort of type checking. So a lot of effort has gone into the kernel's special version of max():

    #define __max(t1, t2, max1, max2, x, y) ({		\
	t1 max1 = (x);					\
	t2 max2 = (y);					\
	(void) (&max1 == &max2);			\
	max1 > max2 ? max1 : max2; })

    #define max(x, y)					\
	__max(typeof(x), typeof(y),			\
	      __UNIQUE_ID(max1_), __UNIQUE_ID(max2_),	\
	      x, y)

For the curious, the outer max() macro uses __UNIQUE_ID() to generate two unique variable names. Those names, along with the types of the two values and the values themselves, are passed to __max(). That macro declares two new variables (using the unique names) and assigns the passed-in values to them; this is done to prevent x and y from being evaluated more than once. Pointers to those two variables are then compared while discarding the result; this is done to force the compiler to issue an error if the two operands do not have compatible types. Finally, the two values themselves are compared to determine which one is greater.

It was initially assumed that GCC was simply not smart enough to understand that the result of this whole mess was still constant, so it created a VLA instead. Torvalds eventually figured out the real problem, though: the C standard makes a distinction between a "constant value" and a "constant expression". Array dimensions are required to be constant expressions, but the max() macro does not qualify as such. As Torvalds pointed out, the warning from GCC is thus somewhat misleading; while the compiler is emitting the VLA code for these arrays, they are not actually variable in length, and the problem is more a matter of syntax.

Regardless of the reason for their creation, it clearly would be a good thing to get rid of these "accidental" VLAs. What followed was an effort to rewrite max(); it was the sort of quest that the kernel community is so good at mustering. At one point, Cook came up with this:

    #define __single_eval_max(t1, t2, max1, max2, x, y) ({	\
 	t1 max1 = (x);					\
 	t2 max2 = (y);					\
 	(void) (&max1 == &max2);			\
 	max1 > max2 ? max1 : max2; })

    #define __max(t1, t2, x, y)						\
	__builtin_choose_expr(__builtin_constant_p(x) &&		\
			      __builtin_constant_p(y),			\
			      (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),	\
			      __single_eval_max(t1, t2,			\
						__UNIQUE_ID(max1_),	\
						__UNIQUE_ID(max2_),	\
						x, y))

    #define max(x, y)	__max(typeof(x), typeof(y), x, y)

Essentially, this version uses more macro trickery to short out much of the max() logic when the two operands are constants. It worked well — on recent compilers. The results turned out to not be so pleasing on older compilers, though. Initially there was some thought of revisiting the discussion on deprecating support for older compilers, but then 4.8.5 was reported to fail as well. At that point, Cook threw up his hands and gave up on creating a better max(). The solution going forward is likely to be what he suggested when he first discovered the problem: create a new SIMPLE_MAX() macro that is really just the original Kernighan and Ritchie max() with a different name. It is good enough for constant array dimensions.

Finishing the job

Now that this discussion has run its course, the process of eliminating all of the non-accidental VLAs can continue. There are just over 200 of them in the kernel; many of them are easily replaced with ordinary fixed-length arrays. Several developers have posted patches eliminating VLAs from various kernel subsystems. Getting rid of all of them will probably take some time; a few instances may require fairly deep analysis to determine the real maximum length and, perhaps, internal API changes. Eventually, though, we will get to a point where a -Wvla build generates no warnings.

Comments (30 posted)

Designing ELF modules

By Jonathan Corbet
March 13, 2018

The bpfilter proposal posted in February included a new type of kernel module that would run as a user-space program; its purpose is to parse and translate iptables rules under the kernel's control but in a contained, non-kernel setting. These "ELF modules" were reposted for review as a standalone patch set in early March. That review has happened; it is a good example of how community involvement can improve a special-purpose patch and turn it into a more generally useful feature.

ELF modules look like ordinary kernel modules in a number of ways. They are built from source that is (probably) shipped with the kernel itself, they are compiled to a file ending in .ko, and they can be loaded into the kernel with modprobe. Rather than containing a real kernel module, though, that .ko file holds an ordinary ELF binary, as a user-space program would. When the module is "loaded", a special process resembling a kernel thread is created to run that program in user mode. The program will then provide some sort of service to the kernel that is best not run within the kernel itself.

In general, the community's reaction to this feature may have been expressed best by Greg Kroah-Hartman: "this is crazy stuff, but I like the idea and have no objection to it overall". ELF modules give the kernel a controlled way to run user-space helper code, and they make it easy to develop and distribute that code with the kernel itself. That latter aspect, in particular, distinguishes ELF modules from the existing "usermode helper" mechanism, which depends on programs developed and shipped separately from the kernel. It's clear that some developers see uses for this feature beyond the bpfilter subsystem, and would like for those uses to be supported as well.

Beyond rule translation

Consider, for example, one branch of the discussion where Andy Lutomirski raised concerns that the current implementation might break systems that load an ELF module during system boot. Alexei Starovoitov, the author of the patches, responded: "There is no intent to use umh modules during boot process. This is not a replacement for drivers and kernel modules". Instead, he said, this feature is aimed at one specific use: converting iptables rules to BPF programs. But some developers, including Kroah-Hartman, are clearly looking further ahead:

You are creating a very generic, new, user/kernel api that a whole bunch of people are going to want to use. Let's not hamper the ability for us all to use this right from the beginning please.

In particular, he sees uses for these modules as a way to implement USB drivers in user space, perhaps bringing some existing user-space drivers into the kernel tree in the process.

Making ELF modules serve the more general use case may require a number of changes to the patch set. As Linus Torvalds pointed out, there is a significant difference between standard kernel modules and the current implementation of ELF modules. When the process of loading a standard module completes, that module has registered itself with all of the requisite subsystems and is ready to respond to requests from the kernel or user space. The end of the loading process for an ELF module, though, only indicates that the program in the module has started executing. It may not yet be ready to answer requests or provide services and, should something go wrong in its initialization process, it may crash and never get to that point.

The answer to this problem (and a couple of others), according to Torvalds, is to make the execution of ELF modules synchronous, in that a modprobe invocation would not complete until the process that was started to run the module's code has exited. For short-duration tasks, the final exit status could reflect the success of the operation itself, which is not possible in the current implementation. For a long-running module, the code could fork and return a success status once initialization is complete, giving a clear indication that the module is ready to do its work.

Some other changes would be required to make ELF modules suitable for other use cases. Currently there is no means of communication between the module and the kernel beyond the standard system calls. If ELF modules are to be used for tasks like driving a new device, there will need to be a way to pass control of that device to the module from the kernel, among other things. A number of these issues could apparently be handled by opening a pipe between the kernel and the module when it is launched and using it for communications between the two.

A trickier problem may have to do with modules that need some sort of filesystem access to operate. The access itself can be provided, but it can be difficult to write such code in a way that doesn't assume some sort of filesystem layout (the existence and contents of /dev, for example) in the underlying system. The kernel tries hard not to impose such policies on user space, and nobody would like to see that change with ELF modules.

Security concerns

Another issue that came up in the conversation is security. Kees Cook argued that there were a number of security issues with ELF modules. They run with full privileges regardless of the privilege level of the process that caused them to be loaded, and they run in the root namespace even if they were loaded in response to a request from inside a container. Most of the security concerns have been pushed aside for a simple reason: standard kernel modules run with full privileges inside the kernel itself. Even a process running as root is not as privileged as an normal kernel module, so it is unlikely that adding this feature will make the system less secure, especially if module signing is used to limit the modules that can be loaded.

One interesting exception did turn up later in the conversation, though. As Torvalds pointed out, there is a race window between the time that the module signature is checked and when the code is actually loaded into memory and executed; an attacker with the CAP_SYS_MODULE capability could exploit this window to replace the code between those two steps. That escalates the ability to run an existing, signed module into the ability to run arbitrary code as root. One way of addressing this issue would be the synchronous behavior described above. The kernel could take control of the file containing the module, marking it as non-writable, for the duration of the module's execution.

Another possible solution would be to load the code into kernel memory first, perform the check, then execute from that copy of the code. Lutomirski, in a separate part of the discussion, had suggested a mechanism where the code would be stored as a binary blob within a standard kernel module; the kernel would then execute the contents of the blob after loading the module. This approach, too, would avoid the race window described above. It would also make the ELF-module functionality work in non-modular kernels (assuming the module is built in, of course) and enable tighter integration with the rest of the kernel.

The downside of these approaches is that they load the module code into kernel memory, which is not pageable. For tiny modules that would not be a problem, but ELF modules, like other kernel code, seem likely to grow over time. Lutomirski suggested that the module code could be backed up by a tmpfs filesystem; Kroah-Hartman responded that it would be "tricky" but that it could be a good solution. "Micro-kernel here we come!" But no such implementation exists now.

There were few solid conclusions from the discussion, due in part, at least, to a general hostility to the changes on Starovoitov's part. Some of that is understandable; it can be frustrating to create a mechanism to solve a specific problem, only to be told that it needs to be generalized so that it is better suited to unrelated problems as well. But the kernel exists to address the entire community's problems, so this process of making features more generally useful is a vital part of the kernel's long-term success. At least some of the points raised in the discussion will need to be addressed before ELF modules can find their way into the mainline kernel.

Comments (10 posted)

Time-based packet transmission

By Jonathan Corbet
March 8, 2018

Normally, when an application sends data over the network, it wants that data to be transmitted as quickly as possible; the kernel's network stack tries to oblige. But there are applications that need their packets to be transmitted within specific time windows. This behavior can be approximated in user space now, but a better solution is in the works in the form of the time-based packet transmission patch set.

There are a number of situations where outgoing data should not necessarily be transmitted immediately. One example would be any sort of isochronous data stream — an audio or video stream, maybe — where each packet of data is relevant at a specific point in time. For such streams, transmitting ahead of time and buffering at the receiving side generally works well enough. But realtime control applications can be less flexible. Commands for factory-floor or automotive systems, for example, should be transmitted within a narrow period of time. Realtime applications can wait until the window opens before queuing data for transmission, of course, but any sort of latency that creeps in (due to high network activity, for example) may then cause the data to be transmitted too late.

Naturally, the network-standards community has been working on solutions for this particular problem; one of them is called P802.1Qbv. Should that name prove to be a mouthful, there is the more concise alternative: "Standard for Local and Metropolitan Area Networks-Media Access Control (MAC) Bridges and Virtual Bridged Local Area Networks Amendment: Enhancements for Scheduled Traffic". It defines a mechanism for the draining of queues of packets such that each packet is transmitted by its specific deadline. When P802.1Qbv is in use, applications can queue packets whenever they are ready, but those packets will not actually hit the wire until their deadline approaches.

The patch set implementing time-based transmission on Linux has a few separate components to it. The first is an API addition to allow applications to request this behavior. That is done by setting the new SO_TXTIME option with the setsockopt() system call. Packets intended for timed transmission should be sent with sendmsg(), with a control-message header (of type SCM_TXTIME) indicating the transmission deadline as a 64-bit nanoseconds value.

There are a couple of other control-message parameters that can be set with sendmsg(). SCM_DROP_IF_LATE instructs the network stack to simply drop a packet if, for some reason, it cannot be transmitted by the given deadline. The SCM_CLOCKID message can be used to specify which clock should be used for packet timing; the default is CLOCK_MONOTONIC. This parameter does not appear to actually be used in the current implementation, though, with one small exception described below.

These changes to the core network stack allow the specification of time-based behavior, but the core itself does not implement that behavior. That, instead, is an add-on feature. One way to get it is with the tbs queuing discipline, which is also part of the patch set. It can be configured to use time-based scheduling on a specific queue, with a couple of additional parameters. Here, too, the clock ID can specified; if the clock ID also appears in individual packets the two must match, or the packets will be dropped. There is also a delta parameter to configure how far in advance of the deadline each packet should be sent to the network interface for transmission. This parameter and the deadline for each packet thus define the window in which the packet should hit the wire.

The delta and the SCM_DROP_IF_LATE flag can be used to obtain two distinctly different behaviors. If the flag is set and delta is reasonably large, the semantics are that the packet must be transmitted before the given deadline. Instead, with a small (or zero) delta and with SCM_DROP_IF_LATE not set, the behavior is to not transmit the packet until after the given deadline.

The tbs queuing discipline, by itself, is a "best-effort" implementation, since there is still the possibility that packets could be delayed after tbs releases them to the interface. The real intent behind P802.1Qbv, however, appears to be implementation in the network adapters themselves. If the adapter is aware of packet deadlines, it can schedule its own transmission activities to ensure that the packets hit the wire at the right time.

The tbs queuing discipline thus supports offloading time-based transmission to the hardware; the patch set includes an implementation for the Intel igb Ethernet driver. In a full offload scenario, the delta and clock-ID parameters are not used; instead, all deadlines are assumed to be relative to the clock running within the adapter itself, so the adapter takes full responsibility for packet timing. If those parameters are specified, instead, tbs will sort the packets and send them to the interface at the beginning of the transmission window, with the interface still taking responsibility for getting them out before the deadline passes. Since this mode uses both a kernel-based clock and the adapter's own clock, the two must be running in sync or the results will not be as desired.

The patch set is now in its third revision; the initial version was posted by Richard Cochran but it is now being posted by Jesus Sanchez-Palencia, who has made a number of changes and added the hardware offload capability. There is still some disagreement over how the API should work and, in particular, if the ability to specify different clocks is really needed. Storing a clock ID with each packet makes the network stack's sk_buff structure larger, which is something that the networking developers have been resisting strongly for some time now. Working that out is likely to take at least one more revision, so it's not clear if this patch set will be ready by the 4.17 merge window or not.

Comments (13 posted)

JupyterLab: ready for users

March 13, 2018

This article was contributed by Lee Phillips

In the recent article about Jupyter and its notebooks, we mentioned that a new interface, called JupyterLab, existed in what its developers described as an "early preview" stage. About two weeks after that article appeared, Project Jupyter made a significant announcement: JupyterLab is "ready for users". Users will find a more integrated environment for scientific computation that is also more easily extended. JupyterLab takes the Jupyter Notebook to a level of functionality that will propel it well into the next decade—and beyond.

While JupyterLab is still in beta, it is stable and functional enough to be used in daily work, and steadily approaching a 1.0 release. From the point of view of developers working on extensions or other projects that use the JupyterLab API, however, the beta status serves as a caution that its developer interfaces are still in flux; they should plan for the possibility of breaking changes.

JupyterLab arose in 2015 from the desire to incorporate the "classic" (as it is known now) Jupyter Notebook into something more like an integrated development environment running in the browser. In addition, the user was to have the ability to extend the environment by creating new components that could interact with each other and with the existing ones. The 2011 web technology that the Jupyter Notebook was built upon was not quite up to this task. Although existing JavaScript libraries, such as React, suggested a way forward, none of them had the power and flexibility, particularly in the area of interprocess communication, that was required. The JupyterLab team addressed this problem by developing a new JavaScript framework called PhosphorJS. JupyterLab and PhosphorJS are co-developed, with capabilities added to the JavaScript framework as they are needed for JupyterLab.

One drawback to the new, rewritten Notebook framework is that, while it is mostly backward compatible with existing Jupyter Notebooks, most JavaScript extensions will not work. This is because the Notebook component of JupyterLab has a new and incompatible extension mechanism, based on TypeScript, which is a superset of JavaScript that features optional type annotations and compiles to JavaScript. However, some JavaScript extensions, such as the interactive widgets that were demonstrated in the previous article, already have JupyterLab versions; note that Node.js is required to install them.

For those interested in more backstory, Karlijn Willems provides a good rundown of the history of the Jupyter Notebook and its relationship with IPython and similar notebook projects, from both the commercial and free-software worlds.

Using JupyterLab

Let's go through a brief rundown of JupyterLab's main features, to provide an idea of what it's like to use the new environment. Start JupyterLab by typing jupyter lab at the command line on a system with it installed (more about that below). If you do this from within a directory that contains an existing Jupyter (or JupyterLab) Notebook, JupyterLab will make it convenient to open the Notebook using the file browser. A new tab or window should open in the default browser (recent versions of Firefox, Chrome, and Safari are supported) that contains the JupyterLab environment.

If this is the first time starting JupyterLab, the window will appear as in the figure above, with a file browser on the left and the "launcher" on the right. The launcher's contents depend on what kernels are installed. Kernels provide the interface between Jupyter (or JupyterLab) and various programming languages, such as Julia, R, Go, and, of course, Python.

Besides the Python 3 and Python 2 kernels, I have a gnuplot kernel installed. I had been using this kernel with the Jupyter Notebook, and it works flawlessly with JupyterLab.

As can be seen, the launcher allows opening notebooks or consoles for any of the installed kernels. The consoles are REPLs (read-eval-print loops) attached to the given kernels, but gain more power, as we shall see, from their incorporation into the JupyterLab environment, because they can communicate with its other components. You can also launch a terminal, which behaves just like a normal xterm but is embedded in the browser. The "Text Editor" button opens an editor that can have various key-bindings, including Vim and Emacs, applied using the Settings menu. These are not actual instances of Vim, Emacs, etc., but can be useful for simple editing tasks; as is the case with the other components, the editor gains additional usefulness due to its integration with the JupyterLab system.

New notebook features

Opening a notebook presents a familiar vista if you've previously used Jupyter. The JupyterLab Notebook looks and behaves just as the classic notebook, but is more convenient to use. Now cells can be collapsed and expanded individually with a mouse click, or in groups with a selection from the View menu. This menu also has items to collapse or expand all code or all output cells. You can rearrange the order of cells by dragging them with a mouse, or drag cells between notebooks (dragging an input cell takes its associated output cell along with it). The classic notebook required selection, copy, and paste to get the same results.

Copying cells between notebooks is made practical by the ability to arrange components in almost any pattern within the browser window by dragging the component tabs. Notebooks, editors, consoles, and all other components can be arranged horizontally or vertically in any way that is convenient, and resized as well. This allows you, for example, to use one notebook freely as a scratchpad while dragging selected cells to another notebook beside it to construct an orderly narrative. I found the graphical interface for arranging components to be fluid and responsive, with no glitches or delays in redrawing the browser window.

The figure above shows an example of tiling the workspace with a notebook, a terminal session running the process monitor htop, and a console running IPython. The console is running an instance of IPython separate from the IPython kernel backing the open notebook. This is made clear in the area on the left, where the Running tab is selected to show all the kernel and terminal sessions backing the components in the workspace. The components can be shut down from the Running tab as well. (You can also quit the browser while keeping kernels running, and reattach to them later, which can be a great convenience.) The left-hand area can be collapsed to get more working space, and any component can be temporarily set to a quasi-full-screen mode by hitting ctrl-shift-return while it has the focus (the same key combination restores the tab arrangement).

You can also create a second, synchronized view of a notebook using the context menu (traditionally accessed with a mouse right-click). By placing the two views side-by-side, you can have access to different parts of a single notebook simultaneously. A related trick is using a cell's context menu to create a new view for the output just of that cell. For certain types of output, this allows you to interact with its representation in a separate panel, while preserving the embedded notebook version unchanged; for example, you can pan and zoom a map, or rotate a 3D model.

A quirk of both the old and new versions of the Notebook is that its cells can be executed in any order, leaving no explicit record of what that order happened to be. Also, there is normally no way to interact with the notebook's kernel, say to check the values of variables, without creating or changing notebook cells. JupyterLab allows you to address these issues by attaching a console to the same kernel backing a notebook, using another context menu, "New Console for Notebook". Now any time a cell is executed, the calculation is recorded in the console, and you can turn to the console to check variables and perform calculations that will not be recorded in the notebook.

File viewers

The first figure above shows that JupyterLab uses special icons in the file viewer for the file types that it knows about, which it identifies from the filename extensions. Aside from the notebook files, JupyterLab has special viewers for a handful of file types, with more likely to be added in short order. In this article we talk about two of these, Markdown and comma separated value (CSV) files (which can actually be tab or semicolon separated, but still need to have the .csv extension).

CSV files provide a good introduction to a concept central to understanding JupyterLab: different views into the same file. The figure above shows the result of opening two views onto the same million-line by six-column CSV file; on the left, an editor, and, on the right, a table viewer. If you change values using the editor view, they are reflected in the table view (after a short delay for this particular file). The table view is high-performance, able to scroll through the file with no hesitation, jumping from the top to the bottom as fast as you can move the mouse. The columns can be resized manually, which is also a smooth and instantaneous operation.

However, some glitches appeared on my machine: the editor view became corrupted when I attempted to scroll to the bottom. While some JupyterLab presentations have shown table views of files with billions and even trillions of rows, I found that attempting to view files with more than about 10 million rows crashed my browser. This may be because I was conducting these experiments on a laptop with "only" 4GB of RAM; it probably indicates that there are still problems in the CSV viewers that need to be addressed, as well.

Another example of multiple views into the same file is JupyterLab's handling of Markdown files. The figure below shows two views of the same Markdown file, an editor on the left and the formatted output on the right. Editing the file produces a fairly rapid result on the right-hand side, which is called the "Markdown preview" in JupyterLab terminology. Observe the equation, which is typeset using MathJax.

Components can be combined in various other ways. For example, you can attach a kernel to an editor, with another quick selection in a context menu. This opens a console with a REPL talking to the selected kernel. You can then select blocks of code in the editor and hit shift-enter, just as when executing a cell in a notebook. The code is copied to the console and executed in the REPL. If you happen to be editing a Markdown file, code blocks are recognized, so you merely need to have the cursor within a block of code, with no need to select it.

JupyterLab is designed to be extended; new components and file viewers are already appearing. Some examples are a graphical diagram editor and a LaTeX editor with a live preview.

JupyterLab may remind some of the Apple technology from the 1990s called OpenDoc, and similar efforts, which also allowed the user to arrange sundry disparate components within a single frame, each with its own editor and viewer. To the chagrin of many, this was one of several promising technologies killed off by Steve Jobs upon his return to Apple.

Installation

It is unlikely that your distribution's package-management system provides a recent enough version of JupyterLab; it may also not be sufficiently up to date with the Jupyter Notebook prerequisite. The two main ways of installing this software are to either use the pip command or to download the Anaconda distribution. In keeping with the latest fashion, the latter has its own package-management system; install JupyterLab using the conda installer. You will need some of the components installed with the Jupyter Notebook from version 5.3 or newer. Of course, you also need Python, NumPy (required for all numerical or scientific work with Python), and you may want to install SciPy if you don't already have it.

A 1.0 release of JupyterLab is planned for later this year; shortly after that, development work on the classic Jupyter Notebook will dwindle; it will be deprecated in favor of JupyterLab. Hence if you plan to work with the Notebook for some time, you may be compelled, as a practical matter, to switch to JupyterLab eventually.

The Jupyter Notebook has already won over many scientists and educators because of the ease with which it allows one to explore, experiment, and share. JupyterLab makes the Notebook part of a more complete, powerful, and extensible environment for pursuing computational science and disseminating the results, leaving little doubt that this free-software project will win over an even larger portion of the scientific community. I've tried to give some idea of the power and convenience of the JupyterLab interface, but to really appreciate this technology, you need to try it out yourself. Fortunately, this is easy to do, as it's simple to install and intuitive enough to get started without reading documentation—and it happens to be a great deal of fun.

Comments (3 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>