|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for July 6, 2023

Welcome to the LWN.net Weekly Edition for July 6, 2023

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Emacs for Android

By Jonathan Corbet
June 29, 2023
The Emacs editor is not tied to the Linux kernel; indeed, it was created some years before Linux existed. The Emacs code base is intended to be portable, and the editor runs, with varying levels of support, on a wide variety of systems. Recently, an energetic developer has worked to extend the set of supported systems to Android; the result is a working port, but whether that port will be accepted into the Emacs mainline is the topic of ongoing conversation.

On the last day of 2022, Po Lu announced that a preliminary Emacs port had been pushed to a feature branch in the Emacs Git repository. It was "about 14,000 lines of stuff" with basic functionality working; Lu asked for help with some of the remaining issues. Lu came back in January with more suggestions for projects that others could take on. In February, the port was declared to be "more or less feature complete" and, at the beginning of March, Lu requested that this work be merged into the Emacs trunk "in the next couple of days". Emacs maintainer Eli Zaretskii proved reluctant at the time and, as of this writing, that merge has not yet taken place.

[Emacs on Android] There is a build of the Android Emacs port in the F-Droid repository, but that build is old and Lu does not recommend it. There is also a frequently updated set of binary builds on SourceForge that seems to be the best option for users who do not want to build their own.

That build seemingly works on your editor's Android systems, though it is somewhat painful to use. The icons are too small to see or for the fat-fingered among us to hit reliably. There must surely be a way to obtain finger-twisting three-key chords like M-$ (which runs ispell-word) with the Android keyboard, but it is not obvious and is unlikely to be convenient for regular use. Anybody wanting to make serious use of Emacs on this platform would be well advised to have an external keyboard. Even then, the Android permissions scheme (described as "fascist" by Lu) makes it impossible to access files on the system as a whole.

Still, there seems to be some interest in this port, and Lu is not the only one who would like to see it merged. Zaretskii, though, remains unconvinced; in mid-June, he started a conversation by saying that, while there would be advantages to merging the Android code, there would be a number of disadvantages as well. It would grow the Emacs distribution significantly, introduce Java code into Emacs, complicate a number of existing internal APIs, and, most significantly, would add a dependency on a single maintainer who understands the port and can keep it working. Given these problems, along with the fact that Android is a proprietary platform, he asked, might it not be better to keep Android support out of Emacs?

Lu responded that this port runs on free versions of Android, such as Replicant, as well, while conceding that "those systems are used so rarely that they should not be put into consideration". The NeXTSTEP/macOS port uses Objective-C, so there is a precedent for bringing in a new language that a platform needs, and the Android port has been carefully done to minimize the amount of Java code needed. With regard to maintenance, Lu said that the changes needed are not huge, that there are many free-software developers with Android experience, and that he intends to stick around and keep things working in any case.

The conversation went on at length, including a separate sub-thread on the challenges of contributing to Emacs in general that will not be covered here. Numerous participants said that they would like to have the Android port; Dr. Arne Babenhauserheide, for example, remarked that this would be the best way to make Org Mode available on Android devices. Zaretskii, though, was firm that any discussion of the advantages was off-topic:

I have no doubt whatsoever that having Emacs on Android will benefit Android users; my doubts are whether we as the project should and can take upon ourselves this additional maintenance burden, and promise the Android users that we will maintain, let alone develop, this port for the years to come.

The maintenance concern is not entirely unfounded. A look at the feature/android branch shows 620 commits, 619 of which were written by Lu (the other commit being a typo fix). Lu's requests for help earlier this year, in other words, did not result any other contributors working on the port. Emacs has a number of other ports that are not seen as being in great shape; that includes the macOS port, the maintainer of which left the community. Adding another poorly supported port would not improve the state of Emacs overall.

Lu is adamant that the Android port would be supported well, though. For Richard Stallman, that is good enough: "But if Po Lu says that he wants to keep working on it and make it good enough to achieve popularity, I suggest that makes the effort of installing it a good investment." Zaretskii, though, repeated his concerns, saying that the Emacs project has been burned this way in the past and may be repeating an old mistake. But he left the door open to merging the Android support anyway:

But maybe I'm the only one who is bothered by the fact that we never, as a project, raise the head above the water level and look farther into the future than just tomorrow or the day after? In which case I'll stop talking about this and accept the fact the others aren't bothered. After all, I don't own this project, I'm just the steward; if the community decides to go with this, the community will bear the consequences, whether good or bad.

A reading of the conversation suggests that, indeed, few people other than Zaretskii are bothered by the prospect of merging a port that subsequently loses its sole maintainer. Perhaps that outcome reflects the fact that the responses were mostly from users of Emacs, while other developers have mostly kept their peace. Maintenance burdens are less daunting when they are shouldered by somebody else, after all. There are risks involved in accepting a large body of new code, but there are also risks inherent in failing to support a popular platform and disappointing a prolific contributor. Which risk the Emacs community will choose remains to be seen.

Comments (46 posted)

The first half of the 6.5 merge window

By Jonathan Corbet
June 30, 2023
The first days of the 6.5 merge window have been a bit calmer than usual, with "only" 4,000 non-merge changesets having been pulled into the mainline repository. Those changesets include a fair amount of significant work, though. Read on for LWN's summary of the first set of changes merged for the next major kernel release.

Architecture-specific

  • X86 systems can now parallelize much of the process of bringing up all of the CPUs, reducing the time to get all processors online by as much as a factor of ten.
  • Intel's "Topology Aware Register and PM Capsule Interface" (abbreviated "TPMI") is now supported. This is an interface that provides a better way of managing power-management features.
  • The arm64 permission-indirection extension is now supported. There is no new functionality resulting from this support now, but it is needed for some upcoming features.

Core kernel

  • The io_uring subsystem has gained the ability to store the rings and submission queue in user-space memory, rather than having the kernel allocate that memory. This allows user space to allocate the needed memory as huge pages, hopefully improving performance. This changelog has a little more information.
  • The kernel's Rust support has been upgraded to the Rust 1.68.2 release, the first such upgrade since that support was merged. Other than that, the changes to Rust support were relatively minor this time around; this merge message has the details.
  • The kernel has gained support for unaccepted memory — the protocol by which secure guest systems accept memory allocated by the host. The merged code includes the (somewhat) controversial protocol to automatically accept all provided memory in the firmware when running a guest kernel without support for memory acceptance.
  • The BPF subsystem has gained the ability to attach filter functions to kfuncs; the filter can limit the contexts from which the kfunc can be invoked. The initial use is to restrict callers of bpf_sock_destroy() to programs of the BPF_TRACE_ITER type.
  • Pinning of BPF objects can now be done using O_PATH file descriptors as an alternative to providing the path name for the target directory.

Filesystems and block I/O

  • It is now possible to mount a filesystem underneath an existing mount on the same mount point; this feature is useful for the provision of seamless updates within containers. See this article, this article, and the merge message for details.
  • The new cachestat() system call can query the page-cache state of files and directories, allowing user space to determine which of its file pages are currently in RAM. See this article for details and this commit for a man page.

Hardware support

  • Miscellaneous: Renesas RZ/V2M clocked serial interfaces.
  • Networking: Fintek F81604 USB to 2CAN interfaces, Microchip LAN865x Rev.B0 10BASE-T1S Internal PHYs, Realtek RTL8192FU interfaces, Realtek 8723DS SDIO wireless network adapters, Realtek 8851BE PCI wireless network adapters, and MediaTek SoC Ethernet PHYs.
  • Regulator: TI TPS6287x power regulators, TI TPS6594 power-management chips, Rockchip RK806 power-management chips, and Renesas RAA215300 power-management ICs.

Miscellaneous

  • The nolibc library has gained stack protector support, a number of architecture-specific improvements, and more.

Networking

  • The passing of process credentials, as done with the SCM_CREDENTIALS control message, has been enhanced with a new SCM_PIDFD type. As might be expected from the name, this message passes a pidfd rather than a process ID. There is also a new SO_PEERPIDFD option to getsockopt() that obtains the pidfd of the peer process.

Security-related

  • The "secretmem" facility, in the form of the memfd_secret() system call, is now enabled by default. This change was made after some research determined that secretmem use does not hurt performance as had been thought.

Internal kernel changes

  • The workqueue subsystem will now automatically detect CPU-intensive work items (defined as running for at least 10ms by default) and mark them. This will prevent such items from blocking the execution of other work items. There is a new configuration debugging option to enable the reporting of CPU-intensive work items detected in this way.
  • The kernel is now built with the -fstrict-flex-arrays=3 compiler option, adding more warnings around the use of flexible arrays. See this article for more details on this work.
  • The new attribute macro __counted_by() can be used to document which field in a structure contains the number of elements stored in a flexible array (in the same structure). The documentation is useful, but it can also eventually be used for bounds checks as well.

The 6.5 merge window can be expected to remain open until July 9. LWN will be back shortly after that with a summary of the changes pulled in the second half; stay tuned.

Comments (27 posted)

Documenting counted-by relationships in kernel data structures

By Jonathan Corbet
July 3, 2023
The C language is expressive in many ways, but it still does not have ways to express many of the relationships between fields in a data structure. That gap can be at least partially filled, though, if one is willing to create and use non-standard extensions. The adoption of of those extensions, in the form of the __counted_by() macro, has been merged for the 6.5 kernel release, even though the compiler feature it depends on has not yet been finalized.

Flexible arrays (also known as variable-length arrays) are arrays defined within a structure with a length that is only known at run time:

    struct flex {
        int count;
	struct some_item items[];
    };

When a structure of this type is allocated at run time, the number of items to be stored within it will be known; enough memory will be allocated to hold an items array of the expected size. Normally such structures will include a field saying how long the array is; the count field in the above example could be used this way. But there is no way for the compiler (or any other tool) to know about the association between count and the length of items.

Flexible arrays, by their nature, are particularly prone to a number of memory-safety bugs. It is, thus, not surprising that work has been ongoing for some time in the kernel-hardening community to clean up and regularize the code that deals with these arrays in the kernel. As of the 6.5 release, warnings will be generated by code that uses anything other than the standard notation to declare a flexible array (array[] rather than the once-common array[0] or even array[1]). But flexible arrays are still opaque to code that wants to check whether a given reference falls within or outside of the allocated memory, for the simple reason that the actual size of the array is determined at run time and is not known to the compiler or other tools.

That information usually is available, though; it's just that the compiler does not know where to find it. In an attempt to fill in that information, requests were filed with both the GCC and LLVM communities to support a new variable attribute to indicate which structure field contains the length of a variable array. Using this attribute, the above structure could be declared as:

    struct flex {
        int count;
	struct some_item items[] __attribute__((element_count("count")));
    };

Here, the new element_count attribute says that the length of items (in elements, not bytes) is stored in the field count in the same structure. The compiler can use that information to calculate the size of the array; that, in turn, can be used to provide run-time bounds checking for accesses to the array. The result should be a kernel that is a little harder to exploit and better documentation of how the structure's fields relate to each other.

In the kernel, this new attribute is hidden behind a macro:

    # define __counted_by(member) __attribute__((__element_count__(#member)))

This macro makes the code more concise, which is nice, but it is needed for another reason as well. The actual naming of the element_count attribute is not yet set in stone, and might well change (probably to counted_by) before compilers with support for it are released. Once the name settles down, the macro can be changed to match.

Kees Cook, who has done the work of supporting this attribute in the kernel, is ready to go with the next step: annotating over 150 files with the new attribute. Those are the relatively easy cases, found with the Coccinelle tool; others are sure to follow.

Christoph Hellwig, while welcoming the feature in general, worried that it was being introduced too soon:

But this feels a bit premature to me, not only due to the ongoing discussions on the syntax, but more importantly because I fear it will be completely misused before we have a compiler actually supporting available widely enough that we have it in the usual test bots.

Cook answered that he has test systems running with the compiler patches and should be able to catch any incorrect annotations that show up in the near future. Meanwhile, though, he wants to get started marking up the code:

This has been a pain point for years as we've been doing flexible array conversions, since when doing the work it usually becomes clear which struct member is tracking the element count, but that information couldn't be reliably recorded anywhere. Now we can include the annotation (which is the really important part). [...]

But I really want to start capturing the associations _now_, and get us all into the habit of doing it, and I want it to be through some kind of regular syntax (now that there are patches to both GCC and Clang that can validate the results), not just comments.

That reasoning was clearly enough for Linus Torvalds, who pulled this change into the mainline during the 6.5 merge window. This new macro is another example of the kernel community extending the version of C language it uses in an attempt to address some of C's legendary safety issues. We should all gain a slightly more secure and better documented kernel as a result.

Comments (7 posted)

Converting NFSD to use iomap and folios

By Jake Edge
July 4, 2023

LSFMM+BPF

Chuck Lever led a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit on the Linux NFS server, which is also known as NFSD. He wanted to talk about converting the network filesystem to use iomap; that kind of conversion was the topic of the previous session at the summit. Beyond that, he wanted to discuss using folios, which has been a frequent topic at recent LSFMM+BPF gatherings, including this year.

Lever began with the announcement that NFSD is "under new management". Bruce Fields, who had been the maintainer since 2007 or so, has taken a sabbatical from the IT world ("he is well, I am not trying to cover anything up there"). Lever became the maintainer of NFSD for the kernel in January 2022 and Jeff Layton joined him as co-maintainer in July 2022.

The Linux NFSD has some features that no other implementation in the industry has, including NFS over RDMA, with support for "just about any fabric you can imagine"; the NFS client also works over RDMA. Support for NFS v4.2, which is pretty rare in other implementations, is also present; "those are things that we can be proud of and I hope I can extend that winning streak a little bit".

[Chuck Lever]

His first priority is functionality, thus making sure that Linux stays at the top of the list for NFS. Next is security; to that end, he has been working on both GSS/Kerberos and RPC using TLS. The latter is a way to do in-transit encryption of NFS traffic without using Kerberos; the cloud people have been asking for it since 2018 and he thinks the NFSD project is just about in a position to deliver it. His third priority is performance and scalability for the server, which is the topic of the talk. Fourth is the ability to trace the operation of the live server and diagnose problems with it, but without impacting its operation. He is "way into tracepoints" and has been putting them into the server; he has not yet gotten into BPF, though he plans to.

He has gotten some anecdotal reports that NFS reads from the server are slow; for 20 years or so, the server has used a "pipe-splice mechanism" for reads; that mechanism is "poorly documented and we broke it pretty badly last year" in a few different ways. Al Viro broke it with his pipe-iterator work and Lever broke it when he removed some code that had no documentation and looked unnecessary. "Now we know what we need it for", he said with a chuckle.

He has not measured these read performance problems himself, but he would like to pay some attention to them soon. Meanwhile, though, NFSD wants to join with some of the other Linux filesystems to "support folios and iomap and all of those wonderful things". There are some unrelated problems with write performance, he said. Both read and write rely on the struct xdr_buf structure, which he put up as his only "slide"; it is the "basic way that we track the assembly of RPC messages". It contains a pointer to an array of pages for the data, along with two struct kvec entries for the RPC header and the tail information (such as a checksum or padding to a four-byte boundary). There are some other entries to support zero-copy operations as well.

There is a struct bio_vec entry in the xdr_buf, which was put in when the NFS developers thought that "bio_vecs were the wave of the future". The NFS client uses that entry, but "the server kind of ... doesn't"; one of the things that has stopped him from using the bio_vec in the server is that the APIs for some things, like RDMA, do not support using it. The socket APIs do support bio_vec but he has not made that switch.

iomap

Meanwhile, he has heard that the iomap interface provides a feature that NFSD would like to have: the ability to read a local sparse file without triggering the mechanism that fills in the missing pages with zeroes. In the past, Dave Chinner had told him that reading an unallocated extent (i.e. "hole") in a sparse file will cause the system to allocate blocks on disk to hold the hole and fill it in with zeroes; that is not something that he wants an NFS read to do, especially for large files.

Viro came in over the remote link to ask how a read could cause that behavior; he noted that maybe it is XFS-specific, but that reads should not normally cause blocks to be allocated. Jan Kara said that it was a misunderstanding; the system will not allocate blocks on disk, but it will allocate zero pages in the page cache, which could be avoided using iomap. Lever said that NFSD wants the behavior provided by iomap; there is a "read-plus" operation that can distinguish between data and holes—it is effectively a "sparse read" operation. The client can ask for a range of data and the server can send the data, if it is present, or a compact reply simply telling the client that there is no data on the server in that range (or part of it).

But, Layton said, iomap is something that the underlying filesystem would have to support; NFSD cannot just call directly into iomap. Matthew Wilcox said that filesystems that support iomap will need to indicate that they do and provide operations for NFSD to call. Kara said that it sounded a bit like the existing FIEMAP (and SEEK_HOLE for lseek()) APIs, which could perhaps be used to find holes in files. There are some races with FIEMAP, though, Lever said, which is why NFSD is not using it now.

Ted Ts'o said that there will be filesystems, such as ext4, that are supporting iomap gradually, so they will need a way to say that a given file does not support iomap. Perhaps the iomap operation would just return EOPNOTSUPP or the like and the caller would then have to fall back to using the existing mechanism. The ext4 developers plan to support iomap for the easy cases first, then add it for the more complicated cases.

Lever said that maybe NFSD would just wait until filesystems completely support iomap before trying to use the API, but Ts'o cautioned that there may be a long tail, where 99% of file types are handled just fine. It would be a shame if NFSD could not take advantage of iomap for the vast majority of files on ext4 filesystems, he said. Lever said that there is already a bifurcation in the NFSD read code, because sometimes it can use the pipe-splice mechanism, but sometimes it cannot and an iterator has to be used.

The read-plus operation is going to have to consult the underlying filesystem, so that it can report any holes to the client. Avoiding races in that reporting is desirable. Layton said that an "atomic sparse-read" operation is what is needed; Lever agreed and said that is what he would like to get from iomap.

Wilcox wondered how useful the page cache is for NFSD and whether it could use direct I/O instead. Layton said it was workload dependent and Lever said that there is no easy way for the server to determine whether the page cache is needed for a particular file or workload. He said that there are some other servers that try to make that kind of determination, but that the Linux NFSD always uses the page cache.

Kara asked about the atomicity needed for the sparse read; Layton said that when they had tried to use FIEMAP, the map could change out from under them due to racing with other processes. Viro said that the operation needed to be atomic with respect to hole punching and truncate() at a minimum. Over the Zoom link, Anna Schumaker said that when she encountered the races, she had not actually used FIEMAP but used seek() with SEEK_HOLE/SEEK_DATA instead. Though that is the "same thing" as FIEMAP, she and Kara agreed.

Another remote participant, Darrick Wong, asked what would be done with the information about the holes, given that iomap would not give any information about what is or is not in the page cache. Lever said that the server can use the information about where the holes are to read only from places where data is expected and to construct the read-plus reply from that. But Wong cautioned that there may be dirty pages in the page cache that correspond to pages in a hole; the SEEK_HOLE approach would actually notice that was the case, unlike iomap.

Kara said that using SEEK_HOLE was the better interface, but there are race conditions that will need to be handled. The i_version field of the inode could be used to detect that a change has been made. Lever suggested that maybe the read-plus operation would not promise a completely consistent view of the file, but Wilcox did not like that at all.

He said that the page cache could be changed so that it could directly represent file holes with a "special entry that says 'no data here'; that's a lot of work, but it is certainly something that I have been thinking very seriously about doing". Schumaker said that would also help in the NFS client code. Wong wondered if what was really desired was an operation to read from the next non-hole part of the file and to return the data and the offset where it was found. Layton said that Ceph has a sparse-read operation that returns a table of offsets and lengths, followed by all of the data; it would be nice to be able to do something like that with a VFS call.

Lever said that he does not see how the race can be avoided; something can always come along and write data into the hole while the read-plus operation is in-progress. The server cannot promise a consistent view and if the client needs that, it should lock the file. There are some problems with the NFS tests if that promise is not kept, Schumaker said. But Viro pointed out that there is no way to stop something local to the server writing to the hole while the read operation is being sent to the client; Lever agreed and said that the problem affects regular reads as well. "If folks are going to do something stupid, they deserve what they get ... it's glib, but I guess it's a fact of life."

Folios

Lever circled back to the struct xdr_buf up on the screen and noted that he had invited Wilcox in the hopes of getting some ideas for converting NFSD to use folios; Lever wondered where that support would get plumbed in. On the receive side of the NFS server, there is an array of anonymous pages that get filled in by the network layer. On the send side, at least for sockets, the anonymous pages are completely handed off to the network code to be sent and then freed; new anonymous pages are created for the next request. So, he wondered, how do folios fit into that picture?

Wilcox said that he does not want to dictate how the NFSD code should be written, but could try to help the developers understand "how you work well with the MM [memory-management] layer and the filesystem layer". The idea behind folios is to manage memory in chunks that are larger than a page; so you can request an order-5 folio (i.e. 32 pages in length), but if you then break it up into single pages, it is wasted effort; the MM layer could have allocated those single pages directly much more efficiently.

He encourages developers to allocate folios in larger sizes, which helps reduce fragmentation, but only if they do not break the folios up. He suggested using larger folios even if a given use only needs part of it. If a particular request only needs 23 pages, say, he recommended not over-optimizing by splitting up the folio in order to use the other nine pages for something else; the next request may require the whole folio.

Lever said the main place where page-at-a-time behavior is happening is on the send side when handing off a page array to the network layer; maybe NFSD can simply hand over a folio containing those pages instead. Wilcox said that he wished David Howells was at the session because he is familiar with what the network layer is expecting. In general, though, the idea is that all parts of the system will eventually be able to work with a folio of any size. Passing the first page (or in some cases, any page) of the folio to existing code will often just work, though you "have to be a bit brave to do that".

Lever said that Howells wanted NFSD to switch from using the kernel sendpage() operation to sendmsg() with an iterator instead. Wilcox agreed that made sense and Lever asked if he or Howells were planning to implement an iterator that could take a folio parameter and "deal with it". "Absolutely", Wilcox said; the send-message takes a bio_vec, which can contain folios. Viro said that iter_bvec() already handles folios, so it should all work now.

But, Viro said that Howells wants to make iterators that can work with either bio_vec or kvec, which is "a complete nightmare" because it will add "a bunch of overhead for no good reason". The head and tail kvec entries could be converted to use bio_vec instead, Lever said. There are some pitfalls to using memory that comes from kmalloc(), but he said the NFSD developers just need to be careful and switch to using memory from the page allocator. In most cases, the server just uses a single page to hold both the head and the tail of the response. Howells showed up just as the session was ending; to some smiles and chuckling, Lever said that Howells had missed some discussion of how to deal with the head and tail kvec entries in xdr_buf, but that they had figured out what to do without his input.

Comments (10 posted)

Improving i_version

By Jake Edge
July 5, 2023

LSFMM+BPF

The i_version field in struct inode is meant to track changes to the data or metadata of a file. There are some problems with the way that i_version is being handled in the kernel, so Jeff Layton led a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit to discuss them and what to do about them. For the most part, there are solutions in the works that will resolve most of the larger issues.

Layton's motivation for improving the state of i_version handling is NFS. Currently, the NFSv3 code watches file/directory timestamps (access time, or atime, and change time, or ctime) to indicate when its cache should be invalidated. But those times are recorded with one-jiffy (1-10ms) resolution; a lot can happen in a jiffy on today's hardware. That can lead to problems with the client thinking that its cache is up-to-date when it really is not.

[Jeff Layton]

For NFSv4, a new "change attribute" was added; it is a 64-bit unsigned quantity that must change any time the ctime would be updated. Originally, it was considered to be an opaque value, but, over time, the advantages of a monotonically increasing value became apparent. In particular, clients can determine whether certain updates have been performed by seeing if the change attribute is higher or lower than the value it expects.

NFS servers report what kind of change attribute they use; the client can then decide how to treat the values that it gets. Right now, Linux reports an "undefined" type for its change attribute, but Layton would like to be able to report that the change attribute is monotonically increasing. The inode's i_version field can be used for the NFS change attribute; Layton seemed to use the terms i_version and "change attribute" mostly interchangeably throughout the session.

The change attribute must be changed whenever the ctime in the metadata would be changed, as mentioned; some servers can ensure that the attribute change is atomic with respect to the file change that caused it. The Linux server is not able to provide that atomicity, so there is a question of when the attribute should be changed. Right now, for write operations, i_version is changed before the file write is visible, which means that someone could race with the server by seeing the new i_version value that was caused by a write, then doing a read operation that gives the data currently in the page cache, so the data and i_version are out of sync. The client will not update its cache unless another change is made to the file, so this condition can persist for some time.

Changing i_version after the file change becomes visible is "still a little racy", but any synchronization problem should not last long as the client should catch up fairly quickly. Another possibility is to increment the value before and after the file change.

Layton then looked at the i_version field in a bit more detail. It is an unsigned 64-bit value stored in struct inode that comes in two flavors. The first is "server managed" and is used by network-filesystem clients (e.g. NFS or CephFS clients); the value stored in the local inode is whatever value the server has. Local filesystems use a kernel-managed i_version; the kernel increments the value when it updates the metadata timestamps for the file. There is VFS infrastructure that filesystems can opt into using the SB_I_VERSION flag; so far, ext4, Btrfs, XFS, and tmpfs are using it.

The kernel-managed i_version is the more interesting one, he said; the infrastructure is already enabled in the four filesystems he mentioned and GFS2 plans to enable it, though work has not yet started. Originally, it was a simple counter that got incremented whenever ctime was updated, but that turned out to be costly for ext4 and XFS because each increment needed to be logged to disk even if nothing else changed.

Around 2018, there was a switch to a new scheme that sacrificed the low-order bit of the counter for a "queried" flag. If the i_version value was queried, the bit was set. When it was time to increment the counter, the flag was checked; if the flag was set, the counter needed to be incremented, but if not, the counter could stay at the same value. That change allowed the filesystem developers to regain the performance that was lost in the original scheme.

But there are still some problems, Layton said. A while back, he noticed that XFS and ext4 were updating i_version based on atime updates, but it does not make sense to invalidate the cache for a file because someone simply read it; that has been fixed in ext4, but XFS uses i_version for some other things so some other solution must be found for that.

For file writes, i_version is being updated before copying the changes to the page cache, which can lead to the problems he described earlier. There is the potential for losing updates due to crashes because NFSD does not wait for the updated value to be written by the filesystem before it starts presenting it to clients. That could lead to a client with an i_version and file data that correspond to the data lost in the crash, while another client has the same value but different data. NFSD mitigates this problem by using the ctime value to differentiate the two file versions for filesystems that need it. In addition, the i_version behavior is difficult to test because there is no way to query it from user space without changing its behavior (i.e. setting the "queried" bit).

Generally, i_version is updated alongside the ctime update. For directories, that means it is updated after the operation completes, but for writes to files, it is generally done before the data is copied. One way to be more consistent would be to separate the ctime and i_version updates; there is resistance to changing when ctime updates are done, with good reason, but i_version updates could be moved to after the operation is performed. There are still some possible races, but that would be better.

Another possibility is to bump i_version both before and after the operation. In nearly all cases, it is a no-op anyway because of the queried bit. Meanwhile, though, XFS does not need any changes of this sort because it serializes buffered reads and writes. Ext4, Btrfs, and tmpfs, though, should probably also increment i_version after the operation completes.

Layton said that an idea from Dave Chinner for a multi-grain timestamp for ctime could be used. Chinner suggested that NFSD use ctime for its change attribute (instead of i_version), but that the updates to it be done at jiffy resolution except when the value has been queried. At that point, the ctime value gets updated with a fine-grained (much higher than jiffy-resolution) timestamp. Layton has posted some patches to implement the idea; there are some test failures, but they are due to faulty tests, he said.

For the future, though, he thinks it would be quite useful to expose the change attribute to user space, perhaps via statx() (e.g. with a STATX_CHANGE_COOKIE type). It would allow creating a "gated write" that stores the attribute, reads the data and modifies it in memory, then only writes it if the attribute value has not changed; otherwise, it tries again starting with retrieving the change attribute. That is similar to something he did in CephFS a ways back and it provides consistency without requiring locks.

Ted Ts'o thought that decoupling the cookie from i_version, so that it was only an in-memory value, might be better, but Layton said that the same value needs to be used by NFSD for crash resilience, so it needs to be on the disk as well. Christian Brauner wanted to make sure that there were clearly defined, consistent semantics for the cookie value if were to be added to statx(). He complained that the meaning of the f_fsid value for the statfs() system call is amorphous; "nobody knows what it is supposed to mean". Layton agreed that it will be important to spell that out.

There was some discussion of the differences between change attributes as defined for NFS versus the ones that the Andrew filesystem (AFS) uses; the latter is only for changes to the data, so metadata changes do not update its change attribute. Meanwhile, though, NFS has to handle the case of local modifications of the filesystem, while AFS does not; NFSD itself cannot fully manage the updates to the change attribute because the value needs to be updated when local modifications are made. The NFS server will not even know that the modification has occurred.

In the end, it was generally agreed that the multi-grain timestamp approach for ctime should be pursued. It will give user space sufficient information, so the change cookie for statx() likely will not be needed. Layton said he would be working on adding that functionality, but that he needs to fix a number of tests as part of that work.

Comments (4 posted)

Termux: Linux applications on Android

July 5, 2023

This article was contributed by Sam Sloniker

Termux is an Android app that provides a Linux environment and terminal emulator for such devices. Most command-line software can be used quite easily with Termux, and GUI software can be run by installing a few extra apps. It is an excellent option for Android users who want to run Linux software occasionally on a device more portable than a laptop but do not want to use a dedicated Linux phone due to the cost or limitations of such devices.

The Android operating system runs on the Linux kernel, but it cannot run most desktop Linux software on its own because its user space is almost entirely different. One of the most important differences is the absence of the GNU C library (glibc); Android uses Google's custom Bionic C library implementation, released under the three-clause BSD license, instead. Android's filesystem layout is also different from that of a typical Linux system, necessitating adaptations to run standard Linux software.

Because of these differences, software generally must be compiled specifically for Termux. The app's developers maintain repositories containing rebuilds of widely-used command-line Linux software, as well as many GUI programs.

Command-line software

[Termux Vim]

The simplest use case for Termux is running standard command-line Linux software. Most programs work fine with only the base Termux app installed. One common use is SSH. By installing either OpenSSH or Dropbear, it is possible to use Termux to remotely access another machine. This works in exactly the same way as using SSH from a desktop or laptop. Another program that I often run is the Python REPL. Programming on a phone is not ideal but is not too difficult for simple scripts; the Python console also works well as a calculator. Command-line text editors also work; I use Vim regularly in Termux (screen shot on the right), and it works as well as it does on a laptop or the PinePhone.

Of course, typical phone keyboards are not suitable for command-line use due to their lack of keys such as Ctrl, Alt, Tab, and the arrow keys. Termux does use the phone's standard keyboard, but it provides these keys in a row above the main keyboard. The hyphen and forward slash are also included because it usually takes several taps to access them in a phone keyboard, but they are used quite often in typing command-line options and file paths. Termux disables the phone keyboard's auto-correct and predictive text, because these features would interfere with typing commands.

Installation

Termux can be installed from F-Droid or downloaded as an APK from GitHub. Releases are infrequent, but this is because the app itself does relatively little and needs few updates. The project itself is actively developed. Using the F-Droid version is recommended due to its automatic-update functionality. There is a listing for Termux on the Google Play Store, but the version is several years out of date and should not be used.

Termux works on most of the Android devices in active use. It runs on Android 5.0 through 12.0, supports 32- and 64-bit Arm, i686, and x86_64 devices, and requires only 300MB of disk space for a minimal installation.

Once Termux is installed, most users will want to install more software, of course. The default distribution is quite minimal with only some basic command-line utilities (grep, Bash, tar, curl, etc.) and the nano text editor. The package manager, pkg, is a wrapper around Debian's apt. The packages come from a custom repository run by the Termux developers. pkg supports most basic apt commands, but allows abbreviations as long as they are unambiguous; for example, "pkg in vim" is equivalent to "apt install vim".

Termux has three repositories: main, X11, and root. The main repository contains command-line software and common libraries, the X11 repository contains GUI software and X11-specific libraries, and the root repository contains software and libraries that are only useful on rooted devices. Due to Termux policy, hacking and penetration-testing software is not included in any of the repositories.

It is possible to build or develop other software for Termux in many different languages. The libc problem can be handled by rebuilding packages using the Android Native Development Kit (NDK), which is largely released under FOSS-license terms, though the full license picture is not entirely clear. The NDK automatically uses Bionic for its C library; the majority of Linux software can be compiled using other C library implementations, including Bionic. The filesystem layout is accounted for by patching the package sources to refer to files in the locations that Termux uses rather than the standard locations used on desktop distributions.

Termux:API

Although the main Termux app is quite powerful, it does have several limitations. It mostly creates an isolated environment with little ability to interact with the main Android operating system. However, Termux can integrate with several add-ons, which are separate apps that connect to Termux and provide extra functionality. One of these apps is Termux:API, which allows scripts running in Termux to integrate with Android itself, by providing command-line programs for Android API calls. In addition to installing the Termux:API app, it is necessary to install the package termux-api in Termux (pkg in termux-api).

Termux:API provides several commands related to communication. Content can be shared from within Termux; either a file or text from standard input can be shared, and once the Android share dialog opens, it works the same as sharing from any other app. Termux can be used to receive and send SMS messages. Sending requires that the Android SEND_SMS permission be granted to Termux:API; I did this using Android Debug Bridge (ADB). It is also possible to make outgoing calls, but not detect or answer incoming calls, using Termux:API.

Termux can take photos using the device's camera, though there is no way to record videos. Audio recordings can be both made and played; the audio player does accept videos but only plays the audio.

Termux:API provides a variety of commands for changing device settings, such as screen brightness, volume, and wallpaper image. It also allows access to the device's shared storage; after running termux-setup-storage, a directory called storage will be created in the Termux home directory with symlinks to the shared-storage root and several often-used subdirectories. This provides an easy way to share files between Termux and other apps.

Termux:X11

[Termux X11]

Termux:X11 allows for the use of X11 software in Termux (as can be seen in the screen shot on the right). It is still in early development and has little documentation; those seeking a more stable experience may prefer one of the other methods to run GUI software. For me, however, Termux:X11 works quite well, although it does still have several bugs.

Termux:X11 has no stable release yet; the only prebuilt APKs available are the GitHub Actions build artifacts. The app is under active development, and new APKs are built with each commit. Unlike the other add-ons, Termux:X11 does not need to be installed from the same source as the main Termux app; the APKs from GitHub work fine with the main app from F-Droid. After installing the app, some software also needs to be installed in Termux:

    $ pkg in x11-repo
    $ pkg in termux-x11-nightly xfce4

The installation must be done in two commands because the first command enables the repository and the second one installs the software. The xfce4 package includes only a minimal desktop; other packages, such as xfce4-terminal or firefox, can be appended to the command.

The command that I use to launch the desktop is:

    $ am start --user 0 com.termux.x11/com.termux.x11.MainActivity; \
    rm -f .ICEauthority; \ 
    sleep 3; \
    termux-x11 :1 -xstartup "xfce4-session"
I put that all in a script, however. The am command automatically launches the Termux:X11 app and brings it to the front. On my phone, if the .ICEauthority file is not deleted, the desktop disappears shortly after loading and an "ICE I/O Error" shows in Termux. (The same error also happens if you switch out of Termux:X11 app and return; once it happens, it is necessary to return to Termux and re-run the command.)

The phone keyboard can be opened and closed using the device's back button. Termux:X11, like Termux itself, disables predictive text and shows a bar with extra keys. It seems that this bar covers the bottom section of the display rather than shrinking the display, but there is an option in preferences (accessed using a button shown when Termux:X11 is opened without a client) to make the volume-down button toggle the extra key bar. The mouse cursor is controlled by using the phone screen as if it were a laptop touchpad.

Of course, running non-mobile-optimized software on a smartphone does not provide the best experience; many of the same problems that exist with desktop software on the PinePhone also exist with Termux:X11. Additionally, because smartphones have a higher pixel density than typical desktops, most items on the screen, including text, are extremely small. Termux:X11 has a scaling option in its preferences, which helps with this problem. The image is slightly blurred when scaling is enabled, and there is, of course, less screen space available when objects are twice as large, but it will probably still be better for most users to enable scaling; the text is almost unreadable at the default size. The scale option in Xfce display settings does not work; instead of scaling up the display, it shrinks the usable area to only a fourth of the screen and makes objects even smaller.

Termux's selection of GUI software is much less extensive than its selection of command-line software; for example, LibreOffice is not available, but AbiWord is, and Chromium is not available, though Firefox is. Despite the limited selection, the applications that I have tested all work well.

Conclusion

Termux provides many of the benefits of Linux smartphones, such as the PinePhone, without some of the drawbacks. It is not a perfect solution, and X11 support can be particularly problematic, but for those who primarily use Android apps and only use Linux software occasionally, it is likely a better option than a dedicated Linux phone. But those who use mobile-Linux software extensively would probably find a Linux phone to be a better choice than an Android phone with Termux. Many Linux phone users also have another phone due to the limitations of current Linux phones, such as a lack of apps for mobile Linux and the performance/stability of the PinePhone; for these users, having Termux installed on the other phone can be helpful when the Linux phone is not available for some reason.

Comments (18 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: StackRot vuln; "Frankenkernel"; RH and clones; Firefox 115; LXD moves; Perl 5.38; Quote; ...
  • Announcements: Newsletters, conferences, security updates, patches, and more.
Next page: Brief items>>

Copyright © 2023, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds