LWN.net Weekly Edition for July 16, 2020

Welcome to the LWN.net Weekly Edition for July 16, 2020

This edition contains the following feature content:

LibreOffice: the next five years: the LibreOffice community considers changes to help the commercial ecosystem.
Operations restrictions for io_uring: safely sharing an asynchronous I/O primitive with untrusted processes.
Microsoft drops support for PHP: somebody else will have to build Windows PHP packages in the future.
Managing tasks with Org mode and iCalendar: two more task-management approaches.
Creating open data interfaces with ODPi: an Open Source Summit session on data interchange.
What's new in Lua 5.4: features in the first major release of the Lua language in five years.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

LibreOffice: the next five years

By Jonathan Corbet
July 9, 2020

The LibreOffice project would seem to be on a roll. It produces what is widely seen as the leading free office-productivity suite, and has managed to move out of the shadow of the moribund (but brand-recognized) Apache OpenOffice project. The LibreOffice 7 release is coming within a month, and the tenth anniversary of the founding of the Document Foundation arrives in September. Meanwhile, LibreOffice Online is taking off and, seemingly, seeing some market success. So it is a bit surprising to see the project's core developers in a sort of crisis mode while users worry about a tag that showed up in the project's repository.

LibreOffice was based on firm free-software principles, and is an egalitarian organization overall, so it is not surprising that the appearance of a "Personal Edition" tag in a recent 7.0 release candidate raised some eyebrows. Company-dominated projects will often withhold features for "enterprise" customers, delaying their arrival into the second-class "community edition" or keeping them entirely proprietary. But LibreOffice is not supposed to be such a project; it is owned by an independent foundation and its development is driven by a few companies. LibreOffice is freely shared by everybody, or at least it has been so far.

The fuss quickly reached a level that required the Document Foundation's board to issue a statement about what was going on. The board emphasized that there will be no license change for LibreOffice, and no changes to "the license, the availability, the permitted uses and/or the functionalities". But there is still something going on:

This "Personal Edition" tag line is part of a wider 5 years marketing plan we are preparing and it has the purpose of differentiating the current, free and community supported LibreOffice from a LibreOffice Enterprise set of products and services provided by the members of our ecosystem.

So, while nothing is going to change, there is still a plan to create different versions of LibreOffice, some of which will need to be paid for.

Some problems

The driving force behind the changes is easy enough to understand; it is one that many successful free-software projects face. LibreOffice is a huge program, and developing it takes a lot of work. According to this marketing plan [PDF] put together by the project, nearly 70% of the changes to LibreOffice come from developers paid by "ecosystem companies"; those companies pay about 40 people to work on LibreOffice. That is not a small expense; the companies involved will only be able to sustain that level of development if LibreOffice is bringing in a corresponding amount of revenue.

In a lengthy post titled "Some problems", project co-founder Michael Meeks explained that this revenue is not coming in. Part of the problem is that Microsoft provides "poor to non-existent support to the majority of users" of Office, he said, so nobody thinks in terms of buying support for any office suite:

It is routinely the case that I meet organizations that have deployed free LibreOffice without long term support, with no security updates etc. Try the Cabinet Office in the UK (at the center of UK Government), or a large European Gov't Department I recently visited - 15,000 seats - with some great FLOSS enthusiasm, but simply no conceptual frame that deploying un-supported FLOSS in the enterprise hurts the software that they then rely on.

Companies think of LibreOffice, he said, in the same way that they think about web browsers, which are available for free and are well supported. But work on web browsers is paid for by advertising, which is not a model that works for an office suite.

The problem is compounded by companies that sell inexpensive "support" for LibreOffice, but which are not involved in its development and are not really able to provide that support. Those companies "file all their tickets up-stream and hope they are fixed for free". Companies working in that mode have no problem pricing their offerings below those of the companies doing the actual work (and thus winning much of the business that does exist). In addition, they simply call their offerings "LibreOffice", which actually looks more authentic than services from other companies, which are trying to build their own brands around LibreOffice support.

LibreOffice, he concluded, has tried to do something unique and is finding its path to be difficult:

As we look around the industry we see tons of organizations exploring ways to solve similar problems. LibreOffice's is ‑particularly‑ challenging, because we aspire to being a vendor neutral project. There are reasonably well-known ways to build a company controlled, branded, FLOSS project - we know and love lots of them: openSUSE, Fedora, Nextcloud, ownCloud etc. this is the norm. With TDF we tried to do something far harder - to create an vendor neutral ecosystem that can help retain the community spirit while delivering on our mission. That has proved extraordinarily harder.

The result of all this is that the LibreOffice ecosystem "is under long term stress".

The plan

In response to these problems, members of the LibreOffice community have been working on a five-year marketing plan, the core of which can be seen in the slides linked above. The intent is to create differentiated versions of LibreOffice while avoiding open-core or proprietary business models. Part of that involves getting a better handle on the LibreOffice brand.

The plan starts by creating the concept of the "LibreOffice Engine", which is a term to describe the core LibreOffice code. It is meant to be a way to enable products selling under their own brand to associate themselves with LibreOffice while maintaining their own identity. "LibreOffice Engine" is described in the plan as a sort of equivalent to the highly successful "Intel Inside" branding effort. Presumably this term would be trademarked by the Document Foundation; the plan does not get into what constraints would be put on who could use the trademark (and how).

Then, there is the Personal Edition, which would be "forever free" and only available from the Document Foundation. This release would be tagged, according to the plan, "volunteer supported, not suggested for production environments or strategic documents". The alternative would be "LibreOffice Enterprise", which would only be available from "ecosystem members". This version would come with commercial support and a corresponding price tag.

LibreOffice Online seems to be a place where a lot of tension resides, perhaps unsurprisingly, since that is where the bulk of the money is being made with LibreOffice now. Companies would like to keep parts of LibreOffice Online to themselves, but that threatens to disrupt the volunteer part of the development community. The plan involves the same split between "personal" and "enterprise" offerings, but adds a little note: "There will be an X month gap between the release of the two versions: LibreOffice Online Enterprise and LibreOffice Online Personal".

The hope is that this plan will give the true "ecosystem members" something attractive to sell and, to an extent, free them from the difficult challenge of competing with the free LibreOffice offering. It is, in many ways, reminiscent of the path Red Hat took years ago to differentiate its Enterprise Linux offering, complete with insinuations that the free version might not be fully trustworthy. That approach has clearly worked well for Red Hat; it would be hard to argue that it has not worked well for the wider Linux community too.

Free software is an inherently challenging base upon which to try to build a company. Many in the free-software community are happily indifferent to the fate of companies working with the code, but without successful companies we would not have much of the code that we depend on every day. As Meeks pointed out, LibreOffice without companies would look a lot like the cobweb-strewn OpenOffice project; it is hard to see that as a win for anybody. So one can only wish LibreOffice and the Document Foundation luck as they seek a way to solve this problem while remaining true to the free-software principles that sparked the project's launch in the first place. Ten years of LibreOffice is nowhere near enough.

Comments (72 posted)

Operations restrictions for io_uring

By Jonathan Corbet
July 15, 2020

The io_uring subsystem is not much over one year old, having been merged for the 5.1 kernel in May 2019. It was initially added as a better way to perform asynchronous I/O from user space; over time it has gained numerous features and support for functionality beyond just moving bits around. What it has not yet gained is any sort of security mechanism beyond what the kernel already provides for the underlying system calls. That may be about to change, though, as the result of this patch set from Stefano Garzarella adding a set of user-configurable restrictions to io_uring.

As one might expect from its name, io_uring is based around a ring buffer shared between the kernel and user space that allows user space to submit operations to the kernel. There is a second ring that is filled with the results of those operations. Each operation can be thought of as a way of expressing a system call; operations may read or write buffers, open files, send network messages, or request any of a number of other actions. Operations can be made contingent on the successful completion of previous operations. In short, the operation stream feeding into the kernel is a sort of language expressing a program that the kernel should execute asynchronously.

Operations executed by io_uring result in calls to the code within the kernel that implements the corresponding system calls; an IORING_OP_READV operation, for example, ends up in the same place as a readv() system call. That code will perform the usual privilege checks, using the credentials of the process that created the ring in the first place. So, in the absence of bugs, a process can do nothing with io_uring that it would not be allowed to do with direct system calls — with the exception that seccomp() filters do not apply to io_uring. This model has worked well for io_uring so far, but it turns out that there is a use case that could use a bit more control.

In particular, what happens if a process wants to create a ring and hand it over to another, less-trusted process? For example, I/O from within virtualized guests could perhaps be accelerated considerably if it used io_uring. This I/O, which often goes through the Virtio mechanism now, involves a certain amount of data copying and context shifting that could be avoided this way. The hypervisor could create whatever file descriptors the client would need, which would correspond to specific devices or open network connections, then let the guest handle things directly through the ring from there.

The problem with this idea is that the guest could then perform any operation that io_uring supports. Remember that the ring retains the credentials of the creator, which would be the hypervisor in this case; giving such a ring to a client would open the door to actions like accessing other file descriptors opened by the hypervisor or opening new files with the hypervisor's credentials. This is likely to prove extremely disappointing to anybody counting on virtualization as a security barrier.

The answer to this problem, according to Garzarella, is to allow the registration of restrictions on what a specific ring can do. He adds a new opcode (IORING_REGISTER_RESTRICTIONS) for this purpose. There are a few types of restrictions that can be added:

IORING_RESTRICTION_REGISTER_OP

Provides a list of registration operations that can be carried out with this ring. Registration operations install file descriptors and buffers in the ring, optimizing their use in subsequent operations. These are, in other words, setup operations for the ring itself that do not actually perform I/O.

IORING_RESTRICTION_SQE_OP

The operations (actual system calls) that will be allowed in this ring are provided as a list. It's called a "whitelist" within the code, a term that seems more than usually likely to change before the patches find their way into the mainline. Any operation that does not appear in this list will be disallowed on the restricted ring.

IORING_RESTRICTION_FIXED_FILES_ONLY

If this restriction is applied, only file descriptors that have been previously registered in the ring can be used in operations. In other words, this restriction can be used to limit a ring to operating on a specific set of known files.

Most of the "restrictions" above are thus actually permissions; they specify the things that the ring is allowed to do. Among other things, the allowlist approach here will help prevent future surprises when new operations are inevitably added to the io_uring roster. Restrictions can be applied exactly once, after which they are fixed for as long as the ring exists.

One final piece, suggested by io_uring maintainer Jens Axboe in response to a previous version of the patch set, is a new flag (IORING_SETUP_R_DISABLED) that can be provided when the ring is first created. When present, that flag causes the ring to start in a disabled state; registration operations will still succeed, but any other operations will fail. That allows the ring creator to perform the necessary registrations and add restrictions without having to worry about any other thread starting to use the ring for I/O. Once the registration phase is complete, the IORING_REGISTER_ENABLE_RINGS registration operation will complete the ring setup and enable all (allowed) operations.

This restrictions mechanism appears to be sufficient for the described use case of allowing restricted access to a specific set of file descriptors. It seems probable that somebody will want to add more sophisticated policy mechanisms at some point; a proposal to add a BPF hook for security decisions seems unavoidable. For the near future, though, the proposed restriction mechanism may help to speed up I/O in virtual machines or other untrusted environments, which seems like a useful improvement.

Comments (11 posted)

Microsoft drops support for PHP

By John Coggeshall
July 11, 2020

For years, Windows PHP users have enjoyed builds provided directly by Microsoft. The company has contributed to the PHP project in many ways, with the binaries made available on windows.php.net being the most visible. Recently Microsoft Project Manager Dale Hirt announced that, beginning with PHP 8.0, Microsoft support for PHP on Windows would end.

The PHP community is still reacting to the news, but so far appears to be largely unconcerned with the decision. Hirt explained that Microsoft would continue to support the PHP builds it currently maintains for Windows for the length of time that PHP's official support policy dictates, but would not be doing so for PHP 8.0 or beyond:

We know that the current cadence is 2 years from release for bug fixes, and 1 year after that for security fixes. This means that PHP 7.2 will be going out of support in November. PHP 7.3 will be going into security fix mode only in November. PHP 7.4 will continue to have another year of bug fix and then one year of security fixes. We are committed to maintaining development and building of PHP on Windows for 7.2, 7.3 and 7.4 as long as they are officially supported. We are not, however, going to be supporting PHP for Windows in any capacity for version 8.0 and beyond.

Beyond the announcement, neither Hirt nor Microsoft have provided any of the reasoning behind the decision. PHP 8.0 release manager Sara Golemon responded to the news by thanking Microsoft for its support over the years, and expressed confidence that the PHP project would find a solution by the PHP 8.0 release:

First, let me convey all our appreciation for the work Microsoft has put into supporting PHP on Windows over the years. Thank you also for letting us know in advance to not expect 8.0 builds. I guess this decision must have only been made very recently since 8.0.0alpha1 and alpha2 builds were produced already.

I won't say I'm not bummed, of course. Nevertheless, y'all gotta do what you gotta do. I'm sure we can work out an alternative by the end of the year.

Microsoft has been supporting the community for a long time, and specifically providing resources toward producing Windows builds for PHP (including PECL extensions) for over a decade. The company's contributions to the project are more than simply providing some binaries for distribution, though. For example, starting with PHP 7.2, the build environment tools enabling Windows builds using Microsoft compilers have been maintained by the company. There are other contributions it has made too, including the PHP extension for its SQL Server product line.

The announcement seems to make it clear that some large chunk of this work will need to be picked up by someone else in the PHP project. How much of the work Microsoft is abandoning exactly, however, is not clear. The announcement made by Hirt seems to be focused on PHP builds for Windows — but the phrase "any capacity" does imply all support from Microsoft for PHP is ending.

On Reddit, Golemon added to her initial comments on the internals mailing list, providing some context to what this news meant in her view. In her post, she indicated that this would likely not affect end users of the language.

Most likely the project will dust off a machine somewhere in the cloud running Windows (likely using a free license generously provided by MS, btw) and setup some automated build processes to make these "in house".

These machine(s) may even be setup/maintained by the same people who were doing the official builds at Microsoft (such as cmb [Christoph M. Becker] who is also one of the 7.3 RMs).

We're still in initial reaction phase here, but the bottom line is there will likely be very little change for Windows users.

Short term, with PHP 8.0 Alpha 2 just released and less than two weeks before Alpha 3, the next release seems like the most immediate concern this news has for the project. So far, all of the PHP 8.0 releases have also included Windows builds thanks to Microsoft.

Longer term, the project shouldn't have much difficulty providing Windows builds of PHP 8.0 — it is something the community used to do itself before Microsoft stepped in. A potentially more significant concern, however, might be support for SQL Server. A PHP application coupled with SQL Server certainly isn't a significant percentage of use cases, but those who do rely on it will not be happy to lose it, either. Ongoing maintenance for the extension could be a challenge if Microsoft is unwilling to lend a hand. Hopefully, Microsoft (or the PHP project) will address these concerns in the near future so that those who currently depend on these technologies can plan accordingly.

Comments (22 posted)

Managing tasks with Org mode and iCalendar

July 14, 2020

This article was contributed by Martin Michlmayr

In an earlier article, I reviewed the todo.txt and Taskwarrior task managers. This article continues the process of examining task managers by looking at tools for Org mode, which is a system originally created for Emacs, as well as at tools that make use of the iCalendar standard. It is time to find out whether I can find a system that meets my needs.

Not just for Emacs

Org mode is an Emacs mode for note-taking and project planning, though Org's workflow and file format have found adoption outside of Emacs, as we'll see. Org mode makes it easy to keep notes, maintain to-do lists, plan projects, and more in Emacs. Worg, a community site for Org, describes it as a "powerful system for organizing your complex life with simple plain-text files". This sounds rather appealing since many readers probably appreciate the power of simple text files and might agree that modern life is getting increasingly complex.

What makes Org mode interesting is that it's not merely a task manager, but a system to organize your life. Org mode can also be used to keep a variety of notes, such as ideas, quotes, a list of links, or code snippets. What I noticed is that I often jot down thoughts and ideas throughout the day as I perform a range of activities, such as working on a problem, reading articles, or interacting with others. Some of those notes might just be random observations that I want to preserve, while others may lead to specific tasks later. Keeping both notes and tasks in the same document seems natural from this perspective.

Org mode offers a rich set of features, such as folding sections (i.e. hiding information under a particular heading), keeping a time record for tasks (clocking in and out), capturing notes or tasks from within Emacs or other applications (such as a web browser or PDF viewer), maintaining tables (including support for text spreadsheets), and exporting to other formats (such as HTML, LaTeX, or Open Document Format). In terms of tasks, Org mode sports features commonly found in task managers, such as states (e.g. TODO and DONE), task dependencies (expressed via sub-tasks), priorities (e.g. [#A] for the highest priority), and tags (e.g. :@home:).

These features can be used like this:

    * LWN
    ** DONE Write article about task managers, part 1
    ** TODO [#A] Write article about task managers, part 2 :@home:
    *** TODO [#A] Review Org mode
    SCHEDULED: <2020-07-09 Thu 12:00> DEADLINE: <2020-07-09 Thu 17:00>

A star denotes a headline and can be used for sub-sections (and sub-tasks for tasks). Since Org mixes tasks and notes, a note essentially becomes a task if it has a state (like TODO). Each section can have a planning line with keywords like SCHEDULED, DEADLINE, and CLOSED. A more detailed example, serving as a tutorial of the Org syntax, is available.

Org mode also has built-in support for agendas with scheduled tasks and deadlines. There are a number of tutorials showing how to implement task-management systems in Org mode, such as for the popular Getting Things Done (GTD) system.

Even though Org started as a mode for Emacs, the Org file format has been implemented by many other applications. Worg has a list of other tools, although that page seems somewhat out of date. Pandoc can convert Org files to other markup formats, Orga offers an Org parser in JavaScript, GitHub and GitLab render Org files automatically, and so on.

My editor of choice, Vim, offers a plugin based on the Org format called Vim-OrgMode (seen above). It supports basic functionality from Emacs's Org mode, such as folding, changing the state of tasks, and displaying an agenda. However, the functionality is more limited compared to Emacs (e.g. there's no support for tables). Another Vim plugin, Vim Do Too, also looks promising, but it is more "inspired by Org mode" than a fully compatible replacement.

For those who want to add tasks on the web or from a mobile device, org-web is one option. The web app runs in the browser instead of on a server, has support for Dropbox and Google Drive, and offers a lot of features, including an agenda view and tables. One potential shortcoming is that org-web is specifically optimized for mobile use. This is addressed by organice, a friendly fork of org-web, which is built for mobile and desktop browsers. Organice offers extensive documentation, and its comparison page highlights a number of advantages, such as support for WebDAV (which can be used to sync with Nextcloud).

Finally, Orgzly (seen at right) is a popular app for Android. Despite Orgzly having one core developer, development is pretty active. Orgzly supports saving Org files on Dropbox, using WebDAV, or to local storage. It offers many features one would expect from a task manager for a mobile device, such as reminders of scheduled events and deadlines. While adding geolocation information to tasks is supported, Orgzly doesn't yet have reminders based on the location. It does have a number of built-in filters (agenda, next three days, scheduled), and custom filters ("searches" in Orgzly) can be defined.

Overall, the Org ecosystem looks quite attractive. The file format is simple, yet powerful. While Org mode for Emacs is the most mature, there are a number of interesting solutions that expand the scope of Org beyond just Emacs.

Standards, standards, ... and iCalendar

After reviewing three different ways of storing tasks, one starts to wonder why there isn't a single standard to describe tasks. Of course, cynics will point to the xkcd comic about standards. As it turns out, though, there is an actual standard in the form of iCalendar, also known as RFC 5545. Many readers may be familiar with the ICS files sent out by email from applications such as Microsoft Outlook and Google Calendar, which contain event information in the form of VEVENT entries. But the standard also defines tasks, called VTODO entries, with the usual features needed to support them (priorities, progress, scheduled and due dates, dependencies, etc.).

iCalendar is also a text-based format, although it's more complicated than todo.txt or Org, and is not usually edited manually. One attraction of iCalendar over the alternatives is that it has been designed with synchronization in mind; CalDAV can be used to synchronize calendars (including tasks). Calendars can be stored on Nextcloud (self-hosted or via one of the many providers) or on a self-hosted server using Radicale or Xandikos. Vdirsyncer is a command-line tool to synchronize calendars and contacts; DAVx⁵ does the same on Android.

There are many applications built around iCalendar. On Android, I first found OpenTasks. The app looks quite attractive and offers basic task management, but many important features are missing, including recurring tasks and sub-tasks. The GitHub issue for sub-tasks suggests that they would be implemented after recurrence, but that comment was made in 2017 and recurring tasks are still not implemented.

A more actively maintained Android app, although also having only one core developer, is Tasks, which is seen at the right. It is available on F-Droid and as Tasks.org on Google Play. Tasks originates from Astrid, which was a popular application many years ago, but its development stopped after the company behind it was acquired by Yahoo. (The Grumpy Editor reviewed Astrid back in 2011). Tasks feels quite polished and it has some nice gimmicks, such as the ability to assign not just a color to a tag but also an icon. You can hide tasks until a later date, define custom filters, and receive reminders based on your location. Tasks implements the core functionality of a task manager, although there are, of course, many open feature requests, such as for swipe gestures, progress bars, and Markdown rendering.

Since iCalendar is a standard, it's easy to use multiple applications to manage tasks. While I find Tasks useful on my phone to get reminders and to add tasks on the go, I prefer something on my desktop for more in-depth management. For this purpose, Nextcloud Tasks (seen below) looks quite appealing. It supports the core functionality of task management, such as sub-tasks, priorities, scheduled and due dates, and progress. Unfortunately, development is quite slow, although the developer (who works on Nextcloud Tasks in his spare time) actively comments on tickets and invites people to submit pull requests. Furthermore, the application would be vastly more useful with better integration with Nextcloud, such as implementing Nextcloud's concept of projects and integrating with other Nextcloud apps, such as Deck, a Kanban style organization tool.

There are also some iCalendar solutions for the command line, although the offerings are not as mature as what's available for Taskwarrior. Todoman can be used to list, edit, and create tasks from the command line. There is also calcurse, a curses-based calendar which can display tasks.

In summary, iCalendar is a standard for events and tasks, which is implemented in a large number of tools. In addition to those introduced here, KDE's Korganizer, GNOME To Do, and others use the iCalendar format. One downside is that the format, while text-based, is more complex and hard to edit manually.

Summary

After reviewing four different systems to manage tasks, it's time to draw conclusions. First of all, I observed a trend all too common with open-source projects. While one attraction of open source is the sheer diversity of what's available, it's also a curse: many of the reviewed task-management systems had several half-baked solutions instead of one solution for each category (command-line, web, and mobile). If more developers joined forces to focus one on solution, we might have more mature and longer lasting projects.

Having said that, I feel that all of the four systems have some appeal and offer robust solutions. Personally, todo.txt seems too simple to me. While the format is definitely quite attractive to some users, I find Org's format to be more expressive, yet simple enough to edit by hand. The simple addition of headers in Org, compared to todo.txt, by itself makes a huge difference, since you can structure your file much better. A better todo.txt editor could potentially structure the file to provide similar functionality, however, perhaps by creating virtual sections based on tags.

When I started my review, I was fairly sure I was looking for a combination of one mobile and one desktop (native or web) application to manage tasks. While I'm traditionally a command-line person, it seems that task management would benefit from a modern, visual design that allows you to easily sort through tasks, separating important tasks from the rest. Yet, I was surprised by Taskwarrior's highly capable text-based tools that made the management of tasks easy. I also found Org appealing since I spend a lot of time editing texts; managing tasks in my editor feels right. Unfortunately, the Org plugins for Vim are not nearly as featureful as Org mode for Emacs.

The most future-proof choice seems to be iCalendar since it's an actual standard; its level of adoption will likely increase over time. Unfortunately, the format isn't easy enough for routine manual editing and I'd rather store an Org file in Git than an iCalendar file. While both are plain-text files, a diff on the Org file will be easier to read.

In an ideal world, I would have Taskwarrior with an iCalendar backend, so I could combine the robust text-based tools from Taskwarrior with the attractive web and mobile solutions for iCalendar. And even though I've mostly been leaning toward an iCalendar-based solution so far, I can't stop thinking about Org's powerful, yet simple system, which feels very natural.

While I haven't fully made up my mind, one conclusion is clear, though. Apart from todo.txt, which wouldn't work for me personally, I find all of the systems to be useful and interesting in their own ways. I could make any of the three systems work. The only question is which one best fits into my workflow and the only way to find out is to enter my tasks and see how I get along. In the worst case, I'll add a task to revisit this topic in another year.

Comments (12 posted)

Creating open data interfaces with ODPi

July 10, 2020

This article was contributed by Sean Kerner

OSSNA

Connecting one source of data to another isn't always easy because of different standards, data formats, and APIs to contend with, among the many challenges. One of the groups that is trying to help with the challenge of data interoperability is the Linux Foundation's Open Data Platform initiative (ODPi). At the 2020 Open Source Summit North America virtual event on July 2, ODPi Technical Steering Committee chairperson Mandy Chessell outlined the goals of ODPi and the projects that are part of it. She also described how ODPi is taking an open-source development approach to make data more easily accessible.

While perhaps not as well-known as other Linux Foundation efforts, ODPi has actually been around since 2015. Chessell explained that ODPi's initial role was to help different vendors using Apache Hadoop to interoperate, since each had its own set of data connectors. As usage and the number of Hadoop vendors has declined in recent years, ODPi defined a broader vision to be an initiative focused on creating open-source data standards to help users understand and make use of data across different platforms.

ODPi has multiple projects

ODPi is an umbrella organization that is comprised of multiple sub-projects, including, OpenDS4All, BI & AI, and Egeria. The governance structure of ODPi follows a similar pattern to other initiatives within the Linux Foundation, with a Board of Directors handling the business operations and a Technical Steering Committee (TSC). Chessell described the role of the TSC as providing mentoring to the projects that comprise ODPi, as well as helping the leaders of the projects with best practices and ideas for technical improvement.

While code is an important part of ODPi, technology alone isn't the only way to make effective use of data. Chessell noted that as organizations try to become more data-driven, they face cultural, organizational, and technology problems. "For many organizations they operate as a sort of hierarchy that creates silos between each of the different IT systems, and to make use of data you have to sort of break down those silos and allow data and collaboration to flow laterally across the organization," she said.

The OpenDS4All project is one such non-code effort within ODPi. OpenDS4All is an open data-science project that is focused entirely on education, creating materials that educators and organizations can use to build a data-science curriculum. The project got started in February 2020 based on materials originally created by professors at the University of Pennsylvania.

The material that OpenDS4All provides includes Python-based tools that can help students to learn about different aspects of data science, including data modeling, integration, and analysis. The project makes use of Jupyter Notebooks to set up the data-science environments. (LWN has looked at the use of Jupyter notebooks for education and collaboration in the past.) All of the OpenDS4All components are available via the project's GitHub repository.

BI & AI

OPDi is also host to the Business Intelligence and Artificial Intelligence (BI & AI) project. Chessell explained that BI is a marketing term used to describe data platforms that are used by companies to create reports about their operations. BI platforms also typically include the ability to make various charts from the data, as well as dashboards for companies to use to analyze their data. Chessell remarked that BI platforms include a large range of data sources that have been specially created to support the reporting process. The data that is created for BI can also be useful for AI use cases.

The goal of the BI & AI project is to help make BI data more accessible so that it can be used by AI frameworks. She added that the BI & AI project team members are working on defining a standard interface to allow AI models to be plugged into a BI platform. "So the first phase they're working on now is around the specification for this bridge between AI and BI," Chessell said. "The second phase will be actually to build a reference implementation of that bridge so that the vendors can choose to demonstrate the bridge operating with their platform."

ODPi Egeria

Chessell spent most of the time during her presentation talking about the Egeria project, which is all about metadata: data that describes some collection of data. The project includes code, as well as best practices and educational elements, to help users better understand and create data systems that can interoperate with different types of metadata. She noted that many organizations use software tools that can store information about the data that they have; that metadata is used to build a what's often referred to as a data catalog. The metadata can be used to provide different types of information about the data structure and how it can be organized.

Typically, each vendor's metadata repository is proprietary and locked down, such that if a company wants to move to a new tool, there is no easy way to migrate the existing metadata. "So the result was that different business units within an organization, were operating in isolated silos and knowledge wasn't being shared," Chessell said. "A data-driven organization needs knowledge and data to flow laterally between the different silos."

Chessell added that although some vendors have tried to create open APIs with their own technologies, competitive vendors are often reluctant to integrate with each other. That's where the Egeria project fits in; it is an open-source initiative for data APIs and protocols that brings together vendors to work cooperatively on metadata interoperability. She emphasized that the goal with Egeria is not to create a central metadata repository that every tool connects to, but rather to enable a peer-to-peer communication between different metadata repositories.

Metadata is not one format or data type either, which adds to the complexity of management and interoperability. Chessell said that the Egeria project mapped out the types of metadata used by organizations and found 500 different metadata types. So what the Egeria project has done is create an approach to classifying what a given piece of metadata is being used for and how it relates to other data. According to Chessell, by having a way to define what a specific piece of metadata is about, there is a common language that can be used as the basis for creating an interoperable metadata system. "As each vendor maps to the same model, we start to understand the correspondence between the different technologies and the data that they store," she said.

How Egeria works

A core part of Egeria is developing the Open Metadata and Governance (OMAG) Server Platform. Chessell explained that OMAG servers can be deployed in on-premises or cloud environments to integrate a series of services for metadata discovery, governance, and interoperability.

There are three main types of OMAG servers; "Cohort Members" are a type that is used for doing peer-to-peer exchange with a metadata access point. Cohort Members also include a conformance test server that helps to validate that a server adheres to the OMAG specifications and can properly share metadata information. "View Servers" provide REST interfaces to connect with data repositories behind firewalls. Then there are "Governance Servers", which provide a layer of management and security services for handling metadata. Governance Servers also provide discovery services that can enable deduplication when there are multiple copies of the same data asset.

Egeria is still in a state of active development. Chessell noted that the developer pieces, which include metadata repository services, are production-ready, though she said that the View Server component, which enables integration with backend metadata servers, is currently less mature.

The promise of ODPi is nothing short of making data better (which was the title of Chessell's presentation). In order to make use of data it's painfully obvious that organizations need to be able to get access to data and it's surprising in some respects that so much data is still in silos and locked behind proprietary interfaces. With the continued effort of ODPi and its constituent projects, hopefully more data will be open and accessible in the years to come.

Comments (none posted)

What's new in Lua 5.4

July 15, 2020

This article was contributed by Ben Hoyt

Lua version 5.4 was released at the end of June; it is the fifteenth major version of the lightweight scripting language since its creation in 1993. New in 5.4 is a generational mode for the garbage collector, which performs better for programs with lots of short-lived allocations. The language now supports "attributes" on local variables, allowing developers to mark variables as constant (const) or resources as closeable (close). There were also significant performance improvements over 5.3 along with a host of minor changes.

Lua is a programming language optimized for embedding inside other applications, with notable users such as Redis and Adobe Lightroom. It has been used as a scripting language for many computer games, including big names such as World of Warcraft and Angry Birds; Lua was the most-used scripting language in a 2009 survey of the game industry. Part of the reason Lua is good for embedding is because it is small: in these days of multi-megabyte downloads for even the simplest applications, the entire Lua 5.4 distribution (source plus docs) is a 349KB archive. To build a Lua interpreter with the default configuration, a developer can type make and wait about five seconds for compilation — the result is a self-contained 200-300KB binary.

Major versions of Lua are released every few years, not on any particular release cycle. The previous major version, 5.3, was released over five years ago, in January 2015, with the addition of a separate integer type (previously Lua used only floating-point numbers), bitwise operators, a basic UTF-8 library, and many minor features.

Language changes

One of the interesting new features in Lua 5.4 is the addition of local variable "attributes". When declaring a local (block-scoped) variable, a developer can add <const> or <close> after a variable name to give it that attribute. The const attribute is straightforward: similar to const in C, it means that the specified variable cannot be reassigned after the initialization in its declaration. The const attribute does not make a data structure immutable: a developer is not prevented from changing entries in a table stored in a const variable, but the variable name cannot be assigned again. The const attribute provides a small amount of compile-time safety, as the compiler will give an error if a constant is accidentally reassigned:

    do
        local x <const> = 42
        x = x+1
    end
    -- ERROR: attempt to assign to const variable 'x'

Perhaps more useful (though with similarly unusual syntax) is the close attribute. This tells Lua to call the object's __close() "metamethod" when the variable goes out of scope. Similar to RAII in C++ or the with statement in Python, it is a way to ensure that memory is freed, files are closed, or other resources are shut down in a deterministic way. For example, the file object returned by the built-in io.open() function can be used with <close>:

    do
        local f <close> = io.open("/etc/fstab", "r")
        -- read from file 'f'
    end
    -- file is automatically closed here

The <close> attribute can also be used with user-defined objects:

    function new_thing()
        local thing = {}
        setmetatable(thing, {
            __close = function()
                          print("thing closed")
                      end
        })
        return thing
    end

    do
        local x <close> = new_thing()
        print("use thing")
    end
    -- "thing closed" is printed here after "use thing"

Previously, developers would have to use the __gc() metamethod for this purpose, but that is only called when the object is garbage collected some time later, not deterministically at the end of the block.

Generational GC

Version 5.4 also brings a new generational garbage collection (GC) mode, which performs better for certain kinds of programs where objects usually have a short lifetime. A generational GC — based on the observation that "most objects die young" — scans "young" objects frequently and frees them if they are not referenced, but scans older objects (those that are still referenced after one or more GC passes) less frequently. Interestingly, Roberto Ierusalimschy (one of the creators of Lua) noted in 2017 that Lua previously had a generational GC:

Lua has an incremental garbage collector since 5.1. It was the generational collector that was introduced in 5.2 as an experiment and then removed in 5.3. It will come again in 5.4, this time probably to stay.

Ierusalimschy gave a talk in 2019 (PDF slides and YouTube video) that goes into more detail about how incremental GC works, as well as why the 5.2 generational GC didn't perform that well, and what the Lua team replaced it with. In 5.2's version, objects only had to survive a single GC cycle (collector pass) before becoming "old", but in 5.4 they have to survive two GC cycles, which is a more accurate model for real-world Lua programs. The two-cycle approach is more complicated to implement but gives better GC performance for many programs — though not all. Ierusalimschy notes that programs which build large data structures won't benefit. Possibly for that reason, the Lua team didn't change the default: in 5.4 the default is still to use the incremental collector; a developer needs to add "collectgarbage("generational")" to their program in order to turn on the generational GC.

On the lua-l mailing list, Gé Weijers described how the generational GC, with its "minor collections" (frequent GC passes to collect young objects) ties into the new <close> feature (which used to be called "toclose"):

The garbage collector in 5.4 implements a generational mode. If an object survives the minor collections it may take a very, very long time before its __gc metamethod gets called after is becomes inaccessible, especially if your program mostly creates short lived objects. This makes __gc less useful as a poor man's RAII replacement.

The new "toclose" feature is much more useful to release resources and unlock locks in a timely matter.

Faster

One of the unsung features in 5.4, is a significantly faster interpreter, though the release notes have overlooked it. In a test I did on my 64-bit macOS machine using Gabriel de Quadros Ligneul's Lua Benchmarks suite, I found that version 5.4 was an average of 40% faster than version 5.3 across 11 benchmarks included in the suite:

Similar gains are shown in Elmar Klausmeier's performance comparison. Admittedly, both of these are rather artificial benchmarks — when using Lua in something like a game engine, performance-sensitive code like graphics or matrix multiplication will no doubt be written in C. Still, an improvement of this magnitude for number-centric code (which most of these benchmarks are) is not to be scoffed at. Dibyendu Majumdar described some of the reasons for these improvements on the lua-l mailing list back in 2018: 5.4 added new and optimized bytecode instructions for numeric operations that Lua's parser can use when it can infer that the types involved are numbers. For example, GETI and SETI are two new instructions that are used for table lookups when the index is a constant integer.

Those who need much higher performance can use Mike Pall's LuaJIT, a just-in-time compiler for Lua 5.1 that is significantly faster than the stock Lua interpreter. However, LuaJIT hasn't added any of Lua's new features since version 5.1 (which was released in 2006). Doing so would be quite an undertaking due to many breaking changes, including new scoping rules in 5.2 and the new integer type in 5.3. For this reason, Pall has been a vocal critic of the backward-incompatible changes that the Lua team makes.

This does seem to be a real problem, and not only with obscure edge cases: two of the benchmarks in the Lua Benchmarks suite failed in 5.4 with a "C stack overflow" error (though they work fine in 5.3), so I had to remove them before running it. The ack and fixpoint-fact benchmarks fail, presumably due to different handling of recursive tail calls in 5.4. Most of the incompatibilities in 5.4 are documented, but the length of that list may still cause a fair bit of pain for those trying to upgrade large Lua scripts. My guess is that this is why tools that need long-term stability, like Redis and World of Warcraft, lock in a specific older version of Lua (in the case of both of those, version 5.1). It seems like there's something of a split in the community, with some who stick to 5.1 because it has a JIT compiler and because the changes since then are relatively minor.

Incompatibilities between Lua versions may also contribute to the problem of Lua not having a unified standard library, which LWN wrote about back in February. If a library author has to do a bunch of work to upgrade when a new Lua version comes out, they may be less likely to keep it up to date. That makes it more likely that someone will create a fork that works on the new Lua version or simply write a new library.

Smaller changes

In addition to the larger changes, Lua 5.4 adds many smaller features, including a new random number generator using the xoshiro256** algorithm instead of using the underlying C library's rand() function. There is now a simple warning system used when there's an error in a finalizer or __close() method. Also added is the ability for Lua values with "userdata" to have multiple user values (userdata is a pointer to a memory block created with the Lua C API, so this feature allows objects created by C extensions to have multiple memory blocks associated with them).

There were some minor changes in semantics as well: slightly different handling of edge cases with wrap-around in for loops and adjustment of string-to-number coercion for integers (for example, "10"+1 is the integer 11 in 5.4, but the floating-point number 11.0 in 5.3).

Overall, Lua seems like a good language for its domain (embedding into larger systems or applications); the release of 5.4 shows that it is receiving continual improvements from the core team. Lua has no clear roadmap, so it's hard to know at this early stage what changes are being planned for 5.5, or when it is likely to be released (Lua developer Pierre Chapuis even speculates the next version may be "a very impacting change" with a 6.0 version number). In any event, the new features in 5.4 will probably be fairly minor for most users, but the performance improvements will prove to be a nice win.

Comments (1 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: openSUSE board removal vote fails; Ubuntu package tracking; LO marketing plan; Quotes; ...
Announcements: Newletters; conferences; security updates; kernel patches; ...

Next page: Brief items>>