|
|
Log in / Subscribe / Register

A Plumber's Wish List for Linux

From:  Kay Sievers <kay.sievers-AT-vrfy.org>
To:  linux-kernel-AT-vger.kernel.org
Subject:  A Plumber’s Wish List for Linux
Date:  Fri, 07 Oct 2011 01:17:02 +0200
Message-ID:  <1317943022.1095.25.camel@mop>
Cc:  lennart-AT-poettering.net, harald-AT-redhat.com, david-AT-fubar.dk, greg-AT-kroah.com
Archive‑link:  Article

We'd like to share our current wish list of plumbing layer features we
are hoping to see implemented in the near future in the Linux kernel and
associated tools. Some items we can implement on our own, others are not
our area of expertise, and we will need help getting them implemented.

Acknowledging that this wish list of ours only gets longer and not
shorter, even though we have implemented a number of other features on
our own in the previous years, we are posting this list here, in the
hope to find some help.

If you happen to be interested in working on something from this list or
able to help out, we'd be delighted. Please ping us in case you need
clarifications or more information on specific items.


Thanks,
Kay, Lennart, Harald, in the name of all the other plumbers



An here's the wish list, in no particular order:

* (ioctl based?) interface to query and modify the label of a mounted
FAT volume:
A FAT labels is implemented as a hidden directory entry in the file
system which need to be renamed when changing the file system label,
this is impossible to do from userspace without unmounting. Hence we'd
like to see a kernel interface that is available on the mounted file
system mount point itself. Of course, bonus points if this new interface
can be implemented for other file systems as well, and also covers fs
UUIDs in addition to labels.

* CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:
useful to allow module auto-loading of e.g. cpufreq drivers and KVM
modules. Andy Kleen has a patch to create the alias file itself. CPU
?struct sysdev' needs to be converted to ?struct device' and a ?struct
bus_type cpu' needs to be introduced to allow proper CPU coldplug event
replay at bootup. This is one of the last remaining places where
automatic hardware-triggered module auto-loading is not available. And
we'd like to see that fix to make numerous ugly userspace work-arounds
to achieve the same go away.

* expose CAP_LAST_CAP somehow in the running kernel at runtime:
Userspace needs to know the highest valid capability of the running
kernel, which right now cannot reliably be retrieved from header files
only. The fact that this value cannot be detected properly right now
creates various problems for libraries compiled on newer header files
which are run on older kernels. They assume capabilities are available
which actually aren't. Specifically, libcap-ng claims that all running
processes retain the higher capabilities in this case due to the
"inverted" semantics of CapBnd in /proc/$PID/status.

* export ?struct device_type fb/fbcon' of ?struct class graphics'
Userspace wants to easily distinguish ?fb' and ?fbcon' from each other
without the need to match on the device name.

* allow changing argv[] of a process without mucking with environ[]:
Something like setproctitle() or a prctl() would be ideal. Of course it
is questionable if services like sendmail make use of this, but otoh for
services which fork but do not immediately exec() another binary being
able to rename this child processes in ps is of importance.

* module-init-tools: provide a proper libmodprobe.so from
module-init-tools:
Early boot tools, installers, driver install disks want to access
information about available modules to optimize bootup handling.

* fork throttling mechanism as basic cgroup functionality that is
available in all hierarchies independent of the controllers used:
This is important to implement race-free killing of all members of a
cgroup, so that cgroup member processes cannot fork faster then a cgroup
supervisor process could kill them. This needs to be recursive, so that
not only a cgroup but all its subgroups are covered as well.

* proper cgroup-is-empty notification interface:
The current call_usermodehelper() interface is an unefficient and an
ugly hack. Tools would prefer anything more lightweight like a netlink,
poll() or fanotify interface.

* allow user xattrs to be set on files in the cgroupfs (and maybe
procfs?)

* simple, reliable and future-proof way to detect whether a specific pid
is running in a CLONE_NEWPID container, i.e. not in the root PID
namespace. Currently, there are available a few ugly hacks to detect
this (for example a process wanting to know whether it is running in a
PID namespace could just look for a PID 2 being around and named
kthreadd which is a kernel thread only visible in the root namespace),
however all these solutions encode information and expectations that
better shouldn't be encoded in a namespace test like this. This
functionality is needed in particular since the removal of the the ns
cgroup controller which provided the namespace membership information to
user code.

* allow making use of the "cpu" cgroup controller by default without
breaking RT. Right now creating a cgroup in the "cpu" hierarchy that
shall be able to take advantage of RT is impossible for the generic case
since it needs an RT budget configured which is from a limited resource
pool. What we want is the ability to create cgroups in "cpu" whose
processes get an non-RT weight applied, but for RT take advantage of the
parent's RT budget. We want the separation of RT and non-RT budget
assignment in the "cpu" hierarchy, because right now, you lose RT
functionality in it unless you assign an RT budget. This issue severely
limits the usefulness of "cpu" hierarchy on general purpose systems
right now.

* Add a timerslack cgroup controller, to allow increasing the timer
slack of user session cgroups when the machine is idle.

* An auxiliary meta data message for AF_UNIX called SCM_CGROUPS (or
something like that), i.e. a way to attach sender cgroup membership to
messages sent via AF_UNIX. This is useful in case services such as
syslog shall be shared among various containers (or service cgroups),
and the syslog implementation needs to be able to distinguish the
sending cgroup in order to separate the logs on disk. Of course stm
SCM_CREDENTIALS can be used to look up the PID of the sender followed by
a check in /proc/$PID/cgroup, but that is necessarily racy, and actually
a very real race in real life.

* SCM_COMM, with a similar use case as SCM_CGROUPS. This auxiliary
control message should carry the process name as available
in /proc/$PID/comm.




to post comments

+1 on setproctitle

Posted Oct 7, 2011 16:52 UTC (Fri) by dskoll (subscriber, #1630) [Link] (13 responses)

I would like to set setproctitle or equivalent. It's very useful if you have a concurrent server that forks and handles requests. If each child sets the process title to reflect what it's doing, ps becomes a pretty effective monitoring and diagnostic tool.

+1 on setproctitle

Posted Oct 7, 2011 17:09 UTC (Fri) by ajb (subscriber, #9694) [Link] (7 responses)

What's the difference between that and "prctl(PR_SET_NAME,name,0,0,0);", which we already have?

+1 on setproctitle

Posted Oct 7, 2011 18:08 UTC (Fri) by sionescu (subscriber, #59410) [Link] (6 responses)

An upper limit of more than 16 chars, hopefully

+1 on setproctitle

Posted Oct 7, 2011 20:16 UTC (Fri) by dskoll (subscriber, #1630) [Link] (5 responses)

Yep. 16 characters is too small to be useful. It's nice to get output like this:

postgres 23453 0.2 0.2 1159236 47684 ? Rs 16:05 0:01 postgres: user dbname 127.0.0.1(43135) SELECT

±0 on setproctitle

Posted Oct 7, 2011 23:16 UTC (Fri) by jengelh (subscriber, #33263) [Link] (2 responses)

setproctitle you already have, sort of.
void setproctitle(int argc, char **argv, char *title)
{
    strlcpy(argv[0], title, &argv[argc-1][strlen(argv[argc-1])] - argv[0]);
}
or so is how it is currently done, for the glibc-linux platform.

+1 on setproctitle

Posted Oct 8, 2011 15:31 UTC (Sat) by dskoll (subscriber, #1630) [Link] (1 responses)

It's not that easy. Read the source code to some programs that allow changing of the process title (eg, PostgreSQL, Perl, Sendmail.) They all use horrible hacks to get it to work.

+½ on setproctitle

Posted Oct 8, 2011 15:52 UTC (Sat) by jengelh (subscriber, #33263) [Link]

The pgsql code does the same. Plus a handful of extra checks to ensure linearity etc. of course, because it wants to be multi-platform and safe. Now let's rather think about why glibc still does not have the feature (implementing setproctitle via clobbering of argv).

+1 on setproctitle

Posted Oct 8, 2011 9:27 UTC (Sat) by fdr (guest, #57064) [Link] (1 responses)

Color me confused. I have an Ubuntu 10.04 LTS machine, and I have long proctitles:

postgres: archiver process last was 00000003000004B2000000F5

Along with full on user, database, IP(socket) statement-kind information.

Am I missing something?

+1 on setproctitle

Posted Oct 8, 2011 10:11 UTC (Sat) by neilbrown (subscriber, #359) [Link]

The bit you are missing is "without mucking with environ[]".

When a process is started, the top of memory contains:

all the args, each nul terminated
an extra nul
all the environment, each entry nul terminated
maybe another nul (not sure).

setproctitle copies new text over the args. If the new text is longer than the old text, it copies over the environment as well. This can be a problem if you later want to use the environment.

The kernel remembers where the args and environment start and end:
unsigned long arg_start, arg_end, env_start, env_end;
(mm_types.h)
proc_pid_cmdline (fs/proc/base.c) knows that if the extra nul isn't there at arg_end, it should look for more text in the env_{start,end} range too.

This feature could be implemented by allowing arg_start, arg_end to be set by some user-space system call. So the process would allocate some memory, fill it with the text to appear in /proc/$pid/cmdline, and "give" that memory to the kernel.

It might be easier to just support a 'write' request on /proc/self/cmdline. Then the kernel would need to allocate a page to store the text, but that isn't a big deal...

Re: +1 on setproctitle

Posted Oct 8, 2011 11:49 UTC (Sat) by ldo (guest, #40946) [Link] (3 responses)

Here’s a routine I came up with a while back:
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <stdarg.h>

extern char **
    environ;
static char *
    argstart = NULL;
static size_t
    maxarglen; /* maximum available size of argument area */
static bool
    envmoved = false;

void setproctitle
  (
    char ** argv, /* argv as passed to main, so args can be moved if necessary */
    const char * fmt,
    ...
  )
  /* something as close as possible to BSD setproctitle(3), but for Linux.
    Note I need argv as passed to main, in order to be able to poke the process
    arguments area. Also don't call routines like putenv(3) or setenv(3)
    prior to using this routine. */
  {
  /* Theory of operation: the command-line arguments and environment variables
    for a process can be found from fields arg_start, arg_end, env_start and
    env_end of the mm_struct object defined in include/linux/mm_types.h in the
    current kernel sources. These areas are set up in different ways depending
    on the executable format; for ELF, see the routine create_elf_tables in
    fs/binfmt_elf.c. This puts the envp pointer array immediately following
    the null entry at the end of the argv pointer array, both in the userspace
    stack. The actual strings are stored contiguously in the area beginning at
    mm_struct.arg_start for the argument strings, and mm_struct.env_start
    (immediately follows mm_struct.arg_end) for the environment strings.
    The current process command line is made visible (to utilities like ps etc) via
    the proc_pid_cmdline routine in fs/proc/base.c. This routine checks that the
    byte at mm_struct.arg_end is still a null; if not, it assumes the process
    has overwritten its argument and environment area with an extra-long title,
    and appends the extra data beginning at mm_struct.env_start as well.

    Limitations: this routine can only use the available argument and environment
    area. If the command-line arguments and environment are small to begin with, then
    that limits the length of process title that can be set. Also libc library routines
    like putenv(3) do their own relocation of the environ array and strings; if they
    are used before the first call of this routine that needs to overflow into the
    environment area, then I won't be able to find the original location of the latter.
  */
    char title[512]; /* big enough? */
    ssize_t titlelen;
      {
        va_list args;
        va_start(args, fmt);
        titlelen = vsnprintf(title, sizeof title, fmt, args);
        va_end(args);
        if (titlelen < 0)
          {
            titlelen = 0; /* ignore error */
            title[0] = 0;
          } /*if*/
        titlelen += 1; /* including trailing nul */
        if (titlelen > sizeof title)
          {
            title[sizeof title - 1] = '\0'; /* do I need to do this? */
            titlelen = sizeof title;
          } /*if*/
      }
    if (argstart == NULL)
      {
      /* first call, find and initialize argument area */
        char ** thisarg = argv;
        maxarglen = 0;
        argstart = *thisarg;
        while (*thisarg != NULL)
          {
            maxarglen += strlen(*thisarg++) + 1; /* including terminating nul */
          } /*while*/
        memset(argstart, 0, maxarglen); /* clear it all out */
      } /*if*/
    if (titlelen > maxarglen && !envmoved)
      {
      /* relocate the environment strings and use that area for the command line
        as well */
        char ** srcenv;
        char ** dstenv;
        char ** newenv;
        size_t envlen = 0;
        size_t nrenv = 1; /* nr env strings + 1 for terminating NULL pointer */
        if (argstart + maxarglen == environ[0]) /* not already moved by e.g. libc */
          {
            srcenv = environ;
            while (*srcenv != NULL)
              {
                envlen += strlen(*srcenv++) + 1; /* including terminating nul */
                ++nrenv; /* count 'em up */
              } /*while*/
            newenv = (char **)malloc(sizeof(char *) * nrenv); /* new env array, never freed! */
            srcenv = environ;
            dstenv = newenv;
            while (*srcenv != NULL)
              {
              /* copy the environment strings */
                *dstenv++ = strdup(*srcenv++);
              } /*while*/
            *dstenv = NULL; /* mark end of new environment array */
            memset(environ[0], 0, envlen); /* clear old environment area */
            maxarglen += envlen; /* this much extra space now available */
            environ = newenv; /* so libc etc pick up new environment location */
          } /*if*/
        envmoved = true;
      } /*if*/
    if (titlelen > maxarglen)
      {
        titlelen = maxarglen; /* truncate to fit available area */
      } /*if*/
    if (titlelen > 0)
      {
      /* set the new title */
        const size_t oldtitlelen = strlen(argstart) + 1; /* including trailing nul */
        memcpy(argstart, title, titlelen);
        argstart[titlelen - 1] = '\0'; /* if not already done */
        if (oldtitlelen > titlelen)
          {
          /* wipe out remnants of previous title */
            memset(argstart + titlelen, 0, oldtitlelen - titlelen);
          } /*if*/
      } /*if*/
  } /*setproctitle*/

Re: +1 on setproctitle

Posted Oct 9, 2011 3:26 UTC (Sun) by neilbrown (subscriber, #359) [Link]

Neat !!

As long as no-one uses getenv() before-hand and holds on to the string - or uses putenv as you say.

I would probably have a setproctitle_prepare() which is called first-thing in main() and relocates both argv and environ so that the original arg space is not referenced by anything. But that might just be paranoia.

Also - ld-linux.*.so uses environment variables. I wonder if it holds on to any that it looked up before main() was called.. Probably not, that would be asking for trouble.

Re: +1 on setproctitle

Posted Oct 10, 2011 16:57 UTC (Mon) by mezcalero (subscriber, #45103) [Link]

This is more or less what my own Avahi does as well. But it's really broken as it breaks /proc/$PID/environ and suchlike. And it doesn't work at all if the env block passed to you is too small.

It's a hack, nothing more. And what we have been asking for is a nicer solution, that isn't just a hack.

this seems tangential to the desired effect.

Posted Oct 25, 2011 12:59 UTC (Tue) by quanstro (guest, #77996) [Link]

perhaps i'm not following along, but one assumes that changing
argv for internal purposes is already trivial, and requires no
further interface. and i don't see what the process'
view of argv/environment has to do with ps(1)'s view. why not
allow /proc/$pid/args or similar to be writable?

+1 on setproctitle

Posted Oct 26, 2011 19:52 UTC (Wed) by cras (guest, #7000) [Link]

I tried to get setproctitle() added a few years ago, but I got tired after a while and just implemented the ugly environ hack. If someone wants to continue, this is probably the last patch: http://lwn.net/Articles/365733/ Someone didn't like the prctl() way anyway and preferred a separate setproctitle() syscall.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 17:40 UTC (Fri) by Lovechild (guest, #3592) [Link] (5 responses)

With Lennarts recent comments on FANOTIFY and its shortcomings for Tracker I would have expected that to be on the list as well. Ah well, the list is bound to continue to grow.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 18:07 UTC (Fri) by krakensden (guest, #72039) [Link] (3 responses)

Do you have a link? I'm working on a project that's having issues with a huge number of inotify watches, and was thinking about how wonderful fanotify was going to be when I got some free time to poke at it.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 18:31 UTC (Fri) by foom (subscriber, #14868) [Link]

Don't know about the particular reference, but last time I looked, fanotify looked to be completely useless. I still can't believe it was actually merged.

http://lwn.net/Articles/421643/

Tracker

Posted Oct 7, 2011 18:34 UTC (Fri) by corbet (editor, #1) [Link] (1 responses)

He may be thinking of this fedora-desktop discussion.

Tracker

Posted Oct 7, 2011 22:52 UTC (Fri) by krakensden (guest, #72039) [Link]

Thanks for the link. I'm always surprised that developer conferences aren't sponsored by whiskey manufacturers.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 17:00 UTC (Mon) by mezcalero (subscriber, #45103) [Link]

I don't work on tracker, so getting renames and unprivileged access to fanotify are not really important to me, though I'd welcome that if they existed...

PID

Posted Oct 7, 2011 18:27 UTC (Fri) by sytoka (guest, #38525) [Link] (9 responses)

- random PID numbering scheme.

- add capabilities to LXC container (right now, drop capabilities which is a security aberration).

PID

Posted Oct 7, 2011 23:20 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (3 responses)

How about udev working in LXC?

PID

Posted Oct 9, 2011 9:30 UTC (Sun) by arekm (guest, #4846) [Link] (1 responses)

And systemd. Or does it work currently in lxc/vserver guests?

PID

Posted Oct 10, 2011 17:04 UTC (Mon) by mezcalero (subscriber, #45103) [Link]

I regularly test systemd in Linux containers (though only in systemd-nspawn, not LXC, since the latter is borked on Fedora, and has been since about always). It should boot up properly, however it you'll see a couple of warnings and error messages on the way, for example because sysctl cannot be applied and so on.

udev

Posted Oct 28, 2011 20:33 UTC (Fri) by gvy (guest, #11981) [Link]

udev in LXC is very much needed probably, but I hardly can imagine a single heavyweight use case so far.

PID

Posted Oct 8, 2011 0:22 UTC (Sat) by Richard_J_Neill (subscriber, #23093) [Link] (4 responses)

What about increasing the size of the pid maximum number? Otherwise, there is a real danger that the following code:

#!/bin/bash
process2 &
PID=$!
#do stuff for several hours.
kill $PID
exit

could end up killing the wrong process (if process2 exits, and the PID gets reused).

At least sequentially assigned PIDs make this not too probable: if the PIDs were randomly allocated, without increasing the possible range, it is riskier.

PID

Posted Oct 8, 2011 1:14 UTC (Sat) by neilbrown (subscriber, #359) [Link]

Is "echo 4000000 > /proc/sys/kernel/pid_max" enough for you?

Though to be fair I don't fully understand the memory usage implications of doing this. A quick look suggests that most things scale with the number of processes, not the number of possible processes.
There is a bitmap to track which pids are used, so if used-pids are not dense you would get more wastage in that data structure.

You could try it and see (??).

PID

Posted Oct 8, 2011 11:53 UTC (Sat) by abacus (guest, #49001) [Link]

There's no need to increase PID size in order to solve the race in your program. Just use the following:

#!/bin/bash
process2 &
jobspec="%1"
#do stuff for several hours.
kill $jobspec
exit

Safe killing of a child process

Posted Oct 8, 2011 12:10 UTC (Sat) by pjm (guest, #2080) [Link]

Increasing the pid maximum number won't remove the danger, it would only make the bug rarer.

What you want is ‘kill %process2’, or have a look at bash's job control documentation.

If job control stuff won't do what you're looking for, then you can add a trap on SIGCHLD and keep track of liveness yourself. The ‘jobs’ and ‘wait’ builtins might be useful in some cases.

All of the above should work on any Free Unix kernel, including Linux-2.6.xx and earlier; it doesn't require adding any Linux-3.x-specific features.

PID

Posted Oct 8, 2011 12:35 UTC (Sat) by nix (subscriber, #2304) [Link]

You can't eliminate the race here, but you can narrow it radically by looking at the ppid:

#!/bin/bash
process2 &
PID=$!
# do stuff for several hours.
[[ "$(ps -o ppid= -p $PID)" == $BASHPID ]] && kill $PID
exit

Now this will only lose if $PID dies and a new process starts in that narrow window, or if this bash has many children and one of them started after $PID dies and has the same PID (which you can solve, if you really need to, by remembering PPID->PID mappings in an associative array).

A Plumber's Wish List for Linux

Posted Oct 7, 2011 18:32 UTC (Fri) by jra (subscriber, #55261) [Link] (29 responses)

Working POSIX aio, supported by the kernel (sigh).

Samba has needed this for 5+ years - works fine on Solaris. Don't see why I have to be a second class citizen on Linux.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 18:59 UTC (Fri) by aliguori (subscriber, #30636) [Link] (28 responses)

POSIX AIO is not a very nice interface. Signal based completions are awkward to use at best. In addition, aiocb only contains a flat buffer, not a scatter gather list. For something like QEMU, that makes it a useless interface since it can't support zero-copy I/O.

I'd rather see the effort put into making linux-aio actually work in a meaningful way with buffered I/O. linux-aio is at least a sane interface and is not restricted to the extremely slow process of POSIX standardization.

linux-aio has a proper file descriptor based completion mechanism and support for vectored I/O.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 23:52 UTC (Fri) by jra (subscriber, #55261) [Link] (27 responses)

aliguori wrote:

> POSIX AIO is not a very nice interface. Signal based completions are
> awkward to use at best. In addition, aiocb only contains a flat buffer,
> not a scatter gather list. For something like QEMU, that makes it a
> useless interface since it can't support zero-copy I/O.

But guess what - it's a *STANDARD* that is supported on many other systems.

That's what makes it valuable to have working. We already have code that implements it.

I'm not saying *only* implement the standard. Do a nice Linux-specific interface underneath that the libc code then calls if you like, but don't ignore the standard interface as there's code out there that already uses it.

Of course it needs to work with buffered aio, that goes without saying :-).

A Plumber's Wish List for Linux

Posted Oct 8, 2011 16:44 UTC (Sat) by aliguori (subscriber, #30636) [Link] (24 responses)

You don't need kernel support for Posix aio. The reason it's so bad is that glibc serializes each request for reasons I don't understand. I tried to ask Uli but never got a response. We ended up reimplementing posix aio in QEMU and the performance is quite good. We've also added additional interfaces. The kernel would have to use a thread pool anyway unless every file system is written.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 0:35 UTC (Mon) by jra (subscriber, #55261) [Link] (1 responses)

I fixed the glib code to remove the serialization requirement and tested it. The results were interesting. For a micro benchmark given to me by a Samba OEM vendor more than doubled in speed, but then when testing inside Samba with the Intel NAS benchmark things went slower. Still not sure why. can you point me to your implementation in QEMU - maybe that will help.

A Plumber's Wish List for Linux

Posted Oct 30, 2011 11:21 UTC (Sun) by Lennie (subscriber, #49641) [Link]

I'm not familiar with the specifics but could something like that code be added to eglibc or an other project ?

Maybe as a separate API ?

If the performance really is improved that much it is silly not to be able to use it.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 0:46 UTC (Mon) by jra (subscriber, #55261) [Link] (20 responses)

Found the QEMU aio code. It appears to be llicensed GPLv2 *only*. WTF ??? Why would anyone do that ? Just to shaft more forward-looking GPL projects and force people to fork or re-implement ? Can you give me an explaination for this incredibly anti-social act, or if I'm wrong please show me the GPLv2+ text I've missed.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 11:39 UTC (Mon) by liljencrantz (guest, #28458) [Link] (14 responses)

Huh? Why is licensing your code under GPLv2 an «incredibly anti-social act»? Many people, including the Linux kernel developers have chosen GPLv2 as their preferred license.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 16:06 UTC (Mon) by jra (subscriber, #55261) [Link] (13 responses)

It's not licensing under GPLv2 that's anti-social, Samba also used to be licensed under GPLv2. It's licensing GPLv2-*ONLY* that's the anti-social part (and yes I've spoken to many Linux kernel developers about this).

The simple act of licensing GPLv2+ allows other projects that have moved to GPLv3 to re-use that code. It doesn't force the project that wants GPLv2 to move, it simply allows code reuse by other projects that have upgraded the license.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 16:50 UTC (Mon) by mpr22 (subscriber, #60784) [Link] (12 responses)

Releasing a work as GPLv2-or-later means effectively handing over control of the licence terms on the covered versions of the work to a third party in perpetuity; it's entirely reasonable to find that notion objectionable even if one regards the FSF as a trustworthy organization. Just because one happens to find the terms of GPLv2 suitable doesn't mean that one will find the terms of GPLv3 suitable.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 17:00 UTC (Mon) by jra (subscriber, #55261) [Link] (11 responses)

True, but forcing yourself to be "v2-only" isolates your project from the rest of the community. Now if that's what you want to do, then it's your choice, but it's not a social act, more a selfish one.

It also risks the continuance of your project if fatal flaws are found in your 'XX-only' license, as many believe to be the case with GPLv2. You might find your code used in ways you were specifically trying to avoid by your original choice of license.

Picking a 'XX-only' license is a touching faith in the bug-free nature of legal code. As programmers I would hope we were wiser than that.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 17:05 UTC (Mon) by mjg59 (subscriber, #23239) [Link] (5 responses)

Choosing any GPL license isolates you from a large part of the community. I think describing it as anti-social is rather extreme.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 17:45 UTC (Mon) by jra (subscriber, #55261) [Link] (4 responses)

I'm sorry, I didn't define my terms correctly - that's my fault.

When I use the word community, I'm considering the GPL-code-writing community - that's the group I feel a part of. There are other communities, and a broader FLOSS community using different licenses, but they're not ones that I would chose so I don't feel as much a part of them.

Within that narrower definition (although it is 70% of all Free Software out there) then chosing an XX-only license is an anti-social act, as it prevents wider reuse within GPL projects, which IMHO is the purpose of that license.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 18:41 UTC (Mon) by mjg59 (subscriber, #23239) [Link] (3 responses)

It's a community that's based on a large number of non-GPL components (plus one fairly significant GPLv2-only component). If you define "Community" in such a way that, say, X.org isn't part of the community, I think you're aligning yourself pretty differently to the majority of people contributing to free software development.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 21:20 UTC (Mon) by jra (subscriber, #55261) [Link] (2 responses)

Maybe so, but that's the community I feel the most affinity with (the GPL software producing community).

A Plumber's Wish List for Linux

Posted Oct 13, 2011 16:35 UTC (Thu) by bronson (subscriber, #4806) [Link]

'Antisocial' refers to society as a whole, not just the community you personally feel the most affinity with.

A Plumber's Wish List for Linux

Posted Oct 16, 2011 18:01 UTC (Sun) by obi (guest, #5784) [Link]

As it's supposed to be a community, you could simply ask for this specific code to be released GPLv2+.

Or even better, understand their apprehensions about not automatically trusting all future GPL versions, and ask for this code to be released v2 + v3 only.

If people would ask me to release some of my code under an additional or more liberal license so it can be reused I doubt I'd deny them.

So just ask in a friendly way.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 19:58 UTC (Mon) by fperrin (subscriber, #61941) [Link]

It also risks the continuance of your project if fatal flaws are found in your 'XX-only' license, as many believe to be the case with GPLv2. You might find your code used in ways you were specifically trying to avoid by your original choice of license.
If a fatal flaw in GPLv2 allows Stallman to come to your house and kill your kittens, then a GPLv2+ licensing enables you to release future versions of your software under GPLv3 where you are protected; but previous versions of your software are still available under the GPLv2 and Stallman can still get an old version and come slaughter your kittens (if wants to; in the GPLv2+ scenario, the choice of the exact license is down to the user).

A Plumber's Wish List for Linux

Posted Oct 10, 2011 21:59 UTC (Mon) by dlang (guest, #313) [Link] (3 responses)

by the same logic, choosing GPLv3+ isolates your project from parts of the community and so is anti-social

face it, when the FSF created the GPLv3 they split the community between those who use GPLv2 and those who use GPLv3, there are some who allow both, but in those cases the code can only move one way to GPLv3+ projects.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 23:33 UTC (Mon) by jra (subscriber, #55261) [Link] (2 responses)

Chosing GPL at all is in itself a statement that you approve of that license and the copyleft provisions within. That in and of itself is a fairly strong statement in support of the FSF who are the creators and maintainers of the license.

Saying "I support GPL and copyleft, but I don't trust the people that created it" strikes me as a bit silly (and anti-social). It's the FSF that are moving it forward to deal with modern threats such as DRM-locked down hardware and software patents that simply didn't exist with earlier versions.

Anyway, this is getting distracting from the fact that the Linux kernel doesn't have properly working POSIX AIO, and the real technical discussions I'd prefer to be having, so I'm going to leave my politics at this comment, and get back to complaining that the kernel doesn't give me the free pony that *I* want :-).

A Plumber's Wish List for Linux

Posted Oct 11, 2011 0:14 UTC (Tue) by vonbrand (subscriber, #4458) [Link]

How is that silly? GPLv2 is out there, you can look it over and satisfy yourself that it does express what you want. GPLv3 doesn't, for a lot of copyleft advocates. And nobody can even hint at the changes GPLv4 might bring...

A Plumber's Wish List for Linux

Posted Oct 12, 2011 22:10 UTC (Wed) by Baylink (guest, #755) [Link]

> Saying "I support GPL and copyleft, but I don't trust the people that created it" strikes me as a bit silly (and anti-social)

I disagree with you, but I'm pretty certain you're not Hitler. :-)

I think it's equivalent to "Love your country, but never trust its government". That v2, which you could see, and choose, suited you, and you weren't sure later versions would not, declining to take advantage of the "or later version" language seems perfectly sane to me.

And gimme back my initials; it's confusing? ;-)

A Plumber's Wish List for Linux

Posted Oct 10, 2011 12:07 UTC (Mon) by mpr22 (subscriber, #60784) [Link]

The only licence choices that aren't "antisocial" by the definition apparently in use here are "CC-0" (i.e. as close to public domain as jurisdictions with certain inalienable creators' rights will let you get) and "new BSD".

Licenses?

Posted Oct 10, 2011 16:14 UTC (Mon) by vonbrand (subscriber, #4458) [Link] (3 responses)

Perhaps they just don't trust the FSF to respect the philosophy of the license they are comfortable with. The whole GPLv3 fracas showed clearly that this can be the case.

Licenses?

Posted Oct 10, 2011 23:39 UTC (Mon) by jra (subscriber, #55261) [Link] (2 responses)

I'm old enough to remember when the GPLv2 was beyond the pale and considered pure evil and communism (which is only a dirty word in the USA, and possibly China :-), no commercial company could *possibly* work with code under such a horrible and business-unfriendly license.

My, how times have changed. The same will happen with GPLv3 (probably after GPLv4 is announced :-)

If you don't trust the creators of the license, who do you trust to maintain it ? Do you think it doesn't need maintenance ? I know several lawyers in proprietary software companies that would disagree with you on that fact.

Licenses?

Posted Oct 11, 2011 0:19 UTC (Tue) by vonbrand (subscriber, #4458) [Link] (1 responses)

At least as far as I understand many Linux authors, they are fine with GPLv2 and don't agree with GPLv3, won't give up their copyright (or other rights, including the "change the license" right) to anybody within the Linux community, and definitely will never give the "change the license" right to somebody outside said group. Just remember the flamewars that erupted when it was suggested to move Linux to GPLv3.

Licenses?

Posted Oct 12, 2011 22:13 UTC (Wed) by Baylink (guest, #755) [Link]

And FWIW, there's nothing saying an author could not later relicense their own code as GPLv4, if v3 changes they didn't approve of were rolled back...

Strictly speaking, you're supposed to distribute the code under the version of the license it came with, but if a sole developer later relicensed his code under a newer license, and you took the old package from a 3rd party under that newer license, I'd be hard pressed to see anyone give you crap about it...

A Plumber's Wish List for Linux

Posted Oct 10, 2011 0:49 UTC (Mon) by jra (subscriber, #55261) [Link]

I actually did ask Ulrich and got a response (originally). He told me it was to throttle user space processes who might use too many threads. When I pointed out that the field inside the struct passed to aio_init() satisfied exactly this purpose, *then* he stopped responding to me :-).

A Plumber's Wish List for Linux

Posted Oct 8, 2011 17:48 UTC (Sat) by jcm (subscriber, #18262) [Link] (1 responses)

+1

We have this wonderful delusion in the Linux community that we don't need no stinkin standards and things would be so much better if we just threw things away and did everything differently. And that's precisely what proprietary UNIX did for many years before efforts like POSIX.

Jon.

A Plumber's Wish List for Linux

Posted Oct 8, 2011 17:48 UTC (Sat) by jcm (subscriber, #18262) [Link]

Oh sorry, right, we're all Open Source so it doesn't matter. Puh-lease.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 20:33 UTC (Fri) by kpfleming (subscriber, #23250) [Link] (7 responses)

Some method of being able to poll()/epoll() on both fds *and* synchronization objects (pthread_cond_t objects, for example). Right now the only method to achieve thread-safe wakeup is to use a pipe() and poke it so that the poll() will wakeup. Unless I'm mistaken, all of these mechanisms use the same underlying basis in the kernel anyway for waking up processes, so this should be possible. It wouldn't be part of POSIX, of course.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 21:12 UTC (Fri) by cmccabe (guest, #60281) [Link] (2 responses)

Also, the ability to use epoll together with Linux AIO.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 22:07 UTC (Fri) by wahern (subscriber, #37304) [Link] (1 responses)

Linux has a (struct iocb).aio_resfd member which should be set to an eventfd descriptor on which readiness will be signaled.

A Plumber's Wish List for Linux

Posted Oct 8, 2011 16:54 UTC (Sat) by cmccabe (guest, #60281) [Link]

Thanks. It looks like you can integrate POSIX aio with epoll as well by using signalfd() or similar.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 21:51 UTC (Fri) by wahern (subscriber, #37304) [Link] (1 responses)

Use eventfd() instead of a pipe. In fact, use it instead of a POSIX mutex when you want to be able to poll on it. That's what it was invented for--rolling your own pollable semaphores.

A Plumber's Wish List for Linux

Posted Oct 8, 2011 14:08 UTC (Sat) by kpfleming (subscriber, #23250) [Link]

Thanks for that tip... that might be just what I've been looking for (and signalfd() looks quite handy too).

A Plumber's Wish List for Linux

Posted Oct 7, 2011 21:58 UTC (Fri) by HelloWorld (guest, #56129) [Link]

I think that this was the purpose of FUTEX_FD, see man 2 futex. It was removed in 2.6.26 though, due to being inherently racy.

A Plumber's Wish List for Linux

Posted Oct 8, 2011 9:26 UTC (Sat) by helge.bahmann (subscriber, #56804) [Link]

How about doing it the other way around, allow poll/aio/whatever to signal synchronisation objects?

http://chaoticmind.net/~hcb/kfutex/kfutex.pdf

(Yes I'm biased).

A Plumber's Wish List for Linux

Posted Oct 7, 2011 21:56 UTC (Fri) by dpquigl (guest, #52852) [Link] (5 responses)

I wonder why they want user xattrs on cgroupfs and procfs. I can see the other xattr types but it would be interesting to hear the usecases for the user xattr namespace.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 22:13 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Easy. xattrs allow cgroups to be labeled for use with SELinux.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 22:17 UTC (Fri) by dpquigl (guest, #52852) [Link] (1 responses)

Those aren't user xattrs though. The user. xattr namespace is unprivileged. The SMACK and SELinux xattrs are in the security. namespace.

A Plumber's Wish List for Linux

Posted Oct 7, 2011 23:30 UTC (Fri) by dpquigl (guest, #52852) [Link]

Its worth noting that procfs supports the security name space but cgroupfs doesn't yet. cgroupfs security xattr support should be implemented in the same way that the sysfs xattr support is done instead of proc.

A Plumber's Wish List for Linux

Posted Oct 10, 2011 17:17 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (1 responses)

Having user xattrs on cgroupfs and procfs allows userspace to attach meta information to processes and services (the latter because in systemd each service gets a cgroup of its own). THis can be useful for a multitude of things. For example, in systemd we'd like to allow processes to mark themselves as "don't kill me on shutdown during killall" (which some borked DM software might need), and it would be really pretty if they could just set "trusted.dont-kill-me" or so as xattr on their procfs dir /proc/self, so that it actually is really the process that is marked that way, instead of having a side channel for this. But the fact that this way we can attach meta info to processes and services has a lot of other benefits too. For example, Gtk programs could expose their app name and icon via an xattr on /proc/self and gnome-system-monitor could use it to show a pretty name in the process view and so on.

In fact, attaching meta information to OS objects like cgroups and processes is probably a lot more useful then simply attaching it to normal files as we have supported now since so long.

A Plumber's Wish List for Linux

Posted Oct 11, 2011 1:00 UTC (Tue) by ebiederm (subscriber, #35028) [Link]

Shudder. I hate to thing about what it would take to implement xatts on files inder /proc/$PID. And only for a little convinience. Shudder

wait() on a PID that is not your child

Posted Oct 8, 2011 0:27 UTC (Sat) by Richard_J_Neill (subscriber, #23093) [Link] (11 responses)

There are lots of uses for this! Its usually possible to work around, but sometimes not. A contrived example might be for a user to want kdialog to tell him as soon as apache quits; at the moment, it's necessary to poll.

wait() on a PID that is not your child

Posted Oct 8, 2011 1:33 UTC (Sat) by HelloWorld (guest, #56129) [Link] (10 responses)

That is inherently racy. In order to do that, you first need to find out about the relevant process's PID, and then start waiting for it. But when you start waiting, the process may already have exited, and a new process might have its PID now, so you may end up waiting for another process than the one you intended. In order to fix this, larger PIDs would be necessary, so that every new process gets a PID that was never used before.

wait() on a PID that is not your child

Posted Oct 8, 2011 3:10 UTC (Sat) by neilbrown (subscriber, #359) [Link] (9 responses)

I would suggest that the best way to do this is not to use 'wait' at all but to support 'poll' on some file in /proc/$pid.
I would suggest /proc/$pid/status. It wouldn't be hard to get a poll on this file to report POLLERR when the process dies or changes state.
For extra points you could add an 'exit status' field that only appears after the process has exited - not sure if that is necessary though.

wait() on a PID that is not your child

Posted Oct 8, 2011 11:08 UTC (Sat) by HelloWorld (guest, #56129) [Link] (8 responses)

That doesn't fix the problem I pointed out.

wait() on a PID that is not your child

Posted Oct 9, 2011 3:10 UTC (Sun) by neilbrown (subscriber, #359) [Link] (2 responses)

You are correct, it doesn't. But it could.

I was thinking that /proc/$PID was some how linked to the actual process so that when the process died, that directory would become empty and would stay empty. However it isn't.
/proc/$PID is linked to $PID so if a new process appeared with the same pid, its details would appear in the same directory.
i.e. if you "cd /proc/$PID". then "kill -9 $PID", the directory will appear empty (or give an error on readdir) but if another process gets called $PID, "ls ." will start showing things again.

However this could easily be "fixed" for example by using a generation number similar to that used by NFS. Each new process gets a random generation number assigned to it and when you open /proc/$PID that number gets copied into the inode that is created. Then accesses to a process through that inode always check that the generation number is correct as well as the pid. About a dozen lines of code.

With that in place, your race would be trivial to avoid. Just "chdir" to the /proc/$PID directory, check again that this is the process that you are interested in, then open "status" and 'poll' for POL_ERR.

wait() on a PID that is not your child

Posted Oct 9, 2011 7:25 UTC (Sun) by ebiederm (subscriber, #35028) [Link] (1 responses)

/proc/$PID/ is linked to the process and it very much becomes empty when a process dies.

If a new process gets the same pid a different directory is created.

The tricky bit is actually is the way process death updates are implemented internally. The data structures are backwards and need to be turnedd around so poll on the file descriptor could be implemented.

There is still the race of changing into the directory at the top of this thread but pid reuse is typically slow enough that race should be hard to hit.

wait() on a PID that is not your child

Posted Oct 9, 2011 10:54 UTC (Sun) by neilbrown (subscriber, #359) [Link]

hmm... I must have missed an important piece in the code. I just tested and the old empty directory definitely stays empty as you said. Thanks for the correction.

I don't think the race at the top is real. Whatever mechanism was used to determine which pid to wait for can be repeated after the "cd /proc/$pid" to see if it is still the same. If it is the same, then it is perfectly safe to wait for files to disappear (if/when there is a mechanism to do that).

wait() on a PID that is not your child

Posted Oct 10, 2011 17:21 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (4 responses)

I wonder if this could be fixed by actually having 64bit PIDs. It's not that easy making things overrun 2^64.

wait() on a PID that is not your child

Posted Oct 10, 2011 19:01 UTC (Mon) by HelloWorld (guest, #56129) [Link] (2 responses)

64 bit PIDs ought to be enough for anybody. If the system creates 10000
processes per second, the PIDs would overflow after 2^64/(60*60*24*365*10000) = 58494241 years.

wait() on a PID that is not your child

Posted Oct 12, 2011 22:15 UTC (Wed) by Baylink (guest, #755) [Link] (1 responses)

> should be enough for anybody.

You didn't *really* expect to get away with that, here, did you?

:-)

wait() on a PID that is not your child

Posted Oct 12, 2011 22:49 UTC (Wed) by HelloWorld (guest, #56129) [Link]

It was worth a try.

wait() on a PID that is not your child

Posted Oct 11, 2011 1:13 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

They would look quite ugly, though.

Blocking access to removable drives that haven't spun up yet

Posted Oct 8, 2011 0:31 UTC (Sat) by Richard_J_Neill (subscriber, #23093) [Link] (5 responses)

If you put a CD/DVD into the optical drive, and then immediately [without waiting ~ 5 seconds] launch VLC/Mplayer/etc, you usually get an error message saying "no disk present".

It takes about 5 seconds for the optical drive to spin up and read a disk. During this time, the drive surely 'knows' that something is happening; access to it should therefore block for a few seconds.

Blocking access to removable drives that haven't spun up yet

Posted Oct 8, 2011 23:49 UTC (Sat) by Los__D (guest, #15263) [Link] (4 responses)

But does the drive tell the host controller that something is happening, and does the controller have the ability pass that information on? Without that, the OS can't do much.

Blocking access to removable drives that haven't spun up yet

Posted Oct 9, 2011 11:44 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (3 responses)

I won't swear to it, but I have a feeling my older optical drives (when I still used one more than once in a blue moon) would actually block an I/O operation while spinning up with a new disc.

So I suspect this is a drive firmware decision and not amenable to influence by the operating system.

Blocking access to removable drives that haven't spun up yet

Posted Oct 10, 2011 22:55 UTC (Mon) by smoogen (subscriber, #97) [Link] (2 responses)

I believe this can happen with the new Green drives also. Your drive is spun down and you are talking to "cache" which if what you are wanting was something it had gotten recently is good enough. Otherwise you are blocked until the drive spins up.

Blocking access to removable drives that haven't spun up yet

Posted Oct 12, 2011 7:09 UTC (Wed) by Los__D (guest, #15263) [Link] (1 responses)

Yes, that is pretty standard, and I guess that the drive just lets the controller wait for the data. Problem is that CD/DVD/Blu-Ray drives doesn't do that, they send a no-disc (or something to that effect) instead, even if it is spinning up the disc.

I have no idea why it's done that way, it seems rather silly when the drive knows that data is (probably) on the way.

Blocking access to removable drives that haven't spun up yet

Posted Oct 15, 2011 0:11 UTC (Sat) by giraffedata (guest, #1954) [Link]

I'm not sure which drives we're talking about, but I've had CD drives for about 10 years that spin down after a period of inactivity, and when the client sends a read command, the drive spins up while the command executes.

However, when you load a disk, before it spins up and the drive otherwise loads the disk, commands fail immediately.

I don't know any reason a "green" drive would be different.

But it does seem like an obvious idea to have the drive hold off on completing any commands while loading is taking place.

A Plumber's Wish List for Linux

Posted Oct 8, 2011 3:00 UTC (Sat) by fest3er (guest, #60379) [Link] (4 responses)

I would like to see a netfilter/iptables feature similar to one found in ipset: the ability to create and populate a new chain, then swap that chain with an existing chain and delete the (now old) chain. This would be a boon for firewall administration and security in that it would reduce code complexity and greatly reduce the amount of time that rules are 'missing'.

A Plumber's Wish List for Linux

Posted Oct 8, 2011 14:07 UTC (Sat) by maxximino (subscriber, #80685) [Link] (1 responses)

It's already possibile.
Check iptables-restore: it applies the bunch of rules you give it ATOMICALLY.

A Plumber's Wish List for Linux

Posted Jul 20, 2012 18:25 UTC (Fri) by fest3er (guest, #60379) [Link]

Finally had some time to ponder this. How *many* rules can be restored atomically? In previous playing with iptables-restore, I'd found that periodic COMMITs (every 15-25k rules) were needed. Doesn't a COMMIT terminate/end the atomicity?

A Plumber's Wish List for Linux

Posted Oct 11, 2011 0:34 UTC (Tue) by nybble41 (subscriber, #55106) [Link] (1 responses)

I'm hardly an expert on iptables, but it seems that, apart from using iptables-restore, you could also use an intermediate chain as a sort of "function pointer" to switch from the old rules to the new ones with a single update:

# set up the initial rules
iptables -N real-chain-1
iptables -A real-chain-1 ...

# create the indirect chain
iptables -N replaceable-chain
iptables -A replaceable-chain -g real-chain-1

# use it
iptables ... -j replaceable-chain

# later...

# set up the new rules
iptables -N real-chain-2
iptables -A real-chain-2 ...

# switch to the new rules
iptables -R replaceable-chain 1 -g real-chain-2

# clean up
iptables -F real-chain-1
iptables -X real-chain-1

A Plumber's Wish List for Linux

Posted Jul 20, 2012 18:31 UTC (Fri) by fest3er (guest, #60379) [Link]

Yes, that's generally possible. But it requires the chain name change to be tracked externally. (OK, I have to change the rule set again. Am I, right now, using chain_0 or chain_1?)

Refreshing the groups of a running process

Posted Oct 11, 2011 9:57 UTC (Tue) by DaD (guest, #80726) [Link] (1 responses)

One of the oldest thing that annoy me is the lack of possibility to refresh groups of a running process, specially for X.

me@debian:~$ sudo adduser me plugdev
Adding user `dad' to group `plugdev' ...
Adding user dad to group plugdev
Done.
me@debian:~$ grep '^Groups' /proc/$$/status # On not list group id of plugdev.

Regards.

Refreshing the groups of a running process

Posted Oct 13, 2011 10:47 UTC (Thu) by cortana (subscriber, #24596) [Link]

At least groups are no longer necessary for most things on the desktop; udev/policykit take care of adding/removing ACLs on device nodes as users log in/log out/active/deactive their terminals.

Being able to revoke access to a file after the user has opened it, however, would still be useful...

A Plumber's Wish List for Linux

Posted Oct 13, 2011 12:40 UTC (Thu) by etienne (guest, #25256) [Link]

If the subject was not to find people to implement the Wish List, but to increase the Wish List itself, I would have added:

- an official interface to get the disk/sector numbers from the block number of a filesystem (the later got by FIBMAP/FIEMAP).
Sometimes it is easy to get (simple partition table) but other times nearly impossible (multi-disk RAID with LVM on top).


Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds