|
|
Subscribe / Log in / New account

What goes into default Debian?

By Jake Edge
February 17, 2021

The venerable locate file-finding utility has long been available for Linux systems, though its origins are in the BSD world. It is a generally useful tool, but does have a cost beyond just the disk space it occupies in the filesystem; there is a periodic daemon program (updatedb) that runs to keep the file-name database up to date. As a recent debian-devel discussion shows, though, people have differing ideas of just how important the tool is—and whether it should be part of the default installation of Debian.

There are several variants of locate floating around at this point. The original is described in a ;login: article from 1983; a descendant of that code lives on in the GNU Find Utilities alongside find and xargs. After that came Secure Locate (slocate), which checks permissions to only show file names that users have access to, and its functional successor, mlocate, which does the same check but also merges new changes into the existing database, rather than recreating it, for efficiency and filesystem-cache preservation. On many Linux distributions these days, mlocate is the locate of choice.

But Steinar H. Gunderson has created another variant, plocate, which he has suggested should be the standard locate for the upcoming Debian 12 ("Bookworm") release. He said that plocate can completely supplant mlocate:

Now, there is plocate (written and maintained by myself). It is orders of magnitude faster than mlocate (both on SSDs and HDDs alike), has the same security model, a smaller database (typically half the size), and fixes a few long-standing mlocate bugs. It is nearly fully command-line compatible with mlocate, so most users should feel right at home.

He pointed out that mlocate used to be installed by default, but that was changed for Debian 10 ("Buster"). He would like to see locate return as part of the default install, but to use plocate instead:

I believe a good, fast locate is something that we should have in our base install; it is fine to keep it out of the cloud image (as today), but it is genuinely useful for both desktops and servers, and with a low cost.

Bernd Zeimetz agreed that plocate should be part of the default install, as did Paul Wise, but Wise was concerned about the cost of keeping the database updated. Gunderson said that plocate (like mlocate) tries to be smarter than simply walking the whole filesystem. "It keeps track of the mtime of each directory, and doesn't do the readdir()/getdents() if it hasn't changed." But Josh Triplett argued that while plocate is a good choice for the default locate for Debian, it should not be part of the default (or base) install:

However, I don't think *any* locate should be in the base install, as long as that entails having any kind of regularly scheduled task that indexes the filesystem, even if that task has been made relatively cheaper than other implementations. locate is a purely user-facing tool, not really usable for portable scripting, since neither its presence nor its functioning can be assumed. Many users won't even know it exists (locate has far fewer users than find), and for all of those non-users, the effort spent building the database will go entirely to waste.

Beyond that, he pointed out that desktop environments often provide similar functionality, "typically based on a change-watching API (e.g. inotify) rather than a regularly scheduled update". Gunderson said that the amount of wasted time generally amounts to "a few seconds every night". Triplett noted that there was a counterexample to that figure in the thread, but he also made a broader point about defaults in the distribution:

[...] It's not a cost we should impose on the majority of Debian installations just so that someone can run locate without installing it first.

The defaults need to cater for 1) the broadest set of users, and 2) users who are less likely to change the defaults. Most users don't run locate, and those who do are more likely to be users who can and do change the defaults. `apt install plocate` isn't hard for someone who uses locate to do.

But, as Gunderson pointed out, systems that are shared by multiple users could benefit from having locate available—without having to ask an administrator to install it. He suggested that locate is a standard tool for users of the command line, as well. Adrian Bunk said that shared systems are "pretty rare", but Russ Allbery pointed out that "rare" might not be the right characterization: "They've become *rarer*, but they're still very common in the academic and scientific research world."

Bunk also noted that many non-technical Linux users never actually touch the shell. Gunderson wondered why Debian installs a whole host of other utilities (e.g. netcat, lsof, the PCI utilities) that are really only useful for technical, shell-aware Linux users. Those utilities are "expected to be on a typical Linux system by almost every technically-knowledgeable Linux user", Marvin Renich said, but locate does not quite rise to that level. Others disagreed, of course. Bjørn Mork drew the line differently, noting that administrators can always add tools that they need, but sometimes users cannot:

I also use Linux systems I don't [administrate]. I'd hate to have to ask the admin for every basic Unix tool I need. Some of the standard tools you mention are only relevant for admins. Those don't need to be standard. But the ones that are relevant for users should be.

In addition, not having locate available by default is "yet another step away from being a proper Unix system", Mork said. Holger Levsen suggested that it was perhaps time for someone to create a package that installs a "proper Unix system". There would, of course, be differing opinions on what constitutes such a system, which is part of what is driving the question of a default locate as well. One suspects that hair length or color would not actually play much of a role as was (jokingly) suggested, but several packages with different choices could be accommodated, as Levsen noted.

So it would seem that plocate will be the default locate for Debian, though mlocate will still be available, but neither will be installed by default. Or at least there was no huge groundswell of support to change the current practice. That is in keeping with the practice of other Linux distributions (Fedora, openSUSE, Ubuntu, and RHEL, at least), but it is understandable that locate might be missed. find is a reasonable substitute, but it lacks the instant feedback that locate provides. For me, that's worth installing an extra package and expending a few seconds per day—your mileage may vary.



to post comments

What goes into default Debian?

Posted Feb 17, 2021 20:43 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (27 responses)

I've forgotten about locate myself. I remember fighting with `updatedb` for disk access at times too (no idea which implementation it was either). Personally, I've found `fzf`[1] to be of far more use since it uses thread pools and has interactive filtering. But, I also tend to know the ballpark of where the files I'm interested in too, so maybe that's another axis.

Also, using mtimes seems weird for system packages. If I downgrade a package or install a newer version that happened to be built before the newer one (copr repositories on Fedora, AUR on Arch, or whatever Ubuntu is providing…can't remember the name), does `updatedb` get confused (AFAIU, timestamps tend to come from the package, not install time)?

[1] https://github.com/junegunn/fzf/

What goes into default Debian?

Posted Feb 17, 2021 21:14 UTC (Wed) by juliank (guest, #45896) [Link]

I'm not sure if mtime works like that for directories. But also you can just do != instead of > to see if you need to rescan.

What goes into default Debian?

Posted Feb 17, 2021 22:08 UTC (Wed) by warrax (subscriber, #103205) [Link] (24 responses)

It's a relic from a time long gone. Even spinning metal and most network file systems are plenty fast enough that just doing a find / ... is usually fine. (IIRC the standard updatedb even goes so far as to exclude networked file systems, so that's moot, I guess.)

I mean, if your busy loop is 'find a file' then maybe something like it makes sense, but if you need to do that then you should find a better way to do it than calling locate.

What goes into default Debian?

Posted Feb 17, 2021 22:25 UTC (Wed) by Sesse (subscriber, #53779) [Link] (23 responses)

“find / -name \*LWN\*” on my main machine takes 9 minutes 11 seconds.

“plocate LWN” takes 8 milliseconds.

I wrote plocate because mlocate's slowness was a real impediment to my (volunteer) sysadmin tasks. find is fine if you have a tiny system or a narrow search scope, but even on my laptop's SSD with a pretty small installation, it takes 2–3 seconds to run.

What goes into default Debian?

Posted Feb 17, 2021 23:28 UTC (Wed) by clump (subscriber, #27801) [Link] (5 responses)

The -iname flag for GNU find will do case-insensitive searches. Just in case you didn't know about it.

I like the ease of locate but switched to find years ago because find is always up to date.

What goes into default Debian?

Posted Feb 18, 2021 6:12 UTC (Thu) by dowdle (subscriber, #659) [Link]

locate shouldn't be more than 24 hours out of date... and if you are looking for something more recent, you most likely created it and it's in your home directory... and you can find it quickly enough. But for the vast majority of the filesystem outside of your home directory, locate smokes find.

If I've done a lot of package installs or updates... I'll often run updatedb before using locate. The updatedb action usually only takes a second or two. So making locate as up-to-date as find, is still way, way faster than find.

Yes, there are feature differences because locate only matches file/dir names whereas find has a whole slew of properties you can search for. It does NOT need to be an either or... or one is better than the other. Use both, they are both great.

What goes into default Debian?

Posted Feb 18, 2021 7:29 UTC (Thu) by anton (subscriber, #25547) [Link] (3 responses)

"rlocate is an implementation of the ``locate'' command that is always up-to-date." Except that rlocate itself is not up-to-date; it was written before inotify/fanotify, so it uses its own kernel module instead. But maybe one of the current locate implementors can add an always-up-to-date feature based on fanotify.

It's funny that some people argue that updatedb is too costly while others argue that "find /" (which costs hardly less) is fast enough.

What goes into default Debian?

Posted Feb 18, 2021 9:20 UTC (Thu) by smcv (subscriber, #53363) [Link] (2 responses)

It depends on how much you use it. Assume updatedb runs once a day, for example. If you run locate multiple times a day, then the daily updatedb is definitely "cheaper" than using find every time; but if you only run locate once a year, then you're reading the whole filesystem hierarchy 365 times as often as you need to.

What goes into default Debian?

Posted Feb 18, 2021 9:39 UTC (Thu) by anton (subscriber, #25547) [Link] (1 responses)

It also depends on how you value the user's time vs. the computer's time. However, on my personal system I indeed do not run updatedb automatically, because last time I did (long ago) it would run right on system startup (i.e., every morning) and make the system sluggish.

What goes into default Debian?

Posted Feb 18, 2021 9:44 UTC (Thu) by Sesse (subscriber, #53779) [Link]

While this is a valid concern, do note that updatedb in plocate is run with the “idle” I/O class, so it should get lower priority than your interactive use. (At least that's what the kernel claims!)

What goes into default Debian?

Posted Feb 18, 2021 6:02 UTC (Thu) by atai (subscriber, #10977) [Link] (3 responses)

Funny how some people reacted to daemons of locate-like indexers:

how to disable GNOME Tracker: https://www.linuxquestions.org/questions/ubuntu-63/how-to...

how to disable KDE Baloo: https://askubuntu.com/questions/1214572/how-do-i-stop-and...

What goes into default Debian?

Posted Feb 18, 2021 8:45 UTC (Thu) by Wol (subscriber, #4433) [Link]

Well, when the default install basically screws your system ...

I'm a bit like that - my grief was with Akonadi, I think the longest I waited to log in was about 36 hours, I ended up installing xfce in order to get a usable system.

(That 36 hours - that wasn't "login until usable desktop", it was "login until I killed the system in frustration". For someone who uses their PC as a desktop, ie switch it off every night, login times like that just aren't acceptable. Well, they're not acceptable full stop, but ...)

Cheers,
Wol

What goes into default Debian?

Posted Feb 19, 2021 20:18 UTC (Fri) by clump (subscriber, #27801) [Link]

Tracker is obnoxious. People have been asking for a simple way to disable it for years.

What goes into default Debian?

Posted Feb 25, 2021 10:56 UTC (Thu) by oak (guest, #2786) [Link]

Trackers index file *contents*, not just the file names. That’s at least magnitude heavier than what locate does.

What goes into default Debian?

Posted Feb 18, 2021 8:52 UTC (Thu) by josh (subscriber, #17465) [Link]

Personally, I would love to have an optional mechanism for "find" to cache some information *when run*, so that it can reduce the amount of work it needs to do while still giving up-to-date results.

What goes into default Debian?

Posted Feb 19, 2021 18:48 UTC (Fri) by jond (subscriber, #37669) [Link]

I saw your blog post and it was the first I’d heard of plocate, but I am interested. I worked on the mlocate package this cycle because I wanted to fix some issues I had with it, but it hadn’t occurred to me to explore alternatives at the time. We should definitely evaluate the situation for bullseye+1.

What goes into default Debian?

Posted Feb 25, 2021 13:02 UTC (Thu) by Hello71 (subscriber, #103412) [Link] (3 responses)

this seems like a bogus comparison if you don't specify -xdev to skip searching devtmpfs, procfs, and sysfs, the latter of which contains deeply nested trees. unless plocate also searches /sys?

What goes into default Debian?

Posted Feb 25, 2021 13:52 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (2 responses)

Those filesystem are virtual, "find" goes through them in single-digit milliseconds.

What goes into default Debian?

Posted Feb 25, 2021 14:03 UTC (Thu) by pizza (subscriber, #46) [Link] (1 responses)

On the system I'm writing this from, 'time find /sys > /dev/null" claims 81ms (with 55042 entries)

(So it's double-digit-ms speeds..)

What goes into default Debian?

Posted Feb 25, 2021 15:33 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

Before posting I've checked on my work environment (Fedora in HyperV on Windows 10, on ~5 years old Thinpad T560). I was getting 5-8 ms for /dev, /proc and /sys.
It completely doesn't matter while we are talking about "find / …" taking over 9 minutes, so let's end this subthread here.

What goes into default Debian?

Posted Mar 13, 2021 18:10 UTC (Sat) by nix (subscriber, #2304) [Link] (6 responses)

I was really happy when I saw plocate existed... until I noticed it didn't support multiple databases. This makes it a complete non-starter for larger installations exporting large numbers of files over NFS, the very sorts of installations in which a fast locate is most necessary: with locate, slocate and mlocate this is easily supportable by putting a locatedb covering each filesystem at the root of each exported filesystem and pointing the LOCATEPATH through all of them, but with plocate there is no replacement except for skipping all the remote filesystems entirely (my $HOME is on one of them: that's out) or traversing them from the client (they're huge: no). Even though plocate has a two-phase db generation process, you can't even pass multiple dbs in to the second phase to emit a single plocate db that is the union of all of them.

I know nobody cares about people with networked filesystems any more, but this made me sad :(

What goes into default Debian?

Posted Mar 20, 2021 8:45 UTC (Sat) by Sesse (subscriber, #53779) [Link] (5 responses)

There's nothing preventing plocate from searching multiple databases, especially if you just want them to be searched serially. It's literally a for loop in the client—you probably want some sort of path rewriting in updatedb, but that could be done, too. (Then there's the question on whether you want to try to be maximally clever by pruning away the prefix or not. YMMV.) Due to io_uring, it should be reasonably performant to search databases on NFS, although I haven't tried.

You could probably even just make a shell script that calls plocate multiple times. The main reason I've never done it is that it's such a niche case nobody's ever asked for it—it requires a lot of admin intervention.

What goes into default Debian?

Posted Mar 23, 2021 20:12 UTC (Tue) by nix (subscriber, #2304) [Link] (4 responses)

Aha! If it's that easy, I might work on some patches. I do like the idea of plocate, but unless it works with a setup that allows N databases, one per NFS export point, it's kinda hard to make it do anything useful on a system where most filesystems you would like to run locate over are on NFS. (In my case, I usually run locate on the desktop and almost all the things I ever want it to find are on the server. Obviously the server's locate databases have to be built *on* the server: even with 10GbE I don't want updatedb throwing the stat data for thirty million files over the network every night :) and that means multiple databases if I want locate on the client to scan everything visible from the client, whether it's server-side or not).

What goes into default Debian?

Posted Mar 28, 2021 10:39 UTC (Sun) by Sesse (subscriber, #53779) [Link] (3 responses)

I pushed code for searching multiple databases to the git repository. Please give it a go.

What goes into default Debian?

Posted Mar 28, 2021 10:52 UTC (Sun) by zdzichu (subscriber, #17118) [Link]

That's why I love open source community!

What goes into default Debian?

Posted Apr 27, 2021 12:54 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

Ooh, excellent! I'll give this a try this weekend :) (yes, sometimes I do go months between catching up with LWN. Mea culpa etc.)

What goes into default Debian?

Posted Mar 27, 2022 15:50 UTC (Sun) by nix (subscriber, #2304) [Link]

Aaaand months and months after I said I'd try it, I finally did. This thing is fantastic :)

Before, with GNU findutils: 40 mins to build the locatedb, 10 mins if everything was in cache. Afterwards (hot cache figures only): 56 seconds. DBs about five times smaller. As for times:

% /usr/bin/time locate wombat
[...]
8.09user 0.08system 0:08.25elapsed 99%CPU (0avgtext+0avgdata 2084maxresident)k

% /usr/bin/time locate -r wombat
[...]
19.75user 0.40system 0:20.38elapsed 98%CPU (0avgtext+0avgdata 2196maxresident)k

% /usr/bin/time locate -r womb.t
[...]
24.98user 0.03system 0:25.08elapsed 99%CPU (0avgtext+0avgdata 2184maxresident)k

Afterwards:
% /usr/bin/time locate wombat
[...]
0.00user 0.00system 0:00.02elapsed 71%CPU (0avgtext+0avgdata 4140maxresident)k

% /usr/bin/time locate -r wombat
[...]
4.95user 0.10system 0:01.68elapsed 299%CPU (0avgtext+0avgdata 10952maxresident)k

% /usr/bin/time locate -r womb.t
[...]
5.15user 0.06system 0:01.72elapsed 302%CPU (0avgtext+0avgdata 11012maxresident)k

This is with a LOCATE_PATH with 19 databases in it, so I think we can safely say that the 20-fold increases in plocate time implied by this are... well... still pretty insignificant :)

What goes into default Debian?

Posted Feb 22, 2021 9:29 UTC (Mon) by amarao (guest, #87073) [Link]

As cloud operator I always hated all locate-like software. It causes a lot of cold reads from many unused, and otherwise idle machines. On a shared storage it causes a spike in cold io, which causes a spike in latency.

Before crontab randomization was introduced, it was even noticed by electrical operator of data center. He asked 'what is happening at 4:00' every night?'. It was crontab, doing cron.daily on all machines. I suspect, locate update was the part of that electricity spike too.

What goes into default Debian?

Posted Feb 17, 2021 21:14 UTC (Wed) by smoogen (subscriber, #97) [Link]

I find the need for a locate to be a given for my shell access on most of my remote systems. Sometimes it is the easiest way to find out where some blasted utility is NOW versus where I had hard coded it elsewhere :).

What goes into default Debian?

Posted Feb 17, 2021 21:23 UTC (Wed) by pebolle (guest, #35204) [Link] (2 responses)

Debian, the gift that keeps on giving!

Please, DD's (Debian Developers?) make this into something that involves the TC (Technical Committee?) and, joy of all joys, a GR (General Resolution?). Life's all too boring under lock down, curfew, and all other brilliant measures. Do keep us entertained.

Please?

What goes into default Debian?

Posted Feb 17, 2021 22:51 UTC (Wed) by Sesse (subscriber, #53779) [Link] (1 responses)

Hate to disappoint you, but the thread is dead, and there's nobody who has indicated they want to ask the TC.

What goes into default Debian?

Posted Feb 18, 2021 0:19 UTC (Thu) by Liskni_si (guest, #91943) [Link]

To counterbalance the troll: Thank you Steinar for plocate. It's made my life a tiny but noticeable bit better. Good work!

What goes into default Debian?

Posted Feb 17, 2021 21:58 UTC (Wed) by josh (subscriber, #17465) [Link]

As a historical note, here's the thread from 2007 that led to splitting locate out of findutils, as well as a discussion about whether mlocate or another locate implementation should be in "standard": https://lists.debian.org/debian-devel/2007/11/threads.htm...

What goes into default Debian?

Posted Feb 17, 2021 22:03 UTC (Wed) by Sesse (subscriber, #53779) [Link]

My thread made LWN! That's a first. :-)

FWIW, the concept of “default locate in Debian” does not really exist, as long as nothing is installed by default. If you do apt install locate, you get the locate package, which is GNU findutils. To get plocate, you need do explicitly do apt install plocate.

The only way plocate gets preference is if you happen to install multiple ones. If both e.g. plocate and findutils locate are installed, the /usr/bin/locate symlink will (by default) point to plocate, not locate.findutils.

What goes into default Debian?

Posted Feb 18, 2021 3:55 UTC (Thu) by blastwave (guest, #129935) [Link] (2 responses)

It is always a sadness that the very first thing that I must do with a new Debian install is to install ed, time, bc, dc, as well as sharutils. These are essentials that have always been required in POSIX compliant systems and they are sadly absent by default. However, for reasons far beyond my comprehension there seems to be the nano visual editor. This is something I will never understand.

What goes into default Debian?

Posted Feb 18, 2021 7:16 UTC (Thu) by re:fi.64 (subscriber, #132628) [Link]

Nano is useful for cases of trying to walk newer-ish users through fixing issues with boot or GPU drivers.

What goes into default Debian?

Posted Feb 18, 2021 9:42 UTC (Thu) by amacater (subscriber, #790) [Link]

There is also the issue of maintenance: there's no maintainer for nvi at the moment and vim is too large and complex an infrastructure now with plugins etc. There's absolutely nothing to prevent anyone installing nvi / vim and updating the alternatives.

What goes into default Debian?

Posted Feb 18, 2021 4:11 UTC (Thu) by pabs (subscriber, #43278) [Link]

A correction: updatedb has never been a daemon, it has only ever been a one-shot service or cron job.

Now that the Linux kernel finally has the full list of VFS events available via fanotify, it should be feasible to change that though. I think Windows and macOS also have that feature, but I'm not sure about the BSD kernels or things like Hurd/Genode/Redox/seL4 though.

What goes into default Debian?

Posted Feb 18, 2021 9:18 UTC (Thu) by tchernobog (guest, #73595) [Link] (10 responses)

About the argument of multi-user systems: wouldn't locate potentially expose a list of filenames in the folder of another user? This might leak private information. I would think, from a privacy and security perspective, installing a tool creating a database of all filenames should be a conscious decision by the administrator.

So, in my opinion `locate` should NOT be part of the standard install. Most people familiar with the command line can install it if they are admins, or resort to `find` if they are not (which will not crawl folders for which the user has no permissions).

What goes into default Debian?

Posted Feb 18, 2021 9:54 UTC (Thu) by Sesse (subscriber, #53779) [Link] (9 responses)

Both slocate, mlocate and plocate will show you only files in directories that you have access to (ie., the same that find / would find).

What goes into default Debian?

Posted Feb 18, 2021 10:05 UTC (Thu) by pabs (subscriber, #43278) [Link] (8 responses)

I wonder if you could using timing information to figure out if specific paths exist.

What goes into default Debian?

Posted Feb 18, 2021 12:27 UTC (Thu) by niner (subscriber, #26151) [Link] (7 responses)

If your filenames are that sensitive, then you should set permissions accordingly. If that means updatedb can't index your files, then that's a good thing.

What goes into default Debian?

Posted Feb 18, 2021 13:03 UTC (Thu) by eehakkin (subscriber, #92008) [Link] (6 responses)

If your filenames are that sensitive, then you should set permissions accordingly. If that means updatedb can't index your files, then that's a good thing.
That is not possible. In the case of mlocate and plocate, updatedb runs as root so that it can index all files. It is not possible to set permissions so that the root user would not be able to see and index them.

What goes into default Debian?

Posted Feb 18, 2021 14:25 UTC (Thu) by Sesse (subscriber, #53779) [Link] (5 responses)

You can set anything so sensitive in PRUNEPATHS.

What goes into default Debian?

Posted Feb 18, 2021 21:35 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (4 responses)

Is that a system setting that only the administrator can set or can users add paths to it somehow?

What goes into default Debian?

Posted Feb 19, 2021 5:55 UTC (Fri) by flussence (guest, #85566) [Link] (3 responses)

The default for mlocate's PRUNENAMES contains "CVS", so just invent a clever backronym for that and put your sensitive files in a directory named such. If that's still not sufficient then you have someone with root access limiting themselves to making malicious edits to updatedb.conf, which is a bizarre threat model to care about.

What goes into default Debian?

Posted Feb 19, 2021 14:16 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (2 responses)

My question was a response to:

> You can set anything so sensitive in PRUNEPATHS.

I was asking if the "You" here needs root privileges to do that. I think anyone worrying about root should know that they should encrypt any files they wish to hide from them (assuming root isn't actively spying on in-use memory). I'm more thinking about someone writing code that ignored permission checks (say, a custom patched build of `locate`) when querying the database (I'm not sure if that is done on the `locate` side or somehow embedded into the database itself).

What goes into default Debian?

Posted Feb 19, 2021 16:52 UTC (Fri) by Sesse (subscriber, #53779) [Link] (1 responses)

You can custom-build your own locate without the access checks, but it needs to be installed sgid to get access to the locate database, so it wouldn't help you.

What goes into default Debian?

Posted Feb 19, 2021 22:53 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

Ah, that sounds reasonable enough. Thanks.

Sorry to rain on the parade...

Posted Feb 18, 2021 10:34 UTC (Thu) by Herve5 (subscriber, #115399) [Link]

I switched to Debian years ago, I came from macintoshes where everything is indexed.
I immediately installed Recoll, that similarly helps me finding anything INSIDE files.
x-locate looking only for file titles just is abysmally useless for my needs, sorry.
I do have to retrieve things from 10 years ago, with keywords inside, whatever the format (inside EXIF tags in a photo, pdfs, proprietary fossil word/excels...). Recoll does it.
My only concern is that it seems a single-man development with no competing app.

What goes into default Debian?

Posted Feb 18, 2021 11:37 UTC (Thu) by chris.sykes (subscriber, #54374) [Link] (9 responses)

It would be nice to have a centralised / shared database to answer "what changed recently", or "what changed since I last checked" that different tools could all use.

A shared service so that *locate, backup tools, mandb, desktop content indexers etc. don't each end up doing their own trawl over file-system(s) to determine what changed.

Log structured file-systems like btrfs should be able to help do this efficiently.

Perhaps such a service already exists? Or there might be very good reasons why it wouldn't work - I've not put much thought into it.

(this is just asking for someone to invoke xkcd 927 of course)

What goes into default Debian?

Posted Feb 18, 2021 20:01 UTC (Thu) by grawity (subscriber, #80596) [Link]

Something like btrfs subvolume find-new?

(Other operating systems: Windows' NTFS is neither log-structured nor CoW, but it kind of has the "USN Journal" serving the same purpose, though this journal is maintained by the OS "voluntarily" and probably misses changes done via ntfs-3g.)

What goes into default Debian?

Posted Feb 19, 2021 6:11 UTC (Fri) by pabs (subscriber, #43278) [Link] (6 responses)

The Linux kernel's fanotify interface should be enough for that, although it does require all those things to have a daemon running constantly storing the filesystem events that happen.

What goes into default Debian?

Posted Feb 19, 2021 11:44 UTC (Fri) by chris.sykes (subscriber, #54374) [Link] (4 responses)

Yeah, the multiple independent daemons all sat in *notify was what I thought it might be nice to avoid.

An analogy might be something like crond, but for file-system changes. It would maintain persistent state between boots, and could implement fs specific optimisations.

But as I say, I haven't really thought this through :-)

What goes into default Debian?

Posted Feb 19, 2021 14:17 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (1 responses)

> But as I say, I haven't really thought this through :-)

The easy hole is "the filesystem can be mounted outside of your machine". Unless you're doing Secure Boot chaining to the unlocking of your disk or something, guaranteeing the filesystem is untouched while your daemon is not running is probably…very difficult.

What goes into default Debian?

Posted Feb 20, 2021 7:22 UTC (Sat) by pabs (subscriber, #43278) [Link]

You would just have to re-scan the full system on every daemon start.

What goes into default Debian?

Posted Feb 20, 2021 7:25 UTC (Sat) by pabs (subscriber, #43278) [Link] (1 responses)

There are various inotify-based cron-like daemons, but your idea is beginning to sound like it should become a core systemd service.

What goes into default Debian?

Posted Feb 20, 2021 14:51 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

There are .path units what can trigger other units on various state changes already. They don't work with directory trees so well though IIRC.

What goes into default Debian?

Posted Feb 20, 2021 5:20 UTC (Sat) by zblaxell (subscriber, #26385) [Link]

fanotify receivers can easily fall behind if there are too many events.

I once tried to make a build system with inferred dependency tracking using fanotify. On paper it's straightforward: fanotify gives a stream of events like "pid P opened file F for reading and G for writing", and from that we infer G depends on F. So we run the build and log all the file accesses and then process the log later and we have at least a rough idea of what file outputs use what other files as inputs. Easy.

Now try this with a build that has 4000 processes opening a million files on a dozen CPU cores. In nonblocking mode, the events can't be dequeued fast enough (there certainly isn't time to resolve filenames), and in blocking mode it dramatically slows down the build.

Same problem for incremental backups: the fanotify monitor plays along for a while, then a lot of files get updated at once and the monitor has to say "nope, can't do it, you'll have to do a full scan to find out what I missed."

What goes into default Debian?

Posted Feb 20, 2021 5:21 UTC (Sat) by zblaxell (subscriber, #26385) [Link]

> Log structured file-systems like btrfs should be able to help do this efficiently.

btrfs is not log structured. It is a wandering tree. Under normal conditions and default settings, btrfs ruthlessly destroys all the historical metadata you're looking for after it has been on disk for a few minutes.

If you make a snapshot, btrfs can do a fast(ish) diff between the snapshot and the current filesystem contents, and that data could be fed into locatedb. This does require keeping a snapshot containing a complete copy of all data since the last locatedb update lying around on disk, though you might already be doing that for other reasons (like backups).

find-new can find new files since a transaction number. It doesn't provide a way to know whether files have been removed, but maybe that doesn't matter for a locate tool--it could stat every filename just before printing, and skip output of any that no longer exist, or maybe dumping out the entire filesystem tree through TREE_SEARCH is fast enough that incremental updates of the DB aren't necessary.

What goes into default Debian?

Posted Feb 18, 2021 18:57 UTC (Thu) by flussence (guest, #85566) [Link]

>Many users won't even know it exists (locate has far fewer users than find)

locate used to have far *more* users than find: GTK+(2)'s file-open box used to transparently use it if installed to speed up its search feature significantly.

What goes into default Debian?

Posted Feb 18, 2021 23:24 UTC (Thu) by benjamir (subscriber, #133607) [Link] (6 responses)

Why did they argue what has to be onboard?
I always start from a (sadly, involuntarily non-free) minimal install and customize from there on via ansible anyway.
I doubt a "default" does fit even 80% of the bill (how about offering more convenient "tasks" via tasksel for those who are not into CM yet?). Beginners don't need netcat and the like from the start; that's wasteful.
Who's the target audience of "default"? Are those the packages the majority of pop contest users install? Does the data from popularity contest back the arguments mentioned in the article?
Would be nice to see Debian see an agreement based on neither guesswork, nor egos.

What goes into default Debian?

Posted Feb 19, 2021 8:29 UTC (Fri) by dvdeug (guest, #10998) [Link]

How is netcat wasteful? It's a little more than 100KB installed. Raspberry Pi wants at least a 4GB SD card, and even the most basic PCs seem to have 100 GB drives. Not including a utility whose only cost is a few extra files and 100KB of hard drive space (that has no cron jobs and no setuid) is not a win for beginners or anyone.

What goes into default Debian?

Posted Feb 20, 2021 5:20 UTC (Sat) by zblaxell (subscriber, #26385) [Link] (4 responses)

> Beginners don't need netcat and the like from the start

If they are stuck at a shell prompt in the middle of the installer and they need help, netcat is a tool for exfiltrating dmesg to support personnel over the network, and for importing updated package or other system rescue software.

What goes into default Debian?

Posted Feb 21, 2021 17:44 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (3 responses)

That's an argument for keeping it in the *installer*, but not necessarily the default *install*.

What goes into default Debian?

Posted Feb 22, 2021 7:58 UTC (Mon) by zblaxell (subscriber, #26385) [Link] (2 responses)

Fine, "if they are stuck at a shell prompt during or at some point after completing a default install..."

What goes into default Debian?

Posted Feb 22, 2021 13:52 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

That's what rescue Live CDs are for. Guiding someone at a shell prompt is not my idea of a fun day over the phone. At least with a Live CD, they can share their screen and maybe even let me SSH in to help with the problem.

What goes into default Debian?

Posted Mar 1, 2021 14:02 UTC (Mon) by zblaxell (subscriber, #26385) [Link]

That only works if they can attach a live USB to the host (I'm assuming you don't mean a literal CD, there haven't been CD drives in new hardware for a decade), and that the live USB doesn't suffer from the same problem that needed shell commands to fix or work around.

Sometimes you don't get to pick the target support environment.

What goes into default Debian?

Posted Feb 20, 2021 5:21 UTC (Sat) by zblaxell (subscriber, #26385) [Link] (8 responses)

I've been uninstalling updatedb on sight since 1993. I'm glad it's no longer the default, shame it took over 25 years to go away.

My favorite failure modes were the ones that make updatedb take more than 24 hours to run: failing to filter out a network filesystem, running on a busy build server at nice 20, or running on a fileserver with more than a hundred million files on spinning disks. Then the next day's cron job starts, there are multiple updatedb's running, and everything gets worse every day until someone notices and makes it stop.

Usually if we have a requirement for file-level indexing, it comes with a requirement to index file metadata because all the filenames are like "00000001.dat". So we find a tool that does that indexing and install that instead.

What goes into default Debian?

Posted Feb 20, 2021 17:53 UTC (Sat) by Sesse (subscriber, #53779) [Link] (7 responses)

I assume you are aware that most of these issues look very different from in 1993, right? Network filesystems are filtered by default, merging updatedb makes 100M+ files entirely feasible, systemd can deal with locking...

And if updatedb is not fast enough, how would indexing into the files be?

What goes into default Debian?

Posted Feb 21, 2021 10:11 UTC (Sun) by zblaxell (subscriber, #26385) [Link] (6 responses)

Yes, for the first decade or so I was optimistic that updatedb could get better. I would allow updatedb to be installed on new machines, and I even tried to help resolve some of the issues the first five or six times they came up. The same issues kept coming up over and over, and they don't seem to have gone away today.

updatedb doesn't use whitelists--it indexes everything not on a blacklist. Network filesystems are filtered with a blacklist that had to be updated when nfs4 came along, then again when smbfs^H^H^H^H^Hcifs came along, then again and again and again when a thousand fuse filesystems came along (the horror...semi-infinite generated file namespaces with expensive iterators popping up at random, vs. updatedb). If I start using a new filesystem tomorrow, I can have an updatedb-related disaster the following day.

updatedb would also index any directory under / that wasn't explicitly excluded by a blacklist. The whole point of using novel directories under / that no existing software knows about is that existing software will stay the hell out of them until explicitly directed there. Not updatedb! That thing goes looking for trouble, and as long as trouble isn't in a default list of a half-dozen excluded directories, it'll find it!

Most of the problems would have been trivially solved if updatedb used a whitelist of FHS directories (/etc, /bin, /lib, /lib64, /sbin, /srv, /usr, /var, /opt, /home if local, all with -xdev in find) and searched only those until told to do otherwise. A "normal user" will not store anything outside of $HOME anyway. Users who attach huge file stores to their machines can add the mount points to updatedb.conf, or use a standard (for Debian) path like /srv for the big file store. Users who use 'locate' every day can edit /etc/updatedb.conf.

My host configuration logs indicate locking didn't start happening in Debian until 2015 (give or take a year) with the introduction of updatedb.mlocate. Even then, the locking had an obvious bug until 2017. This was long after I had stopped caring what the locate package maintainer did any more--by 2002 I was already using /etc/apt/preferences to ensure the locate package could not be installed on production machines, and a few years later I stopped testing new versions.

I don't know what systemd or plocate does, but if it doesn't start with bind-mounting whitelisted filesystems in a private namespace and running updatedb chroot in that namespace, I don't need to see the rest of it. I've seen updatedb's blacklisting accidentally defeated by users and junior sysadmins and upstream software updates, and I've seen nothing to prevent this from happening again.

When we index files, we define a service profile, purchase or assign hardware from inventory, task that hardware with running the service, i.e. storing hundreds of millions of files, and providing an indexing service for them. We assign staff and robots to operate and monitor the hardware, periodically check that the hardware is healthy and services in the profile are all running correctly, check that the indexes are correct and up to date and indexing files in scope and not indexing files not in scope, and ensure there is enough storage for the index and enough free iops to update it with whatever frequency the service profile says we need. In other words, these indexers are _supervised_.

None of this happens when updatedb is installed by default. It's a production risk and wasted cost until you turn it off or take control of it. If you turn your back on it, it inevitably fills up /var without warning, and burns power and media lifetime even when it's working normally--and when it's not, it can take big servers all the way down.

As far as I can tell, among the standard Unix packages, this is a unique property of *locate? Off the top of my head I can't think of any other past or present default-installed service that potentially or actually does scheduled work proportional to the number of files your host can access.

What goes into default Debian?

Posted Mar 13, 2021 18:18 UTC (Sat) by nix (subscriber, #2304) [Link] (5 responses)

When we index files, we define a service profile, purchase or assign hardware from inventory, task that hardware with running the service, i.e. storing hundreds of millions of files, and providing an indexing service for them. We assign staff and robots to operate and monitor the hardware, periodically check that the hardware is healthy and services in the profile are all running correctly, check that the indexes are correct and up to date and indexing files in scope and not indexing files not in scope, and ensure there is enough storage for the index and enough free iops to update it with whatever frequency the service profile says we need. In other words, these indexers are _supervised_.
It mystifies me that anyone running at this sort of scale would expect findutils' locate, or any comparable tool, to do a decent job. This is just way outside its design parameters, and putting it inside its design parameters would likely make it so complex that it would be unusable for its intended purpose.

What goes into default Debian?

Posted Mar 14, 2021 2:50 UTC (Sun) by zblaxell (subscriber, #26385) [Link] (4 responses)

You are building your argument from the opposite direction, but we agree at the conclusion: updatedb should not be actively wandering around unsupervised across every available path reachable from /, looking for things that it wasn't designed to handle, especially if there is any possibility it will be installed and enabled by default.

That means sandboxing it to prevent accidents, and determining what it can access by whitelist to minimize surprises. Possibly also disabling it by default, but that might not be necessary if the default sandbox is sufficient.

systemd can set up a chroot filesystem namespace, or a few lines of shell script can set up bind mounts to map whitelisted filesystems into a chroot tree, then run some more robust version of 'chroot $sandbox_path updatedb'. This isn't rocket science--it's how modern system services should be designed.

What goes into default Debian?

Posted Mar 16, 2021 19:00 UTC (Tue) by nix (subscriber, #2304) [Link] (3 responses)

I'd say, rather, that if you're running truly massive distributed filesystems you should probably have a different set of software installed on clients than the vast majority of people who are not. Don't penalize those people by removing stuff because it doesn't work in your unusual giant-fs use case, any more than all inotify-using software should be removed from default installs because inotify doesn't work very well over NFS.

What goes into default Debian?

Posted Mar 19, 2021 7:17 UTC (Fri) by zblaxell (subscriber, #26385) [Link] (2 responses)

The risk profile is all wrong with the current updatedb.conf design on Debian. It penalizes ordinary users more than anyone else.

Giant filesystems are run by professionals who know what updated is and have either "purge updatedb" or "have a plan to manage updatedb" on their deployment checklist. We don't have to worry about them. I mean, they'll obviously be annoyed by having to add a new package name to their blacklists every few years, but they've lived with this for a quarter century already. They're fine.

Cloud nodes and IoT devices are built by professionals who know that updatedb strictly wastes energy and shortens device lifespan with zero upside. Nobody ever logs into one of these hosts, so nobody benefits from locate--they do everything by orchestration, or by dropping a root filesystem image built somewhere else onto the device. If they use locate at all, they use it on their development system with the prototype filesystem tree--every node has an identical copy of that. "Don't install updatedb" is burned into their toolchain. We don't need to worry about them either.

Managed clients are somewhere between totally ad-hoc and an IoT device, depending on what the manager permits and how strictly permissions are enforced. Here we have to rely on the manager to make a good decision: either not install updatedb, or permit only whitelisted filesystems and mount points on the host that updatedb can safely handle. This is a reasonable expectation of a client manager, and often required for other reasons like security. If the manager doesn't manage well, then these hosts fall into the next category.

The problem happens when everyone else runs updatedb: ad-hoc servers and desktops run by people who are not aware updatedb exists, or how it will interact with whatever random third-party thing they've bought and installed, and who also buy and install random third-party things. These are the ones that pick all-default options during install. These are also the ones that are most likely to combine something new (recall that lots of things are new to a stable Debian system) with updatedb's default configuration, and trigger a bespoke flavor of disaster. This is the most common kind of person to have an updatedb failure case in my experience.

Every failure is slightly different, and easily fixed by extending the blacklists in updatedb.conf, but none of the individual fixes help any other user (they are always something like "exclude /DavidsBigAndUnreliableUSBFilePile", which doesn't work for users named Peter or Alice). There's no possible patch to send upstream to prevent the problem from happening to anyone else, other than "throw out updatedb.conf and start over." None of the failures could have happened in a properly configured sandbox with a whitelist, but Debian's updatedb.conf syntax provides no way to whitelist anything. Predictable and avoidable failures just keep happening, year after year. Those who are most affected are least equipped to deal with them.

Fix the design so updatedb is properly whitelisted and sandboxed, or don't install it at all. Those are the two safe options for most users. Based on how it's going so far, I think I'm just going to have to keep repeating that every 5 to 10 years until the updatedb maintainers finally get it.

What goes into default Debian?

Posted Mar 20, 2021 1:49 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

I don't see how your properly configured sandbox is even *implementable*, nor what sort of horrible failure happens especially to random users with non-complex setups and no huge networked filesystems (I mean, what on earth is going to happen? What harm is one traversal of a filesystem a day on a modest system going to do?)

I do think the default updatedb and locate configurations could do with updating to make lightly-distributed setups employing NFS work better (ensuring no traversals of remotely-mounted filesystems by updatedb, while still making them searchable by locate. I have patches to do that: I should submit them... alas they require multiple databases, so no plocate support.

What goes into default Debian?

Posted Mar 26, 2021 4:04 UTC (Fri) by zblaxell (subscriber, #26385) [Link]

> I don't see how your properly configured sandbox is even *implementable*,

bind mounts, private mount namespace, and chroot? If those aren't for sandboxing dodgy software (or protecting critical software from dodgy users), then I've been using them wrong for years. The chroot gets a curated (whitelisted) list of filesystems and mount points imported into it, from either a preconfigured list or /etc/fstab. Anything else--user mounts, removable media, new filesystems, giant external file stores--is not merely ignored, but not even accidentally accessible inside the sandox. Can't accidentally wipe out a big tree of files from the locate db by mounting something in the middle of /home, since a mount like that is not propagated into the sandbox namespace (this is more of a problem for backups than locate, but we do run backups sandboxed this way to avoid that problem).

If your question is "how to implement it in a way that is backward compatible with updatedb.conf", I don't have an answer. plocate is an amazing technical leap forward from traditional updatedb, and yet the 1% of updatedb that wasn't rewritten from scratch for plocate is the 1% of updatedb that causes all the problems in practice. There is a robust supply of competing file indexers out there, and I've always just used one that doesn't reimplement the worst 1% of updatedb.

I thought I had seen all the ways updatedb can fail, but I ran plocate on a test VM for a while, and discovered a new (to me) one: it spends most of its time indexing trees that will not exist at locate time. updatedb.plocate traverses the filesystem with the openat() family of functions, which means it will block umounts and snapshot deletion until it's finished indexing the entire tree--then it will close its FD, and the tree will cease to exist. Unlike traditional updatedb, updatedb.plocate still has access to the umounted or deleted tree through its open directory FDs, and I didn't see any sign of updatedb.plocate periodically checking to see if the tree it is indexing is still reachable from /. That can multiply indexing time for snapshots (especially if you are using one of snapper's default configs which creates new 24 snapshots every day), and interfere with umounts if the user were trying to disconnect or remount that filesystem.

What goes into default Debian?

Posted Feb 23, 2021 21:15 UTC (Tue) by ju3Ceemi (subscriber, #102464) [Link]

netcat, lsof and stuff are really useful for debugging a running -but broken- server
If I am in a corner, I can get help with those tools
locate can easily be replaced, in such scenario, by find

And booting a live system is not a solution..

Above all, specific tools like locale should not be part on the base system
If someone rarely install a Debian, he can install locale by hand
If someone installs a lot of Debian, he should really do some scripting / automation to install and configure all its software

Ergo, this is a non-issue


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds