What goes into default Debian?
The venerable locate
file-finding utility has long been available for Linux systems, though its
origins are in the BSD world. It is a generally useful tool, but does have
a cost beyond just the disk space it occupies in the filesystem; there is a
periodic daemon program (updatedb)
that runs to keep the file-name database up to date. As a recent
debian-devel discussion shows, though, people have differing ideas of
just how important the tool is—and whether it should be part of the default installation of Debian.
There are several variants of locate floating around at this point. The original is described in a ;login: article from 1983; a descendant of that code lives on in the GNU Find Utilities alongside find and xargs. After that came Secure Locate (slocate), which checks permissions to only show file names that users have access to, and its functional successor, mlocate, which does the same check but also merges new changes into the existing database, rather than recreating it, for efficiency and filesystem-cache preservation. On many Linux distributions these days, mlocate is the locate of choice.
But Steinar H. Gunderson has created another variant, plocate, which he has suggested should be the standard locate for the upcoming Debian 12 ("Bookworm") release. He said that plocate can completely supplant mlocate:
He pointed out that mlocate used to be installed by default, but that was changed for Debian 10 ("Buster"). He would like to see locate return as part of the default install, but to use plocate instead:
Bernd Zeimetz agreed
that plocate should be part of the default install, as
did Paul Wise, but Wise was concerned about the cost of keeping the
database updated. Gunderson said
that plocate (like mlocate) tries to be smarter than
simply walking the whole filesystem. "It keeps track of the mtime of
each directory, and doesn't do the
readdir()/getdents() if it hasn't changed.
" But Josh Triplett argued that while
plocate is a good choice for the default locate for
Debian, it should not be part of the default (or base) install:
Beyond that, he pointed out that desktop environments often provide similar
functionality, "typically based on a change-watching API (e.g. inotify)
rather than a regularly scheduled update
". Gunderson said
that the amount of wasted time generally amounts to "a few
seconds every night
". Triplett noted that there
was a counterexample to that figure in the thread, but he also made a broader point about
defaults in the distribution:
The defaults need to cater for 1) the broadest set of users, and 2) users who are less likely to change the defaults. Most users don't run locate, and those who do are more likely to be users who can and do change the defaults. `apt install plocate` isn't hard for someone who uses locate to do.
But, as Gunderson pointed
out, systems that are shared by multiple users could benefit from
having locate available—without having to ask an administrator to
install it. He suggested that locate is a standard tool for users
of the command line, as well. Adrian Bunk said that
shared systems are "pretty rare
", but Russ Allbery pointed out that
"rare" might not be the right characterization: "They've become
*rarer*, but they're still very common in the academic and
scientific research world.
"
Bunk also noted that many non-technical Linux users never actually touch
the shell. Gunderson wondered
why Debian installs a whole host of other utilities (e.g. netcat,
lsof, the PCI utilities) that are really only
useful for technical, shell-aware Linux users. Those utilities are
"expected to be on a typical Linux system by almost
every technically-knowledgeable Linux user
", Marvin Renich said,
but locate does not quite rise to that level.
Others disagreed, of course. Bjørn Mork drew the line
differently, noting that administrators can always add tools that they
need, but sometimes users cannot:
In addition,
not having locate available by default is "yet another step away from being a
proper Unix system
", Mork said. Holger Levsen suggested
that it was perhaps time for someone to create a package that installs a "proper Unix system".
There would, of course, be differing opinions on what constitutes such a
system, which is part of what is driving the question of a default locate
as well. One suspects that hair length
or color
would not actually play much of a role as was (jokingly) suggested, but
several
packages with different choices could be accommodated, as Levsen noted.
So it would seem that plocate will be the default locate for Debian, though mlocate will still be available, but neither will be installed by default. Or at least there was no huge groundswell of support to change the current practice. That is in keeping with the practice of other Linux distributions (Fedora, openSUSE, Ubuntu, and RHEL, at least), but it is understandable that locate might be missed. find is a reasonable substitute, but it lacks the instant feedback that locate provides. For me, that's worth installing an extra package and expending a few seconds per day—your mileage may vary.
Posted Feb 17, 2021 20:43 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link] (27 responses)
Also, using mtimes seems weird for system packages. If I downgrade a package or install a newer version that happened to be built before the newer one (copr repositories on Fedora, AUR on Arch, or whatever Ubuntu is providing…can't remember the name), does `updatedb` get confused (AFAIU, timestamps tend to come from the package, not install time)?
Posted Feb 17, 2021 21:14 UTC (Wed)
by juliank (guest, #45896)
[Link]
Posted Feb 17, 2021 22:08 UTC (Wed)
by warrax (subscriber, #103205)
[Link] (24 responses)
I mean, if your busy loop is 'find a file' then maybe something like it makes sense, but if you need to do that then you should find a better way to do it than calling locate.
Posted Feb 17, 2021 22:25 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (23 responses)
“plocate LWN” takes 8 milliseconds.
I wrote plocate because mlocate's slowness was a real impediment to my (volunteer) sysadmin tasks. find is fine if you have a tiny system or a narrow search scope, but even on my laptop's SSD with a pretty small installation, it takes 2–3 seconds to run.
Posted Feb 17, 2021 23:28 UTC (Wed)
by clump (subscriber, #27801)
[Link] (5 responses)
I like the ease of locate but switched to find years ago because find is always up to date.
Posted Feb 18, 2021 6:12 UTC (Thu)
by dowdle (subscriber, #659)
[Link]
If I've done a lot of package installs or updates... I'll often run updatedb before using locate. The updatedb action usually only takes a second or two. So making locate as up-to-date as find, is still way, way faster than find.
Yes, there are feature differences because locate only matches file/dir names whereas find has a whole slew of properties you can search for. It does NOT need to be an either or... or one is better than the other. Use both, they are both great.
Posted Feb 18, 2021 7:29 UTC (Thu)
by anton (subscriber, #25547)
[Link] (3 responses)
It's funny that some people argue that updatedb is too costly while others argue that "find /" (which costs hardly less) is fast enough.
Posted Feb 18, 2021 9:20 UTC (Thu)
by smcv (subscriber, #53363)
[Link] (2 responses)
Posted Feb 18, 2021 9:39 UTC (Thu)
by anton (subscriber, #25547)
[Link] (1 responses)
Posted Feb 18, 2021 9:44 UTC (Thu)
by Sesse (subscriber, #53779)
[Link]
Posted Feb 18, 2021 6:02 UTC (Thu)
by atai (subscriber, #10977)
[Link] (3 responses)
how to disable GNOME Tracker: https://www.linuxquestions.org/questions/ubuntu-63/how-to...
how to disable KDE Baloo: https://askubuntu.com/questions/1214572/how-do-i-stop-and...
Posted Feb 18, 2021 8:45 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
I'm a bit like that - my grief was with Akonadi, I think the longest I waited to log in was about 36 hours, I ended up installing xfce in order to get a usable system.
(That 36 hours - that wasn't "login until usable desktop", it was "login until I killed the system in frustration". For someone who uses their PC as a desktop, ie switch it off every night, login times like that just aren't acceptable. Well, they're not acceptable full stop, but ...)
Cheers,
Posted Feb 19, 2021 20:18 UTC (Fri)
by clump (subscriber, #27801)
[Link]
Posted Feb 25, 2021 10:56 UTC (Thu)
by oak (guest, #2786)
[Link]
Posted Feb 18, 2021 8:52 UTC (Thu)
by josh (subscriber, #17465)
[Link]
Posted Feb 19, 2021 18:48 UTC (Fri)
by jond (subscriber, #37669)
[Link]
Posted Feb 25, 2021 13:02 UTC (Thu)
by Hello71 (subscriber, #103412)
[Link] (3 responses)
Posted Feb 25, 2021 13:52 UTC (Thu)
by zdzichu (subscriber, #17118)
[Link] (2 responses)
Posted Feb 25, 2021 14:03 UTC (Thu)
by pizza (subscriber, #46)
[Link] (1 responses)
(So it's double-digit-ms speeds..)
Posted Feb 25, 2021 15:33 UTC (Thu)
by zdzichu (subscriber, #17118)
[Link]
Posted Mar 13, 2021 18:10 UTC (Sat)
by nix (subscriber, #2304)
[Link] (6 responses)
I know nobody cares about people with networked filesystems any more, but this made me sad :(
Posted Mar 20, 2021 8:45 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (5 responses)
You could probably even just make a shell script that calls plocate multiple times. The main reason I've never done it is that it's such a niche case nobody's ever asked for it—it requires a lot of admin intervention.
Posted Mar 23, 2021 20:12 UTC (Tue)
by nix (subscriber, #2304)
[Link] (4 responses)
Posted Mar 28, 2021 10:39 UTC (Sun)
by Sesse (subscriber, #53779)
[Link] (3 responses)
Posted Mar 28, 2021 10:52 UTC (Sun)
by zdzichu (subscriber, #17118)
[Link]
Posted Apr 27, 2021 12:54 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Mar 27, 2022 15:50 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Before, with GNU findutils: 40 mins to build the locatedb, 10 mins if everything was in cache. Afterwards (hot cache figures only): 56 seconds. DBs about five times smaller. As for times:
% /usr/bin/time locate wombat
% /usr/bin/time locate -r wombat
% /usr/bin/time locate -r womb.t
Afterwards:
% /usr/bin/time locate -r wombat
% /usr/bin/time locate -r womb.t
This is with a LOCATE_PATH with 19 databases in it, so I think we can safely say that the 20-fold increases in plocate time implied by this are... well... still pretty insignificant :)
Posted Feb 22, 2021 9:29 UTC (Mon)
by amarao (guest, #87073)
[Link]
Before crontab randomization was introduced, it was even noticed by electrical operator of data center. He asked 'what is happening at 4:00' every night?'. It was crontab, doing cron.daily on all machines. I suspect, locate update was the part of that electricity spike too.
Posted Feb 17, 2021 21:14 UTC (Wed)
by smoogen (subscriber, #97)
[Link]
Posted Feb 17, 2021 21:23 UTC (Wed)
by pebolle (guest, #35204)
[Link] (2 responses)
Please, DD's (Debian Developers?) make this into something that involves the TC (Technical Committee?) and, joy of all joys, a GR (General Resolution?). Life's all too boring under lock down, curfew, and all other brilliant measures. Do keep us entertained.
Please?
Posted Feb 17, 2021 22:51 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Feb 18, 2021 0:19 UTC (Thu)
by Liskni_si (guest, #91943)
[Link]
Posted Feb 17, 2021 21:58 UTC (Wed)
by josh (subscriber, #17465)
[Link]
Posted Feb 17, 2021 22:03 UTC (Wed)
by Sesse (subscriber, #53779)
[Link]
FWIW, the concept of “default locate in Debian” does not really exist, as long as nothing is installed by default. If you do apt install locate, you get the locate package, which is GNU findutils. To get plocate, you need do explicitly do apt install plocate.
The only way plocate gets preference is if you happen to install multiple ones. If both e.g. plocate and findutils locate are installed, the /usr/bin/locate symlink will (by default) point to plocate, not locate.findutils.
Posted Feb 18, 2021 3:55 UTC (Thu)
by blastwave (guest, #129935)
[Link] (2 responses)
Posted Feb 18, 2021 7:16 UTC (Thu)
by re:fi.64 (subscriber, #132628)
[Link]
Posted Feb 18, 2021 9:42 UTC (Thu)
by amacater (subscriber, #790)
[Link]
Posted Feb 18, 2021 4:11 UTC (Thu)
by pabs (subscriber, #43278)
[Link]
Now that the Linux kernel finally has the full list of VFS events available via fanotify, it should be feasible to change that though. I think Windows and macOS also have that feature, but I'm not sure about the BSD kernels or things like Hurd/Genode/Redox/seL4 though.
Posted Feb 18, 2021 9:18 UTC (Thu)
by tchernobog (guest, #73595)
[Link] (10 responses)
So, in my opinion `locate` should NOT be part of the standard install. Most people familiar with the command line can install it if they are admins, or resort to `find` if they are not (which will not crawl folders for which the user has no permissions).
Posted Feb 18, 2021 9:54 UTC (Thu)
by Sesse (subscriber, #53779)
[Link] (9 responses)
Posted Feb 18, 2021 10:05 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (8 responses)
Posted Feb 18, 2021 12:27 UTC (Thu)
by niner (subscriber, #26151)
[Link] (7 responses)
Posted Feb 18, 2021 13:03 UTC (Thu)
by eehakkin (subscriber, #92008)
[Link] (6 responses)
Posted Feb 18, 2021 14:25 UTC (Thu)
by Sesse (subscriber, #53779)
[Link] (5 responses)
Posted Feb 18, 2021 21:35 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (4 responses)
Posted Feb 19, 2021 5:55 UTC (Fri)
by flussence (guest, #85566)
[Link] (3 responses)
Posted Feb 19, 2021 14:16 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
> You can set anything so sensitive in PRUNEPATHS.
I was asking if the "You" here needs root privileges to do that. I think anyone worrying about root should know that they should encrypt any files they wish to hide from them (assuming root isn't actively spying on in-use memory). I'm more thinking about someone writing code that ignored permission checks (say, a custom patched build of `locate`) when querying the database (I'm not sure if that is done on the `locate` side or somehow embedded into the database itself).
Posted Feb 19, 2021 16:52 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Feb 19, 2021 22:53 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link]
Posted Feb 18, 2021 10:34 UTC (Thu)
by Herve5 (subscriber, #115399)
[Link]
Posted Feb 18, 2021 11:37 UTC (Thu)
by chris.sykes (subscriber, #54374)
[Link] (9 responses)
A shared service so that *locate, backup tools, mandb, desktop content indexers etc. don't each end up doing their own trawl over file-system(s) to determine what changed.
Log structured file-systems like btrfs should be able to help do this efficiently.
Perhaps such a service already exists? Or there might be very good reasons why it wouldn't work - I've not put much thought into it.
(this is just asking for someone to invoke xkcd 927 of course)
Posted Feb 18, 2021 20:01 UTC (Thu)
by grawity (subscriber, #80596)
[Link]
Something like (Other operating systems: Windows' NTFS is neither log-structured nor CoW, but it kind of has the "USN Journal" serving the same purpose, though this journal is maintained by the OS "voluntarily" and probably misses changes done via ntfs-3g.)
Posted Feb 19, 2021 6:11 UTC (Fri)
by pabs (subscriber, #43278)
[Link] (6 responses)
Posted Feb 19, 2021 11:44 UTC (Fri)
by chris.sykes (subscriber, #54374)
[Link] (4 responses)
An analogy might be something like crond, but for file-system changes. It would maintain persistent state between boots, and could implement fs specific optimisations.
But as I say, I haven't really thought this through :-)
Posted Feb 19, 2021 14:17 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
The easy hole is "the filesystem can be mounted outside of your machine". Unless you're doing Secure Boot chaining to the unlocking of your disk or something, guaranteeing the filesystem is untouched while your daemon is not running is probably…very difficult.
Posted Feb 20, 2021 7:22 UTC (Sat)
by pabs (subscriber, #43278)
[Link]
Posted Feb 20, 2021 7:25 UTC (Sat)
by pabs (subscriber, #43278)
[Link] (1 responses)
Posted Feb 20, 2021 14:51 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
Posted Feb 20, 2021 5:20 UTC (Sat)
by zblaxell (subscriber, #26385)
[Link]
I once tried to make a build system with inferred dependency tracking using fanotify. On paper it's straightforward: fanotify gives a stream of events like "pid P opened file F for reading and G for writing", and from that we infer G depends on F. So we run the build and log all the file accesses and then process the log later and we have at least a rough idea of what file outputs use what other files as inputs. Easy.
Now try this with a build that has 4000 processes opening a million files on a dozen CPU cores. In nonblocking mode, the events can't be dequeued fast enough (there certainly isn't time to resolve filenames), and in blocking mode it dramatically slows down the build.
Same problem for incremental backups: the fanotify monitor plays along for a while, then a lot of files get updated at once and the monitor has to say "nope, can't do it, you'll have to do a full scan to find out what I missed."
Posted Feb 20, 2021 5:21 UTC (Sat)
by zblaxell (subscriber, #26385)
[Link]
btrfs is not log structured. It is a wandering tree. Under normal conditions and default settings, btrfs ruthlessly destroys all the historical metadata you're looking for after it has been on disk for a few minutes.
If you make a snapshot, btrfs can do a fast(ish) diff between the snapshot and the current filesystem contents, and that data could be fed into locatedb. This does require keeping a snapshot containing a complete copy of all data since the last locatedb update lying around on disk, though you might already be doing that for other reasons (like backups).
find-new can find new files since a transaction number. It doesn't provide a way to know whether files have been removed, but maybe that doesn't matter for a locate tool--it could stat every filename just before printing, and skip output of any that no longer exist, or maybe dumping out the entire filesystem tree through TREE_SEARCH is fast enough that incremental updates of the DB aren't necessary.
Posted Feb 18, 2021 18:57 UTC (Thu)
by flussence (guest, #85566)
[Link]
locate used to have far *more* users than find: GTK+(2)'s file-open box used to transparently use it if installed to speed up its search feature significantly.
Posted Feb 18, 2021 23:24 UTC (Thu)
by benjamir (subscriber, #133607)
[Link] (6 responses)
Posted Feb 19, 2021 8:29 UTC (Fri)
by dvdeug (guest, #10998)
[Link]
Posted Feb 20, 2021 5:20 UTC (Sat)
by zblaxell (subscriber, #26385)
[Link] (4 responses)
If they are stuck at a shell prompt in the middle of the installer and they need help, netcat is a tool for exfiltrating dmesg to support personnel over the network, and for importing updated package or other system rescue software.
Posted Feb 21, 2021 17:44 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
Posted Feb 22, 2021 7:58 UTC (Mon)
by zblaxell (subscriber, #26385)
[Link] (2 responses)
Posted Feb 22, 2021 13:52 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Mar 1, 2021 14:02 UTC (Mon)
by zblaxell (subscriber, #26385)
[Link]
Sometimes you don't get to pick the target support environment.
Posted Feb 20, 2021 5:21 UTC (Sat)
by zblaxell (subscriber, #26385)
[Link] (8 responses)
My favorite failure modes were the ones that make updatedb take more than 24 hours to run: failing to filter out a network filesystem, running on a busy build server at nice 20, or running on a fileserver with more than a hundred million files on spinning disks. Then the next day's cron job starts, there are multiple updatedb's running, and everything gets worse every day until someone notices and makes it stop.
Usually if we have a requirement for file-level indexing, it comes with a requirement to index file metadata because all the filenames are like "00000001.dat". So we find a tool that does that indexing and install that instead.
Posted Feb 20, 2021 17:53 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (7 responses)
And if updatedb is not fast enough, how would indexing into the files be?
Posted Feb 21, 2021 10:11 UTC (Sun)
by zblaxell (subscriber, #26385)
[Link] (6 responses)
updatedb doesn't use whitelists--it indexes everything not on a blacklist. Network filesystems are filtered with a blacklist that had to be updated when nfs4 came along, then again when smbfs^H^H^H^H^Hcifs came along, then again and again and again when a thousand fuse filesystems came along (the horror...semi-infinite generated file namespaces with expensive iterators popping up at random, vs. updatedb). If I start using a new filesystem tomorrow, I can have an updatedb-related disaster the following day.
updatedb would also index any directory under / that wasn't explicitly excluded by a blacklist. The whole point of using novel directories under / that no existing software knows about is that existing software will stay the hell out of them until explicitly directed there. Not updatedb! That thing goes looking for trouble, and as long as trouble isn't in a default list of a half-dozen excluded directories, it'll find it!
Most of the problems would have been trivially solved if updatedb used a whitelist of FHS directories (/etc, /bin, /lib, /lib64, /sbin, /srv, /usr, /var, /opt, /home if local, all with -xdev in find) and searched only those until told to do otherwise. A "normal user" will not store anything outside of $HOME anyway. Users who attach huge file stores to their machines can add the mount points to updatedb.conf, or use a standard (for Debian) path like /srv for the big file store. Users who use 'locate' every day can edit /etc/updatedb.conf.
My host configuration logs indicate locking didn't start happening in Debian until 2015 (give or take a year) with the introduction of updatedb.mlocate. Even then, the locking had an obvious bug until 2017. This was long after I had stopped caring what the locate package maintainer did any more--by 2002 I was already using /etc/apt/preferences to ensure the locate package could not be installed on production machines, and a few years later I stopped testing new versions.
I don't know what systemd or plocate does, but if it doesn't start with bind-mounting whitelisted filesystems in a private namespace and running updatedb chroot in that namespace, I don't need to see the rest of it. I've seen updatedb's blacklisting accidentally defeated by users and junior sysadmins and upstream software updates, and I've seen nothing to prevent this from happening again.
When we index files, we define a service profile, purchase or assign hardware from inventory, task that hardware with running the service, i.e. storing hundreds of millions of files, and providing an indexing service for them. We assign staff and robots to operate and monitor the hardware, periodically check that the hardware is healthy and services in the profile are all running correctly, check that the indexes are correct and up to date and indexing files in scope and not indexing files not in scope, and ensure there is enough storage for the index and enough free iops to update it with whatever frequency the service profile says we need. In other words, these indexers are _supervised_.
None of this happens when updatedb is installed by default. It's a production risk and wasted cost until you turn it off or take control of it. If you turn your back on it, it inevitably fills up /var without warning, and burns power and media lifetime even when it's working normally--and when it's not, it can take big servers all the way down.
As far as I can tell, among the standard Unix packages, this is a unique property of *locate? Off the top of my head I can't think of any other past or present default-installed service that potentially or actually does scheduled work proportional to the number of files your host can access.
Posted Mar 13, 2021 18:18 UTC (Sat)
by nix (subscriber, #2304)
[Link] (5 responses)
Posted Mar 14, 2021 2:50 UTC (Sun)
by zblaxell (subscriber, #26385)
[Link] (4 responses)
That means sandboxing it to prevent accidents, and determining what it can access by whitelist to minimize surprises. Possibly also disabling it by default, but that might not be necessary if the default sandbox is sufficient.
systemd can set up a chroot filesystem namespace, or a few lines of shell script can set up bind mounts to map whitelisted filesystems into a chroot tree, then run some more robust version of 'chroot $sandbox_path updatedb'. This isn't rocket science--it's how modern system services should be designed.
Posted Mar 16, 2021 19:00 UTC (Tue)
by nix (subscriber, #2304)
[Link] (3 responses)
Posted Mar 19, 2021 7:17 UTC (Fri)
by zblaxell (subscriber, #26385)
[Link] (2 responses)
Giant filesystems are run by professionals who know what updated is and have either "purge updatedb" or "have a plan to manage updatedb" on their deployment checklist. We don't have to worry about them. I mean, they'll obviously be annoyed by having to add a new package name to their blacklists every few years, but they've lived with this for a quarter century already. They're fine.
Cloud nodes and IoT devices are built by professionals who know that updatedb strictly wastes energy and shortens device lifespan with zero upside. Nobody ever logs into one of these hosts, so nobody benefits from locate--they do everything by orchestration, or by dropping a root filesystem image built somewhere else onto the device. If they use locate at all, they use it on their development system with the prototype filesystem tree--every node has an identical copy of that. "Don't install updatedb" is burned into their toolchain. We don't need to worry about them either.
Managed clients are somewhere between totally ad-hoc and an IoT device, depending on what the manager permits and how strictly permissions are enforced. Here we have to rely on the manager to make a good decision: either not install updatedb, or permit only whitelisted filesystems and mount points on the host that updatedb can safely handle. This is a reasonable expectation of a client manager, and often required for other reasons like security. If the manager doesn't manage well, then these hosts fall into the next category.
The problem happens when everyone else runs updatedb: ad-hoc servers and desktops run by people who are not aware updatedb exists, or how it will interact with whatever random third-party thing they've bought and installed, and who also buy and install random third-party things. These are the ones that pick all-default options during install. These are also the ones that are most likely to combine something new (recall that lots of things are new to a stable Debian system) with updatedb's default configuration, and trigger a bespoke flavor of disaster. This is the most common kind of person to have an updatedb failure case in my experience.
Every failure is slightly different, and easily fixed by extending the blacklists in updatedb.conf, but none of the individual fixes help any other user (they are always something like "exclude /DavidsBigAndUnreliableUSBFilePile", which doesn't work for users named Peter or Alice). There's no possible patch to send upstream to prevent the problem from happening to anyone else, other than "throw out updatedb.conf and start over." None of the failures could have happened in a properly configured sandbox with a whitelist, but Debian's updatedb.conf syntax provides no way to whitelist anything. Predictable and avoidable failures just keep happening, year after year. Those who are most affected are least equipped to deal with them.
Fix the design so updatedb is properly whitelisted and sandboxed, or don't install it at all. Those are the two safe options for most users. Based on how it's going so far, I think I'm just going to have to keep repeating that every 5 to 10 years until the updatedb maintainers finally get it.
Posted Mar 20, 2021 1:49 UTC (Sat)
by nix (subscriber, #2304)
[Link] (1 responses)
I do think the default updatedb and locate configurations could do with updating to make lightly-distributed setups employing NFS work better (ensuring no traversals of remotely-mounted filesystems by updatedb, while still making them searchable by locate. I have patches to do that: I should submit them... alas they require multiple databases, so no plocate support.
Posted Mar 26, 2021 4:04 UTC (Fri)
by zblaxell (subscriber, #26385)
[Link]
bind mounts, private mount namespace, and chroot? If those aren't for sandboxing dodgy software (or protecting critical software from dodgy users), then I've been using them wrong for years. The chroot gets a curated (whitelisted) list of filesystems and mount points imported into it, from either a preconfigured list or /etc/fstab. Anything else--user mounts, removable media, new filesystems, giant external file stores--is not merely ignored, but not even accidentally accessible inside the sandox. Can't accidentally wipe out a big tree of files from the locate db by mounting something in the middle of /home, since a mount like that is not propagated into the sandbox namespace (this is more of a problem for backups than locate, but we do run backups sandboxed this way to avoid that problem).
If your question is "how to implement it in a way that is backward compatible with updatedb.conf", I don't have an answer. plocate is an amazing technical leap forward from traditional updatedb, and yet the 1% of updatedb that wasn't rewritten from scratch for plocate is the 1% of updatedb that causes all the problems in practice. There is a robust supply of competing file indexers out there, and I've always just used one that doesn't reimplement the worst 1% of updatedb.
I thought I had seen all the ways updatedb can fail, but I ran plocate on a test VM for a while, and discovered a new (to me) one: it spends most of its time indexing trees that will not exist at locate time. updatedb.plocate traverses the filesystem with the openat() family of functions, which means it will block umounts and snapshot deletion until it's finished indexing the entire tree--then it will close its FD, and the tree will cease to exist. Unlike traditional updatedb, updatedb.plocate still has access to the umounted or deleted tree through its open directory FDs, and I didn't see any sign of updatedb.plocate periodically checking to see if the tree it is indexing is still reachable from /. That can multiply indexing time for snapshots (especially if you are using one of snapper's default configs which creates new 24 snapshots every day), and interfere with umounts if the user were trying to disconnect or remount that filesystem.
Posted Feb 23, 2021 21:15 UTC (Tue)
by ju3Ceemi (subscriber, #102464)
[Link]
And booting a live system is not a solution..
Above all, specific tools like locale should not be part on the base system
Ergo, this is a non-issue
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
"rlocate is an implementation of the ``locate'' command that is always up-to-date." Except that rlocate itself is not up-to-date; it was written before inotify/fanotify, so it uses its own kernel module instead. But maybe one of the current locate implementors can add an always-up-to-date feature based on fanotify.
What goes into default Debian?
What goes into default Debian?
It also depends on how you value the user's time vs. the computer's time. However, on my personal system I indeed do not run updatedb automatically, because last time I did (long ago) it would run right on system startup (i.e., every morning) and make the system sluggish.
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
Wol
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
It completely doesn't matter while we are talking about "find / …" taking over 9 minutes, so let's end this subthread here.
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
[...]
8.09user 0.08system 0:08.25elapsed 99%CPU (0avgtext+0avgdata 2084maxresident)k
[...]
19.75user 0.40system 0:20.38elapsed 98%CPU (0avgtext+0avgdata 2196maxresident)k
[...]
24.98user 0.03system 0:25.08elapsed 99%CPU (0avgtext+0avgdata 2184maxresident)k
% /usr/bin/time locate wombat
[...]
0.00user 0.00system 0:00.02elapsed 71%CPU (0avgtext+0avgdata 4140maxresident)k
[...]
4.95user 0.10system 0:01.68elapsed 299%CPU (0avgtext+0avgdata 10952maxresident)k
[...]
5.15user 0.06system 0:01.72elapsed 302%CPU (0avgtext+0avgdata 11012maxresident)k
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
If your filenames are that sensitive, then you should set permissions accordingly. If that means updatedb can't index your files, then that's a good thing.
That is not possible. In the case of mlocate and plocate, updatedb runs as root so that it can index all files. It is not possible to set permissions so that the root user would not be able to see and index them.
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
Sorry to rain on the parade...
I immediately installed Recoll, that similarly helps me finding anything INSIDE files.
x-locate looking only for file titles just is abysmally useless for my needs, sorry.
I do have to retrieve things from 10 years ago, with keywords inside, whatever the format (inside EXIF tags in a photo, pdfs, proprietary fossil word/excels...). Recoll does it.
My only concern is that it seems a single-man development with no competing app.
It would be nice to have a centralised / shared database to answer "what changed recently", or "what changed since I last checked" that different tools could all use.
What goes into default Debian?
What goes into default Debian?
btrfs subvolume find-new
?What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
I always start from a (sadly, involuntarily non-free) minimal install and customize from there on via ansible anyway.
I doubt a "default" does fit even 80% of the bill (how about offering more convenient "tasks" via tasksel for those who are not into CM yet?). Beginners don't need netcat and the like from the start; that's wasteful.
Who's the target audience of "default"? Are those the packages the majority of pop contest users install? Does the data from popularity contest back the arguments mentioned in the article?
Would be nice to see Debian see an agreement based on neither guesswork, nor egos.
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
When we index files, we define a service profile, purchase or assign hardware from inventory, task that hardware with running the service, i.e. storing hundreds of millions of files, and providing an indexing service for them. We assign staff and robots to operate and monitor the hardware, periodically check that the hardware is healthy and services in the profile are all running correctly, check that the indexes are correct and up to date and indexing files in scope and not indexing files not in scope, and ensure there is enough storage for the index and enough free iops to update it with whatever frequency the service profile says we need. In other words, these indexers are _supervised_.
It mystifies me that anyone running at this sort of scale would expect findutils' locate, or any comparable tool, to do a decent job. This is just way outside its design parameters, and putting it inside its design parameters would likely make it so complex that it would be unusable for its intended purpose.
You are building your argument from the opposite direction, but we agree at the conclusion: updatedb should not be actively wandering around unsupervised across every available path reachable from /, looking for things that it wasn't designed to handle, especially if there is any possibility it will be installed and enabled by default.
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
What goes into default Debian?
> I don't see how your properly configured sandbox is even *implementable*,
What goes into default Debian?
What goes into default Debian?
If I am in a corner, I can get help with those tools
locate can easily be replaced, in such scenario, by find
If someone rarely install a Debian, he can install locale by hand
If someone installs a lot of Debian, he should really do some scripting / automation to install and configure all its software