LWN: Comments on "GNU grep's new features (Linux.com)"

GNU grep's new features (Linux.com)

jzbiciak — Sat, 10 Jun 2006 21:59:09 +0000

Some read()'s will mmap(), others will just do typical buffering w/ read system calls, still others will be more daft. GNU has control over glibc, but not other libc implementations.

I doubt gauging the page size to optimize reads really helps for much other than pipes, and even then I don't know how much it helps.

GNU grep's new features (Linux.com)

dougm — Thu, 08 Jun 2006 12:37:02 +0000

Shouldn't the libc stdio functions already be doing that? They're supposed to buffer input with the
"appropriate" buffer size...

GNU grep's new features (Linux.com)

nix — Thu, 08 Jun 2006 12:34:49 +0000

busybox isn't specified by POSIX nor is it a GNU project (nor does it even slightly conform to the GNU Coding Standards).

busybox is also a bit of a special case, in that 'size is everything', so the heaps-of-symlinks approach actually makes sense.

(But you don't necessarily need any of the symlinks. If you're running everything from a busybox shell, you can tell it to find busybox commands first regardless of the absence of symlinks.)

The BSDs have a tool called crunchgen which smashes a bunch of programs into one which conditionalizes off argv[0] in much the same way. (Except that busybox is smaller and doesn't penalize the rest of the system by forcing the *default* tools to be little featureless ones.)

GNU grep's new features (Linux.com)

lysse — Thu, 08 Jun 2006 07:41:01 +0000

Where does that leave BusyBox?

do we need it all in on utility? yes please!

lysse — Thu, 08 Jun 2006 07:38:43 +0000

"Intellisense is so 90s."

Likewise, I guess writing code that doesn't assume infinite memory and CPU cycles and actually takes care to conserve its resource usage is so 80s...

"I prefer using a tool with an user interface designed in this century"

Whereas a goodly number of developers prefer using a tool whose UI has been under constant development for 20 years and is customisable to a degree where it works with their muscle memory. Good engineering simply doesn't become obsolete, and I'd rather have an intuitive command set than a pretty picture any day.

purity vs. functionality

coriordan — Mon, 05 Jun 2006 12:49:34 +0000

Sometimes there is a conflict between the goals of design purity, and giving the user what they expect. I don't think either goal is perfect, so decisions or compromises have to be made.

It's worth noting that that the problem with multiple implementations is not as big a problem as "design purity" people sometimes claim it is. Factoring the regex code out into a library and standardising on that (as GNU usually does), greatly reduces problems such as inconsistency.

GNU grep's new features (Linux.com)

nix — Mon, 05 Jun 2006 09:56:20 +0000

The reasoning is entirely sensible: it is counterintuitive to have a program act differently simply because you mv'ed it to a different name. Among other things, having grep behave differently simply because you called it egrep *bans* you from making egrep a wrapper script, unless there is some other way to get at egrep's behaviour.

POSIX agreed: hence grep -E and grep -F, and the deprecation of look-at-argv[0] programs.

do we need it all in on utility? yes please!

vonbrand — Sun, 04 Jun 2006 02:55:41 +0000

What happened to the Unix philosophy of small tools that do one thing, and do it well, that can be combined endlessly?

Sure, it is nice to have all this in one package, but on the other hand it is infuriating to see that the "same functionality" (regular expression patterns, processing directories recursively, the list is seemingly endless) are implemented differently in several tools, and what would be handy to combine with other tools sometimes can't be done as it is bound to a specific program.

GNU grep's new features (Linux.com)

vonbrand — Sun, 04 Jun 2006 02:21:20 +0000

...having a single program that acts differently depending on its name is a violation of the GNU Coding Standards, ...

Yet another reason to consider the coding standard to be braindamaged.

do we need it all in on utility? yes please!

micampe — Fri, 02 Jun 2006 22:08:04 +0000

I just got the cedet kit for emacs and although I've not had the time to configure it yet, it provides (the semantic part of it) 'intellisense' completion (that is, based on the code, not the regular completion emacs has based on what you type).

There's more than just intellisense. Intellisense is so 90s.

If that does it for you, fine by me, but I prefer using a tool with an user interface designed in this century, doesn't require configuration to be used and three years to master (and I still doubt that tool can be considered on par with Eclipse JDT).

As for real-time error checking, I dunno. I guess I'd prefer the actual compiler to check this out. It's pretty good at that.

It is the actual compiler doing that.

Eclipse (like most other IDEs) is much more powerful than a text editor launching a compiler.

ok, the longer version then

jabby — Fri, 02 Jun 2006 20:32:54 +0000

I agree. Access to source is a huge advantage. And keeping source code in a version control system goes a long way toward monitoring changes and preventing even the fully baked Ken Thompson exploit.

And your paragraph in the context of GCC is not incorrect. It's absolutely true that Free Software helps to prevent source-borne trojans. Only in the context of the whole ACM article does this argument fall short and, as you say, that was not your aim in your short "top 10" list.

ok, the longer version then

coriordan — Fri, 02 Jun 2006 20:17:22 +0000

I agree with Ken that no one can verify all the code, but access to the source is better than no access to the source, and knowing that everyone has access to the source, and can analyse it in any way they want, and that if one person finds a trojan, they can remove it and publish the patch, is probably as good as it gets.

It's not perfect, and some trust is still required, but that is a fact of life and cannot be avoided. All we can do is aim for "as good as it gets" - and that involves the four freedoms.

When I was writing that paragraph in my blog, I wondered if I should go into the explanation, but I decided against because it was supposed to be a paragraph about GCC.

do we need it all in on utility? yes please!

carcassonne — Fri, 02 Jun 2006 17:06:46 +0000

In short, it builds a model of your code, to provide smart completion suggestions, real-time error checking, refactoring and more.

I just got the cedet kit for emacs and although I've not had the time to configure it yet, it provides (the semantic part of it) 'intellisense' completion (that is, based on the code, not the regular completion emacs has based on what you type).

And it does not seem to add any more memory use to emacs.

For refactoring, there's xref for emacs, but that's a commercial product.

As for real-time error checking, I dunno. I guess I'd prefer the actual compiler to check this out. It's pretty good at that.

GNU grep's new features (Linux.com)

nix — Fri, 02 Jun 2006 16:14:16 +0000

This is true in the upstream distro, and has been true since at least 2002. (Before that they were three separate binaries with different behaviours; having a single program that acts differently depending on its name is a violation of the GNU Coding Standards, so it wasn't implemented like that except for a brief period in 2002.)

trust, GCC, and Ken Thompson's compiler trojan thesis

nix — Fri, 02 Jun 2006 16:11:25 +0000

The bar is raised yet more if you initially cross-compile your bootstrap GCC using a completely different compiler, preferably on a different architecture.

It's still not infinitely high, but it's higher.

trust, GCC, and Ken Thompson's compiler trojan thesis

jabby — Fri, 02 Jun 2006 14:27:05 +0000

I read your 10 favorite tools in which you refer to the Ken Thompson article on the C compiler/login trojan in the context of GCC. You seem to be missing his point, though...

Ken makes this very clear: "No amount of source-level verification or scrutiny will protect you from using untrusted code." GCC is a Free compiler for C, written in C and is thus just as vulnerable to this hack as any other self-referential code.

Anyone could download GCC, follow the steps that Ken outlined and eventually install a version on their system that contains the trojan but with no trace in the source code. If that person were an insider at the place that compiles the binaries for your GNU/Linux distribution of choice, it wouldn't matter that you had access to the source code. Once you accept the binary from that trusted source, you are vulnerable. If you were to recompile the compiler from pristine source code with the trojaned gcc binary, you would still get a trojaned gcc!

Admittedly, having an entirely free system helps tremendously in raising the bar of trust, but depending on a wide and farflung community also means casting a wide net of trust. I trust the Free Software community, but the four freedoms do not prevent this particular hack. It all comes down to trust.

do we need it all in on utility? yes please!

micampe — Fri, 02 Jun 2006 13:29:51 +0000

In short, it builds a model of your code, to provide smart completion suggestions, real-time error checking, refactoring and more.

perl

stijn — Fri, 02 Jun 2006 13:10:40 +0000

I was thinking of this:

.. a version with French quotes «» that does interpolation before splitting into words

Which is taking it a little step further. I am sure I don't like it.

(The comment editor does not let me enter «» alas so I cut and pasted French quotes into the comment - apparently that works)

perl

niner — Fri, 02 Jun 2006 12:40:43 +0000

Even perl 5 supports unicode in program text (not just string constants, but all identifiers). Just use utf8;

why "gpg -c"

coriordan — Fri, 02 Jun 2006 12:40:11 +0000

I like the 'gpg -c' functionality because I think most people don't realise that GnuPG can do that. I reckon most people think that you can't only use it for public/private key encryption, and that's too complicated for some (at least as a first step).

You know what needs to be done? 'tar' needs to be given support for 'gpg -c', just like it supports gzip and bzip2 compression.

Of course password stuff isn't as secure as public key encryption, but it's great when you don't want your local proxy to keep an unencrypted copy of something, and it's good to ecrypt something and tell the password to the recipient when you meet them in person.

I haven't heard of people doing this with 'openssl enc' - I'll look into it.

GNU grep's new features (Linux.com)

jond — Fri, 02 Jun 2006 12:00:07 +0000

egrep and fgrep are just shell scripts that call grep -E and grep -F on my system (sarge). I found this out the hard way, when I was using fgrep in a busy loop.

Unicode normalization

arcticwolf — Fri, 02 Jun 2006 10:32:05 +0000

If you use utf8, Perl 5 already allows you to use Unicode in identifiers, for example, actually.

GNU grep's new features (Linux.com)

stijn — Fri, 02 Jun 2006 09:56:44 +0000

s/1/2/ to make it read(2)s.

GNU grep's new features (Linux.com)

stijn — Fri, 02 Jun 2006 09:53:50 +0000

On a side note, when implementing file reads I had a look at the grep code because it was so astoundingly fast. It works by gauging the native page size (getpagesize.h in the grep source code) and then doing read(1)s of that size. It is probably a well known fact of life - very useful.

Unicode normalization

stijn — Fri, 02 Jun 2006 09:47:31 +0000

At one time I worked at PICA in the Netherlands. Together with a colleague (Geert-Jan van Opdorp) I worked on implementing Unicode search with support for wild cards etc. We made extensive use of the icu libraries (IBM Internationalization Code for Unicode), and IIRC Geert-Jan implemented Udi Manber's (et al) search algorithms to work with Unicode. That was quite a feat. All this was of course in the context of indexes and indexes to indexes, and it built on the already existing infrastructure. But it is doable.

One of my previous projects was the development of a macro language + its processor. It is currently byte (and even ASCII) based. Someone once enquired about 'Unicode support'. I still wonder what possible meanings Unicode support could take on in that context, and I wonder to what extent Unicode should permeate the command line. IIRC perl6 might have Unicode tokens. Is that sane (whether it's true or not)? I am attracted to the idea of keeping Unicode for content, but perhaps that assumes a distinction that cannot be maintained.

Unicode normalization

ibukanov — Fri, 02 Jun 2006 09:40:19 +0000

> Google might have something of the sort.

It is not necessary for Google to know anything about combined characters etc. since Google search is strictly a word search. So they just need to assemble the list of all forms for particular word and map them to the same index entry.

do we need it all in on utility? yes please!

sitaram — Fri, 02 Jun 2006 08:57:58 +0000

Rats! Forgot the formatting...

               Enc     Dec     Enc+Dec
Blowfish       9.39    6.95    8.2
AES256         6.09    4.05    5.08
CAST5          4.87    2.67    3.7

do we need it all in on utility? yes please!

sitaram — Fri, 02 Jun 2006 08:56:48 +0000

I went to that link (10 fav software tools), and I noticed that "gpg -c" is a favourite!

I prefer "openssl enc" to "gpg -c" -- it's almost an order of magnitude faster sometimes! Here're some speed ratios from my machine:

Enc Dec Enc+Dec
Blowfish 9.39 6.95 8.2
AES256 6.09 4.05 5.08
CAST5 4.87 2.67 3.7

You may want to evaluate it yourself and consider this if you are doing this often or to large files.

do we need it all in on utility? yes please!

nix — Fri, 02 Jun 2006 08:47:59 +0000

XEmacs has grown, because of all the packages: there's 120Mb of Lisp alone shipped with it.

But of course not all of that is *loaded* at once :)

do we need it all in on utility? yes please!

davidw — Fri, 02 Jun 2006 08:16:43 +0000

Yeah, I still can't quite figure out what Eclipse *does* with all that memory. It's an order of magnitude bigger than a fully-loaded Emacs session.

Unicode normalization

MortFurd — Fri, 02 Jun 2006 06:40:55 +0000

Google does a decent job with that kind of thing - at least for what I do.

German has vowels with the umlaut (the two dots above the character.) The standard way to type these on a key board that doesn't have the umlauted characters is to substitute a two character combination (ae for umlaut a, ue for umlaut u, etc.) Google properly find words containing the umlaut characters, and also find matches to the double cahracter substitute if you give it an umlaut (my home computer has a german keyboard, my work computer has an amercain keyboard, so I get to see both sides of the problem.

Unicode normalization

dvdeug — Fri, 02 Jun 2006 06:11:24 +0000

Unicode normalization may not be enough for localized searching, but it's the only _correct_ way to search Unicode text. LATIN CAPITAL LETTER E followed by COMBINING ACUTE ACCENT is the exact same thing as LATIN CAPITAL LETTER E WITH ACUTE according to the Unicode standard, and a program that will match one and not the other is not conforming to the standard. It's not unreasonable to ask for grep to at least provide an option to conform to the standard and work the way that users expect? Only a character set geek will understand why those two items don't match, and only such a person should have to understand that.

eight megs and constantly swapping

xoddam — Fri, 02 Jun 2006 05:27:23 +0000

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
 5915 jmaddox   15   0 14192  11m 8752 S  0.0  1.1   0:56.04 emacs

Fourteen, eight ... I guess libc used to be smaller :-)

do we need it all in on utility? yes please!

flewellyn — Fri, 02 Jun 2006 00:42:50 +0000

Emacs, of course, is everyone's favorite target for complaining about bloat. Remember "Eight
Megs And Constantly Swapping"? But the funny thing is, Emacs has remained relatively constant
in memory usage over the years, so that nowadays, that eight megs does not look so bad.
Compare it to Eclipse, or Mozilla, or even some individual component programs of GNOME or
KDE, and, well...gee. Emacs suddenly doesn't look so bloated.

Unicode normalization

kingdon — Thu, 01 Jun 2006 23:00:58 +0000

Sure it is complicated, but is anyone really doing much work on the problem (either in grep or in a separate tool)?

Google might have something of the sort. I know I've searched for non-ASCII strings but haven't played extensively with things like a-with-an-accent (as one character) versus a plus accent-which-combines (as two characters).

But if Lucene does anything like this, the Lucene FAQ doesn't seem to say so (it just says that Lucene uses Unicode and doesn't elaborate).

Oh, and having the search behave differently based on locale is the wrong approach (IMHO). It is a common case that you have a lot of documents, some in one language, some in another, and some in more than one. Sure, giving up locales might cause you to lose some rules where language A treats character X one way, and language B treats it differently (hopefully obscure, but I'm not expert enough to say). Most of the time it would work to just look at the characters in the document and the search string, and ignore the locale.

Unicode normalization

mjr — Thu, 01 Jun 2006 22:36:46 +0000

Fair points. Drawing the line would perhaps be less straightforward than I thought.

Unicode normalization

tialaramex — Thu, 01 Jun 2006 22:27:59 +0000

Once you start thinking along these lines you really want a completely localised search feature. Unicode normalisation only reduces identifiable Unicode characters to a single representation (e.g forcing LATIN CAPITAL LETTER E followed by COMBINING ACUTE ACCENT to LATIN CAPITAL LETTER E WITH ACUTE, or vice versa) in a reproducible manner. That's maybe useful in a search program, but it's not enough to make it suitable for many non-English and particularly non-European languages. IMO it's fine to provide a separate tool for pre-processing text into one of the accepted Unicode normalisations.

Some languages have minor distinctions between characters (think like "case sensitivity" in English) that sometimes need to be ignored when searching. Some have special rules that treat several characters as one in certain circumstances (e.g. imagine if "qu" was not treated as a "q" and a "u" in English, but as a unit "qu", while both "q" and "u" continued to exist separately in other words so that a search for "u" would not match "queen"). Some have the reverse (imagine if English treated W the same as VV so that a search for "veek" would match "week")

On the whole this is a big enough can of worms to deserve a completely new piece of software, one specifically aimed at locale sensitive searching.

Unicode normalization

mjr — Thu, 01 Jun 2006 21:35:39 +0000

I've yet to need it myself, but for a while it's been on my mind that grep should really be able to normalize Unicode strings for search purposes. After all, often we'd like to get at matches incorporating the given characters, not just the literal octet streams.

Probably not a problem most of the time, and one can always normalize the files separately. A bit cumbersome, though.

GNU grep's new features (Linux.com)

nix — Thu, 01 Jun 2006 20:53:23 +0000

--only-matching's code is 24 lines long. Bloat? Not hardly.

As for grep -P, it's implemented using PCRE, so no bloat at all.

(Of course there will never be a pgrep or `prep' (ugh) as suggested in the article: egrep and fgrep are *already* obsolete, with grep -E and grep -F their preferred forms. Introducing more obsoleteness on top seems... peculiar.)

do we need it all in on utility? yes please!

coriordan — Thu, 01 Jun 2006 20:53:02 +0000

I remember hearing a BSD guy complaining about GNU. He didn't like the number of extra options that GNU ls has. As if it proved his argument, he pointed out that BSD ls was only 18kb while GNU ls was a massive 75kb. Who cares if one tiny binary is 5 times the size of the other tiny binary? It's not "bloat", it's the stuff I want so that I can to get my work done easier.

It also reminds me of the Real men use 'ed' mail.

Speaking of nifty free software projects, I put my 10 favourite software tools online today.