Joe 'Zonker' Brockmeier introduces
glark on Linux.com. "What is glark? Basically, it's a utility that's similar to grep, but it has a few features that grep does not. This includes complex expressions, Perl-compatible regular expressions, and excluding binary files. It also makes showing contextual lines a bit easier."
(Log in to post comments)
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 19:54 UTC (Tue) by nix (subscriber, #2304)
[Link]
Um... GNU grep excludes binary files by default and has supported perl-compatible regexes since the release of grep 2.5 in 2002. The 'or' thing is done via multiple -e arguments (though it can't colourize in multiple colours)
Things that are genuinely new in glark that that article metnions: GNU grep has no analogue of --and, --before, or --after. It doesn't have a configuration file. And App::Ack and 'git grep' still seem more useful than both for 90% of my uses of either.
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 20:09 UTC (Tue) by epa (subscriber, #39769)
[Link]
The addition of 'and' and 'or' operators would make regular expressions significantly more powerful. 'not' would also be cool.
Absent the support for these in the regexp language itself, having the Boolean operators part of the command line syntax for grep would still help a great deal. (As you mention it supports 'or' with -e, but not the other two.) I did file a feature request for this a while back.
I haven't benchmarked but I suspect that ack is faster than glark, and nearly as fast as grep, based on the general performance of perl's regexp engine. Still, with today's CPU speeds searching will be I/O-bound anyway, so does it matter?
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 20:10 UTC (Tue) by epa (subscriber, #39769)
[Link]
Duh... of course regexps do have the | operator which provides 'or'. It is only 'and' and 'not' which are lacking. Perl's extended regexp syntax gets you some ability to do that, but it has limitations about fixed match widths.
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 20:49 UTC (Tue) by SimonO (subscriber, #56318)
[Link]
Uhm, I think regex also has some form of and:
(a|b).*(c|d) means: a or b followed by c or d (in which followed can be read as and)
This is not exactly the same as (a|b) and (c|d), but it's close. and if necessary you can write ((a|b).*(c|d))|((c|d).*(a|b)), but that becomes hard to read.
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 20:56 UTC (Tue) by martinfick (subscriber, #4455)
[Link]
Yeah, you can think of REs as defaulting to "and" (with ordering).
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 21:26 UTC (Tue) by ballombe (subscriber, #9523)
[Link]
The set of regular languages is closed by intersection and complementation but the standard regexp syntax does not provide them. However there are simple rules to negate a regexp (and also for intersection) but this is tedious and lead to very large expressions.
grep "and and "not"
Posted Aug 9, 2011 22:32 UTC (Tue) by stevenj (guest, #421)
[Link]
Of course, grep has been able to do "and" and "not" forever, via pipes (grep a | grep b is "a and b", and grep a | grep -v b is "a not b").
grep "and" and "not"
Posted Aug 9, 2011 22:33 UTC (Tue) by stevenj (guest, #421)
[Link]
(Sorry, just noticed that someone else had posted this below.)
grep "and and "not"
Posted Aug 10, 2011 10:38 UTC (Wed) by epa (subscriber, #39769)
[Link]
I've often used that simple trick with pipes but it doesn't work very well when you want grep to print context or to highlight matches.
grep "and and "not"
Posted Aug 11, 2011 3:11 UTC (Thu) by martinfick (subscriber, #4455)
[Link]
Hmm, so perhaps grep should take a page form the find/xargs playbook, add -ogrep and -igrep (and -iogrep for both) options to grep! The -ogrep option would make grep output in a markup style that -igrep could read and understand. This format could contain extra context info, line #s, file names... so that greps further down the pipeline would not loose that info.
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 20:33 UTC (Tue) by xbobx (subscriber, #51363)
[Link]
Posted Aug 9, 2011 20:54 UTC (Tue) by martinfick (subscriber, #4455)
[Link]
Which have the likely advantage of scaling better on multi cores, since a pipe makes a very good parallelization divider! Who needs map reduce when you have pipes? ;)
Line and field matches
Posted Aug 12, 2011 1:53 UTC (Fri) by Richard_J_Neill (subscriber, #23093)
[Link]
What would be really nice is a way to extract a certain subset of a line, matched by a different RE, in a single process. Eg this prints "123":
echo -e "Hello 123\nWorld 456" | grep Hello | grep -oE [0-9]+
Line and field matches
Posted Aug 12, 2011 9:51 UTC (Fri) by jwakely (subscriber, #60262)
[Link]
sed -nr '/Hello/s/[^0-9]*([0-9]+).*/\1/p'
"... now you have two problems" ;-)
Line and field matches
Posted Aug 12, 2011 17:30 UTC (Fri) by nix (subscriber, #2304)
[Link]
Of course you can use backreferences to reshuffle data arbitrarily and even to implement a symbol table and turn sed into a really ugly real programming language. (see dc.sed.)
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 20:52 UTC (Tue) by martinfick (subscriber, #4455)
[Link]
> Still, with today's CPU speeds searching will be I/O-bound anyway, so does it matter?
I wouldn't make that assumption. I have found that using gawk instead of perl for some simple matching/substitutions to be twice as fast on some very large data sets where it mattered. Benchmark, don't assume if you care about speed.
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 21:54 UTC (Tue) by dlang (✭ supporter ✭, #313)
[Link]
it depends on the complexity of your grep, if you are doing a grep -f file1 file2 where there are a number of patterns in file1 you can easily be bottlenecked by CPU, even nowdays.
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 22:55 UTC (Tue) by nix (subscriber, #2304)
[Link]
Hey, thanks to UTF-8's variable-width nature making every single regexp a variable-width matching nightmare, until very recently almost any grep in a UTF-8 locale would be CPU-bound. (This is now, thankfully, fixed, at least for those situations in which the regexp itself is unibyte.)
Is Glark a Better Grep? (Linux.com)
Posted Aug 10, 2011 3:20 UTC (Wed) by quanstro (guest, #77996)
[Link]
i got an account just to reply to this. gnu grep was about 80x slower for utf-8 locales for a long time. this was due to a malloc(3) for every byte of input. gnu grep was broken; this has nothing to do with the merits of utf-8. gnu grep is now fixed.
as far as grep is concerned, the variable-width nature of utf-8 almost never is an issue. consider a codepoint represented as a 3-byte sequence. the only time this is at all different than 3 consecutive ascii characters is (a) for "." and (b) in a character class, but i don't see why that should be a big problem.
utf-8's first home was plan 9. plan 9 grep has always been very fast. and since 1992 or so been utf-8 only. it's unfortunate that a bug in gnu grep has given utf-8 a bad rap.
Is Glark a Better Grep? (Linux.com)
Posted Aug 10, 2011 10:54 UTC (Wed) by nix (subscriber, #2304)
[Link]
Well, . and in particular .* are quite common characters in regexps. It seems to me the variable-width thing is fairly often relevant, or why did GNU grep need to optimize that case at all?
Is Glark a Better Grep? (Linux.com)
Posted Aug 10, 2011 13:37 UTC (Wed) by quanstro (guest, #77996)
[Link]
ah, even "." can be sneaky and add internal states rather than complicate the input path. since speed matters when the input is >> the re itself, this can be a big win.
after having another look at thompson's grep, i see that . is translated as [^\n], so there's really just one case. and character classes are burst into ranges corresponding to the number of bytes in the encoding, so the input path can still match byte wise.
the reason gnu grep got tripped up in older versions was because (1) unix took the path of wide characters with many different encodings, (2) thompson's technique requires knowledge of the character set, and (3) the mbtowc functions were allocating memory for each character of input.
Is Glark a Better Grep? (Linux.com)
Posted Aug 9, 2011 21:24 UTC (Tue) by wahern (subscriber, #37304)
[Link]
GNU grep will tend to be significantly faster than glark and similar utilities (especially ones that merely use Perl, Ruby, or libpcre) because GNU grep implements a blazingly fast DFA engine. Not only that, but it also pre-filters input with a Boyer-Moore search before even getting to a regular expression. You can't do captures--which you need for highlighting--with a DFA. To do captures efficiently you have to match with the DFA, then run an NFA over the matching lines to get the captures. (Perl, et al, only implement backtracking NFAs, AFAIK.) I believe this is how GNU grep does syntax highlighting. RE2 also first runs a DFA before switching to an NFA if the expression has captures, and I can say first hand that RE2 blows Perl out of the water, not only in the worst case but generally.
Regular expressions cannot do `not' or `and'. For these you need something more powerful, like PEGs (parsing expression grammars) and the predicate logic. Some regex engines have zero-width assertions which approximate this, but these are bolted onto the underlying conceptual model and can seriously impact performance. A lookahead assertion, AFAIU, would basically be equivalent to a second, distinct regex; not at all unlike pipelining two grep invocations (which would at least give you SMP benefits, too!). A lookbehind would be a really awkward lookahead, where you clumsily walk backwards from a match, trying to apply the lookbehind.
Someone please correct me if I'm wrong on any of these points.
Here's a very good series of articles on the subject. They detail the development of Russ Cox's RE2:
One of the coolest little regular expression libraries I've ever seen is SLRE. It's small and readable: http://slre.sourceforge.net/
Chapter 1 of "Beautiful Code" includes an even simpler, yet quite sophisticated regex implementation. This chapter was the sample chapter for Amazon Kindle last time I checked.
Regex "and" and "or"
Posted Aug 10, 2011 16:53 UTC (Wed) by PO8 (guest, #41661)
[Link]
"Regular expressions cannot do `not' or `and'."
As noted above, if you interpret "not" as complement and "and" as intersection (a fairly natural interpretation, and one consistent with interpreting "or" as alternation), it is straightforward to add these operators to regular expressions without disturbing any machinery. I've sometimes wished for a "reverse" operator: reversing a regexp by hand is also a bit of a pain, and regular languages are closed under reversal as well.
Is Glark a Better Grep? (Linux.com)
Posted Aug 10, 2011 0:08 UTC (Wed) by intgr (subscriber, #39733)
[Link]
> Still, with today's CPU speeds searching will be I/O-bound anyway, so does it matter?
With today's memory sizes, almost every disk access is cached anyway. So yes, it can make a big difference. I frequently set "LC_ALL=C" in my shell just to make grep run faster (because comparisons in a UTF-8 locale are more expensive).
Is Glark a Better Grep? (Linux.com)
Posted Aug 10, 2011 5:18 UTC (Wed) by jwb (guest, #15467)
[Link]
This seems like an odd thing to say. The ratio of memory to disk, at least in an ordinary PC, is lower now than ever. It used to be common to have, say, 16MB of RAM and 540MB of disk. Now you have 4GB of RAM and 2TB of disk.
Maybe in a laptop you might still see something like 4GB of RAM and 80GB of disk, which is a nice ratio, but not so much for non-portables.
Is Glark a Better Grep? (Linux.com)
Posted Aug 10, 2011 8:38 UTC (Wed) by jezuch (subscriber, #52988)
[Link]
> The ratio of memory to disk, at least in an ordinary PC, is lower now than ever.
But do you grep randomly over the disk or just over a specific subset that you are currently working on?
Is Glark a Better Grep? (Linux.com)
Posted Aug 10, 2011 14:47 UTC (Wed) by farnz (guest, #17727)
[Link]
Back when I first used Linux, I had 4MB RAM, 80MB of HDD, and the full kernel source (just the source, no VCS information, no build cruft) was on the order of 40MB. Now, I have 8GB RAM, 1.5TB of HDD, and yet the Linux kernel source is under 2GB for a full git clone plus build output.
So, on the one hand, I now have over 150 times as much disk as RAM, whereas I used to have just 20 times as much disk. Against that, when I started playing with Linux, a single interesting project was 10 times my total RAM; it's now under 0.25 times my total RAM.
For (hopefully) obvious reasons, I've never been in the habit of grepping my entire disk - I tend to grep inside a single project at a time. When I started, grepping the entire Linux source was I/O bound every time I needed to do it. Now, grepping the entire Linux source is not I/O bound after the first time - or if I'm grepping after a build.
Is Glark a Better Grep? (Linux.com)
Posted Aug 10, 2011 15:20 UTC (Wed) by nix (subscriber, #2304)
[Link]
Quite so. I can't grep a whole *distro's* source code entirely in RAM, but most smaller greps are entirely in cache after the first run (especially if you don't grep things like .git and .svn directories).
Is Glark a Better Grep? (Linux.com)
Posted Aug 10, 2011 5:41 UTC (Wed) by JoeBuck (subscriber, #2330)
[Link]
As of version 2.7 (released 2010-09-16), grep is orders of magnitude faster for UTF-8 matches than it used to be. Unfortunately the web is very good at remembering obsolete information, like 'GNU grep is very slow under UTF-8, use the C locale even if some UTF-8 characters are present', and the search engines don't find newer, updated information because so many respected sites link to the out-of-date information.
Those of you who are running some "enterprise" version of GNU/Linux might be using an older, much slower GNU grep. If so, download from a GNU mirror and start running the latest.
Is Glark a Better Grep? (Linux.com)
Posted Aug 11, 2011 3:49 UTC (Thu) by jwb (guest, #15467)
[Link]
Is that true on every platform, or only where it can take advantage of Intel SSE 4.2 instructions, or some third thing?
Is Glark a Better Grep? (Linux.com)
Posted Aug 11, 2011 14:31 UTC (Thu) by nix (subscriber, #2304)
[Link]
On every platform. Grep doesn't contain nonportable assembler (the very idea!)
Is Glark a Better Grep? (Linux.com)
Posted Aug 11, 2011 18:10 UTC (Thu) by cmccabe (subscriber, #60281)
[Link]
Grep is a pretty widely used tool. Putting in some basic assembly level optimizations seems like a pretty good idea, provided they're optional and the source can still be compiled without them.
Is Glark a Better Grep? (Linux.com)
Posted Aug 11, 2011 23:07 UTC (Thu) by nix (subscriber, #2304)
[Link]
If they went anywhere, they would go in gnulib, but they're not needed because grep spends nearly all of its time inside the Boyer-Moore preliminary filter, not inside the regex matcher. The majority of that, except in pathological cases, is likely to be spent hunting for a matching first letter inside kwset.c:bmexec(), i.e. inside memchr(). memchr() is already optimized by any half-decent libc (certainly including glibc) to use SSE2 and the like. In practice this will be memory-bandwidth bound.
We can try to defeat this optimization by searching for 'a[^[l-z]]{2}b|c' in several thousand files each consisting of half a million 'a's followed by a 'b', grep spent most of its time inside, well, the libc's regex implementation (and if we weren't using glibc, we'd spend it inside gnulib's regex implementation, which is derived from the same source and uses the same algorithms). Half of that time is spent backtracking. And I can't really see a way to optimize *that* search using SSE :( it is intrinsically backtrack-heavy as far as I can tell.
Is Glark a Better Grep? (Linux.com)
Posted Aug 18, 2011 21:46 UTC (Thu) by walex (subscriber, #69836)
[Link]
As usual, please there is the "'cat -s' considered harmful" argument before extolling the wonders of more features in a tool when those can be provided with pipelined tools (I find recursive 'grep' particularly regrettable, but also coloring). In this respect 'glark' seems overdesigned. GNU 'grep' as well , but that's the philosophy of GNU tools.
However, there are a few other pattern match tools, and a particularly useful one is 'agrep' and similar tools like 'tre' (as the old 'agrep' used to be distributed under a restrictive license).
The particularly useful bits of 'agrep' are treating any sequence of lines separated by a given delimited as a single "line", approximate pattern matching, and true 'and'.
Is Glark a Better Grep? (Linux.com)
Posted Aug 10, 2011 13:09 UTC (Wed) by ssam (subscriber, #46587)
[Link]
looks interesting.
also recently discovered ack-grep, which is very handy for grepping source code.
Is Glark a Better Grep? (Linux.com)
Posted Aug 12, 2011 14:51 UTC (Fri) by Wummel (subscriber, #7591)
[Link]
ack-grep has a very nice feature which ignores
VCS directories (.svn, CVS, etc.) per default.
This is why I switched to ack-grep not so long ago.
Unfortunately glark does not advertise such a feature.
Is Glark a Better Grep? (Linux.com)
Posted Aug 12, 2011 22:25 UTC (Fri) by bronson (subscriber, #4806)
[Link]
Grep can do that too. I run with this in my .bashrc:
Posted Aug 15, 2011 11:46 UTC (Mon) by ssam (subscriber, #46587)
[Link]
ack-grep is nothing more fancy than that. its just a wrapper around grep that does the "right thing" when you are grepping through source code.
Is Glark a Better Grep? (Linux.com)
Posted Aug 15, 2011 17:15 UTC (Mon) by bronson (subscriber, #4806)
[Link]
According to Ack's home page:
> ack is pure Perl, so it runs on Windows just fine. It has no dependencies other than Perl 5.
So it's not just a wrapper around the grep executable, it's a 100% perl reimplementation. And, in my experience, ack is at least 5-10X slower than gnu grep when processing massive logfiles.
Is Glark a Better Grep? (Linux.com)
Posted Aug 15, 2011 23:48 UTC (Mon) by petdance (guest, #78383)
[Link]
> And, in my experience, ack is at least 5-10X slower than gnu grep when processing massive logfiles.
Which is why you don't use ack on massive logfiles, because that's not why it was created. It takes longer to drive nails with a circular saw, too.
Is Glark a Better Grep? (Linux.com)
Posted Aug 16, 2011 1:24 UTC (Tue) by bronson (subscriber, #4806)
[Link]
Obviously. That was only intended as rather conclusive evidence for the previous poster.
Is Glark a Better Grep? (Linux.com)
Posted Aug 17, 2011 14:33 UTC (Wed) by petdance (guest, #78383)
[Link]
> ack-grep is nothing more fancy than that.
That's not true. There are plenty of things that ack does that grep does not.
* ack supports filetypes to search, and not just groups of extensions.
* ack lists files by filetypes without actually searching, so you can inventory a tree
* ack allows you to use custom output for the matches, taking advantage of match groups in the Perl regular expressions.
* ack groups matches by file
* ack has a pass-thru option where it will show non-matching lines as well as matching lines, which is useful for tail -f'ing log files.
* ack has a -1 option that stops after the first match
etc etc etc
Is Glark a Better Grep? (Linux.com)
Posted Aug 22, 2011 4:24 UTC (Mon) by sitaram (subscriber, #5959)
[Link]
Ack may have surprises in store for people who manage to *totally* replace grep in their mind/muscle memory.
For example, try 'ack sda /proc/mounts' (or anything in /proc).
It used to be even more idiosyncratic; at one time, an *empty* input (like 'cat /dev/null | ack some-pattern') used to cause ack to recurse into $PWD (ie., behave like 'ack some-pattern $PWD')! Thankfully, that's been fixed.
Is Glark a Better Grep? (Linux.com)
Posted Aug 31, 2011 15:18 UTC (Wed) by nix (subscriber, #2304)
[Link]
Quite. If only it ran its searches in parallel I could use it to entirely replace 'git grep', but right now 'git grep' wins because it's so very, very fast (due to the autoparallelization it does on multicore machines).
Why "glark"?
Posted Aug 10, 2011 16:57 UTC (Wed) by PO8 (guest, #41661)
[Link]
We know why "grep" is called "grep". Why is "glark" called "glark"? I couldn't find an explanation on the linked page, and it's not exactly mnemonic...
Why "glark"?
Posted Aug 11, 2011 11:34 UTC (Thu) by sorpigal (subscriber, #36106)
[Link]