|
|
Subscribe / Log in / New account

A JIT for grepping: jrep and rejit

March 5, 2014

This article was contributed by Alexandre Rames

Jrep is a grep-like program powered by the rejit library. So it is "just another" grepping program, except that it is quite fast. Opportunities to improve regular expression matching performance are interesting for the speed gain itself, but also for the technical aspects behind the scenes. This article introduces jrep and rejit by showing the time taken by jrep and GNU grep to recursively grep through the Linux kernel source for different regular expressions.

Note that jrep is a sample program implemented to showcase the rejit library. The rejit library is actually a Just-In-Time (JIT) compiler: at runtime it generates tailored machine code to match each regular expression. Both jrep and rejit are still in development, so jrep is far from being as fully featured as GNU grep (one of my favorite programs), and undoubtedly not as stable or as tested. Now that this is clear, let's look at what jrep and rejit can do.

Benchmark results

Benchmarks were run on an Ubuntu 13.10 machine, with 32GB of RAM, and a 12-core Intel i7-4930K CPU @ 3.40GHz (which supports SSE4.2). The engines are timed to grep through the Linux kernel 3.13.5 source, using a script (available in the rejit repository) generating commands like:

    $ time -p engine --recursive --with-filename --line-number regexp linux-3.13.5/ > /dev/null
Results show the best out of five runs for each engine and regular expression. Running multiple times ensures that the filesystem caches the files we are processing. The engine versions used are:
    $ cd <rejit_repo> && git rev-parse HEAD
    7830c93befade62933bb068d3556a9256903a86e
    $ grep --version | head -n1
    grep (GNU grep) 2.17

[Simple
regular expression graph]

The graphs presented show the time required by grep and jrep to search through the Linux kernel source for different regular expressions (regexps), indicated just above their associated graph bars in the extended regular expression syntax. The total time required by each engine is split between time spent in user and time spent in sys.

The user time is spent in user space, in our case mostly processing the files. The sys time is spent in the kernel. Jrep spends 75% of that time (computed for the regexp foobar) walking the file tree (using ftw) and mapping files to memory. Not considering kernel or ftw improvements, this 75% of sys can be seen as incompressible time required to get the files ready for processing. The other 25% are used elsewhere in the "matching" section of the program. Grep operates under the same constraint (using fts and manual buffer reading), but consistently shows less time spent in sys, so there are likely things to improve on this side for jrep.

When looking for unsigned, jrep slows down significantly due to the high number of matches. Some profiling is needed to figure out where the time is lost in this situation. For other strings it behaves well. The second regexp éphémère shows matching for non-ASCII characters. Rejit currently only partially supports UTF-8: non ASCII-characters are not yet supported in character classes (e.g. [çé0-9]).

Simple alternation of words show a similar trend:

[Alternation
graph]

The third alternation allows for common substrings extraction: the three alternated words contain the substring def. It allows searching for the substring, and when it is found completing the match from there. Both grep and jrep do so (via different mechanisms, out of scope for this article). Many engines do not perform such extractions, or sometimes only perform prefix and/or suffix extraction.

Both jrep and grep handle more complex regular expressions quite well:

[Complex regular
expression graph]

Many engines would try matching the regexps from the beginning, which can be inefficient. Similarly to the substring extraction above, both grep and jrep manage to look for the easier part of the regexps first and complete the match from there. For [a-z]{3,10}.*gnu.*[a-z]{3,10}, jrep generates code that first looks for gnu, and after it is found, goes backward and then forward to match the whole expression.

More complicated alternations show an area where rejit could improve:

[Slow alternations graph]

The handling of more complex alternations is a known (relatively) weak point of jrep (more precisely of rejit) in need of improvement. Grep uses a smart Boyer-Moore algorithm. To look for aaa|bbb|ccc at position p, it looks up the character at p + 2, and if it is not a, b, or c, knows it can jump three characters ahead to p + 3 (and then look at the character at p + 5).

On the other hand, like for single strings, rejit handles alternations simply: it applies brute force. But it does so relatively efficiently, so the performance is still good. To search for aaa|bbb|ccc at some position p in the text, rejit performs operations like:

    loop:
      find 'aaa' at position p
      if found goto match
      find 'bbb' at position p
      if found goto match
      find 'ccc' at position p
      if found goto match
      increment position and goto loop
    match:
The complexity is proportional to the number of alternations. Worse, when the number of alternated expressions exceeds a threshold (i.e. when the compiler cannot allocate a register per alternated expression), rejit falls back to some slow default code. This is what happens for the two regexps with eight or more alternated strings. The code generation should be fixed to allow an arbitrary number of alternated strings.

In this situation, the initial multi-threading support in jrep can help. Multi-threading can be enabled with the -j (an alias for --jobs) option. Passing -jN will make jrep run with one thread listing the files to process in a queue, and N threads popping filenames from this list and processing them separately. Access to stdout is protected by a lock to ensure the results are printed "by file" and not interleaved.

As the results show, enabling multi-threading must be done carefully. Regular expressions requiring a lot of time matching in user space see improved performance, but regular expressions spending little time doing matching work may not benefit — or, worse, see decreased performance — from the multiple threads. When using multi-threading only the real time (time observed by the user) is reported, as sys and user are the sum of times spent for all cores, which would exceed the observed running time. All the regexps in the previous graphs are not conducive to multi-threading, except for the longer running "unsigned" case. The multi-threading support is not mature yet, and again profiling is needed to see what is happening. Ideally, useless additional threads should stay idle and not hurt performance.

GNU grep does not provide a multi-threading option.

About rejit and jrep

Not counting the underlying library, jrep only has about 500 lines of code. After jrep parses the command-line options, the file tree is traversed, each file is mapped to memory buffer and passed to rejit to find all the matches at once. If any matches are found jrep prints the results and carries on to the next file.

Note again that rejit is a prototype library. Many things are still unsupported. Notably, it only supports the x86_64 architecture. That being said, it has a few interesting features.

Like grep, it is non-backtracking, using finite automata to represent regular expressions, which was introduced into programming by Ken Thompson. For details see this excellent article by Russ Cox, or in short: it does not use an algorithm with exponential complexity. So, on the machine I am writing this article on, to check if the regexp (a?){25}a{25} matches the string aaaaaaaaaaaaaaaaaaaaaaaaa ("a" 25 times), rejit takes less than 0.001 second, while Google v8's Javascript engine (the backtracking engine used in Chrome and Android) takes about three seconds. If we try with 26 "a"s instead v8 would take twice as long, and again twice as long for 27 "a"s, etc.

The specifications of the machine do not matter here. The point is that some regular expressions can't be handled well by backtracking engines. In practice, the type of regular expressions causing this slow behaviour may not be very useful, but this is simply a risk if the program does not control what regular expressions are run (e.g. user search allowing regexps).

The common feature of back-references ((foo|bar)\1 matches foofoo and barbar) cannot be implemented without backtracking (an NP-complete problem). However, as noted by Cox in more detail, it does not justify the choice of a backtracking for the general algorithm. Backtracking can be used locally to handle back-references.

As seen in the benchmark results above, rejit is also (or at least can be) fast. Generating the code on the fly allows generating tailored code for every regular expression processed. In particular, the JIT allows detecting at runtime what CPU instructions are available and generating efficient code for the performance-critical sections. On this machine, rejit uses the Intel SSE4.2 extension. Except for those critical sections, the code generated is still rather poor and could benefit from a lot of optimization — but that is work for later.

The project was started as a JIT because I knew v8's regexp engine was a backtracking JIT, and I wanted to see if I could implement a non-backtracking one. Since I was unable to support all architectures at once, I had to first chose one architecture to support; I had quite a bit of experience with ARM, but not much with Intel, so x86_64 it was.

With hindsight I believe that the best solution would be to implement an architecture-independent engine (C/C++), and then locally extend it with JIT code. While producing an engine that would work on all platforms, most of the current performance could be achieved by implementing just a few pieces of assembly; more assembly stubs could then come in to bring faster speed to normally slow-to-match regular expressions.

If the project finds the resources to do so, that plan would be ideal. Otherwise I will likely focus half of my time on fixing bugs, and half on implementing features that I find interesting. The next area will probably be sub-matches and back-references. Some missing features and ideas are on the project TODO list. The library and sample programs (including jrep) are available under the GPLv3 license, and help is welcome.

The home page contains more information, some benchmarks showing other situations where rejit performs well (e.g. DNA matching), and some documentation. There is also an article that introduces the mechanisms used in rejit. I would be glad to discuss more about the project on the mailing list or by email (see the website).


Index entries for this article
GuestArticlesRames, Alexandre


to post comments

A JIT for grepping: jrep and rejit

Posted Mar 6, 2014 4:17 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (7 responses)

It seems the past few years have seen quite a few grep replacements cropping up (ack is older, but ag, a node.js-based one I forget the name of, and I know I've seen other floating by…usually in $petlanguage after getting "fed up" with grep's performance).

Out of curiosity, how did "git grep" do? Also, I've been trying to get myself to use word-level regexes more and more (mainly to trim down bogus results in large code bases).

A JIT for grepping: jrep and rejit

Posted Mar 6, 2014 9:56 UTC (Thu) by arames (guest, #88205) [Link] (6 responses)

Would you know of a good set of regexps to benchmark?

I'm sure there are many engines out there I don't know about. It's just a fun thing to work on! I haven't looked, but I suspect that node.js must use the Google v8 regexp engine, in which case it would be bactracking and would not perform dominating regexp extraction, so rather slow on complex regexps.

I just benchmarked git-grep on the same regexps, and it performs super well. It takes generally slightly more time on 'sys' and 'user' results, but the 'real' times are generally better (multi-threading behaves superbly).

(See http://pastebin.com/k2L5dDYh for the aligned tables...)

regexp engine sys user real
foobar git-grep 0.51 0.56 0.23
foobar taskset 1 git-grep 0.32 0.48 0.81
foobar jrep 0.23 0.08 0.32
foobar grep-2.17 0.18 0.51 0.70

However it looks like it does not perform dominating regexps extraction, so the performance on complex regexps is only excellent (with a risk of getting worse).

regexp engine sys user real
^\s*(void|unsigned|int)\s+func\s+\( git-grep 0.31 5.67 0.85
^\s*(void|unsigned|int)\s+func\s+\( taskset 1 git-grep 0.28 4.10 4.38
^\s*(void|unsigned|int)\s+func\s+\( jrep 0.25 0.12 0.35
^\s*(void|unsigned|int)\s+func\s+\( grep-2.17 0.18 1.15 1.33

So out of the box it is really good. That tells me I should be looking more at the multi-threading!

A JIT for grepping: jrep and rejit

Posted Mar 6, 2014 10:58 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (3 responses)

I don't have a set handy, but if you have the time, bandwidth, and disk space, grepping the WebKit tree (5G download; 12G checkout if you use git gc) might show git grep having an overhead with large repos. I use :Ggrep -w ^R^W in vim (with vim-fugitive) to grep for symbols all the time, so picking words from a source (particularly those with the project prefix) could be a good start.

I imagine the node.js isn't the fastest in town for complicated regexes, but it'd be interesting to throw into the mix to show where backtracking gets you (and possibly constructing some regexes better written with backtracking and comparing if against de-backtracked regexes (e.g., "(foofoo|barbar)") for the other tools) in terms of cost/benefit.

I'd also imagine git grep probably does much better with a very dirty tree (e.g., with build artifacts) than other tools, but it also has knowledge the others don't.

A JIT for grepping: jrep and rejit

Posted Mar 6, 2014 11:32 UTC (Thu) by tao (subscriber, #17563) [Link] (2 responses)

I'm not sure I fully understand your use-case about grepping the entire source tree from vim, but wouldn't your situation benefit from using ctags?

A JIT for grepping: jrep and rejit

Posted Mar 6, 2014 11:53 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Possibly, but WebKit is a *large* repo. I don't know where I'd put money on things failing…the ctag file being too slow to keep up-to-date, Vim choking on loading it into memory, or Vim choking on actually using the data. I mean, git isn't being the happiest with it (30 seconds to *decide* what to do for a rebase) on an SSD and Vim isn't exactly the fastest once you pile all my plugins on there and open a few hundred buffers (searching, fugitive, quick fix, etc.)… I'll have to run some experiments to see how many searches it takes for ctags to come out ahead of git grep (and the API isn't stable, so every update is probably going to need a response of the ctags).

A JIT for grepping: jrep and rejit

Posted Mar 9, 2014 15:07 UTC (Sun) by mgedmin (subscriber, #34497) [Link]

Crags only gives you definitions, and sometimes you meet to find where names are used.

GNU id-utils might help there (like ctags it builds an index and then queries it). Or cscope, which AFAIU is similar.

Personally, once I got an SSD, I'm reasonably happy with all the tools (especially git grep) and haven't found the need to go back to id-utils.

A JIT for grepping: jrep and rejit

Posted Apr 25, 2014 8:37 UTC (Fri) by arames (guest, #88205) [Link] (1 responses)

Updated benchmarks for GNU grep, git grep, and jrep, available at http://coreperf.com/technical/2014/04/25/rejit_jrep_updat...

A JIT for grepping: jrep and rejit

Posted Apr 26, 2014 22:52 UTC (Sat) by nix (subscriber, #2304) [Link]

That's looking even more impressive than before!

How does jrep compare with RE2?

Posted Mar 6, 2014 11:06 UTC (Thu) by jnareb (subscriber, #46500) [Link] (4 responses)

How does jrep compare with other regular expression engines, like RE2 (https://code.google.com/p/re2/)?

How does jrep compare with RE2?

Posted Mar 6, 2014 11:17 UTC (Thu) by arames (guest, #88205) [Link] (3 responses)

See here for benchmarks with other engines.

How does jrep compare with RE2?

Posted Mar 6, 2014 12:03 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

Wow, v8 is really whacked on there. Does it get the right answers?

How does jrep compare with RE2?

Posted Mar 6, 2014 13:08 UTC (Thu) by arames (guest, #88205) [Link] (1 responses)

The v8 numbers should be interpreted with care. In a couple of graphs the performance jumps for higher text sizes; this is bogus and I need to have a look to see what is going on. The others should be alright but I need to spend some time to double check.

How does jrep compare with RE2?

Posted Apr 25, 2014 8:35 UTC (Fri) by arames (guest, #88205) [Link]

A JIT for grepping: jrep and rejit

Posted Mar 6, 2014 11:10 UTC (Thu) by jezuch (subscriber, #52988) [Link] (1 responses)

I'm actually surprised that a JIT for regexes didn't show up much earlier. This seems like a perfect use-case for JITs...

Also: if we're talking about performance, what about grep -i and its counterpart in jrep? :) (Yes, I'm alluding to the recent announcement that its performance in grep was improved tenfold in some cases.)

A JIT for grepping: jrep and rejit

Posted Mar 6, 2014 11:24 UTC (Thu) by arames (guest, #88205) [Link]

There are already JIT regexp engines. v8's engine is one, and there are others. Actually even the original implementation from Ken Thompson was generating native IBM 7094 code!

Rejit does not support case-insensitive matching. This is on the TODO list. But I suspect that SIMD code will turn out very helpful for this.

A JIT for grepping: jrep and rejit

Posted Mar 8, 2014 9:39 UTC (Sat) by mb (subscriber, #50428) [Link] (2 responses)

What are the chances of GNU grep using the rejit library as a backend on x86_64? Does the GNU grep code architecture allow that?

A JIT for grepping: jrep and rejit

Posted Mar 8, 2014 22:36 UTC (Sat) by arames (guest, #88205) [Link] (1 responses)

Plugging rejit in GNU grep would surely be doable with some work, however I don't think this would be a good idea. In its current state Rejit is not mature enough, and is still missing crucial features. That would cause too many regressions in GNU grep.
There are alternative possibilities though. Or more work could go on the 'jrep' tool to make it compliant with UNIX grep. Or GNU grep could also be augmented with some small JIT. Or maybe that should be reconsidered when rejit has become super mature and featured.

A JIT for grepping: jrep and rejit

Posted Mar 9, 2014 0:43 UTC (Sun) by nix (subscriber, #2304) [Link]

GNU grep 2.15+ already *has* a JIT: just use grep -P with pcre 8.20+ on a supported architecture.

A JIT for grepping: jrep and rejit

Posted Mar 9, 2014 14:02 UTC (Sun) by andersg (guest, #25522) [Link]

How does it compare to redgrep?

A JIT for grepping: jrep and rejit

Posted Jan 10, 2015 15:14 UTC (Sat) by evandrix (guest, #84408) [Link] (1 responses)

Where's the link to download jrep to evaluate? I only see the one to rejit.

A JIT for grepping: jrep and rejit

Posted Jan 10, 2015 15:16 UTC (Sat) by evandrix (guest, #84408) [Link]

nvm, found it together with rejit.


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds