A JIT for grepping: jrep and rejit
Jrep is a grep-like program powered by the rejit library. So it is "just another" grepping program, except that it is quite fast. Opportunities to improve regular expression matching performance are interesting for the speed gain itself, but also for the technical aspects behind the scenes. This article introduces jrep and rejit by showing the time taken by jrep and GNU grep to recursively grep through the Linux kernel source for different regular expressions.
Note that jrep is a sample program implemented to showcase the rejit library. The rejit library is actually a Just-In-Time (JIT) compiler: at runtime it generates tailored machine code to match each regular expression. Both jrep and rejit are still in development, so jrep is far from being as fully featured as GNU grep (one of my favorite programs), and undoubtedly not as stable or as tested. Now that this is clear, let's look at what jrep and rejit can do.
Benchmark results
Benchmarks were run on an Ubuntu 13.10 machine, with 32GB of RAM, and a 12-core Intel i7-4930K CPU @ 3.40GHz (which supports SSE4.2). The engines are timed to grep through the Linux kernel 3.13.5 source, using a script (available in the rejit repository) generating commands like:
$ time -p engine --recursive --with-filename --line-number regexp linux-3.13.5/ > /dev/nullResults show the best out of five runs for each engine and regular expression. Running multiple times ensures that the filesystem caches the files we are processing. The engine versions used are:
$ cd <rejit_repo> && git rev-parse HEAD 7830c93befade62933bb068d3556a9256903a86e $ grep --version | head -n1 grep (GNU grep) 2.17
The graphs presented show the time required by grep and jrep to search through the Linux kernel source for different regular expressions (regexps), indicated just above their associated graph bars in the extended regular expression syntax. The total time required by each engine is split between time spent in user and time spent in sys.
The user time is spent in user space, in our case mostly processing the files. The sys time is spent in the kernel. Jrep spends 75% of that time (computed for the regexp foobar) walking the file tree (using ftw) and mapping files to memory. Not considering kernel or ftw improvements, this 75% of sys can be seen as incompressible time required to get the files ready for processing. The other 25% are used elsewhere in the "matching" section of the program. Grep operates under the same constraint (using fts and manual buffer reading), but consistently shows less time spent in sys, so there are likely things to improve on this side for jrep.
When looking for unsigned, jrep slows down significantly due to the high number of matches. Some profiling is needed to figure out where the time is lost in this situation. For other strings it behaves well. The second regexp éphémère shows matching for non-ASCII characters. Rejit currently only partially supports UTF-8: non ASCII-characters are not yet supported in character classes (e.g. [çé0-9]).
Simple alternation of words show a similar trend:
The third alternation allows for common substrings extraction: the three alternated words contain the substring def. It allows searching for the substring, and when it is found completing the match from there. Both grep and jrep do so (via different mechanisms, out of scope for this article). Many engines do not perform such extractions, or sometimes only perform prefix and/or suffix extraction.
Both jrep and grep handle more complex regular expressions quite well:
Many engines would try matching the regexps from the beginning, which can be inefficient. Similarly to the substring extraction above, both grep and jrep manage to look for the easier part of the regexps first and complete the match from there. For [a-z]{3,10}.*gnu.*[a-z]{3,10}, jrep generates code that first looks for gnu, and after it is found, goes backward and then forward to match the whole expression.
More complicated alternations show an area where rejit could improve:
The handling of more complex alternations is a known (relatively) weak point of jrep (more precisely of rejit) in need of improvement. Grep uses a smart Boyer-Moore algorithm. To look for aaa|bbb|ccc at position p, it looks up the character at p + 2, and if it is not a, b, or c, knows it can jump three characters ahead to p + 3 (and then look at the character at p + 5).
On the other hand, like for single strings, rejit handles alternations simply: it applies brute force. But it does so relatively efficiently, so the performance is still good. To search for aaa|bbb|ccc at some position p in the text, rejit performs operations like:
loop: find 'aaa' at position p if found goto match find 'bbb' at position p if found goto match find 'ccc' at position p if found goto match increment position and goto loop match:The complexity is proportional to the number of alternations. Worse, when the number of alternated expressions exceeds a threshold (i.e. when the compiler cannot allocate a register per alternated expression), rejit falls back to some slow default code. This is what happens for the two regexps with eight or more alternated strings. The code generation should be fixed to allow an arbitrary number of alternated strings.
In this situation, the initial multi-threading support in jrep can help. Multi-threading can be enabled with the -j (an alias for --jobs) option. Passing -jN will make jrep run with one thread listing the files to process in a queue, and N threads popping filenames from this list and processing them separately. Access to stdout is protected by a lock to ensure the results are printed "by file" and not interleaved.
As the results show, enabling multi-threading must be done carefully. Regular expressions requiring a lot of time matching in user space see improved performance, but regular expressions spending little time doing matching work may not benefit — or, worse, see decreased performance — from the multiple threads. When using multi-threading only the real time (time observed by the user) is reported, as sys and user are the sum of times spent for all cores, which would exceed the observed running time. All the regexps in the previous graphs are not conducive to multi-threading, except for the longer running "unsigned" case. The multi-threading support is not mature yet, and again profiling is needed to see what is happening. Ideally, useless additional threads should stay idle and not hurt performance.
GNU grep does not provide a multi-threading option.
About rejit and jrep
Not counting the underlying library, jrep only has about 500 lines of code. After jrep parses the command-line options, the file tree is traversed, each file is mapped to memory buffer and passed to rejit to find all the matches at once. If any matches are found jrep prints the results and carries on to the next file.
Note again that rejit is a prototype library. Many things are still unsupported. Notably, it only supports the x86_64 architecture. That being said, it has a few interesting features.
Like grep, it is non-backtracking, using finite automata to represent regular expressions, which was introduced into programming by Ken Thompson. For details see this excellent article by Russ Cox, or in short: it does not use an algorithm with exponential complexity. So, on the machine I am writing this article on, to check if the regexp (a?){25}a{25} matches the string aaaaaaaaaaaaaaaaaaaaaaaaa ("a" 25 times), rejit takes less than 0.001 second, while Google v8's Javascript engine (the backtracking engine used in Chrome and Android) takes about three seconds. If we try with 26 "a"s instead v8 would take twice as long, and again twice as long for 27 "a"s, etc.
The specifications of the machine do not matter here. The point is that some regular expressions can't be handled well by backtracking engines. In practice, the type of regular expressions causing this slow behaviour may not be very useful, but this is simply a risk if the program does not control what regular expressions are run (e.g. user search allowing regexps).
The common feature of back-references ((foo|bar)\1 matches foofoo and barbar) cannot be implemented without backtracking (an NP-complete problem). However, as noted by Cox in more detail, it does not justify the choice of a backtracking for the general algorithm. Backtracking can be used locally to handle back-references.
As seen in the benchmark results above, rejit is also (or at least can be) fast. Generating the code on the fly allows generating tailored code for every regular expression processed. In particular, the JIT allows detecting at runtime what CPU instructions are available and generating efficient code for the performance-critical sections. On this machine, rejit uses the Intel SSE4.2 extension. Except for those critical sections, the code generated is still rather poor and could benefit from a lot of optimization — but that is work for later.
The project was started as a JIT because I knew v8's regexp engine was a backtracking JIT, and I wanted to see if I could implement a non-backtracking one. Since I was unable to support all architectures at once, I had to first chose one architecture to support; I had quite a bit of experience with ARM, but not much with Intel, so x86_64 it was.
With hindsight I believe that the best solution would be to implement an architecture-independent engine (C/C++), and then locally extend it with JIT code. While producing an engine that would work on all platforms, most of the current performance could be achieved by implementing just a few pieces of assembly; more assembly stubs could then come in to bring faster speed to normally slow-to-match regular expressions.
If the project finds the resources to do so, that plan would be ideal. Otherwise I will likely focus half of my time on fixing bugs, and half on implementing features that I find interesting. The next area will probably be sub-matches and back-references. Some missing features and ideas are on the project TODO list. The library and sample programs (including jrep) are available under the GPLv3 license, and help is welcome.
The home page contains more information, some benchmarks showing other situations where rejit performs well (e.g. DNA matching), and some documentation. There is also an article that introduces the mechanisms used in rejit. I would be glad to discuss more about the project on the mailing list or by email (see the website).
Index entries for this article | |
---|---|
GuestArticles | Rames, Alexandre |
Posted Mar 6, 2014 4:17 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (7 responses)
Out of curiosity, how did "git grep" do? Also, I've been trying to get myself to use word-level regexes more and more (mainly to trim down bogus results in large code bases).
Posted Mar 6, 2014 9:56 UTC (Thu)
by arames (guest, #88205)
[Link] (6 responses)
I'm sure there are many engines out there I don't know about. It's just a fun thing to work on! I haven't looked, but I suspect that node.js must use the Google v8 regexp engine, in which case it would be bactracking and would not perform dominating regexp extraction, so rather slow on complex regexps.
I just benchmarked git-grep on the same regexps, and it performs super well. It takes generally slightly more time on 'sys' and 'user' results, but the 'real' times are generally better (multi-threading behaves superbly).
(See http://pastebin.com/k2L5dDYh for the aligned tables...)
regexp engine sys user real
However it looks like it does not perform dominating regexps extraction, so the performance on complex regexps is only excellent (with a risk of getting worse).
regexp engine sys user real
So out of the box it is really good. That tells me I should be looking more at the multi-threading!
Posted Mar 6, 2014 10:58 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
I imagine the node.js isn't the fastest in town for complicated regexes, but it'd be interesting to throw into the mix to show where backtracking gets you (and possibly constructing some regexes better written with backtracking and comparing if against de-backtracked regexes (e.g., "(foofoo|barbar)") for the other tools) in terms of cost/benefit.
I'd also imagine git grep probably does much better with a very dirty tree (e.g., with build artifacts) than other tools, but it also has knowledge the others don't.
Posted Mar 6, 2014 11:32 UTC (Thu)
by tao (subscriber, #17563)
[Link] (2 responses)
Posted Mar 6, 2014 11:53 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link]
Posted Mar 9, 2014 15:07 UTC (Sun)
by mgedmin (subscriber, #34497)
[Link]
GNU id-utils might help there (like ctags it builds an index and then queries it). Or cscope, which AFAIU is similar.
Personally, once I got an SSD, I'm reasonably happy with all the tools (especially git grep) and haven't found the need to go back to id-utils.
Posted Apr 25, 2014 8:37 UTC (Fri)
by arames (guest, #88205)
[Link] (1 responses)
Posted Apr 26, 2014 22:52 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Mar 6, 2014 11:06 UTC (Thu)
by jnareb (subscriber, #46500)
[Link] (4 responses)
Posted Mar 6, 2014 11:17 UTC (Thu)
by arames (guest, #88205)
[Link] (3 responses)
Posted Mar 6, 2014 12:03 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
Posted Mar 6, 2014 13:08 UTC (Thu)
by arames (guest, #88205)
[Link] (1 responses)
Posted Mar 6, 2014 11:10 UTC (Thu)
by jezuch (subscriber, #52988)
[Link] (1 responses)
Also: if we're talking about performance, what about grep -i and its counterpart in jrep? :) (Yes, I'm alluding to the recent announcement that its performance in grep was improved tenfold in some cases.)
Posted Mar 6, 2014 11:24 UTC (Thu)
by arames (guest, #88205)
[Link]
Rejit does not support case-insensitive matching. This is on the TODO list. But I suspect that SIMD code will turn out very helpful for this.
Posted Mar 8, 2014 9:39 UTC (Sat)
by mb (subscriber, #50428)
[Link] (2 responses)
Posted Mar 8, 2014 22:36 UTC (Sat)
by arames (guest, #88205)
[Link] (1 responses)
Posted Mar 9, 2014 0:43 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Posted Mar 9, 2014 14:02 UTC (Sun)
by andersg (guest, #25522)
[Link]
Posted Jan 10, 2015 15:14 UTC (Sat)
by evandrix (guest, #84408)
[Link] (1 responses)
Posted Jan 10, 2015 15:16 UTC (Sat)
by evandrix (guest, #84408)
[Link]
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
foobar git-grep 0.51 0.56 0.23
foobar taskset 1 git-grep 0.32 0.48 0.81
foobar jrep 0.23 0.08 0.32
foobar grep-2.17 0.18 0.51 0.70
^\s*(void|unsigned|int)\s+func\s+\( git-grep 0.31 5.67 0.85
^\s*(void|unsigned|int)\s+func\s+\( taskset 1 git-grep 0.28 4.10 4.38
^\s*(void|unsigned|int)\s+func\s+\( jrep 0.25 0.12 0.35
^\s*(void|unsigned|int)\s+func\s+\( grep-2.17 0.18 1.15 1.33
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
How does jrep compare with RE2?
See here for benchmarks with other engines.
How does jrep compare with RE2?
How does jrep compare with RE2?
How does jrep compare with RE2?
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
There are alternative possibilities though. Or more work could go on the 'jrep' tool to make it compliant with UNIX grep. Or GNU grep could also be augmented with some small JIT. Or maybe that should be reconsidered when rejit has become super mature and featured.
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit
A JIT for grepping: jrep and rejit