User: Password:
|
|
Subscribe / Log in / New account

Development

A JIT for grepping: jrep and rejit

March 5, 2014

This article was contributed by Alexandre Rames

Jrep is a grep-like program powered by the rejit library. So it is "just another" grepping program, except that it is quite fast. Opportunities to improve regular expression matching performance are interesting for the speed gain itself, but also for the technical aspects behind the scenes. This article introduces jrep and rejit by showing the time taken by jrep and GNU grep to recursively grep through the Linux kernel source for different regular expressions.

Note that jrep is a sample program implemented to showcase the rejit library. The rejit library is actually a Just-In-Time (JIT) compiler: at runtime it generates tailored machine code to match each regular expression. Both jrep and rejit are still in development, so jrep is far from being as fully featured as GNU grep (one of my favorite programs), and undoubtedly not as stable or as tested. Now that this is clear, let's look at what jrep and rejit can do.

Benchmark results

Benchmarks were run on an Ubuntu 13.10 machine, with 32GB of RAM, and a 12-core Intel i7-4930K CPU @ 3.40GHz (which supports SSE4.2). The engines are timed to grep through the Linux kernel 3.13.5 source, using a script (available in the rejit repository) generating commands like:

    $ time -p engine --recursive --with-filename --line-number regexp linux-3.13.5/ > /dev/null
Results show the best out of five runs for each engine and regular expression. Running multiple times ensures that the filesystem caches the files we are processing. The engine versions used are:
    $ cd <rejit_repo> && git rev-parse HEAD
    7830c93befade62933bb068d3556a9256903a86e
    $ grep --version | head -n1
    grep (GNU grep) 2.17

[Simple
regular expression graph]

The graphs presented show the time required by grep and jrep to search through the Linux kernel source for different regular expressions (regexps), indicated just above their associated graph bars in the extended regular expression syntax. The total time required by each engine is split between time spent in user and time spent in sys.

The user time is spent in user space, in our case mostly processing the files. The sys time is spent in the kernel. Jrep spends 75% of that time (computed for the regexp foobar) walking the file tree (using ftw) and mapping files to memory. Not considering kernel or ftw improvements, this 75% of sys can be seen as incompressible time required to get the files ready for processing. The other 25% are used elsewhere in the "matching" section of the program. Grep operates under the same constraint (using fts and manual buffer reading), but consistently shows less time spent in sys, so there are likely things to improve on this side for jrep.

When looking for unsigned, jrep slows down significantly due to the high number of matches. Some profiling is needed to figure out where the time is lost in this situation. For other strings it behaves well. The second regexp éphémère shows matching for non-ASCII characters. Rejit currently only partially supports UTF-8: non ASCII-characters are not yet supported in character classes (e.g. [çé0-9]).

Simple alternation of words show a similar trend:

[Alternation
graph]

The third alternation allows for common substrings extraction: the three alternated words contain the substring def. It allows searching for the substring, and when it is found completing the match from there. Both grep and jrep do so (via different mechanisms, out of scope for this article). Many engines do not perform such extractions, or sometimes only perform prefix and/or suffix extraction.

Both jrep and grep handle more complex regular expressions quite well:

[Complex regular
expression graph]

Many engines would try matching the regexps from the beginning, which can be inefficient. Similarly to the substring extraction above, both grep and jrep manage to look for the easier part of the regexps first and complete the match from there. For [a-z]{3,10}.*gnu.*[a-z]{3,10}, jrep generates code that first looks for gnu, and after it is found, goes backward and then forward to match the whole expression.

More complicated alternations show an area where rejit could improve:

[Slow alternations graph]

The handling of more complex alternations is a known (relatively) weak point of jrep (more precisely of rejit) in need of improvement. Grep uses a smart Boyer-Moore algorithm. To look for aaa|bbb|ccc at position p, it looks up the character at p + 2, and if it is not a, b, or c, knows it can jump three characters ahead to p + 3 (and then look at the character at p + 5).

On the other hand, like for single strings, rejit handles alternations simply: it applies brute force. But it does so relatively efficiently, so the performance is still good. To search for aaa|bbb|ccc at some position p in the text, rejit performs operations like:

    loop:
      find 'aaa' at position p
      if found goto match
      find 'bbb' at position p
      if found goto match
      find 'ccc' at position p
      if found goto match
      increment position and goto loop
    match:
The complexity is proportional to the number of alternations. Worse, when the number of alternated expressions exceeds a threshold (i.e. when the compiler cannot allocate a register per alternated expression), rejit falls back to some slow default code. This is what happens for the two regexps with eight or more alternated strings. The code generation should be fixed to allow an arbitrary number of alternated strings.

In this situation, the initial multi-threading support in jrep can help. Multi-threading can be enabled with the -j (an alias for --jobs) option. Passing -jN will make jrep run with one thread listing the files to process in a queue, and N threads popping filenames from this list and processing them separately. Access to stdout is protected by a lock to ensure the results are printed "by file" and not interleaved.

As the results show, enabling multi-threading must be done carefully. Regular expressions requiring a lot of time matching in user space see improved performance, but regular expressions spending little time doing matching work may not benefit — or, worse, see decreased performance — from the multiple threads. When using multi-threading only the real time (time observed by the user) is reported, as sys and user are the sum of times spent for all cores, which would exceed the observed running time. All the regexps in the previous graphs are not conducive to multi-threading, except for the longer running "unsigned" case. The multi-threading support is not mature yet, and again profiling is needed to see what is happening. Ideally, useless additional threads should stay idle and not hurt performance.

GNU grep does not provide a multi-threading option.

About rejit and jrep

Not counting the underlying library, jrep only has about 500 lines of code. After jrep parses the command-line options, the file tree is traversed, each file is mapped to memory buffer and passed to rejit to find all the matches at once. If any matches are found jrep prints the results and carries on to the next file.

Note again that rejit is a prototype library. Many things are still unsupported. Notably, it only supports the x86_64 architecture. That being said, it has a few interesting features.

Like grep, it is non-backtracking, using finite automata to represent regular expressions, which was introduced into programming by Ken Thompson. For details see this excellent article by Russ Cox, or in short: it does not use an algorithm with exponential complexity. So, on the machine I am writing this article on, to check if the regexp (a?){25}a{25} matches the string aaaaaaaaaaaaaaaaaaaaaaaaa ("a" 25 times), rejit takes less than 0.001 second, while Google v8's Javascript engine (the backtracking engine used in Chrome and Android) takes about three seconds. If we try with 26 "a"s instead v8 would take twice as long, and again twice as long for 27 "a"s, etc.

The specifications of the machine do not matter here. The point is that some regular expressions can't be handled well by backtracking engines. In practice, the type of regular expressions causing this slow behaviour may not be very useful, but this is simply a risk if the program does not control what regular expressions are run (e.g. user search allowing regexps).

The common feature of back-references ((foo|bar)\1 matches foofoo and barbar) cannot be implemented without backtracking (an NP-complete problem). However, as noted by Cox in more detail, it does not justify the choice of a backtracking for the general algorithm. Backtracking can be used locally to handle back-references.

As seen in the benchmark results above, rejit is also (or at least can be) fast. Generating the code on the fly allows generating tailored code for every regular expression processed. In particular, the JIT allows detecting at runtime what CPU instructions are available and generating efficient code for the performance-critical sections. On this machine, rejit uses the Intel SSE4.2 extension. Except for those critical sections, the code generated is still rather poor and could benefit from a lot of optimization — but that is work for later.

The project was started as a JIT because I knew v8's regexp engine was a backtracking JIT, and I wanted to see if I could implement a non-backtracking one. Since I was unable to support all architectures at once, I had to first chose one architecture to support; I had quite a bit of experience with ARM, but not much with Intel, so x86_64 it was.

With hindsight I believe that the best solution would be to implement an architecture-independent engine (C/C++), and then locally extend it with JIT code. While producing an engine that would work on all platforms, most of the current performance could be achieved by implementing just a few pieces of assembly; more assembly stubs could then come in to bring faster speed to normally slow-to-match regular expressions.

If the project finds the resources to do so, that plan would be ideal. Otherwise I will likely focus half of my time on fixing bugs, and half on implementing features that I find interesting. The next area will probably be sub-matches and back-references. Some missing features and ideas are on the project TODO list. The library and sample programs (including jrep) are available under the GPLv3 license, and help is welcome.

The home page contains more information, some benchmarks showing other situations where rejit performs well (e.g. DNA matching), and some documentation. There is also an article that introduces the mechanisms used in rejit. I would be glad to discuss more about the project on the mailing list or by email (see the website).

Comments (21 posted)

Brief items

Quotes of the week

In particular, many of us never knew – or are in the process of forgetting – how dependent we used to be on proprietary software.

We didn’t get here because we failed in our duty to protect a prelaparian software commons, but because we succeeded in creating one. That is worth remembering.

Eric S. Raymond (hat tip to Paul Wise)

If we had a blue box with a madman inside, who we could persuade to bring our browsing history back from the future, implementing bélády's algorithm might be possible. Failing a perfect solution we are reduced to making a judgement based on the information available to us.
Vincent Sanders

Comments (none posted)

Bash-4.3 available

Bash version 4.3 has been released, incorporating a number of important bugfixes and new features. Among the bugs fixed, the most important "is the reworking of signal handling to avoid running signal and trap handlers in a signal handler context. This led to issues with glibc, which uses internal locks extensively and handles longjmps from user code very poorly." Among the most noteworthy new features are the "globasciiranges" option, "which forces the pattern matching code to treat [a-z] as if in the C locale," improvements to the `direxpand' option introduced in Bash 4.2, and support for negative subscripts when assigning and referencing indexed array elements.

Full Story (comments: none)

Buildroot 2014.02 released

Release 2014.2 of the Buildroot cross-compilation tool is available. This release cleans up a number of environment variable names, adds support for external packages, and adds the support infrastructure necessary for Python and Luarocks packages.

Full Story (comments: none)

GNU autoconf archive 2014.02.28 available

Version 2014.02.28 of the GNU autoconf archive has been released. Many new macros have been added, in addition to new option for existing macros like AX_PERL_EXT, AX_EXT, and AX_LUA.

Full Story (comments: none)

PulseAudio 5.0 available

Version 5.0 of the PulseAudio sound server has been released. The release notes highlight a number of new features, including a new implementation of tunnel modules and support for BlueZ 5.0's Advanced Audio Distribution Profile (A2DP). Notably, though, the A2DP support comes at a price: "The BlueZ project also decided to drop support for the HSP and HFP profiles, which were the profiles responsible for handling telephony audio. If you have a headset, its microphone won't work with BlueZ 5, because the microphone is only supported by the HSP and HFP profiles."

Full Story (comments: none)

Krita 2.8.0 released

Version 2.8.0 of the Krita painting application is out. New features include improved tablet support, high-quality scaling, integration with the "Gemini" sketch application, a new wrap-around mode, and much more.

Comments (none posted)

Newsletters and articles

Development newsletters from the past week

Comments (none posted)

What is good video editing software on Linux? (Xmodulo)

Xmodulo presents a brief overview of ten video editing applications available for Linux. "I will not cover subjective merits such as usability or interface design, but instead highlight notable features of each video editor."

Comments (18 posted)

Page editor: Nathan Willis
Next page: Announcements>>


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds