<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="http://purl.org/rss/1.0/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"
>

  <channel rdf:about="http://lwn.net/headlines/378384/">
    <title>LWN: Comments on "Google's RE2 regular expression library"</title>
    <link>http://lwn.net/Articles/378384/</link>
    <description>
This is a special feed containing comments posted
to the individual LWN article titled &quot;Google's RE2 regular expression library&quot;.

    </description>

    <syn:updatePeriod>hourly</syn:updatePeriod>
    <syn:updateFrequency>2</syn:updateFrequency>
    <items>
      <rdf:Seq>
	<rdf:li resource="http://lwn.net/Articles/379616/rss" />
	<rdf:li resource="http://lwn.net/Articles/379499/rss" />
	<rdf:li resource="http://lwn.net/Articles/379131/rss" />
	<rdf:li resource="http://lwn.net/Articles/378789/rss" />
	<rdf:li resource="http://lwn.net/Articles/378625/rss" />
	<rdf:li resource="http://lwn.net/Articles/378580/rss" />
	<rdf:li resource="http://lwn.net/Articles/378574/rss" />
	<rdf:li resource="http://lwn.net/Articles/378573/rss" />
	<rdf:li resource="http://lwn.net/Articles/378572/rss" />
	<rdf:li resource="http://lwn.net/Articles/378562/rss" />
	<rdf:li resource="http://lwn.net/Articles/378553/rss" />
	<rdf:li resource="http://lwn.net/Articles/378534/rss" />
	<rdf:li resource="http://lwn.net/Articles/378533/rss" />
	<rdf:li resource="http://lwn.net/Articles/378532/rss" />
	<rdf:li resource="http://lwn.net/Articles/378513/rss" />
	<rdf:li resource="http://lwn.net/Articles/378510/rss" />
	<rdf:li resource="http://lwn.net/Articles/378501/rss" />
	<rdf:li resource="http://lwn.net/Articles/378492/rss" />
	<rdf:li resource="http://lwn.net/Articles/378488/rss" />
	<rdf:li resource="http://lwn.net/Articles/378481/rss" />
	<rdf:li resource="http://lwn.net/Articles/378476/rss" />
	<rdf:li resource="http://lwn.net/Articles/378457/rss" />
	<rdf:li resource="http://lwn.net/Articles/378455/rss" />
	<rdf:li resource="http://lwn.net/Articles/378442/rss" />
	<rdf:li resource="http://lwn.net/Articles/378440/rss" />
	<rdf:li resource="http://lwn.net/Articles/378435/rss" />
	<rdf:li resource="http://lwn.net/Articles/378428/rss" />
	<rdf:li resource="http://lwn.net/Articles/378423/rss" />
	<rdf:li resource="http://lwn.net/Articles/378419/rss" />
	<rdf:li resource="http://lwn.net/Articles/378401/rss" />
	<rdf:li resource="http://lwn.net/Articles/378404/rss" />
	<rdf:li resource="http://lwn.net/Articles/378398/rss" />
      
      </rdf:Seq>
    </items>

  </channel>
    <item rdf:about="http://lwn.net/Articles/379616/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/379616/rss</link>
      <dc:date>2010-03-20T00:38:34+00:00</dc:date>
      <dc:creator>rsc</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
&lt;font class=&quot;QuotedText&quot;&gt;&amp;gt; I want to see a regular expression engine that doesn't have a&lt;/font&gt;&lt;br&gt;
&lt;font class=&quot;QuotedText&quot;&gt;&amp;gt; 5x slowdown when used with a UTF-8 locale.&lt;/font&gt;&lt;br&gt;
&lt;p&gt;
You're looking at it.  &lt;a rel=&quot;nofollow&quot; href=&quot;http://code.google.com/p/re2/&quot;&gt;http://code.google.com/p/re2/&lt;/a&gt;&lt;br&gt;
&lt;p&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/379499/rss">
      <title>GNU Grep has its bad days</title>
      <link>http://lwn.net/Articles/379499/rss</link>
      <dc:date>2010-03-19T12:16:08+00:00</dc:date>
      <dc:creator>walles</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
My experience isn't that grep is &quot;pretty fast&quot;; here's a bug report where both Ruby and Perl runs circles around Grep:&lt;br&gt;
&lt;a href=&quot;http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=445215&quot;&gt;http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=445215&lt;/a&gt;&lt;br&gt;
&lt;p&gt;
This performance issue is one that I'd really like to see fixed, so if somebody would step up to fix it that would be great!&lt;br&gt;
&lt;p&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/379131/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/379131/rss</link>
      <dc:date>2010-03-17T21:39:02+00:00</dc:date>
      <dc:creator>dthurston</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
I think it's more that they're giving the standard, fast performance of a proper RegExp engine to almost all of the various convenient extensions that Perl, Python, etc. have.  So it's an advance in speed over Perl et al, and an advance in convenience over awk, etc.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378789/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378789/rss</link>
      <dc:date>2010-03-16T16:21:09+00:00</dc:date>
      <dc:creator>wahern</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
Ragel, which is amazing. I use it all over the place. I've written C parsers for HTTP, RTSP, SMTP, SDP, MySQL wire protocol, and many more w/ the help of Ragel.&lt;br&gt;
&lt;p&gt;
&lt;a href=&quot;http://www.complang.org/ragel/&quot;&gt;http://www.complang.org/ragel/&lt;/a&gt;&lt;br&gt;
&lt;p&gt;
&lt;p&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378625/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378625/rss</link>
      <dc:date>2010-03-15T17:31:21+00:00</dc:date>
      <dc:creator>droundy</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
I guess I saw the articles more as a denouncement of the poor &lt;br&gt;
implementations in common use than the claim of a drastically new and better &lt;br&gt;
implementations, and the choice of a pathological regular expression made &lt;br&gt;
sense as such, since the point is that such a beast exists with backtracking &lt;br&gt;
implementations, and doesn't exist with good implementations.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378580/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378580/rss</link>
      <dc:date>2010-03-15T09:47:19+00:00</dc:date>
      <dc:creator>PO8</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
I guess my point was just that this work seems to be presented as a significant improvement on the state of the art in RE-matching performance.  As far as I can tell it's only an improvement over RE implementations that have never been held up as the state of the art for performance.  Further, its performance seems to have been measured mostly on a particularly pathological microbenchmark.&lt;br&gt;
&lt;p&gt;
Just grabbed the source code and stared at it for a bit.  It's big; 14K lines of C++ code, really well commented.  It does do lazy DFA compilation with a state cache, which is nice.  Couldn't find any evidence of Boyer-Moore, though.  This is really a major omission if performance is an issue.&lt;br&gt;
&lt;p&gt;
Don't really have time right now to do more benchmarking of this code.  There's some benchmarks there already, but I haven't figured out how to play with them yet.  More if and when I get some time.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378574/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378574/rss</link>
      <dc:date>2010-03-15T06:34:17+00:00</dc:date>
      <dc:creator>dlang</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
if you are going to be doing a lot of string character position based manipulation _and_ expect to be dealing with a lot of non-ASCII data then it's worth converting the strings when you read them in.&lt;br&gt;
&lt;p&gt;
on the other hand, many programs copy strings around a lot, but don't actually manipulate them much.&lt;br&gt;
&lt;p&gt;
and for many things, the data being used really is almost entirely ASCII.&lt;br&gt;
&lt;p&gt;
in these cases it is far better to leave things in UTF-8 variable length encoding and just walk the string when needed.&lt;br&gt;
&lt;p&gt;
in the case of regex matching, if you are going to start at the beginning of the string and walk though it looking for matches, then you may as well leave it in UTF-8, you aren't doing anything that would benifit from knowing ahead of time where a particular character position starts, and the fact that the string is almost always going to be smaller is a win.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378573/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378573/rss</link>
      <dc:date>2010-03-15T02:45:25+00:00</dc:date>
      <dc:creator>tkil</dc:creator>
      <description>
      &lt;blockquote&gt;
&lt;p&gt;It's easy enough if you convert internally while reading the data into 
UCS-2 or 4.&lt;/p&gt;
&lt;p&gt;The main problem is that UTF encodings are variable length, so you never 
really know where the characters are until you go there and look.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I believe that all you really need is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;canonicalize both the regex and the text you intend to match it 
against; and&lt;/li&gt;
&lt;li&gt;have a regex engine that understands &lt;em&gt;characters&lt;/em&gt;, not 
&lt;em&gt;octets&lt;/em&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Granted, going to UCS-4 solves the latter issue, but incurs a pretty 
hefty memory cost.  The canonicalization is required regardless.&lt;/p&gt;

&lt;p&gt;Unicode-compliant regex 
engines will be slower than 8-bit regex engines, no matter what: all the 
character 
classes are larger, implying longer time to read in from disk, and more 
cache hit when in use.  Also, their size tends to remove the capability to 
use simple table lookups like we could for 8-bit datasets, forcing the 
engine to use fancier (and likely slower) techniques such as tries, 
etc.&lt;/p&gt;
      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378572/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378572/rss</link>
      <dc:date>2010-03-15T02:37:40+00:00</dc:date>
      <dc:creator>tkil</dc:creator>
      <description>
      &lt;blockquote&gt;
&lt;p&gt;What's so special about a regex, then?&lt;/p&gt;
&lt;p&gt;If the VM has the code half-compiled, there are probably many other 
optimizations to do under the cover.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's pretty much the reason: it's a tradeoff between upfront regex 
parsing / compilation / optimizing, and the ongoing cost of matching 
against that regular expression.&lt;/p&gt;

&lt;p&gt;If you are only going to match against a regex once or twice, then 
there's no point in spending a huge amount of time optimizing that regex 
(and certainly not in doing it every time it's seen!).  If, on the other 
hand, you are going to match many things against that regex, it makes sense 
to spend more time up front so that each match is a bit faster.&lt;/p&gt;

&lt;p&gt;It's a continuum: on the one end, the regex engine can just parse the 
regex text as it goes, and do no compilation or optimization whatsoever 
(this would be analogous to a true interpreter).  As the next step, the 
interpreter could process the regex into some internal representation, but 
not do any optimizations.  A step further, that internal representation 
could be optimized in various ways.  Finally, the optimized internal 
representation could be compiled down to machine code (with possible 
further optimizations applied).  Each of those steps takes longer up front, 
though, so you have to amortize the up-front regex processing time across 
the number of matches.&lt;/p&gt;
      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378562/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378562/rss</link>
      <dc:date>2010-03-14T19:38:44+00:00</dc:date>
      <dc:creator>droundy</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
Somehow it seems a bit odd to argue that comparison with perl, python and &lt;br&gt;
ruby and pcre is irrelevant.  True, they are slow, but regexp matching is &lt;br&gt;
supposed to be one of their strengths (esp of perl).  And most people think &lt;br&gt;
of them as being only maybe 10 or 1000 times slower than C, precisely when &lt;br&gt;
the computation is dominated by library calls such as regular expression &lt;br&gt;
matching, not 1e6 times slower than C.  Of course, the point is that the &lt;br&gt;
algorithm is at issue, not the language.&lt;br&gt;
&lt;p&gt;
Admittedly, it seems like pretty weird regular expressions are needed in &lt;br&gt;
order to trigger this exponential behavior.  But then again, taking a minute &lt;br&gt;
to match a 29 character string with a 29 character regular expression does &lt;br&gt;
seem pretty excessive...&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378553/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378553/rss</link>
      <dc:date>2010-03-14T16:01:54+00:00</dc:date>
      <dc:creator>zlynx</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
It's easy enough if you convert internally while reading the data into UCS-2 or 4.&lt;br&gt;
&lt;p&gt;
The main problem is that UTF encodings are variable length, so you never really know where the characters are until you go there and look.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378534/rss">
      <title>Full regular expressions?</title>
      <link>http://lwn.net/Articles/378534/rss</link>
      <dc:date>2010-03-14T10:50:04+00:00</dc:date>
      <dc:creator>ballombe</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
&lt;font class=&quot;QuotedText&quot;&gt;&amp;gt; Don't some of the back-referencing extensions actually take it from regular deep into LR(1) languages?&lt;/font&gt;&lt;br&gt;
&lt;p&gt;
This is much much worse. Languages defined with general back-references  are undecidable, and even the simplest kind of back reference can cause languages not to be algebraic, for example ([ab]*)\1  and (a*)(b*)\1\2 are not.&lt;br&gt;
&lt;p&gt;
Now, all true regular expressions can be matched in linear time and bounded memory. However both the memory bound and the setup time (before starting the match) are in the worse case exponential in the size of the expression.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378533/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378533/rss</link>
      <dc:date>2010-03-14T10:05:18+00:00</dc:date>
      <dc:creator>tzafrir</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
So this is yet another part of Project UNG (UNG is Not GNU)?&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378532/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378532/rss</link>
      <dc:date>2010-03-14T10:02:44+00:00</dc:date>
      <dc:creator>tzafrir</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
What's so special about a regex, then?&lt;br&gt;
&lt;p&gt;
If the VM has the code half-compiled, there are probably many other optimizations to do under the cover.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378513/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378513/rss</link>
      <dc:date>2010-03-13T19:24:54+00:00</dc:date>
      <dc:creator>tkil</dc:creator>
      <description>
      &lt;blockquote&gt;Something like perl's &lt;code&gt;qr//&lt;/code&gt; operator?&lt;/blockquote&gt;

&lt;p&gt;
Not really; &lt;code&gt;qr//&lt;/code&gt; simply lets the programmer control when the 
double-quotish regex string is turned into a regex object.  It's more like 
generating an opcode tree for the regex, as opposed to compiling it down to 
bytecode or machine instructions.
&lt;/p&gt;

&lt;p&gt;
This is still a win, since the regex object creation step is often the most 
expensive part of using a regex: it involves interpolation, compilation to 
opcodes, and various optimizations.  Doing that only once vs. every time 
through a loop is a huge help.
&lt;/p&gt;

&lt;p&gt;
Being able to compile down to machine instruction (with all the 
optimizations along the way) is just going further along the scale of &quot;more 
effort up front&quot; vs. &quot;better speed on each match&quot;.  When you're working 
with long-lived C++ apps and the regexes are static, it makes perfect sense 
to try to compile them down as far as possible; when working with short-
lived dynamic scripts, it isn't always a win.
&lt;/p&gt;
      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378510/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378510/rss</link>
      <dc:date>2010-03-13T17:10:37+00:00</dc:date>
      <dc:creator>andikleen</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
It already happened multiple times of course.&lt;br&gt;
&lt;p&gt;
You don't really need all the infrastructure in a full compiler like LLVM&lt;br&gt;
for it either, regular expressions are rather simple.&lt;br&gt;
&lt;p&gt;
One old style version (non JIT) is re2c that simply compiles regexprs to C.&lt;br&gt;
Then the V8 java script engine in google chrome which uses an own compiler for their JS regexprs. Undoubtedly there are others too. e.g. in a language with built in compiler like Common Lisp it's rather easy to do.&lt;br&gt;
&lt;p&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378501/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378501/rss</link>
      <dc:date>2010-03-13T14:38:41+00:00</dc:date>
      <dc:creator>intgr</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
Meh. I want to see a regular expression engine that doesn't have a 5x slowdown when used with a UTF-8 locale.&lt;br&gt;
&lt;p&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378492/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378492/rss</link>
      <dc:date>2010-03-13T10:23:10+00:00</dc:date>
      <dc:creator>PO8</dc:creator>
      <description>
      &lt;p&gt;This all seems a little weird to me.  The things chosen as reference points are mostly strange; most were never particularly touted as examples of speed.  GNU grep is pretty fast, though; indeed, if you look at their own &lt;a href=&quot;http://pdos.csail.mit.edu/~rsc/regexp-img/grep1p.png&quot;&gt;graph&lt;/a&gt; you might wonder why they cut it off where they did&amp;mdash;grep certainly looks like it's almost constant-time in this example, and about to cross the y axis.  Anyhow, the RE and target given in that second paper are a lousy benchmark for normal use; they are only intended to illustrate the thesis that backtracking loses on some REs and targets.  In short, I'm not sure why backtracking matchers are any more than a strawman when building a fast modern RE engine.  The benchmarking in the first paper is only against PCRE.&lt;/p&gt;

&lt;p&gt;It doesn't appear that any kind of Boyer-Moore is being used; this can be a major performance penalty for fixed strings.  If I recall correctly, one can do Boyer-Moore-like things with general REs these days also.  This is a performance no-brainer as far as I know.&lt;/p&gt;

&lt;p&gt;I'd be willing to wager a small amount of money that ripping Mike Haertel's RE engine out of GNU grep and clipping out the backreference support (which wouldn't be very hard) would give a much faster engine than re2, with an equally good memory footprint.  The problem with this approach for Google, I suspect, is the GPL.  Reimplementing a more sophisticated DFA-based RE engine doesn't seem that hard either, though.  Certainly one could pay Mike to do it on a contract basis.&lt;/p&gt;

&lt;p&gt;Here's some stories about Mike and grep and RE matching: &lt;a href=&quot;http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/&quot;&gt;(1)&lt;/a&gt; (and note the comments), &lt;a href=&quot;http://fob.po8.org/node/173&quot;&gt;(2)&lt;/a&gt;.&lt;/p&gt;


      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378488/rss">
      <title>Full regular expressions?</title>
      <link>http://lwn.net/Articles/378488/rss</link>
      <dc:date>2010-03-13T04:29:50+00:00</dc:date>
      <dc:creator>greenfie</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
RE2 will fail to compile certain regular expressions due to its memory &lt;br&gt;
limitations, but during matching (when it builds up its DFA) it has bounded &lt;br&gt;
memory. It'll just run slower on perverse regexes.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378481/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378481/rss</link>
      <dc:date>2010-03-13T02:26:40+00:00</dc:date>
      <dc:creator>emk</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
Doesn't .NET compile regular expressions using System.Reflection.Emit, &lt;br&gt;
which is basically Microsoft's closest approximation of LLVM?&lt;br&gt;
&lt;p&gt;
In any case, you might not need all of LLVMthat kind of simplified control &lt;br&gt;
flow can occasionally be handled quite nicely by specialized code generators.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378476/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378476/rss</link>
      <dc:date>2010-03-12T22:57:53+00:00</dc:date>
      <dc:creator>tzafrir</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
Something like perl's qr// operator?&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378457/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378457/rss</link>
      <dc:date>2010-03-12T19:17:49+00:00</dc:date>
      <dc:creator>ajross</dc:creator>
      <description>
      Mostly because constant factor execution speed (not the algorithmic stuff the 
linked article is talking about) doesn't matter for regexes.  Anything doing 
performance critical work on text data in the modern world is going to be I/O 
bound at some level (even if it's only DRAM latency).  Cycle counting doesn't 
help in that regime.
      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378455/rss">
      <title>Full regular expressions?</title>
      <link>http://lwn.net/Articles/378455/rss</link>
      <dc:date>2010-03-12T19:13:13+00:00</dc:date>
      <dc:creator>Pc5Y9sbv</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
Don't some of the back-referencing extensions actually take it from regular deep into LR(1) languages?&lt;br&gt;
&lt;p&gt;
If I remember way back from school, in practice, many human-comprehensible non-determinstic REs can be emulated in closer to linear time. Running a set of DFAs, growing on each non-deterministic branch point, doesn't grow as wildly as it could in theory, because so many of the DFAs are culled relatively soon on dead-end input transitions.  Is that no longer considered a valid heuristic strategy?&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378442/rss">
      <title>Full regular expressions?</title>
      <link>http://lwn.net/Articles/378442/rss</link>
      <dc:date>2010-03-12T18:42:47+00:00</dc:date>
      <dc:creator>ballombe</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
Well, they only say 'bounded stack space', not 'bounded heap space'.&lt;br&gt;
&lt;p&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378440/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378440/rss</link>
      <dc:date>2010-03-12T18:34:50+00:00</dc:date>
      <dc:creator>robert_s</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
You know, ever since LLVM appeared, I've been expecting someone to come along and write a regex engine that dynamically compiles its matchers to native code.&lt;br&gt;
&lt;p&gt;
Wonder why that's never happened.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378435/rss">
      <title>Full regular expressions?</title>
      <link>http://lwn.net/Articles/378435/rss</link>
      <dc:date>2010-03-12T18:18:05+00:00</dc:date>
      <dc:creator>vonbrand</dc:creator>
      <description>
      &lt;p&gt;
There are regular expressions that require an exponentially large deterministic automaton to recognize (alternatively, if they simulate a nondeterministic one directly, an exponentially large set of states to remember), so they'll cut it short somewhere.
      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378428/rss">
      <title>It handles only regular expressions, while Perl regexps give more</title>
      <link>http://lwn.net/Articles/378428/rss</link>
      <dc:date>2010-03-12T17:51:20+00:00</dc:date>
      <dc:creator>rriggs</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
Actually, I think this will be widely adopted precisely because it can always make those guarantees.  I would really like to see this as an algorithm selection option in the Boost Regex library.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378423/rss">
      <title>Try 'New BSD License'</title>
      <link>http://lwn.net/Articles/378423/rss</link>
      <dc:date>2010-03-12T17:15:49+00:00</dc:date>
      <dc:creator>southey</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
The licence is given on the project homepage as 'New BSD License' and &lt;br&gt;
is linked it to the www.opensource.org for details.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378419/rss">
      <title>3-clause BSD</title>
      <link>http://lwn.net/Articles/378419/rss</link>
      <dc:date>2010-03-12T17:07:52+00:00</dc:date>
      <dc:creator>robla</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
I just had this dilemma on a piece of software I released.  I wanted to say &lt;br&gt;
it was &quot;BSD&quot;, but the anal retentive side of me couldn't bring myself to &lt;br&gt;
call it &quot;BSD&quot; when neither the abbreviation &quot;BSD&quot; nor even the word &lt;br&gt;
&quot;Berkeley&quot; appear in the license.&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378401/rss">
      <title>Google's RE2 regular expression library</title>
      <link>http://lwn.net/Articles/378401/rss</link>
      <dc:date>2010-03-12T16:19:55+00:00</dc:date>
      <dc:creator>HelloWorld</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
PCRE's default algorithm does backtracking, but PCRE also offers an alternative algorithm (see man pcrematching) which doesn't. I wonder how it compares to this new implementation. &lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378404/rss">
      <title>It handles only regular expressions, while Perl regexps give more</title>
      <link>http://lwn.net/Articles/378404/rss</link>
      <dc:date>2010-03-12T15:57:46+00:00</dc:date>
      <dc:creator>epa</dc:creator>
      <description>
      The project page notes that
&lt;blockquote&gt;Unlike most automata-based engines, RE2 implements almost all the common Perl and PCRE features and syntactic sugars. It also finds the leftmost-first match, the same match that Perl would, and can return submatch information. The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently.&lt;/blockquote&gt;This is quite correct.  Once you allow backreferences, you no longer have a regular language but some more powerful class of language, which can't make the same runtime guarantees.
&lt;p&gt;
However it would be nice if the library could have two modes: a safe, regular-expression-only mode, and an unsafe mode where it will also accept backreferences and other funky Perl-like features, at the cost of fewer guarantees on run time.  It might then be more widely adopted.
      
      </description>
    </item>
    <item rdf:about="http://lwn.net/Articles/378398/rss">
      <title>3-clause BSD</title>
      <link>http://lwn.net/Articles/378398/rss</link>
      <dc:date>2010-03-12T15:40:45+00:00</dc:date>
      <dc:creator>coriordan</dc:creator>
      <description>
      &lt;div class=&quot;FormattedComment&quot;&gt;
I was a bit worried by the &quot;ish&quot; regarding the licence, but it seems to be under a normal 3-clause BSD licence:&lt;br&gt;
&lt;p&gt;
&lt;a rel=&quot;nofollow&quot; href=&quot;http://code.google.com/p/re2/source/browse/LICENSE&quot;&gt;http://code.google.com/p/re2/source/browse/LICENSE&lt;/a&gt;&lt;br&gt;
&lt;/div&gt;

      
      </description>
    </item>
</rdf:RDF>

