LWN.net Logo

Google's RE2 regular expression library

Google's RE2 regular expression library

Posted Mar 14, 2010 16:01 UTC (Sun) by zlynx (subscriber, #2285)
In reply to: Google's RE2 regular expression library by intgr
Parent article: Google's RE2 regular expression library

It's easy enough if you convert internally while reading the data into UCS-2 or 4.

The main problem is that UTF encodings are variable length, so you never really know where the characters are until you go there and look.


(Log in to post comments)

Google's RE2 regular expression library

Posted Mar 15, 2010 2:45 UTC (Mon) by tkil (subscriber, #1787) [Link]

It's easy enough if you convert internally while reading the data into UCS-2 or 4.

The main problem is that UTF encodings are variable length, so you never really know where the characters are until you go there and look.

I believe that all you really need is:

  1. canonicalize both the regex and the text you intend to match it against; and
  2. have a regex engine that understands characters, not octets

Granted, going to UCS-4 solves the latter issue, but incurs a pretty hefty memory cost. The canonicalization is required regardless.

Unicode-compliant regex engines will be slower than 8-bit regex engines, no matter what: all the character classes are larger, implying longer time to read in from disk, and more cache hit when in use. Also, their size tends to remove the capability to use simple table lookups like we could for 8-bit datasets, forcing the engine to use fancier (and likely slower) techniques such as tries, etc.

Google's RE2 regular expression library

Posted Mar 15, 2010 6:34 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

if you are going to be doing a lot of string character position based manipulation _and_ expect to be dealing with a lot of non-ASCII data then it's worth converting the strings when you read them in.

on the other hand, many programs copy strings around a lot, but don't actually manipulate them much.

and for many things, the data being used really is almost entirely ASCII.

in these cases it is far better to leave things in UTF-8 variable length encoding and just walk the string when needed.

in the case of regex matching, if you are going to start at the beginning of the string and walk though it looking for matches, then you may as well leave it in UTF-8, you aren't doing anything that would benifit from knowing ahead of time where a particular character position starts, and the fact that the string is almost always going to be smaller is a win.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds