Google's RE2 regular expression library
Posted Mar 15, 2010 2:45 UTC (Mon) by tkil
In reply to: Google's RE2 regular expression library
Parent article: Google's RE2 regular expression library
It's easy enough if you convert internally while reading the data into
UCS-2 or 4.
The main problem is that UTF encodings are variable length, so you never
really know where the characters are until you go there and look.
I believe that all you really need is:
- canonicalize both the regex and the text you intend to match it
- have a regex engine that understands characters, not
Granted, going to UCS-4 solves the latter issue, but incurs a pretty
hefty memory cost. The canonicalization is required regardless.
engines will be slower than 8-bit regex engines, no matter what: all the
classes are larger, implying longer time to read in from disk, and more
cache hit when in use. Also, their size tends to remove the capability to
use simple table lookups like we could for 8-bit datasets, forcing the
engine to use fancier (and likely slower) techniques such as tries,
to post comments)