Google's RE2 regular expression library
Posted Mar 15, 2010 2:45 UTC (Mon) by
tkil (subscriber, #1787)
In reply to:
Google's RE2 regular expression library by zlynx
Parent article:
Google's RE2 regular expression library
It's easy enough if you convert internally while reading the data into
UCS-2 or 4.
The main problem is that UTF encodings are variable length, so you never
really know where the characters are until you go there and look.
I believe that all you really need is:
- canonicalize both the regex and the text you intend to match it
against; and
- have a regex engine that understands characters, not
octets
Granted, going to UCS-4 solves the latter issue, but incurs a pretty
hefty memory cost. The canonicalization is required regardless.
Unicode-compliant regex
engines will be slower than 8-bit regex engines, no matter what: all the
character
classes are larger, implying longer time to read in from disk, and more
cache hit when in use. Also, their size tends to remove the capability to
use simple table lookups like we could for 8-bit datasets, forcing the
engine to use fancier (and likely slower) techniques such as tries,
etc.
(
Log in to post comments)