Posted Mar 14, 2010 16:01 UTC (Sun) by zlynx (subscriber, #2285)
[Link]
It's easy enough if you convert internally while reading the data into UCS-2 or 4.
The main problem is that UTF encodings are variable length, so you never really know where the characters are until you go there and look.
Google's RE2 regular expression library
Posted Mar 15, 2010 2:45 UTC (Mon) by tkil (subscriber, #1787)
[Link]
It's easy enough if you convert internally while reading the data into
UCS-2 or 4.
The main problem is that UTF encodings are variable length, so you never
really know where the characters are until you go there and look.
I believe that all you really need is:
canonicalize both the regex and the text you intend to match it
against; and
have a regex engine that understands characters, not
octets
Granted, going to UCS-4 solves the latter issue, but incurs a pretty
hefty memory cost. The canonicalization is required regardless.
Unicode-compliant regex
engines will be slower than 8-bit regex engines, no matter what: all the
character
classes are larger, implying longer time to read in from disk, and more
cache hit when in use. Also, their size tends to remove the capability to
use simple table lookups like we could for 8-bit datasets, forcing the
engine to use fancier (and likely slower) techniques such as tries,
etc.
Google's RE2 regular expression library
Posted Mar 15, 2010 6:34 UTC (Mon) by dlang (✭ supporter ✭, #313)
[Link]
if you are going to be doing a lot of string character position based manipulation _and_ expect to be dealing with a lot of non-ASCII data then it's worth converting the strings when you read them in.
on the other hand, many programs copy strings around a lot, but don't actually manipulate them much.
and for many things, the data being used really is almost entirely ASCII.
in these cases it is far better to leave things in UTF-8 variable length encoding and just walk the string when needed.
in the case of regex matching, if you are going to start at the beginning of the string and walk though it looking for matches, then you may as well leave it in UTF-8, you aren't doing anything that would benifit from knowing ahead of time where a particular character position starts, and the fact that the string is almost always going to be smaller is a win.
Google's RE2 regular expression library
Posted Mar 20, 2010 0:38 UTC (Sat) by rsc (guest, #64555)
[Link]
> I want to see a regular expression engine that doesn't have a
> 5x slowdown when used with a UTF-8 locale.