Unicode normalization

Posted Jun 1, 2006 23:00 UTC (Thu) by kingdon (guest, #4526)
In reply to: Unicode normalization by tialaramex
Parent article: GNU grep's new features (Linux.com)

Sure it is complicated, but is anyone really doing much work on the problem (either in grep or in a separate tool)?

Google might have something of the sort. I know I've searched for non-ASCII strings but haven't played extensively with things like a-with-an-accent (as one character) versus a plus accent-which-combines (as two characters).

But if Lucene does anything like this, the Lucene FAQ doesn't seem to say so (it just says that Lucene uses Unicode and doesn't elaborate).

Oh, and having the search behave differently based on locale is the wrong approach (IMHO). It is a common case that you have a lot of documents, some in one language, some in another, and some in more than one. Sure, giving up locales might cause you to lose some rules where language A treats character X one way, and language B treats it differently (hopefully obscure, but I'm not expert enough to say). Most of the time it would work to just look at the characters in the document and the search string, and ignore the locale.

Unicode normalization

Posted Jun 2, 2006 6:40 UTC (Fri) by MortFurd (guest, #9389) [Link]

Google does a decent job with that kind of thing - at least for what I do.

German has vowels with the umlaut (the two dots above the character.) The standard way to type these on a key board that doesn't have the umlauted characters is to substitute a two character combination (ae for umlaut a, ue for umlaut u, etc.) Google properly find words containing the umlaut characters, and also find matches to the double cahracter substitute if you give it an umlaut (my home computer has a german keyboard, my work computer has an amercain keyboard, so I get to see both sides of the problem.

Unicode normalization

Posted Jun 2, 2006 9:40 UTC (Fri) by ibukanov (subscriber, #3942) [Link]

> Google might have something of the sort.

It is not necessary for Google to know anything about combined characters etc. since Google search is strictly a word search. So they just need to assemble the list of all forms for particular word and map them to the same index entry.