State of Text Rendering

[Posted July 9, 2009 by cook]

Behdad Esfahbod recently presented a talk on the State of Text Rendering at the Gran Canaria Desktop Summit. The presentation slides [PDF] are also available. "Text is the primary means of communication in computers, and is bound to be so for the decades to come. With the widespread adoption of Unicode as the canonical character set for representing text a whole new domain has been opened up in a desktop system software design. Gone are the days that one would need to input, render, print, search, spell-check, ... one language at a time. The whole concept of internationalization (i18n) on which Unicode is based is "all languages, all the time"." (Thanks to Nicolas Mailhot).

State of Text Rendering

Posted Jul 10, 2009 11:08 UTC (Fri) by macson_g (guest, #12717) [Link] (1 responses)

The next to last slide is notable.

Second-last slide

Posted Jul 10, 2009 21:53 UTC (Fri) by midg3t (guest, #30998) [Link]

The second-last slide says "Where is my vote?" in a speech bubble on a green background.

State of Text Rendering

Posted Jul 14, 2009 13:32 UTC (Tue) by liljencrantz (guest, #28458) [Link] (16 responses)

Actually, the concept of Unicode is «all languages, all the time, except for yellow people». Because of Han unification, one needs to switch fonts to move between Chinese, Korean and Japanese. The motivation usually given for this is that these character sets share a common origin, and where historically rendered similarly to each other, an argument that as equally true for e.g. Greek and Latin letters.

State of Text Rendering

Posted Jul 14, 2009 19:33 UTC (Tue) by behdad (guest, #18708) [Link] (1 responses)

This problem is not limited to Han Unification though. Persian and Arabic for example also use the same Unicode characters, but have differing font preferences. The bottomline is: for good text, you need language annotation. And if you have that, the East Asian text is not much different anymore.

State of Text Rendering

Posted Jul 15, 2009 7:05 UTC (Wed) by liljencrantz (guest, #28458) [Link]

(Un)Cool. Didn't know that. Thanks for enlightening me.

State of Text Rendering

Posted Jul 14, 2009 23:08 UTC (Tue) by jordanb (guest, #45668) [Link] (13 responses)

You mean the "problem" that only Japan, of all the CJK countries seems to have?

Also the Greek and Latin letters have far more semantic difference than the Han letters that were combined in Unicode. It's more akin to complaining that both English and French 'a's have the same codepoint.

State of Text Rendering

Posted Jul 15, 2009 7:20 UTC (Wed) by liljencrantz (guest, #28458) [Link] (12 responses)

The second match for «Han unification» on Google seems to contradict that assertion. This from a Korean guy:

«If I view the same page saved as Unicode (UTF-8), the browser will suddenly show my characters with different styles, taken from different fonts. This is a technical problem--the browser doesn't have a font for Unicode, so it will map each character to some other character set, such as GB, CNS, JIS, or KSC, and the web page appears in a patchwork of styles. This problem is caused by the Han unification, and it is a serious problem. Not only does the patchwork look ugly, people also don't like to see alien fonts.»

So... no. As to your comparison about «a» in different languages, it is plain wrong on the surface, because unlike many glyphs suffering from Han unification, french and english «a» are rendered identically. But if you think a bit more about it, it's kind of like if we stripped all local variants like the grave and acute accent from letters (à, á). I'm thinking french speakers around the world would be less then pleased by that.

State of Text Rendering

Posted Jul 15, 2009 15:32 UTC (Wed) by foom (subscriber, #14868) [Link] (10 responses)

His main complaint seems to be that he doesn't have an appropriate unicode-supporting font which
covers all the characters necessary, and so his system is relying on fallback to a large collection of
different fonts with different styles, for the same run of text. Of course that's not going to look
nice.

He also notes that it's entirely feasible to make an appropriate font.

I'd then note the date on that article: 1999. One would at least hope that by a decade later, a
decent font covering all the characters has indeed been made.

State of Text Rendering

Posted Jul 15, 2009 17:00 UTC (Wed) by anselm (subscriber, #2796) [Link] (8 responses)

The problem with Han unification is that the CJK character at a given code point would look subtly different in China, Japan, or Korea. (For example, Korean script uses ovals -- for want of a better word -- in places where Japanese script would use boxes with sharp corners.) This makes it difficult to use the same font to display a document that mixes Chinese, Japanese, and Korean text. This issue is independent of whether you have a single font with complete Unicode coverage or piece your coverage together from several fonts. (The problem doesn't arise with Latin script as an English, French, and German »A« look 100% identical for all practical purposes.)

On the other hand, replicating several times over a few ten thousand code points that are, semantically, substantially the same doesn't look like a very enticing alternative, either.

State of Text Rendering

Posted Jul 15, 2009 18:30 UTC (Wed) by foom (subscriber, #14868) [Link] (1 responses)

Well, it's also difficult to mix together Chinese Japanese and Korean text, with different encodings pre Unicode.

With Unicode, at least you can clearly understand what the characters are, there's no ambiguity, and it's easy to process. However, in order to get optimal rendering, you do need to have a document which can specify a font for certain blocks of text. This hardly seems a problem in today's world -- how much software is there that can't deal with rich text, really?

Furthermore, the following quote from the referenced article:

For instance, the style called Mincho in Japan, Myeongjo in Korea, and Song in China originated in Ming-dynasty China and is universally acceptable in the CJK countries. The Japanese fontmaker Typebank has a Chinese Mincho font with PRC-simplified characters approved by the Chinese government that shares glyphs with their Japanese Mincho font, and it is quite feasible to make a Mincho font that will serve the mainland Chinese/Korean/Japanese users decently.

It sounds to me like the default font can be chosen to be neutral, and different fonts can be chosen as desired for optimal rendering of particular pieces of text. But even if you fail to do so, the text will be recognizable.

It seems to me that the most important thing is to make sure you have similar-looking fonts covering all the necessary characters, so you don't end up with a random mix of differing looking fallback fonts for different characters with no rhyme or reason...

State of Text Rendering

Posted Jul 16, 2009 12:18 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

> However, in order to get optimal rendering, you do need to have a document
> which can specify a font for certain blocks of text.

Actually, no. You need a document that can specify a *language* for certain blocks of text. The text stack can take care of the rest.

Using font names as hints to specify language does not work because the fonts available on your system today are unlikely to be available on other systems or even in your own system in the future. And even if they *are* available, they may not have the same coverage or they may have been deprecated in favour of a better font since.

A few years ago having Luxi on hand was a given. It's become a minority font since.

State of Text Rendering

Posted Jul 16, 2009 12:04 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (5 responses)

Actually, French and English preferences for numbers like 1 and 7 are different (7 will always have a middle bar when written by a Frenchman) and the problem is even more acute for cyrillic languages (a few letters are not written the same way in Russia and in the Balkans, but have the same codepoints).

And I had an ugly patchwork effect on Linux systems way after 1999 rendering French (because someone thought is was a good idea to use an ASCII diacritics-less PS1 font as default in some apps)

So really CJK is not special at all. It's more visible, but the problems are generic. The main one as Behdad stated is apps typically do not give language info to the text stack, so it has to guess all the time. Windows is much better architectured on this regard (the IM switcher is not limited to IM but also specifies a language)

State of Text Rendering

Posted Jul 16, 2009 13:59 UTC (Thu) by dgm (subscriber, #49227) [Link] (4 responses)

nim-nim,

I think a Language indication would not be the right solution for this problem. the fact that Koreans prefer ovals and Japanese squares, or that French want they 7 with a bar is clearly an aesthetic choice, much in the line of preferring fonts with or without serifs.

So, I think the real solution is specifying the proper font.

On the other hand, specifying the language would be useful when the text has to be _interpreted_, as when using a spell checker.

State of Text Rendering

Posted Jul 16, 2009 14:57 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (3 responses)

It is not just an æsthetic choice or you would not have such a resistance to Unicode in asian countries. You have glyphs which have not drifted appart enough yet for the Unicode.org consortium feels they deserve a separate codepoint, but for which local use is already different enough it is very user-hostile to use the wrong variant for one of the affected languages.

It has nothing in common with serif vs not-serif. An English speaker will read both a serif and sans-serif font, and they'll both be pleasant, even though he will recognize they are different. A user of one of the affected languages, OTOH, will be deeply offended if the wrong variant is used, and will take it as foreign mis-rendering (like an American who reads a long text in British English spelling, except several orders of magnitude worse, even discounting Japanese/Chinese/Korean Arab/Iranian historical conflicts which are still very much alive nowadays).

Also, a serif font and a sans-serif font will have almost no glyph in common. OTOH you may have two fonts with 99.9% identical glyphs except for a few dozen that change to cater for local variance. In fact modern OpenType smart fonts can include different variants of the same glyph in the same font file precisely for this reason, and the text stack can select the right one of to use, as long as the app told it what the language to render is (this is another thing crappy legacy font formats like PS1 can not do).

Modern text rendering and fonts are much more complex than you assume. They are an interpretation process in themselves, and one of the input variables necessary to get correct results is the language to target.

Also, you've ignored what I wrote about fonts. Their availability and capabilities do change over time and systems, it is very wrong to expect a particular font name to give specific results as soon as a document left its origin system (that is except if you systematically bundle the fonts with documents which apart from the weight and legal problems is a great way to have buggy fonts linger long after they've been fixed or replaced). This is acceptable as long as the changes are (to take your words) æsthetic only, but this is not what people complain of here.

State of Text Rendering

Posted Jul 16, 2009 16:02 UTC (Thu) by jordanb (guest, #45668) [Link] (1 responses)

"Asian Opposition" to Unicode was really opposition within certain segments of the Japanese computing industry, who preferred the old model of foreign companies finding it difficult to work with their baroque, flaky, and mutually incompatible local character sets.

Han Unification was always a cause celebre for people whose real motive was ensure that internationalization remained difficult. It didn't hurt them any that Unicode was championed mostly by American companies with strong support in that part of the world from China, making it easy to paint the entire process as Chino-American imperialism.

Regarding your example of Americans seeing British spelling, that's not how it works. If you're a Chinese person you will have your computer localized for China. If an individual from Japan sends you some text, and it contains unified characters, and language specification isn't available or not used, then you will see the unified characters with Chinese glyphs.

A more accurate analogy to the situation would be an American receiving an email from a British person, where their computer silently changed all the British spellings to American ones. There certianly would be the possiblilty of confusion in some contrived circumstances. For instance, if the British person were to write "We spell things differently than you, like colour" and the American saw "color", the American might wonder what was going on. But that would hardly present any possibility of "deep offense."

State of Text Rendering

Posted Jul 16, 2009 17:04 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

> "Asian Opposition" to Unicode was really opposition within certain
> segments of the Japanese computing industry, who preferred the old model
> of foreign companies finding it difficult to work with their baroque,
> flaky, and mutually incompatible local character sets.

Sure, but they managed to convince many third-parties in the meanwhile

> Regarding your example of Americans seeing British spelling, that's not
> how it works. If you're a Chinese person you will have your computer
> localized for China.

That's not true anymore. Localized systems, like non-network systems, are increasingly the past (especially in the Linux world where distros ship the same image to everyone). When your system supports many different languages with conflicting requirements (or when many users use the en_US locale because they're not satisfied with their localization either because it's bad, or because they're so saturated with English technical terms they don't recognise their translation anymore), you can't rely on defaults to mask the lack of language awareness at the application level.

I don't think Fedora managed a single release in the past 2-3 years where Chinese, Japanese or Korean users didn't complain the general non-local defaults were too biaised in favour of one of the others (there is also the problem they want different processing for latin than others, as pointed in Behdad's paper)

State of Text Rendering

Posted Jul 17, 2009 14:54 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

> Modern text rendering and fonts are much more complex than you assume.

Sorry, that came a bit more strongly that intended (story of my life). But you should know that extracting a letter from a font is not just looking at a position indexed by a code point in isolation like it used to be. OpenType permits variable behaviour that depends on the codepoint environment.

If this environment is wrong (including the language info), the end result will be wrong too. The severity of the resulting problems varies from language to language but they are not limited to exotic Asian scripts. Even plain old European languages are affected (except perhaps for English, because the defaults used in the absence of a correct environment are English-oriented).

State of Text Rendering

Posted Jul 16, 2009 12:13 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

> I'd then note the date on that article: 1999. One would at least hope that
> by a decade later, a decent font covering all the characters has indeed
> been made.

Some fonts have been improved, but too many FLOSS apps insist on shipping/supporting fonts with bad coverage or in a format like PS1 which does not even support features needed for good i18n (both i18n in the far-east sense and in the support for less common European languages sense).

When apps such as OO.o and TeX start supporting properly modern OpenType formats at last it will be possible to dump all the old fonts with their quirks and problems. But as long as they don't, people will take them as argument that it's better to ship those fonts since they are good enough for English users, rather than modern fonts that would solve i18n problems but have less mature app support.

State of Text Rendering

Posted Jul 16, 2009 0:35 UTC (Thu) by jordanb (guest, #45668) [Link]

That guy's decade-old rant can be boiled down to this, basically:

"When I receive text from a friend in a different Asian country (Japan in this case) and the locale is left unspecified, then the unified glyphs will be rendered in a local font, which can introduce subtle differences. In the old system, where every country had incompatible character sets, the locale was implicit in the character set being used (ShiftJIS means Japanese and Big5 means Taiwanese/Chinese, for instance)."

It's true that Unicode combines glyphs where possible. It's also true that language specification is "outside the scope" of the standard. Language specification is "outside the scope" of most legacy encodings as well, but it was sort of a perverse benefit of international fragmentation that the encoding would imply a particular country, especially in Asia.

The solution to the problem is to ensure that your data format has the ability to specify a language or a font as well as an encoding. This is not uncommon in modern formats.