State of Text Rendering
Text is the primary means of communication in computers, and is bound to be so for the decades to come. With the widespread adoption of Unicode as the canonical character set for representing text a whole new domain has been opened up in a desktop system software design. Gone are the days that one would need to input, render, print, search, spell-check, ... one language at a time. The whole concept of internationalization (i18n) on which Unicode is based is "all languages, all the time"." (Thanks to Nicolas Mailhot).
Posted Jul 10, 2009 11:08 UTC (Fri)
by macson_g (guest, #12717)
[Link] (1 responses)
Posted Jul 10, 2009 21:53 UTC (Fri)
by midg3t (guest, #30998)
[Link]
Posted Jul 14, 2009 13:32 UTC (Tue)
by liljencrantz (guest, #28458)
[Link] (16 responses)
Posted Jul 14, 2009 19:33 UTC (Tue)
by behdad (guest, #18708)
[Link] (1 responses)
Posted Jul 15, 2009 7:05 UTC (Wed)
by liljencrantz (guest, #28458)
[Link]
Posted Jul 14, 2009 23:08 UTC (Tue)
by jordanb (guest, #45668)
[Link] (13 responses)
Also the Greek and Latin letters have far more semantic difference than the Han letters that were combined in Unicode. It's more akin to complaining that both English and French 'a's have the same codepoint.
Posted Jul 15, 2009 7:20 UTC (Wed)
by liljencrantz (guest, #28458)
[Link] (12 responses)
The second match for «Han unification» on Google seems to contradict that assertion. This from a Korean guy:
«If I view the same page saved as Unicode (UTF-8), the browser will suddenly show my characters with different styles, taken from different fonts. This is a technical problem--the browser doesn't have a font for Unicode, so it will map each character to some other character set, such as GB, CNS, JIS, or KSC, and the web page appears in a patchwork of styles. This problem is caused by the Han unification, and it is a serious problem. Not only does the patchwork look ugly, people also don't like to see alien fonts.»
So... no. As to your comparison about «a» in different languages, it is plain wrong on the surface, because unlike many glyphs suffering from Han unification, french and english «a» are rendered identically. But if you think a bit more about it, it's kind of like if we stripped all local variants like the grave and acute accent from letters (à, á). I'm thinking french speakers around the world would be less then pleased by that.
Posted Jul 15, 2009 15:32 UTC (Wed)
by foom (subscriber, #14868)
[Link] (10 responses)
He also notes that it's entirely feasible to make an appropriate font.
I'd then note the date on that article: 1999. One would at least hope that by a decade later, a
Posted Jul 15, 2009 17:00 UTC (Wed)
by anselm (subscriber, #2796)
[Link] (8 responses)
The problem with Han unification is that the CJK character at a given code
point would look subtly different in China, Japan, or Korea. (For example,
Korean script uses ovals -- for want of a better word -- in places where
Japanese script would use boxes with sharp corners.) This makes it
difficult to use the same font to display a document that mixes Chinese,
Japanese, and Korean text. This issue is independent of whether you have a
single font with complete Unicode coverage or piece your coverage together
from several fonts. (The problem doesn't arise with Latin script as an
English, French, and German »A« look 100% identical for all practical
purposes.)
On the other hand, replicating several times over a few
ten thousand code points that are, semantically, substantially the same
doesn't look like a very enticing alternative, either.
Posted Jul 15, 2009 18:30 UTC (Wed)
by foom (subscriber, #14868)
[Link] (1 responses)
With Unicode, at least you can clearly understand what the characters are, there's no ambiguity,
and it's easy to process. However, in order to get optimal rendering, you do need to have a
document which can specify a font for certain blocks of text. This hardly seems a problem in
today's world -- how much software is there that can't deal with rich text, really?
Furthermore, the following quote from the referenced article:
It sounds to me like the default font can be chosen to be neutral, and different fonts can be
chosen as desired for optimal rendering of particular pieces of text. But even if you fail to do so,
the text will be recognizable.
It seems to me that the most important thing is to make sure you have similar-looking fonts
covering all the necessary characters, so you don't end up with a random mix of differing looking
fallback fonts for different characters with no rhyme or reason...
Posted Jul 16, 2009 12:18 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link]
Actually, no. You need a document that can specify a *language* for certain blocks of text. The text stack can take care of the rest.
Using font names as hints to specify language does not work because the fonts available on your system today are unlikely to be available on other systems or even in your own system in the future. And even if they *are* available, they may not have the same coverage or they may have been deprecated in favour of a better font since.
A few years ago having Luxi on hand was a given. It's become a minority font since.
Posted Jul 16, 2009 12:04 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (5 responses)
And I had an ugly patchwork effect on Linux systems way after 1999 rendering French (because someone thought is was a good idea to use an ASCII diacritics-less PS1 font as default in some apps)
So really CJK is not special at all. It's more visible, but the problems are generic. The main one as Behdad stated is apps typically do not give language info to the text stack, so it has to guess all the time. Windows is much better architectured on this regard (the IM switcher is not limited to IM but also specifies a language)
Posted Jul 16, 2009 13:59 UTC (Thu)
by dgm (subscriber, #49227)
[Link] (4 responses)
I think a Language indication would not be the right solution for this problem. the fact that Koreans prefer ovals and Japanese squares, or that French want they 7 with a bar is clearly an aesthetic choice, much in the line of preferring fonts with or without serifs.
So, I think the real solution is specifying the proper font.
On the other hand, specifying the language would be useful when the text has to be _interpreted_, as when using a spell checker.
Posted Jul 16, 2009 14:57 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (3 responses)
It has nothing in common with serif vs not-serif. An English speaker will read both a serif and sans-serif font, and they'll both be pleasant, even though he will recognize they are different. A user of one of the affected languages, OTOH, will be deeply offended if the wrong variant is used, and will take it as foreign mis-rendering (like an American who reads a long text in British English spelling, except several orders of magnitude worse, even discounting Japanese/Chinese/Korean Arab/Iranian historical conflicts which are still very much alive nowadays).
Also, a serif font and a sans-serif font will have almost no glyph in common. OTOH you may have two fonts with 99.9% identical glyphs except for a few dozen that change to cater for local variance. In fact modern OpenType smart fonts can include different variants of the same glyph in the same font file precisely for this reason, and the text stack can select the right one of to use, as long as the app told it what the language to render is (this is another thing crappy legacy font formats like PS1 can not do).
Modern text rendering and fonts are much more complex than you assume. They are an interpretation process in themselves, and one of the input variables necessary to get correct results is the language to target.
Also, you've ignored what I wrote about fonts. Their availability and capabilities do change over time and systems, it is very wrong to expect a particular font name to give specific results as soon as a document left its origin system (that is except if you systematically bundle the fonts with documents which apart from the weight and legal problems is a great way to have buggy fonts linger long after they've been fixed or replaced). This is acceptable as long as the changes are (to take your words) æsthetic only, but this is not what people complain of here.
Posted Jul 16, 2009 16:02 UTC (Thu)
by jordanb (guest, #45668)
[Link] (1 responses)
Han Unification was always a cause celebre for people whose real motive was ensure that internationalization remained difficult. It didn't hurt them any that Unicode was championed mostly by American companies with strong support in that part of the world from China, making it easy to paint the entire process as Chino-American imperialism.
Regarding your example of Americans seeing British spelling, that's not how it works. If you're a Chinese person you will have your computer localized for China. If an individual from Japan sends you some text, and it contains unified characters, and language specification isn't available or not used, then you will see the unified characters with Chinese glyphs.
A more accurate analogy to the situation would be an American receiving an email from a British person, where their computer silently changed all the British spellings to American ones. There certianly would be the possiblilty of confusion in some contrived circumstances. For instance, if the British person were to write "We spell things differently than you, like colour" and the American saw "color", the American might wonder what was going on. But that would hardly present any possibility of "deep offense."
Posted Jul 16, 2009 17:04 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link]
Sure, but they managed to convince many third-parties in the meanwhile
> Regarding your example of Americans seeing British spelling, that's not
That's not true anymore. Localized systems, like non-network systems, are increasingly the past (especially in the Linux world where distros ship the same image to everyone). When your system supports many different languages with conflicting requirements (or when many users use the en_US locale because they're not satisfied with their localization either because it's bad, or because they're so saturated with English technical terms they don't recognise their translation anymore), you can't rely on defaults to mask the lack of language awareness at the application level.
I don't think Fedora managed a single release in the past 2-3 years where Chinese, Japanese or Korean users didn't complain the general non-local defaults were too biaised in favour of one of the others (there is also the problem they want different processing for latin than others, as pointed in Behdad's paper)
Posted Jul 17, 2009 14:54 UTC (Fri)
by nim-nim (subscriber, #34454)
[Link]
Sorry, that came a bit more strongly that intended (story of my life). But you should know that extracting a letter from a font is not just looking at a position indexed by a code point in isolation like it used to be. OpenType permits variable behaviour that depends on the codepoint environment.
If this environment is wrong (including the language info), the end result will be wrong too. The severity of the resulting problems varies from language to language but they are not limited to exotic Asian scripts. Even plain old European languages are affected (except perhaps for English, because the defaults used in the absence of a correct environment are English-oriented).
Posted Jul 16, 2009 12:13 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link]
Some fonts have been improved, but too many FLOSS apps insist on shipping/supporting fonts with bad coverage or in a format like PS1 which does not even support features needed for good i18n (both i18n in the far-east sense and in the support for less common European languages sense).
When apps such as OO.o and TeX start supporting properly modern OpenType formats at last it will be possible to dump all the old fonts with their quirks and problems. But as long as they don't, people will take them as argument that it's better to ship those fonts since they are good enough for English users, rather than modern fonts that would solve i18n problems but have less mature app support.
Posted Jul 16, 2009 0:35 UTC (Thu)
by jordanb (guest, #45668)
[Link]
"When I receive text from a friend in a different Asian country (Japan in this case) and the locale is left unspecified, then the unified glyphs will be rendered in a local font, which can introduce subtle differences. In the old system, where every country had incompatible character sets, the locale was implicit in the character set being used (ShiftJIS means Japanese and Big5 means Taiwanese/Chinese, for instance)."
It's true that Unicode combines glyphs where possible. It's also true that language specification is "outside the scope" of the standard. Language specification is "outside the scope" of most legacy encodings as well, but it was sort of a perverse benefit of international fragmentation that the encoding would imply a particular country, especially in Asia.
The solution to the problem is to ensure that your data format has the ability to specify a language or a font as well as an encoding. This is not uncommon in modern formats.
The next to last slide is notable.
State of Text Rendering
Second-last slide
State of Text Rendering
State of Text Rendering
State of Text Rendering
State of Text Rendering
State of Text Rendering
State of Text Rendering
covers all the characters necessary, and so his system is relying on fallback to a large collection of
different fonts with different styles, for the same run of text. Of course that's not going to look
nice.
decent font covering all the characters has indeed been made.
State of Text Rendering
Well, it's also difficult to mix together Chinese Japanese and Korean text, with different
encodings pre Unicode.
State of Text Rendering
For instance, the style called Mincho in Japan, Myeongjo in Korea, and Song in China originated
in Ming-dynasty China and is universally acceptable in the CJK countries. The Japanese
fontmaker Typebank has a Chinese Mincho font with PRC-simplified characters approved by the
Chinese government that shares glyphs with their Japanese Mincho font, and it is quite feasible
to make a Mincho font that will serve the mainland Chinese/Korean/Japanese users decently.
State of Text Rendering
> which can specify a font for certain blocks of text.
State of Text Rendering
State of Text Rendering
State of Text Rendering
State of Text Rendering
State of Text Rendering
> segments of the Japanese computing industry, who preferred the old model
> of foreign companies finding it difficult to work with their baroque,
> flaky, and mutually incompatible local character sets.
> how it works. If you're a Chinese person you will have your computer
> localized for China.
State of Text Rendering
State of Text Rendering
> by a decade later, a decent font covering all the characters has indeed
> been made.
State of Text Rendering