Google and Adobe's pan-CJK open font

By Nathan Willis
October 1, 2014

ATypI 2014

At the 2014 ATypI conference in Barcelona, a pair of talks described the recent joint effort by Google and Adobe to build what amounts to the largest font project ever undertaken. The font in question is free software, available under the Apache 2 license. Its branding varies between the companies—Google's version is known as Noto Sans CJK, while Adobe's is Source Han Sans—but, in either case, the font is the first open-source "pan CJK" (Chinese, Japanese, Korean) typeface. The character set it implements maxes out the possible size of an OpenType file's internal tables, with 65,535 characters in each file. Including all of the weights and variants, the project developed nearly 500,000 characters—an effort that pushed the limits of the design process and of the font-building process alike.

On the first day of the event, Caleb Belohlavek from Adobe and Stuart Gill from Google presented the completed font itself and discussed some of the technical challenges involved in the project. On the fourth day, Masataka Hattori, Ryoko Nishizuka, and Taro Yamamoto spoke about the design and publication process.

Abundant characters

Noto/Source Han covers all of the symbols needed to write the four major languages that commonly use Chinese "Han" characters: Traditional Chinese (written in Hong Kong, Taiwan, and Macau), Simplified Chinese (written in mainland China and Singapore), Japanese, and Korean. Simplified Chinese, as the name might suggest, uses the same characters as Traditional Chinese, but it incorporates numerous intentional reductions in the number of strokes and it substitutes structural simplifications. Altogether, written Chinese contains tens of thousands of characters; there is not even full agreement on the exact number. A fair number of them are unique to proper names, and many are homophones—but that fact does not eliminate the need to support more than one of the characters—Belohlavek and Gill likened it to the difference between "Smith" and "Smythe."

Japanese and Korean each have their own scripts, of course (Kana and Hangul), but they regularly intermix Chinese glyphs as well. The Japanese and Korean Han variants, however, were first adopted from Chinese centuries ago, and today they differ in important ways—both from each other and from Chinese. The structures of the forms varies from language to language, but so do many stylistic details, such as how individual strokes are terminated or joined together. The Kana and Hangul character sets are fairly small, but the need to support Chinese characters as well makes any CJK font a complicated beast.

In practice, the sheer size of the CJK character set means that most publishers, online and off, are forced to mix-and-match multiple fonts in a document—particularly when there is a need for multiple sizes of text or varying levels of bold for emphasis. There have been precious few pan-CJK fonts ever made that could provide the full character set. In its coverage of the Noto/Source Han release, CNET found only three others, all with hefty (four-figure) price tags attached.

In 2009, this situation struck Adobe's Ken Lunde as one needing immediate attention, and he raised the issue at the 2009 Unicode Conference. Lunde's comments were noticed by the team at Google working on that company's Noto font "superfamily." Noto is Google's effort to create a high-quality typeface that covers the entire Unicode standard. The design (at least of the Latin characters) is visually similar to the Droid font developed for Android and to Open Sans, which Google uses in its branding and in several web applications.

After Lunde's talk, Google approached Adobe with the idea of commissioning a pan-CJK typeface that would harmonize with Noto and would also be usable with Adobe's Source Pro family. As the speakers put it, "then two lawyers entered a room together and two years went by." Finally, the legal details were sorted out, and in 2012 the project began in earnest.

Collaborative development

Since the goal was to develop a font that appealed to local users, Google and Adobe decided to commission three type foundries with expertise in the local languages. At ATypI 2012 in Hong Kong, the companies met with Changzhou SinoType (from China), Iwata Corporation (from Japan), and Sandoll Communication (from Korea), and began hashing out a development plan.

[Stuart Gill and Caleb
Belohlavek at ATypI 2014]

Early on, it became clear that developing the font would push the boundaries of existing formats. Using the TrueType format, the team estimated, generating hinting for the fonts would, by itself, require two years. That left the Compact Font Format (CFF), a PostScript-derived format, as the only real choice—since it relies on the rendering engine to perform pixel-grid alignment, rather than requiring hints to be embedded for each individual glyph. That realization, Belohlavek said, was one of the major contributing factors in Adobe's eventual decision to donate its CFF rendering engine to the FreeType project. Without the Adobe CFF renderer, he said, the Noto/Source Han font family would not have been viable.

The next hurdle was the number of characters that can be included in any single file. OpenType, which can be a wrapper around either TrueType or CFF glyphs, supports only 65,535 glyphs in a single file. Unicode 1.0 introduced the concept of Han unification, which was intended to map the CJK characters from all four languages into a single character set. But, as mentioned earlier, the actual characters are often not drawn identically in the four languages.

Thus, an extensive undertaking was required to sort out which of the tens of thousands of "unified" glyphs could be reused in more than one of the languages, then to map those reuse relationships into a set of OpenType locl (for "locale") substitution tables. Ultimately, however, even that was not enough to squeeze under the 65,535-character limit, Gill and Belohlavek explained, "so you must get choosy; eventually it all comes down to opinions." A variety of the trade-offs that were involved, from working with Unicode Variation Sequences to which code points are mapped for which language, are described in the release notes.

Similarly, when the team set out to decide on a character encoding for each language, there were (as there are for most languages) many to choose from. It eventually chose to offer multiple character encodings for each language, hoping to be as widely useful as possible. It is not possible to please everyone, of course, as the speakers found out. During the Q&A session, one audience member took issue with the decision to implement Taiwan's Ministry Of Education (MOE) encoding chosen; eventually the speakers had to concede that there were trade-offs with any such choice, and they relied on the regional expertise of the various type foundries to make the right decision.

From design to deployment

Yamamoto, Hattori, and Nishizuka spoke in more detail about the design process in their session. Nishizuka was the principal designer on the project; Yamamoto heads Adobe's type team in Tokyo, and Hattori worked both on type design and production issues. The overall workflow involved Nishizuka developing a set of designs for Japanese, which were then sent to Sandoll to be adapted for Korean, and from there were sent to SinoType for the Traditional and Simplified Chinese work. Drawing tens of thousands of characters even once (much less in multiple weights) is a daunting proposition. Nisihizuka said she started by building a library of around 120 reusable stroke components, which were then used to build "a few hundred" core characters to be circulated to the other foundries.

Over many iterations, that process was used to build up the completed character set. But the sheer scale involved, it seems, made everything difficult. Nishizuka noted that removing the serif-like stroke endings, although a small change, radically reduced the file size when multiplied by 60,000 characters. Among the other challenges the designers cited was accounting for differences between the way the languages are used today: Japanese documents, for example, are increasingly required to mix vertical and horizontal writing. It was also not easy to develop a style that fit the comparatively open Japanese Kana characters, the compound Chinese characters, and the rather geometric forms of Korean Hangul.

Altogether, the development process took two years. Along the way, the team even found previously undiscovered bugs in the various encoding standards—such as a reversed component in the Unicode charts and mistakes in Taiwan's MOE standard. Lunde has written detailed blog posts about several of these issues, which make for highly educational reading about the dangers of extremely large specifications.

The first public release was made on July 15, 2014, and included the full character set in seven different weights. In addition to source, installable packages were built in a variety of formats: the full character set, a monolingual version for each language, region-specific versions (which are essentially a workaround to cope with the fact that not all software supports OpenType locl features), and sets that combine multiple .OTF fonts into OpenType Collection (.OTC) files—a rarely used file format that saves a bit of space when packaging multiple fonts together by allowing the fonts to share a common set of feature tables. The file sizes range from 15 to well over 100MB in size, depending on the version selected. Even in the packaging, it seems, the Noto/Source Han project is pushing the limits.

The font has already been spotted in the wild, and the speakers noted that enough feedback had come in from users that an update was released in September. Users who can get by with text written in European languages may regard the Noto/Source Han project as impressive largely for the scope of the engineering effort that it required. In fact, Belohlavek and Gill displayed a door-sized poster showing the entire 65,535 character set during their talk that proved to be quite a popular curiosity; even at that size, individual characters were all but unreadably small. But for the millions of users who write one of the four CJK languages, the availability of a high-quality font family as free software is undoubtedly a win.

Index entries for this article
Conference	ATypI/2014

Google and Adobe's pan-CJK open font

Posted Oct 2, 2014 7:36 UTC (Thu) by Seegras (guest, #20463) [Link] (4 responses)

I don't exactly see where Korean fits in. It's the most designed-through writing system in the world, with only 24 glyphs. With absolutely NO relationship to the Chinese or Japanese writing systems.

http://en.wikipedia.org/wiki/Hangul

Just because "it looks Asian"? Or "because they once used Chinese glyphs there"?

Google and Adobe's pan-CJK open font

Posted Oct 2, 2014 8:07 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link]

The latter

https://en.wikipedia.org/wiki/CJK_characters
https://en.wikipedia.org/wiki/Hanja

Google and Adobe's pan-CJK open font

Posted Oct 2, 2014 9:00 UTC (Thu) by Sho (subscriber, #8956) [Link] (2 responses)

Hangul has 24 letters, but it doesn't have "24 glyphs" in essentially any digital font in widespread use. First of, those letters sometimes form digraphs and trigraphs that essentially are letter-like in practical use (to the point where some have their own keys on the keyboard). Further, multiple letters are grouped into syllabic blocks (also morphemic in practice, since blocks often happen to map to morphemes) in a number of different arrangements based on how the letters tetris together[1]. Unicode contains code points for both the individual letters and the pre-composed blocks (in fact, you can use a neat little formula to map from the former to the latter). Fonts usually contain glyphs for both as well; almost no software implementation attempts to compose them from combining characters at runtime.

That means font designers/implementors have more than 10k blocks to take care of - because of their regularity (since they *do* compose down to letters) it's not quite a Han-scale problem, but it means some similar challenges (the stroke density for more complex blocks is more similar to hanja than to individual Latin letters, say).

And as rahulsundaram points out Koreans do still sometimes (rarely now) use hanja, so they need to fit nicely aesthetically.

1 = http://upload.wikimedia.org/wikipedia/commons/8/82/Hangeu...

Google and Adobe's pan-CJK open font

Posted Oct 2, 2014 14:43 UTC (Thu) by n8willis (subscriber, #43041) [Link] (1 responses)

I would also add that, as one of the speakers pointed out (Yamamoto, I believe, but I'm not 100% sure), due to the significantly increased trade and tourism within the region, many many documents, web sites, advertisements, and signs need to be written in more than one language. So while hanja usage may be in decline when it comes to writing (new) books entirely in Korean, writing something solely in Korean is far from being the only situation that users find themselves in.

Nate

Google and Adobe's pan-CJK open font

Posted Oct 2, 2014 17:37 UTC (Thu) by Sho (subscriber, #8956) [Link]

Our desktops also cope fairly poorly with mixed character sets, cf. https://blogs.kde.org/2014/09/11/beyond-unicode-closing-g...

Google and Adobe's pan-CJK open font

Posted Oct 2, 2014 9:54 UTC (Thu) by NAR (subscriber, #1313) [Link] (3 responses)

I never thought that there would be that much difference between Kana, Hangul and Chinese characters. When I saw the "entrance" characters in a Seoul castle gate they looked to be the same to the characters I saw in Japan on tram doors. But it looks like it is more complicated...

Google and Adobe's pan-CJK open font

Posted Oct 3, 2014 12:50 UTC (Fri) by alan (subscriber, #4018) [Link] (2 responses)

You may have been looking at a Japanese translation for tourists. This again highlights the need to be able to write multiple languages together in an aesthetically consistent way.

Google and Adobe's pan-CJK open font

Posted Oct 3, 2014 15:58 UTC (Fri) by NAR (subscriber, #1313) [Link] (1 responses)

Actually I was looking at a restored 15th century sign.

Google and Adobe's pan-CJK open font

Posted Oct 3, 2014 19:36 UTC (Fri) by Sho (subscriber, #8956) [Link]

Korean used to be written using Chinese characters. The rules involve were complicated and limited literacy to an elite, which is a reason why the Hangul alphabet was introduced in the 1440s. Due to various political reasons it took a few more centuries for the population to achieve mass-literacy, using Hangul, however.