Google and Adobe's pan-CJK open font
At the 2014 ATypI conference in Barcelona, a pair of talks described the recent joint effort by Google and Adobe to build what amounts to the largest font project ever undertaken. The font in question is free software, available under the Apache 2 license. Its branding varies between the companies—Google's version is known as Noto Sans CJK, while Adobe's is Source Han Sans—but, in either case, the font is the first open-source "pan CJK" (Chinese, Japanese, Korean) typeface. The character set it implements maxes out the possible size of an OpenType file's internal tables, with 65,535 characters in each file. Including all of the weights and variants, the project developed nearly 500,000 characters—an effort that pushed the limits of the design process and of the font-building process alike.
On the first day of the event, Caleb Belohlavek from Adobe and Stuart Gill from Google presented the completed font itself and discussed some of the technical challenges involved in the project. On the fourth day, Masataka Hattori, Ryoko Nishizuka, and Taro Yamamoto spoke about the design and publication process.
Abundant characters
Noto/Source Han covers all of the symbols needed to write the four major languages that commonly use Chinese "Han" characters: Traditional Chinese (written in Hong Kong, Taiwan, and Macau), Simplified Chinese (written in mainland China and Singapore), Japanese, and Korean. Simplified Chinese, as the name might suggest, uses the same characters as Traditional Chinese, but it incorporates numerous intentional reductions in the number of strokes and it substitutes structural simplifications. Altogether, written Chinese contains tens of thousands of characters; there is not even full agreement on the exact number. A fair number of them are unique to proper names, and many are homophones—but that fact does not eliminate the need to support more than one of the characters—Belohlavek and Gill likened it to the difference between "Smith" and "Smythe."
![[Variants in Han characters]](https://static.lwn.net/images/2014/09-atypi-han-differences-sm.png)
Japanese and Korean each have their own scripts, of course (Kana and Hangul), but they regularly intermix Chinese glyphs as well. The Japanese and Korean Han variants, however, were first adopted from Chinese centuries ago, and today they differ in important ways—both from each other and from Chinese. The structures of the forms varies from language to language, but so do many stylistic details, such as how individual strokes are terminated or joined together. The Kana and Hangul character sets are fairly small, but the need to support Chinese characters as well makes any CJK font a complicated beast.
In practice, the sheer size of the CJK character set means that most publishers, online and off, are forced to mix-and-match multiple fonts in a document—particularly when there is a need for multiple sizes of text or varying levels of bold for emphasis. There have been precious few pan-CJK fonts ever made that could provide the full character set. In its coverage of the Noto/Source Han release, CNET found only three others, all with hefty (four-figure) price tags attached.
In 2009, this situation struck Adobe's Ken Lunde as one needing immediate attention, and he raised the issue at the 2009 Unicode Conference. Lunde's comments were noticed by the team at Google working on that company's Noto font "superfamily." Noto is Google's effort to create a high-quality typeface that covers the entire Unicode standard. The design (at least of the Latin characters) is visually similar to the Droid font developed for Android and to Open Sans, which Google uses in its branding and in several web applications.
After Lunde's talk, Google approached Adobe with the idea of commissioning a pan-CJK typeface that would harmonize with Noto and would also be usable with Adobe's Source Pro family. As the speakers put it, "then two lawyers entered a room together and two years went by." Finally, the legal details were sorted out, and in 2012 the project began in earnest.
Collaborative development
Since the goal was to develop a font that appealed to local users, Google and Adobe decided to commission three type foundries with expertise in the local languages. At ATypI 2012 in Hong Kong, the companies met with Changzhou SinoType (from China), Iwata Corporation (from Japan), and Sandoll Communication (from Korea), and began hashing out a development plan.
![Stuart Gill and Caleb Belohlavek [Stuart Gill and Caleb
Belohlavek at ATypI 2014]](https://static.lwn.net/images/2014/09-atypi-gill-sm.jpg)
Early on, it became clear that developing the font would push the boundaries of existing formats. Using the TrueType format, the team estimated, generating hinting for the fonts would, by itself, require two years. That left the Compact Font Format (CFF), a PostScript-derived format, as the only real choice—since it relies on the rendering engine to perform pixel-grid alignment, rather than requiring hints to be embedded for each individual glyph. That realization, Belohlavek said, was one of the major contributing factors in Adobe's eventual decision to donate its CFF rendering engine to the FreeType project. Without the Adobe CFF renderer, he said, the Noto/Source Han font family would not have been viable.
The next hurdle was the number of characters that can be included in any single file. OpenType, which can be a wrapper around either TrueType or CFF glyphs, supports only 65,535 glyphs in a single file. Unicode 1.0 introduced the concept of Han unification, which was intended to map the CJK characters from all four languages into a single character set. But, as mentioned earlier, the actual characters are often not drawn identically in the four languages.
Thus, an extensive undertaking was required to sort out which of the tens of thousands of "unified" glyphs could be reused in more than one of the languages, then to map those reuse relationships into a set of OpenType locl (for "locale") substitution tables. Ultimately, however, even that was not enough to squeeze under the 65,535-character limit, Gill and Belohlavek explained, "so you must get choosy; eventually it all comes down to opinions." A variety of the trade-offs that were involved, from working with Unicode Variation Sequences to which code points are mapped for which language, are described in the release notes.
Similarly, when the team set out to decide on a character encoding for each language, there were (as there are for most languages) many to choose from. It eventually chose to offer multiple character encodings for each language, hoping to be as widely useful as possible. It is not possible to please everyone, of course, as the speakers found out. During the Q&A session, one audience member took issue with the decision to implement Taiwan's Ministry Of Education (MOE) encoding chosen; eventually the speakers had to concede that there were trade-offs with any such choice, and they relied on the regional expertise of the various type foundries to make the right decision.
From design to deployment
Yamamoto, Hattori, and Nishizuka spoke in more detail about the design process in their session. Nishizuka was the principal designer on the project; Yamamoto heads Adobe's type team in Tokyo, and Hattori worked both on type design and production issues. The overall workflow involved Nishizuka developing a set of designs for Japanese, which were then sent to Sandoll to be adapted for Korean, and from there were sent to SinoType for the Traditional and Simplified Chinese work. Drawing tens of thousands of characters even once (much less in multiple weights) is a daunting proposition. Nisihizuka said she started by building a library of around 120 reusable stroke components, which were then used to build "a few hundred" core characters to be circulated to the other foundries.
![Ryoko Nishizuka [Ryoko Nishizuka at ATypI 2014]](https://static.lwn.net/images/2014/09-atypi-nishizuka-sm.jpg)
Over many iterations, that process was used to build up the completed character set. But the sheer scale involved, it seems, made everything difficult. Nishizuka noted that removing the serif-like stroke endings, although a small change, radically reduced the file size when multiplied by 60,000 characters. Among the other challenges the designers cited was accounting for differences between the way the languages are used today: Japanese documents, for example, are increasingly required to mix vertical and horizontal writing. It was also not easy to develop a style that fit the comparatively open Japanese Kana characters, the compound Chinese characters, and the rather geometric forms of Korean Hangul.
Altogether, the development process took two years. Along the way, the team even found previously undiscovered bugs in the various encoding standards—such as a reversed component in the Unicode charts and mistakes in Taiwan's MOE standard. Lunde has written detailed blog posts about several of these issues, which make for highly educational reading about the dangers of extremely large specifications.
The first public release was made on July 15, 2014, and included the full character set in seven different weights. In addition to source, installable packages were built in a variety of formats: the full character set, a monolingual version for each language, region-specific versions (which are essentially a workaround to cope with the fact that not all software supports OpenType locl features), and sets that combine multiple .OTF fonts into OpenType Collection (.OTC) files—a rarely used file format that saves a bit of space when packaging multiple fonts together by allowing the fonts to share a common set of feature tables. The file sizes range from 15 to well over 100MB in size, depending on the version selected. Even in the packaging, it seems, the Noto/Source Han project is pushing the limits.
The font has already been spotted in the wild, and the speakers
noted that enough feedback
had come in from users that an update
was released in September. Users who can get by with text written in
European languages may regard the Noto/Source Han project as impressive
largely for the scope of the engineering effort that it
required. In fact, Belohlavek and Gill displayed a door-sized
poster showing the entire 65,535 character set during their talk that
proved to be quite a popular curiosity; even at that size, individual
characters were all but unreadably small. But for the millions of
users who write one of the four CJK languages, the availability of a
high-quality font family as free software is undoubtedly a win.
Index entries for this article | |
---|---|
Conference | ATypI/2014 |
Posted Oct 2, 2014 7:36 UTC (Thu)
by Seegras (guest, #20463)
[Link] (4 responses)
http://en.wikipedia.org/wiki/Hangul
Just because "it looks Asian"? Or "because they once used Chinese glyphs there"?
Posted Oct 2, 2014 8:07 UTC (Thu)
by rahulsundaram (subscriber, #21946)
[Link]
Posted Oct 2, 2014 9:00 UTC (Thu)
by Sho (subscriber, #8956)
[Link] (2 responses)
That means font designers/implementors have more than 10k blocks to take care of - because of their regularity (since they *do* compose down to letters) it's not quite a Han-scale problem, but it means some similar challenges (the stroke density for more complex blocks is more similar to hanja than to individual Latin letters, say).
And as rahulsundaram points out Koreans do still sometimes (rarely now) use hanja, so they need to fit nicely aesthetically.
1 = http://upload.wikimedia.org/wikipedia/commons/8/82/Hangeu...
Posted Oct 2, 2014 14:43 UTC (Thu)
by n8willis (subscriber, #43041)
[Link] (1 responses)
Nate
Posted Oct 2, 2014 17:37 UTC (Thu)
by Sho (subscriber, #8956)
[Link]
Posted Oct 2, 2014 9:54 UTC (Thu)
by NAR (subscriber, #1313)
[Link] (3 responses)
Posted Oct 3, 2014 12:50 UTC (Fri)
by alan (subscriber, #4018)
[Link] (2 responses)
Posted Oct 3, 2014 15:58 UTC (Fri)
by NAR (subscriber, #1313)
[Link] (1 responses)
Posted Oct 3, 2014 19:36 UTC (Fri)
by Sho (subscriber, #8956)
[Link]
Google and Adobe's pan-CJK open font
Google and Adobe's pan-CJK open font
Google and Adobe's pan-CJK open font
Google and Adobe's pan-CJK open font
Google and Adobe's pan-CJK open font
Google and Adobe's pan-CJK open font
Google and Adobe's pan-CJK open font
Google and Adobe's pan-CJK open font
Google and Adobe's pan-CJK open font