Leading items

A look at The Machine

By Jake Edge
August 26, 2015

In what was perhaps one of the shortest keynotes on record (ten minutes), Keith Packard outlined the hardware architecture of "The Machine"—HP's ambitious new computing system. That keynote took place at LinuxCon North America in Seattle and was thankfully followed by an hour-long technical talk by Packard the following day (August 18), which looked at both the hardware and software for The Machine. It is, in many ways, a complete rethinking of the future of computers and computing, but there is a fairly long way to go between here and there.

The hardware

The basic idea of the hardware is straightforward. Many of the "usual computation units" (i.e. CPUs or systems on chip—SoCs) are connected to a "massive memory pool" using photonics for fast interconnects. That leads to something of an equation, he said: electrons (in CPUs) + photons (for communication) + ions (for memory storage) = computing. Today's computers transfer a lot of data and do so over "tiny little pipes". The Machine, instead, can address all of its "amazingly huge pile of memory" from each of its many compute elements. One of the underlying principles is to stop moving memory around to use it in computations—simply have it all available to any computer that needs it.

Some of the ideas for The Machine came from HP's DragonHawk systems, which were traditional symmetric multiprocessing systems, but packed a "lot of compute in a small space". DragonHawk systems would have 12TB of RAM in an 18U enclosure, while the nodes being built for The Machine will have 32TB of memory in 5U. It is, he said, a lot of memory and it will scale out linearly. All of the nodes will be connected at the memory level so that "every single processor can do a load or store instruction to access memory on any system".

Nodes in this giant cluster do not have to be homogeneous, as long as they are all hooked to the same memory interconnect. The first nodes that HP is building will be homogeneous, just for pragmatic reasons. There are two circuit boards on each node, one for storage and one for the computer. Connecting the two will be the "next generation memory interconnect" (NGMI), which will also connect both parts of the node to the rest of the system using photonics.

The compute part of the node will have a 64-bit ARM SoC with 256GB of purely local RAM along with a field-programmable gate array (FPGA) to implement the NGMI protocol. The storage part will have four banks of memory (each with 1TB), each with its own NGMI FPGA. A given SoC can access memory elsewhere without involving the SoC on the node where the memory resides—the NGMI bridge FPGAs will talk to their counterpart on the other node via the photonic interface. Those FPGAs will eventually be replaced by application-specific integrated circuits (ASICs) once the bugs are worked out.

ARM was chosen because it was easy to get those vendors to talk with the project, Packard said. There is no "religion" about the instruction set architecture (ISA), so others may be used down the road.

Eight of these nodes can be collected up into a 5U enclosure, which gives eight processors and 32TB of memory. Ten of those enclosures can then be placed into a rack (80 processors, 320TB) and multiple racks can all be connected on the same "fabric" to allow addressing up to 32 zettabytes (ZB) from each processor in the system.

The storage and compute portions of each node are powered separately. The compute piece has two 25Gb network interfaces that are capable of remote DMA. The storage piece will eventually use some kind of non-volatile/persistent storage (perhaps even the fabled memristor), but is using regular DRAM today, since it is available and can be used to prove the other parts of the design before switching.

SoCs in the system may be running more than one operating system (OS) and for more than one tenant, so there are some hardware protection mechanisms built into the system. In addition, the memory-controller FPGAs will encrypt the data at rest so that pulling a board will not give access to the contents of the memory even if it is cooled (à la cold boot) or when some kind of persistent storage is used.

At one time, someone said that 640KB of memory should be enough, Packard said, but now he is wrestling with the limits of the 48-bit addresses used by the 64-bit ARM and Intel CPUs. That only allows addressing up to 256TB, so memory will be accessed in 8GB "books" (or, sometimes, 64KB "booklettes"). Beyond the SoC, the NGMI bridge FPGA (which is also called the "Z bridge") deals with two different kinds of addresses: 53-bit logical Z addresses and 75-bit Z addresses. Those allow addressing 8PB and 32ZB respectively.

The logical Z addresses are used by the NGMI firewall to determine the access rights to that memory for the local node. Those access controls are managed outside of whatever OS is running on the SoC. So the mapping of memory is handled by the OS, while the access controls for the memory are part of the management of The Machine system as a whole.

NGMI is not intended to be a proprietary fabric protocol, Packard said, and the project is trying to see if others are interested. A memory transaction on the fabric looks much like a cache access. The Z address is presented and 64 bytes are transferred.

The software

Packard's group is working on GPL operating systems for the system, but others can certainly be supported. If some "proprietary Washington company" wanted to port its OS to The Machine, it certainly could. Meanwhile, though, other groups are working on other free systems, but his group is made up of "GPL bigots" that are working on Linux for the system. There will not be a single OS (or even distribution or kernel) running on a given instance of the The Machine—it is intended to support multiple different environments.

Probably the biggest hurdle for the software is that there is no cache coherence within the enormous memory pool. Each SoC has its own local memory (256GB) that is cache coherent, but accesses to the "fabric-attached memory" (FAM) between two processors are completely uncoordinated by hardware. That has implications for applications and the OS that are using that memory, but OS data structures should be restricted to the local, cache-coherent memory as much as possible.

For the FAM, there is a two-level allocation scheme that is arbitrated by a "librarian". It allocates books (8GB) and collects them into "shelves". The hardware protections provided by the NGMI firewall are done on book boundaries. A shelf could be a collection of books that are scattered all over the FAM in a single load-store domain (LSD—not Packard's favorite acronym, he noted), which is defined by the firewall access rules. That shelf could then be handed to the OS to be used for a filesystem, for example. That might be ext4, some other Linux filesystem, or the new library filesystem (LFS) that the project is working on.

Talking to the memory in a shelf uses the POSIX API. A process does an open() on a shelf and then uses mmap() to map the memory into the process. Underneath, it uses the direct access (DAX) support to access the memory. For the first revision, LFS will not support sparse files. Also, locking will not be global throughout an LSD, but will be local to an OS running on a node.

For management of the FAM, each rack will have a "top of rack" management server, which is where the librarian will run. That is a fairly simple piece of code that just does bookkeeping and keeps track of the allocations in a SQLite database. The SoCs are the only parts of the system that can talk to the firewall controller, so other components communicate with a firewall proxy that runs in user space, which relays queries and updates. There are a "whole bunch of potential adventures" in getting the memory firewall pieces all working correctly, Packard said.

The lack of cache coherence makes atomic operations on the FAM problematic, as traditional atomics rely on that feature. So the project has added some hardware to the bridges to do atomic operations at that level. There is a fam_atomic library to access the operations (fetch and add, swap, compare and store, and read), which means that each operation is done at the cost of a system call. Once again, this is just the first implementation; other mechanisms may be added later. One important caveat is that the FAM atomic operations do not interact with the SoC cache, so applications will need to flush those caches as needed to ensure consistency.

Physical addresses at the SoC level can change, so there needs to be support for remapping those addresses. But the SoC caches and DAX both assume static physical mappings. A subset of the physical address space will be used as an aperture into the full address space of the system and books can be mapped into that aperture.

Flushing the SoC cache line by line would "take forever", so a way to flush the entire cache when the physical address mappings change has been added. In order to do that, two new functions have been added to the Intel persistent memory library (libpmem): one to check for the presence of non-coherent persistent memory (pmem_needs_invalidate()) and another to invalidate the CPU cache (pmem_invalidate()).

In a system of this size, with the huge amounts of memory involved, there needs to be well-defined support for memory errors, Packard said. Read is easy—errors are simply signaled synchronously—but writes are trickier because the actual write is asynchronous. Applications need to know about the errors, though, so SIGBUS is used to signal an error. The pmem_drain() call will act as a barrier, such that errors in writes before that call will signal at or before the call. Any errors after the barrier will be signaled post-barrier.

There are various areas where the team is working on free software, he said, including persistent memory and DAX. There is also ongoing work on concurrent/distributed filesystems and non-coherent cache management. Finally, reliability, availability, and serviceability (RAS) are quite important to the project, so free software work is proceeding in that area as well.

Even with two separate sessions, it was a bit of a whirlwind tour of The Machine. As he noted, it is an environment that is far removed from the desktop world Packard had previously worked in. By the sound, there are plenty of challenges to overcome before The Machine becomes a working computing device—it will be an interesting process to watch.

[I would like to thank the Linux Foundation for travel assistance to Seattle for LinuxCon North America.]

Comments (20 posted)

Data visualizations in text

By Nathan Willis
August 26, 2015

TypeCon

Data visualization is often thought of in terms of pixels; considerable work goes into shaping large data sets into a form where spatial relationships are made clear and where colors, shapes, intensity, and point placement encode various quantities for rapid understanding. At TypeCon 2015 in Denver, though, researcher Richard Brath presented a different approach: taking advantage of readers' familiarity with the written word to encode more information into text itself.

Brath is a PhD candidate at London South Bank University where, he said, "I turn data into shapes and color and so on." Historically speaking, though, typography has not been a part of that equation. He showed a few examples of standard data visualizations, such as "heatmap" diagrams. Even when there are multiple variables under consideration, the only typography involved is plain text labels. "Word clouds" are perhaps the only well-known example of visualizations that involve altering text based on data, but even that is trivial: the most-frequent words or tags are simply bigger. More can certainly be done.

Indeed, more has been done—at least on rare occasion; Brath has cataloged and analyzed instances where other type attributes have been exploited to encode additional information in visualizations. An oft-overlooked example, he said, is cartography: subtle changes in spacing, capitalization, and font weight are used to indicate many distinct levels of map features. The reader may not consciously recognize it, but the variations give cues as to which neighboring text labels correspond to which map features. Some maps even incorporate multiple types of underline and reverse italics in addition to regular italics (two features that are quite uncommon elsewhere).

Brath also showed several historical charts and diagrams (some dating back to the 18th Century) that used typographic features to encode information. Royal family trees, for example, would sometime vary the weight, slant, and style of names to indicate the pedigree and status of various family members. A far more modern example of signifying information with font attributes, he said, can be seen in code editors, where developers take it for granted that syntax highlighting will distinguish between symbols, operators, and structures—hopefully without adversely impacting readability.

On the whole, though, usage of these techniques is limited to specific niches. Brath set out to catalog the typographic features that were employed, then to try an apply them to entirely new data-visualization scenarios. The set of features available for encoding information included standard properties like weight and slant, plus capitalization, x-height, width (i.e., condensed through extended), spacing, serif type, stroke contrast, underline, and the choice of typeface itself. Naturally, some of those attributes map well to quantitative data (such as weight, which can be varied continuously throughout a range), while others would only be useful for encoding categorical information (such as whether letters are slanted or upright).

He then began creating and testing a variety of visualizations in which he would encode information by varying some of the font attributes. Perhaps the most straightforward example was the "text-skimming" technique: a preprocessor varies the weight of individual words in a document based on their overall frequency in the language used. Unusual words are bolder, common words are lighter, with several gradations incorporated. Articles and pronouns can even be put into italics to further differentiate them from the more critical parts of the text. The result is a paragraph that, in user tests, readers can skim through at significantly higher speed; it is somewhat akin to overlaying a word-frequency cloud on top of the text itself.

A bit further afield was Brath's example of encoding numeric data linearly in a line of text. He took movie reviews from the Rotten Tomatoes web site and used each reviewer's numeric rating as the percentage of the text rendered in bold. The result, when all of the reviews for a particular film are placed together, effectively maps a histogram of the reviews onto the reviews themselves. In tests, he said, participants typically found it easier to extract information from this form than from Rotten Tomatoes's default layout, which places small numbers next to review quotes in a grid, intermingled with various images.

He also showed examples of visualization techniques that varied multiple font attributes to encode more than one variable. The first was a response to limitations of choropleth maps—maps where countries or other regions are colored (or shaded) to indicate a particular score on some numeric scale. While choropleths work fine for single variables, it is hard to successfully encode multiple variables using the technique, and looking back and forth between multiple single-variable choropleth maps makes it difficult for the reader to notice any correlations between them.

Brath's technique encoded three variables (health expenditure as a percentage of GDP, life expectancy, and prevalence of HIV) into three font attributes (weight, case, and slant), using the three-letter ISO country codes as the text for each nation on the map. The result makes it easier to zero in on particular combinations of the variables (for example, countries with high health expenditures and short life expectancies) or, at least, easier than flipping back and forth between three maps.

His final example of multi-variable encoding used x-height and font width to encode musical notation into text. The use case presented was how to differentiate singing from prose within a book. Typically, the only typographic technique employed in a book is to offset the sung portion of the text and set it in italics. Brath, instead, tested varying the heights of the letters to indicate note pitch and the widths to indicate note duration.

The reaction to this technique from the audience at TypeCon was, to say the least, mixed. While it is clear that the song text encodes some form of rhythmic changes and variable intensity, it does not map easily to notes, and the rendered text is not exactly easy to look at. Brath called it a work in progress; his research is far from over.

He ended the session by encouraging audience members to visit his research blog and take the latest survey to test the impact of some of the visualization techniques firsthand. He also posed several questions to the crowd, such as why there were many font families that come with a variety of different weights, but essentially none that offer multiple x-height options or italics with multiple angles of slant.

Brath's blog makes for interesting reading for anyone concerned with data visualizations or text. He often explores practical issues—for example, how overuse of color and negatively impact text legibility, which could have implications for code markup tools, or the difficulties to overcome when trying to slant text at multiple angles. Programmers, who spend much of their time staring at text, are no doubt already familiar with many ways in which typographic features can encode supplementary information (in this day and age, who does not associate a hyperlink closely with an underline, after all?). But there are certainly still many places where the attributes of text might be used to make data easier to find or understand.

Comments (3 posted)

Topics from the LLVM microconference

By Jake Edge
August 26, 2015

Linux Plumbers Conference

A persistent theme throughout the LLVM microconference at the 2015 Linux Plumbers Conference was that of "breaking the monopoly" of GCC, the GNU C library (glibc), and other tools that are relied upon for building Linux systems. One could quibble with the "monopoly" term, since it is self-imposed and not being forced from the outside, but the general idea is clear: using multiple tools to build our software will help us in numerous ways.

Kernel and Clang

Most of the microconference was presentation-oriented, with relatively little discussion. Jan-Simon Möller kicked things off with a status report on the efforts to build a Linux kernel using LLVM's Clang C compiler. The number of patches needed for building the kernel has dropped from around 50 to 22 "small patches", he said. Most of those are in the kernel build system or are for little quirks in driver code. Of those, roughly two-thirds can likely be merged upstream, while the others are "ugly hacks" that will probably stay in the LLVM-Linux tree.

There are currently five patches needed in order to build a kernel for the x86 architecture. Two of those are for problems building the crypto code (the AES_NI assembly code will not build with the LLVM integrated assembler and there are longstanding problems with variable-length arrays in structures). The integrated assembler also has difficulty handling some "assembly" code that is used by the kernel build system to calculate offsets; GCC sees it as a string, but the integrated assembler tries to actually assemble it.

The goal of building an "allyesconfig" kernel has not yet been realized, but a default configuration (defconfig) can be built using the most recent Git versions of LLVM and Clang. It currently requires disabling the integrated assembler for the entire build, but the goal is to disable it just for the files that need it.

Other architectures (including ARM for the Raspberry Pi 2) can be built using roughly half-a-dozen patches per architecture, Möller said. James Bottomley was concerned about the "Balkanization" of kernel builds once Linus Torvalds and others start using Clang for their builds; obsolete architectures and those not supported by LLVM may stop building altogether, he said. But microconference lead Behan Webster thought that to be an unlikely outcome. Red Hat and others will always build their kernels using GCC, he said, so that will be supported for quite a long time, if not forever.

Using multiple compilers

Kostya Serebryany is a member of the "dynamic testing tools" team at Google, which has the goal of providing tools for the C++ developers at the company to find bugs without any help from the team. He was also one of the proponents of the "monopoly" term for GCC, since it is used to build the kernel, glibc, and all of the distribution binaries. But, he said, making all of that code buildable using other compilers will allow various other tools to also be run on the code.

For example, the AddressSanitizer (ASan) can be used to detect memory errors such as stack overflow, use after free, using stack memory after a function has returned, and so on. Likewise, ThreadSanitizer (TSan), MemorySanitizer (MSan), and UndefinedBehaviorSanitizer (UBSan) can find various kinds of problems in C and C++ code. ~~But all are based on Clang and LLVM, so only code that can be built with that compiler suite can be sanitized using these tools.~~

GCC already has some similar tools and the Linux kernel has added some as well (the kernel address sanitizer, for example), which have found various bugs, including quite a few security bugs. GCC's support has largely come about because of the competition with LLVM and still falls short in some areas, he said.

Beyond that, though, there are other techniques beyond "best effort" tools like the sanitizers. For example, fuzzing and hardening are two techniques that can be used to either find more bugs or eliminate certain classes of bugs. He stated that coverage-guided fuzzing can be used to narrow in on problem areas in the code. LLVM's LibFuzzer can be used to perform that kind of fuzzing. He noted that the Heartbleed bug can be "found" using LibFuzzer in roughly five seconds on his laptop.

Two useful hardening techniques are also available with LLVM: control flow integrity (CFI) and SafeStack. CFI will abort the program when it detects certain kinds of undesired behavior—for example that the virtual function table for a program has been altered. SafeStack protects against stack overflows by placing local variables on a separate stack. That way, the return address and any variables are not contiguous in memory.

Serebryany said that it was up to the community to break the monopoly. He was not suggesting simply switching to using LLVM exclusively, but to ensuring that the kernel, glibc, and distributions all could be built with it. Furthermore, he said that continuous integration should be set up so that all of these pieces can always be built with both compilers. When other compilers arrive, they should also be added into the mix.

To that end, Webster asked if Google could help getting the kernel patches needed to build with Clang upstream. Serebryany said that he thought that, by showing some of the advantages of being able to build with Clang (such as the fuzzing support), Google might be able to help get those patches merged.

BPF and LLVM

The "Berkeley Packet Filter" (BPF) language has expanded its role greatly over the years, moving from simply being used for packet filtering to now providing the in-kernel virtual machine for security (seccomp), tracing, and more. Alexei Starovoitov has been the driving force behind extending the BPF language (into eBPF) as well as expanding its scope in the kernel. LLVM can be used to compile eBPF programs for use by the kernel, so Starovoitov presented about the language and its uses at the microconference.

He began by noting wryly that he "works for Linus Torvalds" (in the same sense that all kernel developers do). He merged his first patches into GCC some fifteen years ago, but he has "gone over to Clang" in recent years.

The eBPF language is supported by both GCC and LLVM using backends that he wrote. He noted that the GCC backend is half the size of the LLVM version, but that the latter took much less time to write. "My vote goes to LLVM for the simplicity of the compiler", he said. The LLVM-BPF backend has been used to demonstrate how to write a backend for the compiler. It is now part of LLVM stable and will be released as part of LLVM 3.7.

GCC is built for a single backend, so you have to specifically create a BPF version, but LLVM has all of its backends available using command-line arguments (--target bpf). LLVM also has an integrated assembler that can take the C code describing the BPF and turn it into in-memory BPF bytecode that can be loaded into the kernel.

BPF for tracing is currently a hot area, Starovoitov said. It is a better alternative to SystemTap and runs two to three times faster than Oracle's DTrace. Part of that speed comes from LLVM's optimizations plus the kernel's internal just-in-time compiler for BPF bytecode.

Another interesting tool is the BPF Compiler Collection (BCC). It makes it easy to write and run BPF programs by embedding them into Python (either directly as strings in the Python program or by loading them from a C file). Underneath the Python "bpf" module is LLVM, which compiles the program before the Python code loads it into the kernel. A simple printk() can easily be added into the kernel without recompiling it (or rebooting). He noted that Brendan Gregg has added a bunch of example tools to show how to use the C+Python framework.

Under the covers, the framework uses libbpfprog that compiles a C source file into BPF bytecode using Clang/LLVM. It can also load the bytecode and any BPF maps to the kernel using the bpf() system call and attach the program(s) to various types of hooks (e.g. kprobes, tc classifiers/actions). The Python bpf module simply provides bindings for the library.

The presentation was replete with examples, which are available in the slides [PDF] as well.

Alternatives for the core

There was a fair amount of overlap between the last two sessions I was able to sit in on. Both Bernhard Rosenkraenzer and Khem Raj were interested in replacing more than just the compiler in building a Linux system. Traditionally, building a Linux system starts with GCC, glibc, and binutils, but there are now alternatives to those. How much of a Linux system can be built using those alternatives?

Some parts of binutils are still needed, Rosenkraenzer said. The binutils gold linker can be used instead of the traditional ld. (Other linker options were presented in Mark Charlebois's final session of the microconference, which I unfortunately had to miss.) The gas assembler from binutils can be replaced with Clang's integrated assembler for the most part, but there are still non-standard assembly constructs that require gas.

Tools like nm, ar, ranlib, and others will need to be made to understand three different formats: regular object files, LLVM bytecode, and the GCC intermediate representation. Rosenkraenzer showed a shell-script wrapper that could be used to add this support to various utilities.

For the most part, GCC can be replaced by Clang. OpenMandriva switched to Clang as its primary compiler in 2014. The soon-to-be-released OpenMandriva 3 is almost all built with Clang 3.7. Some packages are still built with gcc or g++, however. OpenMandriva still needed to build GCC, though, to get libraries that were needed such as libgcc, libatomic, and others (including, possibly, libstdc++).

The GCC compatibility claimed by Clang is too conservative, Rosenkraenzer said. The __GNUC__ macro definition in Clang is set to 4.2.1, but switching that to 4.9 produces better code. There were several thoughts on why Clang has chosen 4.2.1, though both are related: 4.2.1 was the last GPLv2 release of GCC, so some people may not be allowed to look at later versions; in addition, GCC 4.2.1 was the last version that was used to build the BSD portions of OS X.

There are a whole list of GCC-isms that should be avoided for compatibility with Clang. Rosenkraenzer's slides [PDF] list many of them. He noted that there have been a number of bugs found via Clang warnings or errors when building various programs—GCC did not complain about those problems.

Another "monopoly component" that one might want to replace would be glibc. The musl libc alternative is viable, but only if binary compatibility with other distributions is not required. But musl cannot be built with Clang, at least yet.

Replacing GCC's libstdc++ with LLVM's libc++ is possible but, again, binary compatibility is sacrificed. That is a bigger problem than it is for musl, though, Rosenkraenzer said. Using both is possible, but there are problems when libraries (e.g. Qt) are linked to, say, libc++ and a binary-only Qt program uses libstdc++, which leads to crashes. libc++ is roughly half the size of libstdc++, however, so environments like Android (which never used libstdc++) are making the switch.

Cross-compiling under LLVM/Clang is easier since all of the backends are present and compilers for each new target do not need to be built. There is still a need to build the cross-toolchains, though, for binutils, libatomic, and so on. Rosenkraenzer has been working on a tool to do automated bootstrapping of the toolchain and core system.

Conclusion

It seems clear that use of LLVM within Linux is growing and that growth is having a positive effect. The competition with GCC is helping both to become better compilers, while building our tools with both is finding bugs in critical components like the kernel. Whether it is called "breaking the monopoly" or "diversifying the build choices", this trend is making beneficial changes to our ecosystem.

[I would like to thank the Linux Plumbers Conference organizing committee for travel assistance to Seattle for LPC.]

Comments (17 posted)

Reviving the Hershey fonts

By Nathan Willis
August 26, 2015

TypeCon

At the 2015 edition of TypeCon in Denver, Adobe's Frank Grießhammer presented his work reviving the famous Hershey fonts from the Mid-Century era of computing. The original fonts were tailor-made for early vector-based output devices but, although they have retained a loyal following (often as a historical curiosity), they have never before been produced as an installable digital font.

Grießhammer started his talk by acknowledging his growing reputation for obscure topics—in 2013, he presented a tool for rapid generation of the Unicode box-drawing characters—but argued that the Hershey fonts were overdue for proper recognition. He first became interested in the fonts and their peculiar history in 2014, when he was surprised to find a well-designed commercial font that used only straight line segments for its outlines. The references indicated that this choice was inspired by the Hershey fonts, which led Grießhammer to dig into the topic further.

The fonts are named for their creator, Allen V. Hershey (1910–2004), a physicist working at the US Naval Weapons Laboratory in the 1960s. At that time, the laboratory used one of the era's most advanced computers, the IBM Naval Ordnance Research Calculator (NORC), a vacuum-tube and magnetic-tape based machine. NORC's output was provided by the General Dynamics S-C 4020, which could either plot on a CRT display or directly onto microfilm. It was groundbreaking for the time, since the S-C 4020 could plot diagrams and charts directly, rather than simply outputting tables that had to be hand-drawn by draftsmen after the fact.

By default, the S-C 4020 would output text by projecting light through a set of letter stencils, but Hershey evidently saw untapped potential in the S-C 4020's plotting capabilities. Using the plotting functions, he designed a set of high-quality Latin fonts (both upright and italics), followed by Greek, a full set of mathematical and technical symbols, blackletter and Lombardic letterforms, and an extensive set of Japanese glyphs—around 2,300 characters in total. Befitting the S-C 4020's plotting capabilities, the letters were formed entirely by straight line segments.

The format used to store the coordinates of the curves is, to say the least, unusual. Each coordinate point is stored as pair of ASCII letters, where the numeric value of each letter is found by taking its offset from the letter R. That is, "S" has a value of +1, while "L" has a value of -6. The points are plotted with the origin in the center of the drawing area, with x increasing to the right and y increasing downward.

Typographically, Hershey's designs were commendable; he drew his characters based on historical samples, implemented his own ligatures, and even created multiple optical sizes. Hershey then proceeded to develop four separate styles that each used different numbers of strokes (named "simplex," "duplex," "complex," and "triplex").

The project probably makes Hershey the inventor of "desktop publishing" if not "digital type" itself, Grießhammer said, but Hershey himself is all but forgotten. There is scant information about him online, Grießhammer said; he has still not even been able to locate a photograph (although, he added, Hershey may be one of the unnamed individuals seen in group shots of the NORC room, which can be found online).

Hershey's vector font set has lived on as a subject for computing enthusiasts, however. The source files are in the public domain (a copy of the surviving documents is available from the Ghostscript project, for example) and there are a number of software projects online that can read their peculiar format and reproduce the shapes. At his GitHub page, Grießhammer has links to several of them, such as Kamal Mostafa's libhersheyfont. Inkscape users may also be familiar with the Hershey Text extension, which can generate SVG paths based on a subset of the Hershey fonts. In that form, the paths are suitable for use with various plotters, laser-cutters, or CNC mills; the extension was developed by Evil Mad Scientist Laboratories for use with such devices.

Nevertheless, there has never been an implementation of the designs in PostScript, TrueType, or OpenType format, so they cannot be used to render text in standard widgets or elements. Consequently, Grießhammer set out to create his own. He wrote a script to convert the original vector instructions into Bézier paths in UFO format, then had to associate the resulting shapes with the correct Unicode codepoints—Hershey's work having predated Unicode by decades.

The result is not quite ready for release, he said. Hershey's designs are zero-width paths, which makes sense for drawing with a CRT, but is not how modern outline fonts work. To be usable in TrueType or OpenType form, each line segment needs to be traced in outline form to make a thin rectangle. That can be done, he reported, but he is still working out what outlining options create the most useful final product. The UFO files, though, can be used to create either TrueType or OpenType fonts.

When finished, Grießhammer said, he plans to release the project under an open source license at github.com/adobe-fonts/hershey. He hopes that it will not only be useful, but will also bring some more attention to Hershey himself and his contribution to modern digital publishing.

Comments (7 posted)

Page editor: Jonathan Corbet
Next page: Security>>