File-format analysis tools for archivists

May 25, 2016

This article was contributed by Gary McGath

Preserving files for the long term isn't as easy as just putting them on a drive. As xkcd points out, in its subtle way, some other issues are involved. Will the software of the future be able to read the files of today without losing information? If it can, will people be able to tell what those files contain and where they came from?

Digital archives and libraries store files for future generations, just as physical ones store books, photographs, and art; the digital institutions have a similar responsibility for the preservation of electronic documents. In a way, digital data is more problematic, since file formats change more quickly than human languages. On the other hand, effective use of metadata lets a file carry its history with it.

For these reasons, detailed characterization of files is important. The file command just isn't enough, so developers have created a variety of open-source tools to check the quality of documents going into archives. These tools analyze files, reporting those that are outright broken or might cause problems, and showing how forthcoming or reticent the files are about describing themselves. We can break the concerns down into several issues:

Exact format identification: Knowing the MIME type isn't enough. The version can make a difference in software compatibility, and formats come in different "profiles," or restrictions of the format for special purposes. For instance, PDF/A is a profile of PDF that requires a file to have certain structural features but no external dependencies. PDF/A is better for archiving (which is what the "A" stands for) than most other PDF files.
Format durability: Software that can read any given format fades into obsolescence if there isn't enough interest to keep it updated. Which formats will fare best is a guessing game, but open and widely known formats are a safer bet than proprietary or obscure ones.
Strict validation: Many software projects follow Postel's Law: "Be liberal in what you accept and conservative in what you send." Archiving software, though, stands on both sides of the fence. It accepts files in order to give them to an audience that doesn't even exist yet. This means it should be conservative in what it accepts.
Metadata extraction: A file with a lot of identifying metadata, such as XMP or Exif, is a better candidate for an archive than one with very little. An archive adds a lot of value if it makes rich, searchable metadata available.

A number of open-source applications address these concerns, some of which we will look at below. Most them come from software developers in the library and preservation communities. Some focus on a small number of formats in intense detail; others cover lots of formats but generally don't go as deep. Some just identify files, while others pull out metadata.

JHOVE

JHOVE (JSTOR-Harvard Object Validation Environment) is the most demanding and obsessive of the lot. It covers a small number of formats in a nitpicking way, which is useful for making sure that software in the future won't have problems. It examines files exhaustively, analyzing them for validity, identifying versions and profiles, and pulling out lots of metadata. I worked on it for a decade, joining the project at the Harvard University Libraries in 2003, writing the bulk of the code, and continuing to support it after I left Harvard. It's now in the hands of the Open Preservation Foundation, which has just released version 1.14.

JHOVE is written in Java and is available under the GNU LGPL license (v2.1 or later). It includes modules for fifteen formats, including image, audio, text-based, and PDF formats. New in version 1.14 (and not yet listed in the documentation) are PNG, GZIP, and WARC.

Each module does extensive analysis on files, looking for any violations of the specification. A file that conforms to the syntactic requirements is considered "well-formed." If it also meets the semantic requirements, it's "valid." For instance, an XML file is well-formed when its tags are all properly matched and nested, etc., and it's valid when it matches its schema, if any.

The fallback format is "Bytestream," which is just a stream of bytes, in other words, any file. In the default configuration, JHOVE applies all of its modules against a file and reports the first one to declare it well-formed and valid. If no other module matches, it reports that the file is a Bytestream. It's also possible to run JHOVE to apply just a single module, for the format that a file is supposed to be. This is useful with defective files, since it will report how they aren't well-formed or valid. That's more helpful than simply declaring them Bytestreams.

If a file is valid, JHOVE will report the version of the format, any profiles that it satisfies, and lots of file metadata. The output can be in plain text or XML. The GUI version shows its output as an expandable outline and can save it as text or XML.

To examine a known TIFF file and get output in XML, the command might be:

    jhove -m TIFF-hul -h xml example.tif

Other Java applications can call JHOVE through its API.

JHOVE is strict, but it isn't designed to examine the data streams in a file, only the file's structure. For instance, in an LZW-compressed TIFF file, it will check that all the tags are well-formed, including StripOffsets and StripByteCounts, but it won't check that the actual strips (i.e., the compressed pixel data) are well-formed LZW data. Thus, JHOVE will catch subtle errors, but it won't find all defects.

DROID and PRONOM

Archivists often have large batches of files to process and need a big picture of what they have: how many in each format, how many risky files, changes in format usage by year or month, how much older format versions are being used, and so on. This is where DROID shows its strength. It's available from the UK National Archive under the three-clause BSD license. Its main purpose is to screen and identify files as they're being ingested into an archive. It works with the National Archive's PRONOM database of formats, identifying files on the basis of their signature or "magic number."

In this regard, it's similar to file, but it performs finer-grained distinctions among formats. For example, within the TIFF format, PRONOM distinguishes the Digital Negative or DNG, which is a universal raw camera format based on TIFF, TIFF-FX for fax images, and Exif files, which are TIFF metadata without an image.

DROID is good at processing large batches of files. Analyzing them involves two steps. First the user "profiles" a set of files, collecting information on them into a single document. From the command line, the user can specify filters telling DROID which files to profile. Unfortunately, the filter language is difficult to figure out, and the documentation isn't as helpful as it might be, but fortunately there's a Google group where people can answer questions. The second step is to generate a report. One command can do both of these. Here's a relatively simple example with a filter that accepts only PDF files and generates a report as a CSV file.

    droid.sh -p "result1.droid" -e "result1.csv" -F "file_ext contains 'pdf'"

Running DROID as a GUI application is easier. In this case, profile creation and report generation are separate steps.

DROID doesn't do much validation or metadata extraction, but it's strong on identifying the format of a file by looking at its signature. This is valuable when processing a large number of files for an archive and weeding out the files that aren't in suitable formats.

ExifTool

Phil Harvey's ExifTool has a different focus. Its specialty is fiddling with metadata and, in spite of its name, it knows about lots of metadata types, not just Exif. It can modify as well as view files, and it's adept at tricks like assigning an author to a group of files or fixing a timestamp that's in the wrong time zone. Its main interest for archivists is its ability to grab and report the metadata in files.

It's aware mostly, but not exclusively, of audio, image, and video formats. It does simple signature-based format identification, along with just enough validation to identify the metadata in a file. ExifTool is available under the Perl license.

It's a versatile piece of software with extensive scripting capabilities. Perl applications can use it through Image::ExifTool. Other code can use its command-line interface as an API, using the -@ and -stay_open options to feed it commands through standard input or an argument file. In addition, a library wraps the command-line interface for use in C++ programs.

ExifTool treats all file properties and metadata as "tags." A command can request specific tags or tag groups. The following command will return a file's type, MIME type, and usual format extension:

    exiftool -filetype -mimetype -filetypeextension sample.png

The output for this would be as follows, assuming it's really a PNG file:

    File Type                       : PNG
    MIME Type                       : image/png
    File Type Extension             : png

A variety of export options are available, including HTML, RDF XML, JSON, and plain text. Output can be sorted, and some tags have formatting options.

Putting it all together: FITS

What if you want a second opinion on a file? Maybe even a third or fourth?

There are lots of free-software tools for file identification and metadata extraction, and space doesn't allow discussing all of them here. Others include MediaInfo, which extracts metadata from audio and video files, the National Library of New Zealand (NLNZ) Metadata Extraction tool, which specializes in a few archive-friendly formats, and Apache Tika, which extracts metadata from over a thousand formats.

All of these applications report different information, and they don't always agree with each other. Some produce more fine-grained identification than others, and some are fussier than others about whether a file is valid. It's desirable to use more than one tool, in case one of them doesn't handle certain cases well. The Harvard Library's File Information Tool Set (FITS) allows using a dozen different tools together.

FITS originally served as a gatekeeper for Harvard's Digital Repository Service (DRS), and it still does. Other institutions now use it too. I worked only briefly on FITS, but my efforts played a significant role in moving it from a Harvard-only tool to one with a larger user and support community. It is available under the LGPLv3.

DROID, ExifTool, and JHOVE are all parts of the repertoire of FITS. So are Tika, file, MediaInfo, the NLNZ Metadata Extractor, an unsupported but still sometimes useful tool called ffident, and several in-house tools.

For all its complexity, running FITS is fairly simple. Here's the simplest useful command, which simply processes the given file with all of the different modules:

    fits -i sample.png

Combining all the tools is tricky for several reasons. They're written in different languages; FITS is in Java, and it invokes non-Java software such as ExifTool through the command-line interface. Their output is in a variety of formats and each tool uses its own terminology.

Where the component tools can produce XML, FITS uses XSLT to convert it to "FITS XML," and then consolidates the outputs into a single XML file. Optionally, it will convert FITS XML to metadata schemas that archives and libraries commonly use, such as MIX, TextMD, and AES Audio Object.

Often the tools won't completely agree about the file, and FITS tries to do conflict resolution. The identification section of the FITS XML output lists the tools that identified the file; if they disagree, it will have the attribute status=CONFLICT. Those who just want one answer can select an ordering preference for the tools and set the conflict reporting configuration element to false. The first tool to give an answer wins.

Because FITS incorporates so many tools, each of which has its own development cycle, into a single application, it's a complicated piece of software to manage. Sometimes it has to stay with older versions of tools until the developers can fix FITS to work with the latest version of the tool.

Final thoughts

Identifying formats and characterizing files is a tricky business. Specifications are sometimes ambiguous. Practices that differ from the letter of the spec may become common; for instance, TIFF's requirement for even-byte alignment is deemed archaic. People have different views on how much error, if any, is acceptable. Being too fussy can ban perfectly usable files from archives.

Specialists are passionate about the answers, and there often isn't one clearly correct answer. It's not surprising that different tools with different philosophies compete, and that the best approach can be to combine and compare their outputs.

Index entries for this article
GuestArticles	McGath, Gary

File-format analysis tools for archivists

Posted May 26, 2016 10:26 UTC (Thu) by pabs (subscriber, #43278) [Link]

A useful resource on file formats is the Archive Team's file formats wiki:

http://fileformats.archiveteam.org/

File-format analysis tools for archivists

Posted May 26, 2016 14:57 UTC (Thu) by oever (guest, #987) [Link]

File formats are a very interesting topic. While working on Microsoft Office import filters for Calligra, I wrote a small grammar for writing file format schemas. From that one can generate (de)serialization code for those file formats. There is an effort to write such grammars for more file formats. That grammar is called Data Format Description Language (DFDL).

Writing grammars for file format is the best way of doing IO. Writing IO by hand is just silly. If a file format has a grammar, it's easier to support it in new software.