Some useful tools for binary formats

February 28, 2023

This article was contributed by Koen Vervloesem

Linux users often work with text files; tools like grep, awk, and sed are standard utilities in their toolbox. However, these tools fall short when trying to extract or edit data from files in a binary format, analyze corrupt media files, or for parsing a binary data format. FOSDEM 2023 in Brussels had a whole binary tools devroom dedicated to open-source programs that deal with binary data.

Line-based text files can be handled with the standard tools, but even better tools exist for data formats that store structured data in text, like JSON, YAML, and XML. For JSON, the command-line processor jq has become popular. It was also the inspiration for at least two tools called yq that handle YAML, JSON, XML, and other text-based formats: one by Mike Farah and another by Andrey Kislyuk.

fq, or jq for binary formats

At FOSDEM, Mattias Wadman spoke about his tool fq, which he calls "jq for binary formats". He developed it because he works with media files a lot in his job and wanted to have a command-line tool to help with debugging broken media files, looking for unusual values or data structures, or just to automate data-extraction tasks over multiple media files.

Wadman explained that he liked jq a lot because of its CLI-friendly syntax to query, display, and transform JSON data. The syntax is terse and composable, it's easy to iterate and recurse over elements in a data structure, and it has powerful means to select and transform data. He wanted the same possibilities for data in a binary format, so he wrote fq. It is written in Go, and based on a Go implementation of jq. Both projects are MIT-licensed.

Fq has an expressive syntax. In its simplest forms, fq d file recursively shows a tree of decoded data structures, but it truncates long arrays; with fq dd file all bytes are shown. With fq . file only the upper level of the tree is shown. It's also possible to select for values fulfilling some condition, extract specific parts of the file's data structures, or compute histograms of values.

This way fq can be used for quick data-manipulation tasks on the command line, also as part of a shell script. It can also be started as an interactive read-eval-print loop (REPL) shell, using fq -i . file. This has autocompletion to easily discover structures in the file and navigate through the tree of structures. Another way to use fq is as a script interpreter. When regularly using the same fq command, one can create a script like:

    #!/usr/bin/env fq -d mp3 -rf
    [.frames[].header | .sample_count / .sample_rate] | add

This example computes the duration of an MP3 file by selecting the header of all frames, for each of these headers dividing the number of samples in the frame by its sample rate, and then adding all those results. In the same way, files in many other formats can be handled. The project also has some documentation about implementing a decoder for a new format.

Kaitai Struct

While fq is a useful command-line tool, Kaitai Struct focuses more on creating parsers for binary structures in a declarative and language-neutral way. Petr Pučil, a developer and maintainer of Kaitai Struct, talked about how he discovered the project. He wanted to create a Musical Instrument Digital Interface (MIDI) editor and, as part of this, he started writing a parser for SoundFont 2 (.sf2) files. "This was hard, and no fun", Pučil admitted. But when he found Kaitai Struct, it allowed him to do in one day what had taken him two months before.

So what is Kaitai Struct? Fundamentally, it's a declarative language that is used to describe arbitrary binary data structures. A data format is described in a .ksy file, using the YAML format. This is particularly interesting, because it can be used as a formal specification for the format. At the time of this writing, Kaitai Struct's format gallery has 181 specifications, including the ZIP archive format, the GPT partition table, the Executable and Linkable Format (ELF) for executable files, and the Audio Video Interleave (AVI) multimedia container format.

On top of that, Kaitai Struct is also a parser generator. When Mikhail Yakshin released its first source code in 2016, Kaitai Struct was able to compile the .ksy files into Java and Ruby code to parse the structures. By 2017, the project already supported 8 languages, and at this writing it is able to create parsers in 11 languages: C++/STL, C#, Go, Java, JavaScript, Lua, Nim, Perl, PHP, Python, and Ruby. In his talk, Pučil announced that Rust, C, and Julia parser generators are planned.

"It's really a write once, use everywhere approach", Pučil explained. Once a binary format is described in a .ksy file, Kaitai Struct's compiler is able to generate source code for parsers in these 11 programming languages. The compiler, ksc, is released under the GPLv3 license, but the run-time libraries for the generated code use the MIT or Apache-2.0 license. Thus generated code can even be used in proprietary applications.

Ksc has an interesting feature: one of its targets is not a programming language, but Graphviz's dot language. The resulting .dot file can be converted with Graphviz to a diagram showing the data structures defined in the .ksy file. This is an easy way to visualize a binary data format, perhaps as part of its documentation, for example.

Kaitai Struct comes with a couple of other utilities. With the ksv command, files can be parsed using a .ksy file, with their structure and the raw data visualized side by side. The user can open and close parts of the tree structure to interactively explore the data. Another command that comes with the ksv package is ksdump. This also visualizes the tree structure, but non-interactively: it just dumps the parsed structure for the given data in YAML, JSON, or XML format to the terminal.

Another tool that comes in handy while developing a .ksy file is the Kaitai Web IDE. This is an online editor and visualizer that shows the raw data and the parsed object tree for a .ksy file while editing it. This immediate feedback can speed up developing a format specification. The Web IDE has the GPLv3 license and can also be run locally.

Pučil ended his talk with a recent development: the Kaitai Struct team has been working for six months on serialization support. Serialization is the opposite of parsing: while parsing creates a structure from binary data (so it reads binary data), serialization converts a structure to binary data (so it writes binary data). The work on serialization has been financially supported by the NLnet Foundation. Currently, Kaitai Struct has a serialization generator for Java working and, according to Pučil, Python and C# implementations will be ready in two months.

With serialization, Kaitai Struct can be used to edit existing files in a binary format, to create new files from a structure, or even to convert a file between different formats. However, the current scope of the serialization implementation is rather narrow. The user must set everything in the data structure explicitly, including lengths, offsets, and magic numbers. Kaitai Struct only checks the consistency of the serialized data. Another limitation is that once a stream of data is created for serialization, it’s not possible to resize it later.

GNU poke

The GNU poke project had two talks in FOSDEM's binary tools devroom. José E. Marchesi gave an introduction to the tool and a status update of the project. GNU poke is an interactive, extensible editor for structured binary data from the command line. In contrast to a simple hex editor, GNU poke not only lets the user edit the data as a raw stream of bytes, but also in a structured way. Marchesi called GNU poke especially useful for reverse engineering and prototyping.

Of course, the poke program needs to know the data format the user is working with, and that's why the program comes with almost 50 "pickles". These are scripts in the Poke language that can read and write specific binary formats. Some pickles are even complex enough that they are distributed in their own packages, such as poke-elf to edit ELF object files, executables, shared libraries, and core dumps.

Users can also write their own pickles to edit files in other binary formats. The Poke language is statically typed and garbage-collected. A pickle includes the definitions of types, variables, and functions needed to parse and serialize the data format. So, by writing a custom pickle and loading it into GNU poke, users can create their own custom binary data editor.

The basic functionality of GNU poke is implemented in a C library, libpoke, which allows its features to be integrated into other tools. Marchesi demonstrated an example of this, where libpoke was integrated into GDB. "With GDB good at debugging and GNU poke good at poking at data, combining the two results in a tool that excels at both tasks", Marchesi concluded.

Apart from GNU poke's home page, there are two other web sites to learn more about the tool: Pokology is a community-driven web site maintained by poke users and developers, and Applied Pokology is Marchesi's blog about GNU poke.

GNU poke contributor Mohammad-Reza Nabipoor spoke about alternative user interfaces for the tool. In the recent GNU poke 3.0 release, Nabipoor contributed the poke daemon, poked. It links with libpoke and exposes its functionality to other programs over Unix sockets. This way, multiple clients can interact with poked simultaneously.

Nabipoor, who uses GNU poke to manipulate Bluetooth formats, built upon poked to create pacme, which he calls "an acme-inspired GNU poke interface". The acme he refers to is Rob Pike's text editor from the distributed operating system Plan 9 from Bell Labs.

Pacme consists of a bunch of small C programs, called pokelets, that interact with poked. They execute Poke code, process the results, and implement several components of the user interface, such as the REPL, byte dumps, a tree viewer, and an editor of Poke data structures. For its user interface, pacme is using the terminal multiplexer tmux with the various pokelets shown each in their own pane. There are some pre-defined layouts, or the user can open tmux panes and arrange them manually, and then open the pokelets in them for a custom interface.

Binary toolkit

All in all, having these tools could come in handy when working with various binary data formats. For extracting and filtering data on the command line, fq offers an accessible way. To edit binary data, GNU poke is a powerful tool. To parse and maybe serialize binary data in one of the supported programming languages, Kaitai Struct offers a flexible approach.

While these three tools have their own use cases, it's a little unfortunate that each has its own format library. There's a lot of overlap between the formats so there is a fair amount of duplicated work. Kaitai Struct is actively collaborating with the Construct Python data parser and builder project, however, and there are ideas flowing in both directions. The Kaitai Struct compiler has Construct as one of its targets, which allows converting a .ksy file to a Python file describing the same data format using Construct's declarative language. More of this type of collaboration between various projects in the space of binary data formats would be fruitful.

Index entries for this article
GuestArticles	Vervloesem, Koen
Conference	FOSDEM/2023

Some useful tools for binary formats

Posted Mar 1, 2023 4:49 UTC (Wed) by anton.kochkov (guest, #161204) [Link] (4 responses)

There is also a realm of working with the executable binary formats in UNIX-y way, not only the binary data formats. One good example is Rizin framework (and a Cutter Qt GUI to it). Apart from standard features for RE tool like disassembly, analysis, signatures, decompilation, binary diffing, it offers some of the features that are suitable for reversing data formats as well - inspecting raw data, calculating hashes, entropy inspection, magic search, diffing, and so on.

See https://rizin.re for more information. Disclaimer: I am one of the developers of the tool.

Some useful tools for binary formats

Posted Mar 1, 2023 8:16 UTC (Wed) by sur5r (subscriber, #61490) [Link] (2 responses)

The "cutter" name left me confused at first, but it's indeed a fork of the radare2 stuff. Might be worth noting.

Some useful tools for binary formats

Posted Mar 1, 2023 13:07 UTC (Wed) by anton.kochkov (guest, #161204) [Link] (1 responses)

We forked multiple years ago and there is a significant difference in the code and features. We aren't "just a fork" for quite a while. I recommend checking the blog articles to get a glimpse: https://rizin.re/posts/

Some useful tools for binary formats

Posted Mar 1, 2023 14:26 UTC (Wed) by sur5r (subscriber, #61490) [Link]

I didn't mean to undermine your work, sorry if it sounded like that. It was just that my mind connected "Cutter" and "Reverse Engineering" with radare2 and I was trying to understand if it was related or just a name collision.

Some useful tools for binary formats

Posted Mar 1, 2023 21:52 UTC (Wed) by bartoc (guest, #124262) [Link]

Yeah I was surprised this was left out, it's my goto when I need to mess around with binary formats because it's data specification mini-language is so terse but covers so much ground. It's also just really nice to be able to interactively get all kinds of different sorts of printouts. I kinda wish there was something like this "Katai Struct" program that used that instead of yaml, katai's yaml is so very verbose I almost feel like I'm not saving much time.

Actually, I think ASN.1 is basically the ultimate language for binary format description, but it's really the kitchen sink and I don't know of any parser generator implementation that can really deal with the whole specification.

Some useful tools for binary formats

Posted Mar 1, 2023 6:41 UTC (Wed) by LtWorf (subscriber, #124958) [Link] (2 responses)

Nice article!

> the generated code use the MIT or Apache-2.0 license. Thus generated code can even be used in proprietary applications.

This had me confused. I was under the impression that generated code just carries the copyright of the source. Which is why gcc's output can have any license the author desires.

But I guess the situation here is that the generated code links/imports a certain library that is using that license, rather than being under that license in itself.

Some useful tools for binary formats

Posted Mar 1, 2023 11:24 UTC (Wed) by donaldh (subscriber, #151569) [Link]

The article says:

> the run-time libraries for the generated code use the MIT or Apache-2.0 license

So, yes, it's just the libraries that the generated code needs, not the generated code itself. You can presumably license the generated code any way you like.

Some useful tools for binary formats

Posted Mar 1, 2023 17:06 UTC (Wed) by iabervon (subscriber, #722) [Link]

There's a similar point with libgcc: sometimes the compiler chooses to generate a call to a non-obvious implementation of a common pattern instead of a direct transformation of your source, and then the license of that implementation becomes relevant. Such things tend to be permissively licensed, so that the compiler is as widely useable as possible. On the other hand, things like regular expression libraries tend to have more runtime support and also tend to expect that a system will include the compiler and not just its output, so they don't tend to produce something that is only a derived work of the input to the compiler. So it can be a question of whether the project thinks of itself like GCC or libpcre as far as how some of the code is licensed.

Kaitai not (really) included in Fedora

Posted Mar 1, 2023 14:17 UTC (Wed) by pebolle (guest, #35204) [Link]

kaitai looks interesting. I wanted to play around with it on Fedora. Turns out there is only a python3-kaitaistruct package. Fedora doesn't seem to ship the ksc, the compiler.

I guess Fedora doesn't ship the compiler because Fedora last contained SBT, the Scala build tool, in Fedora 29. That's about five years ago. And SBT is mentioned as a requirement for building the compiler. So I fear playing with this in Fedora requires a lot of yak shaving, involving a language (Scala) that I know only by name.

Has anyone tried to do this nevertheless?

(It's a bit odd that Fedora ships that Python package, because it apparently only allows you to run ksc generated Python code, but not to generate kaitai code by itself. As you seem to need that compiler to do that.)

So my playing around with kaitai stopped after discovering that Fedora doesn't include the tools to do it easily. Which took me an embarrassing amount of time.

Thanks

Posted Mar 2, 2023 6:47 UTC (Thu) by buck (subscriber, #55985) [Link]

Have to agree with a prior comment: Nice article!

A whole ecosystem of tools i had no knowledge of

Have enjoyed many of your past contributions as well, and appreciate their variety, clarity, deeply informative nature, and ubiquity of links. Please keep up the good work.

(Which is to take nothing away from our humble editor and the rest of the crew, who continue to faithfully pound the kernel and software development beats, with like aplomb and trademark wit)