|
|
Subscribe / Log in / New account

ML vs HDF5

ML vs HDF5

Posted Mar 17, 2024 11:25 UTC (Sun) by summentier (subscriber, #100638)
In reply to: Insecurity and Python pickles by gspr
Parent article: Insecurity and Python pickles

My view is certainly somewhat contrarian, but I don't think HDF5 is a good file format for scientific data.

First of all, an HDF5 spec is enormous for a file format. This is not just a set of tensors organized in a tree structure, there is all sorts of additional stuff: attributes, compression, data layout, custom data types, you name it. For this reason, I disagree that there are “mature implementations in pretty much every language”, there really is only one feature-complete implementation: libhdf5, written in C, which pretty much every other language wraps around. (Yes, there is jHDF for Java, which can only read, there is JLD2 and some rust crates, but none of them support the full spec last time I checked.)

Because the HDF5 spec is so large and complex, you essentially have to tool around libhdf5 or use an HDF5 viewer every time you want to look at the data – hex dumps are of no use to you. But this also means that things like mem-mapping parts of a large datasets becomes a problem – libhdf5 to this day does not support this properly. Writing and reading HDF5 files is thus quite cumbersome and tends to be slow. Compare that with a simple binary file format like numpy, where you simply have a text header followed by some binary data, this becomes trivial.

What about HDF5 as an archiving format, then? Well, OK, there is a spec, but what use is that if, say, libhdf5 ceases to be maintained? In this case, how on Earth are we going to get the data out of a compressed data set with custom data types nestled deep into a tree hierarchy without reimplementing the spec? And even then, since there is essentially only one implementation that everyone uses, we have to pray that libhdf5 actually followed the HDF5 spec ...

In summary, I consider a tarball of binary files with text headers a la numpy a vastly superior file format to HDF5. It is clear, universally understood, and easy to view and tool around. (Of course, HDF5 being not a good format does not excuse using pickle ...)


to post comments


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds