Insecurity and Python pickles

Posted Mar 13, 2024 21:35 UTC (Wed) by NYKevin (subscriber, #129325)
In reply to: Insecurity and Python pickles by rav
Parent article: Insecurity and Python pickles

I'm a bit confused about why people keep reinventing the "tabular data in a binary file" wheel. Are SQLite and/or DuckDB unsatisfactory for this purpose?

Looking through your code, it appears that one of the things you're doing is moving data between the file and NumPy. I'm guessing that a major sticking point here is that SQLite provides no obvious way to efficiently move large quantities of data between itself and NumPy. Interestingly, it looks like DuckDB does actually try to support that use case to some extent.[1] But I don't know enough about your use case to say whether that's actually good enough.

[1]: https://duckdb.org/docs/api/python/overview.html and pages linked from there.

Insecurity and Python pickles

Posted Mar 13, 2024 22:05 UTC (Wed) by Wol (subscriber, #4433) [Link] (12 responses)

> I'm a bit confused about why people keep reinventing the "tabular data in a binary file" wheel. Are SQLite and/or DuckDB unsatisfactory for this purpose?

Maybe because a two-dimensional table is unusual, unnatural, and constricting?

Is SQLite capable of storing a 4th-normal-form structure in a single row?

Cheers,
Wol

Insecurity and Python pickles

Posted Mar 13, 2024 22:31 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (7 responses)

SQLite can store any individual thing you want, as long as you can serialize it to bytes. Of course, a table full of TEXT objects is much less useful than a properly normalized table, but it does provide all of the supporting infrastructure (i.e. actually writing data out to files, shared and exclusive locking, write-ahead logging, etc.) for free, so it's still better than hand-rolled code.

Insecurity and Python pickles

Posted Mar 14, 2024 1:09 UTC (Thu) by intelfx (subscriber, #130118) [Link] (6 responses)

> shared and exclusive locking, write-ahead logging, etc.

I don't quite see how is any of this useful in the context of a data _interchange_ file format? All of these files are written exactly once and then read or distributed. If something happens during writing of such a file, the partial result is simply discarded and recomputed because it has no meaning.

Insecurity and Python pickles

Posted Mar 14, 2024 3:55 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (5 responses)

There is a problem with applying YAGNI here: YAGNI is supposed to *reduce* the amount of work you have to do, not increase it.

With SQLite: import sqlite3, then write a few lines of SQL. Done.

Without SQLite: You have to write out this JSON stuff by hand, make sure your format is unambiguous, parse it back in, etc., and probably you also want to write tests for all of that functionality.

Insecurity and Python pickles

Posted Mar 14, 2024 3:58 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

(And before anyone asks: sqlite3 is a standard library module. It is already installed in every reasonably modern version of Python. You do not have to download it, take a dependency on it, or faff about with pip.)

Insecurity and Python pickles

Posted Mar 14, 2024 7:36 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Without SQLite: You have to write out this JSON stuff by hand, make sure your format is unambiguous, parse it back in, etc., and probably you also want to write tests for all of that functionality.

You're assuming your JSON doesn't have a schema/definition.

There's a whole bunch of JSON-like stuff (XML/DTD, Pick/MultiValue) where having a schema is optional but enforceable.

If you *declare* that JSON/XML/MV etc without a schema is broken, then all this stuff can be automated extremely easily.

Cheers,
Wol

Insecurity and Python pickles

Posted Mar 14, 2024 9:14 UTC (Thu) by atnot (subscriber, #124910) [Link] (2 responses)

> With SQLite: import sqlite3, then write a few lines of SQL. Done.

That's not quite it. You need to laboriously map all of the objects you have into an SQL model first. Then learn about prepared statements, etc. if you don't already know all of this stuff, which as the average scientist you don't. That's easily a dozen lines of code.

> Without SQLite: You have to write out this JSON stuff by hand, make sure your format is unambiguous, parse it back in, etc., and probably you also want to write tests for all of that functionality.

All of this needs to be done for SQL too. You don't just magically get the valid python objects you put in out again. Even if you use a third party ORM-like thing, what about third party objects that were never intending this. And tests are needed for all this stuff.

It's not like Rust etc. where there's a defacto standard for ser/des that everything implements, all of this is real work.

Meanwhile with pickle: You import pickle and just give it the python object you want to save and it works. One line. And it's just built into the language. Sure it's insecure, but you'll fix that maybe once this paper is out.

Insecurity and Python pickles

Posted Mar 14, 2024 10:32 UTC (Thu) by aragilar (subscriber, #122569) [Link]

It depends what you're working on/what libraries you're using, but tools like pandas make it fairly easy to dump out an sqlite file (see https://pandas.pydata.org/docs/user_guide/io.html#sql-que...). The larger python web frameworks either provide serialisation support, or recommend third-party libraries and demonstrate their use in their docs. There isn't a universal library like serde, but I personally wouldn't use serde for HPC (wrong design), so I'm not sure this is the actual reason (my expectation is that people are using notebooks, and want to pick up where they left off, and so while pickle is fine as a "dump the current state of my work to a file" tool, people then start sending this state around, and it gets embedded in workflows).

Insecurity and Python pickles

Posted Mar 14, 2024 16:58 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

> That's not quite it. You need to laboriously map all of the objects you have into an SQL model first. Then learn about prepared statements, etc. if you don't already know all of this stuff, which as the average scientist you don't. That's easily a dozen lines of code.

This is a game of "don't read the thread." I made that comment in response to an assertion that some data could not be mapped into SQL because it was not 2D. In that case, you already have to turn it into bytes anyway (e.g. with numpy.ndarray.tofile() into a BytesIO object, which was already being done in the code I was commenting on in the first place). My point is that you can put metadata and other such stuff into "real" SQL columns, and store anything that doesn't easily map to SQL objects as TEXT, and then you can skip the nonsense with JSON. You have not meaningfully responded to that assertion, you've simply talked past me.

Insecurity and Python pickles

Posted Mar 14, 2024 9:28 UTC (Thu) by aragilar (subscriber, #122569) [Link] (3 responses)

My understand of "tabular" implies 2D (i.e. array of records)? In my experience, tabular/catalogue data makes sense in a database. Naturally, higher dimensional data requires different tools (e.g. HDF5).

Insecurity and Python pickles

Posted Mar 14, 2024 16:54 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

> My understand of "tabular" implies 2D (i.e. array of records)? In my experience, tabular/catalogue data makes sense in a database.

Actually, I would argue that tabular data does NOT make sense in a database. It starts with the very definition of data.

Relational data is defined as "coming in rows and columns". If it doesn't fit that definition, it's not data. Relational dictates what is acceptable.

My definition (the one I use in my databases) is "data is what the user gives you". I'm happy with whatever I'm given.

Now let's define metadata. I don't know what the Relational definition of metadata is, probably "it's the schema". My definition (that I use in my databases) is "metadata is data I have inferred about the data the user gave me". And one of my cardinal rules is NEVER EVER EVER MIX data and metadata in the same table !!!!!!

But that's exactly what the job of a relational data analyst is - to infer data about the user data, then promptly mix both data and metadata up in the same table. How else would you represent a list in a 2D table?

> Naturally, higher dimensional data requires different tools (e.g. HDF5).

No. Higher dimensional data requires a multi-dimensional database. Preferably with a query language that is multi-dimensional-aware.

SQL contains huge amounts of database functionality, because it was designed to query unstructured data. So the query had to be able to express the structure. Get a database where the schema can DEscribe the structure, and the query language can be simultaneously more expressive, more powerful, and simpler, because you've pushed a lot of the complexity into the database where it belongs.

SQL puts huge amounts of complexity in the wrong place. Some complexity is unavoidable, but dealing with it in the wrong place causes much AVOIDABLE complexity.

Just look back at my rants here about relational. Just don't set off another one, the regulars here will have their heads in their hands :-)

The best way to let you see roughly where I'm coming from, is I see everything similar to an XML/DTD pair. That is EASY to manipulate with automated tools. And those tools are heavily optimised for fast efficient processing. Okay, that's not an exact description of MultiValue, but it's close. Oh - and if I store one object per XML table, the tool makes it dead easy to link different objects together.

Cheers,
Wol

Insecurity and Python pickles

Posted Mar 15, 2024 8:43 UTC (Fri) by aragilar (subscriber, #122569) [Link] (1 responses)

I think we're using the same words to mean different things. The data I've worked with has come in two different forms:
* arrays of records (and collections of these arrays): generally having a db makes it easier and faster to do more complex queries over these vs multiple files (or a single file with multiple arrays), and formats designed for efficient use of "tabular" data (e.g. parquet) are better than random CSV/TSV.
* n-dimensional arrays: this represent images/cubes/higher moments of physical data (vs metadata), and so are different in kind to the arrays of records. This is is where HDF5, netCDF, FITS (if you're doing observational astronomy) come in.

I think the data you're talking about is more graph-like right (and feels like the kind of thing where you want to talk about the structure of how data is related)? That feels different in kind to both the above, and so naturally tools designed for other types of data don't match?

My understand of ML/AI is generally they're pushed into one of the two bins above, but that may be a bias based on the data I encounter.

Insecurity and Python pickles

Posted Mar 15, 2024 9:52 UTC (Fri) by Wol (subscriber, #4433) [Link]

> I think we're using the same words to mean different things. The data I've worked with has come in two different forms:

No surprise ...

> * arrays of records (and collections of these arrays): generally having a db makes it easier and faster to do more complex queries over these vs multiple files (or a single file with multiple arrays), and formats designed for efficient use of "tabular" data (e.g. parquet) are better than random CSV/TSV.

So are your records one-dimensional? That makes your "arrays of records" two-dimensional - what I think of as your typical relational database table.

And what do you mean by "a complex query"? In MV that doesn't make sense. Everything that makes SQL complicated, belongs in an MV Schema - joins, case, calculations, etc etc. Pretty much ALL MV queries boil down to the equivalent of "select * from table".

> * n-dimensional arrays: this represent images/cubes/higher moments of physical data (vs metadata), and so are different in kind to the arrays of records. This is is where HDF5, netCDF, FITS (if you're doing observational astronomy) come in.

And if n=2? That's just your standard relational database aiui.

It's strange you mention astronomy. Ages back there was a shoot-out between Oracle, and Cache (not sure whether it was Cache/MV). The acceptance criteria were to hit 100K inserts/hr or whatever - I don't know what these speeds are, I'm generally limited by the speed people can type. Oracle had to struggle to hit the target - all sorts of optimisations like disabling indices on insert and running an update later etc etc. Cache won, went into production, and breezed through 250K within weeks ...

> I think the data you're talking about is more graph-like right (and feels like the kind of thing where you want to talk about the structure of how data is related)? That feels different in kind to both the above, and so naturally tools designed for other types of data don't match?

Graph-like? I'm not a visual person so I don't understand what you mean (and my degree is Chemistry/Medicine).

To me, I have RECORDs - which are the n-dimensional 4NF representation of an object, and the equivalent of a relational row!

I then have FILEs which are a set of RECORDS, and the equivalent of a relational table.

All the metadata your relational business analyst shoves in the data, I shove in the schema.

With the result that all the complexities of a SQL query, and all the multiple repetitions across multiple queries, just disappear because they're in the schema! (And with a simple translation layer defined in the schema, I can run SQL over my FILEs.)

I had cause to look up the "definition" of NoSQL recently. Version 1 was the name of a particular database. Version 2 was the use I make of it - defined by the MV crowd, "Not only SQL" - databases that can be queried by SQL but it's not their native language (in MV's case because it predates relational). Version 3 is the common one now, web stuff like JSON that doesn't really have a schema, and what there is is embedded with the data.

So I understand talking past each other with the same language is easy.

But all I have to do is define N as two (if my records just naturally happen to be 1NF), and I've got all the speed and power of your HDF5-whatever, operating like a relational database. But I don't have the complexity, because 90% of SQL has been moved into the database schema.

Cheers,
Wol

Insecurity and Python pickles

Posted Mar 14, 2024 7:16 UTC (Thu) by gspr (subscriber, #91542) [Link] (2 responses)

> I'm a bit confused about why people keep reinventing the "tabular data in a binary file" wheel. Are SQLite and/or DuckDB unsatisfactory for this purpose?

For ML model weights as discussed here, I've always been baffled that HDF5 isn't used more. It's an established, efficient and flexible standard with mature implementations in pretty much every language.

Insecurity and Python pickles

Posted Mar 14, 2024 9:09 UTC (Thu) by aragilar (subscriber, #122569) [Link]

HDF5 is used by at least some of the AI/ML projects (I'm guessing though most people working in the space are fairly new, and making it up as they go).

I'm not sure though you could say that there are multiple mature implementations though. Most libraries wrap libhdf5, and those that don't tend to be pretty limited (because why reinvent the wheel). That's not saying it's a bad format, but usually using libhdf5 is more than sufficient.

ML vs HDF5

Posted Mar 17, 2024 11:25 UTC (Sun) by summentier (guest, #100638) [Link]

My view is certainly somewhat contrarian, but I don't think HDF5 is a good file format for scientific data.

First of all, an HDF5 spec is enormous for a file format. This is not just a set of tensors organized in a tree structure, there is all sorts of additional stuff: attributes, compression, data layout, custom data types, you name it. For this reason, I disagree that there are “mature implementations in pretty much every language”, there really is only one feature-complete implementation: libhdf5, written in C, which pretty much every other language wraps around. (Yes, there is jHDF for Java, which can only read, there is JLD2 and some rust crates, but none of them support the full spec last time I checked.)

Because the HDF5 spec is so large and complex, you essentially have to tool around libhdf5 or use an HDF5 viewer every time you want to look at the data – hex dumps are of no use to you. But this also means that things like mem-mapping parts of a large datasets becomes a problem – libhdf5 to this day does not support this properly. Writing and reading HDF5 files is thus quite cumbersome and tends to be slow. Compare that with a simple binary file format like numpy, where you simply have a text header followed by some binary data, this becomes trivial.

What about HDF5 as an archiving format, then? Well, OK, there is a spec, but what use is that if, say, libhdf5 ceases to be maintained? In this case, how on Earth are we going to get the data out of a compressed data set with custom data types nestled deep into a tree hierarchy without reimplementing the spec? And even then, since there is essentially only one implementation that everyone uses, we have to pray that libhdf5 actually followed the HDF5 spec ...

In summary, I consider a tarball of binary files with text headers a la numpy a vastly superior file format to HDF5. It is clear, universally understood, and easy to view and tool around. (Of course, HDF5 being not a good format does not excuse using pickle ...)