Insecurity and Python pickles

Posted Mar 14, 2024 3:55 UTC (Thu) by NYKevin (subscriber, #129325)
In reply to: Insecurity and Python pickles by intelfx
Parent article: Insecurity and Python pickles

There is a problem with applying YAGNI here: YAGNI is supposed to *reduce* the amount of work you have to do, not increase it.

With SQLite: import sqlite3, then write a few lines of SQL. Done.

Without SQLite: You have to write out this JSON stuff by hand, make sure your format is unambiguous, parse it back in, etc., and probably you also want to write tests for all of that functionality.

Insecurity and Python pickles

Posted Mar 14, 2024 3:58 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

(And before anyone asks: sqlite3 is a standard library module. It is already installed in every reasonably modern version of Python. You do not have to download it, take a dependency on it, or faff about with pip.)

Insecurity and Python pickles

Posted Mar 14, 2024 7:36 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Without SQLite: You have to write out this JSON stuff by hand, make sure your format is unambiguous, parse it back in, etc., and probably you also want to write tests for all of that functionality.

You're assuming your JSON doesn't have a schema/definition.

There's a whole bunch of JSON-like stuff (XML/DTD, Pick/MultiValue) where having a schema is optional but enforceable.

If you *declare* that JSON/XML/MV etc without a schema is broken, then all this stuff can be automated extremely easily.

Cheers,
Wol

Insecurity and Python pickles

Posted Mar 14, 2024 9:14 UTC (Thu) by atnot (subscriber, #124910) [Link] (2 responses)

> With SQLite: import sqlite3, then write a few lines of SQL. Done.

That's not quite it. You need to laboriously map all of the objects you have into an SQL model first. Then learn about prepared statements, etc. if you don't already know all of this stuff, which as the average scientist you don't. That's easily a dozen lines of code.

> Without SQLite: You have to write out this JSON stuff by hand, make sure your format is unambiguous, parse it back in, etc., and probably you also want to write tests for all of that functionality.

All of this needs to be done for SQL too. You don't just magically get the valid python objects you put in out again. Even if you use a third party ORM-like thing, what about third party objects that were never intending this. And tests are needed for all this stuff.

It's not like Rust etc. where there's a defacto standard for ser/des that everything implements, all of this is real work.

Meanwhile with pickle: You import pickle and just give it the python object you want to save and it works. One line. And it's just built into the language. Sure it's insecure, but you'll fix that maybe once this paper is out.

Insecurity and Python pickles

Posted Mar 14, 2024 10:32 UTC (Thu) by aragilar (subscriber, #122569) [Link]

It depends what you're working on/what libraries you're using, but tools like pandas make it fairly easy to dump out an sqlite file (see https://pandas.pydata.org/docs/user_guide/io.html#sql-que...). The larger python web frameworks either provide serialisation support, or recommend third-party libraries and demonstrate their use in their docs. There isn't a universal library like serde, but I personally wouldn't use serde for HPC (wrong design), so I'm not sure this is the actual reason (my expectation is that people are using notebooks, and want to pick up where they left off, and so while pickle is fine as a "dump the current state of my work to a file" tool, people then start sending this state around, and it gets embedded in workflows).

Insecurity and Python pickles

Posted Mar 14, 2024 16:58 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

> That's not quite it. You need to laboriously map all of the objects you have into an SQL model first. Then learn about prepared statements, etc. if you don't already know all of this stuff, which as the average scientist you don't. That's easily a dozen lines of code.

This is a game of "don't read the thread." I made that comment in response to an assertion that some data could not be mapped into SQL because it was not 2D. In that case, you already have to turn it into bytes anyway (e.g. with numpy.ndarray.tofile() into a BytesIO object, which was already being done in the code I was commenting on in the first place). My point is that you can put metadata and other such stuff into "real" SQL columns, and store anything that doesn't easily map to SQL objects as TEXT, and then you can skip the nonsense with JSON. You have not meaningfully responded to that assertion, you've simply talked past me.