LWN.net Logo

A report from OpenSQLCamp

A report from OpenSQLCamp

Posted Nov 3, 2010 1:09 UTC (Wed) by ringerc (subscriber, #3071)
Parent article: A report from OpenSQLCamp

Standard machine-friendly representation of SQL

I'm an application developer who likes SQL - for reporting and and complex queries. It's a powerful language with great features, and I'd like to use it more in my code. In reality, I still use a query builder library or even an ORM much of the time, because what I hate and fear is assembling SQL from horrid little snippets for any non-trivial dynamic query. I'd love to see a standardized machine-friendly form of SQL that's:

  • Designed for fast, simple and unambigious parsing/processing;
  • Structured and friendly for machine processing. Use XML, use JSON, use whatever, but use something that's not free-form text;
  • Capable of being easily restructured to add and remove clauses, embed another query as a subquery, etc; and
  • Semantically identical to SQL, and translatable to text SQL queries for databases that don't accept it as an an alternate wire representation

Half the reason people "hate and fear" SQL is because it's a free-form, pseudo-natural-language query system that's an absolute bastard to manipulate in code. People are driven to add-on query builder APIs or ORMs to get library interfaces that make producing dynamic queries less excruciating than battling SQL snippets.

ORM extensions to the query language and result representation

The other thing I'd really love to see in SQL would be extensions to support ORMs. They're not going away, and the pain of getting from objects to/from relational data is greatly increased by the SQL interface that must be used. SQL is great for reporting and for complex queries, but ORMs are very useful for working with the data during small, highly dynamic transaction processing.

A set of ORM support extensions to SQL that defines a heirachical or graph representation for the returned data rather than a "flat" tabular/relational representation would massively reduce the pain involved in using ORMs with SQL databases, letting you stick to full SQL where its power is desirable and use a simplified "entity selector" when you just want to grab a bunch of records from a bunch of different relations according to some relationship. This would help avoid all that horrid left-outer-join abuse and many small selects that ORMs often do, by permitting them to request (say) "customer 1 and all customer 1's invoices and bookings" in a single query that'd return a single structured representation of the result data. For simple cases multiple tabular result sets from a query would be good enough (ie: a 1-record result for the customer, followed by an n-record result for invoices and an m-record result for bookings). Ideally you'd want to be able to return what's essentially a graph representation from the database, so you could slurp in database-produced JSON/XML/whatever and easily turn it into native language objects. Or, for that matter, return serialized forms of the language's native objects directly from the database (python pickle, Java serialize, etc) using a language runtime embedded in the database engine.

Imagine how much nicer it'd be to use ORMs if the database still stored data in relational form (so you get all those nice constraints, foreign key relationships, and the power of SQL to query it all) - but knew how to backward-traverse foreign key relationships and produce object graphs when you want an object representation of the data not a tabular representation. Oh, bliss.


(Log in to post comments)

A report from OpenSQLCamp

Posted Nov 3, 2010 10:22 UTC (Wed) by dgm (subscriber, #49227) [Link]

what's so hard about doing SELECT FROM invoices WHERE customer = '1'?

A report from OpenSQLCamp

Posted Nov 3, 2010 16:16 UTC (Wed) by iabervon (subscriber, #722) [Link]

Updating the code that reads the result set to deal with the column that you added for a different feature entirely (since that gives you all of the columns in the order they were declared in the schema definition)? Dealing with the possibilities of single quotes in the string if it comes from a potential attacker? The fact that your customer column is probably actually a number?

But the real issue is that applications will generally need some sort of search, and that will often involve optional clauses, which means that a fixed SQL statement, even with bind variables, will not be sufficient to implement some of the queries the application needs to do. And once you do pretty much anything dynamic, you're into a lot of annoyance with the need to have one less item of punctuation than list item, and needing to understand how join order can matter. And it's not long before you have to join the same table for two different purposes, getting different rows for each, and you have to worry about what you're naming the table in the query, and getting the right names for the columns from the two copies. The problem with a lot of the convenience features of SQL is that you can't use them with the queries that most need to be comprehensible.

A report from OpenSQLCamp

Posted Nov 3, 2010 19:33 UTC (Wed) by nix (subscriber, #2304) [Link]

Also, as soon as joins turn up, you're into hell. WITH makes things a bit simpler (assuming your database supports it), but encoding knowledge of the frankly somewhat bizarre tree-walk you have to do in order to figure out what names your columns and tables have got in *this* particular subclause is not very pleasant.

But that's not why I hate SQL, at all. That's just icing on the bad-tasting cake.

A report from OpenSQLCamp

Posted Nov 3, 2010 16:27 UTC (Wed) by alankila (subscriber, #47141) [Link]

Apart from not being syntactically correct, using string quotes for field that is likely an integer, and hardcoding a variable value in query rather than using a host parameter? Nothing.

From viewpoint of someone who constructs a query dynamically based on, say a hash-map's keys, the idiocy already starts from the fact that the first condition needs to start with "WHERE" and the other conditions need to start with "AND", just to construct a query that says "WHERE col1 = :v1 AND col2 = :v2". Lazy people type code like "WHERE 1=1" just to avoid having to deal with that issue.

Some other things that I hate personally are having to deal with the NULL value specifically; "foo = :some_value", but "foo IS NULL" instead of "foo = NULL". I'm sure there is some tired reason for this, but I don't care anymore. Additionally, if you want to negate a condition generally, you have to come up with schemes like (NOT foo = x) or (NOT foo IS NULL) because otherwise you have to generate foo != x vs. foo IS NOT NULL. (At least that is possible generically, though. A small victory, if you accept the use of different operators to begin with.)

Finally, the actual value is often presenting complexities of its own. Many databases flat out refuse a comparison like id = '5' if the id is in fact numerical. This is mitigated by host variables and intelligent cooperation between a dynamic language and the underlying SQL engine so automatic type conversions become possible, and in case of a static language it is a feature that the query crashes if wrong type of parameter is bound to a host variable.

It is this type of thoughtless hostility that the GP is complaining about, I believe. I have often wished having a programmable API for query construction that wouldn't require laying out a complex query out of tiny bricks. When designing one, I decided that such a thing must have a SQL parser because usually there is no way better to represent a SQL query than to spell it out in a string, and despite starting out in this way you still want to add and remove conditions later on without having to care about whether the condition must start with "WHERE" or "AND" or if the null value needs to be compared with operator = or IS, or whether the field in question needs a value that is a string or integer type, etc.

A report from OpenSQLCamp

Posted Nov 3, 2010 17:46 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

You're in luck:

List%lt;SomeObject%gt; objects=sess.from(someObject)
  .where(someObject.someField.lt(10))
  .where(someObject.anotherLinkedObject().anotherField.gt(10))
  .where(someObject.name.in("Joe", "Smith", "Bubba")
  .list(someObject);
or even:
HibernateQuery query=sess.from(someObject)
  .where(someObject.someField.lt(10))
  .where(someObject.anotherLinkedObject().anotherField.gt(10));

if (weNeedToLimitByName)
    query.where(someObject.name.in("Joe", "Smith", "Bubba"));

if (someAnotherCondition)
    query.where(someObject.id.eq(10));

query.list(someObject);
Secret sauce is QueryDSL. LINQ in C# has even more powerful capabilities.

A report from OpenSQLCamp

Posted Nov 3, 2010 22:10 UTC (Wed) by alankila (subscriber, #47141) [Link]

It looks reasonable on the surface, at least. I normally use the Criteria API when dealing with Hibernate on Java, but it is not especially smart or type safe. It's a pity that code generator is required for this.

A report from OpenSQLCamp

Posted Nov 4, 2010 1:02 UTC (Thu) by ringerc (subscriber, #3071) [Link]

Null handling isn't too bad if you're not dealing with brain-dead databases where you have to severely limit the used feature set. You can use "NOT (A IS DISTINCT FROM B)" to cover equality and nullity in one expression. It's a wee bit verbose, but better than a properly null-safe equality comparison, which is just horrid.

I'm of two minds on NULL. On one hand, it's semantically different to the null/NULL/None/undef/whatever in most programming languages, and behaves differently. On the other hand, it's a really useful concept that's intellectually appealing. It's such a PITA to actually work with that I'm not sure it's worth it, though, especially once you get to the inconsistencies in its application in various parts of the spec.

A report from OpenSQLCamp

Posted Nov 4, 2010 11:45 UTC (Thu) by alankila (subscriber, #47141) [Link]

Yeah, NULLs sound useful at first, but the only real place where I use them today is for indicating a missing value in a numeric field. For string fields, I find that empty strings are good substitutes for NULLs, and many dates, especially those that mark some start and end epoch can be set to year 0000 or year 9999 initially instead of NULL depending on what is the semantically most useful way to view the value.

A report from OpenSQLCamp

Posted Nov 3, 2010 17:49 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Now suppose that you want to fetch 10 customer's address records, last 5 coupons, and 20 recent buys in the same request.

You'll get hit by Cartesian product in the form of 10*5*20=1000 records. Good hierarchical query should return a tree with 35 nodes.

A report from OpenSQLCamp

Posted Nov 3, 2010 19:40 UTC (Wed) by andresfreund (subscriber, #69562) [Link]

Why would you want to?

A report from OpenSQLCamp

Posted Nov 3, 2010 19:46 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

To minimize number of database roundtrips, as usual

A report from OpenSQLCamp

Posted Nov 7, 2010 13:27 UTC (Sun) by zorro (subscriber, #45643) [Link]

You can also send the three individual queries to the database in a single batch.

A report from OpenSQLCamp

Posted Nov 3, 2010 16:46 UTC (Wed) by iabervon (subscriber, #722) [Link]

I think one mistake that was made with SQL that you're still making is the idea that queries should normally be constructed as a sequence of characters, like program source. Since applications will almost always have to make queries with systematic structural variations, it isn't possible to put it all in the program source, and there's not really any other particular benefit to the sequence of characters for programmatic use (of course, it can be convenient to interact with the database manually over a text terminal, but that's an operational issue rather than a development issue). What I think would be really nice is something that is to SQL what DOM is to XML: something that can be manipulated at a logical granularity and doesn't need to be unparsed and parsed at all except when it has to go between independent systems (which wouldn't happen at all in most applications). I think it might even be fine if the unparsed form was still SQL, just because it would hardly matter (of course, it would mean having a syntax that isn't entirely predictable from the parsed form's API, which would mean that extensions would have to be defined in both formats if they are to be possible to use over a text terminal).

A report from OpenSQLCamp

Posted Nov 3, 2010 20:00 UTC (Wed) by nix (subscriber, #2304) [Link]

i.e. what we want is Sexpy Query Language? I'd support such an endeavour :)

A report from OpenSQLCamp

Posted Nov 10, 2010 17:08 UTC (Wed) by jeremiah (subscriber, #1221) [Link]

My current working solution is to define queries as an XML tree. This tree defines both the query, and what the resulting XML should look like. I'm not interested in storing full XML documents in the DB, but breaking them up into atomic pieces. That a subsequent query, can the reconstruct across documents. I had really considered hacking the Postres source to do this. I've just have a parser that builds sql, and one that processes result sets into XML. Based on a rather large assumption here, I would think it would be possible to create an API that would take XML based queries and parse them into Postgres's internal query representation, and have postgres's normal query optimization apply w/o much work. The difficult part would be processing the result sets in a hierarchical fashion w/o iterating over the redundant outer left join redundancies. There is a lot of wasted time there, but I'd imagine that the internal optimizations that produce results are not designed in a tree friendly way, but in a flat way. So you'd probably be making rather severe architectural changes, to the point of it not being worth it. I guess you could always filter the flat results at the DB into the XML, but that would mean processing the result tree twice. Which most of us do already, but atleast it would already be in mem, and avoid transport time.
</unedited train of thought>

In a nutshell, it seems like everyone wants to be able to send a query or data tree in, and get a result tree out.

A report from OpenSQLCamp

Posted Nov 10, 2010 17:33 UTC (Wed) by jeremiah (subscriber, #1221) [Link]

btw. NeoDB does an almost decent job of this, but too much of the processing and map reducing is done in code as opposed to by the DB itself. Although it's tree based, the query structure was not dynamically defined enough for my uses.

A report from OpenSQLCamp

Posted Nov 10, 2010 21:59 UTC (Wed) by nix (subscriber, #2304) [Link]

That would be lovely. Downside: XML is horribly verbose and ugly as sin.

(Has anyone written an XML -> sexp bidirectional filter? If not, I should: it wouldn't be hard and would make this stuff *so* much easier for humans to read and write. It's quite an achievement, actually, coming up with something *less* readable than Lisp's sexps... the SGML/XML people should be proud.)

A report from OpenSQLCamp

Posted Nov 10, 2010 23:40 UTC (Wed) by jeremiah (subscriber, #1221) [Link]

So I'm trying to visualize this, and failing. I don't suppose you'd toss up an example? I'm curious to see how you'd address a mix of attribute value pairs and element names and content. It's been at least 15 years since I messed with lisp.

I deal with XML every day, and agree with the verbosity and ugliness comment. The only way i can make it palatable is lots of indenting, and syntax highlighting, and even then it's pretty rough some times. I like it's flexibility and ubiquitousness though. I don't hate it nearly as much as I dislike the SQL language.

A report from OpenSQLCamp

Posted Nov 11, 2010 0:11 UTC (Thu) by nix (subscriber, #2304) [Link]

There are lots of possible unambiguous mappings. Instead of

<foo bar="baz">quux<wob>foo</wob>blah</foo>

you could say

(foo :bar "baz" (:content "quux" (wob (:content "foo")) (:content "blah")))

perhaps. Or, if you don't like the ":content", you could say

(foo (:attr bar "baz") "quux" (wob "foo") "blah")

(yeah, I think the latter is much easier to read. Attributes are less common than content in decent XML, so attributes should be more verbose in the sexpy syntax.)

Note that both these syntaxes render it impossible to close elements in the wrong order and impossible to do the analogue of missing a / out. Those two things on their own make it worthwhile, I think.

(I am by no means the first person to have thought about this. I'm not even the five thousandth.)

A report from OpenSQLCamp

Posted Nov 11, 2010 1:13 UTC (Thu) by jeremiah (subscriber, #1221) [Link]

I most definitely prefer the second, and thanks for the example. I'm going to roll it around in my head for a bit. I am curious to know why you think "Attributes are less common than content in decent XML" though. Most of the XML I deal with is coming out of a DB and generally the table name corresponds to the element name, and the column names to the attribute names etc. This makes for fairly decently readable XML, when a person has to look at it. Sure you run into issues with large content that doesn't really fit with in an attribute, and should be content instead. But for the majority of data that I have to deal with I have very very few content only elements. Do you have any pointers as to why this is a bad thing?

A report from OpenSQLCamp

Posted Nov 11, 2010 2:11 UTC (Thu) by jeremiah (subscriber, #1221) [Link]

Never mind. There was some missed subtlety in my reading of your post.

A report from OpenSQLCamp

Posted Nov 11, 2010 7:17 UTC (Thu) by nix (subscriber, #2304) [Link]

I think it depends on where your data's coming from. I think for the application you were discussing, you're probably right. I tend to deal with XML as a data-interchange format, and while a few of those are slightly self-describing, with datatypes in an attribute, most of them hand you a (generally very buggy) schema and then have no attributes at all in the content.

Note: a nice thing about using a real programming language rather than XML is that you could go the whole hog if you want and eliminate redundancy, although this looks uglier than it should because all my indentation is being eaten by the comment form and because I've only got one line of XML generated, much too little to be worth using this technique:

(let ((type-string '(:attr type "string"))
(type-int '(:attr type "int")))
(foo type-string "quux" (num-foos type-int 6)))

is equivalent to

<foo type="string">quux<num-foos type="int">6</num-foos></foo>

If you wanted to have 'type's with parameters, let alone optional ones, you'd have to do something a little different:

(flet ((type-float (precision) `(:attr type "float" `(:attr type ,(concat "float-" ,precision)))))
(foo (type-float 6) 14.172))

is equivalent to

<foo type="float-6">14.172</foo>

Note that the latter is horribly hard to typecheck properly in XML, but having something typecheck the Lisp notation is trivial. Also note that your generator and parser don't need to understand Lisp (although it is so easy the parser might as well understand a restricted subset): it just needs to know how to generate stereotyped let and (for parameterized attributes) flet expressions.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds