Malcolm: SQL for the command line: "show"
This got me thinking. We have many different log formats, and many different sources of data. All of our tools seem to have different interfaces. [...] For example, why should I write regular expressions and shell pipelines to get at my logs? Why do I have to learn a custom syntax ("rpm -qa --queryformat='various things'") for looking at the software I have installed? Why does e.g. the audit subsystem have its own query format? [...] Why can't I just use SQL, and write SELECT statements to drill down into all of this data?"
Posted Mar 23, 2009 16:27 UTC (Mon)
by sjj (guest, #2020)
[Link]
Posted Mar 23, 2009 16:29 UTC (Mon)
by clugstj (subscriber, #4020)
[Link] (14 responses)
Because SQL sucks as a general programming language.
Posted Mar 23, 2009 16:33 UTC (Mon)
by flewellyn (subscriber, #5047)
[Link] (13 responses)
But, that's not what he wants to use it for. He only wants to use it for its problem domain: querying a data store of some kind. And in that domain, it's perhaps not the best thing ever, but it's quite suitable and certainly well-understood.
I just wonder how he'll handle issues like complex queries, with things like subqueries and joins and the like.
Posted Mar 23, 2009 16:53 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (8 responses)
But seriously, if your data naturally fits a flat table, SQL is a good fit. Most data, however, doesn't.
Cheers,
Posted Mar 23, 2009 18:09 UTC (Mon)
by mrshiny (guest, #4266)
[Link]
Posted Mar 23, 2009 18:51 UTC (Mon)
by KGranade (guest, #56052)
[Link] (6 responses)
Secondly, you recommend using something like English, but I have no idea how the examples you present would be used, which means learning a specialized syntax, which means I might as well use a generalized syntax that is a moderately straightforward mapping of the concepts of database retrieval to English... which is a decent description of SQL.
<Insert overly-verbose rant about the pitfalls of "natural language programming here>
Posted Mar 24, 2009 0:21 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (5 responses)
Sorry, but I did *not* say "use English". I said "use ENGLISH" (ENGLISH being a dedicated data access language).
ENGLISH is the original Pick data query language, and is a very good NFNF query tool (It's also called ENGLISH because it is, actually, very similar to English!) For example
SELECT INVOICE WITH INVOICE.TOTAL EQ 1600 AND WHERE LINE.ITEM EQ 215
will select all invoices where the invoice value is 1600 and any individual line is 215.
SQL is *not* a "moderately straightforward mapping of the concepts of database retrieval to English" - no way would I describe it as "moderately straightforward", and it is very relational-oriented. Using it to query a non-relational database is *horrid*.
Cheers,
Posted Mar 24, 2009 0:44 UTC (Tue)
by flewellyn (subscriber, #5047)
[Link] (4 responses)
Posted Mar 24, 2009 15:51 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (3 responses)
So yes, SQL is probably a good language for querying them. But then, so is ENGLISH, because it's n-dimensional (actually, it doesn't work that well if n hits 4 or more :-( so horses for courses, I'd use ENGLISH because that's what I'm comfortable with.
Cheers,
Posted Mar 24, 2009 16:24 UTC (Tue)
by flewellyn (subscriber, #5047)
[Link] (2 responses)
Posted Mar 26, 2009 14:06 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (1 responses)
IBM are on record as saying that their version (the U2 databases) is the fastest growing product in their database VAR sales.
And International Spectrum is holding their conference right now - against a background of collapsing conference attendances (typically down 25 - 50 %), they're holding their own - I think they were down 7% (or was it up?)
imho relational *theory* is great. Unfortunately, relational practice falls foul of Einstein's corollary to Occam - practice is TOO simple, therefore system complexity (as in all the stuff *round* the database) rises sharply as a result. SQL queries are a classic example :-)
Cheers,
Posted Mar 28, 2009 0:55 UTC (Sat)
by nix (subscriber, #2304)
[Link]
(actually it *is* useful, but that doesn't mean it's not totally bizarre
Posted Mar 23, 2009 20:30 UTC (Mon)
by clugstj (subscriber, #4020)
[Link] (1 responses)
Posted Mar 23, 2009 20:42 UTC (Mon)
by flewellyn (subscriber, #5047)
[Link]
Posted Mar 24, 2009 21:35 UTC (Tue)
by skx (subscriber, #14652)
[Link] (1 responses)
It is funny how people get hooked on using SQL for querying logfiles. Last year or so I wrote asql which I use for producing adhoc statistics form Apache logfiles. Simple usage is: Finding the top ten referers becomes: Although SQL is often not the most natural way to query things I found it very useful and natural in this context.
Posted Mar 24, 2009 21:50 UTC (Tue)
by flewellyn (subscriber, #5047)
[Link]
Truth be told, my ideal query language would be something Lisplike, with query operations specified by functions and special operators (or macros), and "views" just being newly defined querying functions or macros.
Hmmm...now there's an idea...
Posted Mar 23, 2009 20:00 UTC (Mon)
by epa (subscriber, #39769)
[Link] (3 responses)
OK, in this example it would be /var/log/httpd/where which presumably wouldn't trip up the parser. But using this thing in your current directory could be flaky.
Posted Mar 24, 2009 15:26 UTC (Tue)
by efexis (guest, #26355)
[Link] (2 responses)
show host, "count(*)", "total(size)" < /var/log/httpd/access_log
or
cat /var/log/httpd/*access_log* | show host, "count(*)", "total(size)"
would solve that
Posted Mar 24, 2009 21:36 UTC (Tue)
by skx (subscriber, #14652)
[Link]
Or just use asql which has a nice built-in shell for live queries, or the ability to run queries from the command line.
Posted Mar 28, 2009 11:22 UTC (Sat)
by jengelh (guest, #33263)
[Link]
Posted Mar 24, 2009 2:48 UTC (Tue)
by jordanb (guest, #45668)
[Link] (2 responses)
# select gender, race, sum(income)/count(income);"
for instance. You have to say:
# select gender, race, industry sum(income)/count(id) group by gender, race, industry;
even though there's only one rational way to group the data. In fact. If you leave off any term in the group by statement, most servers will raise an error.
While there's silly redundancy in the select statement, the others have horrible and dangerous defaults, for instance this:
# delete from records;
doesn't raise an error or do nothing, which would be sensible (you didn't specify what to delete!) instead, it deletes *everything* in the table. It'd be like if 'rm's behavior without arguments was to delete every file in the current directory.
Then there's insert's annoying positional syntax, which is both tediously redundant *and* error prone. Plus the 'shortcut' way to do an insert ignores the fact that nearly every table has as its first column an auto-incrementing ID, forcing you to either go to the horrible long-form or use a non-standard workaround if you're lucky enough to be using a server that has one.
Anyway, while his tool is fairly neat I think it might have been more useful if he'd made something to coerce the data into sqlite. It'd have been less work and you don't get stuck with the shell argument escaping horror he demonstrates there.
Posted Mar 26, 2009 23:56 UTC (Thu)
by zlynx (guest, #2285)
[Link]
For example, the list of values isn't limited to one thing. Values can just go on and on, inserting as many records as you like.
Or, you can put a SELECT statement there instead of VALUES.
Now what seemed redundant is necessary.
I won't argue about DELETE and UPDATE assuming all records without a WHERE being dangerous and stupid though. :-)
Posted Mar 27, 2009 20:09 UTC (Fri)
by marcH (subscriber, #57642)
[Link]
Higher level concerns:
"Why SQL Sucks":
http://perlmonks.org/?node_id=515776
Posted Mar 24, 2009 13:32 UTC (Tue)
by pierre (guest, #663)
[Link]
Posted Mar 26, 2009 0:08 UTC (Thu)
by dave_malcolm (subscriber, #15013)
[Link] (2 responses)
Follow-up posting here: http://dmalcolm.livejournal.com/2009/03/24/ with info on git repo etc.
Thanks for the feedback. As some have noted, this isn't intended just for log files. Note that in the original post it already had "rpm" and "proc" backends, for querying the local rpm database and /proc respectively.
It can now parse many of the files in /etc, using the Augeas library (http://augeas.net)
I also just added tcpdump support, so you can look at e.g. a wireshark dump and run something like this:
Ideas for other backends most welcome.
Posted Apr 1, 2009 18:53 UTC (Wed)
by ortalo (guest, #4654)
[Link] (1 responses)
But the actual reason for my comment was another suggestion. Have you considered implementing the same kind of backends inside a full fledged database? It seems to me at least PostgreSQL should offer enough extensibility to allow this. It could free you from dealing with the intricacies/limitations of an "SQL-like" parser and may open the door to more complex treatments (dunno if writing would be feasible).
Posted Apr 1, 2009 22:38 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Malcolm: SQL for the command line: "show"
Malcolm: SQL for the command line: "show"
Malcolm: SQL for the command line: "show"
Malcolm: SQL for the command line: "show"
Wol
Malcolm: SQL for the command line: "show"
Malcolm: SQL for the command line: "show"
Malcolm: SQL for the command line: "show"
Wol
Malcolm: SQL for the command line: "show"
Malcolm: SQL for the command line: "show"
Wol
Malcolm: SQL for the command line: "show"
Malcolm: SQL for the command line: "show"
Wol
Malcolm: SQL for the command line: "show"
instance. In case the relational model is 'too hard', now you can turn
your DB into a tiny spreadsheet and bash at it in the query. How
relational...
and screwy. The real problem here is SQL's halfassed incapable
implementation of half the relational calculus in a non-Turing-complete
fashion. But it paid for my house so I can't complain *too* terribly
hard.)
Malcolm: SQL for the command line: "show"
Malcolm: SQL for the command line: "show"
Malcolm: SQL for the command line: "show"
$ asql
asql v1.2 - type 'help' for help.
asql> load /home/www/www.steve.org.uk/logs/access.log
Loading: /home/www/www.steve.org.uk/logs/access.log
asql> SELECT source,SUM(size) AS Number FROM logs GROUP BY source ORDER BY Number DESC, source LIMIT 0,10;
67.195.37.112 4681922
74.6.17.185 2353628
87.120.8.52 2066975
77.36.6.72 1859180
...
asql> SELECT referer,COUNT(referer) AS number from logs WHERE referer NOT LIKE '%steve.org.uk%' GROUP BY referer ORDER BY number DESC,referer LIMIT 0,10;
- 1888
http://www.gnu.org/software/gnump3d/download.html 12
http://community.livejournal.com/lotr_tattoos/?skip=20 5
http://lua-users.org/wiki/LibrariesAndBindings 4
Malcolm: SQL for the command line: "show"
Special filenames
show host, "count(*)", "total(size)" from /var/log/httpd/*access_log*
And what happens when there is a file called 'where'?
Special filenames
Special filenames
Special filenames
Malcolm: SQL for the command line: "show"
Malcolm: SQL for the command line: "show"
Hey, how consistent is this?
Malcolm: SQL for the command line: "show"
- delete from TABLE where ROWfilter
- select COLnames from TABLE where ROWfilter
- insert into TABLE values COLvalues
Every time I go back to SQL I have to google again for examples...
Malcolm: SQL for the command line: "show"
Author responds
$ show "count(*)", "total(length)", src_mac, dst_mac from test.pcap group by src_mac, dst_mac
(though this is merely a messy proof-of-concept hack at this stage)
Author responds
Or maybe you would find this too overweight for your intended usage? (I routinely have to consider >30Go of compressed log files so, even a full-fledged database engine does not seem overkill sometimes.)
Gonna look at your tool anyway. Thanks for the contribution.
Author responds
protocol to pass the format string and arguments separately (as well as as
a formatted whole), so that syslog-ng can use its existing facilities to
dump the lot in a database. Then we can *really* do log analysis, with
variable and fixed parts spliced out. (The problem is the break of the
syslog protocol, though. I considered analyzing log messages to attempt to
retrospectively determine which parts are format string and which are
arguments, but that rapidly gets into a pattern-matching tarpit.)