January 13, 2010
This article was contributed by Robert Fekete
Operating systems, applications, and network devices generate text
messages of the events that happen to them: a user logs in, a file is
created, a network connection is opened to a remote host, etc. These
messages, called log messages, can be used to detect security incidents,
operational problems, policy violations, and are
useful in auditing and forensics situations. Traditionally, classifying log messages has been done external to the syslog system, with various log file analysis utilities, but a new feature in syslog-ng seeks to do that processing within the syslog daemon itself. By using a simpler syntax for describing log messages, along with a fast mechanism for recognizing them, message classification in syslog-ng can decrease the need for log file post-analysis, which will help ease the burden for system administrators.
Log messages do not have a predefined content, they can be
straightforward or obscure, depending on the attitude of the developer who
wrote them. Either way, most of the time they are written with human
readers in mind. This ignores the fact that these days more and more companies
and organizations collect the log messages of their computers on a central
log server and try to process them automatically to detect break-in attempts, network errors, and other issues.
Classifying messages with syslog-ng attempts to remedy this situation by
making it possible to add metadata (e.g., event type like user login,
hardware error) to the log messages. It can also extract the relevant data
(like the username) from the messages and determine what to do or where to
store the log message based on this information. For example, if you need
to create reports about specific events, you can collect the messages of
the relevant events into a separate log file, which can be used as the
basis of the reports.
A brief introduction to syslog and syslog-ng
Applications usually send their log messages to the system logging daemon of the operating system, which delivers the messages to the place where the log messages are stored: to log files on the local machine (found typically under /var/log/), or to a remote server. Most UNIX and Linux operating systems use the syslogd application as the system logging daemon. The syslog daemon adds some meta-information (called the syslog header) to the received log messages, like the date and time the message was received, or the name or address of the host where it was created.
The nine-year-old syslog-ng project is a popular, alternative syslog
daemon — licensed under GPLv2 — that has established its name
with reliable message transfer and flexible message filtering and sorting
capabilities. In that time it has gained many new features including the direct
logging to SQL databases, TLS-encrypted message transport, and the ability
to parse and modify the content of log messages. The SUSE and openSUSE
distributions use syslog-ng as their default syslog daemon.
In syslog-ng 3.0 a new message-parsing and classifying feature (dubbed
pattern database or patterndb) was introduced. With recent improvements in
3.1 and the increasing demand for processing and analyzing log messages, a
look at the syslog-ng capabilities is warranted.
The main task of a central syslog-ng log server is to collect the messages sent by the clients and route the messages to their appropriate destinations depending on the information received in the header of the syslog message or within the log message itself. Using various filters, it is possible to build even complex, tree-like log routes. For example:
It is equally simple to modify the messages by using rewrite rules instead of filters if needed. Rewrite rules can do simple search-and-replace, but can also set a field of the message to a specific value: this comes handy when client does not properly format its log messages to comply with the syslog RFCs. (This is surprisingly common with routers and switches.) Version 3.1 of makes it possible to rewrite the structured data elements in messages that use the latest syslog message format (RFC5424).
Artificial ignorance
Classifying and identifying log messages has many uses. It can be useful
for reporting and compliance, but can be also important from the security
and system maintenance point of view. The syslog-ng pattern database is
also advantageous if you are using the "artificial ignorance" log processing
method, which was described by Marcus
J. Ranum (MJR):
Artificial Ignorance - a process whereby you throw
away the log entries you know aren't interesting. If there's anything left
after you've thrown away the stuff you know isn't interesting, then the
leftovers must be interesting.
Artificial ignorance is a method to detect the anomalies in a working
system. In log analysis, this means recognizing and ignoring the regular,
common log messages that result from the normal operation of the system,
and therefore are not too interesting. However, new messages that have not
appeared in the logs before can signify important events, and should
therefore be investigated.
The syslog-ng pattern database
The syslog-ng application can compare the contents of the received log messages to a set of predefined message patterns. That way, syslog-ng is able to identify the exact log message and assign a class to the message that describes the event that has triggered the log message. By default, syslog-ng uses the unknown, system, security, and violation classes, but this can be customized, and further tags can be also assigned to the identified messages.
The traditional approach to identify log messages is to use regular
expressions (as the
logcheck project does for example). The syslog-ng pattern database uses radix trees for this task, and that has the following important advantages:
-
Classifying messages is fast, much faster than with methods based on regular expressions. The speed of processing a message is practically independent from the total number of patterns. What matters is the length of the message and the number of "similar" messages, as this affects the number of junctions in the radix tree.
-
Regular-expression based methods become increasingly slower as the number of patterns increases. Radix trees scale very well, because only a relatively small number of simple comparisons must be performed to parse the messages.
-
The syslog-ng message patterns are easy to write, understand, and maintain.
For example, compare the following:
A log message from an OpenSSH server:
Accepted password for joe from 10.50.0.247 port 42156 ssh2
A regular expression that describes this log message and its variants:
Accepted \
(gssapi(-with-mic|-keyex)?|rsa|dsa|password|publickey|keyboard-interactive/pam) \
for [^[:space:]]+ from [^[:space:]]+ port [0-9]+( (ssh|ssh2))?
An equivalent pattern for the syslog-ng pattern database:
Accepted @QSTRING:auth_method: @ for @QSTRING:username: @ from \
@QSTRING:client_addr: @ port @NUMBER:port:@ @QSTRING:protocol_version: @
Obviously, log messages describing the same event can be different: they
can contain data that varies from message to message, like usernames, IP
addresses, timestamps, and so on. This is what makes parsing log messages
with regular expressions so difficult. In syslog-ng, these parts of the
messages can be covered with special fields called parsers, which are the
constructs between '@' in the example. Such parsers process a specific type of data like a string (@STRING@), a number (@NUMBER@ or @FLOAT@), or IP address (@IPV4@, @IPV6@, or @IPVANY@). Also, parsers can be given a name and referenced in filters or as a macro in the names of log files or database tables.
It is also possible to parse the message until a specific ending character or string using the @ESTRING@ parser, or the text between two custom characters with the @QSTRING@ parser.
A syslog-ng pattern database is an XML file that stores patterns and
various metadata about the patterns. The message
patterns are sample messages that are used to identify the incoming
messages; while metadata can include descriptions, custom tags, a message
class — which is just a special type of tag — and name-value pairs (which are yet another type of tags).
The syslog-ng application has built-in macros for using the results of
the classification: the .classifier.class macro contains the class
assigned to the message (e.g., violation, security, or unknown) and the
.classifier.rule_id macro contains the identifier of the message
pattern that matched the message. It is also possible to filter on the
tags assigned to a message. As with syslog, these routing rules are
specified in the syslog-ng.conf file.
Using syslog-ng
In order to use these features, get syslog-ng 3.1 - older versions use an earlier and less complete database format. As most distributions still package version 2.x, you will probably have to download it from the syslog-ng download page.
The syntax of the pattern database file might seem a bit intimidating at
first, but most of the elements are optional. Check The
syslog-ng 3.1 Administrator Guide [PDF] and the sample database files to start with, and write to the mailing list if you run into problems.
A small utility called pdbtool is available in syslog-ng 3.1 to help the testing and management of pattern databases. It allows you to quickly check if a particular log message is recognized by the database, and also to merge the XML files into a single XML for syslog-ng. See pdbtool --help for details.
Closing remarks
The syslog-ng pattern database provides a powerful framework for classifying messages, but it is powerless without the message patterns that make it work. IT systems consist of several components running many applications, which means a lot of message patterns to create. This clearly calls for community effort to create a critical mass of patterns where all this becomes usable.
To start with, BalaBit - the developer of syslog-ng - has made a number of experimental pattern databases available. Currently, these files contain over 8000 patterns for over 200 applications and devices, including Apache, Postfix, Snort, and various common firewall appliances.
The syslog-ng pattern databases are freely available for use under the terms of the Creative Commons Attribution-Noncommercial-Share Alike 3.0 (CC by-NC-SA) license.
A community site for sharing pattern databases is reportedly also under construction, but until this becomes a reality, pattern database related discussions and inquiries should go to the general syslog-ng mailing list.
(
Log in to post comments)