Log message classification with syslog-ng

January 13, 2010

This article was contributed by Robert Fekete

Operating systems, applications, and network devices generate text messages of the events that happen to them: a user logs in, a file is created, a network connection is opened to a remote host, etc. These messages, called log messages, can be used to detect security incidents, operational problems, policy violations, and are useful in auditing and forensics situations. Traditionally, classifying log messages has been done external to the syslog system, with various log file analysis utilities, but a new feature in syslog-ng seeks to do that processing within the syslog daemon itself. By using a simpler syntax for describing log messages, along with a fast mechanism for recognizing them, message classification in syslog-ng can decrease the need for log file post-analysis, which will help ease the burden for system administrators.

Log messages do not have a predefined content, they can be straightforward or obscure, depending on the attitude of the developer who wrote them. Either way, most of the time they are written with human readers in mind. This ignores the fact that these days more and more companies and organizations collect the log messages of their computers on a central log server and try to process them automatically to detect break-in attempts, network errors, and other issues.

Classifying messages with syslog-ng attempts to remedy this situation by making it possible to add metadata (e.g., event type like user login, hardware error) to the log messages. It can also extract the relevant data (like the username) from the messages and determine what to do or where to store the log message based on this information. For example, if you need to create reports about specific events, you can collect the messages of the relevant events into a separate log file, which can be used as the basis of the reports.

A brief introduction to syslog and syslog-ng

Applications usually send their log messages to the system logging daemon of the operating system, which delivers the messages to the place where the log messages are stored: to log files on the local machine (found typically under /var/log/), or to a remote server. Most UNIX and Linux operating systems use the syslogd application as the system logging daemon. The syslog daemon adds some meta-information (called the syslog header) to the received log messages, like the date and time the message was received, or the name or address of the host where it was created.

The nine-year-old syslog-ng project is a popular, alternative syslog daemon — licensed under GPLv2 — that has established its name with reliable message transfer and flexible message filtering and sorting capabilities. In that time it has gained many new features including the direct logging to SQL databases, TLS-encrypted message transport, and the ability to parse and modify the content of log messages. The SUSE and openSUSE distributions use syslog-ng as their default syslog daemon.

In syslog-ng 3.0 a new message-parsing and classifying feature (dubbed pattern database or patterndb) was introduced. With recent improvements in 3.1 and the increasing demand for processing and analyzing log messages, a look at the syslog-ng capabilities is warranted.

The main task of a central syslog-ng log server is to collect the messages sent by the clients and route the messages to their appropriate destinations depending on the information received in the header of the syslog message or within the log message itself. Using various filters, it is possible to build even complex, tree-like log routes. For example:

It is equally simple to modify the messages by using rewrite rules instead of filters if needed. Rewrite rules can do simple search-and-replace, but can also set a field of the message to a specific value: this comes handy when client does not properly format its log messages to comply with the syslog RFCs. (This is surprisingly common with routers and switches.) Version 3.1 of makes it possible to rewrite the structured data elements in messages that use the latest syslog message format (RFC5424).

Artificial ignorance

Classifying and identifying log messages has many uses. It can be useful for reporting and compliance, but can be also important from the security and system maintenance point of view. The syslog-ng pattern database is also advantageous if you are using the "artificial ignorance" log processing method, which was described by Marcus J. Ranum (MJR):

Artificial Ignorance - a process whereby you throw away the log entries you know aren't interesting. If there's anything left after you've thrown away the stuff you know isn't interesting, then the leftovers must be interesting.

Artificial ignorance is a method to detect the anomalies in a working system. In log analysis, this means recognizing and ignoring the regular, common log messages that result from the normal operation of the system, and therefore are not too interesting. However, new messages that have not appeared in the logs before can signify important events, and should therefore be investigated.

The syslog-ng pattern database

The syslog-ng application can compare the contents of the received log messages to a set of predefined message patterns. That way, syslog-ng is able to identify the exact log message and assign a class to the message that describes the event that has triggered the log message. By default, syslog-ng uses the unknown, system, security, and violation classes, but this can be customized, and further tags can be also assigned to the identified messages.

The traditional approach to identify log messages is to use regular expressions (as the logcheck project does for example). The syslog-ng pattern database uses radix trees for this task, and that has the following important advantages:

Classifying messages is fast, much faster than with methods based on regular expressions. The speed of processing a message is practically independent from the total number of patterns. What matters is the length of the message and the number of "similar" messages, as this affects the number of junctions in the radix tree.
Regular-expression based methods become increasingly slower as the number of patterns increases. Radix trees scale very well, because only a relatively small number of simple comparisons must be performed to parse the messages.
The syslog-ng message patterns are easy to write, understand, and maintain.

For example, compare the following:

A log message from an OpenSSH server:

    Accepted password for joe from 10.50.0.247 port 42156 ssh2

A regular expression that describes this log message and its variants:

    Accepted \ 
        (gssapi(-with-mic|-keyex)?|rsa|dsa|password|publickey|keyboard-interactive/pam) \
        for [^[:space:]]+ from [^[:space:]]+ port [0-9]+( (ssh|ssh2))?

An equivalent pattern for the syslog-ng pattern database:

    Accepted @QSTRING:auth_method: @ for @QSTRING:username: @ from \ 
        @QSTRING:client_addr: @ port @NUMBER:port:@ @QSTRING:protocol_version: @

Obviously, log messages describing the same event can be different: they can contain data that varies from message to message, like usernames, IP addresses, timestamps, and so on. This is what makes parsing log messages with regular expressions so difficult. In syslog-ng, these parts of the messages can be covered with special fields called parsers, which are the constructs between '@' in the example. Such parsers process a specific type of data like a string (@STRING@), a number (@NUMBER@ or @FLOAT@), or IP address (@IPV4@, @IPV6@, or @IPVANY@). Also, parsers can be given a name and referenced in filters or as a macro in the names of log files or database tables.

It is also possible to parse the message until a specific ending character or string using the @ESTRING@ parser, or the text between two custom characters with the @QSTRING@ parser.

A syslog-ng pattern database is an XML file that stores patterns and various metadata about the patterns. The message patterns are sample messages that are used to identify the incoming messages; while metadata can include descriptions, custom tags, a message class — which is just a special type of tag — and name-value pairs (which are yet another type of tags).

The syslog-ng application has built-in macros for using the results of the classification: the .classifier.class macro contains the class assigned to the message (e.g., violation, security, or unknown) and the .classifier.rule_id macro contains the identifier of the message pattern that matched the message. It is also possible to filter on the tags assigned to a message. As with syslog, these routing rules are specified in the syslog-ng.conf file.

Using syslog-ng

In order to use these features, get syslog-ng 3.1 - older versions use an earlier and less complete database format. As most distributions still package version 2.x, you will probably have to download it from the syslog-ng download page.

The syntax of the pattern database file might seem a bit intimidating at first, but most of the elements are optional. Check The syslog-ng 3.1 Administrator Guide [PDF] and the sample database files to start with, and write to the mailing list if you run into problems.

A small utility called pdbtool is available in syslog-ng 3.1 to help the testing and management of pattern databases. It allows you to quickly check if a particular log message is recognized by the database, and also to merge the XML files into a single XML for syslog-ng. See pdbtool --help for details.

Closing remarks

The syslog-ng pattern database provides a powerful framework for classifying messages, but it is powerless without the message patterns that make it work. IT systems consist of several components running many applications, which means a lot of message patterns to create. This clearly calls for community effort to create a critical mass of patterns where all this becomes usable.

To start with, BalaBit - the developer of syslog-ng - has made a number of experimental pattern databases available. Currently, these files contain over 8000 patterns for over 200 applications and devices, including Apache, Postfix, Snort, and various common firewall appliances. The syslog-ng pattern databases are freely available for use under the terms of the Creative Commons Attribution-Noncommercial-Share Alike 3.0 (CC by-NC-SA) license.

A community site for sharing pattern databases is reportedly also under construction, but until this becomes a reality, pattern database related discussions and inquiries should go to the general syslog-ng mailing list.

Index entries for this article
GuestArticles	Fekete, Robert

Maybe less is more?

Posted Jan 15, 2010 20:57 UTC (Fri) by eparis123 (guest, #59739) [Link] (1 responses)

Sometimes I'm not very comfortable with this "central" sort of approaches, as they get over-complicated by time.

Maybe it's the fear of change, but I like the simple sparse list of configuration files under /etc rather than the GNOME gconf XML repository.

By the same token, while the list of files under /var/log are a bit chaotic comparing them to syslog-ng approach, they are much simpler, and simplicity is good.

The tool seems to shine for server farms though.

Maybe less is more?

Posted Jan 17, 2010 7:26 UTC (Sun) by frobert (guest, #62734) [Link]

You are absolutely right, using patterndb for a single host might be an overkill. It is mainly aimed at larger networks.

Artificial Stupidity ~= antilogging

Posted Jan 16, 2010 0:47 UTC (Sat) by davecb (subscriber, #1574) [Link]

Oh cool, it was Marcus who thought of this approach! I wrote up the "antilog" variant of it in
"Sherlock Holmes on Log Files",
http://datacenterworks.com/stories/antilog.html.

I'll drop him a line...

--dave

it would be nice to see a comparison with rsyslog

Posted Feb 14, 2010 22:12 UTC (Sun) by dlang (guest, #313) [Link]

as most distros are moving from traditional syslog to rsyslog the basline capabilities are climbing.

rsyslog doesn't have a pattern database, but it does have quite a bit of filtering and re-writing capability.