LWN.net Logo

Sanitizing log file output

By Jake Edge
June 29, 2011

Handling user-controlled data properly is one of the basic principles of computer security. Various kernel log messages allow user-controlled strings to be placed into the messages via the "%s" format specifier, which could be used by an attacker to potentially confuse administrators by inserting control characters into the strings. So Vasiliy Kulikov has proposed a patch that would escape certain characters that appear in those strings. There is some question as to which characters should be escaped, but the bigger question is an age-old one in security circles: whitelisting vs. blacklisting.

The problem stems from the idea that administrators will often use tools like tail and more to view log files on a TTY. If a user can insert control characters (and, in particular, escape sequences) into the log file, they could potentially cause important information to be overlooked—or cause other kinds of confusion. In the worst case, escape sequences could potentially exploit some hole in the terminal emulator program to execute code or cause other misbehavior. In the patch, Kulikov gives the following example: "Control characters might fool root viewing the logs via tty, e.g. using ^[1A to suppress the previous log line." For characters that are filtered, the patch simply replaces them with "#xx", where xx is the hex value of the character.

It's a fairly minor issue, at some level, but it's not at all clear that there is any legitimate use of control characters in those user-supplied strings. The strings could come from various places; two that were mentioned in the discussion were filenames or USB product ID strings. The first version of the patch clearly went too far by escaping characters above 0x7e (in addition to control characters), which would exclude Unicode and other non-ASCII characters. But after complaints about that, Kulikov's second version just excludes control characters (i.e. < 0x20) with the exception of newline and tab.

That didn't sit well with Ingo Molnar, however, who thought that rather than whitelisting the known-good characters, blacklisting those known to be potentially harmful should be done instead:

Also, i think it would be better to make this opt-out, i.e. exclude the handful of control characters that are harmful (such as backline and console escape), instead of trying to include the known-useful ones.

[...] It's also the better approach for the kernel: we handle known harmful things and are permissive otherwise.

But, in order to create a blacklist, one must carefully determine the effects of the various control characters on all the different terminal emulators, whereas the whitelist approach has the advantage of being simpler by casting a much wider net. As Kulikov notes, figuring out which characters are problematic is not necessarily simple:

Could you instantly answer without reading the previous discussion what control characters are harmful, what are sometimes harmful (on some ttys), and what are always safe and why (or even answer why it is harmful at all)? I'm not a tty guy and I have to read console_codes(4) or similar docs to answer this question, the majority of kernel devs might have to read the docs too.

The disagreement between Molnar and Kulikov is one that has gone on in the security world for many years. There is no right answer as to which is better. As with most things in security (and software development for that matter), there are tradeoffs between whitelists and blacklists. In general, for user-supplied data (in web applications for example), the consensus has been to whitelist known-good input, rather than attempting to determine all of the "bad" input to exclude. At least in this case, though, Molnar does not see whitelists as the right approach:

A black list is well-defined: it disables the display of certain characters because they are *known to be dangerous*.

A white list on the other hand does it the wrong way around: it tries to put the 'burden of proof' on the useful, good guys - and that's counter-productive really.

It won't come as a surprise that Kulikov disagreed with that analysis: "What do you do with dangerous characters that are *not yet known* to be dangerous?" While there is little question that whitelisting the known-good characters is more secure, it is less flexible if there is a legitimate use for other control characters in the user-supplied strings. In addition, Molnar is skeptical that there are hidden dangers lurking in the ASCII control characters: "This claim is silly - do you claim some 'unknown bug' in the ASCII printout space?"

In this particular case, either solution should be just fine, as there aren't any good reasons to include those characters, but Molnar is probably right that there aren't hidden dangers in ASCII. There is a question as to whether this change is needed at all, however. The concern that spawned the patch is that administrators might miss important messages or get fooled by carefully crafted input (Willy Tarreau provides an interesting example of the latter). Linus Torvalds is not convinced that it is really a problem that needs addressing:

I really think that user space should do its own filtering - nobody does a plain 'cat' on dmesg. Or if they do, they really have themselves to blame.

And afaik, we don't do any escape sequence handling at the console level either, so you cannot mess up the console with control characters.

And the most dangerous character seems to be one that you don't filter: the one we really do react to is '\n', and you could possibly make confusing log messages by embedding a newline in your string and then trying to make the rest look like something bad (say, an oops).

Given Torvalds's skepticism, it doesn't seem all that likely this patch will go anywhere even if it were changed to a blacklisting approach as advocated by Molnar. It is, or should be, a fairly minor concern, but the question about blacklisting vs. whitelisting is one we will likely hear again. There are plenty of examples of both techniques being used in security (and other) contexts. It often comes down to a choice between more security (whitelisting typically) or more usability (blacklisting). This case is no different, really, and others are sure to crop up.


(Log in to post comments)

Sanitizing log file output

Posted Jun 30, 2011 7:57 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

in the case of escaping characters, another reason to whitelist instead of blacklist is that the resulting code is shorter (allow a handful of control characters, then everything above a particular value is a printable character vs a growing case of 'if the value is X' point conditions)

I have seen strange things happen when control characters hit a terminal that's not expecting them (including commands getting executed), I've seen this happen due to the terminal and the system sending data to that terminal having different opinions on what character encoding is in use,it doesn't take malicious people to cause problems.

Sanitizing log file output

Posted Jun 30, 2011 10:34 UTC (Thu) by etienne (subscriber, #25256) [Link]

If ESC i.e.'\033' is forbidden, you may also want to forbid CSI i.e. '\233' which was treated at a short version of "\033[" in VT200+ in 8 bits mode.
The trick is to set the name of the terminal to "sudo rm -rf /" and then ask the terminal name, but that is not supported by most terminal emulators.
Also, could forbid XON/XOFF writing to the console, if filtering is needed.

Sanitizing log file output

Posted Jun 30, 2011 14:08 UTC (Thu) by cesarb (subscriber, #6266) [Link]

Wait, isn't 0x9B perfectly valid as part of a normal UTF-8 character? You would have to know whether the terminal which will display the log output (which could be on a different machine running a different operating system on the other side of the world) is in UTF-8 mode (in which case CSI is 0xC2 0x9B, and other sequences containing 0x9B should not be filtered out), or in ISO-8859-1 mode (in which case 0x9B should always be filtered out), or in some other mode (I have no idea how other multibyte encodings represent CSI).

Sanitizing log file output

Posted Jun 30, 2011 14:34 UTC (Thu) by dgm (subscriber, #49227) [Link]

I think Vasiliy is wrong, but so is Ingo.

The burden of escaping characters should be in the tool used to display the logs, as it is in the position to know what is dangerous and what not.
So, let the kernel save whatever the user enters, but tell users administrators not to use cat to output directly to the terminal.

Sanitizing log file output

Posted Jul 1, 2011 9:29 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Ok. What tool should be used on BusyBox to view logs, then? Userspace filtering here is just WRONG.

Or maybe the whole syslog subsystem should be redesigned to store formatting string and parameters separately (like Windows does, btw). It would also allow easier log analysis.

Sanitizing log file output

Posted Jul 1, 2011 14:25 UTC (Fri) by geofft (subscriber, #59789) [Link]

Is it actually unusual for people to run "dmesg" instead of e.g. "dmesg | cat -A"?

Should dmesg(1) be patched to sanitize output for the current terminal/locale settings? (Is it already?)

What format is the raw kernel buffer, anyway? UTF-8?

Sanitizing log file output

Posted Jul 1, 2011 19:06 UTC (Fri) by nix (subscriber, #2304) [Link]

I can safely say that I have never run dmesg | cat -A in my life, nor would I ever have thought of doing so had it not been for this thread. I don't think syslog-ng has any defence against this yet, either (though doubtless baszi has already added it by now!)

Sanitizing log file output

Posted Jul 1, 2011 20:09 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

for what little it's worth, rsyslog does escape control characters by default.

this is a place that I think a bit of paranoia is good.

what does it hurt to change some characters to hex codes? the readability suffers a tiny bit, but this isn't a novel that you are reading, it's just a log message, and as long as the escaping is done consistantly, does it really matter if you see fooXbar (where X is something other than a ascii printable character) foo#xxxbar (or however you present the escaping)? in either case what you are really going to end up doing is searching or matching the string, and it really doesn't matter which you use for that purpose.

Sanitizing log file output

Posted Jul 1, 2011 20:06 UTC (Fri) by jrn (subscriber, #64214) [Link]

I would think the usual case is "dmesg | less".

Sanitizing log file output

Posted Jul 1, 2011 20:29 UTC (Fri) by nix (subscriber, #2304) [Link]

Indeed. This sanitizes things fine as long as you have the right things in $LESS.

Sanitizing log file output

Posted Jul 3, 2011 0:21 UTC (Sun) by mgedmin (subscriber, #34497) [Link]

I find myself doing "dmegs | tail" more often.

Sanitizing log file output

Posted Jul 2, 2011 22:25 UTC (Sat) by malor (subscriber, #2973) [Link]

The concept of trying to enumerate badness is probably the source of more security bugs than any other idea in computing.

Always enumerate safety. It's usually slightly less convenient, but when you're building something that will be used for generations (as much of this code is likely to be), it's the responsible approach.

What happens when the kernel goes fully to Unicode? It's going to happen someday, if it hasn't already. Thinking "we can blacklist because it's just ASCII" is short-term thinking. It won't always be just ASCII.

Sanitizing log file output

Posted Jul 6, 2011 11:32 UTC (Wed) by dsommers (subscriber, #55274) [Link]

I've been pondering on this for a couple of days ... When does the kernel need to log data which is non-7bit-ASCII? Obviously it is related to messages which is non-English or other binary data. But seriously, when does this *really* happen? Can someone point me to some kernel code where this is an important feature? Which other scenarios would non-7bit-ASCII values be valuable? When something above 0x7f needs to be logged, it's often binary data - where a hex notation might even make more sense.

UTF-8 also supports the complete 7 bit ASCII range, so displaying purified 7bit ASCII strings on UTF-8 terminals is not a problem.

In this perspective, it makes most sense to whitelist \t, \n and 0x20-0x7f ... IMO, anything outside this range should be considered potentially harmful, and can and should therefore be escaped.

And *if* this causes an issue later on with UTF-8, it is easier to expand a valid range - than to do the reverse. Which is kind of the "pain" in this discussion, reducing the number of valid characters is always painful.

Of course user-space should do their sanitation as well. *But* security is about layers. If kernel can provide sane log data, and user-space also can filter out gibberish which sneaked under the kernel radar, that's when you have redundant security. Tossing the ball back and forth claiming user-space or kernel-space should have the responsible is just silly, especially the day when something breaks.

Sanitizing log file output

Posted Jul 6, 2011 12:17 UTC (Wed) by malor (subscriber, #2973) [Link]

Agreed wholeheartedly. Whitelisting with a very basic set of characters is absolutely the right way to go.

I think the people saying "you shouldn't cat your dmesg" are being idiotic.

Sanitizing log file output

Posted Jul 6, 2011 12:23 UTC (Wed) by dsommers (subscriber, #55274) [Link]

I won't call anyone idiots, as I firmly believe that even the dmesg and cat user-space binaries should also do some kind of sanitation of the data it processes. But that sanitation needs to be done according to those programs needs and requirements. Hence, dmesg can most likely be much more stricter to what it passes on further, than cat.

Sanitizing log file output

Posted Jul 6, 2011 12:39 UTC (Wed) by malor (subscriber, #2973) [Link]

Well, sure, but it just makes sense to do it properly at the source. Security works best in layers. *Everyone* should fix the problem, both in kernel and in userspace. Saying "you shouldn't do that" is inadequate, when pretty much everyone in the entire world is doing that.

Sanitizing log file output

Posted Jul 6, 2011 13:49 UTC (Wed) by nix (subscriber, #2304) [Link]

cat cannot possibly do sanitization by default. A major use is in pipelines, in which it is sometimes used to stream all sorts of arbitrary binary data to other processes which never send it to the screen at all.

It could do it if its stdout isatty() I suppose, but that has so many holes it's nearly not worth it for a security thing (ls(1) uses this to tell how many columns to use, and note how easy it is to get it to switch to one-column mode accidentally).

Sanitizing log file output

Posted Jul 6, 2011 15:15 UTC (Wed) by malor (subscriber, #2973) [Link]

I'd actually hit on that in my prior comment, and then felt I was probably digressing a bit too much, and deleted that paragraph. Just as well, because you were more specific anyway.

As you say, it kind of breaks the whole idea of cat, which is to take a stream of bytes from somewhere and echo it to stdout, without changing it. Cat's useful in a zillion different places, and if that filtering code got triggered by accident, it'd make a hell of a mess.

Cat is simple and reliable code, and adding in all that complexity to sanitize something that should have been sanitized in the first place is fundamentally a broken idea. And what about all the other (hundreds?) of programs that might touch dmesg and send it to the console?

In my view, 'don't use cat for dmesg' isn't reasonable. The devs making this argument are saying that the most fundamental Unix tool for echoing text to a screen, is not suitable for echoing text to a screen.

Sanitizing log file output

Posted Jul 7, 2011 16:59 UTC (Thu) by dgm (subscriber, #49227) [Link]

But the truth is you shouldn't. Specially when the fix is so easy: use less instead of cat.

It's not the kernel's responsibility to know that some data is dangerous to the program you use to display it. "Sane data" is something that's completely different if you are using a VT-100 or a web browser. Trying to force this kind of policy where it doesn't belong is clearly shortsighted.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds