One-stop performance analysis using atop

May 12, 2010

This article was contributed by Jan Christiaan van Winkel & Gerlof Langeveld

Linux system administrators often receive complaints about the performance of their systems. It can be rather difficult to track down these problems and to find why, when, and how often they happen. Being able to zoom in on the processes that are responsible, and to see what has happened in the past, is very valuable. The atop utility was written with just these things in mind.

Performance analysis tools

Linux has a rich set of tools for performance analysis, but each has its own capabilities and limitations. In developing atop, the following list was considered to be desirable features for the tool:

The tool should obviously be able to show the current situation. However many resource problems don't occur "now". Often complaints will come in about the system performance "last night" or "last week". Therefore the tool must be able to look in the past. Being able to look in the future would be a "nice to have" but was deemed too difficult to implement.
It should show the load of the four main resources on a system level: CPU, memory, disk I/O, and network usage.
The four main resources are consumed by or on behalf of processes, so the tool should be able to show which processes (over)load the four resources.
A monitoring tool takes snapshots of the system, using a certain interval. If a process used resources since the last snapshot but has exited before the current snapshot, the tool should still be able to show which processes loaded which resources. In other words: the sum of resource usage by the processes should be equal to the system wide reported resource usage.

Looking at this list of requirements, none of the existing standard analysis tools meets the bill. sar shows extensive data regarding CPU, memory, disk and network usage from the past and the present. However, it cannot "zoom in" on processes: it only shows resource usage on a system level. vmstat and iostat can only show CPU, memory and disk usage on a system level; they cannot show usage data from the past. Finally top, one of the most used performance monitors, does show CPU and memory usage on a system level and on a process level. However, it only shows the current situation, it cannot show usage data from the past. It also does not show the resource consumption for exited processes, so with top it is possible that on a system level the CPU is shown as 90% busy, while the sum of all CPU consumption on a process level is only 40% (the other 50% might have been used by processes that exited between the previous and the current snapshot).

This chart compares the characteristics of these other analysis tools with atop:

atop is free software, and can be downloaded from the web site, though many Linux distributions include atop in their repositories. After installing atop, the command atopsar is also available. It can be compared to sar but references the same log files that are generated and used by atop.

Characteristics of `atop`

atop was created mainly because the other tools don't report about processes that exit between snapshots. When using "process accounting", the kernel writes a record to a log file for every process that exits. atop will use these records to make a process activity list that is as complete as possible, including processes that exited since the last snapshot.

atop shows the load of the CPUs, memory, disks, and network on a system level. Apart from the network, atop also shows which processes consume these resources (for network utilization per process, a kernel patch is provided). By default, atop shows generic information about processes (like PID, name, CPU utilization, memory utilization, disk utilization, and status). However, more information about the process's memory usage, disk I/O, and scheduling characteristics is available by using single-character keystrokes (for example, s for scheduling characteristics).

Users can always override the default sorting order that atop uses. For example, for more information about a process's memory usage, the M subcommand sorts the processes in descending order of their resident memory usage. But, these processes can also be sorted on their disk I/O usage by using the D subcommand. Typing A will let atop determine what the most sensible sorting order would be given the most heavily used resource at the moment. In the system overview (the top half of the screen) a line will be highlighted if that particular resource is overloaded.

Obviously, not all data about all resources can be shown on the screen at once. Therefore, if the window is resized, atop will automatically show more (or less) data depending on the room available. Configurable priorities are used to determine what data is no longer shown if there is too little space.

Using `atop` on a system level

The default screen of atop looks like this:

[atop screen]

Just like top, the screen is divided in two parts: the top half shows system-wide data, and the bottom half is used for per-process data.

On a system level, not only CPU and memory statistics are visible, but also disk and network usage data. In the example above, the line with label CPU shows a total of (27+61+25+214+73)=400% CPU capacity, so there are 4 CPUs in this system. The lines labeled cpu show the individual CPUs (each rated at 100%). The CPUs are listed in order of busyness. The CPU is considered busy when it is in system mode, user mode, or handling interrupts, and it is idle when idle or waiting for I/O (wait). Therefore, in this case the sort order for the CPUs is 3, 1, 2, 0 as shown in the last column, just in front of the wait percentage "w".

The header line shows that ten seconds have elapsed since the previous snapshot. During this time, four CPUs provide a total of 40 seconds of computing capacity. The line with label PRC shows the sum of the CPU time used by processes, i.e. 5.20 seconds in system mode (corresponds to 26% sys + 25% irq in the line labeled CPU) and 6.20 in user mode (corresponds to 61% user in the line labeled CPU).

A line labeled DSK gives information about a physical disk that has been active in the past interval. It shows the name of the disk, the I/O busy percentage, the number of read and write requests, and the average service time per request (avio). By making the screen wider, more data is shown: the disk bandwidth for reads and writes (in MiB/s) and the average queue length for that disk.

If the system uses LVM or MD software RAID volumes, the same information is shown for each active logical volume of MD volume. In the figure below, it is clear that several writes (111 plus 1) to logical volumes may be combined to fewer writes to the physical disk (54). The combined transfers are larger and therefore use a higher service time per transfer.

[LVM disk output]

In the same way, data for memory availability, usage, paging, and page scanning is shown. The last lines in the system overview show network related data, per interface, on IP-layer level and on transport level.

Using `atop` on a process level

It is useful to be able to see how busy the system is, but if a system is too busy, the tool needs to be able to zoom in to find the culprit. This is where atop shines. atop tries to make sure the books balance. Other tools do not take processes that exited into account. For example, with top it is possible that on a system level the CPUs are 99% busy, even though top shows only two active processes that together have used only 5% of the CPU in the sample period. One notorious example of this is a kernel compilation: lots of short-lived processes that eat up all of the CPU, but only a few of them show up in top's output.

atop uses process accounting to take into account processes that have exited. In the first full screenshot shown at system level, we can see bzip2 having used 61% of the CPU time. atop lists it in angle brackets, to show that the process has exited. In addition, the exit code (column EXC) is shown. It is interesting to see if a process is eating up CPU power in system mode (system call and interrupt handling) or in user mode. Fortunately, atop shows you both, per process.

The Linux process scheduler determines which process gets the CPU. Scheduling information can be seen by using the s subcommand:

[scheduling info]

One can see how many threads are in the "running" state (TRUN), "sleeping interruptible" (TSLPI), or "sleeping uninterruptible" (TSLPU). The scheduling policy (normal, round-robin realtime, fifo realtime, etc.) is also shown.

CPU is not the only scarce resource in the system. Therefore, atop can also show per-process usage statistics for memory, disk bandwidth, and network bandwidth. Zooming in on disk statistics (using the d subcommand), atop shows the following:

[disk usage info]

Recent kernels are often configured with the option "Per-task storage I/O accounting", so the kernel keeps track of how much data is passed by the write and read system calls related to disk I/O (WRDSK and RDDSK respectively). However, not all write system calls will lead to physical writes on disk. For example, if a region of a file is written and then overwritten before the data was flushed from the page cache to disk, the first writes are shown by atop even though they have never been written to disk. In this case, the column WCANCL shows the amount of data whose physical write was canceled. In the example above, the actions of tar canceled 172KiB worth of writes.

Extra information with patches

The kernel does not register network bandwidth usage per process. Patches are available that make the kernel keep track of network usage per process. After receiving the n subcommand, atop will show network related data per process:

[network usage info]

The TCPRCV/TCPSND and UDPRCV/UDPSND columns show the number of packets being received and sent per process by these transport layers. The RAWRCV and RAWSND show the number of "raw" packets received and sent. These are packets that go directly from the application to the IP layer, not passing through TCP or UDP. For example the ping program sends ICMP ECHO REQUEST packets directly through the ICMP layer to receive ICMP ECHO REPLY packets.

The TCPSASZ column shows the average send transfer size. If the screen is wide enough, the average receive transfer size is also shown, both for TCP and for UDP.

Unfortunately, these patches are not part of the mainline kernel. In 2008 an attempt was made to merge them, but the modifications conflicted with other new features (like cgroups) that were under development at the time.

Back to the future

atop is useful as a tool for the here and now. But what if the system was slow in the past? The normal installation of the atop package starts an atop daemon nightly. This daemon takes snapshots and writes them to a log file (/var/log/atop/atop_YYYYMMDD). The default snapshot interval for a logging atop is 10 minutes, but obviously this is configurable. Every logfile is preserved for a month (also configurable), so performance events a full month back can still be observed.

The log file can be viewed using atop -r log_filename. The subcommand t forwards to the next sample in the atop logfile (i.e. 10 minutes by default), subcommand T rewinds one sample. Subcommand b branches to a specific time in the current logfile. All other subcommands to zoom in on specific resources also work. The logfile that the atop daemon creates can also be viewed using a sar-like interface using the command atopsar.

One-stop analysis

Performance analysis is a cyclic process of measuring, drawing conclusions, measuring in more detail, drawing more detailed conclusions, and so on until you can really pinpoint the aching spot. Performance problems do not only occur in the here and now, but also in the past, so a performance analysis tool should be able give information in both situations. As described in this article, atop has been designed to be a complete one-stop tool to guide you at least through the first cycles (system and process level for the four most critical resources: CPU, memory, disk I/O and network).

More information about atop can be found on the atop website, as well as in the atop man page [PDF]. There is also a case study [PDF] available that shows how atop can be used to analyze a problem with processes leaking memory.

Index entries for this article
GuestArticles	Langeveld, Gerlof
GuestArticles	Van Winkel, Jan Christiaan

One-stop performance analysis using atop

Posted May 13, 2010 14:49 UTC (Thu) by HappyCamp (guest, #29230) [Link] (3 responses)

Thanks a lot for the article. I am going to try this tool out.

The project's website is somewhat disappointing. I could not find any information on a mailing list, IRC channel, source code repository, who the author is, etc...

The only contact method was a web form :(

Main reason I was looking for this info was I was trying to see if it was possible to make the display fill the whole screen, since it chops off the command line arguments on my display, even though it could go much wider.

One-stop performance analysis using atop

Posted May 13, 2010 18:53 UTC (Thu) by hazmat (subscriber, #668) [Link] (1 responses)

latest release (1.2.5) seems to work fine with scaling to larger terminal windows.

One-stop performance analysis using atop

Posted May 13, 2010 21:26 UTC (Thu) by HappyCamp (guest, #29230) [Link]

Thanks!

Fedora 12 has 1.23. And looking at the changelog it appears that 1.24 added better ability to size the atop program based on the display.

One-stop performance analysis using atop

Posted May 14, 2010 11:42 UTC (Fri) by jcgl (guest, #66101) [Link]

The authors' email address is listed in the atop man page.
( http://atoptool.nl/download/man_atop.pdf )

One-stop performance analysis using atop

Posted May 14, 2010 9:34 UTC (Fri) by roryi (subscriber, #25099) [Link]

Let me add another note of thanks for this article.

I've long used a mix of sar, dstat, and iotop - atop looks like a great replacement for all three.

One-stop performance analysis using atop

Posted May 16, 2010 10:18 UTC (Sun) by fjalvingh (guest, #4803) [Link]

A very useful addition to the *top commands! Thanks for the article!

System Activity in KDE

Posted May 19, 2010 11:45 UTC (Wed) by johnflux (guest, #58833) [Link]

Hi all,

I'm the author of System Activity in KDE. It pops up if you press ctrl+esc in KDE.

It has some of the features of atop. For example, it shows the disk IO usage, lets you see and set the CPU scheduler and I/O scheduler, etc.

It also has javascript scripting support which is used, for example, to show detailed memory information for a process etc. And shows the window title for each process - very useful if you have multiple firefox programs running etc.

Anyway, check it out. My aim has to been to make it as simple to use as possible.

One-stop performance analysis using atop

Posted May 20, 2010 20:52 UTC (Thu) by oak (guest, #2786) [Link] (1 responses)

htop is also pretty nice, easily configurable and although ncurses based, supports mouse & colors. It shows only the current state, not history (and no disk or network usage), but from TIME+ column one can check whether process CPU ticks value increase.

Latest version allows stracing selected process (it should use "-f" arg for strace though, so that all threads are tracked) or using lsof.

The problem with these fancier top programs (e.g. compared to simple Busybox top) is that they take more CPU and are slower to start. If your system is sometimes _really_ slow, but only temporarily, it matters a lot how fast "top" starts and can update its screen.

One-stop performance analysis using atop

Posted Jun 14, 2010 0:47 UTC (Mon) by Tobu (subscriber, #24111) [Link]

The problem with these fancier top programs (e.g. compared to simple Busybox top) is that they take more CPU and are slower to start. If your system is sometimes _really_ slow, but only temporarily, it matters a lot how fast "top" starts and can update its screen.

The nice thing with atop is that it has per-process, historical info. Assuming you use the debian package which sets up cronjobs and logrotate, you can run sudo atop -r, use t and T to navigate the day in 10s slices, find the time the system slowed down, and pinpoint the problem process and the scarce resource. Thanks to process accounting, data is logged reliably under load.

One-stop performance analysis using atop

Posted May 30, 2010 22:34 UTC (Sun) by sfink (guest, #6405) [Link] (1 responses)

I love atop and have been using it quite a bit over the last year. One warning, though -- the binary format changes rather frequently, and newer versions of the program are generally not backward-compatible with older formats. This can be a big problem if you are using it to capture performance regression data for test runs. The newer versions often add enough juicy features to make the upgrade worthwhile, but then you can no longer write a single tool that will allow comparing results with your older runs.

For a while, I dealt with it by writing a wrapper script that autodetected the appropriate version and used the text dump command-line options, but eventually it was enough of a nuisance that I patched in backwards compatibility. The patch wasn't accepted, I forget why. (I think the reason was reasonable, though.)

One-stop performance analysis using atop

Posted May 31, 2010 16:31 UTC (Mon) by jcvw (subscriber, #50475) [Link]

This problem has been solved as of version 1.25.

One-stop performance analysis using atop

Posted Mar 14, 2012 11:07 UTC (Wed) by liljencrantz (guest, #28458) [Link]

Great idea for an article.

I read about atop years ago, and it sounded very useful, but I never got round to ditching good old top - familiarity is sometimes powerful. This article provided me a little kick, and i just apt-get:ed atop. Hopefully I'll stick with it.

One-stop performance analysis using atop

Performance analysis tools

Characteristics of atop

Using atop on a system level

Using atop on a process level

Extra information with patches

Back to the future

One-stop analysis

One-stop performance analysis using atop

One-stop performance analysis using atop

One-stop performance analysis using atop

One-stop performance analysis using atop

One-stop performance analysis using atop

One-stop performance analysis using atop

System Activity in KDE

One-stop performance analysis using atop

One-stop performance analysis using atop

One-stop performance analysis using atop

One-stop performance analysis using atop

One-stop performance analysis using atop

Characteristics of `atop`

Using `atop` on a system level

Using `atop` on a process level