User: Password:
|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.34-rc2, released (without announcement) on March 20. A lot of changes went in since the -rc1 release; see the short-form changelog for an overview, or see the full changelog for all the details.

Comments (1 posted)

Quotes of the week

With netlink you can do whatever you like - it is like ioctl but without the guilt.
-- Neil Brown

What you've created is no longer a single project, it is called a distro, and you're being short-sighted and anti-social to think you can garner more support than all of those individual packages you forked. This is why most developers work upstream and let the goodness propagate down from the top like molten sugar of each granular package on a flan where it is collected from the rich custard channel sitting on a distribution plate below before the big hungry mouth of the consumer devours it and incorporates it into their infrastructure.
-- Zachary Amsden (Thanks to Michael S. Tsirkin)

What happens is that hundreds of bug reports land in my inbox and I get to route them to various maintainers, most of whom don't exist, so warnings keep on landing in my inbox. Please send a mailing address for my invoices.

It would be more practical, more successful and quicker to hunt down the miscreants and send them rude emails. Plus it would save you money.

-- Andrew Morton

I guess you are talking to the wrong person as i actually have implemented ls functionality in the kernel, using async IO concepts and extreme threading ;-) It was a bit crazy, but was also the fastest FTP server ever running on this planet.
-- Ingo Molnar

Comments (none posted)

Ceph distributed filesystem merged for 2.6.34

Linus's allegedly shorter-than-usual merge window has seemingly mutated into one of the longest merge windows in recent times. Along with big trees for the Microblaze and Blackfin architectures and the SCSI subsystem, the kernel has just gained the Ceph distributed filesystem, a high-performance filesystem intended to scale into the petabyte range.

Comments (26 posted)

SystemTap 1.2 released

SystemTap 1.2 - a dynamic tracing system for kernel and user space - is out. The summary reads: "prototype perf event and hw-breakpoint probing, security fixes, error tolerance script language extensions, optimizations, tapsets, interesting new sample scripts, kernel versions 2.6.9 through 2.6.34-rc." The support for perf events and hardware breakpoints should make a number of tracing tasks easier.

Full Story (comments: 2)

SSL on kernel.org

John "Warthog9" Hawley has announced the availability of SSL encryption (i.e. https) for kernel.org. The kernel bugzilla, wikis, account requests, and the Patchwork patch tracker have all been defaulted to https via an http redirect. In addition, the www, boot, git, and android.git subdomains of kernel.org can use SSL if the user specifies https in the URL. There are no plans to support SSL for mirrors.kernel.org, because "these machines move a large amount of data to a large number of users and it would be difficult, and memory intensive, to provide SSL for this service." Hawley also notes that Thawte donated signed SSL certificates, which "alleviates a large amount of support effort that self-signed certificates would have incurred".

Comments (3 posted)

Piecemeal tracepoints?

By Jake Edge
March 24, 2010

On March 23, Jan Kara proposed a patch that would enable tracepoints selectively for different subsystems at build time. His concern was that debugging one particular area using tracepoints would end up "polluting" other kernel paths with tracepoint checks. Allowing tracing for a particular subsystem, without the potential performance degradation from tracepoint tests in other subsystems, is the goal. But various other kernel hackers saw things differently.

Quite a bit of work has gone into making disabled-but-present tracepoints have a very minimal impact on performance. Frederic Weisbecker described it this way: "each tracepoint is a lightweight thing and induce a tiny overhead, probably hard to notice, and this is going to be even more the case after the jmp label optimization patches." There are lots of benefits to having tracepoints be an "all or none" proposition as well. As part of developing tracepoints, Mathieu Desnoyers thought about and rejected the idea:

When I considered if it was worth it to create such a per-tracepoint group compile-time disabling in the first place, I decided not to do it precisely due to the added-value that comes with the availability of system-wide tracepoints. And I think with the static jump patching, we are now at a point where the overhead is stunningly low.

Ted Ts'o sees that "a lot of the value of tracepoints goes away if people are compiling kernels without them and we need to get a special 'tracing kernel' installed before we can debug a problem". Both Ingo Molnar and Steven Rostedt also agreed, making the prospects for this change rather dim. While piecemeal tracepoints seem attractive at first glance, the value of tracepoints comes, at least partially, from having them all available at once. The belief and hope is that they are built into nearly every kernel, so that when problems arise, they are there, ready to be used.

Comments (none posted)

The end for Video4Linux1

By Jonathan Corbet
March 24, 2010
The Video4Linux1 (V4L1) ABI is deprecated, and has been for a long time; it was ostensibly replaced by Video4Linux2 in the 2.5 development series. But, as has been discovered many times, an ABI is a hard thing to get rid of. So the kernel still supports V4L1 applications; indeed, there are still V4L1-only drivers in current kernels. That situation has persisted for a long time, but it may now be coming to an end.

Hans Verkuil has posted a multi-stage proposal for the removal of V4L1 from the kernel. The first phase involves the conversion of the remaining V4L1 drivers - of which there are several - to the newer ABI. Some of those drivers have since been supplanted by GSPCA and may just be deleted outright. All told, this is a bit of much-needed janitorial work.

Phase 2 may be a bit more controversial, though, in that it calls for the removal of the V4L1 compatibility layer in the kernel. This code allows V4L1 applications to work with V4L2 drivers - most of the time. It was an important bit of backward compatibility support, but it has also helped to delay the updating of a number of old V4L1 applications. Given that these applications do still exist (many distributions still ship xawtv, for example), it might be a bit surprising that this layer is slated for removal, perhaps as soon as 2.6.36.

There are problems with the compatibility layer. It cannot provide access to much of the functionality of contemporary hardware and drivers, it cannot always do the right thing in response to application requests, and it has been a long time since anybody had any interest in maintaining this code. So the V4L developers would like to push it out into user space, and into the libv4l1 library in particular. Supporting old applications would then be a matter of a quick edit (replacing ioctl() calls with v4l1_ioctl(), for example) and a rebuild against the library. Some old applications may be pulled into the V4L project, since their original maintainers have almost certainly long since lost interest.

It's not a perfect solution; old, binary applications will cease to work on newer kernels. It is an ABI break, plain and simple, and it is possible that there will be enough of an uproar to prevent this change from happening in the end. But it may also be that nobody really cares about running binary V4L1 applications on new kernels, and that it is truly time for this old interface to pass into history.

Comments (5 posted)

Kernel development news

KVM, QEMU, and kernel project management

By Jonathan Corbet
March 23, 2010
The KVM virtualization subsystem is seen as one of the great success stories of contemporary kernel development. KVM came from nowhere into a situation with a number of established players - both free and proprietary - and promptly found a home in the kernel and in the marketing plans of a number of Linux companies. Both the code and its development model are seen as conforming much more closely to the Linux way of doing things than the alternatives; KVM is expected to be the long-term virtualization solution for Linux. So, one might well wonder, why has KVM been the topic of one of the more massive and less pleasant linux-kernel discussions in some time?

Yanmin Zhang was probably not expecting to set off a flame war with the posting of a patch adding a set of KVM-related commands to the "perf" tool. The value of this patch seems obvious: beyond allowing a host to collect performance statistics on a running guest, it enables the profiling of the host/guest combination as a whole. One can imagine that there would be value to being able to see how the two systems interact.

The problem, it seems, is that this feature requires that the host have access to specific information from the running KVM guest: at a minimum, it needs the guest kernel's symbol table. More involved profiling will require access to files in the guest's namespaces. To this end, Ingo Molnar suggested that life would be easier if the host could mount (read-only) all of the filesystems which were active in the guest. It would also be nice, he said elsewhere, if the host could easily enumerate running guests and assign names to them.

The response he got was "no way." Various security issues were raised, despite the fact that the filesystems on the host would not be world-readable, and despite the fact that, in the end, the host has total control over the guest anyway. Certainly there are some interesting questions, especially when frameworks like SELinux are thrown into the mix. But Ingo took that answer as a statement of unwillingness to cooperate with other developers to improve the usability of KVM, especially on developers' desktop systems. What followed was a sometimes acrimonious and often repetitive discussion between Ingo and KVM developer Avi Kivity, with a small group of supporting actors on both sides.

Ingo's position is that any development project, to be successful, must make life easy for users who contribute code. So, he says, the system should be most friendly toward developers who want to run KVM on their desktop. Beyond that, he claims that a stronger desktop orientation is crucial to our long-term success in general:

I.e. the kernel can very much improve quality all across the board by providing a sane default (in the ext3 case) - or, as in the case of perf, by providing a sane 'baseline' tooling. It should do the same for KVM as well.

If we don't do that, Linux will eventually stop mattering on the desktop - and some time after that, it will vanish from the server space as well. Then, may it be a decade down the line, you won't have a KVM hacking job left, and you won't know where all those forces eliminating your project came from.

Avi, needless to say, sees things differently:

It's a fact that virtualization is happening in the data center, not on the desktop. You think a kvm GUI can become a killer application? fine, write one. You don't need any consent from me as kvm maintainer (if patches are needed to kvm that improve the desktop experience, I'll accept them, though they'll have to pass my unreasonable microkernelish filters). If you're right then the desktop kvm GUI will be a huge hit with zillions of developers and people will drop Windows and switch to Linux just to use it.

But my opinion is that it will end up like virtualbox, a nice app that you can use to run Windows-on-Linux, but is not all that useful.

Ingo's argument is not necessarily that users will flock to the platform, though; what seems to be important is attracting developers. A KVM which is easier to work with should inspire developers to work with it, improving its quality further. Anthony Liguori, though, points out that the much nicer desktop experience provided by VirtualBox has not yet brought in a flood of developers to fix its performance problems.

Another thing that Ingo is unhappy with is the slow pace of improvement, especially with regard to the QEMU emulator used to provide a full system environment for guest systems. A big part of the problem, he says, is the separation between the KVM and QEMU, despite the fact that they are fairly tightly-coupled components. Ingo claimed that this separation is exactly the sort of problem which brought down Xen, and that the solution is to pull QEMU into the kernel source tree:

If you want to jump to the next level of technological quality you need to fix this attitude and you need to go back to the design roots of KVM. Concentrate on Qemu (as that is the weakest link now), make it a first class member of the KVM repo and simplify your development model by having a single repo.

From Ingo's point of view, such a move makes perfect sense. KVM is the biggest user of the QEMU project which, he says, was dying before KVM came along. Bundling the two components would allow ABI work to be done simultaneously on both sides of the interface, with simultaneous release dates. Kernel and user-space developers would be empowered to improve the code on both sides of the boundary. Bringing perf into the kernel tree, he says, grew the external developer community from one to over 60 in less than one year. Indeed, integration into the kernel tree is the reason why perf has been successful:

If you are interested in the first-hand experience of the people who are doing the perf work then here it is: by far the biggest reason for perf success and perf usability is the integration of the user-space tooling with the kernel-space bits, into a single repository and project.

Clearly, Ingo believes that integrating QEMU into the kernel tree would have similar effects there. Just as clearly, the KVM and QEMU developers disagree. To them, this proposal looks like a plan to fork QEMU development - though, it should be said, KVM already uses a forked version of QEMU. This fork, Avi says, is "definitely hurting." According to Anthony, moving QEMU into the kernel tree would widen that fork:

We lose a huge amount of users and contributors if we put QEMU in the Linux kernel. As I said earlier, a huge number of our contributions come from people not using KVM.

The KVM/QEMU developers are unconvinced that they will get more developers by moving the code into the kernel tree, and they seem frankly amused by the notion that kernel developers might somehow produce a more desktop-oriented KVM. They see the separation of the projects as not being a problem, and wonder where the line would be drawn; Avi suggested that the list of projects which don't belong in the kernel might be shorter in the end. In summary, they see a system which does not appear to be broken - QEMU is said to be improving quickly - and that "fixing" it by merging repositories is not warranted.

Particular exception was taken to Ingo's assertion that a single repository allows for quicker and better development of the ABI between the components. Slower, says Zachary Amsden, tends to be better in these situations:

This is actually a Good Thing (tm). It means you have to get your feature and its interfaces well defined and able to version forwards and backwards independently from each other. And that introduces some complexity and time and testing, but in the end it's what you want. You don't introduce a requirement to have the feature, but take advantage of it if it is there.

Ingo, though, sees things differently based on his experience over time:

It didn't work, trust me - and i've been around long enough to have suffered through the whole 2.5.x misery. Some of our worst ABIs come from that cycle as well... And you can also see the countless examples of carefully drafted, well thought out, committee written computer standards that were honed for years, which are not worth the paper they are written on.

'extra time' and 'extra bureaucratic overhead to think things through' is about the worst thing you can inject into a development process.

As the discussion wound down, it seemed clear that neither side had made much progress in convincing the other of anything. That means that the status quo will prevail; if the KVM maintainers are not interested in making a change, the rest of the community will be hard-put to override them. Such things have happened - the x86 and x86-64 merger is a classic example - but to override a maintainer in that way requires a degree of consensus in the community which does not appear to be present here. Either that, or a decree from Linus - and he has been silent in this debate.

So the end result looks like this:

Please consider 'perf kvm' scrapped indefinitely, due to lack of robust KVM instrumentation features: due to lack of robust+universal vcpu/guest enumeration and due to lack of robust+universal symbol access on the KVM side. It was a really promising feature IMO and i invested two days of arguments into it trying to find a workable solution, but it was not to be.

Whether that's really the end for "perf kvm" remains to be seen; it's a clearly useful feature that may yet find a way to get into the kernel. But this disconnect between the KVM developers and the perf developers is a clear roadblock in the way of getting this sort of feature merged for now.

Comments (130 posted)

Using the TRACE_EVENT() macro (Part 1)

March 24, 2010

This article was contributed by Steven Rostedt

Throughout the history of Linux, people have been wanting to add static tracepoints — functions that record data at a specific site in the kernel for later retrieval — to the kernel. Those efforts weren't very successful because of the fear that tracepoints would sacrifice performance. Unlike the Ftrace function tracer, a tracepoint can record more than just the function being entered. A tracepoint can record local variables of the function. Over time, various strategies for adding tracepoints have been tried, with varying success, and the TRACE_EVENT() macro is the latest way to add kernel tracepoints.

History

Mathieu Desnoyers worked on adding a very low overhead tracer hook called trace markers. Even though the trace markers solved the performance issue by using cleverly crafted macros, the information that the trace marker would record was embedded at the location in the core kernel as a printf format. This upset several core kernel developers as it made the core kernel code look like debug code was left scattered throughout.

In trying to appease the kernel developers, Mathieu came up with tracepoints. The tracepoint included a function call in the kernel code that, when enabled, would call a callback function passing the parameters of the tracepoint to that function as if the callback function was called with those parameters. This was much better than the trace markers since it allowed the passing of type casted pointers that the callback functions could dereference, as opposed to the marker interface, which required the callback function to parse a string. With the tracepoint, the callback function could efficiently take whatever it needed from the structures.

Although this was an improvement over trace markers, it was still too tedious for developers to create a callback for every tracepoint they wanted to add, so that a tracer would output its data. The kernel needed a more automated way to connect a tracer to the tracepoints. That would require automating the creation of the callback and also format its data, much like what the trace marker did, but it should be done in the callback, and not at the tracepoint site in the kernel code.

To solve this issue of automating the tracepoints, the TRACE_EVENT() macro was born. Inspired by Tom Zanussi's zedtrace, this macro was specifically made to allow a developer to add tracepoints to their subsystem and have Ftrace automatically be able to trace them. The developer need not understand how Ftrace works, they only need to create their tracepoint using the TRACE_EVENT() macro. In addition, they need to follow some guidelines in how to create a header file and they would gain full access to the Ftrace tracer. Another objective of the design of the TRACE_EVENT() macro was to not couple it to Ftrace or any other tracer. It is agnostic to the tracers that use it, which is apparent now that TRACE_EVENT() is also used by perf, LTTng and SystemTap.

The anatomy of the TRACE_EVENT() macro

Automating tracepoints had various requirements that must be fulfilled:

  • It must create a tracepoint that can be placed in the kernel code.

  • It must create a callback function that can be hooked to this tracepoint.

  • The callback function must be able to record the data passed to it into the tracer ring buffer in the fastest way possible.

  • It must create a function that can parse the data recorded to the ring buffer and translate it to a human readable format that the tracer can display to a user.

To accomplish that, the TRACE_EVENT() macro is broken into six components, which correspond to the parameters of the macro:

   TRACE_EVENT(name, proto, args, struct, assign, print)
  • name - the name of the tracepoint to be created.

  • prototype - the prototype for the tracepoint callbacks

  • args - the arguments that match the prototype.

  • struct - the structure that a tracer could use (but is not required to) to store the data passed into the tracepoint.

  • assign - the C-like way to assign the data to the structure.

  • print - the way to output the structure in human readable ASCII format.

A good example of a tracepoint definition, for sched_switch, can be found here. That definition will be used below to describe each of the parts of TRACE_EVENT() macro.

All parameters except the first one are encapsulated with another macro (TP_PROTO, TP_ARGS, TP_STRUCT__entry, TP_fast_assign and TP_printk). These macros give more control in processing and also allow commas to be used within the TRACE_EVENT() macro.

Name

The first parameter is the name.

   TRACE_EVENT(sched_switch,

This is the name used to call this tracepoint. The actual tracepoint that is used has trace_ prefixed to the name (ie. trace_sched_switch).

Prototype

The next parameter is the prototype.

    TP_PROTO(struct rq *rq, struct task_struct *prev, struct task_struct *next),

The prototype is written as if you were to declare the tracepoint directly:

    trace_sched_switch(struct rq *rq, struct task_struct *prev,
                       struct task_struct *next);

It is used as the prototype for both the tracepoint added to the kernel code and for the callback function. Remember, a tracepoint calls the callback functions as if the callback functions were being called at the location of the tracepoint.

Arguments

The third parameter is the arguments used by the prototype.

    TP_ARGS(rq, prev, next),

It may seem strange that this is needed, but it is not only required by the TRACE_EVENT() macro, it is also required by the tracepoint infrastructure underneath. The tracepoint code, when activated, will call the callback functions (more than one callback may be assigned to a given tracepoint). The macro that creates the tracepoint must have access to both the prototype and the arguments. Below is an illustration of what a tracepoint macro would need to accomplish this:

    #define TRACE_POINT(name, proto, args) \
       void trace_##name(proto)            \
       {                                   \
               if (trace_##name##_active)  \
                       callback(args);     \
       }
Structure

The fourth parameter is a bit more complex.

    TP_STRUCT__entry(
		__array(	char,	prev_comm,	TASK_COMM_LEN	)
		__field(	pid_t,	prev_pid			)
		__field(	int,	prev_prio			)
		__field(	long,	prev_state			)
		__array(	char,	next_comm,	TASK_COMM_LEN	)
		__field(	pid_t,	next_pid			)
		__field(	int,	next_prio			)
    ),

This parameter describes the structure layout of the data that will be stored in the tracer's ring buffer. Each element of the structure is defined by another macro. These macros are used to automate the creation of a structure and are not function-like. Notice that the macros are not separated by any delimiter (no comma nor semicolon).

The macros used by the sched_switch tracepoint are:

  • __field(type, name) - this defines a normal structure element, like int var; where type is int and name is var.

  • __array(type, name, len) - this defines an array item, equivalent to int name[len]; where the type is int the name of the array is array and the number of items in the array is len.

There are other element macros that will be described in a later article. The definition from the sched_switch tracepoint would produce a structure that looks like:

    struct {
	      char   prev_comm[TASK_COMM_LEN];
	      pid_t  prev_pid;
	      int    prev_prio;
	      long   prev_state;
	      char   next_comm[TASK_COMM_LEN];
	      pid_t  next_pid;
	      int    next_prio;
    };

Note that the spacing used in the TP_STRUCT__entry definition breaks the rules outlined by checkpatch.pl. That is done because these macros are not function-like but, instead, are used to define a structure. The spacing follows the rules of structure spacing and not of function spacing, so that the names line up in the structure declaration. Needless to say, checkpatch.pl fails horribly when processing changes to TRACE_EVENT() definitions.

Assignment

The fifth parameter defines the way the data from the parameters is saved to the ring buffer.

    TP_fast_assign(
		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
		__entry->prev_pid	= prev->pid;
		__entry->prev_prio	= prev->prio;
		__entry->prev_state	= prev->state;
		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
		__entry->next_pid	= next->pid;
		__entry->next_prio	= next->prio;
    ),

The code within the TP_fast_assign() is normal C code. A special variable __entry represents the pointer to a structure type defined by TP_STRUCT__entry and points directly into the ring buffer. The TP_fast_assign is used to fill all fields created in TP_STRUCT__entry. The variable names of the parameters defined by TP_PROTO and TP_ARGS can then be used to assign the appropriate data into the __entry structure.

Print

The last parameter defines how a printk() can be used to print out the fields from the TP_STRUCT__entry structure.

	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s ==> " \
 		  "next_comm=%s next_pid=%d next_prio=%d",
		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
		__entry->prev_state ?
		  __print_flags(__entry->prev_state, "|",
				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
				{ 128, "W" }) : "R",
		__entry->next_comm, __entry->next_pid, __entry->next_prio)

Once again the variable __entry is used to reference the pointer to the structure that contains the data. The format string is just like any other printf format. The __print_flags() is part of a set of helper functions that come with TRACE_EVENT(), and will be covered in another article. Do not create new tracepoint-specific helpers, because that will confuse user-space tools that know about the TRACE_EVENT() helper macros but will not know how to handle ones created for individual tracepoints.

Format file

The sched_switch TRACE_EVENT() macro produces the following format file in /sys/kernel/debug/tracing/events/sched/sched_switch/format:

   name: sched_switch
   ID: 33
   format:
	field:unsigned short common_type;	offset:0;	size:2;
	field:unsigned char common_flags;	offset:2;	size:1;
	field:unsigned char common_preempt_count;	offset:3;	size:1;
	field:int common_pid;	offset:4;	size:4;
	field:int common_lock_depth;	offset:8;	size:4;

	field:char prev_comm[TASK_COMM_LEN];	offset:12;	size:16;
	field:pid_t prev_pid;	offset:28;	size:4;
	field:int prev_prio;	offset:32;	size:4;
	field:long prev_state;	offset:40;	size:8;
	field:char next_comm[TASK_COMM_LEN];	offset:48;	size:16;
	field:pid_t next_pid;	offset:64;	size:4;
	field:int next_prio;	offset:68;	size:4;

   print fmt: "task %s:%d [%d] (%s) ==> %s:%d [%d]", REC->prev_comm, REC->prev_pid,
   REC->prev_prio, REC->prev_state ? __print_flags(REC->prev_state, "|", { 1, "S"} ,
   { 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128,
   "W" }) : "R", REC->next_comm, REC->next_pid, REC->next_prio

Note: Newer kernels may also display a signed entry for each field.

Notice that __entry is replaced with REC in the format file. The first set of fields (common_*) are not from the TRACE_EVENT() macro, but are added to all events by Ftrace, which created this format file, other tracers could add different fields. The format file provides user-space tools the information needed to parse the binary output containing sched_switch entries.

The header file

The TRACE_EVENT() macro cannot just be placed anywhere in the expectation that it will work with Ftrace or any other tracer. The header file that contains the TRACE_EVENT() macro must follow a certain format. These header files typically are located in the include/trace/events directory but do not need to be. If they are not located in this directory, then other configurations are necessary.

The first line in the TRACE_EVENT() header is not the normal #ifdef _TRACE_SCHED_H, but instead has:

   #undef TRACE_SYSTEM
   #define TRACE_SYSTEM sched

   #if !defined(_TRACE_SCHED_H) || defined(TRACE_HEADER_MULTI_READ)
   #define _TRACE_SCHED_H

This example is for scheduler trace events, other event headers would use something other than sched and _TRACE_SCHED_H. The TRACE_HEADER_MULTI_READ test allows this file to be included more than once; this is important for the processing of the TRACE_EVENT() macro. The TRACE_SYSTEM must also be defined for the file and must be outside the guard of the #if. The TRACE_SYSTEM defines what group the TRACE_EVENT() macros in the file belong to. This is also the directory name that the events will be grouped under in the debugfs tracing/events directory. This grouping is important for Ftrace as it allows the user to enable or disable events by group.

The file then includes any headers required by the contents of the TRACE_EVENT() macro. (e.g. #include <linux/sched.h>). The tracepoint.h file is required.

   #include <linux/tracepoint.h>

All the trace events can now be defined with TRACE_EVENT() macros. Please include comments that describe the tracepoint above the TRACE_EVENT() macros. Look at include/trace/events/sched.h as an example. The file ends with:

   #endif /* _TRACE_SCHED_H */

   /* This part must be outside protection */
   #include <trace/define_trace.h>

The define_trace.h is where all the magic lies in creating the tracepoints. The explanation of how this file works will be left to another article. For now, it is sufficient to know that this file must be included at the bottom of the trace header file outside the protection of the #endif.

Using the tracepoint

Defining the tracepoint is meaningless if it is not used anywhere. To use the tracepoint, the trace header must be included, but one C file (and only one) must also define CREATE_TRACE_POINTS before including the trace. This will cause the define_trace.h to create the necessary functions needed to produce the tracing events. In kernel/sched.c the following is defined:

   #define CREATE_TRACE_POINTS
   #include <trace/events/sched.h>

If another file needs to use tracepoints that were defined in the trace file, then it only needs to include the trace file, and does not need to define CREATE_TRACE_POINTS. Defining it more than once for the same header file will cause linker errors when building. For example, in kernel/fork.c only the header file is included:

   #include <trace/events/sched.h>

Finally, the tracepoint is used in the code just as it was defined in the TRACE_EVENT() macro:

   static inline void
   context_switch(struct rq *rq, struct task_struct *prev,
	          struct task_struct *next)
   {
	   struct mm_struct *mm, *oldmm;

	   prepare_task_switch(rq, prev, next);
	   trace_sched_switch(rq, prev, next);
	   mm = next->mm;
	   oldmm = prev->active_mm;

Coming soon

This article explained all that is needed to create a basic tracepoint within the core kernel code. Part 2 will describe how to consolidate tracepoints to keep the tracing footprint small, along with information about the TP_STRUCT__entry macros and TP_printk helper functions (like __print_flags). Part 3 will look at defining tracepoints outside of the include/trace/events directory (for modules and architecture-specific tracepoints) as well as a look at how the TRACE_EVENT() macro does its magic. Both articles will have a few practical examples of how to use tracepoints. Stay tuned ...

Comments (2 posted)

Huge pages part 5: A deeper look at TLBs and costs

March 23, 2010

This article was contributed by Mel Gorman

[Editor's note: this is the fifth and final installment in Mel Gorman's series on the use of huge pages in Linux. Parts 1, 2, 3 and 4 are available for those who have not read them yet. Many thanks to Mel for letting us run this series at LWN.]

This chapter is not necessary to understand how huge pages are used and performance benefits from huge pages are often easiest to measure using an application-specific benchmark. However, there are the rare cases where a deeper understanding of the TLB can be enlightening. In this chapter, a closer look is taken at TLBs and analysing performance from a huge page perspective.

1 TLB Size and Characteristics

First off, it can be useful to know what sort of TLB the system has. On X86 and X86-64, the tool x86info can be used to discover the TLB size.

    $ x86info -c
      ...
      TLB info
       Instruction TLB: 4K pages, 4-way associative, 128 entries.
       Instruction TLB: 4MB pages, fully associative, 2 entries
       Data TLB: 4K pages, 4-way associative, 128 entries.
       Data TLB: 4MB pages, 4-way associative, 8 entries
      ...

On the PPC64 architecture, there is no automatic means of determining the number of TLB slots. PPC64 uses multiple translation-related caches of which the TLB is at the lowest layer. It is safe to assume on older revisions of POWER - such as the PPC970 - that 1024 entries are available. POWER 5+ systems will have 2048 entries and POWER 6 does not use a TLB. On PPC64, the topmost translation layer uses an Effective to Real Address Translation (ERAT) cache. On POWER 6, it supports 4K and 64K entries but typically the default huge page size of 16MB consumes multiple ERAT entries. Hence, the article will focus more on the TLB than on ERAT.

2 Calculating TLB Translation Cost

When deciding whether huge pages will be of benefit, the first step is estimating how much time is being spent translating addresses. This will approximate the upper-boundary of performance gains that can be achieved using huge pages. This requires that the number of TLB misses that occurred is calculated as well as the average cost of a TLB miss.

On much modern hardware, there is a Performance Measurement Unit (PMU) which provides a small number of hardware-based counters. The PMU is programmed to increment when a specific low-level event occurs and interrupt the CPU when a threshold, called the sample period, is reached. In many cases, there will be one low-level event that corresponds to a TLB miss so a reasonable estimate can be made of the number of TLB misses.

On Linux, the PMU can be programmed with oprofile on almost any kernel currently in use, or with perf on recent kernels. Unfortunately, perf is not suitable for the analysis we need in this installment. Perf maps high-level requests, such as cache misses, to suitable low-level events. However it is not currently able to map certain TLB events, such as the number of cycles spent walking a page table. It is technically possible to specify a raw event ID to perf, but figuring out the raw ID is error-prone and tricky to verify. Hence, we will be using oprofile to program the PMU in this installment.

A detailed examination of the hardware specification may yield an estimate for the cost of a TLB miss, but it is time-consuming and documentation is not always sufficient. Broadly speaking, there are three means of estimating the TLB cost in the absence of documentation. The simplest case is where the TLB is software-filled and the operating system is responsible for filling the TLB. Using a profiling tool, the number of times the TLB miss handler was called and the time spent can be recorded. This gives an average cost of the TLB miss but software-filled TLBs are not common in mainstream machines. The second method is to use an analysis program such as Calibrator [manegold04] that guesses characteristics of cache and the TLB. While there are other tools that exist that claim to be more accurate [yotov04a][yotov04b], Calibrator has the advantage of being still available for download and it works very well for X86 and X86-64 architectures. Its use is described below.

Calibrator does not work well on PPC64 as the TLB is the lowest layer where as Calibrator measures the cost of an ERAT miss at the highest layer. On PPC64, there is a hardware counter that calculates the number of cycles spent doing page table walks. Hence, when automatic measurement fails, it may be possible to measure the TLB cost using the PMU as described in Section 2.3, below.

Once the number of TLB misses and the average cost of a miss is known, the percentage time spent servicing TLB misses is easily calculated.

2.1 Estimating Number of TLB Misses

Oprofile can be used to estimate the number of TLB misses using the PMU. This article will not go in-depth on how PMUs and oprofile work but, broadly speaking, the PMU counts low-level events such as a TLB miss. To avoid excessive overhead, only a sample-period number of events are recorded. When the sample-period is reached, an interrupt is raised and oprofile records the details of that event. An estimate of the real number of TLB misses that occurred is then

EstimatedTLBMisses = TLBMissesSampled * SamplePeriod

The output below shows an example oprofile session that sampled Data-TLB (DTLB) misses within a benchmark.

  $ opcontrol --setup --event PM_CYC_GRP22:50000 --event PM_DTLB_MISS_GRP22:1000
              --vmlinux=/vmlinux
  $ opcontrol --start
  Using 2.6+ OProfile kernel interface.
  Reading module info.
  Using log file /var/lib/oprofile/samples/oprofiled.log
  Daemon started.
  Profiler running.
  $ ./benchmark
  $ opcontrol --stop
  $ opcontrol --dump
  $ opreport
  CPU: ppc64 970MP, speed 2500 MHz (estimated)
  Counted PM_CYC_GRP22 events ((Group 22 pm_pe_bench4) Processor cycles)
          with a unit mask of 0x00 (No unit mask) count 50000
  Counted PM_DTLB_MISS_GRP22 events ((Group 22 pm_pe_bench4) Data TLB misses)
          with a unit mask of 0x00 (No unit mask) count 1000
  PM_CYC_GRP22:5...|PM_DTLB_MISS_G...|
    samples|      %|  samples|      %|
  ------------------------------------
     622512 98.4696      9651 97.8506 benchmark
       4170  0.6596        11  0.1115 libc-2.9.so
       3074  0.4862         1  0.0101 oprofiled
        840  0.1329         4  0.0406 bash
        731  0.1156       181  1.8351 vmlinux-2.6.31-rc5
        572  0.0905        14  0.1419 ld-2.9.so

Note in the figure that 9651 samples were taken and the sample period was 1000. Therefore it is reasonable to assume, using the equation above, that the benchmark incurred 9,651,000 DTLB misses. Analysis of a more complex benchmark would also include misses incurred by libraries.

2.2 Estimating TLB Miss Cost using Calibrator

Calibrator should be used on machines where the TLB is the primary cache for translating virtual to physical addresses. This is the case for X86 and X86-64 machines but not for PPC64 where there are additional translation layers. The first step is to setup a working directory and obtain the calibrator tool.

  $ wget http://homepages.cwi.nl/~manegold/Calibrator/v0.9e/calibrator.c
  $ gcc calibrator.c -lm -o calibrator
  calibrator.c:131: warning: conflicting types for built-in function 'round'

The warning is harmless. Note the lack of compiler optimisation options specified which is important so as not to skew the results reported by the tool. Running Calibrator with no parameters gives:

  $ ./calibrator 
  Calibrator v0.9e
  (by Stefan.Manegold@cwi.nl, http://www.cwi.nl/ manegold/)

  ! usage: './calibrator <MHz> <size>[k|M|G] <filename>` !

The CPU MHz parameter is used to estimate the time in nanoseconds a TLB miss costs. The information is not automatically retrieved from /proc/ as the tool was intended to be usable on Windows, but this shell script should discover the MHz value on many Linux installations. size is the size of work array to allocate. It must be sufficiently large that the cache and TLB reach are both exceeded to have any chance of accuracy but in practice much higher values were required. The poorly named parameter filename is the prefix given to the output graphs and gnuplot files.

This page contains a wrapper script around Calibrator that outputs the approximate cost of a TLB miss as well as how many TLB misses must occur to consume a second of system time. An example running the script on an Intel Core Duo T2600 is as follows:

  $ ./run-calibrator.sh
  Running calibrator with size 13631488: 19 cycles 8.80 ns 
  Running calibrator with size 17563648: 19 cycles 8.80 ns matched 1 times
  Running calibrator with size 21495808: 19 cycles 8.80 ns matched 2 times
  Running calibrator with size 25427968: 19 cycles 8.80 ns matched 3 times

  TLB_MISS_LATENCY_TIME=8.80
  TLB_MISS_LATENCY_CYCLES=19
  TLB_MISSES_COST_ONE_SECOND=114052631

In this specific example, the estimated cost of a TLB miss is 19 clock cycles or 8.80ns. It is interesting to note that the cost of an L2 cache miss on the target machine is 210 cycles, making it likely that the hardware is hiding most of the latency cost using pre-fetching or a related technique. Compare the output with the following from an older generation machine based on the AMD Athlon 64 3000+, which has a two-level TLB structure:

  $ ./run-calibrator.sh 
  Running calibrator with size 13631488: 16 cycles 8.18 ns 
  Running calibrator with size 17563648: 19 cycles 9.62 ns 
  Running calibrator with size 21495808: 19 cycles 9.54 ns matched 1 times
  Running calibrator with size 25427968: 19 cycles 9.57 ns matched 2 times
  Running calibrator with size 29360128: 34 cycles 16.96 ns 
  Running calibrator with size 33292288: 34 cycles 16.99 ns matched 1 times
  Running calibrator with size 37224448: 37 cycles 18.17 ns 
  Running calibrator with size 41156608: 37 cycles 18.17 ns matched 1 times
  Running calibrator with size 45088768: 36 cycles 18.16 ns matched 2 times
  Running calibrator with size 49020928: 37 cycles 18.17 ns matched 3 times

  TLB_MISS_LATENCY_TIME=18.17
  TLB_MISS_LATENCY_CYCLES=37
  TLB_MISSES_COST_ONE_SECOND=54297297

While calibrator will give a reasonable estimate of the cost, some manual adjustment may be required based on observation.

2.3 Estimating TLB Miss Cost using Hardware

When the TLB is not the topmost translation layer, Calibrator is not suitable to measure the cost of a TLB miss. In the specific case of PPC64, Calibrator measures the cost of an ERAT miss but the ERAT does not always support all the huge page sizes. In the event a TLB exists on POWER, it is the lowest level of translation and it supports huge pages. Due to this, measuring the cost of a TLB miss requires help from the PMU.

Two counters are minimally required - one to measure the number of TLB misses and a second to measure the number of cycles spent walking page tables. The exact name of the counters will vary but for the PPC970MP, the PM_DTLB_MISS_GRP22 counter for TLB misses and PM_DATA_TABLEWALK_CYC_GRP30 counters are suitable.

To use the PMU, a consistent test workload is required that generates a relatively fixed number of TLB misses per run. The simplest workload to use in this case is STREAM. First, download and build stream:

  $ wget http://www.cs.virginia.edu/stream/FTP/Code/stream.c
  $ gcc -O3 -DN=44739240 stream.c -o stream

The value of N is set such that the total working set of the benchmark will be approximately 1GB.

Ideally, the number of DTLB misses and cycles spent walking page tables would be measured at the same time but due to limitations of the PPC970MP, they must be measured in two separate runs. Because of this, it is very important that the cycles be sampled at the same time and it is essential that the samples taken for cycles in each of the two runs are approximately the same. This will require you to scale the sample rate for the DTLB and page table walk events appropriately. Here are two oprofile reports based on running STREAM.

  CPU: ppc64 970MP, speed 2500 MHz (estimated)
  Counted PM_CYC_GRP30 events ((Group 30 pm_isource) Processor cycles)
          with a unit mask of 0x00 (No unit mask) count 50000
  Counted PM_DATA_TABLEWALK_CYC_GRP30 events ((Group 30 pm_isource) Cycles
	  doing data tablewalks) with a unit mask of 0x00 (No unit mask)
	  count 10000
  PM_CYC_GRP30:5...|PM_DATA_TABLEW...|
    samples|      %|  samples|      %|
  ------------------------------------
     604695 97.9322    543702 99.3609 stream

  CPU: ppc64 970MP, speed 2500 MHz (estimated)
  Counted PM_CYC_GRP23 events ((Group 23 pm_hpmcount1) Processor cycles)
          with a unit mask of 0x00 (No unit mask) count 50000
  Counted PM_DTLB_MISS_GRP23 events ((Group 23 pm_hpmcount1) Data TLB mis
          with a unit mask of 0x00 (No unit mask) count 1000
  PM_CYC_GRP23:5...|PM_DTLB_MISS_G...|
    samples|      %|  samples|      %|
  ------------------------------------
     621541 98.5566      9644 98.0879 stream

The first point to note is that the samples taken for PM_CYC_GRP are approximately the same. This required that the sample period for PM_DATA_TABLEWALK_CYC_GRP30 be 10000 instead of the minimum allowed of 1000. The average cost of a DTLB miss is now trivial to estimate.

    PageTableCycles = CyclesSampled * SamplePeriod 
    		    = 543702 * 10000

    TLBMisses = TLBMissSampled * SamplePeriod 
    	      = 9644 * 1000

    TLBMissCost = PageTableWalkCycles/TLBMisses 
                = 5437020000/9644000 
		= ~563 cycles

Here the TLB-miss cost on PPC64 is observed to be much higher than on comparable X86 hardware. However, take into account that the ERAT translation cache hides most of the cost translating addresses and it's miss cost is comparable. This is similar in principal to having two levels of TLB.

2.4 Estimating Percentage Time Translating

Once the TLB miss cost estimate is available, estimates for any workload depend on a profile showing cycles spent within the application and the DTLB samples such as the following report.

  CPU: ppc64 970MP, speed 2500 MHz (estimated)
  Counted PM_CYC_GRP22 events ((Group 22 pm_pe_bench4) Processor cycles)
          with a unit mask of 0x00 (No unit mask) count 50000
  Counted PM_DTLB_MISS_GRP22 events ((Group 22 pm_pe_bench4) Data TLB misses)
          with a unit mask of 0x00 (No unit mask) count 1000
  PM_CYC_GRP22:5...|PM_DTLB_MISS_G...|
    samples|      %|  samples|      %|
  ------------------------------------
     156295 95.7408      2425 96.4215 stream

The calculation of the percentage of time spent servicing TLB misses is then as follows

    CyclesExecuted = CyclesSamples * SampleRateOfCycles
     		   = 156292 * 50000 
		   = 7814600000 cycles

    TLBMissCycles = TLBMissSamples * SampleRateOfTLBMiss * TLBMissCost
     		  = 2425 * 1000 * 563 
    		  = 1365275000

    PercentageTimeTLBMiss = (TLBMissCycles * 100)/CyclesExecuted 
    			  = 17.57%

Hence, the best possible performance gain we might expect from using huge pages with this workload is about 17.57%.

2.5 Verifying Accuracy

Once a TLB miss cost has been estimated, it should be validated. The easiest means of doing this is with the STREAM benchmark, modified using this patch to use malloc() and rebuilt. The system must be then minimally configured to use hugepages with the benchmark. The huge page size on PPC64 is 16MB so the following commands will configure the system adequately for the validation. Note that the hugepage pool allocation here represents roughly 1GB of huge pages for the STREAM benchmark.

    $ hugeadm --create-global-mounts
    $ hugeadm --pool-pages-min 16M:1040M
    $ hugeadm --pool-list
        Size  Minimum  Current  Maximum  Default
    16777216       65       65       65        *

We then run STREAM with base pages and profiling to make a prediction on what the hugepage overhead will be.

  $ oprofile_start.sh --sample-cycle-factor 5 --event timer --event dtlb_miss
  [ ... profiler starts ... ]
  $ /usr/bin/time ./stream
  [ ...]
  Function      Rate (MB/s)   Avg time     Min time     Max time
  Copy:        2783.1461       0.2585       0.2572       0.2594
  Scale:       2841.6449       0.2530       0.2519       0.2544
  Add:         3080.5153       0.3499       0.3486       0.3511
  Triad:       3077.4167       0.3498       0.3489       0.3510
  12.10user 1.36system 0:13.69elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+262325minor)pagefaults 0swaps

  $ opcontrol --stop
  $ opreport
  CPU: ppc64 970MP, speed 2500 MHz (estimated)
  Counted PM_CYC_GRP23 events ((Group 23 pm_hpmcount1) Processor cycles)
          with a unit mask of 0x00 (No unit mask) count 50000
  Counted PM_DTLB_MISS_GRP23 events ((Group 23 pm_hpmcount1) Data TLB misses)
          with a unit mask of 0x00 (No unit mask) count 1000
  PM_CYC_GRP23:5...|PM_DTLB_MISS_G...|
    samples|      %|  samples|      %|
  ------------------------------------
     599073 98.2975      9492 97.1844 stream

Using the methods described earlier, it is predicted that 17.84% of time is spent translating addresses. Note that time reported that the benchmark took 13.69 seconds to complete. Now rerun the benchmark using huge pages.

  $ oprofile_start.sh --sample-cycle-factor 5 --event timer --event dtlb_miss
  [ ... profiler starts ... ]
  $ hugectl --heap /usr/bin/time ./stream
  [ ...]
  Function      Rate (MB/s)   Avg time     Min time     Max time
  Copy:        3127.4279       0.2295       0.2289       0.2308
  Scale:       3116.6594       0.2303       0.2297       0.2317
  Add:         3596.7276       0.2988       0.2985       0.2992
  Triad:       3604.6241       0.2982       0.2979       0.2985
  10.92user 0.82system 0:11.95elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+295minor)pagefaults 0swaps

  $ opcontrol --stop
  $ opreport
  CPU: ppc64 970MP, speed 2500 MHz (estimated)
  Counted PM_CYC_GRP23 events ((Group 23 pm_hpmcount1) Processor cycles)
          with a unit mask of 0x00 (No unit mask) count 50000
  Counted PM_DTLB_MISS_GRP23 events ((Group 23 pm_hpmcount1) Data TLB misses)
          with a unit mask of 0x00 (No unit mask) count 1000
  PM_CYC_GRP23:5...|PM_DTLB_MISS_G...|
    samples|      %|  samples|      %|
  ------------------------------------
     538776 98.4168         0       0 stream

DTLB misses are not negligible within the STREAM benchmark and it now completes in 11.95 seconds instead of 13.69, which is about 12% faster. Of the four operations, Copy is now 12.37% faster, Scale is 9.67% faster, Add is 16.75% faster and Triad is 17.13% faster. Hence, the estimate of 563 cycles for DTLB misses on this machine is reasonable.

3 Calculating TLB Miss Cost with libhugetlbfs

The methods described in this section for measuring TLB costs were incorporated into libhugetlbfs as of release 2.7 in a script called tlbmiss_cost.sh and a manual page is included. It automatically detects whether calibrator or oprofile should be used to measure the cost of a TLB miss and optionally will download the necessary additional programs to use for the measurement. By default, it runs silently but in the following example where a miss cost of 19 cycles was measured, verbose output is enabled to show details of it working.

    $ tlbmiss_cost.sh -v
    TRACE: Beginning TLB measurement using calibrator
    TRACE: Measured CPU Speed: 2167 MHz
    TRACE: Starting Working Set Size (WSS): 13631488 bytes
    TRACE: Required tolerance for match: 3 cycles
    TRACE: Measured TLB Latency 19 cycles within tolerance. Matched 1/3
    TRACE: Measured TLB Latency 19 cycles within tolerance. Matched 2/3
    TRACE: Measured TLB Latency 19 cycles within tolerance. Matched 3/3
    TLB_MISS_COST=19

4 Summary

While a deep understanding of the TLB and oprofile is not necessary to take advantage of huge pages, it can be instructive to know more about the TLB and the expected performance benefits before any modifications are made to a system configuration. Using oprofile, reasonably accurate predictions can be made in advance.

Conclusion

While virtual memory is an unparalleled success in engineering terms, it is not totally free. Despite multiple page sizes being available for over a decade, support within Linux was historically tricky to use and avoided by even skilled system administrators. Over the last number of years, effort within the community has brought huge pages to the point where they are relatively painless to configure and use with applications, even to the point of requiring no source level modifications to the applications. Using modern tools, it was shown that performance can be improved with minimal effort and a high degree of reliability.

In the future, there will still be a push for greater transparent support of huge pages, particularly for use with KVM. Patches are currently being developed by Andrea Arcangeli aiming at the goal of greater transparency. This represents a promising ideal but there is little excuse for avoiding huge page usage as they exist today.

Happy Benchmarking.

Bibliography

libhtlb09
Various Authors. libhugetlbfs 2.8 HOWTO. Packaged with the libhugetlbfs source. http://sourceforge.net/projects/libhugetlbfs, 2009.

casep78
Richard P. Case and Andris Padegs. Architecture of the IBM system/370. Commun. ACM, 21(1):73--96, 1978.

denning71
Peter J. Denning. On modeling program behavior. In AFIPS '71 (Fall): Proceedings of the November 16-18, 1971, fall joint computer conference, pages 937--944, New York, NY, USA, 1971. ACM.

denning96
Peter J. Denning. Virtual memory. ACM Comput. Surv., 28(1):213--216, 1996.

gorman09a
Mel Gorman. http://www.itwire.com/content/view/30575/1090/1/0. http://www.csn.ul.ie/~mel/docs/stream-api/, 2009.

henessny90
Henessny, J. L. and Patterson, D. A. Computer Architecture a Quantitative Approach. Morgan Kaufmann Publishers, 1990.

manegold04
Stefan Manegold and Peter Boncz. The Calibrator (v0.9e), a Cache-Memory and TLB Calibration Tool. http://homepages.cwi.nl/~manegold/Calibrator/calibrator.shtml, 2004.

mccalpin07
John D. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance Computers. In a continually updated technical report. http://www.cs.virginia.edu/stream/, 2007.

smith82
Smith, A. J. Cache memories. ACM Computing Surveys, 14(3):473--530, 1982.

yotov04a
Kamen Yotov, Keshav Pingali, and Paul Stodghill. Automatic measurement of memory hierarchy parameters. Technical report, Cornell University, nov 2004.

yotov04b
Kamen Yotov, Keshav Pingali, and Paul Stodghill. X-ray : Automatic measurement of hardware parameters. Technical report, Cornell University, oct 2004.

Comments (3 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds