Are all patches created equal?

By Jonathan Corbet
July 20, 2011

Last week's Kernel Page included an article on the top contributors to the 3.0 kernel, an extended version of our traditional look at who participated in each kernel development cycle. Since this article is traditional, it tends not to draw a lot of attention. This time was different: quite a few publications have picked up on the fact that Microsoft was one of the top contributors of changesets by virtue of a long series of cleanups to its "hv" driver in the staging tree. Those sites, seemingly, got a lot more mileage out of those results than LWN did, an amusing outcome which can be expected occasionally with subscriber-only content. That said, this outcome is a bit dismaying for other reasons.

Some background: the hv driver helps Linux to function as a virtualized guest under Windows. It is useful code which, with luck, will soon move out of the staging tree and into the mainline kernel proper. After a period of relative neglect, developers at Microsoft have started cleaning up hv with a vengeance - 366 hv patches were merged for the 3.0 development cycle. This work has clear value; it is aimed at getting this code ready to graduate from staging; it is worth having.

That said, let's look at the actual patches. 47 of them simply move functions around to eliminate the need for forward declarations; 39 of them rename functions and variables; 135 of them take the form "get rid of X" for some value of (usually unused) X. Clearly this 15,000-line driver needed a lot of cleaning, and it's good that people are doing the work. But it also seems somewhat uncontroversial to say that this particular body of work does not constitute one of the more significant contributions to the 3.0 kernel.

Part of the problem, beyond any doubt, is the creation of lists of top changeset contributors in the first place. The number of changes is an extremely poor metric if one is interested in how much real, interesting work was contributed to the kernel. Some changes are clearly more interesting than others. Highlighting changeset counts may have ill effects beyond misleading readers; if the number of changesets matters, developers will have an incentive to bloat their counts through excessive splitting of changes - an activity which, some allege, has been going on for a while now.

LWN does post another type of statistic - the number of lines changed. As with changeset counts, this number does contain a modicum of information. But as a metric for the value of kernel contributions, it is arguably even worse than changeset counts. Picking a favorite Edsger Dijkstra quote is a challenging task, but this one would be a contender:

If we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent".

Just because a patch changes (or adds) a lot of code does not mean that there is a lot of value to be found therein.

Given these problems, one might be tempted to just stop producing these statistics at all. Yet these metrics clearly have enough value to be interesting. When LWN first started posting these numbers, your editor was approached at conferences by representatives from two large companies who wanted to discuss discrepancies between those numbers and the ones they had been generating internally. We are routinely contacted by companies wanting to be sure that all of their contributions are counted properly. Developers have reported receiving job offers as a result of their appearance in the lists of top contributors. Changeset counts are also used to generate the initial list of nominees to the Kernel Summit. For better or for worse, people want to know who the most significant contributors to the kernel are.

So it would be good to find some kind of metric which yields that information in a more useful way than a simple count of changesets or lines of code. People who understand the code can usually look at a patch and come to some sort of internal assessment - though your editor might be disqualified by virtue of having once suggested that merging devfs would be a good idea. But the reason why even that flawed judgment is not used in LWN's lists is simple: when a typical development cycle brings in 10,000 changesets, the manual evaluation process simply does not scale.

So we would like to have a metric which would try, in an automated fashion, to come up with an improved estimate of the value of each patch. That does not sound like an easy task. One could throw out some ideas for heuristics as a place to start; here are a few:

Changes to core code (in the kernel, mm, and fs directories, say) affect more users and are usually more heavily reviewed; they should probably be worth more.
Bug fixes have value. A heuristic could try to look to see if the changelog contains a bug ID, whether the patch appears in a stable update, or whether it is a revert of a previous change.
Patches that shuffle code but add no functional change generally have relatively low value. Patches adding documentation, instead, are priceless.
Patches that remove code are generally good. A patch that adds code to a common directory and removes it from multiple other locations may be even better. Patches adding significant code which appears to be cut-and-pasted from elsewhere may have negative value.
Changes merged late in the development cycle may not have a high value, but, if the development process is working as it should be, they should be worth something above the minimum.
Changes merged directly by Linus presumably have some quality which caught his attention.

Once a scoring system was in place, one could, over time, try to develop a series of rules like the above in an attempt to better judge the true value of a developer's contribution.

That said, any such metric will certainly be seen as unfair by at least some developers - and rightly so, since it will undoubtedly be unfair. This problem has no solution that will be universally seen as correct. So, while we may well play with some of these ideas, it seems likely that we are stuck with changeset and lines-changed counts for the indefinite future. These metrics, too, are unfair, but at least they are unfair in an easily understood and objectively verifiable way.

Are all patches created equal?

Posted Jul 21, 2011 2:20 UTC (Thu) by neilbrown (subscriber, #359) [Link] (1 responses)

Attribute change sets and lines to the "Reviewed-by" reviewer rather than the "Author".

Well, maybe "as well as" in a separate statistic instead of "rather than".
If counting things motivates people to produce them, then counting "Reviewed-by" certainly seems like a good idea.

Are all patches created equal?

Posted Jul 28, 2011 7:01 UTC (Thu) by eliezert (subscriber, #35757) [Link]

I would generalize this into a "did anyone care?" metric, someone cared if:
They reviewed it (maybe more for the second and above)
they reported it
They tested it

Also if this code is included by someone it matters more

generally, the higher up on the kernel source directory tree it is, it matters to more people.

Are all patches created equal?

Posted Jul 21, 2011 7:52 UTC (Thu) by Los__D (guest, #15263) [Link]

Patches that shuffle code but add no functional change generally have relatively low value.
Do NOT agreee. Code cleanup is one of the most important tasks.

Are all patches created equal?

Posted Jul 21, 2011 8:10 UTC (Thu) by zuki (subscriber, #41808) [Link]

Patches which have a long commit message are usually worth more than patches with a short commit message or even just the subject line. This is especially true for bugfix patches which often have a few paragraphs of explanation and diff changing a few characters.

Are all patches created equal?

Posted Jul 21, 2011 8:23 UTC (Thu) by Felix.Braun (guest, #3032) [Link] (1 responses)

To me as a non engineer, this seems to be a typical attempt to find a technical solution to a social problem. Clearly, the worth of a patch is a highly subjective measure: the patch that fixes a bug that prevents me from booting clearly is worth more to me than one, that fixes my audio (unless of course, I'm an audio artist and depend on the working of my audio system for a living) which in turn is worth more than a patch that fixes a race condition leading to frequent memory corruption but only on SMP-systems, because I don't own one of those.

Maybe the success of the "Who was kernel X made by?"-series of articles stems from the fact that the statistics used -- unfair as they are -- are seen to have just enough information to be valuable. Everybody knows how these statistics were created (transparency clearly is of value here) and knows to take them with appropriate amounts of salt. So introducing some weighting into these statistics, which -- as our editor admits -- must necessarily be subjective, may in fact reduce their value.

Then again I might just be my ordinary nay-saying lawyerly self, that doesn't see the opportunity to improve the world. Tough being me.

Social metrics

Posted Jul 26, 2011 5:23 UTC (Tue) by man_ls (guest, #15091) [Link]

On the contrary, I think you are right on target as to why the "Who wrote ..." articles are popular. On the other hand, a weighted metric such as the one our editor is seeking is not widely used in software engineering; creating one would be pushing the envelope and therefore interesting to watch.

Are all patches created equal?

Posted Jul 21, 2011 8:25 UTC (Thu) by wsa (guest, #52415) [Link] (1 responses)

Dropping 'staging' from the stats (or make separate staging-stats once in a while) would be a step in the right direction IMO. There _is_ valuable work done there, yet the majority of patches are adding/removing/cleaning stuff and are more of the "lower value" kind. While dropping that from the stats might be unfair to some, it is less unfair in general, I'd think.

Are all patches created equal?

Posted Jul 22, 2011 17:46 UTC (Fri) by Lovechild (guest, #3592) [Link]

I think it would unfairly unvalue people such as Microsoft who have clearly invested heavily in cleaning up their code and making it conform to the Linux coding standards.

We set these standards for a reason and we should "reward" those who make a serious effort to live up to them.

Are all patches created equal?

Posted Jul 21, 2011 9:14 UTC (Thu) by ortalo (guest, #4654) [Link] (1 responses)

If you really need to do something, I suggest adding as a warning a permanent initial reference in these articles to http://www.york.ac.uk/depts/maths/histstat/lies.htm ; the classical article/citation "Lies, Damn Lies and... Statistics" (which 90% of people attribute to Mark Twain... ;-).

But, like one of the commenters above, I think you should not change the stats unless you actually think that they do not fit *your* needs or more generally feel the need to do so *yourself*.

First, it may prove convenient one day. With generous acks from core kernel developpers, harmless, big and numerous NO-OP kernel patches and their subsequent removal may be a nice way to feed starving hackers and/or influence companies. It could even become a successful variant of classical obfuscating contests (which are not rewarded enough IMHO).

More seriously, if external people (certainly managers) are dumb enough to select employees or company-wide strategies just by looking at the tables *only* and *not* reading the article... I really do not see what you can do about it except insisting on the fact that this is an *article*, not an accounting report.
Furthermore, even if you change your statistics, they will certainly find new ways to misuse them.

Are all patches created equal?

Posted Jul 21, 2011 9:36 UTC (Thu) by ortalo (guest, #4654) [Link]

BTW, that link may be more useful to your readers than the former:
http://cancerguide.org/median_not_msg.html

Are all patches created equal?

Posted Jul 21, 2011 11:15 UTC (Thu) by sitaram (guest, #5959) [Link] (3 responses)

"Bug fixes have value": yes, but only if they're someone else's bugs or bugs in long standing code. You can't submit buggy code today, send in 10 fixes tomorrow, and have a score of 10.

If I were to do this, I'd introduce a lag. Count those commits/changesets that are in the released kernel for more than 3 months and have not suffered any subsequent fixups.

Delayed gratification is not very popular though.

Are all patches created equal?

Posted Jul 21, 2011 16:46 UTC (Thu) by hpro (subscriber, #74751) [Link] (2 responses)

This reminds me of something Scott Adams wrote in one of the Dilbert books. A software company introduced a bounty system where QA people would get a small bonus for finding a bug, and coders would get another small bonus for fixing the bug.

What no one thought of was that it was the same people introducing the bugs as were fixing them.

Needless to say, a black market in bugs immediately formed.

Are all patches created equal?

Posted Jul 28, 2011 14:56 UTC (Thu) by joeytsai (guest, #3480) [Link] (1 responses)

I think this is the related comic, I've always remembered it:

http://dilbert.com/strips/comic/1995-11-14/

What a long way we've come!

Are all patches created equal?

Posted Jul 29, 2011 20:28 UTC (Fri) by knobunc (guest, #4678) [Link]

The one from the previous day was the one that stuck with me:
http://dilbert.com/strips/comic/1995-11-13/

"I'm gonna write me a new minivan this afternoon!"

Are all patches created equal?

Posted Jul 21, 2011 13:14 UTC (Thu) by canatella (subscriber, #6745) [Link] (4 responses)

Measure the entropy added by the patch to the kernel code ;)

Are all patches created equal?

Posted Jul 21, 2011 13:39 UTC (Thu) by ortalo (guest, #4654) [Link]

I second this very nice idea too!

Are all patches created equal?

Posted Jul 21, 2011 22:07 UTC (Thu) by jspaleta (subscriber, #50639) [Link]

that's actually a very very clever idea. It would reward both people who do code clean ups by removal of duplicate code snippets as well people who inject non copy-pasted code snippets, including usable documentation and comemnts.

Like everything else its subject to being gamed. Obviously random bits of obfuscated code in a commit would show up with high entropy and thus high value. However such a piece of code would probably be selected against in human review/sign off as it would be highly unreadable. So I think the entropy metric strikes a good balance there.

I think as this idea is probably make a great comparison metric with just the raw commits as a positive impact assessment that is understandable to the main audience without getting too bogged down into very detailed scoring algorithms. From an information theory pov, commits which caused the overall codebase to go up in entropy is a higher positive impact than those which lower entropy. We can actually start asking the question who is just pumping out trivial commits that are overall low impact, and find a way to encourage them to take the leap to something high impact.

The side effect of that of course is that, over time, if people strive to reach the top of the entropy impact metric, we could see the kernel codebase become less and less compressible. Hopefully this is offset by large amounts of cleanup that shrinks the overall size of the codebase to offset the entropy increase leading to less compression.

-jef

Are all patches created equal?

Posted Jul 22, 2011 11:28 UTC (Fri) by droundy (subscriber, #4559) [Link] (1 responses)

But wouldn't certain valuable changes consistently decrease the entropy of the kernel? I'm thinking of things like an ARM rewrite that removes board-specific code...

Are all patches created equal?

Posted Jul 22, 2011 16:25 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link]

I think it depends a lot on exactly how you measure entropy. You can either look at the total entropy of the code or you can look at some kind of entropy density like entropy per line or character of code. Adding code by copy/paste/edit will increase total entropy but decrease entropy per size. Refactoring the mass of duplicate code into something more compact may reduce the total entropy a little, but the drop in size should be enough for it to cause an increase in entropy per size. You'll probably want to look at both measures, just as the current system looks at both lines and changesets.

The problem with entropy based measurements is that they're really only useful for looking at the quantity of function; they don't say anything about quality. A one line patch that fixes a nasty bug in critical code has very little effect on entropy but a huge effect on correct function. And removing dead or deprecated code can be very valuable from a maintainability standpoint precisely because it reduces the total entropy.

Are all patches created equal?

Posted Jul 21, 2011 15:20 UTC (Thu) by civodul (guest, #58311) [Link] (2 responses)

Instead of unidiff line counts, some form of semantic patch would provide more insight on the actual amount of changes.

For instance, whitespace changes, function moves, renames, etc. would count as little since they don't change the AST.

Are all patches created equal?

Posted Jul 24, 2011 19:41 UTC (Sun) by jnareb (subscriber, #46500) [Link] (1 responses)

Instead of unidiff line counts, some form of semantic patch would provide more insight on the actual amount of changes.
For instance, whitespace changes, function moves, renames, etc. would count as little since they don't change the AST.

Are there any tools, preferably open source, that can turn unidiff patch + preimage into some form of semantic patch (like e.g. Coccinelle)?
Without this I don't see how it would be any better than manual analysis...

Are all patches created equal?

Posted Jul 24, 2011 20:48 UTC (Sun) by civodul (guest, #58311) [Link]

Unfortunately I don't know of any tool that would create a semantic patch from a unidiff patch (the opposite of Coccinelle.)

Obviously a C parser is needed to turn source code into a canonical form (roughly, its AST). It may be possible to write a plug-in to abuse GCC to that end (see GCC-XML). Then some sort of tree diff on alpha-renamed ASTs would do it. However, I don't know what algorithms exist to compute the differences between two trees.

Are all patches created equal?

Posted Jul 21, 2011 15:56 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

One potentially interesting thing to look at would be the amount of discussion associated with each patch on their appropriate mailing list. Changes that result in small scale discussion or just a pull request are probably minor, uncontroversial changes. Ones that result in big discussions are likely to be either technically challenging changes that require a lot of discussion to get right, or controversial changes that require a lot of argument to convince people they're worth adding. Either way, they're likely to be far more interesting than the ones that pass with barely a whisper.

Are all patches created equal?

Posted Jul 28, 2011 11:40 UTC (Thu) by nix (subscriber, #2304) [Link]

In my experience, changes that trigger almost no discussion are either dead boring, or really challenging and important but so arcane that nobody reviews them (more true in fs/ than mm/ it seems). Changes that trigger a lot of discussion are either really challenging and important, or simply prone to bikeshedding (what version numbering scheme should we give the kernel again?). So 'amount of discussion' trends both ways...

Are all patches created equal?

Posted Jul 21, 2011 15:59 UTC (Thu) by mjthayer (guest, #39183) [Link] (1 responses)

In the lists of kernel contributors, you could list the subsystems where the majority of the changesets (90%?) were made after the names, or even be more precise if the changes were very localised (e.g. "Microsoft (drivers/staging/hv)"). This would give readers more information to judge the "value" of the contributions without you having to do so for them.

Are all patches created equal?

Posted Jul 30, 2011 10:21 UTC (Sat) by oak (guest, #2786) [Link]

> In the lists of kernel contributors, you could list the subsystems where the majority of the changesets (90%?) were made after the names, or even be more precise if the changes were very localised (e.g. "Microsoft (drivers/staging/hv)").

Excellent idea. I think it would be enough to keep >50% as the limit for where majority of changes were done. And if the changes are split more, the table could just state in how many directories the changes were applied.

Are all patches created equal?

Posted Jul 21, 2011 18:58 UTC (Thu) by david.a.wheeler (subscriber, #72896) [Link] (2 responses)

"These metrics... are unfair, but at least they are unfair in an easily understood and objectively verifiable way."

If you can make better metrics, that'd be nice, but even these "unfair" metrics still help give some insight. The fact that Microsoft was working to get its contributions into the *mainline* Linux kernel was far more obvious from the statistics. So even though all measurements have weaknesses, they're still useful, and I'm glad LWN is doing them.

Are all patches created equal?

Posted Jul 21, 2011 20:38 UTC (Thu) by jspaleta (subscriber, #50639) [Link] (1 responses)

Let's break that out a bit more.

As a community we value devs you "do the right thing" and mainline their new ideas. Can we capture that as a broken out metric? Activity in staging broken out would be the thing to look at for that.

As a community we strongly dislike un-maintained code dumps. So how can we capture that and produce either a stat that lifts up good maintainers or a shaming stat for bad maintainers of existing code? Do we trend commits that attempt to fix reported bugs for this metric?

What other specific behaviors do we really like to see as best practices?

Also perhaps its time to move away from fine scale ordering and instead think about chunking stats at a less granular level. No more simple rank ordering, but group highly productive devs/companies, into pools which equate to a curved grade scale weighted to a median value. X number of companies/devs get A grades (doing great) in the maintainership stat, Y companies get C grades (doing poorly), Z companies get B grades (so-so) and just report that. Stop with the micro-managed class ranking and have a pool of honors students that can be viewed collectively as peers who set the standards for best practices.

-jef

Are all patches created equal?

Posted Jul 21, 2011 21:49 UTC (Thu) by dlang (guest, #313) [Link]

Jef brings up an interesting point.

how to the contributions bunch.

we've been looking at the top few in detail, but how does it taper off?

Code reach

Posted Jul 21, 2011 20:07 UTC (Thu) by cesarb (subscriber, #6266) [Link]

An interesting (but hard to calculate) metric would be the "reach" of the change, measured by "how many machines are using code which is affected by this change".

For instance, a change to drivers/net/sc92031.c will only affect those with a PCI device matching its short list of PCI IDs, and which have that driver compiled as a module. Changes to code in it outside the module init/exit functions (which are one line long each) will only affect those with a PCI device on the list, even if the driver was compiled as built-in. Changes to core code affect everyone which has the right CONFIG_ options which enable that code. And so on. What matters is not if the code is actually executed, but that it could be executed on that particular machine.

The data needed for that calculation would be a database of machines, their hardware (PCI/USB/DMI/etc IDs), and their config file for the currently running kernel. But that would be the easy part; matching which parts of the kernel are enabled or disabled depending on the configuration options and present hardware, and which of these each patch touches, would be the hard part.

With that metric, the changes to the hv driver would count only in proportion to the small amount of people who run that specific virtual machine technology.

Are all patches created equal?

Posted Jul 21, 2011 23:11 UTC (Thu) by iabervon (subscriber, #722) [Link] (1 responses)

Obviously, Microsoft shouldn't get extra credit for writing a driver that needed at least 366 patches of cleaning and then doing that cleaning. Ultimately, the total value of the original submission, cleanup done before the 3.0 cycle, these 366 patches, and the rest of the cleanup that needs to be done is one useful driver (with no useful development history). In the end, Microsoft will have contributed one good thing, and done it substantially in 3.0 and substantially in the original submission. Whatever you count positive here, you have to deduct from the earlier value.

Are all patches created equal?

Posted Jul 22, 2011 0:01 UTC (Fri) by dlang (guest, #313) [Link]

one thing that complicates this issue is that they are the big contributer to the 3.0 release

if this was to the 2.6.41 release, few people would care.

but the fact that the number is incrementing to 3.0 makes people assume that there are orders of magnitude more changes going into 3.0 than went into 2.6.39

the "what's new in 3.0" document that lists all the changes since 2.6.0 helps confuse the issue as well.

Blame, i.e. surviving lines of code

Posted Jul 24, 2011 19:50 UTC (Sun) by jnareb (subscriber, #46500) [Link]

One metric that is easy to calculate (though a bit computationally intensive), would be to calculate blame for each changed file with previous release as boundary, and count attributed surviving lines of code.

I think that GitStats includes such statistics.

Are all patches created equal?

Posted Jul 25, 2011 8:50 UTC (Mon) by mlankhorst (subscriber, #52260) [Link]

Honestly I think every system has its flaws here. I'd say keep the statistics as is and report on reasons for certain outliers. If people want to ignore that and shout 'Microsoft is the biggest corporate contributor to Linux', no amount of fair metrics is going to change that..