Are all patches created equal?
Some background: the hv driver helps Linux to function as a virtualized guest under Windows. It is useful code which, with luck, will soon move out of the staging tree and into the mainline kernel proper. After a period of relative neglect, developers at Microsoft have started cleaning up hv with a vengeance - 366 hv patches were merged for the 3.0 development cycle. This work has clear value; it is aimed at getting this code ready to graduate from staging; it is worth having.
That said, let's look at the actual patches. 47 of them simply move functions around to eliminate the need for forward declarations; 39 of them rename functions and variables; 135 of them take the form "get rid of X" for some value of (usually unused) X. Clearly this 15,000-line driver needed a lot of cleaning, and it's good that people are doing the work. But it also seems somewhat uncontroversial to say that this particular body of work does not constitute one of the more significant contributions to the 3.0 kernel.
Part of the problem, beyond any doubt, is the creation of lists of top changeset contributors in the first place. The number of changes is an extremely poor metric if one is interested in how much real, interesting work was contributed to the kernel. Some changes are clearly more interesting than others. Highlighting changeset counts may have ill effects beyond misleading readers; if the number of changesets matters, developers will have an incentive to bloat their counts through excessive splitting of changes - an activity which, some allege, has been going on for a while now.
LWN does post another type of statistic - the number of lines changed. As with changeset counts, this number does contain a modicum of information. But as a metric for the value of kernel contributions, it is arguably even worse than changeset counts. Picking a favorite Edsger Dijkstra quote is a challenging task, but this one would be a contender:
Just because a patch changes (or adds) a lot of code does not mean that there is a lot of value to be found therein.
Given these problems, one might be tempted to just stop producing these statistics at all. Yet these metrics clearly have enough value to be interesting. When LWN first started posting these numbers, your editor was approached at conferences by representatives from two large companies who wanted to discuss discrepancies between those numbers and the ones they had been generating internally. We are routinely contacted by companies wanting to be sure that all of their contributions are counted properly. Developers have reported receiving job offers as a result of their appearance in the lists of top contributors. Changeset counts are also used to generate the initial list of nominees to the Kernel Summit. For better or for worse, people want to know who the most significant contributors to the kernel are.
So it would be good to find some kind of metric which yields that information in a more useful way than a simple count of changesets or lines of code. People who understand the code can usually look at a patch and come to some sort of internal assessment - though your editor might be disqualified by virtue of having once suggested that merging devfs would be a good idea. But the reason why even that flawed judgment is not used in LWN's lists is simple: when a typical development cycle brings in 10,000 changesets, the manual evaluation process simply does not scale.
So we would like to have a metric which would try, in an automated fashion, to come up with an improved estimate of the value of each patch. That does not sound like an easy task. One could throw out some ideas for heuristics as a place to start; here are a few:
- Changes to core code (in the kernel, mm, and
fs directories, say) affect more users and are usually more
heavily reviewed; they should probably be worth more.
- Bug fixes have value. A heuristic could try to look to see if the
changelog contains a bug ID, whether the patch appears in a stable
update, or whether it is a revert of a previous change.
- Patches that shuffle code but add no functional change generally have
relatively low value. Patches adding documentation, instead, are
priceless.
- Patches that remove code are generally good. A patch that adds code
to a common directory and removes it from multiple other locations may
be even better.
Patches adding significant code which appears to be cut-and-pasted
from elsewhere may have negative value.
- Changes merged late in the development cycle may not have a high
value, but, if the development process is working as it should be,
they should be worth something above the minimum.
- Changes merged directly by Linus presumably have some quality which caught his attention.
Once a scoring system was in place, one could, over time, try to develop a series of rules like the above in an attempt to better judge the true value of a developer's contribution.
That said, any such metric will certainly be seen as unfair by at least
some developers - and rightly so, since it will undoubtedly be
unfair. This problem has no solution that will be universally seen as
correct. So, while we may well play with some of these ideas, it seems
likely that we are stuck with changeset and lines-changed counts for the
indefinite future. These metrics, too, are unfair, but at least they are
unfair in an easily understood and objectively verifiable way.
Posted Jul 21, 2011 2:20 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (1 responses)
Well, maybe "as well as" in a separate statistic instead of "rather than".
Posted Jul 28, 2011 7:01 UTC (Thu)
by eliezert (subscriber, #35757)
[Link]
Also if this code is included by someone it matters more
generally, the higher up on the kernel source directory tree it is, it matters to more people.
Posted Jul 21, 2011 7:52 UTC (Thu)
by Los__D (guest, #15263)
[Link]
Patches that shuffle code but add no functional change generally have relatively low value.
Posted Jul 21, 2011 8:10 UTC (Thu)
by zuki (subscriber, #41808)
[Link]
Posted Jul 21, 2011 8:23 UTC (Thu)
by Felix.Braun (guest, #3032)
[Link] (1 responses)
Maybe the success of the "Who was kernel X made by?"-series of articles stems from the fact that the statistics used -- unfair as they are -- are seen to have just enough information to be valuable. Everybody knows how these statistics were created (transparency clearly is of value here) and knows to take them with appropriate amounts of salt. So introducing some weighting into these statistics, which -- as our editor admits -- must necessarily be subjective, may in fact reduce their value.
Then again I might just be my ordinary nay-saying lawyerly self, that doesn't see the opportunity to improve the world. Tough being me.
Posted Jul 26, 2011 5:23 UTC (Tue)
by man_ls (guest, #15091)
[Link]
Posted Jul 21, 2011 8:25 UTC (Thu)
by wsa (guest, #52415)
[Link] (1 responses)
Posted Jul 22, 2011 17:46 UTC (Fri)
by Lovechild (guest, #3592)
[Link]
We set these standards for a reason and we should "reward" those who make a serious effort to live up to them.
Posted Jul 21, 2011 9:14 UTC (Thu)
by ortalo (guest, #4654)
[Link] (1 responses)
But, like one of the commenters above, I think you should not change the stats unless you actually think that they do not fit *your* needs or more generally feel the need to do so *yourself*.
First, it may prove convenient one day. With generous acks from core kernel developpers, harmless, big and numerous NO-OP kernel patches and their subsequent removal may be a nice way to feed starving hackers and/or influence companies. It could even become a successful variant of classical obfuscating contests (which are not rewarded enough IMHO).
More seriously, if external people (certainly managers) are dumb enough to select employees or company-wide strategies just by looking at the tables *only* and *not* reading the article... I really do not see what you can do about it except insisting on the fact that this is an *article*, not an accounting report.
Posted Jul 21, 2011 9:36 UTC (Thu)
by ortalo (guest, #4654)
[Link]
Posted Jul 21, 2011 11:15 UTC (Thu)
by sitaram (guest, #5959)
[Link] (3 responses)
If I were to do this, I'd introduce a lag. Count those commits/changesets that are in the released kernel for more than 3 months and have not suffered any subsequent fixups.
Delayed gratification is not very popular though.
Posted Jul 21, 2011 16:46 UTC (Thu)
by hpro (subscriber, #74751)
[Link] (2 responses)
What no one thought of was that it was the same people introducing the bugs as were fixing them.
Needless to say, a black market in bugs immediately formed.
Posted Jul 28, 2011 14:56 UTC (Thu)
by joeytsai (guest, #3480)
[Link] (1 responses)
http://dilbert.com/strips/comic/1995-11-14/
What a long way we've come!
Posted Jul 29, 2011 20:28 UTC (Fri)
by knobunc (guest, #4678)
[Link]
"I'm gonna write me a new minivan this afternoon!"
Posted Jul 21, 2011 13:14 UTC (Thu)
by canatella (subscriber, #6745)
[Link] (4 responses)
Posted Jul 21, 2011 13:39 UTC (Thu)
by ortalo (guest, #4654)
[Link]
Posted Jul 21, 2011 22:07 UTC (Thu)
by jspaleta (subscriber, #50639)
[Link]
Like everything else its subject to being gamed. Obviously random bits of obfuscated code in a commit would show up with high entropy and thus high value. However such a piece of code would probably be selected against in human review/sign off as it would be highly unreadable. So I think the entropy metric strikes a good balance there.
I think as this idea is probably make a great comparison metric with just the raw commits as a positive impact assessment that is understandable to the main audience without getting too bogged down into very detailed scoring algorithms. From an information theory pov, commits which caused the overall codebase to go up in entropy is a higher positive impact than those which lower entropy. We can actually start asking the question who is just pumping out trivial commits that are overall low impact, and find a way to encourage them to take the leap to something high impact.
The side effect of that of course is that, over time, if people strive to reach the top of the entropy impact metric, we could see the kernel codebase become less and less compressible. Hopefully this is offset by large amounts of cleanup that shrinks the overall size of the codebase to offset the entropy increase leading to less compression.
-jef
Posted Jul 22, 2011 11:28 UTC (Fri)
by droundy (subscriber, #4559)
[Link] (1 responses)
Posted Jul 22, 2011 16:25 UTC (Fri)
by rgmoore (✭ supporter ✭, #75)
[Link]
I think it depends a lot on exactly how you measure entropy. You can either look at the total entropy of the code or you can look at some kind of entropy density like entropy per line or character of code. Adding code by copy/paste/edit will increase total entropy but decrease entropy per size. Refactoring the mass of duplicate code into something more compact may reduce the total entropy a little, but the drop in size should be enough for it to cause an increase in entropy per size. You'll probably want to look at both measures, just as the current system looks at both lines and changesets.
The problem with entropy based measurements is that they're really only useful for looking at the quantity of function; they don't say anything about quality. A one line patch that fixes a nasty bug in critical code has very little effect on entropy but a huge effect on correct function. And removing dead or deprecated code can be very valuable from a maintainability standpoint precisely because it reduces the total entropy.
Posted Jul 21, 2011 15:20 UTC (Thu)
by civodul (guest, #58311)
[Link] (2 responses)
For instance, whitespace changes, function moves, renames, etc. would count as little since they don't change the AST.
Posted Jul 24, 2011 19:41 UTC (Sun)
by jnareb (subscriber, #46500)
[Link] (1 responses)
Posted Jul 24, 2011 20:48 UTC (Sun)
by civodul (guest, #58311)
[Link]
Obviously a C parser is needed to turn source code into a canonical form (roughly, its AST). It may be possible to write a plug-in to abuse GCC to that end (see GCC-XML). Then some sort of tree diff on alpha-renamed ASTs would do it. However, I don't know what algorithms exist to compute the differences between two trees.
Posted Jul 21, 2011 15:56 UTC (Thu)
by rgmoore (✭ supporter ✭, #75)
[Link] (1 responses)
One potentially interesting thing to look at would be the amount of discussion associated with each patch on their appropriate mailing list. Changes that result in small scale discussion or just a pull request are probably minor, uncontroversial changes. Ones that result in big discussions are likely to be either technically challenging changes that require a lot of discussion to get right, or controversial changes that require a lot of argument to convince people they're worth adding. Either way, they're likely to be far more interesting than the ones that pass with barely a whisper.
Posted Jul 28, 2011 11:40 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Jul 21, 2011 15:59 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (1 responses)
Posted Jul 30, 2011 10:21 UTC (Sat)
by oak (guest, #2786)
[Link]
Excellent idea. I think it would be enough to keep >50% as the limit for where majority of changes were done. And if the changes are split more, the table could just state in how many directories the changes were applied.
Posted Jul 21, 2011 18:58 UTC (Thu)
by david.a.wheeler (subscriber, #72896)
[Link] (2 responses)
If you can make better metrics, that'd be nice, but even these "unfair" metrics still help give some insight. The fact that Microsoft was working to get its contributions into the *mainline* Linux kernel was far more obvious from the statistics. So even though all measurements have weaknesses, they're still useful, and I'm glad LWN is doing them.
Posted Jul 21, 2011 20:38 UTC (Thu)
by jspaleta (subscriber, #50639)
[Link] (1 responses)
As a community we value devs you "do the right thing" and mainline their new ideas. Can we capture that as a broken out metric? Activity in staging broken out would be the thing to look at for that.
As a community we strongly dislike un-maintained code dumps. So how can we capture that and produce either a stat that lifts up good maintainers or a shaming stat for bad maintainers of existing code? Do we trend commits that attempt to fix reported bugs for this metric?
What other specific behaviors do we really like to see as best practices?
Also perhaps its time to move away from fine scale ordering and instead think about chunking stats at a less granular level. No more simple rank ordering, but group highly productive devs/companies, into pools which equate to a curved grade scale weighted to a median value. X number of companies/devs get A grades (doing great) in the maintainership stat, Y companies get C grades (doing poorly), Z companies get B grades (so-so) and just report that. Stop with the micro-managed class ranking and have a pool of honors students that can be viewed collectively as peers who set the standards for best practices.
-jef
Posted Jul 21, 2011 21:49 UTC (Thu)
by dlang (guest, #313)
[Link]
how to the contributions bunch.
we've been looking at the top few in detail, but how does it taper off?
Posted Jul 21, 2011 20:07 UTC (Thu)
by cesarb (subscriber, #6266)
[Link]
For instance, a change to drivers/net/sc92031.c will only affect those with a PCI device matching its short list of PCI IDs, and which have that driver compiled as a module. Changes to code in it outside the module init/exit functions (which are one line long each) will only affect those with a PCI device on the list, even if the driver was compiled as built-in. Changes to core code affect everyone which has the right CONFIG_ options which enable that code. And so on. What matters is not if the code is actually executed, but that it could be executed on that particular machine.
The data needed for that calculation would be a database of machines, their hardware (PCI/USB/DMI/etc IDs), and their config file for the currently running kernel. But that would be the easy part; matching which parts of the kernel are enabled or disabled depending on the configuration options and present hardware, and which of these each patch touches, would be the hard part.
With that metric, the changes to the hv driver would count only in proportion to the small amount of people who run that specific virtual machine technology.
Posted Jul 21, 2011 23:11 UTC (Thu)
by iabervon (subscriber, #722)
[Link] (1 responses)
Posted Jul 22, 2011 0:01 UTC (Fri)
by dlang (guest, #313)
[Link]
if this was to the 2.6.41 release, few people would care.
but the fact that the number is incrementing to 3.0 makes people assume that there are orders of magnitude more changes going into 3.0 than went into 2.6.39
the "what's new in 3.0" document that lists all the changes since 2.6.0 helps confuse the issue as well.
Posted Jul 24, 2011 19:50 UTC (Sun)
by jnareb (subscriber, #46500)
[Link]
I think that GitStats includes such statistics.
Posted Jul 25, 2011 8:50 UTC (Mon)
by mlankhorst (subscriber, #52260)
[Link]
Are all patches created equal?
If counting things motivates people to produce them, then counting "Reviewed-by" certainly seems like a good idea.
Are all patches created equal?
They reviewed it (maybe more for the second and above)
they reported it
They tested it
Are all patches created equal?
Do NOT agreee. Code cleanup is one of the most important tasks.Are all patches created equal?
Are all patches created equal?
On the contrary, I think you are right on target as to why the "Who wrote ..." articles are popular. On the other hand, a weighted metric such as the one our editor is seeking is not widely used in software engineering; creating one would be pushing the envelope and therefore interesting to watch.
Social metrics
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Furthermore, even if you change your statistics, they will certainly find new ways to misuse them.
Are all patches created equal?
http://cancerguide.org/median_not_msg.html
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
http://dilbert.com/strips/comic/1995-11-13/
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Instead of unidiff line counts, some form of semantic patch would provide more insight on the actual amount of changes.
Are there any tools, preferably open source, that can turn unidiff patch + preimage into some form of semantic patch (like e.g. Coccinelle)?
For instance, whitespace changes, function moves, renames, etc. would count as little since they don't change the AST.
Without this I don't see how it would be any better than manual analysis...
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Are all patches created equal?
Code reach
Are all patches created equal?
Are all patches created equal?
One metric that is easy to calculate (though a bit computationally intensive), would be to calculate blame for each changed file with previous release as boundary, and count attributed surviving lines of code.
Blame, i.e. surviving lines of code
Are all patches created equal?