LWN.net Logo

Representative samples: the Holy Grail

Representative samples: the Holy Grail

Posted Apr 18, 2005 18:20 UTC (Mon) by Max.Hyre (subscriber, #1054)
In reply to: Linux wins on security in survey of 6,000+ software developers by jwb
Parent article: Linux wins on security in survey of 6,000+ software developers

If I took anything away from my statistics courses, it's that the absolutely hardest part to get right is sampling. (Though figuring the right statistical analysis to use is close behind.)

It's hard because you have to

  • First, figure what your sample population is: Sysadmins? Developers? CIOs? A mixture of them in various proportions? Can you determine that subset which knows what they're talking about?
  • Then you have to figure how to generate a random sample of the members of your population---not trivial.
  • Next, how do you reach that set of the population? Always have Dewey vs. Truman floating in front of your eyes. (They reached their sample population [the electorate] by telephone, heavily biasing it [in 1948] toward the well-off. For gory details, google for `truman dewey poll'.)
  • Finally, after doing a good job of all of the above, you have to get your sample to respond to you. How many will be on vacation in Lower Slobovia? How many pick up their voicemail, or look at their e-mail, frequently, responding in time to do you any good? How many will downright refuse to have anything to do with you? Discarding these sample points, either by not counting them, or choosing someone else in their stead, puts a real dent in randomness.

So, just as you understand ``surf over here and answer some questions'', or ``dial in to tell whether you prefer Princess Di or Camilla'' polls to be nothing more than a form of entertainment, any poll like BZ Research's has to be taken with many grains of salt.

The whole thing is dubious without clear description of all the above criteria, analyzed by a knowledgeable, disinterested observer. Look at research reports in Science or Nature to see the sort of detail I mean. I'd bet a candy bar that the ``2.5 percentage points'' is nothing more than the number they looked up in a table for a sample size of 6k.

And now, for some entirely-different bias, look no further than the polls on the nightly news. They tend to be self-fulfilling prophecies: ``Well, if everyone feels like that, why should I bother to vote / call my Senator / complain to the Planning & zoning board?'' ``Hmmm, if no one's using Linux, I should hold off.''

I hope I've loosened your faith in polls somewhat. :-/


(Log in to post comments)

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.