LWN.net Weekly Edition for February 4, 2010

Mozilla and CNNIC

By Jake Edge
February 3, 2010

Adding a new Certificate Authority (CA) to a browser's list of accepted CAs is typically a quiet affair; the browser team vets the CA based on their criteria and adds those who pass the test. For Mozilla, the criteria and vetting process are not private, but the process generally happens behind the scenes. Users find out that new CAs have been added by looking at the CA store after a browser upgrade, though it is likely a very rare user that actually looks. When Mozilla followed its policies and added the China Internet Network Information Center (CNNIC) CA, things took a very different path—a firestorm of protest resulted.

CAs are the issuing authority for Secure Sockets Layer (SSL) certificates that are used to authenticate encrypted HTTP (i.e. HTTPS) sessions. A CA that has been accepted into a browser's "root store" can then sign SSL certificates for domains and those certificates will be accepted as valid by the browser. Much like self-signed certificates, SSL certificates that are signed by a CA that is not in the root store will cause the browser to emit scary security warnings.

As seen in the Mozilla bugzilla entry, Liu Yan of CNNIC requested addition to the root store in February 2009. Public discussion was opened on October 13. There were some technical concerns discussed, which CNNIC fixed, and the discussion closed on October 22. A bug was filed to actually get CNNIC's root certificate added to the root store (which is in the separate Network Security Services component). That bug was closed in mid-December once CNNIC verified that the proper certificate was added.

That is presumably how most new CAs get added, a somewhat bureaucratic process is followed, the certificate gets added, and everyone goes on their merry way. For CNNIC, though, things went a little differently. With at least some folks in the Chinese IT world, CNNIC has a terrible reputation. Starting on January 27, they were not shy about giving their opinion of CNNIC—and Mozilla's decision to include it—on the original bug report and a thread in the mozilla.dev.security.policy group.

The main complaints seem to stem from the accusation that CNNIC has been involved in distributing malware/spyware that is used by the Chinese government to monitor its citizens. It is also alleged to be involved with China's "Great Firewall" that censors specific web sites when accessed from China. In addition, Liu asserted that CNNIC is "not a Chinese Government organization" as part of the application process, but various commenters dispute that.

There are some 60 comments on the bug, along with more than 100 messages in the thread, many of them very passionate and/or heated requests to remove CNNIC. It is perfectly understandable that Chinese people are concerned about the possibility of government action against them because of what they might say on the internet. But, it is not clear that adding CNNIC as a CA has any bearing on that. Certainly CNNIC (or any CA) could abuse their position and issue SSL certificates for domains that it shouldn't, but, if they do, that act will provide clear evidence of wrongdoing.

In order for an SSL certificate to be accepted, it must be sent to the browser. Anyone visiting gmail.com, for example, and getting a certificate signed by anyone other than Thawte (the CA that signed Gmail's certificate), has proof of malfeasance. If CNNIC is abusing its position, it should be relatively easy to prove. As Mozilla's Johnathan Nightingale puts it:

What I have asked for here, and am asking for again, is specific, concrete evidence that this CA has acted in a way that contravenes our root policy. An illegitimate certificate would be the single, best example of such evidence.

To many of the commenters, though, there is abundant proof of CNNIC's involvement with malware and its "lies" about its governmental status should be enough, in their eyes, to remove CNNIC as a CA in Mozilla browsers. But, being affiliated with a government is not a reason that Mozilla would reject a CA (there are several others already in the root store for Japan, Taiwan, and others). It also isn't clear that distributing malware, separate from its CA activities, would be enough to remove a CA from the root store.

Other CAs have misbehaved along the way. Verisign's poorly-named Site Finder scheme redirected DNS queries in violation of the RFC, and in ways that were roundly criticized. But that action was separate from its CA business and there were no calls to remove it from any browser's root store. While Site Finder is a relatively minor transgression compared to the accusations leveled against CNNIC, it is difficult to punish organizations in a particular realm except based on its behavior within that realm. Thus the calls for evidence of CA abuse.

It is quite possible that an outcry back in October, as part of the public comment period, might have slowed or stopped the inclusion of CNNIC. But, that didn't happen, CNNIC complied with the policy, and was added. So, the question now is "whether we should review" that decision, Nightingale said. In order to do that, some evidence needs to be presented, he suggested:

It feels to me like that makes our next step clear, here. It won't help to tally up the complainants (there will be many), and it won't help to demand assurances from CNNIC (since the alleged governmental pressure would trump those anyhow). It certainly won't help to cite wikipedia.

If there's truth to the allegation, here, then it should be possible to produce a cert. It should be possible to produce a certificate, signed by CNNIC, which impersonates a site known to have some other issuer. A live MitM attack, a paypal cert issued by CNNIC for example.

Mozilla's Kathleen Wilson announced the creation of a draft policy for changing a root certificate that has been added to the root store. This would provide a means for handling just this kind of dispute. Eddy Nigg of Startcom, who is part of the team that reviews root inclusion requests, has specifically asked Wilson to start a review of CNNIC.

In the meantime, though, there are several technical measures that users can take to protect themselves. To start with, in "Edit -> Preferences -> Advanced -> Encryption" in Firefox, one can remove particular CAs from the root store. There are also two different Firefox addons that could help. Certificate Patrol permanently stores each SSL certificate that the browser encounters, and alerts the user when one changes. Perspectives instead uses "network notaries" that store certificates for particular hosts and can help users decide whether a self-signed or other certificate is valid.

It is instructive to take a look at the long list of CAs that are installed with Firefox. Many are for high-profile companies, but there are quite a few for seemingly obscure organizations. There are certainly enough different CAs that a government—or criminal organization—that wished to apply some pressure could get its hands on a forged SSL certificate. In truth, the pressure only need be applied to an employee who has access to the signing key. That risk exists whether or not CNNIC, or any other particular CA, is on the list.

It is certainly unfortunate that the accusations against CNNIC only surfaced after the inclusion process had already been completed. Depending on what evidence is compiled, Mozilla is likely to have a difficult decision to make. But the controversy, along with other recent security concerns that may involve the Chinese government, is likely to further raise the profile of internet censorship. It is something that many governments like to condemn on one hand and implement with the other—the only defense against it is keeping it in the public eye.

Comments (18 posted)

HTML5 video element codec debate reignited

February 3, 2010

This article was contributed by Nathan Willis

On January 20, YouTube publicly unveiled a video player that allows site visitors to watch videos embedded directly into each page as HTML 5 video elements, replacing the plugin-based Flash player — and second-tier video sharing site Vimeo quickly followed suit. But both sites serve up HTML 5 video files only in the patented and royalty-collecting H.264 format. By sheer coincidence, the announcement neatly overlapped with the release of Firefox 3.6, and was followed days later with Apple's press event showcasing its iPad gadget, which lack H.264 and Flash support, respectively. What followed was a furious multi-way debate all about Flash, licensing, web video, and H.264 versus Ogg Theora. For the open source community, there is nothing to celebrate yet, but the high profile of the argument has opened the door for discussion of the real underlying issue: patented web standards.

Rewind

The root of the entire controversy is HTML 5's video element, which allows a web developer to include video content in a web page in any file format, obviating the need to wrap such content in a Flash player useful only because of the Flash plugin's ubiquity. But it is up to the browser to include support for the formats it chooses in its built-in video player. The HTML 5 standard does not mandate that support be included for any particular format in order to qualify as compliant, however, so a public war is underway between format proponents for de-facto dominance.

On one side is the ISO Moving Picture Experts Group (MPEG), pushing for adoption of its H.264 format. The H.264 codec is part of the broader MPEG-4 family, is patented, and all parties wishing to include support for it are required to pay licensing fees to the patent holders through a consortium called the MPEG-LA — the licensing requirement applies to encoders and decoders, hardware and software, and includes both original manufacturers and downstream redistributors.

Many on the other side are supporters of the free Theora format, which requires no royalties to implement in hardware or in software, thanks to irrevocable free licenses on the original patents granted by its original creator. The reference encoder and decoder are developed by Xiph.org and are available under a BSD-style license.

Theora proponents emphasize the need for HTML 5 to include a free-to-implement format, insulating the next decade of web development from the nightmare caused by the GIF patent enforcement debacle. H.264 supporters claim that Theora's quality-per-bitrate performance is behind H.264's, and that some unknown third-party might hold secret patents on one or more techniques used in Theora, and subsequently sue implementers for patent infringement if the format is made part of the standard (the so-called "submarine" patent threat).

The major web browsers are divided on format support. Apple's Safari ships with H.264 support only, Google's Chrome supports both H.264 and Theora, Firefox and Opera support only Theora. Microsoft's Internet Explorer does not support HTML 5 video at all. Confusing the mix slightly is the fact that both Safari and Chrome implement H.264 playback because their parent companies pay licensing fees to MPEG-LA; consequently the open source browser projects WebKit and Chromium do not support H.264, because the license fees paid do not cover these downstream derivatives.

Players

That, then, was the situation when YouTube and Vimeo announced their H.264 HTML 5 video player support. What should have been a red-letter day for open web standards instead resulted in complaints to Mozilla from users (and pundits) that Firefox 3.6 "did not support HTML 5." In fact, Firefox has supported HTML 5 video since version 3.5, but it does not include an H.264 decoder.

Video expert Silvia Pfeiffer traced the problem back to numbers. According to Statcounter's market share statistics, Firefox accounts for 22.57% of the browsers in the world, with Chrome and Safari totaling 8.53%. Thus, of all the HTML 5-capable browsers in the field, Firefox makes up nearly 73% — and that 73% could not watch any of the YouTube or Vimeo video. It should be no surprise, then, that some of those users complained.

Mozilla's Christopher Blizzard responded to the news with a detailed analysis of the H.264 ownership and patent problem. The situation is precisely the same as the GIF disaster of a decade earlier, and as the MP3 situation from the early 2000's — but with considerably higher stakes. H.264 is patented, pure and simple, and the patent owners charge royalties today and will continue to do so until their patents expire. If H.264 becomes a de facto standard, the patent owners will have the freedom to hike the price of licenses, and they will no doubt do so.

Blizzard goes on to examine the terms of H.264 licensing and its effects on corporate and independent producers of web content. To include an H.264 decoder in Firefox, Mozilla would have to pay a license fee (perhaps $5 million per year), but such a move would also undermine Mozilla's founding principles of supporting and promoting free formats and standards.

Flash, we hardly knew ye

The other big news from the last week of January was Apple's iPad launch party. The iPad, like its diminutive siblings the iPhone and iPod Touch, uses a Safari-based web browser, and includes Apple's licensed H.264 decoder for HTML 5 video. But also like the smaller devices, the iPad does not include Flash support.

Coming from Apple, that decision was hailed by some in the media as a death knell for Flash. Once the preferred format for incorporating animation and interactive page elements into web content, in recent years its usage has shrunk to the point where it is used almost exclusively as a platform to deliver online video (and for irritating advertising, of course, although strictly speaking that would not be considered "content" by most).

No one seems to lament the possibility of Flash's demise. Apple has suggested that Flash is the cause of most of the Safari crashes reported through its OS X Crash Reporter utility. Mozilla said in October of 2009 that third-party plugins cause at least 30% of all Firefox crashes, a statistic supported by the popularity of Flash-blocking add-ons.

Apple's Steve Jobs even went so far as to publicly call Flash too buggy for use in a town hall meeting last week, declaring HTML 5 the way of the future.

What's a site owner to do?

Flash may indeed have no fans remaining outside of Adobe, a fact that magnifies the importance of HTML 5 video codec battle. The plugin has survived as long as it has for one reason alone: its availability on almost every browser on almost every operating system. Long after AJAX became popular for interactive content functionality, a web developer could implement video playback in a Flash element and feel secure that it would work on virtually every browser that would encounter it.

The same cannot be said of HTML 5 video, and certainly not of HTML 5 video with H.264 content. If Theora becomes the dominant format (or officially sanctioned in the HTML 5 specification), it will be possible again, but that is simply not true of H.264. Both encoders and decoders require licensing; a fact often overlooked in the debate about browser support, but one which Blizzard addresses in his blog entry. Anyone can set up a site delivering CSS, HTML, and even Theora using free, legal tools, and without asking or signing for permission; H.264 would change that.

The only question is whether or not the web development community will recognize that and rally behind Theora or another free alternative. The H.264 patent owners' attacks on Theora are not substantive; the quality comparison is highly subjective (and, in fact, comparing video encoding quality is inherently subjective), and as Xiph.org points out, submarine patents are an equal threat to free and non-free codecs alike. The original patents on Theora technology are known and licensed freely; if a patent owner possessed sufficient evidence to kneecap Theora with an infringement lawsuit regarding other patents, it surely would have happened already.

Moreover, the HTML 5 video element includes support for multiple source files, so content providers can offer each video in multiple formats; the fight is only the H.264 patent holders trying to prevent a rival format from being blessed as part of the standard. Those patent holders would take the same tactics with any other video format.

Some critics have suggested that another free video codec is needed, and Theora is certainly not the only option. Sun has been developing its own patent-avoiding video codec through the Open Media Commons project for several years, although the project is rather quiet. Blizzard suggests that Google may have a video patent play of its own in mind with its recent attempts to acquire On2, the company that developed the VP3 codec from which Theora descended. Dan Glidden, formerly of the Open Media Commons project, is a proponent of the MPEG-RF movement to change MPEG policy to establish a royalty-free option as a "baseline" codec for MPEG-4.

The debate is far from over. YouTube and Vimeo may have changed one aspect of it, however — unlike in years past when the fight took place almost entirely within World Wide Web Consortium working groups, this time it is being fought in public. Consequently, more people are getting a look at what HTML 5 video is in practice, and can better understand the difference between the HTML element and video format delivered, which can only be a good thing.

In the meantime, small web developers who want to serve up HTML 5 video content still have choices. The simplest option is to include multiple video source files, but a better alternative is to use the Cortado applet from Xiph.org; a streaming media Java applet that decodes Theora. It is open source, works transparently on any platform that includes Java support, and does not require encoding multiple source files — so there is no inadvertent spreading of unnecessary H.264 content required. But no one should hold their breath waiting for YouTube to implement it, of course.

Comments (56 posted)

Samba with Active Directory: getting closer

February 3, 2010

This article was contributed by Don Marti

From one point of view, Samba is open source high drama at its finest: an early adopter of version 3 of the GNU General Public License, and the recipient of an unprecedented release of formerly proprietary Microsoft documentation, thanks to a high-profile anti-trust case. Meanwhile, though, it's the low-profile software that implements the Server Message Block (SMB) file-sharing protocol, sometimes known as CIFS. Samba powers every inexpensive NAS device in the computer store—without even a mention on the box—and comes with all the common Linux distributions and with Apple's Mac OS X Server. Today, as Samba comes closer to implementing a key Microsoft directory protocol, the two aspects are being forced together.

Samba creator Andrew Tridgell, better known as Tridge, posted to his blog, "There has been a lot of progress recently in the development of the directory server capabilities of Samba4." In a half-hour screencast video, he demonstrated a development version of Samba acting as a Microsoft Active Directory domain controller in a mixed environment. "We are making very rapid progress now", he added.

Active Directory (AD) is a central repository for all the administrative information that a modern Microsoft Windows site needs. Besides user names and passwords, AD functions as a DNS server, stores network configuration policy such as firewall rules, and acts as a back-end for applications' configuration. Microsoft Exchange, for example, is completely dependent on it.

AD is made up of "domains" which are data structures that contain groups of objects, which might represent everything from an individual printer to the entire company sales force. Domains can then be collected up into "forests". A company might have many AD domains within its forest, and everything in the forest can be managed by the same administrators. Because AD is such a critical service, Windows sites typically install multiple AD servers, which replicate their data using a formerly secret protocol.

The Samba team received Active Directory documentation, including the server-to-server protocol, as part of an agreement made in response to a European Commission antitrust case in 2007. The documents have helped the project, Tridge said:

Stefan Metzmacher had managed to decode some very important parts of the protocol as part of his thesis work, but we were still missing some key parts of the puzzle. The documentation from Microsoft filled in many of these key elements, and perhaps more importantly, Microsoft has been very willing to engage with us to fill in any gaps that we find, including working directly with traces of Samba talking to Windows domain controllers to enable us to debug our implementation.

The documentation project was a huge project from the Microsoft side. Tridge described it this way:

I think it is fair to say that the WSPP/MCPP documentation effort is one of the largest efforts in IT history to document a set of network protocols. The sheer scale of the effort means that there are inevitably errors and omissions. We have been pleased at how Microsoft has responded to our reports of these errors by providing us with additional documentation where needed.

In the video, Tridge demonstrates provisioning an Active Directory domain on a Samba server, running a development version of Samba from shortly before Samba 4 alpha 11. Once the Samba server is running, he then starts a copy of Microsoft Windows Server 2008R2 Standard as a guest under VirtualBox, and runs the Windows "dcpromo" command to have it join the domain as a domain controller.

A few clicks and entries in the "Active Directory Domain Services Installation Wizard" later, the Windows box is ready to reboot and come up as part of the domain originally created on Samba. It takes about 30 seconds to synchronize key information for the newly-created domain. This step might take hours on a larger, longer-running domain.

Samba 4 has a few limitations, compared to a Windows AD server. There is only one domain per forest, and only one site per domain, but Tridge says that removing those limitations are near-future priority tasks. Windows administrators, like sysadmins everywhere, fall all over the "lumpers" vs. "splitters" spectrum, and anyone but extreme lumpers with simple configurations will need the ability to define separate domains, for departments and roles, and separate sites, for physical locations.

The remaining manual step is to add the Windows domain controller to the DNS zonefile on the DNS server. Microsoft's Active Directory handles DNS duties itself, while Samba relies on the system nameserver. A change to a Samba AD domain requires a corresponding change to a zonefile on the nameserver. "What we don't yet support in Samba 4 is the ability to create arbitrary DNS names within a Bind9 server using Kerberos authenticated DNS requests," he said. "Microsoft stores DNS within Active Directory. We can't join a Windows domain controller as a new DNS server, so have to rely on the Unix machines to provide DNS," he added. After recording the screencast, Tridge did write a script to automate the needed zonefile changes, he said.

Tridge's screencast shows the Windows box successfully syncing with the Samba server, and a user added on the Windows side shows up quickly in a search of the Samba server. Samba 4 is also able to join an existing AD domain. A tool called "vampire" is the Samba-side equivalent of the "dcpromo" command on Windows. Tridge demonstrated using it to add a second Samba server to the domain, ending up with a domain with two Samba servers and one Windows server. This ability means that an administrator could soon add a Samba appliance to an existing AD network, reducing the number of actual Windows servers needed.

Integration and the "Franky" concept

Samba 4 is an ambitious rewrite, which has been in progress since 2003. Meanwhile, Samba 3 has been through many releases with incremental improvements, and currently works well as a member, but not a domain controller, of an Active Directory domain. Samba 3 is "closer and closer to Windows compatibility in timestamps and Windows ACLs. It's harder and harder to tell us from a Windows box," Samba team member Jeremy Allison said. Thanks to extensive usage and bug reports, Samba 3 has gained the ability to handle real-world client quirks, while Samba 4 has focused on the big AD problem but not faced the day-to-day beatings of production use.

Tridge said that in addition to remaining AD work, "we also need to find out exactly how we will achieve our stated goal of re-integrating the great file sharing and printing work that has been done in the Samba3 branch with all of the work on Active Directory server support in Samba4."

Samba developers have been discussing ideas for combining the new functionality in Samba 4 with the existing Samba 3 code. One design for a combined project, called "Franky," short for "Frankenstein," would run Samba 3, listening on the SMB ports (139 and 445), along with Samba 4 listening on the ports required for AD support. Another alternative would be to run Samba3, but pass through AD-related requests to Samba4. "Obviously this will require quite a lot of merge work, but we believe this may be possible to achieve in 2010", Jeremy said on the Samba team blog.

Tridge said:

We need to have a single common file server component and printing component again. The strain on the team of having two implementations of the file serving component is too great. One way of achieving that is via something like the 'Franky' approach, but that has a significant downside of making deployment and administration of Samba more difficult. We need to put more thought into how we can make it easy for administrators, while also offering the best set of features from both branches.

"I'm expecting a fairly heated discussion at SambaXP this year," said John Terpstra, Samba team member and chief software architect of ClearCenter, which produces a web-administered distribution for small and medium businesses. The SambaXP conference is scheduled for May 3rd - 7th, 2010 in Göttingen, Germany.

Licensing and downstream

Samba with Active Directory is still not on downstream roadmaps. Simo Sorce, Principal Software Engineer at Red Hat, who maintains Samba packages for Fedora, said that project is looking at including Samba 3.5.0 in Fedora 13, if it's ready in time. But AD is still in the future. For future releases, "We will wait until the solution is stable enough that upgrades won't mean your server has a good chance of breaking," he said.

ClearCenter's ClearOS combines network gateway with VPN, web and mail filtering, Samba file server, Kolab groupware, and web-based administration tools into a package designed for resellers to deploy at small businesses and branch offices. Samba is a key part of the company's product, which competes with Microsoft Small Business Server but with a monthly subscription bill instead of an up-front license price. ClearOS is based on CentOS, a rebuild of Red Hat Enterprise Linux, but includes Samba 3.4 in place of CentOS's 3.0 package. "ClearOS 6 is going to ship pretty quickly after Samba 4 ships," John said.

Samba adopted version 3 of the GPL in 2007. One effect of the new license was to prohibit downstream Samba resellers from entering into new patent license agreements covering Samba, like the controversial Novell-Microsoft patent deal of 2006. Samba's license change doesn't affect Novell, whose contract predates the GPLv3 cutoff date, but according to the Samba web site, "Patent covenant deals done after 28 March 2007 are explicitly incompatible with the license if they are 'discriminatory' under section 11 of the GPLv3."

No GPLv2 fork has emerged, and, Jeremy says, the license change "has essentially been a complete non-issue". Downstream vendors ship Samba on everything from tiny NAS devices that connect to a USB drive, up to IBM's Scale Out File Services, which runs clustered Samba on top of IBM's proprietary General Parallel File System (GPFS). "What Samba does is it turns the CIFS server into a commodity, allowing people to compete on back-end scaled clustered filesystems," Jeremy said.

All of the Samba code is under individual copyrights, without assignment. "It's completely impossible to be bought out," Jeremy said. "No one can get any advantage over anyone else in the Samba code." As part of the agreement with Microsoft, the company must disclose any of its patents that it believes are necessary to implement its protocols, and it has not added any to its list since reaching the agreement. Microsoft has been "very cautions about breaking compatibility," Jeremy said. "With Windows 7, Microsoft made sure that it would work with a Samba 3 domain controller." Microsoft ended support for Windows NT 4, the last of its OS products to implement the old NT Directory Services system, at the end of 2004, and Windows 7 does not work with an NT4 domain controller, he added.

Help wanted

As you might expect, the Samba team is looking for help. Tridge invites new contributors: "Join the #samba-technical IRC channel (on the FreeNode network, irc.freenode.net), join the samba-technical mailing list, and get involved with the development process. Point out what the priorities are for Samba4 before you would consider deploying it, and help us to prioritize our development to meet your needs."

Jeremy asks would-be redistributors and SMB appliance vendors to work on functionality they anticipate needing. "If you're planning on a product within the next 18 months, the earlier you get involved the more chance you get to steer it to do the things you need to do," he said. "If you need Samba to interface with a particular filesystem, give us a VFS module that will let us do that," Jeremy said. Contributions to Samba itself have to be licensed under the GPLv3, but the team does want to be able to run Samba on the user's choice of clustered filesystem.

Then, as Jeremy posted, "Once we have a merged code-base, we'll declare victory, ship Samba4 and have the biggest darn release party since Duke Nukem Forever shipped and revolutionized computer gaming ! :-)." Samba 3 has served well as an essential file server, and Samba 4 has broken new ground in Microsoft protocol discovery, but eventually, one way or another, there will be one Samba again.

Comments (30 posted)

Gathering web site statistics with Piwik

February 3, 2010

This article was contributed by Joe 'Zonker' Brockmeier.

Many sites these days depend on Google Analytics to measure traffic, but there's something to be said for keeping control of one's data. Piwik bills itself as an open source alternative to Google Analytics, but does it actually measure up? Piwik isn't quite a full-on replacement for Google Analytics, but it's mature and complete enough for many users.

Piwik is the successor to phpMyVisites. It lacks a few features that were in phpMyVisites, such as PDF export and mail reports, but also adds a plugin architecture, better API, cleaner user interface, and better performance/scalability.

We looked at the current stable release, Piwik 0.5.4. Piwik is very simple to set up for anyone used to installing Web applications. It requires MySQL 4.1 or later, PHP 5.1.3 or later, the pdo and pdo_mysql PHP extensions, and the PHP GD extension to get the "sparkline" graphs in Piwik. Part of the install process is a system check that shows the system requirements and what, if anything, is missing. On the test server running WordPress, the GD extension was the only bit that wasn't already present. Assuming the requirements are met, it's a simple process of navigating to the URL where Piwik is installed, filling in a few bits of info, and clicking "next" a few times. In all, it shouldn't take more than five to 10 minutes to install.

The slightly harder piece is integrating Piwik to the site. It depends on a piece of JavaScript code to run on each page that will be counted. Some popular blogging software and content management systems have plugins to work with Piwik, so it's not necessary to insert the code into site templates manually. We used the Piwik Analytics plugin to integrate it with WordPress. Once Piwik is installed and configured, results are visible almost immediately.

Because Piwik depends on JavaScript to track visitors, it will miss at least some percentage of traffic, depending on how many users hit the site with JavaScript turned off. It won't track visitors who get site information via RSS/Atom feeds, and will also miss some file downloads as well. Piwik tracks clicks on certain URLs that end with recognized filetypes but if someone clicks a link to, say, a PDF hosted on the site without visiting a page with the Piwik tracking script, that will be missed.

The Piwik interface is easy to use and provides quite a bit of flexibility. Users can customize the main dashboard by adding an assortment of widgets that track visitor actions (like what links are clicked), referrers, or visitor settings (resolution, browser, etc.). The widgets themselves can display data as bar charts, sparklines, pie charts, or just raw numbers. Data can also be exported from each widget as an image of the graph, CSV, JSON, and PHP.

Some users don't like Google Analytics because of the site's dependence on Flash. The good news is that Piwik requires far less use of Flash than Google Analytics, and many of the widgets have table displays that don't require it at all. But if you want the pretty graphs, Flash is required.

While Piwik has the advantage of putting web site owners in control of their own data, it has the disadvantage of putting additional load on the server. For low traffic sites, this probably won't be an issue. The test system we tried Piwik on had no problems with the additional load from Piwik, but the site typically had less than 1,000 page views per day (at least according to Piwik). Note that it's not necessary to run Piwik on the same server as the tracked sites.

Comparing Piwik directly to Google Analytics is sort of Apples to Oranges. Both tools give a good sense of traffic on a web site, and tend to mostly agree on traffic numbers — though as a rule Google seems to track fewer visits than Piwik by about six or seven percent. By default, Piwik doesn't (yet) have an option to discard visits from the admin users, but the WordPress plugin does provide this option — so it's not clear what traffic Google is missing or discounting that Piwik does count. Both trackers show visitor breakdowns by browser, region, operating system, resolution, and more.

Though Piwik provides webmasters with control of their own data, visitors might be uneasy if they were aware how much data Piwik harvests about them. The visitor log report displays the visitor's IP address, keyword used to find the site, date and time visiting, the URL referring them to the site, duration of the visit, operating system, browser, screen resolution, and browser plugins detected.

Piwik does a respectable job identifying keywords that lead visitors to a site, the pages that are most popular, returning visitors, time spent on site, and so forth. For amateur Webmasters who just want to see how their site performs, Piwik gives all the tools that one might want. Depending on how demanding the business needs are, Piwik should be suitable for Webmasters who need a general sense of site traffic and performance. For users who specifically need to focus on site performance as a major business goal, Piwik might not be enough.

Hands down, Google does a much better job of showing geographic data than Piwik. Users who are curious as to the exact location of their traffic will want to use Google Analytics. It's possible to drill down all the way to the city level in some cases. Piwik, by contrast, shows visitors by country and provider, and that's about it. Users who want to know whether traffic is coming from Nuremberg or Frankfurt, or Los Angeles or New York, need to use Google Analytics, try out one of the third party plugins that requires a fair amount of configuration, or write their own.

A full list of plugins is available on the Piwik Developer Zone page, though the list is simply a Trac search. One might find some interesting plugins, but it will take some digging.

Google Analytics also has more features for Webmasters trying to improve site traffic and compete with other sites. For instance, if one chooses to opt in to data sharing, Google will compare a site's traffic with aggregate data from other sites that share their data. Of course, Google already has the data, but this feature requires an extra step to allow it to be aggregated. This allows a Webmaster to track site performance against all aggregate traffic, or specific industry verticals. For example, it was possible to compare the test site traffic against other open source sites that are tracked by Google Analytics.

While Google may have features that Piwik doesn't (and vice-versa), Google Analytics is less friendly to the do-it-yourself approach. Piwik features a plugin architecture that allows developers to create their own features. Most of Piwik's features are enabled via plugins. The Plugin interface could do a better job of allowing users to get more information. Each plugin is listed with a short description, version number and links to activate or deactivate the plugin — but no link to further information about the plugin in most cases. The "Live Visitors!" plugin, for example, is particularly unhelpful with only "Live Visitors!" as a description.

The Piwik roadmap indicates that 1.0 should be released sometime in 2010. Features planned for 1.0 include the ability to anonymize IPs stored in the Piwik database, export widgets to display limited data rather than all Website data, improve performance and scaling for Piwik, and better documentation.

But what won't be in Piwik is just as telling. The roadmap warns that the Piwik team doesn't plan to provide "advanced web analytics features found in other commercial products: custom report generator, custom segments and real time segmentation, funnel analysis, advanced ecommerce reporting, etc." Instead, the team suggests that these could be added as plugins, and that the goal of Piwik is to create an "open web analytics framework" that could be used to implement these features if the community desires.

To get the most complete picture possible, it's probably a good idea to combine Piwik with a package like AWstats that will analyze Apache logs. If data privacy and using an open tool isn't a concern, Google Analytics might be a better choice for now, because it does offer a wider selection of features. But users seeking an open source solution, and those who don't want to turn data over to Google or another third party, should look seriously at Piwik. There's no conflict in setting up each of the tools to run concurrently on a site, and having all of the packages at one's fingertips provides all the information any Webmaster could want.

Comments (9 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Security: Security in the 20-teens; New vulnerabilities in bltk, kernel, moodle, zabbix,...
Kernel: Improving readahead; The x86_64 DOS hole; Lockdep-RCU
Distributions: Fedora's privilege escalation policy proposal; new releases from Debian, openSUSE, Owl, Tiny Core and Ubuntu; Noteworthy Mandriva Cooker changes; Jono Bacon: Connecting The Opportunistic Dots.
Development: Mozilla Weave 1.0 makes the browser experience portable, new versions of MySQL, BusyBox, Tahoe, Apache, flashrom, Non DAW and Non Mixer, GNOME, Claws Mail, KMid2, Lashstudio, Minicomputer, HipHop, Cython, Scripy, Arduino, libfishsound, Mercurial.
Announcements: iPad and freedom, ATI Catalyst 10.1 drivers, Oracle's plans, Lessig on copyright, FSF on Google Book Search, IFOSLR, Android and kernel mods, Shuttleworth on copyright assignment, Free Technology Academy, Akademy cfp, Collaboration Summit cfp, GUADEC volunteers, Panama MiniDebConf.

Next page: Security>>