User: Password:
Subscribe / Log in / New account Weekly Edition for February 9, 2012

Tracking users

By Jake Edge
February 8, 2012

User tracking is always contentious. There are real advantages to gathering lots of information on how an application is used, but there are also serious drawbacks in terms of privacy. Many applications or distributions have "opt-in" mechanisms that report back, but that makes the data somewhat suspect because it comes from a self-selected group. But "opt-out" data gathering is frowned upon by privacy advocates and privacy-conscious users. As a recent discussion in the Mozilla dev-planning group shows, though, there are some who find that the need for data may outweigh some privacy concerns.

Mozilla is understandably concerned with Firefox's decline in market share and would like to try to determine what the underlying causes are. That has led to a proposal for a feature called MetricsDataPing that would collect a wide variety of information about the browser, its add-ons, and how it is used. That information would be sent to Mozilla over HTTPS each day that the browser is used. Crucially, the proposal is that MetricsDataPing would be an opt-out feature, which would require users to know about the feature and disable it if they didn't want to share that data.

This stands in contrast to features like Telemetry, which gathers data on browser performance, but it has two crucial differences from MetricsDataPing. First, it is opt-in so that users actively have to enable it, and secondly, it tries to avoid gathering any personally identifiable information (PII). It does not store IP addresses (but does geolocate the IP address and store that) and it generates a new ID every time the browser is restarted.

MetricsDataPing on the other hand would gather a much wider range of information such that "fingerprinting" a user just based on the data gathered would be a real possibility. Just a list of add-ons installed is probably nearly unique, but adding in just the installation date for the add-on, as MetricsDataPing does, would almost certainly make it unique. Information about search sources used, number of searches done, and that sort of thing also rings alarm bells for those concerned about privacy. It also uses a "document ID" to identify the data sent to the server, which would allow users to delete their data from the Mozilla servers. But the document ID could also essentially serve as a unique user ID (UUID) because the previous document ID is always sent with the current update, so that the older can be deleted.

There are efforts to anonymize the data that would be stored, but, as we have seen before, it is very difficult to truly anonymize collected data. Some of that is also true for Telemetry, because it has added fingerprintable data after its initial roll-out, but the key difference is that users have willingly chosen to share that data. That's the main difficulty that some see with the MetricsDataPing proposal. Benjamin Smedberg started off the discussion with a posting of his concerns:

It seems as if we are saying that since we already collect most of this data via various product features, that makes it ok to also collect this data in a central place and attach an ID to it. Or, that because we *need* this data in order to make the product better, it's ok to collect it. This makes me intensely uncomfortable. At this point I think we'd be better off either collecting only the data which cannot be used to track individual installs, or not implementing this feature at all.

But others, especially on the Mozilla metrics team, believe that the information gathered is critical. Blake Cutler described it this way:

The Metrics Data Ping is an attempt to apply scientific principles to product design and development. Mozilla relies too much on gut decisions, which directly translates to poor product decisions. Firefox analytics are stuck in the dark ages. It shows.

Ben Bucksch made several suggestions on how to improve the privacy of the data gathered, but he is also worried that gathering data to figure out why Firefox usage is declining will actually result in more users leaving because of a perception that the browser is intruding on their privacy. While the data may be important and useful, there are other considerations according to Justin Lebar:

Yeah, it sucks that we can't tell why people stop using Firefox. But our [principles] are more important than that.

To that end, the discussion shouldn't center on why these metrics are important or difficult to obtain another way. The discussion is about whether we can at once collect the proposed metrics and stay true to our values. If we can't, then we can't collect the data, no matter how important it may be.

There was some discussion of technical measures to try to reduce the PII content of the messages, but there are still problems with things like fingerprinting. If you gather enough information (of the kind the metrics team thinks it needs), you are very likely to be able to track users. Even if the data is massaged in some fashion (aggregated for example), the perception of privacy invasion will still be present as Boris Zbarsky pointed out:

One problem is that some people will assume that if data is being sent then it's being used, no matter what we actually do with it and say we do with it. So if we _can_ design things such that we couldn't misuse them even if we were to want to, we should. I understand that in general this is pretty difficult....

Even for opt-in services like Telemetry, gathering additional information requires user agreement. When the list of add-ons was added to the information that Telemetry supplied, users were required to opt back in to Telemetry after being informed of that change. As Lebar noted: "So again, here we have a decision made about sending the list of add-ons in a ping-type thing, that we cannot do it without explicit permission, even for people who already opted in to data collection." But MetricsDataPing would, seemingly, gather that information without asking the user even once.

Early in the thread, Mike Beltzner pointed to a posting on the Mozilla privacy blog that committed Mozilla "to a basic policy of 'no surprises, real choices, sensible settings, limited data, and user control'", he said. It's a bit hard to see how MetricsDataPing fits into that framework. For some Linux distributions (which is probably not really where Mozilla is focused on market share) it could easily be seen as a misfeature that should be removed from the code—though that might lead to more "iceweasels" due to Mozilla trademark issues.

In the end, Mozilla may need to find a way to satisfy its data needs with an opt-in feature, or find a very convincing argument for the impossibility of user tracking with the data it does collect. There is also the argument that there is a subtle self-unselection bias that is introduced with an opt-out feature. In what ways does the data get skewed by eliminating the very privacy-conscious? It is certainly understandable that the metrics team (and Mozilla as a whole) wants the data, but, like Linux distributions it may have to settle for indirect measurements or some self-selection bias.

Comments (80 posted)

Scribus 1.4.0 released

February 8, 2012

This article was contributed by Nathan Willis

The Scribus project announced the release of version 1.4.0 in January, its first new stable release in more than four years. The new release incorporates a suitably long list of changes from that time span, covering new functionality that will be of interest to print-publishing diehards, and simplifications in the tool set that may make the application more accessible for those who are new to desktop publishing (DTP).

Fans of Scribus have had access to unstable development builds for much of the time that 1.4.0 has been in development and testing (LWN covered the release of 1.3.5, the first release in the series that became 1.4, in August 2009), but the "stable" branding is one the project was intent on not rushing. [Scribus UI] The 1.3.5/1.4 series introduces changes to the file format that make newer files incompatible with the old stable release, so running the stable and development series side-by-side was a risky proposition for anyone using Scribus for production work.

The new release is available in binary form for Debian-based Linux systems (32 and 64 bit), Windows, Mac OS X, and (so they tell us...) OS/2. In the past, RPM packages have also been provided on the site, but they do not appear to have landed yet for 1.4.0. Scribus pulls in a hefty list of dependencies, due to its need to support a variety of embedded content types, but nothing out of the ordinary for a modern distribution.

New underpinnings

The biggest change to the code base in Scribus 1.4.0 is the migration to the Qt 4 framework — and, according to developer Peter Linnell, it was also the source of the long development cycle. But Qt 4 also brings with it the project's first stable builds for Mac OS X, which is important because of its position as the dominant DTP operating system. As Linnell explained it, "we spent a lot of time optimizing the code for Qt4, and also we wanted a really, really solid release." Porting to OS X involved plenty of GUI and human interface guidelines (HIG) work in order to make the application fit in, but it also entailed integrating with the OS X font, color management, and printing subsystems. At the same time, Qt 4 allowed the project to make its first appearance on the *BSDs and OpenSolaris.

A seemingly minor change in version 1.4 is the unification of Undo/Redo tracking across the application. The Undo history now tracks all sorts of edits that were previously not undoable, such as text formatting changes. But that effort also uncovered a lot of "dodgy code" in need of refactoring, Linnell said.

The marquee feature in 1.4.0 is the Render Frame content type. In Scribus, as in most other DTP applications, the editing model consists of a set of individual pages onto which you place and rearrange the objects that make up your document — blocks of text, images, lines and other features, footnotes, etc. Scribus calls these objects frames, and it has long supported an impressive list of file formats for frame content — including the native formats of Adobe Photoshop, Illustrator, and other proprietary applications.

Render frames are different in that the content they contain is not a static file, but rather the output generated by an external renderer that will be called when the Scribus project is printed or exported. Any application that can be called from the command line and produce PostScript, PDF, or PNG output can be used as a renderer. This allows Scribus documents to pull in generated content like graphs and complex mathematical formulas, without requiring them to first be exported to a separate image file. That means the external files can be updated at any point without re-building the Scribus document, and it means the final product can be rendered at the appropriate resolution (for print or on-screen viewing) without extra effort. The default set of render frames in 1.4.0 includes TeX and LaTeX, Gnuplot, dot/Graphviz, Lilypond, and POV-Ray.


Render frames represent a conceptual left-turn from typical DTP thinking, but they open the door to a powerful new set of uses. On the other hand, the typical DTP document-building approach differs so much from word processing and general text editing that many new users find it difficult to get started. On that front, Scribus 1.4.0 introduces some changes that should make the learning curve less intimidating, primarily by making far more on-screen objects directly editable.

[Direct editing]

In earlier releases, adding text to a document took two steps: dropping the text frame into position on the page, and opening the text frame in the "story editor" component. For a lot of new users, that was difficult to grasp, so it is likely to be a popular move that 1.4.0 enables direct text editing on the page. The story editor component is still available, and offers access to more features like saved paragraph and character styles, but it is not required.

Similar improvements have landed for working with image content. Almost any object can now be directly edited on the canvas, including vectors and raster images. Transformation tools and Boolean operations like those you would find in GIMP and Inkscape are provided, and when those will not suffice, images can be opened in an external editor.

Scribus's image manager — which serves as a browser and inspector for all of the image objects linked into a project — also received a significant upgrade for 1.4. From the manager, you can see image details for each object in the project (including the file path, original dimensions, and scaled size), look up each instance where an image is used in the document, and apply non-destructive image effects. Although a single-page document may not be difficult to keep tabs on without use of the image manager, it is an indispensable aid for multi-page reports or booklets.

[Printing options]

Finally, Scribus is focused on producing printed (or at least, print-worthy) output, and 1.4 adds some features to assist users at print time. Users can toggle a number of features on or off in the print previewer, with live updates to reflect the changes. These include anti-aliasing (which shows a smoother preview, but takes more time), transparency (which can reveal problems with image backgrounds that need to be masked-out), and spot-color-to-CMYK conversion (which is necessary when printing spot colors on normal, desktop printers). Both the printing system and PDF exporter can output registration, crop, bleed marks, and color bars. An extra nice touch is the ability to simulate how the output will appear to people with four types of color-blindness.

Functional additions

Over the span of its development, the new stable release of Scribus picked up a wealth of individual new features — some, but not all, of which were present at the release of 1.3.5. Among the noteworthy additions are support for more advanced features in Adobe Photoshop files, support for export to the PDF 1.5 format, advanced typography features, and a large collection of new swatches, scripts, and templates.

Photoshop files, like it or not, are the most common application-specific raster images used by graphic designers. Although converting them to an standardized export format like TIFF is usually preferable, there are times when Scribus needs to link them in as-is. The new release supports multi-layer Photoshop files, and those with embedded clipping paths. PDF 1.5 likewise introduces some important new features, such as animation transitions (which are useful for creating PDF presentations), multi-layer documents, and PDFs that embed other PDF or EPS files. On the latter feature, older versions of Scribus could import such PDF and EPS content only by rasterizing it, with the resulting loss in quality.

Scribus is slowly but surely adding support for advanced font and typesetting features — it is one of the only open source applications to properly do drop caps and discretionary ligatures (i.e. at the option of the user), for example. The new release expands the typesetting feature set to include "optical margins" and "glyph extension," both of which are subtle techniques to make the ragged-edge of a text block appear more naturally aligned. Optical margins allow non-letter bits like hyphens and punctuation to hang off past the end of the line, so that the letters on adjacent rows appear to be lined up with each other. Glyph extension allows the font renderer to slightly widen individual characters on a line of fully-justified text, rather than only expanding the spaces between letters. The result is easier to read.

Finally, 1.4.0 ships with an expanded set of design elements like pattern fills, defines more gradient types, and includes more document template styles. It also includes new scripts, some of which add complex, useful functionality. An example is the Autoquote script, which you run on a text frame after you have finished editing its contents. The script parses the text and intelligently converts "dumb" quotation marks into "smart quotes," correctly recognizing and accounting for nested quote styles, and producing output that is tailored to the punctuation style of the language that you specify.

Now the fun begins

The release notes say that Scribus 1.4.0 closes more than 2000 bugs and feature requests, which is a weighty bill for any project. It has been a long time since 1.3.5, at which point we were told that the pace of development would pick up — but then again, DTP is a very convoluted task. It covers proper format handling for everything from text to images, all the way from import of vendor-specific files to printed output, and users expect fine-grained control over every aspect of the positioning and characteristics of each element. Considering those challenges, it is impressive that Scribus is as full-featured as it is.

The new release is also remarkably stable. I have been running pre-release builds of 1.4.0 for several months, and have yet to experience a crash or a corrupted file — at least on Linux. Back when 1.3.5 was released, I commented that the first native builds for Mac OS X would potentially have the biggest impact on the project. I still think that is true, given the hold that OS X has among graphic designers.

Convincing OS X users to try Scribus is a prerequisite, of course, but that is not really a problem with a technical solution. In the meantime, the project has more work cut out for it. Better support for OpenType features and page impositioning are common requests, and even though PDF 1.5 support is an important milestone, Adobe has pushed the format through several revisions since then. Of course, if the target stayed completely still, it probably wouldn't be as much fun to develop.

Comments (7 posted)

FOSDEM: Infrastructure as an open source project

February 8, 2012

This article was contributed by Koen Vervloesem

The open source development model has many interesting properties so it's not surprising it has also been applied in domains other than software. In his talk at FOSDEM (Free and Open Source Software Developers' European Meeting) 2012 in Brussels, Ryan Lane explained how the Wikimedia Foundation is treating their infrastructure as an open source project, which enables the community to help run the Wikimedia web sites, including the popular Wikipedia.

Ryan Lane is an Operations Engineer at the Wikimedia Foundation and the Project Lead of Wikimedia Labs, a project aimed at improving the involvement of volunteers in operations and software development for Wikimedia projects. These projects, like Wikipedia, Wikibooks, and Wikimedia Commons, are well-known because of their large community of volunteers contributing content. Moreover, MediaWiki, the wiki software originally developed for Wikipedia and now also used in many other wikis, is an open source project.

In the early days, Wikimedia volunteers had not only their say on content and software, but also on infrastructure. There was no staff doing operations as the server infrastructure was all managed by volunteers. However, in the meantime operations was professionalized, and now it's all done by staff. Ryan's message in his talk was: "We want to change this, because operations is currently a bottleneck: it doesn't scale as well as software. That's why we had the idea to re-open our infrastructure to volunteers." But how do you give volunteers access to an infrastructure?

Puppet repositories

Wikimedia has already shared a lot of knowledge about its infrastructure on wikitech. This public wiki describes their network and server infrastructure in detail, including the open source software they use, such as Ubuntu, Apache, Squid, PowerDNS, Memcached, MySQL, and the configuration management tool Puppet to maintain a consistent configuration for all their servers.

Ryan's approach to open up Wikimedia's infrastructure even more was twofold. First, Wikimedia's system administrators spent a few weeks to clean up Wikimedia's Puppet configuration. After stripping all private and sensitive information, they published the Puppet files in a public Git repository. The sensitive stuff was moved to a private repository that is only available to Wikimedia staff and volunteers with root access.

But Ryan wanted more than just sharing knowledge about how Wikimedia manages its servers (the information in wikitech and the public Puppet repository): he wanted to treat operations as a real open source project where community members could edit Wikimedia's server architecture just like they did with Wikimedia's content and software. So he had to build a self-sustaining operations community around Wikimedia. For this to happen without sacrificing the reliability of Wikimedia's servers, a group of volunteers created a clone of the production cluster, which is mostly set up now. Thanks to this, staff and community operations engineers can push their changes to a test branch of the Puppet repository to try out new things on the cloned cluster. After a code review of the changes by the staff operations engineers, the code is evaluated by a test suite. If the code passes the tests, the changes are pushed to the production branch of the Puppet repository and hence the production systems are managed by the new Puppet configuration.

Wikimedia Labs is using OpenStack as a private cloud to run their server instances (virtual machines). At the moment, there are 83 instances running in the test cluster, managed by various Puppet classes, including base (for the configuration that applies to every server instance), exim::simple-mail-sender (for every server that has to send email), nfs::server (for an NFS server), misc::apache2 (for a web server), and so on.

Managing projects

There are also 47 projects defined in the Wikimedia Labs project, each of them implementing a specific task such as adding a new feature, adding monitoring, or puppetizing infrastructure that has been set up manually in the past. For instance, there are projects for bots, the continuous integration tool Jenkins, Nginx, Search, Deployment-prep which implements the clone of the production infrastructure, and so on. Each project has a project page on the wiki with documentation, the group members, and other information.

The interesting thing about these project pages is that most of the information is automatically generated. For example, when a server instance is running for a project, this instance is automatically shown at the bottom of the wiki page. And when someone types the command !log <project> <message> on the #wikimedia-labs IRC channel, it is automatically logged on the project page under the heading "Server Admin Log", which are subdivided by day. That way, a volunteer server administrator can explain what he did so other volunteers who are maybe living in a different timezone on the other side of the world can follow what is happening in the project.

The power of the community

So now that anyone has been able to push changes from ideas to production on Wikimedia's cluster for a couple of months, what are the results? According to Ryan, there are 105 users now in the Wikimedia Labs project who have contributed a variety of Puppet configurations:

One volunteer puppetized our existing Nagios monitoring setup (which was not managed by Puppet) in a very neat way. The bot infrastructure has also been improved much by volunteers. And at the San Francisco hackathon in January 2012 we had a project created, implemented, tested, and deployed to production during the hackathon. We have a custom UDP logging module written for nginx, and it had a couple of bugs in the format. Abe Music built an instance, installed our nginx source package, added the change to fix the formatting, then pushed them up for review. We reviewed the change, then pushed it to production. All of this happened during the hackathon.

So has this ambitious experiment been successful? According to Ryan, the original goal to lessen the bottleneck of the operations team definitely succeeded. However, he points out that the bottleneck has shifted: "We have to do these code reviews now, but fortunately it takes less time to review code than it does to make a lot of changes." Another issue Ryan sees is trust: "Giving out root to volunteers is dangerous, so we have to audit our infrastructure often. Moreover, there's always the danger of social engineering: newcomers can try to build trust to have us give them sensitive information about our infrastructure." But luckily the staff can count on a core of community people whom they trust to do these code reviews and audits.

All in all, Ryan thinks that the same model as Wikimedia Labs uses can also be used in other organizations to set up a volunteer-driven infrastructure. In particular, non-profits or software development projects that rely on a big infrastructure could profit from treating operations as an open source project. In addition to being able to tap into the potential of technical talents in the community, opening operations is also a great way to identify skilled and passionate people to hire for a staff position.

Comments (4 posted)

Page editor: Jonathan Corbet

Inside this week's Weekly Edition

  • Security: Debian and Suhosin; New vulnerabilities in ghostscript, kernel, moodle, php, ...
  • Kernel: Autosleep and wake locks; Memory power management, take 2; The Android ION memory allocator.
  • Distributions: How long should security embargoes be?; Fedora, Kubuntu, Parabola, ...
  • Development: XBMC 11 "Eden"; Gnash, Fulltext, ...
  • Announcements: GSoC, "Open Advice", Bufferbloat, Spark, Open Source in New Hampshire, ...
Next page: Security>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds