LWN.net Logo

Development

Data mining with Orange

By Nathan Willis
July 4, 2012

Orange is a GPLv3 Python module for mining, classifying, and visualizing data. The main problem it endeavors to help you solve is machine learning — analyzing and modeling a set of test data so that you can use it to make predictions about new data collected in the wild. Although you can use it to write standard interpreted Python scripts, the project also comes with a "visual programming" interface. Whether visual programming proves useful may depend as much on the programmer as on the data, but Orange makes it simple to explore your data set either way.

The Orange project site provides nightly-build tar downloads, as well as a .deb package repository. In addition to Python (2.6 or 2.7 only), Orange uses the Graphviz library extensively to build visualizations. Orange Canvas, the visual programming tool, requires Qt4.

The fundamentals

Orange is designed to ingest text-based data files; it understands the C4.5 file format popular in the machine learning crowd, but it has a native, tab-delimited file format, too. In Orange's format, the first three lines are reserved for domain information: the first line holds the attribute names, the second line holds the data type for each attribute, and the third line lets you denote special features of specific attributes. The most important special feature is class, which designates an attribute as the distinguishing characteristic of the statistical classes you are out to investigate. The remainder of the file is data, one case per line.

For example, if you have collected data on people who have registered for the forum on your project site, you may have a range of attributes including their country of origin, number of posts, age, whether they use a custom avatar, number of "thumbs up" votes, and OS. If you are interesting in exploring what makes a forum member eventually become a contributor, however, a submittedPatch attribute is the one you would mark as a class. That way you can have Orange examine all of the other attributes to find out which ones (or which combinations) accurately predict what will turn a forum member into a contributor.

Orange provides functions to automatically compute simple statistical summaries of your data: means, mean square errors, frequencies and other basic facts. For example, the following example code loads a data file from forum_users.tab, calls the orange.DomainDistributions() function on it, and iterates through the data's discrete attributes, reporting the frequency in which each value occurs.

    import orange
    data = orange.ExampleTable("forum_users")
    datastats = orange.DomainDistributions(data)

    print "Distributions:"
    for i in range(len(data.domain.attributes)):
    	a = data.domain.attributes[i]
	if a.varType == orange.VarTypes.Discrete:
	    print "%s:" % a.name
	    for j in range(len(a.values)):
	    	print "  %s: %d" % (a.values[j], int(datastats[i][j]))

The output would be of the form:

    country:
      US: 123
      UK: 87
      Germany: 38
      El Salvador: 19
    os:
      linux: 200
      windows: 15
      osx: 9
      vms: 36
      unknown: 7

Orange exposes continuous attributes (i.e., those with floating point data) in a similar fashion through orange.VarTypes.Continuous. There are functions for paring down your data set by filtering on attribute values or by pseudo-random sampling. You can even take a "stratified" sample, which means that Orange will grab a sample set that has the same proportions of each class attribute as the entire set.

Classifying, predicting, and evaluating

You can also see from the example code that Orange tips its hand a little in its class names. The Python class into which you load your data is named ExampleTable rather than (say) DataTable. That is because Orange is intended to load in "example" data sets that have already been classified (i.e., there is at least one class attribute), so that you can use them to generate a classifier algorithm that can correctly predict to which class items will belong.

Orange includes several different learning algorithms into which you can feed data sets in order to deduce a good classifier. Called learners in Orange-speak, some of them are simplistic and are useful mostly for testing, such as the k-nearest neighbor algorithm, while others are more robust, such as the Bayes theorem algorithm. But the project's centerpiece is its own implementation of a decision tree algorithm, which is implemented in its own module named orngTree.

A decision tree classifies a sample by stepping through each attribute — hopefully in the most efficient order possible based on the available examples. The edges of the tree take you either to a final decision (such as "this user will not become a contributor") or to the next attribute to evaluate. In Orange, you create the tree by "training" the learner in one fell swoop with the orngTree.TreeLearner() function, passing it your data set and any options you require. This function creates and returns a classifier that you can subsequently call (on new, unclassified data) through orngTree.TreeClassifier.

Or at least that is how it is supposed to work. In reality, you typically have to train, evaluate, tweak, re-train, and re-evaluate several times. Orange provides a slew of regression tests, statistical and probability features, and tuning options for better modeling your data. Most of them will feel vaguely familiar to people whose last statistics class was more than a couple of years ago.

There are also a number of related features for digging into your data set and finding correlations and hidden relationships between attributes. Among the available options are association rules (i.e., learning that certain combinations of values tend to occur together), clustering (attempting to partition the data into discrete groups), and self-organizing maps (which attempts to find patterns in the data by examining its topological features in 2D or 3D). In some cases, the end goal of these analytical techniques might be to build or optimize your classifier, but you may simply be out to find unusual properties in the data set or locate interesting outliers.

A lot of the data mining options implemented in Orange are outside my personal or professional experience, although the reference documentation does an admirable job of providing background information. That said, I was a bit disappointed that the tutorial section on the Orange site covers only a smattering of the feature set. I worked through as many as I could, however, and I will say Orange provides a very easy to explore data mining tool set. Most (and perhaps all) of the statistical functions are available in other open source packages (such as R), but both the convenience of working in Python and the number of built-in analysis functions make getting started simple.

Visualize!

[Orange Canvas]

The ease-of-getting-started point goes double for Orange Canvas, the project's visual programming interface. Rather, the data mining process is easy, once you figure out the "visual programming" paradigm itself. Orange Canvas works by letting you drag function blocks from the toolbar onto an infinite-in-two-dimensions workspace, then connect the blocks together with hoses. The output from A goes to the input for B, and so on.

It is the same basic idea as a dozen other visual programming editors. My only real criticism with it is that on the canvas too few of the properties of the blocks in question are visible, and you must hover the mouse over each block to see more about it and right-click it to open a property editor.

The reason that this matters is that many of Orange's classes have complex relationships; they often require multiple input connections (such as both data and a learner) that look essentially the same on screen. Some of the block structures struck me as counter-intuitive, too. Conceptually, when you are writing in Python, the TreeLearner is a function that accepts input, and creates a TreeClassifier as its output. But in the visual interface, a "Classification Tree" block has a "learner" as its output node. The reason is that in the block-and-hose design you are intended to hook the learner output directly into a "prediction" block and make use of it, not to manipulate it on its own. But it takes some getting used to.

Still, the visual interface has one killer feature: the ability to instantly hook up blocks to visualizations and manipulate them. Everything from scatterplots to "sieve multigrams" are provided, including a number of chart types you may need to look up in a reference book. Double-clicking on any of the visualization blocks opens a separate window, in which you can manipulate the variables included (and most other properties) and get a live-updated graph of the results. For example, you can plot various attributes against each other in a scatterplot and look for clumps, which would tell you that those attributes are tightly correlated. In contrast, with your own Python code you naturally must edit and re-execute your script to look at each new permutation of attributes.

[Orange Canvas scatterplot]

The Graphviz library does the heavy lifting on the visualizations. When you launch a visualization, you can also use VizRank, a feature that iteratively searches through the possible parameters looking for the ones that produce the most interesting results. For instance, in our forum users example, VizRank will iterate through all possible pairs of attributes (age versus OS, age versus avatar, avatar versus OS, etc.) and rank the pairings by clumpiness. The upshot is that you are spared the time it takes to plot every pair and assess the outcome (not to mention the work of ensuring that you did not accidentally forget a permutation). You can export the visualizations (or selected regions) to image files with a single click.

I am one of those "visual learners" you hear about every now and then; to me, searching through the plots looking for patterns is a far more revealing way of getting to know a data set. I worked with data sets in the hundreds-of-entries range, which does not give one a feel for how Orange might perform on terabyte-scale "big data" mining. That does not mean that Orange is incapable of scaling that high; I simply cannot vouch for it. But that is hardly the point: hundreds and thousands of records are more than enough that a specialized data exploration tool is well worth your while. Orange gives you access to a good Python data-mining toolbox, and Orange Canvas gives you a nearly foolproof way to examine it visually. The project has an active community and is running multiple Google Summer of Code projects to develop new tools, so if the module does not provide the classification or visualization method you need, the odds are good that you can still make it work.

Comments (3 posted)

Brief items

Quotes of the week

There’s absolutely no way that the Mozilla Foundation can personally host events to teach web making to the world. But I do think that if we do this right, we can build a movement that others build on, and make their own.

We can teach people to teach people to teach people to fish. And soon there will be no fish left in the sea. (But in this case, that means “everyone learns web making” so it’s not quite so ecological disastery. … I hope.)

Michelle Levesque

I'm all for code cleanups, UI cleanups, and anything that can improve our state of affairs. Forgive me for using a strong word, but destroying working functionality without explaining exactly what made it bad, and without trying to fix it first, is just vandalism.
Federico Mena Quintero

A quick clarification - -building- LibreOffice tends to discover poor thermal management in systems better than anything else I've seen ;-)
Michael Meeks

Comments (5 posted)

GNU C library 2.16 released

Version 2.16 of the GNU C library is out. Significant changes include support for the x32 ABI, various bits of ISO C11 support, a number of performance improvements, and lots of bug fixes. This version of glibc is not supported on Linux kernels prior to 2.6.

Full Story (comments: 16)

GRUB 2.00 released

Version 2.00 of the GNU GRUB bootloader has been released. "Since this version has a round number it has been paid special attention to, and hopefully, represents higher quality." Improvements include ports to a number of relatively obscure architectures (Itanium, Fuloong2F, ...), improved device and filesystem support, some additional boot protocols, and more.

Full Story (comments: 49)

Rakudo Star release 20120.06

The Rakudo Perl distribution has released "Rakudo Star," which it describes as "a useful, usable, "early adopter" distribution of Perl 6." There are numerous improvements listed, including enhanced list and .map handling and the addition of the same regular expression engine used in user-space, which fixes several parsing bugs.

Full Story (comments: none)

Newsletters and articles

Development newsletters from the last week

Comments (none posted)

Linksvayer: 5 years of GPLv3

Mike Linksvayer shares his reflections on the importance of the GPLv3, which was released five years ago today. "I suggest that number (add qualifiers of and scaling by importance, quality, etc, as you wish) of works under GPLv3 or use of GPLv3 relative to other licenses are less important markers of GPLv3′s success, and that of the broader FLOSS community, than the number and preponderance of works under GPLv3-compatible terms."

Comments (73 posted)

Vinyl cutting on Linux: the real deal (Libre Graphics World)

Libre Graphics World is running a comparison of Linux applications used to drive cutting machines for vinyl, cardstock, and other materials. "Most cutting devices rely on HPGL printer control language and its versions such as CAMM-HPGL (Roland). So the job is, essentially, to take a vector graphics file and convert it to HPGL, then send it to the device along with control commands such as blade speed and pressure." Looks like several options are available, including Inkscape extensions and stand-alone programs.

Comments (3 posted)

Why learn C? (O'Reilly Radar)

O'Reilly Radar has published an interview with author David Griffiths on the continued relevance of teaching C. The main interview in in video form, with text excerpts highlighting specific points, such as "For example, it teaches how memory works in a more profound way (a concept systems programmers will likely already know, though new programmers in specialized fields might not)" and "It's an important, foundational language that requires you to understand the full stack of the technology."

Comments (175 posted)

Page editor: Nathan Willis
Next page: Announcements>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds