Data mining with Orange

By Nathan Willis
July 4, 2012

Orange is a GPLv3 Python module for mining, classifying, and visualizing data. The main problem it endeavors to help you solve is machine learning — analyzing and modeling a set of test data so that you can use it to make predictions about new data collected in the wild. Although you can use it to write standard interpreted Python scripts, the project also comes with a "visual programming" interface. Whether visual programming proves useful may depend as much on the programmer as on the data, but Orange makes it simple to explore your data set either way.

The Orange project site provides nightly-build tar downloads, as well as a .deb package repository. In addition to Python (2.6 or 2.7 only), Orange uses the Graphviz library extensively to build visualizations. Orange Canvas, the visual programming tool, requires Qt4.

The fundamentals

Orange is designed to ingest text-based data files; it understands the C4.5 file format popular in the machine learning crowd, but it has a native, tab-delimited file format, too. In Orange's format, the first three lines are reserved for domain information: the first line holds the attribute names, the second line holds the data type for each attribute, and the third line lets you denote special features of specific attributes. The most important special feature is class, which designates an attribute as the distinguishing characteristic of the statistical classes you are out to investigate. The remainder of the file is data, one case per line.

For example, if you have collected data on people who have registered for the forum on your project site, you may have a range of attributes including their country of origin, number of posts, age, whether they use a custom avatar, number of "thumbs up" votes, and OS. If you are interesting in exploring what makes a forum member eventually become a contributor, however, a submittedPatch attribute is the one you would mark as a class. That way you can have Orange examine all of the other attributes to find out which ones (or which combinations) accurately predict what will turn a forum member into a contributor.

Orange provides functions to automatically compute simple statistical summaries of your data: means, mean square errors, frequencies and other basic facts. For example, the following example code loads a data file from forum_users.tab, calls the orange.DomainDistributions() function on it, and iterates through the data's discrete attributes, reporting the frequency in which each value occurs.

    import orange
    data = orange.ExampleTable("forum_users")
    datastats = orange.DomainDistributions(data)

    print "Distributions:"
    for i in range(len(data.domain.attributes)):
    	a = data.domain.attributes[i]
	if a.varType == orange.VarTypes.Discrete:
	    print "%s:" % a.name
	    for j in range(len(a.values)):
	    	print "  %s: %d" % (a.values[j], int(datastats[i][j]))

The output would be of the form:

    country:
      US: 123
      UK: 87
      Germany: 38
      El Salvador: 19
    os:
      linux: 200
      windows: 15
      osx: 9
      vms: 36
      unknown: 7

Orange exposes continuous attributes (i.e., those with floating point data) in a similar fashion through orange.VarTypes.Continuous. There are functions for paring down your data set by filtering on attribute values or by pseudo-random sampling. You can even take a "stratified" sample, which means that Orange will grab a sample set that has the same proportions of each class attribute as the entire set.

Classifying, predicting, and evaluating

You can also see from the example code that Orange tips its hand a little in its class names. The Python class into which you load your data is named ExampleTable rather than (say) DataTable. That is because Orange is intended to load in "example" data sets that have already been classified (i.e., there is at least one class attribute), so that you can use them to generate a classifier algorithm that can correctly predict to which class items will belong.

Orange includes several different learning algorithms into which you can feed data sets in order to deduce a good classifier. Called learners in Orange-speak, some of them are simplistic and are useful mostly for testing, such as the k-nearest neighbor algorithm, while others are more robust, such as the Bayes theorem algorithm. But the project's centerpiece is its own implementation of a decision tree algorithm, which is implemented in its own module named orngTree.

A decision tree classifies a sample by stepping through each attribute — hopefully in the most efficient order possible based on the available examples. The edges of the tree take you either to a final decision (such as "this user will not become a contributor") or to the next attribute to evaluate. In Orange, you create the tree by "training" the learner in one fell swoop with the orngTree.TreeLearner() function, passing it your data set and any options you require. This function creates and returns a classifier that you can subsequently call (on new, unclassified data) through orngTree.TreeClassifier.

Or at least that is how it is supposed to work. In reality, you typically have to train, evaluate, tweak, re-train, and re-evaluate several times. Orange provides a slew of regression tests, statistical and probability features, and tuning options for better modeling your data. Most of them will feel vaguely familiar to people whose last statistics class was more than a couple of years ago.

There are also a number of related features for digging into your data set and finding correlations and hidden relationships between attributes. Among the available options are association rules (i.e., learning that certain combinations of values tend to occur together), clustering (attempting to partition the data into discrete groups), and self-organizing maps (which attempts to find patterns in the data by examining its topological features in 2D or 3D). In some cases, the end goal of these analytical techniques might be to build or optimize your classifier, but you may simply be out to find unusual properties in the data set or locate interesting outliers.

A lot of the data mining options implemented in Orange are outside my personal or professional experience, although the reference documentation does an admirable job of providing background information. That said, I was a bit disappointed that the tutorial section on the Orange site covers only a smattering of the feature set. I worked through as many as I could, however, and I will say Orange provides a very easy to explore data mining tool set. Most (and perhaps all) of the statistical functions are available in other open source packages (such as R), but both the convenience of working in Python and the number of built-in analysis functions make getting started simple.

Visualize!

The ease-of-getting-started point goes double for Orange Canvas, the project's visual programming interface. Rather, the data mining process is easy, once you figure out the "visual programming" paradigm itself. Orange Canvas works by letting you drag function blocks from the toolbar onto an infinite-in-two-dimensions workspace, then connect the blocks together with hoses. The output from A goes to the input for B, and so on.

It is the same basic idea as a dozen other visual programming editors. My only real criticism with it is that on the canvas too few of the properties of the blocks in question are visible, and you must hover the mouse over each block to see more about it and right-click it to open a property editor.

The reason that this matters is that many of Orange's classes have complex relationships; they often require multiple input connections (such as both data and a learner) that look essentially the same on screen. Some of the block structures struck me as counter-intuitive, too. Conceptually, when you are writing in Python, the TreeLearner is a function that accepts input, and creates a TreeClassifier as its output. But in the visual interface, a "Classification Tree" block has a "learner" as its output node. The reason is that in the block-and-hose design you are intended to hook the learner output directly into a "prediction" block and make use of it, not to manipulate it on its own. But it takes some getting used to.

Still, the visual interface has one killer feature: the ability to instantly hook up blocks to visualizations and manipulate them. Everything from scatterplots to "sieve multigrams" are provided, including a number of chart types you may need to look up in a reference book. Double-clicking on any of the visualization blocks opens a separate window, in which you can manipulate the variables included (and most other properties) and get a live-updated graph of the results. For example, you can plot various attributes against each other in a scatterplot and look for clumps, which would tell you that those attributes are tightly correlated. In contrast, with your own Python code you naturally must edit and re-execute your script to look at each new permutation of attributes.

The Graphviz library does the heavy lifting on the visualizations. When you launch a visualization, you can also use VizRank, a feature that iteratively searches through the possible parameters looking for the ones that produce the most interesting results. For instance, in our forum users example, VizRank will iterate through all possible pairs of attributes (age versus OS, age versus avatar, avatar versus OS, etc.) and rank the pairings by clumpiness. The upshot is that you are spared the time it takes to plot every pair and assess the outcome (not to mention the work of ensuring that you did not accidentally forget a permutation). You can export the visualizations (or selected regions) to image files with a single click.

I am one of those "visual learners" you hear about every now and then; to me, searching through the plots looking for patterns is a far more revealing way of getting to know a data set. I worked with data sets in the hundreds-of-entries range, which does not give one a feel for how Orange might perform on terabyte-scale "big data" mining. That does not mean that Orange is incapable of scaling that high; I simply cannot vouch for it. But that is hardly the point: hundreds and thousands of records are more than enough that a specialized data exploration tool is well worth your while. Orange gives you access to a good Python data-mining toolbox, and Orange Canvas gives you a nearly foolproof way to examine it visually. The project has an active community and is running multiple Google Summer of Code projects to develop new tools, so if the module does not provide the classification or visualization method you need, the odds are good that you can still make it work.

VizRank

Posted Jul 5, 2012 4:04 UTC (Thu) by wahern (subscriber, #37304) [Link]

I had never heard of VizRank before:

VizRank is a tool that finds interesting two-dimensional projections of class-labeled data. When applied to multi-dimensional functional genomics data sets, VizRank can systematically find relevant biological patterns.

That sounds incredibly cool and useful. Now I just need to hunt around for a project to apply this solution to =)

scikit-learn

Posted Jul 5, 2012 19:36 UTC (Thu) by southey (guest, #9466) [Link] (1 responses)

Another Python option is scikit-learn

scikit-learn

Posted Jul 9, 2012 14:40 UTC (Mon) by davide.del.vento (guest, #59196) [Link]

For large scale projects there is also http://www.shogun-toolbox.org/ which is not pure-python but has binding.