By Nathan Willis
July 4, 2012
Orange is a GPLv3 Python module for
mining, classifying, and visualizing data. The main problem it
endeavors to help you solve is machine learning — analyzing and
modeling a set of test data so that you can use it to make predictions
about new data collected in the wild. Although you can use it to write
standard interpreted Python scripts, the project also comes with a
"visual programming" interface. Whether visual programming proves
useful may depend as much on the programmer as on the data, but Orange
makes it simple to explore your data set either way.
The Orange project site provides nightly-build tar downloads, as well as a .deb
package repository. In addition to Python (2.6 or 2.7 only), Orange
uses the Graphviz library
extensively to build visualizations. Orange Canvas, the visual
programming tool, requires Qt4.
The fundamentals
Orange is designed to ingest text-based data files; it understands the
C4.5
file format popular in the machine learning crowd, but it has a
native, tab-delimited file format, too. In Orange's format, the first
three lines are reserved for domain information: the first line holds
the attribute names, the second line holds the data type for each
attribute, and the third line lets you denote special features of
specific attributes. The most important special feature is
class, which designates an attribute as the distinguishing
characteristic of the statistical classes you are out to investigate.
The remainder of the file is data, one case per line.
For example, if you have collected data on people who have registered
for the forum on your project site, you may have a range of attributes
including their country of origin, number of posts, age, whether they
use a custom avatar, number of "thumbs up" votes, and OS. If you are
interesting in exploring what makes a forum member eventually become a
contributor, however, a submittedPatch attribute is the one
you would mark as a class. That way you can have Orange
examine all of the other attributes to find out which ones (or which
combinations) accurately predict what will turn a forum member into a
contributor.
Orange provides functions to automatically compute simple statistical
summaries of your data: means, mean square errors, frequencies and
other basic facts. For example, the following example code loads a
data file from forum_users.tab, calls the
orange.DomainDistributions() function on it, and iterates
through the data's discrete attributes, reporting the frequency in
which each value occurs.
import orange
data = orange.ExampleTable("forum_users")
datastats = orange.DomainDistributions(data)
print "Distributions:"
for i in range(len(data.domain.attributes)):
a = data.domain.attributes[i]
if a.varType == orange.VarTypes.Discrete:
print "%s:" % a.name
for j in range(len(a.values)):
print " %s: %d" % (a.values[j], int(datastats[i][j]))
The output would be of the form:
country:
US: 123
UK: 87
Germany: 38
El Salvador: 19
os:
linux: 200
windows: 15
osx: 9
vms: 36
unknown: 7
Orange exposes continuous attributes (i.e., those with floating
point data) in a similar fashion through
orange.VarTypes.Continuous. There are functions for paring
down your data set by filtering on attribute values or by
pseudo-random sampling. You can even take a "stratified" sample,
which means that Orange will grab a sample set that has the same
proportions of each class attribute as the entire set.
Classifying, predicting, and evaluating
You can also see from the example code that Orange tips its hand a
little in its class names. The Python class into which you load your
data is named ExampleTable rather than (say)
DataTable. That is because Orange is intended to load in
"example" data sets that have already been classified (i.e., there is
at least one class attribute), so that you can use them to
generate a classifier algorithm that can correctly predict to
which class items will belong.
Orange includes several different learning algorithms into which you
can feed data sets in order to deduce a good classifier. Called
learners in Orange-speak, some of them are simplistic and are
useful mostly for testing, such as the k-nearest
neighbor algorithm, while others are more robust, such as the Bayes
theorem algorithm. But the project's centerpiece is its own
implementation of a decision
tree algorithm, which is implemented in its own module named
orngTree.
A decision tree classifies a sample by stepping through each attribute
— hopefully in the most efficient order possible based on the
available examples. The edges of the tree take you either to a final
decision (such as "this user will not become a contributor") or to the
next attribute to evaluate. In Orange, you create the tree by
"training" the learner in one fell swoop with the
orngTree.TreeLearner() function, passing it your data set and
any options you require. This function creates and returns a
classifier that you can subsequently call (on new, unclassified data)
through orngTree.TreeClassifier.
Or at least that is how it is supposed to work. In reality, you
typically have to train, evaluate, tweak, re-train, and re-evaluate
several times. Orange provides a slew of regression tests,
statistical and probability features, and tuning options for better
modeling your data. Most of them will feel vaguely familiar to people
whose last statistics class was more than a couple of years ago.
There are also a number of related features for digging into your data
set and finding correlations and hidden relationships between
attributes. Among the available options are association
rules (i.e., learning that certain combinations of values tend to
occur together), clustering
(attempting to partition the data into discrete groups), and self-organizing
maps (which attempts to find patterns in the data by examining its
topological features in 2D or 3D). In some cases, the end goal of
these analytical techniques might be to build or optimize your
classifier, but you may simply be out to find unusual properties in
the data set or locate interesting outliers.
A lot of the data mining options implemented in Orange are outside my
personal or professional experience, although the reference
documentation does an admirable job of providing background
information. That said, I was a bit disappointed that the tutorial
section on the Orange site covers only a smattering of the feature
set. I worked through as many as I could, however, and I will say
Orange provides a very easy to explore data mining tool set. Most (and
perhaps all) of the statistical functions are available in other open
source packages (such as R), but both the convenience of working in
Python and the number of built-in analysis functions make getting
started simple.
Visualize!
The ease-of-getting-started point goes double for Orange Canvas, the
project's visual programming interface. Rather, the data mining
process is easy, once you figure out the "visual programming" paradigm
itself. Orange Canvas works by letting you drag function blocks from
the toolbar onto an infinite-in-two-dimensions workspace, then connect
the blocks together with hoses. The output from A goes to the input
for B, and so on.
It is the same basic idea as a dozen other visual programming
editors. My only real criticism with it is that on the canvas too
few of the properties of the blocks in question are visible, and you
must hover the mouse over each block to see more about it and
right-click it to open a property editor.
The reason that this matters is that many of Orange's classes have
complex relationships; they often require multiple input connections
(such as both data and a learner) that look essentially the same on
screen. Some of the block structures struck me as counter-intuitive,
too. Conceptually, when you are writing in Python, the TreeLearner is
a function that accepts input, and creates a TreeClassifier as its
output. But in the visual interface, a "Classification Tree" block
has a "learner" as its output node. The reason is that in
the block-and-hose design you are intended to hook the learner output
directly into a "prediction" block and make use of it, not to
manipulate it on its own. But it takes some getting used to.
Still, the visual interface has one killer feature: the ability to
instantly hook up blocks to visualizations and manipulate them.
Everything from scatterplots to "sieve multigrams" are provided,
including a number of chart types you may need to look up in a
reference book. Double-clicking on any of the visualization blocks
opens a separate window, in which you can manipulate the variables
included (and most other properties) and get a live-updated graph of
the results. For example, you can plot various attributes against
each other in a scatterplot and look for clumps, which would tell you
that those attributes are tightly correlated. In contrast, with your
own Python code you naturally must edit and re-execute your script to
look at each new permutation of attributes.
The Graphviz library does the heavy lifting on the visualizations.
When you launch a visualization, you can also use VizRank, a feature
that iteratively searches through the possible parameters looking for
the ones that produce the most interesting results. For instance, in
our forum users example, VizRank will iterate through all possible
pairs of attributes (age versus OS, age versus avatar, avatar versus
OS, etc.) and rank the pairings by clumpiness. The upshot is that you
are spared the time it takes to plot every pair and assess the outcome
(not to mention the work of ensuring that you did not accidentally
forget a permutation). You can export the visualizations (or selected
regions) to image files with a single click.
I am one of those "visual learners" you hear about every now and then;
to me, searching through the plots looking for patterns is a far more
revealing way of getting to know a data set. I worked with data sets
in the hundreds-of-entries range, which does not give one a feel for
how Orange might perform on terabyte-scale "big data" mining. That
does not mean that Orange is incapable of scaling that high; I simply
cannot vouch for it. But that is hardly the point: hundreds and
thousands of records are more than enough that a specialized data
exploration tool is well worth your while. Orange gives you access to
a good Python data-mining toolbox, and Orange Canvas gives you a
nearly foolproof way to examine it visually. The project has an
active community and is running multiple Google Summer of Code
projects to develop new tools, so if the module does not provide the
classification or visualization method you need, the odds are good
that you can still make it work.
Comments (3 posted)
Brief items
There’s absolutely no way that the Mozilla Foundation can personally host events to teach web making to the world. But I do think that if we do this right, we can build a movement that others build on, and make their own.
We can teach people to teach people to teach people to fish. And soon there will be no fish left in the sea. (But in this case, that means “everyone learns web making” so it’s not quite so ecological disastery. … I hope.)
—
Michelle Levesque
I'm all for code cleanups, UI cleanups, and anything that can improve
our state of affairs. Forgive me for using a strong word, but
destroying working functionality without explaining exactly what made it
bad, and without trying to fix it first, is just vandalism.
—
Federico Mena Quintero
A quick clarification - -building- LibreOffice tends to discover poor
thermal management in systems better than anything else I've seen ;-)
—
Michael Meeks
Comments (5 posted)
Version 2.16 of the GNU C library is out. Significant changes include
support for the
x32 ABI,
various bits of ISO C11 support, a number of performance improvements, and lots
of bug fixes. This version of glibc is not supported on Linux kernels
prior to 2.6.
Full Story (comments: 16)
Version 2.00 of the GNU GRUB bootloader has been released. "
Since
this version has a round number it has been paid special attention to, and
hopefully, represents higher quality." Improvements include ports
to a number of relatively obscure architectures (Itanium, Fuloong2F, ...),
improved device and filesystem support, some additional boot protocols, and more.
Full Story (comments: 49)
The Rakudo Perl distribution has released "Rakudo Star," which it describes as "
a useful, usable, "early adopter" distribution of Perl 6." There are numerous improvements listed, including enhanced list and .map handling and the addition of the same regular expression engine used in user-space, which fixes several parsing bugs.
Full Story (comments: none)
Newsletters and articles
Comments (none posted)
Mike Linksvayer shares his
reflections on the importance of the GPLv3, which was released five years ago today. "
I suggest that number (add qualifiers of and scaling by importance, quality, etc, as you wish) of works under GPLv3 or use of GPLv3 relative to other licenses are less important markers of GPLv3′s success, and that of the broader FLOSS community, than the number and preponderance of works under GPLv3-compatible terms."
Comments (73 posted)
Libre Graphics World is running a
comparison of Linux applications used to drive cutting machines for vinyl, cardstock, and other materials. "
Most cutting devices rely on HPGL printer control language and its versions such as CAMM-HPGL (Roland). So the job is, essentially, to take a vector graphics file and convert it to HPGL, then send it to the device along with control commands such as blade speed and pressure." Looks like several options are available, including Inkscape extensions and stand-alone programs.
Comments (3 posted)
O'Reilly Radar has published an
interview with author David Griffiths on the continued relevance of teaching C. The main interview in in video form, with text excerpts highlighting specific points, such as "
For example, it teaches how memory works in a more profound way (a concept systems programmers will likely already know, though new programmers in specialized fields might not)" and "
It's an important, foundational language that requires you to understand the full stack of the technology."
Comments (175 posted)
Page editor: Nathan Willis
Next page: Announcements>>