This year, FOSDEM had a Data
Analytics developer room, which turned out to be quite popular to the
assembled geeks in Brussels: during many of the talks the room was fully
packed. This first meeting about analyzing and learning from data had talks
looking at information retrieval, large scale data processing, machine learning, text mining, data visualization and Linked Open Data, all of which was implemented using open source tools.
Mapping WikiLeaks cables
One of the most inspiring talks in the data analytics track, which
showed just how much you can do with open source tools in data visualization, was Mapping WikiLeaks' Cablegate using Python, mongoDB, Neo4j and Gephi by Elias Showk and Julian Bilcke, two software engineers at the Centre National de la Recherche Scientifique (National Center for Scientific Research in France). Their goal was to analyze the full text of all published WikiLeaks diplomatic cables, to produce occurrence and co-occurrence networks of topics and cables, and finally to visualize how the discussions in the cables relate to each other. In short, they did this by analyzing the 3,300 cables with Python and some data extraction libraries, then they used MongoDB and Neo4j to store the documents and generate graphs, and finally they visualized and explored the graphs with Gephi.
The first step in this process, presented by Showk, is importing the cables. Luckily, the WikiLeaks cables follow a simple structure that makes this relatively easy. Showk based his work on the cablegate Python code by Mark Matienzo that scrapes data from the cables in HTML form and converts this to Python objects. For the HTML scraping, the code is using Beautiful Soup, a well-known Python HTML/XML parser that automatically converts the web pages to Unicode and can cope with errors in the HTML tree. Moreover, with a SoupStrainer object, you can tell the Beautiful Soup parser to target a specific part of the document and forget about all the boilerplate parts such as the header, footer, sidebars, and supporting information.
After the parsing, The Python natural language toolkit NLTK is used on the text body to bring more structure to the word scramble with the goal of extracting some topics. The first step is tokenization: NLTK allows easily breaking up a text into sentences and each sentence into its separate words. Then for each word the stem is determined, which means that all words are grouped by their root. For example, to analyze the topics of the WikiLeaks cables, it doesn't matter if the word in a text is "language" or "languages", so they are both grouped by their root "languag". An SHA-256 hash value of each stem is then used as a database index.
MongoDB, a document-oriented database, is used as document storage for all this data. MongoDB allows transparently inserting and reading records as Python dictionaries, as well as automatic serializing and deserializing of the objects. Then Showk queried the MongoDB database to extract the heaviest occurrences and co-occurrences of words, and converted that to a graph using the Neo4j graph database.
For the final step, visualizing and analyzing the data, Bilcke used Gephi, an open source desktop application for the visualization of complex networks. Gephi, to which Bilcke is an active contributor, is a research-oriented graph visualization tool that has been used in the past to visualize some interesting graphs, like open source communities and social networks on LinkedIn. It's based on Java and OpenGL, but it also has a headless library, the Gephi Toolkit.
So Bilcke imported the graph from the Neo4j graph database into Gephi,
and then did some manual data cleaning. The graph is quite dense and has a
lot of meaningless content, so there is some post-processing needed, like
sorting and filtering. Bilcke chose the OpenOrd layout, which
is one of the few force-directed layout algorithms that can scale to over
1 million nodes, which makes it ideal for the WikiLeaks graph. He only
had to remove some artifacts, tweak the appearance slightly, and finally
export the graph to PDF and GEXF
(Gephi's native file format).
In total the two French researchers did a full week of coding, during
which they have written 600 lines of code using four external libraries,
two database systems, and one visualization program. All the tools they used are open source, as is their code, so this is quite a nice testimonial of what you could do with open source tools in the field of visualization of big data sources. Performing the whole work flow from the original WikiLeaks HTML files to the final graph requires around five hours.
Showk and Bilcke did all this during their free time in order to learn about all these technologies. Their goal was to show that every hacker can convert a corpus of textual data to a graph that is easier for exploring topics. This could be used to find some interesting new things, but the two researchers lacked the time to do this and were more interested in the technical side. In an email, Bilcke clarifies:
Since we worked on the publicly released cables, we didn't expect any more secrets than what had been already published by media like The Guardian. So, we didn't find any unexpected secrets. Moreover, don't forget that there is a publication bias: we only see what has been released and censored by WikiLeaks.
Our maps are mostly a tool for exploration, to help people dig into large datasets of intricate topics. In a sense, our visualization helps seeing the general topics and dig into the hierarchy, level after level. You can see potentially interesting cable stories at a glance, just by looking at what seem to be clusters (sub-networks) in the map, and zoom in for details. We believe this can be used as a complement to other cablegate tools we have seen so far.
The result is published in the form of two graphs, which can be explored by anyone who wants to dig into the WikiLeaks cables. One graph, with 43,179 nodes and 237,058 edges, links topics to the cables they occur in. The other graph, with 39,808 nodes and 177,023 edges, only shows the topics and links them when they co-occur in the same cable. Interested readers can view the PDF or SVG files, but the best way is to load the .gephi files into Gephi, so you can interactively explore the graphs. For graphs of this size, though, the Gephi system requirements suggest 2 GB of RAM.
Semantic data heaven
One of the other talks was about Datalift, an experimental research project
funded by the already mentioned National Center for Scientific Research in
France. Its goal is to convert structured data in various formats (like
relational databases, CSV, or XML) to semantic data that can be
interlinked. According to François Scharffe, only when open data
meets the semantic web, will we truly see a data revolution. Because big chunks of data on an island are difficult to re-use, while data in a common format with semantic information (like RDF) paves the way for richer web applications, more precise search engines, and a lot of other advanced applications. Scharffe referred to the five stars rating system of Tim Berners-Lee:
If your data is available on the web with an open license, it gets one star. If it's available as machine-readable structured data, such as an Excel file instead of an image scan of a table, it gets two stars. The interesting things begin when you use a non-proprietary format like CSV, which gives the data three stars. But to become part of the semantic web, you need open standards to identify and search for things, like RDF and SPARQL
, which amounts to four points. And to reach data heaven (five points), you finally need to link your data to other people's data to let you benefit from the network effect.
The Datalift project is currently developing tools to facilitate all
steps in the process from raw data to published linked data, from selection
of the right vocabulary (e.g. FOAF for persons, or GeoNames for geographic locations),
conversion to RDF, and publishing of the data on the web, to interlinking
the data with other existing data sources. There are already open source
solutions for all these steps. For example, D2R Server
maps a relational database on-the-fly to RDF, and Triplify does the same for web applications
like a blog or content management system. For the publication of RDF in a
human-readable form, there is the Tabulator Firefox
extension. The Datalift project is trying to streamline this whole process.
No shortage of tools
All of the talks in the data analytics developer room where quite short, from 15 to 30 minutes. This allowed a lot of projects to pass in review. Apart from the WikiLeaks and Datalift talks, there were talks about graph databases and NoSQL databases, about query languages, about analyzing and understanding large corpora of text data using Apache Hadoop, about various tools and methods for data extraction from HTML pages, about machine learning with Python, and about a real-time search engine using Apache Lucene and S4.
The whole data analytics track showed that there's no shortage of open
source tools to deal with big amounts of data. That's good news for
statisticians and other "data scientists", which Google's Hal Varian called
"the sexy job in the next ten years". In an article in
The McKinsey Quarterly from January 2009, he wrote: "The ability
to take data-to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it-that's going to be a hugely
important skill in the next decades." Looking at all the talks at
the data analytics track at FOSDEM, it's clear that open source software
will play a big role in this trend. If the track will be hosted again at
FOSDEM 2012, it's going to need a bigger room.
to post comments)