User: Password:
|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for March 4, 2010

SCALE 8x: Relational vs. non-relational

By Jake Edge
March 3, 2010

PostgreSQL hacker Josh Berkus set out to do some "mythbusting" about differences in database technologies in his talk at SCALE 8x. While there are plenty of differences between the various approaches taken by database systems, those are not really the ones that are being highlighted by the technical press. In particular, the so-called "NoSQL movement" makes for a great soundbite, but is "not very informative or accurate". Berkus went on to survey the current database landscape while giving advice on how to approach choosing a database for a particular application.

This is a "more exciting time" to be a "database geek" than ever before, he said. Looking back seven years to 2003, he noted that there were essentially seven different free choices, all of which are SQL-based. In 2010, there are "dozens of new databases breeding like rabbits", with some 60 choices available. As an example of how quickly things are moving, Berkus noted that while he was in New Zealand at linux.conf.au, where a colleague was giving a related talk, two new databases were released.

Mythbusting

Berkus likened the NoSQL term to a partition that is created by putting dolphins, clown fish, and 1958 Cadillacs on one side and octopuses, Toyota Priuses, and redwood trees on the other—labeled as the "NoFins" group. The non-relational databases that are lumped together as NoSQL have "radically different" organizations and use cases. But, that's not just true of the non-relational databases, it's also true for the various relational databases as well.

Another myth that he pointed out was the "revolutionary" tag that gets associated with all of the new types of databases. Once again, that is a convenient soundbite that isn't accurate. He has not seen a new database algorithm since 2000, and all of the new crop of database systems are new implementations and combinations of earlier techniques. The new systems are not revolutionary, just evolutionary.

As an example, he put up a slide with the following description of a database: "A database storing application-friendly formatted objects, each containing collections of attributes which can be searched through a document ID, or the creation of ad-hoc indexes as needed by the application." He noted that it applies equally well to one of his current favorites, CouchDB, which was created in 2007, and to the Pick database system—the original object of the description—which was created in 1965.

Instead of a revolution, what we are seeing now is a "renaissance of non-relational databases". That description is far more accurate, Berkus said, and is a better way to view the change. It is a "big thing" that is going to "change the way that people use databases", so it is important to label it correctly.

Another myth is that non-relational databases are "toys", which is something that is often pushed by people who work on relational systems. Berkus pointed out that many SCALE sponsors would disagree: Google using Bigtable, Facebook using Memcached, Amazon with Dynamo, and so on.

The other side of that myth is that relational databases will become obsolete. Unsurprisingly, that myth is often promulgated by those who work on non-relational databases, and it is something that the relational community has heard before. Berkus pointed to a keynote speech in 2001 proclaiming that relational databases would be replaced with XML databases. He then asked if anyone even remembered or used XML databases; when even the crickets were silent, he pointed out that various relational and non-relational databases had hybridized with XML databases, incorporating the best features of XML databases into existing systems. He predicted that "over the next five years, we will see more hybridization" between different types of database technologies.

"Relational databases are for when you need ACID transactions" was myth number five. Support for transactions is "completely orthogonal" to the relational vs. non-relational question. There are systems like Berkeley DB and Amazon Dynamo that provide robust transactions in non-relational databases, as well as MS Access and MySQL that provide SQL without transactions.

The final myth that needs busting is the Lord of the Rings inspired "one ring theory of database use", Berkus said. There is "absolutely no reason" to choose one database for all of one's projects. He recommends choosing the database system that fits the needs of the application, or to use more than one, such as MySQL with Memcached or PostgreSQL with CouchDB. Another alternative is to use a hybrid, like MySQL NDB, which puts a distributed object database as a back-end to MySQL, or HadoopDB which puts PostgreSQL behind the Hadoop MapReduce implementation.

So, what about relational vs. non-relational?

Relational databases provide better transaction support than non-relational databases do, mostly because of the age and maturity of relational databases, Berkus said. Transaction support is something that many open source people don't know about because the most popular database (MySQL) doesn't implement it. Relational databases enforce data constraints and consistency because that is the basis of the relational model. There are other benefits of today's relational databases, he said, including complex reporting capabilities and vertical scaling to high-end hardware. He also noted that horizontal scaling was not that well-supported and that relational databases tend to have a high administrative overhead.

On the question of SQL vs. Not-SQL, Berkus outlined the tradeoffs. SQL promotes portability, multiple application access, and has ways to manage database changes over time. There are many mature tools to work with SQL, but SQL is a full programming language that must be learned to take advantage of it. Not-SQL allows fast interfaces to the data, without impedance-matching layers, which in turn allows for faster development. Typically, there are no separate database administrators (DBAs) for Not-SQL databases, with programmers acting in that role.

"It's always a tradeoff", Berkus said, but one place that a SQL-relational database makes the most sense is where you have "immortal data". If the data being stored has a life independent of the specific application and needs to be available to new applications down the road, SQL-relational is probably the right choice.

How to choose

For other situations, you need to define the "features you actually need to solve that particular problem" plus another list of features you'd like, "then go shopping". Chances are, he said, there is a database or combination of databases that fits your needs. He then went on to some specific application requirements, suggesting possible choices of database or databases to satisfy them.

  • I need a database for my blog: "use anything", including MySQL, PostgreSQL, SQLite, CouchDB, flat files, DBase III, etc. Pick "whatever is easiest to install" because "it doesn't matter".

  • I need my database to unify several applications and keep them consistent: For example a data warehousing application written C/C++ with reporting tools in Ruby and Rails, should use an OLTP SQL-Relational database like PostgreSQL. He also couldn't resist noting that the PostgreSQL 9 alpha was released the day before: "download it and test it out".

  • I need my application to be location aware: a geographical database, such as PostGIS, is needed. Geographical databases allow queries like "what's near" and "what's inside".

  • I need to store thousands of event objects per second on embedded hardware: db4object is probably the right choice, but SQLite might also be considered.

  • I need to access 100K objects per second over thousands of web connections: Memcached is a distributed in-memory key-value store, which is used by all of the biggest social networks. It can be used as a supplement to a back-end relational database. He also mentioned Redis and TokyoTyrant as possible alternatives.

  • I have hundreds of government documents I need to serve on the web and mine for data: It's hard to get the government to release the data, so the structure of the data may not come with it, which means that the structure must be derived from examining the documents. For that, he suggests CouchDB.

  • I have a social application and I need to know who-knows-who-knows-who-knows-who-knows-who: This is a very hard problem for relational databases and what's needed is a graphing database such as Neo4j. Long chains of relationships are difficult for relational databases, but graphing databases, used in conjunction with another database, can handle these kinds of queries, as well as queries to find items "you may also like".

  • and so on ...

The slides [PDF] from Berkus's talk have additional examples. The basic idea is that "different database systems do better at different tasks" and it is impossible for any database system to do everything well, "no matter what a vendor or project leader may claim". For those who are looking for open source solutions, he recommended the Open Source Database survey which Selena Deckelmann has put together. While it is, as yet, incomplete, it does list around a dozen lesser-known database systems.

It is clear from the talk that it is an exciting time to be a database developer—or user for that matter. There are many different options to choose from, each with their own strengths and weaknesses, some of which can be combined in interesting ways. It is also very clear that there are many more axes to the database graph than just the overly simplified SQL vs. NoSQL axis that seems to dominate coverage of these up-and-coming database systems.

Comments (23 posted)

Apple's patent attack

By Jonathan Corbet
March 2, 2010
Software patents have long been the source of a great deal of concern in the free software community; patents are by far the biggest restraint on our ability to program our own computers. Those who worry about these things have expected that attacks might come from patent trolls, or from software companies with fading prospects. Apple's lawsuit against HTC shows that the real threat may come from a different direction.

HTC is not normally thought of as a Linux company; it is a Taiwanese manufacturer which provides cellular phone handsets to a number of other companies. HTC has only recently begun promoting phones under its own name; as it happens, a number of those run Android. Since Android increasingly looks like the base for some of the strongest competition against Apple's products, this suit certainly has the look of an attack against Android and not just an action against one hardware manufacturer. Indeed, Android is named specifically in both components to the attack.

There are some 20 patents named in Apple's actions. Ten of them are named in the patent infringement suit filed in Delaware:

  1. #7,362,331: Time-based, non-constant translation of user interface objects between states. Filed in 2001, this patent covers basic animated movement of objects in graphics user interfaces; the core "innovation" seems to be that the function for determining the object's velocity is not constant. Apple has patented acceleration of objects on the screen.

  2. #7,479,949: Touch screen device, method, and graphical user interface for determining commands by applying heuristics. This patent was filed in April, 2008; Steven Jobs is the first on a long list of inventors. This patent claims the use of heuristics to determine whether a finger movement on a touchscreen display is vertical, diagonal, or is a "next item" selection.

  3. #7,657,849: Unlocking a device by performing gestures on an unlock image. This patent (2005) covers pretty much what it says; it's requirement for "moving an unlock image" along the path suggests a fairly straightforward workaround might be possible.

  4. #7,469,381: List scrolling and document translation, scaling, and rotation on a touch-screen display (2007). This one is complex, but seems to cover the practice of "bouncing" the display when scrolled past the end of a document or list.

  5. #5,920,726: System and method for managing power conditions within a digital camera device (1997). This is a hardware-related patent covering the process of powering down a digital camera in response to a low-power situation.

  6. #7,633,076: Automated response to and sensing of user activity in portable devices (2006). This is a technique for filtering out touchscreen events resulting from putting a phone to one's ear. It requires the existence of a "proximity sensor" to determine whether a human is sufficiently close to the device.

  7. #5,848,105: GMSK signal processors for improved communications capacity and quality (1996) is a signal-processing algorithm meant to improve interference rejection.

  8. #7,383,453: Conserving power by reducing voltage supplied to an instruction-processing portion of a processor (2005). This hardware patent appears to be well described by its title; it covers a processor which can turn off its clock and reduce its operating voltage.

  9. #5,455,599: Object-oriented graphic system (1995). By a broad reading, this patent would appear to cover just about any graphical system which maps between objects stored in memory and a representation on the display.

  10. #6,424,354: Object-oriented event notification system with listener registration of both interests and methods (1999). The highly innovative technique of allowing one object to register an interest in changes to a second object and receive notifications is covered. This patent is owned by the "Object Technology Licensing Corporation" which is located at 1 Infinite Loop, Cupertino - strangely enough, that's where Apple is located too.

Additionally, Apple has filed with the US International Trade Commission with the purpose of blocking the import of HTC's products into the US. That filing names a different, generally older, and more fundamental set of patents:

  1. #5,481,721: Method for providing automatic and dynamic translation of object oriented programming language-based message passing into operation system message passing using proxy objects (1994). This patent covers sending messages between two objects in separate processes by way of "proxy objects" which translate the message for transmission. Remote procedure calls, in other words.

  2. #5,519,867: Object-oriented multitasking system (1993) covers the entirely non-obvious technique of supplying an object-oriented wrapper around a procedural operating system's process creation and manipulation system calls.

  3. #5,566,337: Method and apparatus for distributing events in an operating system (1994). Here Apple claims the technique of maintaining a list of events and processes interested in those events, then distributing notifications to the processes when the events happen. Broadly read, this patent could cover Unix signals, the select() system call, or the X Window System event notification mechanism - all of which predate the patent by years.

  4. #5,929,852: Encapsulated network entity reference of a network component system (1998). An object is created to provide a graphical representation of a "network resource." When the user clicks on the representation, information about the resource is displayed.

  5. #5,946,647: System and method for performing an action on a structure in computer-generated data (1996). This technique covers "recognizing structures" in data and allowing users to act upon those structures. Think, for example, of recognizing a phone number on a web page, then allowing the user to call the number or store it in a contacts list.

  6. #5,969,705: Message protocol for controlling a user interface from an inactive application program (1997). This one covers the idea of an interactive program forking a worker process to do some processing and letting that worker process provide information which is shown in the user interface.

  7. #6,275,983: Object-oriented operating system (1998). Another Object Technology Licensing Corp. special, this one covers the concept of providing object-oriented wrappers to procedural system calls; the one additional twist is that those wrappers are dynamically loaded at run time if need be.

  8. #6,343,263: Real-time signal processing system for serially transmitted data (1994). A computer with a "realtime signal processing subsystem" and a programming API allowing that subsystem to be used. Something that looks, say, like a computer with a cellular network radio attached.

  9. #5,915,131: Method and apparatus for handling I/O requests utilizing separate programming interfaces to access separate I/O services (1995). This patent appears to cover the idea of providing different APIs for access to different types of devices. Something like ioctl(), perhaps.

  10. #RE39,486: Extensible, replaceable network component system (2003, a reissue of 6,212,575 from 1995). Essentially, this is the technique of building objects around different network protocols so that they all appear the same to higher-level software and users.

A few of the patents are hardware-related and don't have much to do with Linux. Many of the rest, however, purport to cover fundamental programming techniques. It would appear that Apple wants to take Android out of the picture - or at least extract substantial rents for its continued existence. But many of these patents, if upheld, could have an influence far beyond Android.

Needless to say, the validity of many of these patents is questionable. Proving a patent invalid is a lengthy, expensive, and highly risky process, though; it's not something that one can automatically expect a litigation defendant to jump into. So there is no saying how HTC will react, or what sort of assistance HTC will get from the rest of the industry.

In summary: this may be the software patent battle that many of us have feared for a long time. An outright victory by Apple could well leave it "owning" much of the computing and mobile telephony industry - in the US, at least. One assumes that the rest of the industry is going to take note of what is happening here. Nokia is already involved in its own patent disputes with Apple, but this battle could spread well beyond Nokia and HTC. It will be in few companies' interest to let Apple prevail on these claims and entrench their validity. This battle is going to be an interesting one to watch.

Comments (66 posted)

SCALE 8x: Color management for everyone

March 2, 2010

This article was contributed by Nathan Willis

On Sunday at SCALE 8x, Inkscape developer Jon Cruz presented a talk entitled "Why Color Management matters to Open Source and to You," putting the need for color management into real-world terms for the average Linux user, outlining current development work on the subject at the application and toolkit levels, and giving example color-managed workflows for print and web production. Color management is sometimes unfairly characterized as a topic of interest only to print shops and video editors, but as Cruz explained at the top of his talk, anyone who shares digital content wants it to look correct, and everyone who uses more than one device knows how tricky that can be.

"If you have eyes and a display, you need color management"

Color management, broadly speaking, is the automatic transformation of image colors so as to provide a uniformly accurate representation across devices. This includes output-only devices such as televisions and printers, as well as CRT and LCD displays on which editing as well as final output is viewed. The first problem is that every device is capable of generating a different spectrum of colors — different hues, different ranges of white-to-black values, and different degrees of saturation. Collectively, the color capabilities of the device are its gamut, which can be represented by a three-dimensional volume in one of several mathematical color models (or "color spaces").

[Cruz juggling]

The second problem is that digital files store the color of each pixel as a numeric triple that may or may not represent coordinates in some specified color space. If the color space to which the file referenced is known, mapping each triple from its stored value into the gamut of the output device is a simple transformation, and the user can visually examine the full range of pixel data. Without that transformation, multiple colors outside the display device's gamut get mapped to the boundaries, causing artifacts and loss of detail, and the entire image can get mapped too dark or too light, misrepresenting the scene.

Although it is clear that graphics professionals need color managed displays and printers, Cruz said, the explosion of user-generated digital content in recent years makes it a problem for everyone.

Home users want to be able to edit video and share it online, knowing that what appears appropriately bright on-screen will not look washed-out or too dark on DVD or YouTube. They also want to drop off family photos at the corner drugstore kiosk and not be disappointed by a red or green cast to the skin-tones. Photo kiosks may be inexpensive per-print, he said, but online vendors like Apple and Google's Picasa are increasingly offering more elaborate services, such as hardbound books, with correspondingly higher prices. Consumers might shrug off paying a few cents for a bad-looking 4x6 print, but getting burned on an expensive book is considerably more aggravating.

Just as importantly, Cruz added, business users need to care about the professionalism of their presentations, both for aesthetic reasons, and because a mis-colored partner logo could accidentally sour the opinion of the executive at the table who recently spent months determining the "perfect shade of puce" to represent the company image. Finally, he said, anyone who sells products online should know that the number one reason for returned consumer purchases is mismatched colors — if the product shots on the web site make the red shirts look orange, the seller is financially at risk for the cost of returns.

In addition to these use cases, Cruz explained that users need color management support in their desktop applications to cope with the variety of different display devices they use over the course of a day. Multiple computers are commonplace, from desktops to laptops to netbooks to hand-held devices, and each have different display characteristics. Laptop screens have noticeably smaller gamuts than desktop LCDs, which are in turn smaller than CRTs, and different also from the displays of consumer HDTVs. Mobile devices, based on different graphics hardware, may not even support full 8-bit-per-channel color. Presenting a consistent display across these platforms cannot be left to chance.

Status report

Fortunately for Linux users, Cruz continued, color management support in Linux is in good shape, although more still needs to be done. Most creative graphics applications support color management already, thanks in large part to the collaborative efforts of the Create project at Freedesktop.org. These include Gimp, Krita, Inkscape, Scribus, Digikam, F-Spot, and Rawstudio, as well as several image viewing utilities.

Enabling users to acquire good ICC profiles (tables measuring the device's attributes against points in a known color space, thus allowing for interpolation of color data) or to build their own is one of the key areas of current color work. Projects like Argyll and Oyranos handle tasks such as precisely measuring monitor color output through hardware colorimeters, creating profiles for printers, scanners, and cameras through color targets, and linking profiles for advanced usage.

A simpler solution aimed at the home user is GNOME Color Manager (GCM); unlike the previous two examples GCM does not attempt to be a complete ICC profile management tool, but focuses on easily enabling users to correctly assign a profile to their monitor. Default profiles are usually available from the manufacturer, either through the web or on the "driver" CDs in the box, and for normal usage they are an excellent first step. Developers from these and several related projects collaborate on common goals in the OpenICC project.

Developers interested in adding color management to their applications should start with LittleCMS, Cruz advised, noting that he personally added Inkscape's color management support in less than one week's time with LittleCMS. LittleCMS is a library that handles the mathematical transformations between color spaces automatically, quickly, and with very little overhead.

[Jon Cruz]

Currently, however, one drawback of the Linux color management scene is that most color-aware applications work in isolation from one another, requiring the user to choose display, output, and working ICC profiles in each program — whether through LittleCMS or with in-house routines. Ongoing work to bring color management to a wider range of programs includes adding support to the Cairo vector graphics rendering library, attaching display profiles to X displays, and building color management into GTK+ itself. The latter, in particular, would enable "dumb" applications to automatically be rendered in color-corrected form on the monitor, while still allowing "smart" applications to manage their own color. This is important because graphics and video editing applications need to be able to switch between different profiles for tasks like soft-proofing (simulating a printer's output on-screen by rendering with a different ICC profile) or testing for out-of-gamut color.

To the work!

Finally, Cruz showed several example workflows for print and web graphics, first illustrating potential problem points when working in a non-color-managed environment, then explaining how using a color-aware setup would trap and eliminate the problem.

For web graphics, the example scenario was a simple photo color-correction. Over-correcting the color balance on an improperly-managed monitor easily leads to site visitors seeing a wildly distorted image. In addition, Windows and Macs use different system gamuts, which leads to photos looking either too bright on Macs or too dark on Windows. With a managed workflow, users should target the sRGB color space, previewing the results with Windows, Mac OS X 10.4 and Mac OS X 10.5 profiles (due to changes introduced by Apple in 10.5), as well as mobile devices under different conditions. Because most web site audiences do not have color-corrected displays, he said, not everything is under the designer's control — but if the end user's monitor is broken and the artwork is broken, the problems multiply.

For print graphics, the workflow is more complicated, starting with the fact that — despite the popularity of the term — there is no single, standard "CMYK" color space. All process-color spaces are device-dependent, including common four-ink CMYK printers, CcMmYK photo printers, Hexachrome, and others; there is not even an analogous color space to the "Web safe" sRGB standard. Process color's small gamut makes it very easy to produce poor output when not using color management to edit and proof.

Fortunately, Inkscape and other SVG-capable editing tools can take advantage of the fact that SVG allows different color profiles to be attached to different objects in a drawing. A CMYK profile for the target printer can be used for most of the drawing, with a separate spot-color profile attached to specific objects that need careful attention, and corrective profiles for embedded RGB elements like raster graphics. A test run is always the best idea, Cruz said, but having proofing profiles available on the system saves both money and time.

Conclusion

Color management on Linux has come a long way in the last four years. The application support in the basic graphics suite is good, and for professionals tools like Argyll and Oyranos open the door to complete solutions; as Cruz observed in his talk, the colorimeter hardware that used to cost thousands of dollars and lack support on free operating systems is now cheap and well-supported.

Still, the average desktop Linux distribution does not install in a color-managed state, which is unfortunate. Proper support for transforming pixels from one color space to another is straightforward math that, much like window translucency, smooth widget animation, and audio mixing, should happen without requiring the user to stop and think about it. It is promising that headway is being made on that front as well, with GCM and GTK+; perhaps in a few release cycles Linux will have full color management out-of-the-box.

Comments (20 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Security: Fedora 13 to debut a security "spin"; New vulnerabilities in apache, kernel, kvm, sudo,...
  • Kernel: 2.6.34 Merge window, part 1; Huge pages part 3: Administration; Ubuntu kernel development process; Fishy business.
  • Distributions: The Ubuntu One music store and free software for profit; Community Fedora Remix 12.3; LFS-6.6; SystemRescueCd 1.4; Ubuntu Lucid Alpha 3; Yellow Dog Enterprise Linux for CUDA; Mandriva Joins ARM Connected Community; OpenSolaris future assured by Oracle; Red Star OS; Matt Asay answers your questions.
  • Development: Simple Scan; Darcs 2.5; JägerMonkey; Thunderbird; ...
  • Announcements: EFF: Unintended Consequences - Twelve Years Under the DMCA; IIPA aims to put Indonesia on watch list over FOSS; Novell: Linux finally breaks even; Elliott Associates Offers to Buy Novell; Apple files a patent suit against HTC; Archos 5 Internet Tablet with Android Review; UDS 10.10; Document Freedom Day; Day Against DRM; FOSDEM 10 videos now available.
Next page: Security>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds