LWN.net Logo

Opportunistic Garbage Collection Obviousness

Opportunistic Garbage Collection Obviousness

Posted Apr 22, 2011 1:28 UTC (Fri) by jd (guest, #26381)
In reply to: Opportunistic Garbage Collection Obviousness by orcmid
Parent article: Google Linux servers hit with $5m patent infringement verdict (The Register)

Back wne I was at Uni, in the 80s, we looked at database theory. Standard technique for indexed sequential databases where you really don't want to fragment the file: when you "delete" a record, you don't remove it, you merely set a flag. When a new record is created, if a flagged record is of equal or slightly greater size you overwrite it.

Every so often, you vaccuum the database to eliminate gaps and remove any flagged records still left.

This gives you a situation where a record is not deleted immediately but is deleted when performing a different operation (an insert or a vaccuum).

This was considered a fairly basic, crude system that was a holdover from sequential media like magnetic tape.

This system would infringe the patent, despite the fact that I seriously doubt anyone has used magnetic tape for live database work within the lifetime of anyone involved in writing said patent or enforcing it in court.

The part that makes me concerned, though, is that this caswe is giving the impression to other companies that patent trolling is easy money and that actual software development is high-risk. That's an extremely dangerous mindset. When you start believing that producing is bad and privateering is good (privateering is piracy in the cutlass-and-walk-the-plank sense that has been legalized by a government), what do you imagine happens?


(Log in to post comments)

Opportunistic Garbage Collection Obviousness

Posted Apr 22, 2011 2:15 UTC (Fri) by orcmid (guest, #74478) [Link]

That is not an infringement of the patent unless your index uses hashed-key linked lists and, instead of an off-line cleanup, you did some bounded cleanup of only those deleted ones with the same hash when that particular list was being revisited for some useful purpose.

The essential claims in the patent are very specific about the situation in which they arise.

With regard to other observations about what is obvious or not, nothing we say here will impact the validity of the patent.

Google attempted to convince a jury that the patent was invalid and failed. I don't know what their argument was and whether they had any prior art to offer.

Google also failed to convince the jury that the patent was not being infringed, valid or not. They would have had to parade out the code and experts to explain it to accomplish that. Don't know how they failed at that either.

Now, we know that there appear to be some jurisdictions in Texas that are very friendly to patent-infringement suits. (This is where Microsoft ran up against i4i, for example.) But even so, it is not clear, based on the experience of others, that there is much room in the appeal process to over-rule the jury's findings of fact.

Whatever Google and other defendants end up paying, the only thing that will keep Linux as unencumbered FOSS is if the alleged infringement is cured by working around the essential claims of the patent so there is no infringement. If the patent were to be over-turned, put the code back if it is appropriate. Otherwise, just move on, I say.

I am far more concerned about Linux staying unfettered than I am about Google's pocketbook.

Opportunistic Garbage Collection Obviousness

Posted Apr 22, 2011 3:27 UTC (Fri) by jd (guest, #26381) [Link]

Indexed sequential: You hash then you sequentially search. Ok, so clearly we are indeed talking about items that hash to the same value.

Opportunistic overwriting: This is online, so we're clearly talking about bounded cleanup of items that hash to the same value.

Please, please read what you are replying to BEFORE replying to it.

Opportunistic Garbage Collection Obviousness

Posted Apr 22, 2011 5:48 UTC (Fri) by orcmid (guest, #74478) [Link]

That's not the definition of indexed sequential that I ever heard of (my experience going back to the original ISAM).

The "sequential" in indexed sequential means that the file can be accessed in record order (sequenced by key) or by direct access by key. In general, it matters that the keys are ordered. Typically, an index is used to map keys to the external storage blocks where the corresponding record is to be found. It is common for the index to be organized as some for of wide tree and how the index is cached matters.

The technique described in the patent is strictly about hashed indexing by keys and there is no information or assumption about being able to somehow navigate the hashed records in some key-based sequence nor is it presumed that such be possible.

I see that there is a variation of indexed-sequential in which the index is built using a hash technique (see http://en.wikipedia.org/wiki/Indexed_Sequential_Access_Me...). I have no idea what is hashed for that to work and how that helps with how the index itself is managed. But since you built one, perhaps you can enlighten us where hashing was done as part of working down to the location of a record have a particular exact key?

In any case, the Sequential in Indexed Sequential is not the sequential in the sense that the linked list of list items for records having the same hash code is considered sequential. Indeed, there is generally no requirement that the positions of those records in the list be sequenced in some way, although Knuth points out the advantage of doing so, since the detection that the desired record is not present can happen more efficiently.

There is also considerable importance, to ISAM, that the index for random access be loadable into memory in some manner or otherwise cached, since the records themselves were often on external storage and even insertions might be rather costly unless buckets of records retained slack for some reasonable number of insertions, with an overflow mechanism (not related to what we mean by hash overflow) for any excess.

And of course, in common usage, the only way to clean up an ISAM file was to reorganize the whole thing with some sort of utility, not unlike disk defragmenting.


Opportunistic Garbage Collection Obviousness

Posted Apr 22, 2011 6:10 UTC (Fri) by orcmid (guest, #74478) [Link]

I did go look at Berkeley DB because it does use a hashing scheme. What is revealed there is that hashing does not support indexed-sequential operation, but the B+Tree access method does. (This gets determined when you build the DB.)

The hashing method is extended linear hashing, where the hash function is adjusted dynamically to allow growing of the hash table. Linear hashing is mentioned on pp.548-549 (just before the exercises of section 6.4) of Don Knuth's "Art of Computer Programming," vol.3 Sorting and Searching, second edition. There is not enough information there to know how collisions on the same hash are dealt with.

Opportunistic Garbage Collection Obviousness

Posted Apr 22, 2011 18:32 UTC (Fri) by Wol (guest, #4433) [Link]

In linear hashing (typically used in Pick-based systems) it's normal to have a bucket that can hold multiple records. Records with the same hash are chained in a bucket, and should the bucket overflow, the records are often chained on the end of the file. So let's say your "modulo" is 20 - buckets 0-19 - the first bucket to overflow chains into 20, then next into 21, etc.

This then causes a bit of grief when the modulo increments, because you have to shove bucket 20 out of the way to create a new bucket 20 as non-overflow.

At least one implementation (Prime Information) used a special file/directory type, so each SEGSAM file contained subfiles, one per bucket, and they could grow as big as required.

Cheers,
Wol

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds