The claim is not about opportunistic garbage collection. It is about deleting no-longer-needed records from a linked list of same-hash records while visiting that list for some other purpose (finding, inserting, or deleting something).
It is a rather specific circumstance.
The first patent was applied for in 1989 and the last one (specific to the linked-list case) in 1997.
The patent applications list such texts as Knuth "Art of Computer Programming," vol.3 (1973). I only have the second edition (1998) and there is only limited mention of keeping expired ("deleted") items on their list and I could find nothing about incorporating their clean-up in other operations carried out via the hashed-index, apart from possible reuse for later insertions.
One thing that is not clear from my cursory inspection of the patent and Knuth has to do with whether or not the list threads through the records or the linked list is actually of pointers to the relevant records. This can matter in what is going on and when what portions are reclaimed.
Here's a garbage-collection type of situation where this kind of delayed deletion might matter. Consider a pure Lisp type memory of CONS cells. One can do any variety of garbage collection techniques with regard to how those cells depend on each other (including cyclically in the case of LISP), but you can only garbage-collect cells that are unreachable. If some cells are reachable from a hashed-table (say of symbolic names for some purpose), a cell is only removable when it is not reachable that way either. It usually takes some other means to actually release the connection from the hashed-table, and that may depend on it being known this is the only way something is left being reachable.
OK, I sort of hate that contrived example, but hope it is good enough to point out that the problem is different than opportunistic garbage collection as an abstracted-from-context known technique.
Posted Apr 22, 2011 0:44 UTC (Fri) by Pc5Y9sbv (guest, #41328)
[Link]
I think a number of us would think that the mere existence of incremental garbage collection strategies in the 80s and 90s, not to mention prior algorithmic work like heaps, ought to be prior art to kill such a patent. The important point is being able to loosen representation invariants and defer some maintenance work; this method has been applied in many systems to either increase the size of work (bigger batches) or to allow incremental work units (less jitter). (And sometimes both occur based on adaptive thresholds, e.g. batch small requests and break up bulk requests.)
The fact that overly specialized corner cases of generally know technique can be enumerated and patented is part of the problem with the current patent system. It is obvious to one skilled in the art, and it amounts to a land rush to try to lay claim before everyone else gets there. It is not innovation that needs protecting. It is disgusting.
Defered Work Obviousness
Posted Apr 22, 2011 1:43 UTC (Fri) by flewellyn (subscriber, #5047)
[Link]
Yeah, this is what I was going for. It's just a specific application of very well-known techniques.
Opportunistic Garbage Collection Obviousness
Posted Apr 22, 2011 1:28 UTC (Fri) by jd (guest, #26381)
[Link]
Back wne I was at Uni, in the 80s, we looked at database theory. Standard technique for indexed sequential databases where you really don't want to fragment the file: when you "delete" a record, you don't remove it, you merely set a flag. When a new record is created, if a flagged record is of equal or slightly greater size you overwrite it.
Every so often, you vaccuum the database to eliminate gaps and remove any flagged records still left.
This gives you a situation where a record is not deleted immediately but is deleted when performing a different operation (an insert or a vaccuum).
This was considered a fairly basic, crude system that was a holdover from sequential media like magnetic tape.
This system would infringe the patent, despite the fact that I seriously doubt anyone has used magnetic tape for live database work within the lifetime of anyone involved in writing said patent or enforcing it in court.
The part that makes me concerned, though, is that this caswe is giving the impression to other companies that patent trolling is easy money and that actual software development is high-risk. That's an extremely dangerous mindset. When you start believing that producing is bad and privateering is good (privateering is piracy in the cutlass-and-walk-the-plank sense that has been legalized by a government), what do you imagine happens?
Opportunistic Garbage Collection Obviousness
Posted Apr 22, 2011 2:15 UTC (Fri) by orcmid (guest, #74478)
[Link]
That is not an infringement of the patent unless your index uses hashed-key linked lists and, instead of an off-line cleanup, you did some bounded cleanup of only those deleted ones with the same hash when that particular list was being revisited for some useful purpose.
The essential claims in the patent are very specific about the situation in which they arise.
With regard to other observations about what is obvious or not, nothing we say here will impact the validity of the patent.
Google attempted to convince a jury that the patent was invalid and failed. I don't know what their argument was and whether they had any prior art to offer.
Google also failed to convince the jury that the patent was not being infringed, valid or not. They would have had to parade out the code and experts to explain it to accomplish that. Don't know how they failed at that either.
Now, we know that there appear to be some jurisdictions in Texas that are very friendly to patent-infringement suits. (This is where Microsoft ran up against i4i, for example.) But even so, it is not clear, based on the experience of others, that there is much room in the appeal process to over-rule the jury's findings of fact.
Whatever Google and other defendants end up paying, the only thing that will keep Linux as unencumbered FOSS is if the alleged infringement is cured by working around the essential claims of the patent so there is no infringement. If the patent were to be over-turned, put the code back if it is appropriate. Otherwise, just move on, I say.
I am far more concerned about Linux staying unfettered than I am about Google's pocketbook.
Opportunistic Garbage Collection Obviousness
Posted Apr 22, 2011 3:27 UTC (Fri) by jd (guest, #26381)
[Link]
Indexed sequential: You hash then you sequentially search. Ok, so clearly we are indeed talking about items that hash to the same value.
Opportunistic overwriting: This is online, so we're clearly talking about bounded cleanup of items that hash to the same value.
Please, please read what you are replying to BEFORE replying to it.
Opportunistic Garbage Collection Obviousness
Posted Apr 22, 2011 5:48 UTC (Fri) by orcmid (guest, #74478)
[Link]
That's not the definition of indexed sequential that I ever heard of (my experience going back to the original ISAM).
The "sequential" in indexed sequential means that the file can be accessed in record order (sequenced by key) or by direct access by key. In general, it matters that the keys are ordered. Typically, an index is used to map keys to the external storage blocks where the corresponding record is to be found. It is common for the index to be organized as some for of wide tree and how the index is cached matters.
The technique described in the patent is strictly about hashed indexing by keys and there is no information or assumption about being able to somehow navigate the hashed records in some key-based sequence nor is it presumed that such be possible.
I see that there is a variation of indexed-sequential in which the index is built using a hash technique (see http://en.wikipedia.org/wiki/Indexed_Sequential_Access_Me...). I have no idea what is hashed for that to work and how that helps with how the index itself is managed. But since you built one, perhaps you can enlighten us where hashing was done as part of working down to the location of a record have a particular exact key?
In any case, the Sequential in Indexed Sequential is not the sequential in the sense that the linked list of list items for records having the same hash code is considered sequential. Indeed, there is generally no requirement that the positions of those records in the list be sequenced in some way, although Knuth points out the advantage of doing so, since the detection that the desired record is not present can happen more efficiently.
There is also considerable importance, to ISAM, that the index for random access be loadable into memory in some manner or otherwise cached, since the records themselves were often on external storage and even insertions might be rather costly unless buckets of records retained slack for some reasonable number of insertions, with an overflow mechanism (not related to what we mean by hash overflow) for any excess.
And of course, in common usage, the only way to clean up an ISAM file was to reorganize the whole thing with some sort of utility, not unlike disk defragmenting.
Opportunistic Garbage Collection Obviousness
Posted Apr 22, 2011 6:10 UTC (Fri) by orcmid (guest, #74478)
[Link]
I did go look at Berkeley DB because it does use a hashing scheme. What is revealed there is that hashing does not support indexed-sequential operation, but the B+Tree access method does. (This gets determined when you build the DB.)
The hashing method is extended linear hashing, where the hash function is adjusted dynamically to allow growing of the hash table. Linear hashing is mentioned on pp.548-549 (just before the exercises of section 6.4) of Don Knuth's "Art of Computer Programming," vol.3 Sorting and Searching, second edition. There is not enough information there to know how collisions on the same hash are dealt with.
Opportunistic Garbage Collection Obviousness
Posted Apr 22, 2011 18:32 UTC (Fri) by Wol (guest, #4433)
[Link]
In linear hashing (typically used in Pick-based systems) it's normal to have a bucket that can hold multiple records. Records with the same hash are chained in a bucket, and should the bucket overflow, the records are often chained on the end of the file. So let's say your "modulo" is 20 - buckets 0-19 - the first bucket to overflow chains into 20, then next into 21, etc.
This then causes a bit of grief when the modulo increments, because you have to shove bucket 20 out of the way to create a new bucket 20 as non-overflow.
At least one implementation (Prime Information) used a special file/directory type, so each SEGSAM file contained subfiles, one per bucket, and they could grow as big as required.