LWN.net Logo

Still not a copyright infringement

Still not a copyright infringement

Posted Aug 22, 2003 3:28 UTC (Fri) by BrucePerens (subscriber, #2510)
Parent article: Maybe SCO had a point

If this code appears in System V with the two ASSERT statements, something that I have not verified because I don't have access to System V, then it is two lines different from a version which was not copyrighted. The difference is not sufficient to be copyrightable.

But there is no denying that someone, probably at SGI, was sloppy, and that we lucked out that this wasn't much worse. Thus, it behooves all of you who have access to licensed Unix code to take a copy of the latest development kernel and cross-check the code base.

Bruce


(Log in to post comments)

Still not a copyright infringement

Posted Aug 22, 2003 5:28 UTC (Fri) by rjamestaylor (guest, #339) [Link]

Question: if I had access to Licensed UNIX codebase (I don't) and I "diffed" it against Linux, how would I alert the Linux development community about my findings without:
  • Breaking non-disclosure agreements by clearly demonstrating the copying from the original
  • Opening the door for unscrupulous people to "claim" such-and-such a module infringes when it may actually not
Is the best way to make such a disclosure the way ESR has done, i.e., to present a minimal diff, or perhaps a Linux-side diff only?

I know I'd want more than someone's word that code in Linux was copied from licensed UNIX before the community got in gear to remove that code. What's the standard of proof?

Still not a copyright infringement

Posted Aug 22, 2003 5:38 UTC (Fri) by BrucePerens (subscriber, #2510) [Link]

More than one person we know has access to the System V code base. Some of them are known to us and trusted enough to give us no more than a list of file names and line numbers in the Linux kernel that might be questionable. That discloses none of the SCO art. And of course that is not a signal to just remove the code. We would first check the provenance of the code, and would often find that it was something used in System V but not owned by SCO, like the BPF code I reported on.

Of course, you can even cross-check one friend against another, or ask one to verify what another has reported.

Don't you think this is already going on, quietly? I would have expected that folks a number of companies would have been on this months ago.

And by the way, if there was a significant infringement in a piece of code that people cared about - not in a prototype SGI driver for models that were never sold - someone would already have noticed.

Thanks

Bruce

Still not a copyright infringement

Posted Aug 22, 2003 13:33 UTC (Fri) by dwalters (guest, #4207) [Link]

if there was a significant infringement in a piece of code that people cared about - not in a prototype SGI driver for models that were never sold - someone would already have noticed.

This is a good point, and Linus has effectively said that the open source development model effectively allows this to happen.

Don't you think this is already going on, quietly?

That depends on just how many people actually do

  • legally have access to the Sys V source code
  • have the time and motivation to perform detailed "pattern analys" between the two code bases

I don't know what tools already exist to do this, but it strikes me that a useful tool for the community to develop would be a program to statistically analyse the two code bases and come up with suspicious matches. Of course it would be better for SCO to just make public what files, version and line numbers in Linux they think are infringing, but apart from a couple of examples, they don't appear to be doing that.

Still not a copyright infringement

Posted Aug 22, 2003 14:54 UTC (Fri) by jmitchel (guest, #11611) [Link]

Don't you think this is already going on, quietly?

I believe this is more in a class of plausible deniability than it is a wishfull statement. Bruce's post sounds far more like a description of what is happening framed in a deniable way then it does a plan for action.

For that matter, if I cared enough I could probably get access to some generation of SYSV source, having worked with some decently connected people at Bell Labs. Now I think of it, it would be tempting to try to find it. In the long run it would be far more edifying for me and dog+world to try matching Version 7 vintage code, to develop tools that others can use to do fast, efficient searches.

Still not a copyright infringement

Posted Aug 22, 2003 12:44 UTC (Fri) by sommere (guest, #14168) [Link]

There is a largly quiet field of CS research on how to find CS students who plagarize. The researchers, for the most part, don't publicize their findings so that students can't check to see if they have obfuscated their code enough.

I did some poking around to see if I could figgure out how they work and here is what I found:

1) They typically remove all tokens (words) except the keywords. (so variable names don't matter.)
2) They often equate equivolents keywords (for and while can be used in equivolent ways)
3) They usually use an algorithm called "Running Karp Rabin" to find strings of matching tokens in two files. This algorithm is resistant to just reorderign the functions. (so it finds strings of tokens length 6 or longer which match anywhere in the file, for example)


This is likely what SCO's pattern matching team is doing, and someone on our side with access to System V should be doing it too.

I wrote a java program to test out whether this algorithm actaully finds cheaters (it did.) Feel free to e-mail me (lwn at ethanet.com) for more info/source.

Still not a copyright infringement

Posted Aug 22, 2003 19:03 UTC (Fri) by dark (✭ supporter ✭, #8483) [Link]

Umm, that algorithm sounds like it will find loads of false positives. Most student assignments are pretty simple, with only one or two obvious ways of deriving a correct solution.

Still not a copyright infringement

Posted Aug 22, 2003 22:17 UTC (Fri) by sommere (guest, #14168) [Link]

the programs don't expell students automatically :) Yes, it takes a human to use common sense and figgure out if one was actually copied from the other. But it gives you somewhere to start.

Still not a copyright infringement

Posted Sep 17, 2003 1:10 UTC (Wed) by Zakaelri (guest, #15087) [Link]

I would think this is (approximately) how the whole process wouold be
done:

1) Do a unified diff on Linux vs System V
2) Do a unified diff on Linux vs System 32
3) Compare the Diffs, remove any Duplicate lines (this would extract the
things that are identical between System V, System 32, and Linux...
leaving only the chnages.)
4) Inspect the remaining lines in the System V diff, then figure out where
they came from in the linux codebase (line numbers for each file).
5) Send the resultant line numbers/files/etc to the kernel list, so they
can investigate where they all came from, and change anything that should
be changed.

The only problem with this technique is that the files would have to have
the same names/etc for it to work... and in some of the cases (such as the
SGI code case, already dissected by Bruce Perens and (by this point)
probably numerous others, if I recall correctly) this is not the case.

If there are files that are renamed, or code that is copyrighted but used
elsewhere in the same code, then the algorithim will probably escalate
pretty quickly to be NP-complete.

Still not a copyright infringement

Posted Sep 17, 2003 1:20 UTC (Wed) by Zakaelri (guest, #15087) [Link]

I am very sorry for the unreadability of the third and last paragraph in
the above post... Let's see if I can make sense of them:

The only problem with this technique is that the files would have to have
the same names/etc for it to work... If memory serves, there were some
cases (such as the SGI driver code, previously dissected and discussed by
Bruce Perens) where the copied code was actually in a different file than
it's System V equivalent. This makes the problem of finding copyright
infringement in the current codebase become unmanagable... the complexity
of the algorithim necessary to complete such a task might as well be
NP-complete.

See srcdupchk to compare source trees

Posted Aug 22, 2003 15:34 UTC (Fri) by emk (subscriber, #1128) [Link]

I wrote a utility called srcdupchk to encourage SCO to do the right thing and report suspicious code--so far, they haven't. It's available on freshmeat. If there's somebody who has the legal rights to run srcdupchk, and has a legal use for its output, please let them know about it. It uses the rolling hash technique I saw proposed in an article a few months ago.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds