New tool screens spam, digitizes books (ZDNet)

[Posted May 25, 2007 by ris]

ZDNet looks at the ReCaptcha project. " A group of Carnegie Mellon University programmers has launched a service called ReCaptcha that can help cut down on spam while letting people digitize books. The project is a variation of the widely used "Captcha" technique to weed out computer abuse such as e-mailing spam or posting spam on blog comments. Captchas require users to pass little pattern recognition tests, commonly reading distorted or obscured words."

something doesn't make sense here.

Posted May 25, 2007 19:29 UTC (Fri) by dlang (guest, #313) [Link] (2 responses)

Captchas work by presenting text that's hard to machine read and seeing if the response matches the known text

when digitizing text you don't know what the real text is ahead of time so how in the world can you tell if you got the right answer? and once you have the right answer it's no longer a benifit for digitizing books.

something doesn't make sense here.

Posted May 25, 2007 19:42 UTC (Fri) by stephen_pollei (guest, #23348) [Link]

Maybe you overlap the testing area with text that you know and text that you don't know.

So you take a 100 page book and you scan it in. You do one paragraph from about every 10 pages. Then you give the "human" three paragraphs worth of text. The sample should be big enough to hopefully contain 5 errors or more.

You are really testing if they found the errors you know about but they also report on a few errors you didn't know about before.

That would be my guess, I'll read the fine article and see if I was correct.

something doesn't make sense here.

Posted May 25, 2007 21:27 UTC (Fri) by khim (subscriber, #9252) [Link]

The service presents users with two words, one from a conventional Captcha test and the other an unknown word that a computerized optical character recognition couldn't figure out. If the user correctly identifies the known word, he or she is presumed to have decoded the unknown one.

Simple idea, don't know why other projects are not doing it...

Privacy

Posted May 27, 2007 22:56 UTC (Sun) by Tobu (subscriber, #24111) [Link]

I don't see their privacy policy.

They are hosting the captchas on their own website, pinged whenever someone writes a comment on a recaptcha-enabled blog.

They certainly need to be clear that they won't put any tracking cookies or feed an indexing bot with this.

Not gonna fly

Posted May 28, 2007 4:51 UTC (Mon) by jimmybgood (guest, #26142) [Link] (1 responses)

People are already annoyed by having to solve captchas. I'm no different than most people and if all of a sudden I have to solve two, I'm going to find a place where I only need to solve one.

Of course, websites could identify one as optional, but I bet most folks won't bother, particularly if it's difficult.

It sounds more like a tax than a research project. 150,000 free hours a day? Why not just recruit some people to try to improve OCR logic?

Not gonna fly

Posted May 28, 2007 19:24 UTC (Mon) by Los__D (guest, #15263) [Link]

"People are already annoyed by having to solve captchas. I'm no different than most people and if all of a sudden I have to solve two, I'm going to find a place where I only need to solve one."
Then you go do that.

I know noone (besides you, whom I don't know except for your general negativity of most things) that are annoyed by them.