|
|
Log in / Subscribe / Register

Faces of Open Source

Faces of Open Source

Posted Feb 18, 2022 10:34 UTC (Fri) by Ross (guest, #4065)
In reply to: Faces of Open Source by amk
Parent article: Lorinda Cherry RIP

It's sad to learn of your passing, but nice to learn about the person behind some wonderful tools I've used for so long. Thank you Lorinda! But in the writeup I'm struck by all the tools that I don't use and didn't know exist:

Lorinda had a hand in "typo", too, a Morris invention that found gross spelling mistakes by statistical analysis. Sorting the words of a document by the similarity of their trigrams to those in the rest of the document tended to bring typos to the front of the list. [...]

[...] It all began as a one-woman skunk-works project. Noticing the very slow progress in natuaral-language processing, she identified a useful subtask that could be carved out of the larger problem: identifying parts of speech. Using a vocabulary of function words (articles, pronouns, prepositions and conjunctions) and rules of inflection, she was able to classify parts of speech in running text with impressive accuracy.

When Rutgers professor William Vesterman proposed a style-assessing program, with measures such as the frequencies of adjectives, subordinate clauses, or compound sentences, Lorinda was able to harness her "parts" program to implement the idea in a couple of weeks. Subsequently Nina MacDonald, with Lorinda's support, incorporated it into a larger suite that checked and made suggestions about other stylistic issues such as cliches, malapropisms, and redundancy.

Another aspect of text processing that Lorinda addressed was topic identification. Terms (often word pairs) that occur with abnormal frequency are likely to describe the topic at hand. She used this idea to construct first drafts of indexes. One in-house application was to the Unix manual, which up until that time had only a table of contents, but no index. [...]

I am missing these kinds of tools. How did we end up without these being common utilities like "dc" and "bc"? I would love to be able to apply these kinds of tools to text files on the command-line, or inside of vim.


to post comments

Faces of Open Source

Posted Feb 18, 2022 11:40 UTC (Fri) by Wol (subscriber, #4433) [Link]

Changing topic slightly, I'm just mucking about with OCR. Why can't OCR process a PLAIN TEXT FILE CORRECTLY?

My document looks like it came out of nroff, with nicely aligned sections and indents. I don't know how it was OCR'd (probably with one of Google's Artificial Stupidity tools), but the OCR software has decided that the document is formatted in columns (no it isn't), so all the text fragments are all over the place and it's a fing nightmare trying to put it all back together again!

If it's a simple document, ffs OCR it as a simple document! Left-to-right, top-to-bottom, DON'T assume fancy formatting that isn't there!

Cheers,
Wol


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds