Semi-automatic document scanning with Paperwork

By Nathan Willis
March 2, 2016

The paperless office, much like the year of the Linux desktop, always seems to remain just out of reach. In large part, this is because there is little that one person can do to prevent other people from printing and sending hard-copy documents. But there are tools to help convert these unwanted papers into digital form that can be integrated, to one degree or another, with the filesystems and databases that keep track of everything else. One such tool is Paperwork, a graphical tool to scan, extract text from, and automatically index paper documents.

Scanning, itself, is close to being treated as a solved problem on Linux desktops. If the goal is merely to scan images of paper documents, there are loads of existing applications to do just that. Where Paperwork makes its claims of superiority, though, is in its support for recognizing text on the page, extracting it into a searchable form, then intelligently assigning labels that reflect semantic meaning.

Making this work requires generating some metadata from each scanned page, and storing it in a systematic fashion inside the directory structure alongside the scan. Paperwork uses the Tesseract optical character recognition (OCR) library to detect and extract page text. It then stores the extracted text in a foo.words file. Next, the scanned text is indexed and added to Paperwork's database (which allows the user to search across the document collection). Finally, a Bayesian text classifier is run on the text to suggest relevant labels. The user can also manually attach keywords to any scanned document; these keywords and the labels are also stored in files within the directory, making them, too, accessible to external programs.

Paperwork is written in Python 2 and is available for installation from the Python Package Index. This should pull in most of the dependencies, although there is a dependency checker (called paperwork-chkdeps) included in the package. The latest release is version 0.3.1, from February 26. Once installed, the application is launched with paperwork. Any SANE-supported scanner should be usable, but Paperwork can also import image files and PDFs, so having a scanner on hand is not required.

Paperwork's workflow is largely automated. First, the user creates a new document in the internal collection, then they must scan pages to add to that document (or, alternatively, import pages from external files). Once each page has been added to the document, though, Paperwork tries automatically determine the orientation of the text, then runs the Tesseract OCR scanner on the page in the background. Whenever the user hovers the cursor over the page, the interface pops up the words that Tesseract has detected; there is also a menu entry to show all of the detected text. The user can continue to scan and add new pages until all have been added, then save the result.

After completing the OCR pass over all the pages in a new document, Paperwork attempts to automatically assign labels to categorize the text. In practice, the classifier engine that takes this step needs a corpus of documents with existing labels in order to do its job. Thus, the user will have to manually assign every label to the first document, and will likely have to delete labels assigned overzealously to the first several documents. In the long run, the usefulness of the labels hinges on having a set of somewhat related documents. A weird outlier will either end up unlabeled or mislabeled. One regrettable missing feature is an ability to re-run the classifier on one or more documents; there is a feature to re-run the OCR step, so perhaps there is hope for the future.

The search feature lets the user type words into the search box, and presents a live-updated list of matching documents below. Click on any of the matching documents in the list, and Paperwork will open it in a viewer window highlighting the location of each search term. Multi-word search terms are not supported, nor are logical operators.

The simplicity of the search feature might sound like a big limitation—after all, with a lot of documents, some way to narrow down the search would be nice. But the quality of the OCR stage is likely to be more important. Tesseract has come a long way in recent years; I tested it shortly after its initial open-source release ten years ago, when the scan results from other OCR engines were likely to be riddled with elementary spelling mistakes. Tesseract has always been better than the competition, but in those early years it tended to perform poorly on small font sizes, and it tended to erroneously flag images and shadows as text blocks.

For the sake of comparison, I ran paperwork on a some of the smallest text I could find: the legalese in the warranty booklet for a cell phone. Tesseract did admirably; it tagged several shadows, but with a single exception it did not "recognize" letters in them as it might have in the past. I only found two instances where it split a word into two segments, and none where it combined two words into one. Moreover, I do not think there were any spelling mistakes, which is a noteworthy achievement. It also seems to be geared toward detecting black-on-white text, to the point where it occasionally misses lighter print or text within images.

That said, free-software OCR is capable of doing more. Tesseract outputs text in the hOCR format, an XML format that stores each detected word along with the coordinates of its bounding box in the source image. Several other projects exist that can build on this word-level information to detect lines, paragraphs, and even larger page structure (think columns of text that wrap around images). The OCRopus engine does this internally, while GNOME's OCRFeeder can do page-layout detection on top of Tesseract output. While Paperwork does not need to detect page layout in order to run simple searches, at least reconstructing sentences would allow users to search for multi-word phrases.

Paperwork bills itself as a "scan and forget" tool, so it may be asking too much to expect it to detect page layout. As a mostly automated "document ingest" application, it certainly succeeds where other scanning programs for Linux desktops fall flat. What remains to be seen is whether scanning and forgetting works well over the long haul. The project is a small operation at present, without posted plans for where it intends to go in the future. The automatic labeling of contents will surely improve over time as Paperwork gets more familiar with a corpus of documents, but the bare-bones search functionality is likely to become a pain as one accumulates more and more scanned documents.

Semi-automatic document scanning with Paperwork

Posted Mar 10, 2016 12:53 UTC (Thu) by jpritikin73 (guest, #107608) [Link] (1 responses)

How does paperwork compare to gscan2pdf? Anybody used both?

Semi-automatic document scanning with Paperwork

Posted Mar 14, 2016 8:56 UTC (Mon) by faheem (guest, #47574) [Link]

I do use gscan2pdf. Paperwork does not seem to be currently in Debian. If/when it arrives, I will give it a try.

OCRmyPDF

Posted Mar 11, 2016 4:44 UTC (Fri) by spwhitton (subscriber, #71678) [Link]

A nice alternative for incorporating into one's own shell scripts is OCRmyPDF, which recently hit Debian unstable.

Semi-automatic document scanning with Paperwork

Posted Mar 14, 2016 19:33 UTC (Mon) by bronson (subscriber, #4806) [Link]

I have a related scanning need... I'd like a tool where I can drop 4 3x5 prints onto the flatbed scanner and hit the button. It would detect, rotate, crop, name, and save each picture it finds. Cmdline or gui, doesn't matter to me.

I last tried setting something up about six years ago and gave up in frustration... Is there anything closer to that today?

(yes, I know I can ship them to a sweatshop for $1 for five photos. maybe I'll resort to that one day...)

Semi-automatic document scanning with Paperwork

Posted Mar 14, 2016 20:03 UTC (Mon) by ersi (guest, #64521) [Link] (3 responses)

> Scanning, itself, is close to being treated as a solved problem on Linux desktops. If the goal is merely to scan images of paper documents, there are loads of existing applications to do just that.

I'd really like to get more information about the scanning part.

Sure, there's plenty of applications - but finding a decent scanner that is decently supported and suited for the type of scanning you're trying to do.. is hard! At least when it comes to scanning receipts, instead of full size papers.

Does anyone have any scanner and technique recommendations for scanning receipts?

Semi-automatic document scanning with Paperwork

Posted Mar 14, 2016 20:56 UTC (Mon) by bronson (subscriber, #4806) [Link] (1 responses)

I like the Fujitsu ScanSnap s1300i. Its tiny footprint means its sheet feeder is not as easy to adjust as larger scanners but it's a tradeoff I'd make again -- it works fine and it spends 99.9% of its life turned off. SANE works.

I scanned an 800 page double-sided tech document in a few hours with it. While working, every few minutes drop another 20ish pages on the feeder. It can take some abuse.

The color is not as nice as the flatbed but for grayscale it's great.

Semi-automatic document scanning with Paperwork

Posted Mar 14, 2016 21:01 UTC (Mon) by bronson (subscriber, #4806) [Link]

Oop, re-read your question. I don't do many receipts, mostly documents. And I've been mostly using it on a Mac because Linux OCR was terrible last I tried a few years ago. Just couldn't get it to work well.

Looks like I'll have to give it another shot.

Semi-automatic document scanning with Paperwork

Posted Mar 15, 2016 0:07 UTC (Tue) by cortana (subscriber, #24596) [Link]

I recently bought a Canon P-208II for scanning documents and receipts and it works really well. It works with SANE (although I use the Windows software it came with because it gave me better image quality out of the box).