Semi-automatic document scanning with Paperwork
The paperless office, much like the year of the Linux desktop, always seems to remain just out of reach. In large part, this is because there is little that one person can do to prevent other people from printing and sending hard-copy documents. But there are tools to help convert these unwanted papers into digital form that can be integrated, to one degree or another, with the filesystems and databases that keep track of everything else. One such tool is Paperwork, a graphical tool to scan, extract text from, and automatically index paper documents.
Scanning, itself, is close to being treated as a solved problem on Linux desktops. If the goal is merely to scan images of paper documents, there are loads of existing applications to do just that. Where Paperwork makes its claims of superiority, though, is in its support for recognizing text on the page, extracting it into a searchable form, then intelligently assigning labels that reflect semantic meaning.
Making this work requires generating some metadata from each scanned page, and storing it in a systematic fashion inside the directory structure alongside the scan. Paperwork uses the Tesseract optical character recognition (OCR) library to detect and extract page text. It then stores the extracted text in a foo.words file. Next, the scanned text is indexed and added to Paperwork's database (which allows the user to search across the document collection). Finally, a Bayesian text classifier is run on the text to suggest relevant labels. The user can also manually attach keywords to any scanned document; these keywords and the labels are also stored in files within the directory, making them, too, accessible to external programs.
![Paperwork page-orientation detection [Paperwork page-orientation
detection]](https://static.lwn.net/images/2016/03-paperwork-detect-sm.png)
Paperwork is written in Python 2 and is available for installation from the Python Package Index. This should pull in most of the dependencies, although there is a dependency checker (called paperwork-chkdeps) included in the package. The latest release is version 0.3.1, from February 26. Once installed, the application is launched with paperwork. Any SANE-supported scanner should be usable, but Paperwork can also import image files and PDFs, so having a scanner on hand is not required.
Paperwork's workflow is largely automated. First, the user creates a new document in the internal collection, then they must scan pages to add to that document (or, alternatively, import pages from external files). Once each page has been added to the document, though, Paperwork tries automatically determine the orientation of the text, then runs the Tesseract OCR scanner on the page in the background. Whenever the user hovers the cursor over the page, the interface pops up the words that Tesseract has detected; there is also a menu entry to show all of the detected text. The user can continue to scan and add new pages until all have been added, then save the result.
![Paperwork OCR [Paperwork OCR]](https://static.lwn.net/images/2016/03-paperwork-ocr-sm.png)
After completing the OCR pass over all the pages in a new document, Paperwork attempts to automatically assign labels to categorize the text. In practice, the classifier engine that takes this step needs a corpus of documents with existing labels in order to do its job. Thus, the user will have to manually assign every label to the first document, and will likely have to delete labels assigned overzealously to the first several documents. In the long run, the usefulness of the labels hinges on having a set of somewhat related documents. A weird outlier will either end up unlabeled or mislabeled. One regrettable missing feature is an ability to re-run the classifier on one or more documents; there is a feature to re-run the OCR step, so perhaps there is hope for the future.
The search feature lets the user type words into the search box, and presents a live-updated list of matching documents below. Click on any of the matching documents in the list, and Paperwork will open it in a viewer window highlighting the location of each search term. Multi-word search terms are not supported, nor are logical operators.
The simplicity of the search feature might sound like a big limitation—after all, with a lot of documents, some way to narrow down the search would be nice. But the quality of the OCR stage is likely to be more important. Tesseract has come a long way in recent years; I tested it shortly after its initial open-source release ten years ago, when the scan results from other OCR engines were likely to be riddled with elementary spelling mistakes. Tesseract has always been better than the competition, but in those early years it tended to perform poorly on small font sizes, and it tended to erroneously flag images and shadows as text blocks.
![Paperwork OCR errors [Paperwork OCR errors]](https://static.lwn.net/images/2016/03-paperwork-error-sm.png)
For the sake of comparison, I ran paperwork on a some of the smallest text I could find: the legalese in the warranty booklet for a cell phone. Tesseract did admirably; it tagged several shadows, but with a single exception it did not "recognize" letters in them as it might have in the past. I only found two instances where it split a word into two segments, and none where it combined two words into one. Moreover, I do not think there were any spelling mistakes, which is a noteworthy achievement. It also seems to be geared toward detecting black-on-white text, to the point where it occasionally misses lighter print or text within images.
That said, free-software OCR is capable of doing more. Tesseract outputs text in the hOCR format, an XML format that stores each detected word along with the coordinates of its bounding box in the source image. Several other projects exist that can build on this word-level information to detect lines, paragraphs, and even larger page structure (think columns of text that wrap around images). The OCRopus engine does this internally, while GNOME's OCRFeeder can do page-layout detection on top of Tesseract output. While Paperwork does not need to detect page layout in order to run simple searches, at least reconstructing sentences would allow users to search for multi-word phrases.
Paperwork bills itself as a "scan and forget" tool, so it may be
asking too much to expect it to detect page layout. As a mostly
automated "document ingest" application, it certainly succeeds where
other scanning programs for Linux desktops fall flat. What remains to
be seen is whether scanning and forgetting works well over the long
haul. The project is a small
operation at present, without posted plans for where it intends to go
in the future. The automatic labeling of contents will surely improve over
time as Paperwork gets more familiar with a corpus of documents, but
the bare-bones search functionality is likely to become a pain as one
accumulates more and more scanned documents.
Posted Mar 10, 2016 12:53 UTC (Thu)
by jpritikin73 (guest, #107608)
[Link] (1 responses)
Posted Mar 14, 2016 8:56 UTC (Mon)
by faheem (guest, #47574)
[Link]
Posted Mar 11, 2016 4:44 UTC (Fri)
by spwhitton (subscriber, #71678)
[Link]
Posted Mar 14, 2016 19:33 UTC (Mon)
by bronson (subscriber, #4806)
[Link]
I last tried setting something up about six years ago and gave up in frustration... Is there anything closer to that today?
(yes, I know I can ship them to a sweatshop for $1 for five photos. maybe I'll resort to that one day...)
Posted Mar 14, 2016 20:03 UTC (Mon)
by ersi (guest, #64521)
[Link] (3 responses)
I'd really like to get more information about the scanning part.
Sure, there's plenty of applications - but finding a decent scanner that is decently supported and suited for the type of scanning you're trying to do.. is hard! At least when it comes to scanning receipts, instead of full size papers.
Does anyone have any scanner and technique recommendations for scanning receipts?
Posted Mar 14, 2016 20:56 UTC (Mon)
by bronson (subscriber, #4806)
[Link] (1 responses)
I scanned an 800 page double-sided tech document in a few hours with it. While working, every few minutes drop another 20ish pages on the feeder. It can take some abuse.
The color is not as nice as the flatbed but for grayscale it's great.
Posted Mar 14, 2016 21:01 UTC (Mon)
by bronson (subscriber, #4806)
[Link]
Looks like I'll have to give it another shot.
Posted Mar 15, 2016 0:07 UTC (Tue)
by cortana (subscriber, #24596)
[Link]
I recently bought a Canon P-208II for scanning documents and receipts and it works really well. It works with SANE (although I use the Windows software it came with because it gave me better image quality out of the box).
Semi-automatic document scanning with Paperwork
Semi-automatic document scanning with Paperwork
A nice alternative for incorporating into one's own shell scripts is OCRmyPDF, which recently hit Debian unstable.
OCRmyPDF
Semi-automatic document scanning with Paperwork
Semi-automatic document scanning with Paperwork
Semi-automatic document scanning with Paperwork
Semi-automatic document scanning with Paperwork
Semi-automatic document scanning with Paperwork