OCRthyPDF - User interface for OCRmyPDF

Since it unfortunately happens again and again that the original files of scanned texts are no longer available (or can no longer be found), text recognition must be used in these cases to generate editable text from the image files.  The command line tool OCRmyPDF by James Barlow has often made my life easier when dealing with scanned text files in PDF documents.

Since I couldn’t find a simple graphical user interface, I came up with the idea of OCRthyPDF – a user interface that allows users – who are not used to command line tools – to access the basic functions of OCRmyPDF.

OCRthyPDF GUI

The splitter function extends the text recognition provided by OCRmyPDF. It allows scanned documents to be separated at separator pages – defined by a QR code – before text recognition. A QR code can mark a separator-only page that is discarded. Alternatively, in sticker mode, the QR code defines the first page of a new document and is retained.

If you like the results created with OCRthyPDF but need more flexibility I suggest you give OCRmyPDF a try on the command line. 🙂

How To Install

If you are using Ubuntu or any other Distro that comes with snap pre-installed you can install it directly from the Snap Store (Gnome Software) or you can type

sudo snap install ocrthypdf

in your terminal.

If your Distro does not have snap / Gnome Software pre-installed you can find instructions for installing here.

Toubleshooting

Snaps run in a restricted environment and need permissions to access files on your computer (Similar to apps on your smartphone). So first check if you have set the correct permissions in the snap store user interface.

You can start OCRthyPDF from the terminal with ocrthypdf --log INFO or ocrthypdf --log DEBUG in order to get more info in case the application does not work as expected.

Info about subprocesses like OCRmyPDF, Splitter, Ghostscript, etc. is displayed in the console tab. Set ‘Loglevel’ to DEBUG and ‘Limit console …’ to ‘no’ for detailed information.

pushpin Still no clue what went wrong? Report an issue here.

How To Use It

OCRthyPDF GUI Options-Tab

First you need to select a single PDF or a folder containing PDF files that should be processed by OCRmyPDF’s character recognition. Then you specify a folder where the new PDFs will be saved. If no output folder is selected, the input folder ist set as output folder as well.

The switches in the “Options” tab correspond to the values described in the OCRmyPDF cookbook and work exactly the same way. Not all combinations are useful or allowed. OCRthyPDF does not prevent you from setting such combinations. In most cases, OCR will simply refuse to start or abort with an error message. See the Console tab for detailed information about what went wrong.

pushpin Caution: If you leave the postfix field blank and the output is written to the input folder, you will overwrite your source file! facepalm

Start the OCR with the “Start OCR” button. You can press the “Stop OCR” button to stop all running OCR jobs.

The activity indicator bar flashes while OCR is running.

OCRthyPDF GUI Splitter-Tab

The splitter enables you to split a PDF-file into separate files based on a separator barcode / QR-Code. This is very handy if you have to scan a lot of (multi-page) documents and don’t want to scan each document separately. Just put a separator page between each document and scan them at once!

In order to avctivate the splitter set “Run splitter prior to OCR” to “yes”.

In the next field you have to specify a separator text. The splitter tries to find a QR code on each page and compares its content with this text.

The next switch selects the separator mode. By default, the separation page is omitted and not included in the output files. In Sticker Mode a QR-Code starts a new segment and the page will be added to the output. Each segment/document will be saved with a segment number as postfix. You can download standard QR-Codes with text “NEXT” here.

You can use the pattern | in your QR-Code to add a custom postfix to the filename by using Sticker Mode. Use individual postfixes in each code since no segment numbers are added in this mode if a custom postfix is found.

pushpin Examples for useful QR-Codes in Sticker Mode:

  • NEXT|CoverLetter – NEXT|Attachments
  • NEXT|CoverLetter_Miller – NEXT|CoverLetter_Smith

pushpin If you select the option not to use the source filename in the output filename you are able to set the filenames by using the custom postfixes (if you leave the postfix field in the options tab blank).

Before Splitter starts analyzing the pages of a PDF file, the source PDF file is rewritten with Ghostscript to work around some common problems with PDF files created by scanners/MFPs. Splitter looks for QR codes in the rewritten file, but assembles the split files directly from the source file. You can use the “Assemble split files from rewritten source file?” option to tell Splitter to take the pages from the rewritten/repaired version. If you are splitting scanned documents that contain bitmap images, this should be safe. If you split documents that contain other elements (text, fonts, vector drawings, etc.), the result may differ from the source!

OCRthyPDF GUI Language-Tab

The Language tab lets you select the languages present in your documents. The default selection is English and the language of your desktop environment. Since the result of OCR strongly depends on this selection, you should select all languages you need and deselect all languages you don’t!

OCRthyPDF GUI Console-Tab

In the console you can see the output of the processes working “under the hood”. This is helpful in case the results are different than expected or the OCR terminates with an error code. You can select the log levels “INFO” (status messages when everything works as expected) and “DEBUG” (a lot of detailed information). By default, the console shows the output of the last subprocess and is cleaned up when a new subprocess is started. You can set the console to show the information of all subprocesses without cleanup.

The two bars at the bottom indicate the status of the Split Job queue and the OCR job queue. “Queue” refers to documents waiting to be processed.

Links

Alternative software and further information on this topic can be found here:

2 thoughts on “OCRthyPDF - User interface for OCRmyPDF

  1. M

    Hello,
    I would love to use OCRthyPDF and it’s exactly what I have been looking for but I just can’t get Snap to work reliable under EndeavourOS (Arch based).

    Is there a chance to release OCRthyPDF via the AUR or as an AppImage or anything BUT as a snap package?

    Kind regards.
    M

    Reply
    1. Björn Post author

      Hi M,

      I currently do not plan to release the program in a different format.
      There is one guy who plans to put it on homebrew.

      The whole application consists of two python scripts(ocrthypdf.py and splitter.py) + dependencies. So basically you install the dependencies, download both scripts from github, put them in a directory and run OCRthyPDF.py from the command line.

      Dependencies needed that can be installed with apt:
      python3.9 python3-tk python3-pip tcl ghostscript icc-profiles-free liblept5 libxml2 pngquant tesseract-ocr-all unpaper qpdf zlib1g imagemagick openjdk-8-jre
      Modules needed that can be installed with pip3:
      PySimpleGUI ocrmypdf zxing pikepdf

      If you do not want to download all tesseract languages change tesseract-ocr-all to tesseract-ocr. You can download additional language packs separately. Use apt search tesseract-ocr to see the available languages.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *