Since it unfortunately happens again and again that the original files of scanned texts are no longer available (or can no longer be found), text recognition must be used in these cases to generate editable text from the image files. The command line tool OCRmyPDF by James Barlow has often made my life easier when dealing with scanned text files in PDF documents.
Since I couldn’t find a simple graphical user interface, I came up with the idea of OCRthyPDF – a user interface that allows users – who are not used to command line tools – to access the basic functions of OCRmyPDF.
The splitter function extends the text recognition provided by OCRmyPDF. It allows scanned documents to be separated at separator pages – defined by a QR code – before text recognition. A QR code can mark a separator-only page that is discarded. Alternatively, in sticker mode, the QR code defines the first page of a new document and is retained.
If you like the results created with OCRthyPDF but need more flexibility I suggest you give OCRmyPDF a try on the command line. 🙂
How To Install
If you are using Ubuntu or any other Distro that comes with
snap pre-installed you can install it directly from the Snap Store (Gnome Software) or you can type
sudo snap install ocrthypdf
in your terminal.
If your Distro does not have snap / Gnome Software pre-installed you can find instructions for installing here.
Snaps run in a restricted environment and need permissions to access files on your computer (Similar to apps on your smartphone). So first check if you have set the correct permissions in the snap store user interface.
You can start OCRthyPDF from the terminal with
ocrthypdf --log INFO or
ocrthypdf --log DEBUG in order to get more info in case the application does not work as expected.
Info about subprocesses like OCRmyPDF, Splitter, Ghostscript, etc. is displayed in the console tab. Set ‘Loglevel’ to DEBUG and ‘Limit console …’ to ‘no’ for detailed information.
Still no clue what went wrong? Report an issue here.
How To Use It
First you need to select a single PDF or a folder containing PDF files that should be processed by OCRmyPDF’s character recognition. Then you specify a folder where the new PDFs will be saved. If no output folder is selected, the input folder ist set as output folder as well.
The switches in the “Options” tab correspond to the values described in the OCRmyPDF cookbook and work exactly the same way. Not all combinations are useful or allowed. OCRthyPDF does not prevent you from setting such combinations. In most cases, OCR will simply refuse to start or abort with an error message. See the Console tab for detailed information about what went wrong.
Caution: If you leave the postfix field blank and the output is written to the input folder, you will overwrite your source file!
Start the OCR with the “Start OCR” button. You can press the “Stop OCR” button to stop all running OCR jobs.
The activity indicator bar flashes while OCR is running.
The splitter enables you to split a PDF-file into separate files based on a separator barcode / QR-Code. This is very handy if you have to scan a lot of (multi-page) documents and don’t want to scan each document separately. Just put a separator page between each document and scan them at once!
In order to avctivate the splitter set “Run splitter prior to OCR” to “yes”.
In the next field you have to specify a separator text. The splitter tries to find a QR code on each page and compares its content with this text.
The next switch selects the separator mode. By default, the separation page is omitted and not included in the output files. In Sticker Mode a QR-Code starts a new segment and the page will be added to the output. Each segment/document will be saved with a segment number as postfix. You can download standard QR-Codes with text “NEXT” here.
You can use the pattern | in your QR-Code to add a custom postfix to the filename by using Sticker Mode. Use individual postfixes in each code since no segment numbers are added in this mode if a custom postfix is found.
Examples for useful QR-Codes in Sticker Mode:
- NEXT|CoverLetter – NEXT|Attachments
- NEXT|CoverLetter_Miller – NEXT|CoverLetter_Smith
If you select the option not to use the source filename in the output filename you are able to set the filenames by using the custom postfixes (if you leave the postfix field in the options tab blank).
Before Splitter starts analyzing the pages of a PDF file, the source PDF file is rewritten with Ghostscript to work around some common problems with PDF files created by scanners/MFPs. Splitter looks for QR codes in the rewritten file, but assembles the split files directly from the source file. You can use the “Assemble split files from rewritten source file?” option to tell Splitter to take the pages from the rewritten/repaired version. If you are splitting scanned documents that contain bitmap images, this should be safe. If you split documents that contain other elements (text, fonts, vector drawings, etc.), the result may differ from the source!
The Language tab lets you select the languages present in your documents. The default selection is English and the language of your desktop environment. Since the result of OCR strongly depends on this selection, you should select all languages you need and deselect all languages you don’t!
In the console you can see the output of the processes working “under the hood”. This is helpful in case the results are different than expected or the OCR terminates with an error code. You can select the log levels “INFO” (status messages when everything works as expected) and “DEBUG” (a lot of detailed information). By default, the console shows the output of the last subprocess and is cleaned up when a new subprocess is started. You can set the console to show the information of all subprocesses without cleanup.
The two bars at the bottom indicate the status of the Split Job queue and the OCR job queue. “Queue” refers to documents waiting to be processed.
Alternative software and further information on this topic can be found here:
Congratulations! After searching for ages, I finally find a perfectly working solution for speech recognition in PDFs under UBUNTU!
On the web, I have only ever found construction sites, but no stable *solutions*. Even the titles are recognised cleanly and formatted sensibly!
The icing on the cake would be the following function:
The characters are recognised correctly in the correct country code. Would it be possible to replace the blurred scanned images of the characters with original characters in the output file?
This would massively reduce the size of the output files and make them much easier to read!
Nevertheless: THANK YOU, you have succeeded in creating a great app!
thx for your positive feedback!
OCRthyPDF is merely a frontend for ocrmypdf (except for the splitter) and therefore does not have the feature suggested.
And I am not sure if it would be that useful since e.g. all the fonts would be lost and you have no chance to spot misinterpreted characters / numbers.
I would love to use OCRthyPDF and it’s exactly what I have been looking for but I just can’t get Snap to work reliable under EndeavourOS (Arch based).
Is there a chance to release OCRthyPDF via the AUR or as an AppImage or anything BUT as a snap package?
I currently do not plan to release the program in a different format.
There is one guy who plans to put it on homebrew.
The whole application consists of two python scripts(ocrthypdf.py and splitter.py) + dependencies. So basically you install the dependencies, download both scripts from github, put them in a directory and run OCRthyPDF.py from the command line.
Dependencies needed that can be installed with apt:
python3.9 python3-tk python3-pip tcl ghostscript icc-profiles-free liblept5 libxml2 pngquant tesseract-ocr-all unpaper qpdf zlib1g imagemagick openjdk-8-jre
Modules needed that can be installed with pip3:
PySimpleGUI ocrmypdf zxing pikepdf
If you do not want to download all tesseract languages change
tesseract-ocr. You can download additional language packs separately. Use
apt search tesseract-ocrto see the available languages.