finereader in Medak, Sekulic & Mertens 2014


vely good free software solution - Tesseract
(http://code.google.com/p/tesseract-ocr/) - that has solid performance, good language data and can
be trained for an even better performance, although it has its problems. Proprietary solutions (e.g.
Abby FineReader) sometimes provide superior results.
Tesseract supports as input format primarily .tiff files. It produces a plain text file that can be, with
the help of other tools, embedded as a separate layer under the original graphic image of the text in
a PDF


used because of its purported superiority, but it is far less popular.
The export to PDF can be done again with a number of tools. In our case we'll complete the optical
character recognition and PDF export in gscan2pdf. Again, the proprietary Abbyy FineReader will
produce a bit smaller PDFs.
If you prefer to use an e-book format that works better with e-book readers, obviously you will have
to remove some of the elements that appear in the book - headers, footers, footnotes and pagination.

This can be d


'Start OCR'. Once the OCR is
finished, export the graphic files and the OCR text to PDF by selecting 'Save as'.
However, given that sometimes the proprietary solutions produce better results, these tasks can also
be done, for instance, on the Abbyy FineReader running on a Windows operating system running
inside the Virtual Box. The prerequisites are that you have both Windows and Abbyy FineReader
you can install in the Virtual Box. If using Virtual Box, once you've got both installed, you need to
designate a shared folder in your Virtual Box and place the .tiff files there. You can now open them
from the Abbyy FineReader running in the Virtual Box, OCR them and export them into a PDF.
To use Abbyy FineReader transfer the output files in your 'out' out folder to the shared folder of the
VirtualBox. Then start the VirtualBox, start Windows image and in Windows start Abbyy
FineReader. Open the files and let the Abbyy FineReader read the files. Once it's done, output the
result into PDF.

VI. CATALOGING AND SHARING THE E-BOOK
Your road from a book on paper to an e-book is complete. If you want to maintain your library you
can use Calibre, a free software tool for e-book libr


menu 'Tools',
select the Tesseract engine and your language, start the process
- once OCR is finished and to output to a PDF, go under 'File' and select 'Save', edit the
metadata and select the format, save
If using non-free software:
2) open Abbyy FineReader in VirtualBox (note: only Abby FineReader 10 installs and works with some limitations - under GNU/Linux)
- transfer files in the 'out' folder to the folder shared with the VirtualBox
- point it to the readied .tiff files and it will complete the OCR
- save the file

REFERENCES
For more info

 

Display 200 300 400 500 600 700 800 900 1000 ALL characters around the word.