Tesseract 3.02 running on Gnome Terminal 3.8.0. "input_image.tif" is the input document which will be rendered as "output_text.txt" by Tesseract.
|Original author(s)||Ray Smith, Hewlett-Packard|
|Stable release||3.02 / October 28, 2012|
|Written in||C and C++|
|Operating system||Linux (32 & 64-bit), Windows (32-bit), and, unofficially, Mac OS X (x86)|
|Type||Optical character recognition|
|License||Apache License v2.0|
Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006. Tesseract is considered one of the most accurate open source OCR engines currently available.
The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England and Greeley, Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006.
Tesseract was in the top three OCR engines in terms of character accuracy in 1995. It is available for Linux, Windows and Mac OS X, however, due to limited resources only Windows and Ubuntu are rigorously tested by developers.
Tesseract up to and including version 2 could only accept TIFF images of simple one column text as inputs. These early versions did not include layout analysis and so inputting multi-columned text, images, or equations produced a garbled output. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract can detect whether text is monospaced or proportional.
The initial versions of Tesseract could only recognize English language text. Starting with version 2 Tesseract was able to process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it can recognize Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified and Traditional), Danish, German (standard and Fraktur script), Greek, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish, Serbian, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese. Tesseract can be trained to work in other languages too.
Tesseract's output will be very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels, any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.
There are several separate projects which provide a GUI for Tesseract:
- FreeOCR – a Windows Tesseract GUI. However this has been widely reported as installing malware along with the OCR program.
- gImageReader – GTK GUI frontend for Tesseract that supports selecting columns and parts of the document. It can open multipage PDF files or images, supports all formats, can transmit a selected area to Tesseract for recognition and spell check the output.
- gscan2pdf – A GUI to produce PDFs or DjVus from scanned documents
- k2pdfopt – An open-source, cross-platform program to optimize PDF files for e-readers. It can add a Tesseract-based OCR layer to a scanned PDF. MS-Windows version offers a GUI.
- OCRFeeder – Features a complete GTK graphical user interface that allows the users to correct any unrecognized characters, defined or correct bounding boxes, set paragraph styles, clean the input images, import PDFs, save and load the project, export everything to multiple formats, etc.
- OcrGui – A Linux GUI, written in C language using the GLib and GTK+ frameworks, it supports both Tesseract and GOCR. It includes spell checking using Hunspell, an open source spell checker.
- Qiqqa – A freeware PDF reference management tool that uses Tesseract to interpret scanned PDFs for full-index searching.
- Tesseract GUI – A Mac OS X free software GUI
- TextRipper – a Linux Tesseract and/or Ocrad GUI with multiple -page, -column, and -file selection support.
- VietOCR – A Java-based cross-platform GUI that includes a language pack for Vietnamese and special post-processing tools for Vietnamese. It can be used for recognizing text in all languages supported by Tesseract by downloading the appropriate language data files.
- YAGF – Graphical front-end (Qt 4.x) for cuneiform and tesseract for Linux
Libraries using Tesseract engine
- ABCocr .NET - an OCR component for Microsoft's .NET Framework, with support for 64-bit systems, built around a custom version of the Tesseract 3 engine.
- hOcr2Pdf.NET – a .NET library to convert Tesseract recognized images into PDF with search capabilities using HtmlAgilityPack and iTextSharp.
- Tess4J – a Java Programming wrapper for the Tesseract API.
- ruby-tesseract-ocr – a Ruby wrapper for the Tesseract API.
- PyPI search - a number of Python modules that wrap the Tesseract API.
In a July 2007 article on Tesseract, Anthony Kay of Linux Journal termed it "a quirky command-line tool that does an outstanding job". At that time he noted "Tesseract is a bare-bones OCR engine. The build process is a little quirky, and the engine needs some additional features (such as layout detection), but the core feature, text recognition, is drastically better than anything else I've tried from the Open Source community. It is reasonably easy to get excellent recognition rates using nothing more than a scanner and some image tools, such as The GIMP and Netpbm."
- Google (2008). "tesseract-ocr". Retrieved 2008-07-12.
- Kay, Anthony (July 2007). "Tesseract: an Open-Source Optical Character Recognition Engine". Linux Journal. Retrieved 28 September 2011.
- Vincent, Luc (August 2006). "Announcing Tesseract OCR". Retrieved 2008-06-26.
- Canonical Ltd. (February 2011). "OCR". Retrieved 2011-02-11.
- Announcing Tesseract OCR - The official Google blog
- Willis, Nathan (September 2006). "Google's Tesseract OCR engine is a quantum leap forward". Retrieved 2008-07-18.
- Announcing Tesseract OCR - The official Google blog
- Rice Stephen V., Frank R. Jenkins, and Thomas A. Nartker The Fourth Annual Test of OCR Accuracy, expervision.com, retrieved 21 May 2013
- Tesseract Project (February 2011). "Issue 263: patch to enable hOCR output". Retrieved 26 February 2011.
- "TrainingTesseract3". Retrieved 9 October 2011.
- Announcing the OCRopus Open Source OCR System (Thomas Breuel, OCRopus Project Leader)
- "FAQ - tesseract-ocr - Frequently Asked Questions - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting". Code.google.com. Retrieved 2014-05-30.
- "ImproveQuality - tesseract-ocr - Advice on improving the quality of your output. - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting". Code.google.com. 2014-01-27. Retrieved 2014-05-30.
- Google Code – Tesseract Readme
- "3rdParty - tesseract-ocr - GUIs and Other Projects using Tesseract OCR.". Code.google.com. Retrieved 2013-05-21.
- "FreeOCR". 2010. Retrieved January 2010.
- From Softi Software:. "FreeOCR - Free download and software reviews - CNET Download.com". Download.cnet.com. Retrieved 2014-02-05.
- SourceForge (2010). "gImageReader". Retrieved 12 July 2010.
- "gscan2pdf". 2010. Retrieved September 2010.
- "k2pdfopt". 2014. Retrieved November 2014.
- Gnome.org (August 2010). "OCRFeeder". Retrieved 8 August 2010.
- emanueles (2010). "OcrGui". Retrieved 27 August 2010.
- "Qiqqa". 2011. Retrieved 26 January 2011.
- "Tesseract GUI". Retrieved 27 April 2011.
- "TextRipper". 2011. Retrieved January 2011.
- SourceForge (June 2010). "VietOCR". Retrieved 12 July 2010.
- "YAGF". September 2011. Retrieved 1 September 2011.
- "ABCocr.NET". July 2012. Retrieved 11 July 2012.
- "hOcr2Pdf.NET". 2011. Retrieved April 2011.
- "Tess4J". 2013.
- "ruby-tesseract". 2013.
|Wikimedia Commons has media related to Tesseract (software).|
- Official website
- Hacking Tesseract V0.04 – C/C++ structure of Tesseract extracted from Doxyfied source code (based on Tesseract V1.03)
- Tesseract OCR Engine What it is, where it came from, where it is going.