Jump to content

Tesseract (software)

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Azimout (talk | contribs) at 08:52, 31 December 2011 (google since 2006). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Tesseract
Original author(s)Ray Smith, Hewlett-Packard[1]
Developer(s)Google
Stable release
3.01 / October 21, 2011 (2011-10-21)[1]
Repository
Written inC and C++
Operating systemLinux (32 & 64-bit), Windows (32-bit), and, unofficially, Mac OS X (x86)
Available inInterface: English
Recognition: Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Indonesian, Italian, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Tagalog, Turkish, Ukrainian & Vietnamese (more can be added using included training files)
TypeOptical character recognition
LicenseApache License v2.0
Websitehttp://code.google.com/p/tesseract-ocr

Tesseract is a free software optical character recognition engine for various operating systems.[2]

Originally developed as proprietary software at Hewlett-Packard between 1985 and 1995, it had very little work done on it in the following decade. It was then released as open source in 2005 by Hewlett Packard and UNLV. Tesseract development is sponsored by Google since 2006Cite error: The <ref> tag has too many names (see the help page).. It is released under the Apache License, Version 2.0.[1][3][4]

Tesseract is considered one of the most accurate free software OCR engines currently available.[4][5]

History

The Tesseract engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler.[3]

Currently Tesseract builds under Linux with GCC 2.95 or later and under Windows with Visual C++ 6. The C++ code makes heavy use of a list system using macros. This predates the C++ Standard Template Library and may be more efficient than Standard Template Library lists, but is reportedly harder to debug in the event of a segmentation fault. Another side-effect of the C/C++ split is that the C++ data structures get converted to C data structures to call the low-level C code. The migration to C++ is a step towards eliminating this conversion, though it is not yet complete.[citation needed]

Features

Tesseract was in the top 3 OCR engines in terms of character accuracy in 1995. It is available for Linux, Windows and Mac OS X, however, due to limited resources only Windows and Ubuntu are rigorously tested by developers.[3][4][6]

Tesseract up to and including version 2 could only accept TIFF images of simple one column text as inputs. These early versions did not include layout analysis and so inputting multi-columned text, images, equations produced a garbled output. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract can detect whether text is monospaced or proportional.[4]

The initial versions of Tessaract could only recognize English language text. Starting with version 2 Tesseract was able to process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it can recognize Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified and Traditional), Danish (standard and Fraktur script), German, Greek, Finnish, French, Hebrew, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish, Serbian, Swedish, Tagalog, Thai, Turkish, Ukrainian and Vietnamese. Tesseract can be trained to work in other languages too.[4]

If Tessaract is used to process right-to-left text such Arabic or Hebrew the results are ordered as though it is left-to-right text.[7]

Tesseract is suitable for use as a backend, and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus.[8]

User interfaces

Tesseract configuration window in OCRFeeder

Tesseract does not come with a GUI and is instead run from the command-line interface.[9]

There are several separate projects which provide a GUI for Tesseract:

  • FreeOCR – a Windows Tesseract GUI[10]
  • gImageReaderGTK GUI frontend for Tesseract that supports selecting columns and parts of the document. It can open multipage PDF files or images, supports all formats, can transmit a selected area to Tesseract for recognition and spell check the output.[11]
  • gscan2pdf – A GUI to produce PDFs or DjVus from scanned documents[12]
  • OCRFeeder – Features a complete GTK graphical user interface that allows the users to correct any unrecognized characters, defined or correct bounding boxes, set paragraph styles, clean the input images, import PDFs, save and load the project, export everything to multiple formats, etc.[13]
  • OcrGui – A Linux GUI, written in C language using the GLib and GTK+ frameworks, it supports both Tesseract and GOCR. It includes spell checking using Hunspell, an open source spell checker.[14]
  • Qiqqa – A freeware PDF reference management tool that uses Tesseract to interpret scanned PDFs for full-index searching.[15]
  • Tesseract GUI – A Mac OS X free software GUI[16]
  • TextRipper – a Linux Tesseract and/or Ocrad GUI with multiple -page, -column, and -file selection support.[17]
  • VietOCR – A Java-based cross-platform GUI that includes a language pack for Vietnamese and special post-processing tools for Vietnamese[18]
  • YAGF – Graphical front-end (Qt 4.x) for cuneiform and tesseract[19]

Libraries using Tesseract engine

  • ABCocr .NET - an OCR component for Microsoft's .NET Framework, with support for 64-bit systems, built around a custom version of the Tesseract 3 engine.[citation needed]
  • hOcr2Pdf.NET – a .NET library to convert Tesseract recognized images into PDF with search capabilities using HtmlAgilityPack and iTextSharp.[20]

Reception

In a July 2007 article on Tesseract, Anthony Kay of Linux Journal termed it "a quirky command-line tool that does an outstanding job". At that time he noted "Tesseract is a bare-bones OCR engine. The build process is a little quirky, and the engine needs some additional features (such as layout detection), but the core feature, text recognition, is drastically better than anything else I've tried from the Open Source community. It is reasonably easy to get excellent recognition rates using nothing more than a scanner and some image tools, such as The GIMP and Netpbm."[2]

References

  1. ^ a b c Google (2008). "tesseract-ocr". Retrieved 2008-07-12. {{cite web}}: |last= has generic name (help)
  2. ^ a b Kay, Anthony (2007). "Tesseract: an Open-Source Optical Character Recognition Engine". Linux Journal. Retrieved 28 September 2011. {{cite news}}: Unknown parameter |month= ignored (help)
  3. ^ a b c Vincent, Luc (2006). "Announcing Tesseract OCR". Retrieved 2008-06-26. {{cite web}}: Unknown parameter |month= ignored (help)
  4. ^ a b c d e Canonical Ltd. (2011). "OCR". Retrieved 2011-02-11. {{cite web}}: Unknown parameter |month= ignored (help)
  5. ^ Willis, Nathan (2006). "Google's Tesseract OCR engine is a quantum leap forward". Retrieved 2008-07-18. {{cite web}}: Unknown parameter |month= ignored (help)
  6. ^ Tesseract Project (2011). "Issue 263: patch to enable hOCR output". Retrieved 26 February 2011. {{cite web}}: Unknown parameter |month= ignored (help)
  7. ^ "TrainingTesseract3". Retrieved 9 October 2011.
  8. ^ Announcing the OCRopus Open Source OCR System (Thomas Breuel, OCRopus Project Leader)
  9. ^ Google Code – Tesseract Readme
  10. ^ "FreeOCR". 2010. Retrieved January 2010. {{cite web}}: Check date values in: |accessdate= (help)
  11. ^ SourceForge (2010). "gImageReader". Retrieved 12 July 2010.
  12. ^ "gscan2pdf". 2010. Retrieved September 2010. {{cite web}}: Check date values in: |accessdate= (help)
  13. ^ Gnome.org (2010). "OCRFeeder". Retrieved 8 August 2010. {{cite web}}: Unknown parameter |month= ignored (help)
  14. ^ emanueles (2010). "OcrGui". Retrieved 27 August 2010.
  15. ^ "Qiqqa". 2011. Retrieved 26 January 2011.
  16. ^ "Tesseract GUI". Retrieved 27 April 2011.
  17. ^ "TextRipper". 2011. Retrieved January 2011. {{cite web}}: Check date values in: |accessdate= (help)
  18. ^ SourceForge (2010). "VietOCR". Retrieved 12 July 2010. {{cite web}}: Unknown parameter |month= ignored (help)
  19. ^ "YAGF". 2011. Retrieved 1 September 2011. {{cite web}}: Unknown parameter |month= ignored (help)
  20. ^ "hOcr2Pdf.NET". 2011. Retrieved April 2011. {{cite web}}: Check date values in: |accessdate= (help)

See also