Comparison of optical character recognition software

From Wikipedia, the free encyclopedia
Jump to: navigation, search

This comparison of optical character recognition software includes:

  • OCR engines, that do the actual character identification
  • Layout analysis software, that divide scanned documents into zones suitable for OCR
  • Graphical interfaces to one or more OCR engines
  • Software development kits that are used to add OCR capabilities to other software (e.g. forms processing applications, document imaging management systems, e-discovery systems, records management solutions)
Name Founded year Latest stable version Release year License Online Windows Mac OS X Linux BSD Programming language SDK? Languages Fonts Output Formats Notes
Tesseract 1985 3.02 2012 Apache No Yes Yes Yes Yes C++, C Yes 35+[1] ? Text, hOCR,[2] others with different user interfaces[3] or the API Created by Hewlett-Packard; under further development by Google[4] It was one of the top 3 engines in the 1995 UNLV Accuracy test.
ExperVision[5] TypeReader & RTK 1987 7.1.170.1125 2010 Proprietary Yes Yes Yes Yes Yes C/C++ Yes 21 2618 Won the highest marks in the independent testing performed by UNLV for X consecutive years (in 1994).[6][citation needed]


The speed of ExperVision’s OpenRTK is four to eight times faster than competition. — PC Magazine[7] but also "Not as accurate as rival products, clumsy interface, limited options for proofreading, couldn't open some files in standard PDF or image formats."[8] PC Magazine

ABBYY FineReader 1989 11 2011 Proprietary Yes Yes Yes Yes Yes C/C++ Yes 198[9] ? DOC, DOCX, XLS, XLSX, PPTX, RTF, PDF, HTML, CSV, TXT, ODT, DjVu, EPUB, FB2[10] ABBYY also supplies SDKs for embedded and mobile devices. Professional, Corporate and Site License Editions for Windows, Express Edition for Mac.[11]
AnyDoc Software 1989 ? ? Proprietary No Yes No No No VBScript ? ? ? Works with structured, semi-structured, and unstructured documents.
LEADTOOLS[12] 1990[13] 18.0 2013 Proprietary Yes Yes Yes Yes No C/C++, .NET, Objective-C, Java, JavaScript Yes 56[14] Any printed font PDF, PDF/A, DOC, DOCX, XLS, XPS, RTF, HTML, ANSI Text, Unicode Text, CSV[15] Supports Latin, Asian, Arabic, and MICR character sets.[12] For full page, zonal, and form image processing. Includes OCR, barcode, OMR and forms recognition.[16] ICR (handwritten text recognition) is supported.[17]
CuneiForm 1996 12 2007 BSD variant No Yes Yes Yes Yes C/C++ Yes 28 Any printed font HTML, hOCR, native, RTF, TeX, TXT[18] Enterprise-class system, can save text formatting and recognizes complicated tables of any structure
Transym OCR 2000 3.3 2011 Proprietary No Yes No No No C#, C/C++, VB, VB.NET Yes 11 ?
SimpleOCR 2002 3.5 2008 Proprietary No Yes No No No ? ? ? ?
Dynamsoft OCR SDK 2003 8.2 2012 Proprietary Yes Yes No No No C/C++ Yes 40+[19] ? PDF, TXT Dynamsoft is the leading provider of image capture SDKs and version control tools.
OmniPage 1970's 19 2013 Proprietary Yes Yes Yes Yes No C/C++, C#[20] Yes 125[21] Machine and handprinted fonts DOC/DOCX XLS/XLSX PPTX RTF PDF PDF/A Searchable PDF HTML Text XML ePUB MP3 Product of Nuance Communications
Microsoft Office OneNote 2007 2007 ? 2007 Proprietary No Yes No No No ? ? ? ?
FreeOCR ? 4.2 August 2012 Proprietary No Yes No No No ? ? ? ? [22]
GOCR ? 0.50 2013 GPL Yes[23] Yes Yes Yes Yes C ? ? ?
Ocrad ? 0.22[24] 2013 GPL Yes Yes Yes Yes Yes C++ Yes Latin alphabet ? Command line
SmartScore ? ? ? Proprietary No Yes Yes No No ? ? ? ? For musical scores
Microsoft Office Document Imaging ? Office 2007 2007 Proprietary No Yes No No No ? ? ? ? Uses OmniPage[citation needed]
Puma.NET ? ? ? BSD No Yes No No No C# Yes 28 Any printed font .NET OCR SDK based on Cognitive Technologies' CuneiForm recognition engine. Wraps Puma COM server and provides simplified API for .NET applications
ReadSoft ? ? ? Proprietary No Yes No No No ? ? ? ? Scan, capture and classify business documents such as invoices, forms and purchase orders integrated with business processes.
Scantron ? ? ? Proprietary No Yes No No No ? ? ? ? For working with localized interfaces, corresponding language support is required.
OCRFeeder ? 0.7.11 2009 GPL No No No Yes No Python ? ? ? Features a full user interface and has a command-line tool for automatic operations. Has its own segmentation algorithm but uses system-wide OCR engines like Tesseract or Ocrad
OCRopus ? 0.6 2012 Apache No No No Yes No Python ? ? ? hOCR, HTML, TXT[25] Pluggable framework under active development, used for Google Books
Name Founded year Latest stable version Release year License Online Windows Mac OS X Linux BSD Programming language SDK? Languages Fonts Output Formats Notes

References[edit]

  1. ^ Based on count of language training files for version 3.x on 14 December 2010. Available at the download page.
  2. ^ Usage explained in the Tesseract Readme and FAQ
  3. ^ Such as PDF & DjVu with gscan2pdf and ODF with OCRFeeder
  4. ^ "tesseract-ocr - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting". Code.google.com. Retrieved 2013-09-12. 
  5. ^ "OpenRTK – ExperVision OCR SDK | OCR Software, OCR SDK & Toolkit, OCR Service – ExperVision OCR". Expervision.com. Retrieved 2013-09-12. 
  6. ^ http://www.isri.unlv.edu/downloads/AT-1994.pdf
  7. ^ "Expervision TypeReader Desktop 7.0". Retrieved 2010-11-15. 
  8. ^ Mendelson, Edward. "TypeReader 2008". PC Magazine. 
  9. ^ "ABBYY FineReader 11: Full Feature List". Finereader.abbyy.com. Retrieved 2013-09-12. 
  10. ^ "ABBYY FineReader 11: Technical Specifications". Finereader.abbyy.com. Retrieved 2013-09-12. 
  11. ^ "Top OCR Software". Ocrworld.com. 2010-03-30. Retrieved 2013-09-12. 
  12. ^ a b "Ocr Sdk". Leadtools. Retrieved 2013-09-12. 
  13. ^ "LEAD Technologies, Inc. Corporate Information". Leadtools.com. Retrieved 2013-09-12. 
  14. ^ "Ocr Sdk". Leadtools. Retrieved 2013-09-12. 
  15. ^ "OCR SDK Output Formats". Leadtools. Retrieved 2013-09-12. 
  16. ^ "LEADTOOLS Recognition Imaging Developer Toolkit". Leadtools.com. Retrieved 2013-09-12. 
  17. ^ "Icr Sdk". Leadtools. Retrieved 2013-09-12. 
  18. ^ Debian manual page for Cuneiform for Linux version 1.1.0
  19. ^ "OCR SDK Language Packages Download". Dynamsoft.com. Retrieved 2013-09-12. 
  20. ^ "OmniPage CSDK - OCR Document Capture Toolkit | Document Imaging & OCR". Nuance. Retrieved 2013-09-12. 
  21. ^ "OmniPage Standard Document Conversion". Nuance. Retrieved 2014-02-25. 
  22. ^ "Free OCR Software - Optical Character Recognition Software for Windows import from PDF and Twain Scanners". Paperfile.net. Retrieved 2013-09-12. 
  23. ^ "GOCR". Jocr.sourceforge.net. Retrieved 2013-09-12. 
  24. ^ Diaz, Antonio (2013-07-12). "Version 0.22 of GNU Ocrad released". info-gnu. http://lists.gnu.org/archive/html/info-gnu/2013-07/msg00004.html.
  25. ^ OCRopus includes the ocropus-hocr tool which produces hOCR from the recognition results.