Talk:Optical character recognition
|This is the talk page for discussing improvements to the Optical character recognition article.|
|This article is of interest to the following WikiProjects:|
||This article may be too technical for most readers to understand. (September 2010)|
|Text from this version of Optical character recognition was copied or moved into List of optical character recognition software with this edit. The former page's history now serves to provide attribution for that content in the latter page, and it must not be deleted so long as the latter page exists.|
- 1 Missing an overview of where OCR fits into a document processing solution
- 2 Zip codes
- 3 Open source programs
- 4 CJK Support?
- 5 MICR
- 6 Merge
- 7 Section "Optical Character Recognition in Unicode"
- 8 OCR for mathematical documents
- 9 Wrong word?
- 10 MAP
- 11 Tesseract
- 12 Citations?
- 13 unknown characters
- 14 Character 0x244B
- 15 Strongly suggest a 'software - last release date' column in table
- 16 Missing
- 17 This article doesn't even mention Cyrillic OCR!!!
- 18 Uses of OCR
- 19 Simplicity
- 20 Adobe Acrobat
- 21 Zonal OCR
- 22 Removing non-notable and promotional links again
- 23 Mac OS support
- 24 A solved problem?
- 25 Proposal to Split - software tables
- 26 Typical accuracy rates are inaccurate and need a citation
- 27 Did GF Handel REALLY live over 200 years and invent an OCR algorithm?
- 28 IT technology
- 29 Does not describe different aspects/problems/algorithms/approaches to OCR
- 30 Robotics attention needed
Missing an overview of where OCR fits into a document processing solution
Key to a good OCR rate is the quality of input images and pre-processing of them. This needs to be added to the article. For example, thresholding low resolution images of text is critical for good OCR results. This leads into topics such as background removal, background normalization, Otsu thresholding, median filtering, demosacing, etc. A simple chart of OCR recognition rates for various scan DPI settings would help. Commercial products like Abbyy Finereader suggest that characters should be at least 20 pixel high to be OCR'd with good results.
A chart giving the resulting character size in pixels based on character point size, scan dpi would help also. E.g., 75dpi scans of 10point text produce horrible results whilst 300dpi scans of 10 point text produce excellent results. —Preceding unsigned comment added by 126.96.36.199 (talk) 21:00, August 25, 2007 (UTC)
Open source programs
Are there any open source OCR programs available?
- I see that http://simpleocr.com/ is free for "personal use"; is it really open source?
Section about software
- Kooka - default scanning application in KDE. It uses GOCR for OCR
- Tesseract is an open source OCR, initially developed by HP, and released under the Apache License, Version 2.0. It can be compiled using MSVC 6.0 or GCC (~120000 LOC)
- Clara - ,  (~50000 LOC)
- GOCR - (~20000 LOC + Unpaper + Socrates) - GOCR included in Debian and other distributions (not for Windows)
- Ocrad -  - (~9900 LOC) - "is an OCR [...] program based on a feature extraction method".
- Simple OCR - freeware application available, as well as royalty free SDK and source code.
- ISRI Software - some experimental OCR tools
- OCRchie - dormant since 1996
- OOCR OOCR is an OCR program still in development, under the GPL.
- phpOCR A base implementation for an OCR tool in PHP
- Kognition - 
This article doesn't mention anything about OCR support for Chinese, Japanese, and Korean though that information would be very valuable, espescially if there is free software with CJK support. Theshibboleth 00:11, 10 May 2006 (UTC)
- Seconded. I'm disappointed in you all! Astarica 09:37, 6 September 2007 (UTC)
The reference to MICR seems strangely disjointed, as though it is written in the context of human reading rather than machine reading. I am mindful to amend it. Would anyone object? Tom 00:00, 5 June 2006 (UTC)
I am proposing the merge. Neither article is unduly long and it would be much more convenient to the reader to have all the relevant information in one place. BlueValour 17:22, 5 November 2006 (UTC)
- Agreed, it should be in this article under a subsection, makes it easier to find. —Preceding unsigned comment added by 188.8.131.52 (talk • contribs)
Section "Optical Character Recognition in Unicode"
It's not clear at all from the article, what those characters are used for. 184.108.40.206 19:47, 18 March 2007 (UTC)
I can't find a definitive source, but it appears to me that the codes for OCR DASH and OCR CUSTOMER ACCOUNT NUMBER are swapped, according to http://theorem.ca/~mvcorks/cgi-bin/unicode.pl.cgi?start=2440&end=245f , OCR DASH is 0x2448, and OCR CUSTOMER ACCOUNT NUMBER is 0x2449. —Preceding unsigned comment added by 220.127.116.11 (talk) 07:02, 6 July 2008 (UTC)
OCR for mathematical documents
Searching a bit on the web for a taste of OCR for maths led me to this page: http://www.inftyproject.org Although it's labelled 'free software', going by the license it's obviously just freeware. Anyone know of free/open source alternatives? I'm surprised that there isn't any major software project for this, with (cheap) tablet PCs around the corner and Google's plans to digitise the planet being applied to books.
Most mathematical formulas have been set using TeX, so it shouldn't be that difficult to scan it back in again correctly, right? Merctio 23:01, 11 April 2007 (UTC)
- For the InftyReader/open source alternatives: I believe there are no alternatives yet. The OSS world is still struggling with straight Latin. For the InftyEditor: actually quite common, f.ex. OpenOffice Math. For the rest: mixing audio with math text, I've never heard of the idea, and I couldn't envision it by myself, except possibly as my really mad ideas of combo-TV-garden-rake. I thought math was simply unspeakable! Said: Rursus ☺ ★ 09:56, 19 July 2007 (UTC)
Should this say "handwritten" instead of "hand-printed?"
"These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem."
Matthias Röder 10:56, 6 July 2007 (UTC)
Ossware map, feel free to modf:
- Inwiki: GOCR, Ocrad - cmdline?, OCRopus - new, merger?
- Exwiki: Tesseract on g8gle - used by OCRopus, OCRopus on g8gle - nice link to transsurf, Leptonica - unknown whatis,
- Known: ClaraOCR - almost no info there, inactive since 2003.
- Is this related to a project of some sort? It's not really appropriate for this talk page... Chris Cunningham 11:57, 19 July 2007 (UTC)
Its nice that Tesseract is free etc, but trying to use it seems rather tech-challenging at this point. Does anyone offer it to try as a free online conversion tool? FreeOCR may be a more user-friendly version, but they may all require 2K/XP for Windows version, so older OSes are out of luck. The only free online OCR I can find is scanR, but using it seems quite awkward (must email jpegs, get activation codes, etc.) -18.104.22.168 12:48, 2 October 2007 (UTC)
Where do these OCR characters come from:
- Maybe you weren't looking in Unicode version 6.1? Can't explain all of them, but as for their ultimate provenance some come from MICR and some from OCR-A font. Maybe the remainder are from OCR B? The Unicode documentation I could find unfortunately doesn't really trace ancestry. -- Beland (talk) 06:06, 28 March 2013 (UTC)
Strongly suggest a 'software - last release date' column in table
The software list is misleading given that many of the open source OCR packages have not had a release in many years as well as that some of them are in pre-alpha status (Tesserect).—Preceding unsigned comment added by 22.214.171.124 (talk • contribs)
- Optical mark recognition link - Glyph recognition with user interaction (e.g., training an OCR package to learn to OCR latin texts) - Document preprocessing before OCR (deskew, threshold, etc.) - OCR test results to give a basic understanding of scan quality, character size and OCR effectiveness) - Mention output formats for OCR documents (plain text, PDF text on top of the original image, etc.) - Voting techniques for character recognition (i.e., comparing all letter 'e' on a page to help classify unknown glyphs as the letter 'e')—Preceding unsigned comment added by 126.96.36.199 (talk • contribs)
- I added some content along these lines; more details are welcome. -- Beland (talk) 05:50, 28 March 2013 (UTC)
This article doesn't even mention Cyrillic OCR!!!
The HP scanner I bought for about $50 five years ago came bundled with software that can OCR Cyrillic text about as well as Roman. Apparently Russians have been making use of these capabilities to put huge amounts of writing from the tsarist and soviet periods online, in honor of "samizdat" traditions!
Apparently the newest versions of HP's bundled software also OCR Greek, Chinese (simplified or traditional), Arabic, Hebrew and Korean. The only really big omission in contemporary terms seems to be Indic scripts (including variants used outside the subcontinent for Tibetan, Burmese, Thai, Laotian and Cambodian).
- Wow!!! Find some reliable sources and add it to the article. Of course, Cyrillic really is a variation of the Roman alphabet (well, the Latin-Greek-Cyrillic superalphabet), especially from the perspective of OCR.--Prosfilaes (talk) 13:26, 25 May 2008 (UTC)
Uses of OCR
Wow, come on. There are so many. In general, the largest uses of OCR today are related to document management in large instituions, for storage and management of paperless processes. For instance: claims processing (going from health insurance paper claims) to digital claims management without the need for manual data entry. There are many other listed if your search "document OCR paperless applications" on Google.com. One example is legal services[] D'Artagnol (talk) 22:20, 20 March 2009 (UTC)
If you could be specific about what parts you can't understand, that would help us a lot. kbnklvkkfh
I can confirm that, you can chose to have it when you scan documents. It takes quite some time and if you have a lot of documents to scan and don't need it turn it of. It makes the files bigger to but add features to them.
See Scanning options - Make Searchable (Run OCR) at: http://help.adobe.com/en_US/Acrobat/9.0/3D/WS58a04a822e3e50102bd615109794195ff-7f71.w.html --188.8.131.52 (talk) 19:54, 22 January 2009 (UTC)
OCR feature in Adobe Acrobat is provided by ReadIRIS, which is already listed in OCR Software. Please read http://www.irislink.com/Documents/pdf/200609191402/adobe_en.pdf Ankit (talk) 03:59, 11 October 2009 (UTC)
- Thx Ankit, from your source now we can clearly stated that Adobe using I.R.I.S.’ OCR technology., So, no need to add Adobe to the list. Ivan Akira (talk) 07:59, 11 October 2009 (UTC)
In November I removed all the entries from the OCR software section which did not have their own articles or were obviously promotional. Unfortunately, once again the table is full of indisciminate examples which in some cases appear blatantly promotional. I'm going to remove these all again in the future. Chris Cunningham (not at work) - talk 19:23, 13 March 2009 (UTC)
Mac OS support
TypeReader does not appear to sell a Mac OS compatible version any longer.
OmniPage does offer a Mac OS version, but it hasn't been updated in years. It lists the system requirements as Mac OS 9 or Mac OS X 10.1. There is no mention on the Nuance web page showing system requirements of whether or not it works with Mac OS 10.2 or later (current Mac OS X is 10.5).
I believe both of those should either have Mac OS removed from the supported OS columns or a footnote added saying Mac OS support is deprecated or discontinued. 184.108.40.206 (talk) 17:03, 22 March 2009 (UTC) 2009-03-21. Geoff Strickler
A solved problem?
Proposal to Split - software tables
I strongly support the proposal to split the OCR software table, made in October 2009, into a separate article. I suggest that both tables OCR software and OCR software language support, be moved into a separate page, along with the relevant talk sections. I also suggest that the page be entitled Comparison of OCR software and placed into the category Software comparisons Artemgy (talk) 08:21, 28 November 2009 (UTC)
I also suggest to split OCR software table. Vcgupta 20:05, 28 December 2009
Typical accuracy rates are inaccurate and need a citation
The accuracy rate in industrial applications is less than 95%. The article suggests a 99% accuracy which might be achieved in a lab under non realistic conditions. I suggest rephrasing this sentence and researching the accuracy rate under different circumstances. —Preceding unsigned comment added by Mudx77 (talk • contribs) 09:42, 24 January 2010 (UTC)
Did GF Handel REALLY live over 200 years and invent an OCR algorithm?
Doesn't "With IT technology development" seem strange? IT technology??? Is it okay to say information technology technology? — Preceding unsigned comment added by 220.127.116.11 (talk) 00:00, 23 December 2011 (UTC)
- Definitely! Though this has since been removed from the article. -- Beland (talk) 05:36, 28 March 2013 (UTC)
Does not describe different aspects/problems/algorithms/approaches to OCR
For example, identifying paragraphs, identifying lines, identifying word borders, using Directed Acyclic Graphs of possible letter recognitions (i.e. encoding the different possible character sequences for a words image: dam darn, case ease, and more complicated examples [[vv/w][c/e][t/i/l][c/e][o/0][rn/m][c/e]] for "welcome" which are most compactly described by DAG) how individual characters are identified (with high dpi: tracing outlines of the characters, at low dpi: patternmatching (dunno, autocorrelation, neural networks,...?)), identifying images,...
there is no description of the current state of approaching different kinds of characters: what methods work better for low dpi/high dpi, handwritten/typeset, kinds of alphabets, dealing with layout, ... — Preceding unsigned comment added by 18.104.22.168 (talk) 14:09, 9 January 2012 (UTC)
- I found the same information lacking, so I scraped some together and added it. -- Beland (talk) 05:35, 28 March 2013 (UTC)
Robotics attention needed
- Refs - large chunks of text have no refs
- MoS compliance checks
- Content - all topics covered?