Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used as a form of data entry from some sort of original paper data source, whether documents, sales receipts, mail, or any number of printed records. It is a common method of digitizing printed texts so that they can be electronically searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.
Early versions needed to be programmed with images of each character, and worked on one font at a time. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.
History
Early optical character recognition could be traced to activity around two issues: expanding telegraphy and creating reading devices for the blind.[1] In 1914, Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code.[citation needed] Around the same time, Edmund Fournier d'Albe developed the Optophone, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters.
Goldberg continued to develop OCR technology for data entry. Later, he proposed photographing data records and then, using photocells, matching the photos against a template containing the desired identification pattern. In 1929 Gustav Tauschek had similar ideas, and obtained a patent on OCR in Germany. Paul W. Handel also obtained a US patent on such template-matching OCR technology in USA in 1933 (U.S. patent 1,915,993). In 1935 Tauschek was also granted a US patent on his method (U.S. patent 2,026,329).
In 1949, RCA engineers worked on the first primitive computer-type OCR to help blind people for the US Veterans Administration, but instead of converting the printed characters to machine language, their device converted it to machine language and then spoke the letters: an early text-to-speech technology. It proved far too expensive and was not pursued after testing.[2]
In 1950, David H. Shepard, a cryptanalyst at the Armed Forces Security Agency in the United States, addressed the problem of converting printed messages into machine language for computer processing and built a machine to do this, called "Gismo.".[3] He received a patent for his "Apparatus for Reading" in 1953 U.S. patent 2,663,758. “Gismo” could read 23 letters of the English alphabet, comprehend Morse Code, read musical notations, read aloud from printed pages, and duplicate typewritten pages. Shepard went on to found Intelligent Machines Research Corporation (IMR), which soon developed the world's first commercial OCR systems.
In 1955, the first commercial system was installed at the Reader's Digest, which used OCR to input sales reports into a computer. It converted the typewritten reports into punched cards for input into the computer in the magazine’s subscription department, for help in processing the shipment of 15-20 million books a year.[4] The second system was sold to the Standard Oil Company for reading credit card imprints for billing purposes. Other systems sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell Telephone Company and a page scanner to the United States Air Force for reading and transmitting by teletype typewritten messages. IBM and others were later licensed on Shepard's OCR patents.
In about 1965, Reader's Digest and RCA collaborated to build an OCR document reader designed to digitize the serial numbers on Reader's Digest coupons returned from advertisements. The fonts used on the documents were printed by an RCA Drum printer using the OCR-A font. The reader was connected directly to an RCA 301 computer (one of the first solid state computers). This reader was followed by a specialised document reader installed at TWA where the reader processed Airline Ticket stock. The readers processed documents at a rate of 1,500 documents per minute, and checked each document, rejecting those it was not able to process correctly. The product became part of the RCA product line as a reader designed to process "Turn around Documents" such as those utility and insurance bills returned with payments.
The United States Postal Service has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow. The first use of OCR in Europe was by the British General Post Office (GPO). In 1965 it began planning an entire banking system, the National Giro, using OCR technology, a process that revolutionized bill payment systems in the UK. Canada Post has been using OCR systems since 1971[citation needed]. OCR systems read the name and address of the addressee at the first mechanized sorting center, and print a routing bar code on the envelope based on the postal code. To avoid confusion with the human-readable address field which can be located anywhere on the letter, special ink (orange in visible light) is used that is clearly visible under ultraviolet light. Envelopes may then be processed with equipment based on simple bar code readers.
Importance of OCR to the blind
In 1974 Ray Kurzweil started the company Kurzweil Computer Products, Inc. and continued development of omni-font OCR, which could recognize text printed in virtually any font.[5] He decided that the best application of this technology would be to create a reading machine for the blind, which would allow blind people to have a computer read text to them out loud. This device required the invention of two enabling technologies — the CCD flatbed scanner and the text-to-speech synthesizer. On January 13, 1976 the successful finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind[citation needed]. In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload paper legal and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which had an interest in further commercializing paper-to-computer text conversion. Xerox eventually spun it off as Scansoft, which merged with Nuance Communications[citation needed] .
OCR software
Desktop and server OCR software
OCR software and ICR software technology are analytical artificial intelligence systems that consider sequences of characters rather than whole words or phrases. Based on the analysis of sequential lines and curves, OCR and ICR make 'best guesses' at characters using database look-up tables to closely associate or match the strings of characters that form words.
WebOCR & OnlineOCR
With IT development, the platform for people to use software has been changed from single PC platform to multi-platforms such as PC + Web-based + Cloud Computing + Mobile devices. After 30 years development on the desktop, OCR software started to adapt to new application requirements; WebOCR also known as Online OCR or Web-based OCR service, has been a new trend to meet larger volumes and larger groups of users. Internet and broadband technologies have made WebOCR & OnlineOCR practically available to both individual users and enterprise customers. Since 2000, some major OCR vendors began offering WebOCR & Online software. A number of new entrants companies seized the opportunity to develop innovative Web-based OCR services, some of which are free of charge.
Application-oriented OCR
Since OCR technology has been more and more widely applied to paper-intensive industry, it is facing more complex images environment in the real world. For example: complicated backgrounds, degraded-images, heavy-noise, paper skew, picture distortion, low-resolution, disturbed by grid & lines, text image consisting of special fonts, symbols, glossary words and etc. All the factors affect OCR products’ stability in recognition accuracy.
In recent years, the major OCR technology providers began to develop dedicated OCR systems, each for special types of images. They combine various optimization methods related to the special image, such as business rules, standard expression, glossary or dictionary and rich information contained in color images, to improve the recognition accuracy.
Such strategy to customize OCR technology is called “Application-Oriented OCR” or "Customized OCR", widely used in the fields of Business-card OCR, Invoice OCR, Screenshot OCR, ID card OCR, Driver-license OCR or Auto plant OCR, and so on.
Current state of OCR technology
This section needs additional citations for verification. (May 2009) |
Commissioned by the U.S. Department of Energy (DOE), the Information Science Research Institute (ISRI) had the mission to foster the improvement of automated technologies for understanding machine printed documents[citation needed], and it conducted the most authoritative of the Annual Test of OCR Accuracy for five consecutive years in the mid-90s.
Recognition of Latin-script, typewritten text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 71% to 98%;[6] total accuracy can be achieved only by human review. Other areas—including recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character)—are still the subject of active research.
Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (basically a lexicon of words) is not used to correct software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% (95% accuracy) or worse if the measurement is based on whether each whole word was recognized with no incorrect letters.[7]
On-line character recognition is sometimes confused with Optical Character Recognition[8] (see Handwriting recognition). OCR is an instance of off-line character recognition, where the system recognizes the fixed static shape of the character, while on-line character recognition instead recognizes the dynamic motion during handwriting. For example, on-line recognition, such as that used for gestures in the Penpoint OS or the Tablet PC can tell whether a horizontal mark was drawn right-to-left, or left-to-right. On-line character recognition is also referred to by other terms such as dynamic character recognition, real-time character recognition, and Intelligent Character Recognition or ICR.
On-line systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years (see Tablet PC history). Among these are the input devices for personal digital assistants such as those running Palm OS. The Apple Newton pioneered this product. The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known. Also, the user can be retrained to use only specific letter shapes. These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem. Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications.
Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognise all handwritten cursive script.
It is necessary to understand that OCR technology is a basic technology also used in advanced scanning applications. Due to this, an advanced scanning solution can be unique and patented and not easily copied despite being based on this basic OCR technology.
For more complex recognition problems, intelligent character recognition systems are generally used, as artificial neural networks can be made indifferent to both affine and non-linear transformations.[9]
A technique which is having considerable success in recognising difficult words and character groups within documents generally amenable to computer OCR is to submit them automatically to humans in the reCAPTCHA system.
See also
- AI effect
- Applications of artificial intelligence
- Automatic number plate recognition
- Book scanning
- CAPTCHA
- Computational linguistics
- Computer vision
- Digital Library
- Digital pen
- Digital Mailroom
- Handwriting recognition
- Institutional repository
- Machine learning
- Music OCR
- Optical mark recognition
- Raster to vector
- Raymond Kurzweil
- Sketch recognition
- Speech recognition
- Voice recording
- Lists
- Comparison of optical character recognition software
- List of emerging technologies
- Outline of artificial intelligence
References
- ^ Herbert Schantz, The History of OCR. Manchester Center, VT: Recognition Technologies Users Association, 1982.
- ^ "Reading Machine Speaks Out Loud" , February 1949, Popular Science.
- ^ Washington Daily News, April 27, 1951; New York Times, December 26, 1953
- ^ Schantz, The History of OCR.
- ^ Kurzweil is often credited with inventing omnifont OCR, but it was in use by companies, including CompuScan, in the late 1960s and 1970s. See Schantz, The History of OCR; Data processing magazine, Volume 12 (1970), p. 46
- ^ Holley, Rose (April 2009). "How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs". D-Lib Magazine. Retrieved 5 January 2011.
- ^ Suen, C.Y.; et al. (1987-05-29). "Future Challenges in Handwriting and Computer Applications" (Document). 3rd International Symposium on Handwriting and Computer Applications, Montreal, May 29, 1987. Retrieved 2008-10-03Template:Inconsistent citations
{{cite document}}: Explicit use of et al. in:|first=(help); Unknown parameter|accessdate=ignored (help); Unknown parameter|url=ignored (help)CS1 maint: postscript (link) - ^ Tappert, Charles C.; et al. (1990-08). "The State of the Art in On-line Handwriting Recognition" (Document). IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 12 No 8, August 1990, pp 787-ff. Retrieved 2008-10-03Template:Inconsistent citations
{{cite document}}: Check date values in:|date=(help); Explicit use of et al. in:|first=(help); Unknown parameter|accessdate=ignored (help); Unknown parameter|url=ignored (help)CS1 maint: postscript (link) - ^ LeNet-5, Convolutional Neural Networks
External links
- Unicode OCR - Hex Range: 2440-245F Optical Character Recognition in Unicode