Jump to content

Optical character recognition

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 210.193.53.1 (talk) at 07:25, 25 April 2007 (→‎External links). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Optical Character Recognition, usually abbreviated to OCR, is a type of computer software designed to translate images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text, or to translate pictures of characters into a standard encoding scheme representing them (e.g. ASCII or Unicode). OCR began as a field of research in pattern recognition, artificial intelligence and machine vision. Though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques.

Optical character recognition (using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Because very few applications survive that use true optical techniques, the optical character recognition term has now been broadened to cover digital character recognition as well.

Early systems required training (the provision of known samples of each character) to read a specific font. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.

History

In 1929, Gustav Tauschek obtained a patent on OCR in Germany, followed by Handel who obtained a US patent on OCR in USA in 1933 (U.S. Patent 1,915,993). In 1935 Tauschek was also granted a US patent on his method (U.S. Patent 2,026,329).

Tauschek's machine was a mechanical device that used templates. A photodetector was placed so that when the template and the character to be recognised were lined up for an exact match and a light was directed towards them, no light would reach the photodetector.

In 1950, David Shepard, a cryptanalyst at the Armed Forces Security Agency in the United States, was asked by Frank Rowlett, who had broken the Japanese PURPLE diplomatic code, to work with Dr. Louis Tordella to recommend data automation procedures for the Agency. This included the problem of converting printed messages into machine language for computer processing. Shepard decided it must be possible to build a machine to do this, and, with the help of Harvey Cook, a friend, built "Gismo" in his attic during evenings and weekends. This was reported in the Washington Daily News on April 27 1951 and in the New York Times on December 26 1953 after his U.S. Patent Number 2,663,758 was issued. Shepard then founded Intelligent Machines Research Corporation (IMR), which went on to deliver the world's first several OCR systems used in commercial operation. While both Gismo and the later IMR systems used image analysis, as opposed to character matching, and could accept some font variation, Gismo was limited to reasonably close vertical registration, whereas the following commercial IMR scanners analyzed characters anywhere in the scanned field, a practical necessity on real world documents.

The first commercial system was installed at the Readers Digest in 1955, which, many years later, was donated by Readers Digest to the Smithsonian, where it was put on display. The second system was sold to the Standard Oil Company of California for reading credit card imprints for billing purposes, with many more systems sold to other oil companies. Other systems sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell Telephone Company and a page scanner to the United States Air Force for reading and transmitting by teletype typewritten messages. IBM and others were later licensed on Shepard's OCR patents.

The United States Postal Service has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow. The first use of OCR in Europe was by the British General Post Office or GPO. In 1965 it began planning an entire banking system, the National Giro, using OCR technology, a process that revolutionized bill payment systems in the UK. Canada Post has been using OCR systems since 1971. OCR systems read the name and address of the addressee at the first mechanized sorting center, and print a routing bar code on the envelope based on the postal code. After that the letters need only be sorted at later centers by less expensive sorters which need only read the bar code. To avoid interference with the human-readable address field which can be located anywhere on the letter, special ink is used that is clearly visible under ultraviolet light. This ink looks orange in normal lighting conditions. Envelopes marked with the machine readable bar code may then be processed.

Current state of OCR technology

The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem. Typical accuracy rates exceed 99%, although certain applications demanding even higher accuracy require human review for errors.

Recognition of hand printing, cursive handwriting, and even the printed typewritten versions of some other scripts (especially those with a very large number of characters), is still the subject of active research.

Systems for recognizing hand-printed text on the fly have enjoyed commercial success in recent years. Among these are the input device for personal digital assistants such as those running Palm OS. The Apple Newton pioneered this technology. The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known. Also, the user can be retrained to use only specific letter shapes. These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem. Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited contexts. This variety of OCR is now commonly known in the industry as ICR, or Intelligent Character Recognition.

Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque (which is always a written out number) is an example where using a smaller dictionary can increase recognition rates greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognize all handwritten cursive script.

A particularly difficult problem for computers and humans is that of old church baptismal and marriage records containing mostly names. The pages may be damaged by age, water or fire and the names may be obsolete or contain rare spellings. Another research area is cooperative approaches, where computers assist humans and vice-versa. Computer image processing techniques can assist humans in reading extremely difficult texts such as the Archimedes Palimpsest or the Dead Sea Scrolls.

Generally, for more complex recognition problems neural networks are commonly used as they generally can be made indifferent to both affine and non-linear transformations.[1]

Music OCR

Early research into recognition of printed sheet music was performed in the mid 1970s at MIT and other institutions. Successive efforts were made to localize and remove musical staff lines leaving symbols to be recognized and parsed. The first proprietary music-scanning program, MIDISCAN, was released in 1991. Three proprietary products are currently available. At this time, OCR software does not recognize handwritten scores.

MICR

One area where accuracy and speed of computer input of character information exceeds that of humans is in the area of magnetic ink character recognition, where the error rates range around one read error for every 20,000 to 30,000 checks..

Optical Character Recognition in Unicode

In Unicode, Optical Character Recognition symbol characters are placed in the hexadecimal range 0x2440–0x245F, as shown below (see also Unicode Symbols):

colspan="4" rowspan="3" Template:CT-2|   Symbol rowspan="2" Template:CT-3| Name colspan="4" rowspan="3" Template:CT-4|  
Hex
colspan="2" Template:CT-2| Symbol's Picture
width="0*" Template:CT-7| ⑀ rowspan="2" Template:CT-3| OCR Hook width="0*" Template:CT-7| ⑁ rowspan="2" Template:CT-3| OCR Chair width="0*" Template:CT-7| ⑂ rowspan="2" Template:CT-3| OCR Fork width="0*" Template:CT-7| ⑃ rowspan="2" Template:CT-3| OCR Inverted Fork width="0*" Template:CT-7| ⑄ rowspan="2" Template:CT-3| OCR Belt Buckle
0x2440 0x2441 0x2442 0x2443 0x2444
colspan="2" width="20%" Template:CT-2| File:U+2440.gif colspan="2" width="20%" Template:CT-2| File:U+2441.gif colspan="2" width="20%" Template:CT-2| File:U+2442.gif colspan="2" width="20%" Template:CT-2| File:U+2443.gif colspan="2" width="20%" Template:CT-2| File:U+2444.gif
Template:CT-7| ⑅ rowspan="2" Template:CT-3| OCR Bow Tie Template:CT-7| ⑆ rowspan="2" Template:CT-3| OCR Branch Bank Identification Template:CT-7| ⑇ rowspan="2" Template:CT-3| OCR Amount Of Check Template:CT-7| ⑈ rowspan="2" Template:CT-3| OCR Customer Account Number Template:CT-7| ⑉ rowspan="2" Template:CT-3| OCR Dash
0x2445 0x2446 0x2447 0x2448 0x2449
colspan="2" Template:CT-2| File:U+2445.gif colspan="2" Template:CT-2| File:U+2446.gif colspan="2" Template:CT-2| File:U+2447.gif colspan="2" Template:CT-2| File:U+2448.gif colspan="2" Template:CT-2| File:U+2449.gif
Template:CT-7| ⑊ rowspan="2" Template:CT-3| OCR Double Backslash   rowspan="2" Template:CT-3| Not Defined   rowspan="2" Template:CT-3| Not Defined   rowspan="2" Template:CT-3| Not Defined   rowspan="2" Template:CT-3| Not Defined
0x244A 0x244B 0x244C 0x244D 0x244E
colspan="2" Template:CT-3| File:U+244A.gif colspan="2" Template:CT-3| - colspan="2" Template:CT-3| - colspan="2" Template:CT-3| - colspan="2" Template:CT-3| -

See also

External links

  • ICDAR ICDAR is one of the most comprehensive conferences on all aspects of document recognition, including OCR, and is held every two years.
  • DRR SPIE DRR is an annual conference on OCR and document retrieval.
  • Asprise OCR Asprise OCR is a very popular OCR package that has been widely adopted.