Jump to content

Talk:Optical character recognition software

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Added Abbyy

[edit]

I added Abbyy to the list, which seems to be the software of choice for most.

Not to troll, but aren't the OSS solutions mostly experimental? I will try some out, before posting any quality comments/comparisons in the article proper. -- Erik

Type-o

[edit]

http://www.ocr.com/download.shtml The name of the software is "Cuneiform Pro OCR" and not "Cuneinform". It gets its name from this form of ancient writing Oldspammer 12:48, 15 September 2006 (UTC)[reply]

Cuneinform Critique

[edit]

I recently tried this SW... It is fast but the page images cannot be "GIF" or "LZW-compressed TIF" format.

I got an application modal error dialog part way through a page saying that there was some kind of line found that could not be resolved and recognition completely failed (when I had an image of a book page that had page curl from the book's binding and had page border shadow outlines).

Once a page is free from page boundary shadows and in a file supported format (uncompressed 8-bit gray scale TIF or pack-bits TIF), the recognition is super fast (sometimes less than 0.1 seconds for the page, other times 2 seconds depending on the foreign language complexity: e.g. Russian / Cyrillic with underlines) and accurate--however, line endings do not match the original page unless paragraph format mode is set to Unchecked. In this mode indentations are lost for paragraphs.

The recognized text view (if small print) cannot be zoomed for better scrutiny. If the recognized text font size is large, the right margin on the ruler line cannot be adjusted not to prematurely wrap the recognized text--the margin setting always snaps back to its original position after you drag it with your mouse.

Centered or multi-indented / tabbed-in text has its positioning lost and appears next to the left margin. Font size changes and italics are usually lost even when settings are to preserve them (I just discovered that if recognition is done with "Russian & English" as the selected language that more Italic sequences are properly recognized, but the recognition slows down a bit).

The User interface of the toolbars cannot be selected via right clicking them in a context menu as is a common User interface method of accessing options for these things. Instead User interface toolbar customization is done via main menu-OCR-Options-View Tab, then check boxes, then OK--this is slow and clumsy because the Options item is all the way at the bottom of the OCR menu which takes extra motion and care with the mouse to select. Large size toolbars are also not dockable and so they tend to stack up vertically and considerably reduce the sizes of the view windows for both the text editor and the OCR image window. Oldspammer 12:48, 15 September 2006 (UTC)[reply]

Cuneiform General Info

[edit]

From comment characterizations of Cuneiform, they primarily indicate speed is very high (their site claims that it is 3x to 10x faster than its competitors). The site makes no specific claim of OCR accuracy as compared to other SW. It is Microsoft Windows application software as opposed to Mac / Linux / Unix. Version 6.0 was released in the latter part of 2003. OCR SW pricing from their web site started at US$69 ($60 off until September 30) for single user license with discounts for larger numbers of users.

The California www.ocr.com site has its heading as "Cognitive Enterprises" / "Cognitive Technologies Corporation" in Corte Madera. A superficial Google search indicates that Cuneiform 1999 V5.0 was their previous release, so maybe 2007 is due for its next release?

The http://www.ocr.com/ company is also peddling a Win32 OCR "set of DLL's" for developers named Tiger OCR Library. Pricing for this developer software was US$3,000 according to a web page linked off of the one given above. Trial 9-European languages and a single language (English) version of their OCR package is available from the above given page (size 8.5 Mbyte, and 4.8 Mbytes (or so) respectively). Oldspammer 12:48, 15 September 2006 (UTC)[reply]

www.ocr.com Development of Cuneiform Critique

[edit]

Sophistication of this software is not quite ideal. It sometimes gives up when it shouldn't. From this it appears as though ocr.com is not actively developing and frequently releasing new versions of this software. However, they appear to be actively marketing it and its developer libraries via the web. The intellectual property is excellent. They should budget some on-going proper effort into sprucing up the product--eliminate some of the meaningless shortcomings like lack of LZW image decoding support and poor User interface and adding some sophistication smarts to handle non-ideal page scans / images without producing error popups. Oldspammer 00:48, 18 September 2006 (UTC)[reply]

Color and Grey Scale Recognition

[edit]

Say you want to host some product color brochures on your commercial web site. Scanning and OCR-ing these color glossy pages can be tricky if the art department that produced them used any fancy transparency backgrounds or gradient fills or font color inversions on a diagonal curve in the middle of bunches of the text. It would be nice to have a set of such sample color pages to rate these various OCR software packages. Depending on sophistication, some packages would do OK, while others would fail really badly when applied to this kind of work.Oldspammer 12:48, 15 September 2006 (UTC)[reply]

What About Adobe?

[edit]

Does not Adobe Acrobat have some conversion tools to convert printed documents into editable e-books, with hot-linked cross-reference Tables of Contents, Indices, etc. Mention should be made of this software too? Oldspammer 12:48, 15 September 2006 (UTC)[reply]

Spelling Errors / Other Type-Os

[edit]

It would benefit the article to be copy / pasted into a proper word processor and given a spell-check cleaning because a few times one of the original author's fingers could not help but hit the keyboard a few times so that some of the letters of words are in inverted order and other such mistakes. And they did not bother to preview and scrutinize their article / update.Oldspammer 00:48, 18 September 2006 (UTC)[reply]

free software

[edit]

The following section is in the "Proprietary software" section: SimpleOCR a relatively simple freeware (supports English, French and Dutch language recognition)

That section is following by the "Free and open source OCR software" section.

Since SimpleOCR is free, its listing should either be in the second section or remove the "Free" from the second section header, ie rename it to "Open source OCR software" —The preceding unsigned comment was added by 193.32.3.83 (talk) 18:14, 8 December 2006 (UTC).[reply]

  • "Free software" and "freeware" are two different things, so the article's categorisation is correct. The distinction is explained in both those articles. Unfortunately, the English language uses the word "free" to mean both "at liberty" and "at no charge", which causes confusion; other languages use terms like "libre" and "gratis" respectively. Rwxrwxrwx 22:15, 8 December 2006 (UTC)[reply]