Scene text

From Wikipedia, the free encyclopedia

Scene text is text that appears in an image captured by a camera in an outdoor environment.

The image displays the coach category in text format. We can observe that the coach belongs to Sleeper category.

The detection and recognition of scene text from camera captured images are computer vision tasks which became important after smart phones with good cameras became ubiquitous. The text in scene images varies in shape, font, colour and position. The recognition of scene text is further complicated sometimes by non-uniform illumination and focus.

To improve scene text recognition, the International Conference on Document Analysis and Recognition (ICDAR) conducts a robust reading competition once in two years. The competition was held in 2003, 2005[1][2][3] and during every ICDAR conference.[4][5][6] International association for pattern recognition (IAPR) has created a list of datasets as Reading systems.[7]

Text detection[edit]

Text detection is the process of detecting the text present in the image, followed by surrounding it with a rectangular bounding box. Text detection can be carried out using image based techniques or frequency based techniques.

In image based techniques, an image is segmented into multiple segments. Each segment is a connected component of pixels with similar characteristics. The statistical features of connected components are utilised to group them and form the text. Machine learning approaches such as support vector machine and convolutional neural networks are used to classify the components into text and non-text.

In frequency based techniques, discrete Fourier transform (DFT) or discrete wavelet transform (DWT) are used to extract the high frequency coefficients. It is assumed that the text present in an image has high frequency components and selecting only the high frequency coefficients filters the text from the non-text regions in an image.

Word recognition[edit]

In word recognition, the text is assumed to be already detected and located and the rectangular bounding box containing the text is available. The word present in the bounding box needs to be recognized. The methods available to perform word recognition can be broadly classified into top-down and bottom-up approaches.

In the top-down approaches, a set of words from a dictionary is used to identify which word suits the given image.[8][9][10] Images are not segmented in most of these methods. Hence, the top-down approach is sometimes referred as segmentation free recognition.

In the bottom-up approaches, the image is segmented into multiple components and the segmented image is passed through a recognition engine.[11][12][13] Either an off the shelf Optical character recognition (OCR) engine [14][15][16] or a custom-trained one is used to recognise the text.


  1. ^ Lucas, S.M. (2005). "ICDAR 2005 text locating competition results". S. M. Lucas. Text Locating Competition Results. In Proc. 8th ICDAR, pages 80–85, 2005. pp. 80–84 Vol. 1. doi:10.1109/ICDAR.2005.231. ISBN 978-0-7695-2420-7. S2CID 1842569.
  2. ^ ICDAR 2005 Competitions.
  3. ^ Lucas, Simon M.; Panaretos, Alex; Sosa, Luis; Tang, Anthony; Wong, Shirley; Young, Robert; Ashida, Kazuki; Nagai, Hiroki; Okamoto, Masayuki; Yamamoto, Hiroaki; Miyao, Hidetoshi; Zhu, Junmin; Ou, Wuwen; Wolf, Christian; Jolion, Jean-Michel; Todoran, Leon; Worring, Marcel; Lin, Xiaofan (2005). "S. M. Lucas. ICDAR 2003 Robust Reading Competitions: Entries, Results, and Future Directions. IJDAR, 7(2):105–122, June 2005". International Journal of Document Analysis and Recognition. 7 (2–3): 105–122. CiteSeerX doi:10.1007/s10032-004-0134-3. S2CID 2250003.
  4. ^ ICDAR 2013.
  5. ^ ICDAR 2017.
  6. ^ ICDAR 2011 Robust Reading Competition.
  7. ^ IAPR TC11 Reading Systems-Datasets List.
  8. ^ Weinman, J.J.; Learned-Miller, E.; Hanson, A.R. (2009). "J. J. Weinmann, E. Learned-Miller, and A. R. Hanson. Scene text recognition using similarity and a lexicon with sparse belief propagation. IEEE Trans. PAMI, 31(10):1733–1746, 2009". IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (10): 1733–1746. doi:10.1109/TPAMI.2009.38. PMC 3021989. PMID 19696446.
  9. ^ "A. Mishra, K. Alahari, and C. V. Jawahar. Scene Text Recognition using Higher Order Language Priors. In Proc. BMVC, 2012" (PDF).
  10. ^ Novikova, Tatiana; Barinova, Olga; Kohli, Pushmeet; Lempitsky, Victor (2012). "Large-Lexicon Attribute-Consistent Text Recognition in Natural Images". Computer Vision – ECCV 2012. Lecture Notes in Computer Science. Vol. 7577. pp. 752–765. CiteSeerX doi:10.1007/978-3-642-33783-3_54. ISBN 978-3-642-33782-6.
  11. ^ Kumar, Deepak; Ramakrishnan, A. G. (2012). "Power-law transformation for enhanced recognition of born-digital word images". D. Kumar and A. G. Ramakrishnan. Power-law transformation for enhanced recognition of born-digital word images. In Proc. 9th SPCOM, 2012. pp. 1–5. doi:10.1109/SPCOM.2012.6290009. ISBN 978-1-4673-2014-6. S2CID 13876092.
  12. ^ D. Kumar; M. N. Anil Prasad; A. G. Ramakrishnan. "MAPS: Midline analysis and propagation of segmentation". Proc. 8th ICVGIP, 2012. doi:10.1145/2425333.2425348. S2CID 13303734.
  13. ^ Kumar, Deepak; Anil Prasad, M. N.; Ramakrishnan, A. G. (2013). "NESP: Nonlinear enhancement and selection of plane for optimal segmentation and recognition of scene word images". In Zanibbi, Richard; Coüasnon, Bertrand (eds.). Document Recognition and Retrieval XX. Vol. 8658. p. 865806. doi:10.1117/12.2008519. S2CID 13848101.
  14. ^ Abbyy Fine Reader.
  15. ^ Nuance Omnipage Reader.
  16. ^ Tesseract OCR Engine.