The structural similarity (SSIM) index is an Emmy-award winning method for predicting the perceived quality of digital images and videos, that was first developed in the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin and in subsequent collaboration with New York University. The first version of SSIM, called Universal Quality Index (UQI), or Wang-Bovik index, was developed by Zhou Wang and Al Bovik in 2001. It was modified into the current version of SSIM (many variations now exist) along with Hamid Sheikh and Eero Simoncelli, and described in print in a paper entitled "Image quality assessment: From error visibility to structural similarity,” that was published in the IEEE Transactions on Image Processing in April 2004. SSIM is used for measuring the similarity between two images. The SSIM index is a full reference metric; in other words, the measurement or prediction of image quality is based on an initial uncompressed or distortion-free image as reference. SSIM is designed to improve on traditional methods such as peak signal-to-noise ratio (PSNR) and mean squared error (MSE), which have proven to be inconsistent with human visual perception. The 2004 SSIM paper has been cited more than 10,000 times according to Google Scholar, making it among the highest cited papers in the image processing and video engineering fields, ever. It was accorded the IEEE Signal Processing Society Best Paper Award for 2009. The inventors of SSIM were accorded a Primetime Engineering Emmy Award in 2015.
The difference with respect to other techniques mentioned previously such as MSE or PSNR is that these approaches estimate absolute errors; on the other hand, SSIM is a perception-based model that considers image degradation as perceived change in structural information, while also incorporating important perceptual phenomena, including both luminance masking and contrast masking terms. Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene. Luminance masking is a phenomena whereby image distortions (in this context) tend to be less visible in bright regions, while contrast masking is a phenomena whereby distortions become less visible where there is significant activity or "texture" in the image.
The SSIM index is calculated on various windows of an image. The measure between two windows and of common size N×N is:
- the average of ;
- the average of ;
- the variance of ;
- the variance of ;
- the covariance of and ;
- , two variables to stabilize the division with weak denominator;
- the dynamic range of the pixel-values (typically this is );
- and by default.
In order to evaluate the image quality this formula is usually applied only on luma, although it may also be applied on color (e.g., RGB) values or chromatic (e.g. YCbCr) values. The resultant SSIM index is a decimal value between -1 and 1, and value 1 is only reachable in the case of two identical sets of data. Typically it is calculated on window sizes of 8×8. The window can be displaced pixel-by-pixel on the image but the authors propose to use only a subgroup of the possible windows to reduce the complexity of the calculation.
A more advanced form of SSIM, called Multi-Scale SSIM is conducted over multiple scales through a process of multiple stages of sub-sampling, reminiscent of multiscale processing in the early vision system. The performance of both SSIM and Multiscale SSIM is very high in regards to correlations to human judgments, as measured on the most widely used public image quality databases, including the LIVE Image Quality Database and the TID Database. Most competitive image quality models are some form or variation of the SSIM concept.
Owing to its excellent performance and extremely low compute cost, SSIM has become very widely used in the cable and satellite television industries where it has become a dominant method of measuring video quality in broadcast and post-production houses throughout the television industry. SSIM is the basis for a number of video quality measurement tools used globally, including those marketed by Video Clarity, National Instruments, Rodhe and Schwarz, and SSIMWave. Overall, SSIM and its variants, such as Multiscale SSIM, are the most successful and widely used full-reference perceptual image and video quality models throughout the world.
Structural dissimilarity (DSSIM) is a distance metric derived from SSIM (though the triangle inequality is not necessarily satisfied).
Discussions over performance
Owing to its great commercial and academic success in an extremely competitive field, SSIM has been an object of controversy and adverse claims. For example, some have suggested that SSIM is actually not as precise as claimed. However, experiments conducted by numerous international research laboratories have shown otherwise on the world's largest and most popular image quality databases. SSIM is probably the most scrutinized and tested image quality model ever developed.
Other authors dispute the perceptual basis of SSIM, suggesting that its formula does not contain any elaborate visual perception modelling and that SSIM possibly relies on non-perceptual computations. For example, the human visual system does not compute a product between the mean values of the two images. However, as shown in the original SSIM paper, SSIM embodies the most significant perceptual aspects of the perception of image quality. Through mathematical manipulation, the SSIM equation is expressed in a succinct and efficient form that does not "look perceptual" to the inexpert interpreter.
- Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, Apr. 2004.
- Loza et al., "Structural Similarity-Based Object Tracking in Video Sequences", Proc. of the 9th International Conf. on Information Fusion, 2006.