|This article needs additional citations for verification. (January 2016) (Learn how and when to remove this template message)|
The structural similarity (SSIM) index is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. It was first developed in the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin and in subsequent collaboration with New York University.
SSIM is used for measuring the similarity between two images. The SSIM index is a full reference metric; in other words, the measurement or prediction of image quality is based on an initial uncompressed or distortion-free image as reference. SSIM is designed to improve on traditional methods such as peak signal-to-noise ratio (PSNR) and mean squared error (MSE), which have proven to be inconsistent with human visual perception.
The first version of SSIM, called Universal Quality Index (UQI), or Wang–Bovik Index, was developed by Zhou Wang and Al Bovik in 2001. It was modified into the current version of SSIM (many variations now exist) along with Hamid Sheikh and Eero Simoncelli, and described in print in a paper entitled "Image quality assessment: From error visibility to structural similarity”, which was published in the IEEE Transactions on Image Processing in April 2004.
The 2004 SSIM paper has been cited more than 12,000 times according to Google Scholar, making it one of the highest cited papers in the image processing and video engineering fields, ever. It was accorded the IEEE Signal Processing Society Best Paper Award for 2009. The inventors of SSIM were each accorded an individual Primetime Engineering Emmy Award in 2015.
The difference with respect to other techniques mentioned previously such as MSE or PSNR is that these approaches estimate absolute errors; on the other hand, SSIM is a perception-based model that considers image degradation as perceived change in structural information, while also incorporating important perceptual phenomena, including both luminance masking and contrast masking terms. Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene. Luminance masking is a phenomenon whereby image distortions (in this context) tend to be less visible in bright regions, while contrast masking is a phenomenon whereby distortions become less visible where there is significant activity or "texture" in the image.
The SSIM index is calculated on various windows of an image. The measure between two windows and of common size N×N is:
- the average of ;
- the average of ;
- the variance of ;
- the variance of ;
- the covariance of and ;
- , two variables to stabilize the division with weak denominator;
- the dynamic range of the pixel-values (typically this is );
- and by default.
The SSIM index satisfies the condition of symmetry:
The SSIM formula is based on three comparison measurements between the samples of and : luminance (), contrast () and structure (). The individual comparison functions are:
with, in addition to above definitions:
SSIM is then a weighted combination of those comparative measures:
Setting the weights to 1, the formula can be reduced to the form shown at the top of this section.
Application of the formula
In order to evaluate the image quality, this formula is usually applied only on luma, although it may also be applied on color (e.g., RGB) values or chromatic (e.g. YCbCr) values. The resultant SSIM index is a decimal value between -1 and 1, and value 1 is only reachable in the case of two identical sets of data. Typically it is calculated on window sizes of 8×8. The window can be displaced pixel-by-pixel on the image, but the authors propose to use only a subgroup of the possible windows to reduce the complexity of the calculation.
A more advanced form of SSIM, called Multiscale SSIM is conducted over multiple scales through a process of multiple stages of sub-sampling, reminiscent of multiscale processing in the early vision system. The performance of both SSIM and Multiscale SSIM is very high in regards to correlations to human judgments, as measured on widely used public image quality databases, including the LIVE Image Quality Database and the TID Database. Most competitive image quality models are some form or variation of the SSIM concept.
Three-component SSIM (3-SSIM) is a form of SSIM that takes into account the fact that the human eye can see differences more precisely on textured or edge regions than on smooth regions. The resulting metric is calculated as a weighted average of SSIM for three categories of regions: edges, textures, and smooth regions. The proposed weighting is 0.5 for edges, 0.25 for the textured and smooth regions. The authors mention that a 1/0/0 weighting (ignoring anything but edge distortions) leads to results that are closer to subjective ratings. This suggests that edge regions play a dominant role in image quality perception.
This SSIM variant gives results which are more consistent with human subjective perception. It distinguishes distortions like DCT-based compression (all JPEG/MPEG-like algorithms) and blur more strongly than the original SSIM, while being more forgiving to noise.
Structural dissimilarity (DSSIM) is a distance metric derived from SSIM (though the triangle inequality is not necessarily satisfied).
Video quality metrics
It is worth noting that the original version SSIM was designed to measure the quality of still images. It doesn't contain any parameters directly related to temporal effects of human perception and human judgment. However, several temporal variants of SSIM have been developed.
A simple application of SSIM to estimate video quality would be to calculate the average SSIM value over all frames in the video sequence.
Owing to its excellent performance and extremely low computation cost, SSIM has become very widely used in the broadcast, cable and satellite television industries where it has become a dominant method of measuring video quality in broadcast and post-production houses throughout the television industry. This was the basis for the team's Emmy Award.
SSIM is the basis for a number of video quality measurement tools used globally, including those marketed by Video Clarity, National Instruments, Rodhe and Schwarz, and SSIMWave. Overall, SSIM and its variants – such as Multiscale SSIM – are amongst the most widely used full-reference perceptual image and video quality models throughout the world, as evidenced by high citation count, wide industry acceptance, and significant industry recognition and awards.
Discussions over performance
A paper by Dosselmann and Yang suggests that SSIM is not as precise as it claims to be. They claim that SSIM provides quality scores which are not more correlated to human judgment than MSE (Mean Squared Error) values. They dispute the perceptual basis of SSIM, suggesting that its formula does not contain any elaborate visual perception modelling and that SSIM possibly relies on non-perceptual computations..
However as shown in the original 2004 paper, the SSIM model and algorithm includes models of key elements of visual distortion perception, including luminance masking and contrast masking mechanisms. It also has been repeatedly shown to outperform MSE in accuracy.
- Wang, Zhou; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. (2004-04-01). "Image quality assessment: from error visibility to structural similarity". IEEE Transactions on Image Processing. 13 (4): 600–612. doi:10.1109/TIP.2003.819861. ISSN 1057-7149.
- "IEEE Signal Processing Society, Best Paper Award" (PDF).
- Wang, Z.; Simoncelli, E.P.; Bovik, A.C. (2003-11-01). "Multiscale structural similarity for image quality assessment". Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 2004. 2: 1398–1402 Vol.2. doi:10.1109/ACSSC.2003.1292216.
- Dosselmann, Richard; Yang, Xue Dong (2009-11-06). "A comprehensive assessment of the structural similarity index". Signal, Image and Video Processing. 5 (1): 81–91. doi:10.1007/s11760-009-0144-1. ISSN 1863-1703.