Correlation coefficient

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Correlation coefficient may refer to:

  • Pearson product-moment correlation coefficient, also known as r, R, or Pearson's r, a measure of the strength and direction of the linear relationship between two variables that is defined as the (sample) covariance of the variables divided by the product of their (sample) standard deviations.
  • Intraclass correlation, a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups; describes how strongly units in the same group resemble each other.
  • Rank correlation, the study of relationships between rankings of different variables or different rankings of the same variable

Related concepts:

  • Correlation and dependence, a broad class of statistical relationships between two or more random variables or observed data values
  • Goodness of fit, any of several measures that measure how well a statistical model fits observations by summarizing the discrepancy between observed values and the values expected under the model in question
  • Coefficient of determination, a measure of the proportion of variability in a data set that is accounted for by a statistical model; often called R2; equal in a single-variable linear regression to the square of Pearson's product-moment correlation coefficient.

Correlation Coefficient

The correlation coefficient of two variables, sometimes simply called their correlation, is the covariance of the two variables divided by the product of their individual standard deviations. It is a normalized measurement of how the two variables are linearly related.

The population correlation coefficient is defined as follows, where \sigma_X and \sigma_Y are the population standard deviations, and \mathrm{cov}(X,Y) is the population covariance:

\rho_{X,Y}=\mathrm{corr}(X,Y)={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y},

The sample correlation coefficient instead uses the sample covariance and sample standard deviations and is defined as follows:


r_{xy}
      ={\mathrm{cov}(X,Y) \over s_x s_y}
      ={E[(X-\bar{X})(Y-\bar{Y})] \over s_x s_y}

Typically, the population means are not known, so the sample correlation coefficient should expand using the unbiased estimation of covariance:


r_{xy}
      =\frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{(n-1) s_x s_y}
      =\frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}
            {\sqrt{\sum\limits_{i=1}^n (x_i-\bar{x})^2 \sum\limits_{i=1}^n (y_i-\bar{y})^2}},

If the correlation coefficient is close to 1, it would indicate that the variables are positively linearly related and the scatter plot falls almost along a straight line with positive slope. For -1, it indicates that the variables are negatively linearly related and the scatter plot almost falls along a straight line with negative slope. And for zero, it would indicates a weak linear relationship between the variables.