Hotelling's T-squared distribution

From Wikipedia, the free encyclopedia
  (Redirected from Hotelling's T-square distribution)
Jump to: navigation, search

In statistics Hotelling's T-squared distribution is a univariate distribution proportional to the F-distribution and arises importantly as the distribution of a set of statistics which are natural generalizations of the statistics underlying Student's t-distribution. In particular, the distribution arises in multivariate statistics in undertaking tests of the differences between the (multivariate) means of different populations, where tests for univariate problems would make use of a t-test.

The distribution is named for Harold Hotelling, who developed it[1] as a generalization of Student's t-distribution.

The distribution[edit]

If the vector pd1 is Gaussian multivariate-distributed with zero mean and unit covariance matrix N(p01,pIp) and mMp is a p x p matrix with a Wishart distribution with unit scale matrix and m degrees of freedom W(pIp,m) then m(1d' pM−1pd1) has a Hotelling T2 distribution with dimensionality parameter p and m degrees of freedom.[2]

If the notation T^2_{p,m} is used to denote a random variable having Hotelling's T-squared distribution with parameters p and m then, if a random variable X has Hotelling's T-squared distribution,


X \sim T^2_{p,m}

then[1]


\frac{m-p+1}{pm} X\sim F_{p,m-p+1}

where F_{p,m-p+1} is the F-distribution with parameters p and m−p+1.

Hotelling's T-squared statistic[edit]

Hotelling's T-squared statistic is a generalization of Student's t statistic that is used in multivariate hypothesis testing, and is defined as follows.[1]

Let \mathcal{N}_p(\boldsymbol{\mu},{\mathbf \Sigma}) denote a p-variate normal distribution with location \boldsymbol{\mu} and covariance {\mathbf \Sigma}. Let

{\mathbf x}_1,\dots,{\mathbf x}_n\sim \mathcal{N}_p(\boldsymbol{\mu},{\mathbf \Sigma})

be n independent random variables, which may be represented as p\times1 column vectors of real numbers. Define

\overline{\mathbf x}=\frac{\mathbf{x}_1+\cdots+\mathbf{x}_n}{n}

to be the sample mean. It can be shown that


n(\overline{\mathbf x}-\boldsymbol{\mu})'{\mathbf \Sigma}^{-1}(\overline{\mathbf x}-\boldsymbol{\mathbf\mu})\sim\chi^2_p ,

where \chi^2_p is the chi-squared distribution with p degrees of freedom. To show this use the fact that \overline{\mathbf x}\sim \mathcal{N}_p(\boldsymbol{\mu},{\mathbf \Sigma}/n) and then derive the characteristic function of the random variable \mathbf y=n(\overline{\mathbf x}-\boldsymbol{\mu})'{\mathbf \Sigma}^{-1}(\overline{\mathbf x}-\boldsymbol{\mathbf\mu}). This is done below,

\phi_{\mathbf y}(\theta)=\operatorname{E} e^{i \theta \mathbf y},
=\operatorname{E} e^{i \theta n(\overline{\mathbf x}-\boldsymbol{\mu})'{\mathbf \Sigma}^{-1}(\overline{\mathbf x}-\boldsymbol{\mathbf\mu})}
= \int e^{i \theta n(\overline{\mathbf x}-\boldsymbol{\mu})'{\mathbf \Sigma}^{-1}(\overline{\mathbf x}-\boldsymbol{\mathbf\mu})} (2\pi)^{-\frac{p}{2}}|\boldsymbol\Sigma/n|^{-\frac{1}{2}}\, e^{ -\frac{1}{2}n(\overline{\mathbf x}-\boldsymbol\mu)'\boldsymbol\Sigma^{-1}(\overline{\mathbf x}-\boldsymbol\mu) }\,dx_{1}...dx_{p}
= \int (2\pi)^{-\frac{p}{2}}|\boldsymbol\Sigma/n|^{-\frac{1}{2}}\, e^{ -\frac{1}{2}n(\overline{\mathbf x}-\boldsymbol\mu)'(\boldsymbol\Sigma^{-1}-2 i \theta \boldsymbol\Sigma^{-1})(\overline{\mathbf x}-\boldsymbol\mu) }\,dx_{1}...dx_{p},
= |(\boldsymbol\Sigma^{-1}-2 i \theta \boldsymbol\Sigma^{-1})^{-1}/n|^{\frac{1}{2}} |\boldsymbol\Sigma/n|^{-\frac{1}{2}} \int (2\pi)^{-\frac{p}{2}} |(\boldsymbol\Sigma^{-1}-2 i \theta \boldsymbol\Sigma^{-1})^{-1}/n|^{-\frac{1}{2}} \, e^{ -\frac{1}{2}n(\overline{\mathbf x}-\boldsymbol\mu)'(\boldsymbol\Sigma^{-1}-2 i \theta \boldsymbol\Sigma^{-1})(\overline{\mathbf x}-\boldsymbol\mu) }\,dx_{1}...dx_{p},
= |(\mathbf I_p-2 i \theta \mathbf I_p)|^{-\frac{1}{2}},
= (1-2 i \theta)^{-\frac{p}{2}}.~~\blacksquare

However, {\mathbf \Sigma} is often unknown and we wish to do hypothesis testing on the location \boldsymbol{\mu}.

Sum of p squared t's[edit]

Define

{\mathbf W}=\frac{1}{n-1}\sum_{i=1}^n (\mathbf{x}_i-\overline{\mathbf x})(\mathbf{x}_i-\overline{\mathbf x})'

to be the sample covariance. Here we denote transpose by an apostrophe. It can be shown that \mathbf W is positive-definite and (n-1)\mathbf W follows a p-variate Wishart distribution with n−1 degrees of freedom.[3] Hotelling's T-squared statistic is then defined[4] to be


t^2=n(\overline{\mathbf x}-\boldsymbol{\mu})'{\mathbf W}^{-1}(\overline{\mathbf x}-\boldsymbol{\mathbf\mu})

and, also from above,

t^2 \sim T^2_{p,n-1}

i.e.

\frac{n-p}{p(n-1)}t^2 \sim F_{p,n-p} ,

where F_{p,n-p} is the F-distribution with parameters p and n−p. In order to calculate a p value, multiply the t2 statistic by the above constant and use the F-distribution.

Hotelling's two-sample T-squared statistic[edit]

If {\mathbf x}_1,\dots,{\mathbf x}_{n_x}\sim N_p(\boldsymbol{\mu},{\mathbf V}) and {\mathbf y}_1,\dots,{\mathbf y}_{n_y}\sim N_p(\boldsymbol{\mu},{\mathbf V}), with the samples independently drawn from two independent multivariate normal distributions with the same mean and covariance, and we define

\overline{\mathbf x}=\frac{1}{n_x}\sum_{i=1}^{n_x} \mathbf{x}_i \qquad \overline{\mathbf y}=\frac{1}{n_y}\sum_{i=1}^{n_y} \mathbf{y}_i

as the sample means, and

{\mathbf W}= \frac{\sum_{i=1}^{n_x}(\mathbf{x}_i-\overline{\mathbf x})(\mathbf{x}_i-\overline{\mathbf x})'
+\sum_{i=1}^{n_y}(\mathbf{y}_i-\overline{\mathbf y})(\mathbf{y}_i-\overline{\mathbf y})'}{n_x+n_y-2}

as the unbiased pooled covariance matrix estimate, then Hotelling's two-sample T-squared statistic is

t^2 = \frac{n_x n_y}{n_x+n_y}(\overline{\mathbf x}-\overline{\mathbf y})'{\mathbf W}^{-1}(\overline{\mathbf x}-\overline{\mathbf y})
\sim T^2(p, n_x+n_y-2)

and it can be related to the F-distribution by[3]

\frac{n_x+n_y-p-1}{(n_x+n_y-2)p}t^2 \sim F(p,n_x+n_y-1-p).

The non-null distribution of this statistic is the noncentral F-distribution (the ratio of a non-central Chi-squared random variable and an independent central Chi-squared random variable)

\frac{n_x+n_y-p-1}{(n_x+n_y-2)p}t^2 \sim F(p,n_x+n_y-1-p;\delta),

with

\delta = \frac{n_x n_y}{n_x+n_y}\boldsymbol{\nu}'\mathbf{V}^{-1}\boldsymbol{\nu},

where \boldsymbol{\nu} is the difference vector between the population means.

See also[edit]

References[edit]

  1. ^ a b c Hotelling, H. (1931). "The generalization of Student's ratio". Annals of Mathematical Statistics 2 (3): 360–378. doi:10.1214/aoms/1177732979. 
  2. ^ Eric W. Weisstein, CRC Concise Encyclopedia of Mathematics, Second Edition, Chapman & Hall/CRC, 2003, p. 1408
  3. ^ a b K.V. Mardia, J.T. Kent, and J.M. Bibby (1979) Multivariate Analysis, Academic Press.
  4. ^ http://www.itl.nist.gov/div898/handbook/pmc/section5/pmc543.htm

External links[edit]