Talk:Multivariate normal distribution/Archive 0

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Conditional Distribution

I think where and come from in the conditional distribution section is a little confusing. I suggest the following update.

If ~ is partitioned such that

, where
with sizes
with sizes ,

then the distribution of conditional on is multivariate normal where

and covariance matrix

. (talk) 23:43, 21 August 2008 (UTC)

possible Kullback-Leibler Divergence error

I believe you have the KL divergence backwards. It should be from N1 to N0 based on your formula and the language on the KL Divergence page It is easy to check in the case where observations are independent, see Bozdogan 1987 for such a case. Troutinthemilk 20:05, 7 June 2007 (UTC)

Just another remark, wouldn't it be better to write Id instead of N for the identity matrix? If N means the identity matrix, otherwise would be a good idea to make clear what it stands for.—Preceding unsigned comment added by (talk) 07:20, 16 June 2010 (UTC)

Hi. There is an error in the formula for the KL. This is related to the comment above also. The order of the means in the quadratic form should be reversed. It should read:

(m_0 - m_1)' S_1(^-1)(m_0 - m_1)

I will edit the formula if it is not corrected in few days, but since this page is (presumably) somebody else's baby, I thought I should leave a comment first. Thanks and great work, this page is very useful. Peter Halpin, PhD, University of Amsterdam —Preceding unsigned comment added by (talk) 13:02, 12 October 2010 (UTC)

Peter, really, (a − b)2 = (b − a)2, so there is no error there...  // stpasha »  21:09, 12 October 2010 (UTC)

In the KL formula, should the log-of-ratio-of-determinants term be outside the parenthesis, ie not be multiplied by one half? It seems to me that this is what the cited reference says (in the one-dimensional case), and this is also what I get when trying to carry out the KL integrations myself (again, in the 1D case). OMHalck (talk) 15:27, 12 June 2011 (UTC)

Ignore my confusion above; I mixed up the upper-case Sigmas here (variances in the 1D case) with the lower-case sigmas in the reference (square roots of variances).

OMHalck (talk) 05:26, 14 June 2011 (UTC)

error on the normalization

the normalization in the N-dimensional case should be:


see also Gaussian_blur and Difference of Gaussians

That is the same as (2π)-N/2det(Γ)-1/2. --Zerotalk 12:17, 24 January 2007 (UTC)


Note as of 2010/12/24 eqn shows: (2π)-N/2det(Γ)1/2 (due to det(Γ) already being in denominator, so neg power cancels out..) Fixing. SimonFunk (talk) 05:41, 25 December 2010 (UTC)

No, I put the minus sign on purpose. Move the whole determinant to the numerator, if you prefer. In December, I checked with Mathematica and noticed that the minus was missing. And if I am not completely confused, it is wrong now in the way that you left it. (I think when Zero talks about the normalisation factor (above), he means the denominator.) I am not undoing your undo now, as I am too lazy to double-check a second time that I am really right here, but if you haven't done so before removing my minus, please check again. Simon A. (talk) 23:03, 10 January 2011 (UTC)

Question on Bivariate Normal

If X is normally distributed, and Y is normally distributed. If z = X * Y, is z bivariate normally distributed?


I moved the following condition from the main page:


Is u the same as μ here? AxelBoldt

The vector u is not the expected vector. The characteristic functional of X is the expectation value of exp[i(u1X1+...+unXn)]. I write φX(u)=E[eiu X]. The expected vector is the gradient of φ at u=0. I made a mistake: the correct characterization is


This characterization is necessary as a technical step in the proof of equivalence. As far as I know, it is the only way to show that, it every linear combination of the Xi is Gaussian, then the Xi are jointly Gaussian. -- Miguel

ok, I'll put it back in. AxelBoldt

The motivation of this page is that it is a prerequisite to defining a "Gaussian stochastic process". The best way to do this is to say that every linear functional the random function is a gaussian random variable. -- Miguel

The "characterization"

is actually not correct because it implies that the various Xi are uncorrelated.

yup, you're right; I'll mention that in the article. AxelBoldt

Also, it seems that Gaussian is capitalized because it is the name of a person. -- Miguel

There is also a program used in computational chemistry whose name is Gaussian (with the G) -- people used to this program may expect lower-case g for other uses of "gaussian". And according to the ACS Style Guide, the trend is toward lower-casing surnames that are used as units. But then again, this is math, not chemistry... -- Marj Tiefert, Monday, May 6, 2002
Oh, didn't know that "Gaussian" is based on a surname -- that't probably why I was not able to find many examples of the term lowercased on Google. A redirect with the term in lowercase already points here, so I think that is enough. --maveric149
I agree that units should be lower-case. For example, the "gauss" is a unit of magnetic field strength, in honour of Gauss' work on magnetism.
Now that we're at it and maveric is in this discussion, maybe we should agree on the best way to disambiguate gauss (unit) from Gauss (prince of mathematicians), and gaussian (computer program) from Gaussian (random variable) -- Miguel

Hum... Interesting quandary. First lets start with the personality since that is the easiest. It really isn't necessary to list Gauss the man on a disambiguation page located at Gauss since nobody with half a brain would simply link to Gauss and expect that link to go directly to an article about Carl Friedrich Gauss (which is named correctly BTW). The same would be true about Smith and Adam Smith -- it is a misuse of disambiguation pages to list people who had X for a last name unless they were primarily only known by their last name and other things are also known by that name (wikipedia is not a name directory). A good example of this would be Seneca which is both the name of a first century philosopher and the name of a Native American tribe (some disambiguation is needed at Seneca I see...).

However, Gauss is already redirected to Carl Friedrich Gauss, which is not surprising since mathematicians (and, more generally, scientists) are almost universally known by a single surname. This leads to confusion: there were five Bernouillis, two Banachs, two Pearsons... To confuse matters more, there is not only "gauss (unit)", which is a unit of magnetic field intensity, but also "Gauss units", which is a particular choice of normalization of the Maxwell equations and the elementary charge (there are also "Heaviside units") -- Miguel

As for the "gaussian" issue: I wasn't sure about this one since I didn't know there was a computer program with the same name, so I did a little Googling. Found out that <guassian> got 2/3rds of a million hits and <gaussian "computer program"> got less than 1% that number of hits.

Searching for "Gaussian computational chemistry" gives 16200 hits  :-) (Marj)

This tells me that Gaussian the computer program is far less widely known of than gaussian the variable -- thus confirming my first reaction. Since one useage is far more widely known and expected than the other we should have an article titled gaussian that is only about the mathematics term. A link to either Gaussian computer program or Gaussian (computer program) can then be placed at the bottom of that page (in the same way as Paris, Texas is linked at the bottom of the Paris entry -- which is about Paris, France BTW). This is what I like to call 'weak disambiguation'.

Not sure what the name of the article about the computer program should be... Would it sound odd to use "Gaussian computer program" in a sentence talking about Gaussian the computer program? Or is this computer program almost always referred to simply as "Gaussian"?

Gaussian is produced by Gaussian, Inc ( who refer to it as simply "Gaussian". Their website looks like they have more of an academic than a big-corporation mindset, however - like, I didn't notice whether they'd trademarked this use of "Gaussian" (if they in fact could have). Among computational chemists, I've always heard it referred to as "Gaussian", but there wasn't any ambiguity, since they were talking about computational chemistry. Probably the program makes use of the mathematical species of "Gaussian", or "gaussian".  ;-) -- Marj Tiefert, Wednesday, May 8, 2002
You can see "the mathematical species of Gaussian" in's logo. -- Miguel

This is important since a major part of our naming conventions deals with easy-linking and whenever a disambiguation issue like this arises, we first really should look for alternatives that are also widely used yet less ambigous. Who wants to have to write [[Guassian (computer program)|Gaussian]] each time they link to that article? However, if the use of the term "Gaussian computer program" makes for contrived and odd sounding sentences then we might just as well place that article at [[Guassian (computer program)]] so as not to needlessly imply that "computer program" is part of its name.. The use of parentheses in disambiguation is is what I like to call 'strong disambiguation' and is something to be used only as a last resort. Hope this helps.

BTW, I'm still not sure about a general rule for capitalizing units that are derived from surnames... As it is, I am beginning to lean in favor of making them lowercase. However, we might want to explore whether there might be any exceptions where a capitalized term would be used. For examle the unit newton is commonly expressed in lowercase form, but then Celsius is usually shown with a capital 'C' (along with the other two common temperature scales).... Any other thoughts?--maveric149

Celcius and Fahrenheit obey the rule because they are strictly 'degrees Celcius' and 'degrees Fahrenheit', where the first word of the unit is not capitalised. When written explicitly, something like 'ten kelvin' should be written with a lowercase 'k'.

Miguel, I don't think the first and the second condition given in the article are equivalent. Take for instance X=(X1,X2) where X1 is standard normal and X2 is uniform on [0,1]. Then the first condition is not satisfied, but the second is, using the matrix A = (1 0). I claim A needs to be square (and will then automatically be invertible.) --AxelBoldt

You're right. Thanks for pointing that out. I reversed the relation between X and Z. The result, with a rectangular A, is correct. The reason the original Z=A(X-μ) doesn't work is that the covariance matrix of Z doesn't have the right rank. If Z=A X and the covariance matrix of X is Γ, then the covariance matrix of Z must be AΓAT. But the rank of this is at most the rank of Γ and we are requiring the components of Z to be independent N[0,1]. That's why Z needs to have a smaller dimension. But, as you point out, this doesn't work either.
As far as the current statement goes, the number of components of Z could be arbitrarily large, but not smaller than the rank of Γ. Miguel

We still have serious problems with the definition here. First, do we consider a variable that's constant 0 to be normally distributed? If not, then the first two statements are not equivalent. Also, in the third statement, should we go to a positive semidefinite Γ? AxelBoldt 06:13 Jan 24, 2003 (UTC)

We definitely need to consider a constant (not only 0) to be normally distributed (with variance 0, of course), and we need to eliminate the words "unless all ai are 0". The reason is that we need to allow singular variance matrices, and once that happens we have some nonzero linear combinations of non-degenerate normals adding up to to a constant. Example: the residuals (which are not independent, and must not be confused with the errors, which are independent) from the simplest sort of ordinary linear regression are constrained to lie within a space of codimension 2. That vector of residuals has a singular variance matrix. The distribution of its sum of squares is chi-square with n-2 degrees of freedom. The whole discussion leading to that conclusion would be horribly complicated if we're forbidden to speak of normal distributions whose variance is a singular matrix. Michael Hardy 17:19 Jan 24, 2003 (UTC)


  • there is a vector μ=(μ1,...,μn) and a symmetric, positive semidefinite matrix Γ such that X has density
fX(x1,...,xn)dx1...dxn = (det(2πΓ))-1/2 exp ½((X-μ)TΓ-1(X-μ)) dx1...dxn


  • there is a vector μ=(μ1,...,μn) and a symmetric, positive definite matrix Γ such that X has density
fX(x1,...,xn)dx1...dxn = (det(2πΓ))-n/2 exp ½((X-μ)TΓ-1(X-μ)) dx1...dxn

(semidefinite -> definite, 1 -> n) or should I stick to things I know something about? — user:

positive semidefinite means that we are allowing zero variance (i.e., a random variable that always takes the same value). See the discussion just above your question.
The determinant of Γ takes into account the variances and covariances of all variables, and so it need not be raised to the nth power.
Last but not least, if you know enough to ask these questions, you actually "know something about" this ;-) — Miguel 17:44, 2004 Feb 24 (UTC)
I agree with the non-logged-in user's criticism. Multivariate normal distributions exist in which the variance is a positive semi-definite matrix of determinant zero. In a coordinate system in which the components are independent, one or more components has variance zero. But: such a distribution has no density with respect to the usual n-dimensional Lebesgue measure; no density function should be attributed to such distributions unless it is with respect to a measure on a space of lower dimension. Michael Hardy 21:07, 24 Feb 2004 (UTC)
You're completely right, as usual :-) Miguel 21:24, 2004 Feb 24 (UTC)

proposed rearranged first section

I propose the following rearrangement and partial rewrite of the intro section of this article. The main motivation is that the general case can be understood at an informal level without the need to be familiar with characteristic functions. Comments, please. --Zero 12:37, 15 Sep 2004 (UTC)

Since the definition you single out applies only to non-degenerate multivariate normals, you need to mention degeneracy explicitly in the following paragraph.
I would call X a "random vector", not a "random variable".
Make the paragraphs after "A formal definition" the first section of the body of the article, called "Formal definition".
IMHO, the most intuituvely compelling and informally understandable definition is the one that says every linear combination of the coordinates is normally distributed.
Miguel 19:59, 2004 Sep 15 (UTC)
IMHO this copy is better than the copy on the current instance of the page. Is there a reason it's not adopted? I think it strikes a good balance between giving the familiar pdf characterization of the (non-degenerate) multivariate normal, as well as neatly setting up the more general random vector characterization. In contrast, the current wikipage page has a confusing bullet that attempts to give a density for the degenerate case (using psuedo-inverses and such), but qualifies it with only when the support is full, i.e. that this formula is never helpful at all. Because this is an ongoing point of confusion maybe we need to clear it up with another section or perhaps a link to another page that explains that not all random vectors have densities; perhaps a version of the Radon–Nikodym_theorem page with some exposition? Unfortunately that material is too advanced. Here's my best shot at something brief/helpful; help me put this where it belongs and/or give commentary:
If the covariance matrix is not full rank, then the multivariate normal distribution is degenerate and does not have a density. More precisely, it does not have a density with respect to -dimensional Lebesgue measure (which is the usual measure assumed in calculus-level probability courses). Only random vectors whose distributions are absolutely continuous with respect to a measure are said to have densities (with respect to that measure). If we restrict attention to the -dimensional affine subspace of where the Gaussian distribution is supported, i.e. the closure of , then, with respect to Lebesgue measure disintegrated to this subspace, the distribution has the density:
where is the Generalized inverse and det* is the Pseudo-determinant. To talk about densities but avoid dealing with these complications it can be simpler to restrict attention to a subset of of the coordinates of such that the covariance matrix for this subset is positive definite; then the other coordinates may be thought of as an affine function of the selected coordinates.
Perhaps there is a page about probability calculus on manifolds somewhere to link as well?
Marc.coram (talk) 08:47, 28 October 2011 (UTC)
I have copied in an edited version of both bits. Can we continue any more discussion at the bootm of this page in a new section, as mixing in with old stuff is confusing. Melcombe (talk) 09:02, 28 October 2011 (UTC)

In probability theory and statistics, a multivariate normal distribution, also sometimes called a multivariate Gaussian distribution in honor of Carl Friedrich Gauss, is a generalization of the normal distribution to several dimensions.

In the case of a random variable X with a non-degenerate multivariate normal distribution, there is a vector and a symmetric, positive definite matrix such that X has density

where is the determinant of . Note how the equation above reduces to that of the univariate normal distribution if is a matrix (ie a real number).

More generally, a multivariate normal distribution in dimensions consists of a non-degenerate multivariate normal distribution sitting inside some -dimenional affine subspace (a linear subspace possibly shifted from the origin) for some . For example, if Z is a 1-dimensional normal distribution, then the vector (Z,Z) whose components are equal has a multivariate normal distribution which sits inside the subspace .

A formal definition is that an n-dimensional random variable X= X1, ... , Xn has a multivariate normal distribution, if it satisfies the following equivalent conditions:

  • there is a random vector Z=(Z1, ..., Zm), whose components are independent standard normal random variables, a vector μ = (μ1, ..., μn) and an n×m matrix A such that X = A Z + μ.
  • there is a vector μ and a symmetric, positive semi-definite matrix Γ such that the characteristic function of X is
φX(u)=exp(iμTu − (½) uT Γ u).

The vector μ in these conditions is the expected value of X and the matrix is the covariance matrix of the components Xi.

Note that the Xi are in general not independent; they can be seen as the result of applying the linear transformation A to a collection of independent Gaussian variables Z.

What is the N at the end in the kullback-leibler distance. It would make sense to add what the value N signifies in the formulae. And where does the formuale come form, any references too would help

The N is the dimension of both Multivariate normal distribution, as defined above. But I will make it more clear. Unfortunately, I dont found any reference yet, whether the formular is correct.

A counterexample

Would it not make the article clearer to merge "A counterexample" and "correlation and independence" into one section? — ciphergoth 07:09, 2005 Apr 29 (UTC)

I second this suggestion. The subsection A Counterexample does not have any context. A counterexample to what? I suggest moving "A counterexample" into Correlations and independence. (talk) 18:07, 21 August 2008 (UTC)

That section states explicitly what it's a counterexample to. Read the first sentence in the section. It's there. Michael Hardy (talk) 15:03, 4 May 2009 (UTC)

Are we sure X and Y are uncorrelated if Y=X/Y=-X in this way? —Preceding unsigned comment added by (talk) 04:47, 25 February 2010 (UTC)


Confused by the Transposed on this page. If lambda is a row vector (1xn) then lambda^T Sigma^{-1} does not seem to be defined as it would be (nx1)(nxn). 09:53, 14 March 2007 (UTC)Ulrich

The usual convention is that vectors are columns unless otherwise specified. --Zerotalk 10:49, 14 March 2007 (UTC)
Shouldn't then lambda=(lambda_1, ...)^T, X=...^T etc... (Is this Nitpicking??) 12:52, 14 March 2007 (UTC)Ulrich
You seem to be right. Anyone disagree? --Zerotalk 13:21, 14 March 2007 (UTC)
I've fixed the transposing errors in the "General case" section. Please check if there are others to be corrected! Oli Filth 19:38, 16 March 2007 (UTC)

Online calculator malfunction

The Online real-time Bivariate Normal Distribution Calculator, by Razvan Pascalau, Univ. of Alabama, as referenced at the end of this article doesn't seem to work fully.

For instance, enter x=-4, y=2, =0 and the probability comes out as p=-0.01. 15:10, 13 July 2007 (UTC)

The curse of ...

The dimension is denoted in some places and in others. It should be consistent throughout the article. Does anyone have a preference on which one to use? Steve8675309 02:38, 24 July 2007 (UTC)

recent revert

Hello. Sorry to revert a good-faith edit, but the original was very clearly non-gaussian and the newer version was indeed bivariate Gaussian, albeit degenerate. Best wishes, Robinh 07:41, 17 September 2007 (UTC)

Question 2 on Bivariate Normal

What happens to joint distribution if correlation coefficient ρxy is 1 ? Thanks to everybody helping Abayirli1 03:44, 31 October 2007 (UTC)

Then you have a degenerate case, the covariance matrix is singular and you don't have a regular density, but you need a Dirac delta if you want to write a density fuctional.

Specifically, if the correlation coefficient is 1 then you can show there is an affine linear combination ax + by = c. This is related to the Cauchy-Schwartz inequality where if the inequality is saturated the two vectors are proportional to each other. But a linear combination of random variables is a random variable, it just so happens that ax + by is a constant random variable, with zero variance and doesn't have a pdf (unless you allow a dirac delta). The "orthogonal" direction a y - b x carries all the variance of this singular bivariate normal.

I hope that makes sense. Miguel 10:29, 8 November 2007 (UTC)

Script N

I think that the notation for the normal distribution

should be standardized within this article and that for the normal distribution. It should probably be the latter N because it seems more common. (talk) 23:43, 21 August 2008 (UTC)

or just Normal(blah, blah) could be used. Given the wide range of people who read this, it is usually best to error on the conservative side (even if it means writing it out more). I would say the same for other distributions such as Gamma(blah) and Beta(blah), etc. —Preceding unsigned comment added by (talk) 14:21, 18 December 2008 (UTC)

Question 3 on Bivariate Normal

If two variables X and Y are normal, and they are correlated, does that imply they are bivariate normal? —Preceding unsigned comment added by Humble2000 (talkcontribs) 05:36, 5 December 2008 (UTC)

No. Only if infinitely countable linear combinations of X and Y are normal then [X,Y] is bivariate normal. By definition. Unless they are independent, of course. Omrit (talk) 09:50, 19 April 2009 (UTC)

Omrit, I think your answer is confused. "Infinitely countable linear combitations" sounds like "countably infinite linear combinations", which means things like
Maybe you meant only if every linear combination of the two is normal. That idea certainly does not involve any mention of countability. Michael Hardy (talk) 15:01, 4 May 2009 (UTC)

Question 4 on Bivariate normal

If the sum of two normally distributed random variables is still normal, does that imply the two random variables are bivariate normal? —Preceding unsigned comment added by Humble2000 (talkcontribs) 06:26, 5 December 2008 (UTC)

Of course not. Only if infinitely countable linear combinations of the two variables is normal then they are bivariate normal. By definition. Omrit (talk) 09:42, 19 April 2009 (UTC)

It is not true that the sum of two normally distributed random variables is in every case normal.
What is true is that if the pair is bivariate normal, then the sum is normal. The converse—that if the two are separately normal, then the pair is bivariate normal—is false. A counterexample can be found at normally distributed and uncorrelated does not imply independent.
Omrit, I think your answer is confused. "Infinitely countable linear combitations" sounds like "countably infinite linear combinations", which means things like
Maybe you meant only if every linear combination of the two is normal. That idea certainly does not involve any mention of countability. Michael Hardy (talk) 14:59, 4 May 2009 (UTC)

Minor nitpick on "Drawing values from the distribution"

Quoting the article, "Compute the Cholesky decomposition of Σ, that is, find the unique lower triangular matrix such that . Any other matrix for which this equation holds is also feasible." Since the Cholesky decomposition is indeed unique, the second sentence seems redundant. Ryg (talk) 11:22, 11 March 2009 (UTC)

Now rephrased. Melcombe (talk) 10:48, 12 March 2009 (UTC)

Definition of "Jointly Normal"?

Does the phrase "Jointly Normal" mean the same thing as "Multivariate Normal". If so should the page say that somewhere? —Preceding unsigned comment added by (talk) 06:53, 10 June 2009 (UTC)

It does, but it's used slightly differently. Usually you say that a vector x = (x, y) has a multivariate normal distribution, but that the scalars x and y are jointly normally distributed. In the first case you have a single (vector) random variable which follows a MVN distribution, in the second case you look at the same thing and instead see it as two (scalar) random variables which happen to have something in common when considered together, namely being jointly normally distributed. Same thing, slightly different point of view. You can also say that the joint distribution of x and y is a multivariate normal distribution; maybe "jointly normal" is short for "the joint distribution being (multivariate) normal" or something like that (I'm not a linguist). -- Coffee2theorems (talk) 14:21, 6 July 2009 (UTC)

Distribuition of norms of residuals/errors?

Suppose I take n samples from a multivariate normal distribution and look a the distribution of the norm of their error. Similarly, suppose I have a zero-mean multivariate normal distribution and look at the distribution of norms of samples. Obviously for a uni-variate normal distribution, the norm s are distributed like the right half of a normal distribution (or is it a folded normal distribution?). But for bi-variate distributions, a sample will almost certainly not have both components near zero, so the distribution of the norm will look more like a Poisson distribution, with no counts at zero norm, then a peak, then a tail. As you add more dimensions, this continues and you seem to get some thing that looks increasingly normal.

I ask because I am looking at norms of multivariate residuals and trying to figure out how likely each one is. With uni-variate residuals, most are nearly zero, and so normalizing by their RMS seems right. Maybe I just answered my own question... Thoughts? —Ben FrantzDale (talk) 16:19, 29 July 2009 (UTC)

I think you're looking for the chi distribution. It's the distribution of the norm of a standard MVN variate, or distribution of the square root of a chi-square random variate if you're wondering about the name.
The chi-square distribution tends to normal due to the central limit theorem, and while it does so its bulk mass moves to the right (mean being n). Since the square root function tends to a flat line on the right (derivative tends to 0), it transforms the shape of the bulk mass less and less as n increases, which would explain why the chi distribution is also close to normal for reasonable values of n.
FWIW, such questions are more suitable for the reference desk, people are much quicker to respond there. -- Coffee2theorems (talk) 08:15, 17 August 2009 (UTC)

Affine transformation doubt

I think that the vector c in this section should have dimensions of Mx N, instead of N x 1, since the product BX should have dimension M x N. Anyone agree? —Preceding unsigned comment added by (talk) 20:58, 1 February 2010 (UTC), (1) Y is a vector so if Y = c + ... then c must be a vector. (2) B is m x n but X is a vector, so BX is a vector. 018 (talk) 22:07, 1 February 2010 (UTC)

But I think that Y is NOT a vector instead it's a matrix, the same goes for X. If you look the dimensions implied in the variance of X: B x \sigma x B', you can tell that X is a matrix with dimension NxN. Thus, c and Y should be a matrix as well. —Preceding unsigned comment added by (talk) 01:18, 2 February 2010 (UTC)

According to the article, . This implies, (1) The variance-covariance of X is \Sigma, (2) If X is a matrix then I guess is a matrix, but what is ? 018 (talk) 01:35, 2 February 2010 (UTC)
This is an article about the vector normal distribution. So X is an n×1 vector, μ = E[X] is also an n×1 vector, and Σ = E[(X−μ)(X−μ)′] is an n×n variance matrix of vector X. Similarly, both c and Y are m×1 vectors, B is an m×n transformation matrix, and the variance of r.v. Y is equal to BΣB′, which is again an m×m matrix. If you are interested in the case when X is a matrix, see the “matrix normal distribution” page.  … stpasha »  05:29, 2 February 2010 (UTC)

Transpose vector notation

Is there a reason why within the Definition section and in the top table a transposed vector is denoted with u' instead of uT ? The remaining of the article seems to use always the proper notation uT. --Marra (talk) 05:44, 1 May 2010 (UTC)

Fisher information

The formula stated in the “Fisher information matrix” section is wrong. That is, it might be technically correct for the parametric submodel N(μ(θ), Σ(θ)), but this submodel is only distantly related to the family of distributions that are the subject of this article. The Fisher information for the multivariate normal N(μ, Σ) distribution is the expected Hessian of the log-density with respect to the unknown parameter which is (μ, Σ).  // stpasha »  22:27, 3 May 2010 (UTC)

This is not a submodel, it is a generalization: You can define θ to be a vector of any length you want, and if you want it can simply be a trivial parametrization where each of the elements of μ and Σ are free parameters. Anyway, I agree that this doesn't belong here, and anyway already appears in Fisher information matrix#Multivariate normal distribution, so I'll remove it (and place a link). --Zvika (talk) 13:39, 6 July 2010 (UTC)

Can the Vectors be in bold

The capital sigma is ambiguous (summation and not covariance matrix springs to mind), I checked a textbook but has it spelt out and calls it V, can it be changed to lowercase sigma in bold? Can µ, being a vector, be bold too? or does this rule not apply? --Squidonius (talk) 20:24, 3 June 2010 (UTC)

Higher order moments

There are no references in here, could that be addressed please. Also the examples are not exactly straightforward to follow, going from 6th order to 4th order, but the major issue is the lack of reference, i.e., to a book that shows how to do this. Thanks. —Preceding unsigned comment added by (talk) 15:15, 21 June 2010 (UTC)

Cleanup / New headings?

This article should be better organized. Any suggestions? Some material should perhaps be deleted. Ulner (talk) 22:16, 7 July 2010 (UTC)

N vs. k

both $N$ and $k$ are used for dimension. can we pick one and stick with it? —Preceding unsigned comment added by (talk) 22:07, 27 September 2010 (UTC)

Sure, go for it. I'd pick k, because N might be the normal distribution. However, in one section there is a matrix, so it needs two dimensions and NxM makes sense there. 018 (talk) 23:03, 27 September 2010 (UTC)

Why nonnegative definite instead of positive semidefinite?

Aren't the two terms equivalent, but the latter (positive semidefinite) far more common? — Preceding unsigned comment added by Lleeoo (talkcontribs) 00:54, 19 January 2011 (UTC)

They are the same, and I would vote for changing it to positive semidefinite, except that doesn't it need to be positive definite, not just positive semidefinite? If for some x, then some eigenvalue of Σ must be 0, so the matrix isn't invertible. If the matrix isn't invertible, then the pdf formula is undefined. So....shouldn't we change it to positive definite? BlueScreenD (talk) 04:23, 7 February 2011 (UTC)

It's positive semidefninte. Consider the covariance matrix of three points in R2: ((-1,-1), (0,0), (1,1)). —Ben FrantzDale (talk) 17:14, 7 February 2011 (UTC)

Bloated definition of multivariate normal

I feel that the given definition of the multivariate normal distribution is unnecessarily bloated. The current definition section has four bullet points plus a paragraph with an afterthought about the covariance matrix. A definition should be a minimal set of axioms that fully describe something. In this case, I think the best minimal set of axioms is the formula for the probability density function. The other bullet points should be moved to the "Properties" section. Thoughts? BlueScreenD (talk) 04:07, 7 February 2011 (UTC)

This approach would only apply for the full-rank / non-degenerate multivariate normal. So another way to phrase your suggestion is that we should split any discussion of degenerate normal distributions to another page. I'm certainly open to that myself, since that's the definition I carry around in my head, but I'm not sure. The balance of my opinion is to change the copy to Zero's copy, so that the pages can stay together and this page can cover the general case. If I had to pick a single definition I'd choose the characteristic function based definition, but I'd motivate that definition with the construction. Marc.coram (talk) 08:43, 28 October 2011 (UTC)

The density

Previsouly, the article says that "pdf exists only for positive-definite Σ", which is not right. Now it is correct. By the way the extended determinant is the product of the nonzero eigenvalues, i.e. the Pseudo-determinant. Jackzhp (talk) 14:56, 20 February 2011 (UTC)

This is definitely not as easy as you wrote it. Say, if you consider the bivariate pdf with ρ=0, and σy→0, then the pdf will approach 0 for y ≠ μy, and ∞ for y = μy. In this sense we say that the pdf does not exist. Now if you want to make it exist, you need to consider the pdf with respect to the lebesgue measure on a subspace spanned by the columns of Σ, and you probably need to replace Σ−1 with the pseudo-inverse. In any case, it is probably much easier to simply say that pdf does not exist for singular Σ's, and restrict everything else to its own subsection.  // stpasha »  20:50, 20 February 2011 (UTC)
Yes, pseudo inverse is needed, rather than the usual inverse, I forgot to change it. Now, it is right. For a real random vector, pdf is always w.r.t. to the Lebesgue measure. Jackzhp (talk) 02:45, 22 February 2011 (UTC)
I glad we agree on that. We can also agree that this pdf w.r.t. (standard) Lebesgue measure does not exist if Σ is singular. You have to redefine the notion of Lebesgue measure and restrict it to a linear subspace of Rk spanned by the columns of Σ in order to make your definition work. But that would make a very non-standard definition, which should be explained in greater detail, and probably should not be in the infobox.  // stpasha »  04:20, 22 February 2011 (UTC)
Hi, I just realized that you are much more professional than me, sorry that I edited the article frivolously. I don't want to spend any more time on this. So please explain it a little bit more here, then change the article accordingly. Jackzhp (talk) 23:22, 22 February 2011 (UTC)

Sum of Multivariate Normals and Multivariate Normal as a Sum

I can think of two scenarios for adding two independent multivariate normal random variables X and Y. If they had the same dimension one could take the vector sum. If not, one could also consider forming a vector (X,Y) where the first group of components of it were from X and the second were from y. It would be useful if the article had an explicit statement about whether these methods of summation produce another multivariate normal.

By an unwarranted extension of my intuition about the bivariate normal to n-dimensions, it is tempting to think that any n-dimensional multivariate normal distribution could be represented as a sum of n mutually orthogonal unit vectors that are multiplied by random scalars with the scalars being independent univariate normal random variables. It would be useful to know if this is true, especially if it is equivalent statement to the definition of a multivariate normal. Doesn't the algorithm in the section "Drawing values from the distribution" take this approach?

Tashiro (talk) 17:11, 26 March 2011 (UTC)

Picture on the right

What's plotted is a sample from the multivariate normal distribution, not its probability density function. ПБХ (talk) 14:42, 28 March 2011 (UTC)

Good point. Fixed. —Ben FrantzDale (talk) 15:38, 28 March 2011 (UTC)

Procedure for "Drawing values from the distribution"

Can someone add a reference or pointer to further reading on this topic? Thanks. --Msalganik (talk) 17:31, 21 July 2011 (UTC)


In the probability distribution template, the notation "span" is used. This seems not to be defined and seems to be not used elsewhere in the article. Melcombe (talk) 09:42, 25 July 2011 (UTC)

"span" is of course a standard word in linear algebra; we could probably just link that, right? However, I don't think this is quite right anyway. I think that it's more accurate to say that the support is the closure of , where is the pseudo inverse. For a suitable definition of as a "square root" of , we could describe this as . In the non-degenerate case, the support is all of anyway. Marc.coram (talk) 18:24, 28 October 2011 (UTC) On reflection, I see that I made it a bit too complicated. Writing we see that the set is correct. So the only correction needed is the shift by . I haven't found a reference for this yet. In the nonsingular case, the support is, of course. Should we put the general answer or just the answer in the non-singular case into the summary? Marc.coram (talk) 21:48, 31 October 2011 (UTC)

Building a consensus

Are we heading toward one page for the Multivariate normal distribution that covers both the non-degenerate and degenerate cases or not? The current state of the wikipage is incoherent. I see a number of "votes" on this talk page that argue to simplify discussion by only discussing the classical definition for the positive-definite case. Then other "votes" that insist on the importance of handling degenerate covariances. In section 5 of this talk page Talk:Multivariate_normal_distribution#proposed_rearranged_first_section Zero wrote a section that synthesizes the general case and the simple case. I wrote some comments there and suggested a paragraph (edited) that could be used to explain about the meaning of "densities" in the degenerate case. That paragraph was just recently (things move fast) incorporated by Melcombe into the live page. However, it doesn't seem right to present this material in the "definitions" section. The density should not be presented as the definition in the general case. I'm new to wikipedia procedure. How should we go forward here? Marc.coram (talk) 18:24, 28 October 2011 (UTC)

There's sufficient forward motion on the page, I'm going to operate under the assumption that the consensus is that we want a combined page and make some fixes. Marc.coram (talk) 19:56, 28 October 2011 (UTC)

I went ahead moved the description of the density into "properties" section and gave the general characterization as the definition. Marc.coram (talk) 20:18, 28 October 2011 (UTC)

I made a variety of other little corrections throughout the article. I do not have a reference for all of these changes. My primary reference is (unpublished?) lecture notes from Professor Carl Morris. In his definition, he treats the positive definite case ONLY, and derives the properties of the multivariate normal through the moment generating function. Some extrapolation was therefore required to give some of the corrections I made for the degenerate case, but I don't think I introduced anything controversial, I just made a stab at making some of the material more accurate than it was in the extant version of the page. Perhaps Billingsley treats the degenerate case? I do not have time to check now. Marc.coram (talk) 21:27, 28 October 2011 (UTC) Yes, Billingsley does treat it (c.f. page 384). Naturally, he treats the characteristic function of the distribution as the primary object and derives the density in the non-singular case as a special case. He then explains that in the singular case "the distribution can have no density", by which he means with respect to Lebesgue measure, of course. This still leaves me without a reference for defining the density in the singular case with respect to the measure that is the disintegration of Lebesgue measure on the support of the distribution. Marc.coram (talk) 21:35, 28 October 2011 (UTC)

Things are moving along OK. But you might want to see MOS:MATH for accepted guidance on maths-type articles. There is a need to start from the basis that Wikipedia is not a formal maths/probability/statistics textbook and that an article can/should start with the simple stuff and then move on to the mathematically sophisticated stuff, even if this means some repetition. There is also some old guidance on "probability distribution" articles at Wikipedia:WikiProject Probability/Probability distribution article structure. You might look at the Johnson, Kotz et al. volume on Multivariate Continuous Distributions as a possible source of inspiration/references. The nondegenerate/degenerate cases can both be included in the article but they needn't both be treated from the outset in parallel. At some stage in may be worth thinking about a separate article on charaterisation of the multivariate normal distribution, with more extensive details, suitable for inclusion in Category:Characterization of probability distributions. A minor point is that it helps others if some brief info about changes is included in the edit summary box ... see the history page of an article to see how this looks. Melcombe (talk) 09:50, 1 November 2011 (UTC)
Can't the degenerate case always be thought of as a non-degenerate case in a lower-dimensional space? If its degeneracy is axis-aligned, then it's trivial, otherwise you need to rotate the coordinate system, but still. Shouldn't that be an easy way to deal with it? —Ben FrantzDale (talk) 11:23, 1 November 2011 (UTC)
There is a problem with that. For example, one needs to speak of the probabilistic independence of the least-squares estimate of the slope of a regression line and the chi-square random variable that is the sum of squares of residuals. That independence is needed in proving that a certain random variable has a Student's t-distribution, which is used in forming confidence intervals for the slope. Obviously the fact that the vector of residuals had an (n − p)-dimensional normal distribution but lives in a larger n-dimentional space must be dealt with. Michael Hardy (talk) 06:04, 22 November 2011 (UTC)
@Ben, yes, I agree so far as understanding the distribution that's all there is to it. I have some language to that effect in the density discussion and maybe you can suggest an improvement. We agree, though, that this doesn't describe the density of the whole vector (with respect to an appropriate dominating measure), just the density of the subvector. That's fine; doing anything else is an advanced topic that probably doesn't belong here so we should simply say that the pdf for the whole vector doesn't exist. Marc.coram (talk) 03:48, 9 November 2011 (UTC)
@Melcombe, thank you for the links to the guidelines and policies. In your interpretation of the guidelines, then, we should describe the multivariate normal ONLY in the non-degenerate case, at least at the beginning of the article, correct? That's fine, and there's probably a version somewhere in the history along that line far enough back. So then the positive definite case should be given as THE formal definition? That's fine also, and quite common. I don't have Johnson/Kotz at hand but since their title is continuous distributions, I imagine that's the definition they give. Where then is the transition? Is it just a paragraph with a link saying "Some authors, especially in advanced treatments of probability, define the multivariate normal more generally and relax the requirement that is positive definite. See characterization of the multivariate normal for more information."? Is this the sort of change that we need in order to meet the guidelines? If so, I would propose that the current article is a pretty decent version of the "characterization" article that you are describing, so it should be copied there and meanwhile the MVN page should have all the complicated bits redacted. Is that right? It's not really clear to me. The MOS:MATH guidelines also say to use "exact definitions," which would usually mean the more general formal definition, would it not? Marc.coram (talk) 03:48, 9 November 2011 (UTC)
The version I put in had a pointer to a section on "characterization" to cover the degenerate case, within the same article, but mainly to use existing text. However, there needs to be care here, as the results usually meant by "characterization of the multivariate normal" start from having some defintion of multivariate normal, and then say that if a set of random variables have some distribution-preserving properties under the set of all linear combinations, then they must be multivariate normal. Melcombe (talk) 13:47, 10 November 2011 (UTC)
With a view to rounding out the discussion, please note that User:Michael_Hardy (and others) have argued quite vigorously on this discussion page for including the degenerate case in the definition of multivariate normal. "The whole discussion leading to that conclusion [about residuals from a linear regression] would be horribly complicated if we're forbidden to speak of normal distributions whose variance is a singular matrix." To be sure, this is an argument for why it makes sense for textbook writers about multivariate regression to use a more general definition, and not in itself an argument for why it's the right choice for Wikipedia, but I think that linear regression is one of the main reasons people care about multivariate normal distributions at all, so it seems compelling to me. Marc.coram (talk) 04:04, 9 November 2011 (UTC)
I don't see the singular MVN mentioned used in this article, am I missing something? 018 (talk) 04:57, 9 November 2011 (UTC)
The current article's definition of the multivariate normal (I called it MVN for short) requires that the covariance be non-negative definite, it does not require positive definite. I.e. it allows there to be a vector such that . In such a case is an eigenvector of eigenvalue 0 and is singular. The text refers to this as a degenerate case in the description of the density. Did I clear it up? Marc.coram (talk) 07:14, 9 November 2011 (UTC)
sorry, I wasn't clear. I see it defined, but I don't understand the Michael Hardy quote. What whole discussion? Here is my issue: I could not send a college student here and expect them to understand the definition given. Why make it so general and not have it be a generalization applied later down the page? 018 (talk) 19:13, 9 November 2011 (UTC)
Oh! Maybe copying the quote out of context was confusing. That quote occurs at the end of the section "Question on Bivariate Normal" on this discussion page (see also Zero's writing afterwards). The whole discussion he's referring to is the discussion that appears in a regression book about residuals. If the linear model is true, so in particular the errors are MVN, then the residuals are not MVN by any definition that requires positive definiteness, but they are by the more general definition. PS I certainly want a college student to be able to understand things. Do you argue that the only definition they are likely to understand is one via the pdf in the positive definite case? So should we have two "definitions" for the same distribution in the same page? This seems pretty awkward to me, but good writing could make it clear. [Personally I think defining it as where is a vector of independent N(0,1)'s and where has full rank is more foundational and is something that they should be able to understand too, at least as well as they understand an opaque pdf. It's true that they'll get confused that if you generalize to consider that aren't full rank that the resulting random vector doesn't have a density. To me this argues for putting the current content on a different page with a special name multivariate normal distribution (general case) Marc.coram (talk) 10:36, 10 November 2011 (UTC)
Thought needs to be given to how this article fits in with other articles. One particular concern is the Wishart distribution, and presumably other related ones, which only deals with the non-singular case. Melcombe (talk) 13:47, 10 November 2011 (UTC)
Ok, so the consensus is to start this article with a discussion/definition for the non-singular case, yes? Objections? If we do that, the cleanest thing to do in my opinion is discuss the multivariate normal distribution (general case) on another page, so that it's always clear which assumptions are being used. For example, the entropy calculation doesn't apply for the singular case, and the summary facts get complicated (e.g. the support is , except when it isn't). But is this a good idea? I'm reluctant for there to be two pages to maintain and I worry about causing further confusion. Marc.coram (talk) 11:19, 11 November 2011 (UTC)
I agree with your conclusion. I think it would be worth canvasing over at the statistics project page for opinions before doing this. 018 (talk) 00:44, 12 November 2011 (UTC)
Sounds sensible. I mentioned this discussion on the page Wikipedia_talk:WikiProject_Statistics#Discussion_about_how_to_restructure_the_multivariate_normal_page. Marc.coram (talk) 06:25, 12 November 2011 (UTC)
I think we have a green light. The real question is what to name the various files and what to do about hat notes. 018 (talk) 17:54, 22 November 2011 (UTC)
Is there something wrong with the current approach? The definition section is very short and fairly clear. The X = AZ + μ approach is particularly lucid, although it could be explained much more accessibly than it currently is. If we really need to get only one simple definition into the definition section, then how about taking the AZ + μ one, and explaining it in many small words - maybe something along the lines of the last two steps of the "Drawing values from the distribution" section? Lots of people find simulations intuitive, and this is the closest definition to it, and the simulation has no problems with degeneracy.
After that, you could move the rest of the stuff from the definitions section into the properties section and extract the "density function" section into its own section before the properties section (subsubheadings are not easy to distinguish from headings, and there are plenty of "properties" not in the properties section anyway). The "Drawing values" section could be merged into the definition section (really, you draw values by directly using the definition..), and the density section could be the first section saying that densities are tricksy for the degenerate case (like it does already, and it's plenty short). That way the easiest stuff would be at the beginning without resorting to lies to children. -- Coffee2theorems (talk) 18:39, 17 December 2011 (UTC)

Adding a Likelihood_function section - please help expand

Hello all, I've started a new section:

It needs more expansion, but I think it will add to the article. I would love to know what you think. Tal Galili (talk) 20:49, 13 November 2011 (UTC)

Nice! However, I think this would better under Multivariate_normal_distribution#Estimation of parameters (as is done in the article Normal Distribution) Regards, --Voorlandt (talk) 21:08, 6 December 2011 (UTC)

Drawing values; non-pd?

From the article:

Find any real matrix A such that AAT = Σ. When Σ is positive-definite, the Cholesky decomposition is typically used. In the more general nonnegative-definite case, one can use the matrix A = ½ obtained from a spectral decomposition Σ = UΛUT of Σ.

But aren't covariance matrices always positive-definite? --Gerrit CUTEDH 12:05, 23 July 2012 (UTC)

No. A matrix is a covariance matrix if and only if it is symmetric and positive-semidefinite. See the covariance matrix article. Sample covariance matrices, for example, are singular if there are fewer samples than dimensions. Same goes for uniform discrete distributions on sets of points fewer than the number of dimensions — which is basically the same thing, except that "the sample is the whole population". -- Coffee2theorems (talk) 09:54, 28 July 2012 (UTC)

Singular variance case

I don't think it is important to include the singular variance case as part of the definition here. The fact that it is useful in certain circumstances does not provide adequate justification. It will be, I believe, unimportant to nearly all people reading this wiki article. As an analogy, the fact that adding the point at infinity to the complex plane simplifies much of the discussion of complex analysis does not justify defining complex numbers themselves as including this point. Or consider the discussion on Dirac_delta; nowhere is it suggested that Dirac_delta is a type of normal distribution. Further, the more general definition is difficult to understand due to the lack of adequate back-up in other wikipedia pages. Consider the Probability_distribution, where it says "Additionally, some authors define a distribution generally as the probability measure induced by a random variable X on its range - the probability of a set B is P(X^{-1}(B))." This makes it seem that X = AZ + \mu is a random variable on the range of X, which is not all of R^N for the singular variance case. Is X is not a random variable on R^N? I understand that you can describe a measure so that the singular case makes sense, but this can be done for many distributions, e.g. the Cauchy distribution. Should the (for example univariate) Cauchy distribution be defined as a random variable on R^N having characteristic function \phi(u; x0, \gamma, v)=exp(i x_0 <v,u> - \gamma |<v,u>|) for some unit vector v? No login, Fri Jan 12 11:50:13 EST 2007.
17:52, 29 January 2013 (UTC)

Covariance definition

Definition and Geometric interpretation refer to the covariance matrix as Σ = UΛUT and Σ = AA′ respectively. However these don't include a normalisation term i.e. 1/(n-1) where n is the number of samples. The normalisation is needed to ensure that the covariance matrix has expected values (see Covariance estimation). The effect of missing this off is that as the number of samples increase so do the eigenvalues. Johndavidbustard (talk) 10:54, 1 February 2013 (UTC)

please clarify this

There exists a random -vector Z, whose components are independent normal random variables, a k-vector μ, and a k×ℓ matrix A, such that X = AZ + μ. Here is the rank of the covariance matrix Σ = AA′. If the covariance matrix is of full rank, then the linear operator A is simply ______.

Please add the sentence in bold with correct information, thank you.
17:52, 29 January 2013 (UTC)