Talk:Principal component analysis/Archive 1

Separate articles on arg max and arg min notations

We probably need a small article on the arg max and arg min notations.

— Preceding unsigned comment added by The Anome (talkcontribs) 09:11, 21 July 2003

Missing crucial details

The article seems to be missing crucial details. I can't see where the actual dimension reduction is happening. Is the idea that you have several samples of the measurement vector x and you use these to estimate the expectations? 130.188.8.9 16:49, 20 Aug 2003 (UTC)

There should now be a clue. However, the article still needs work — Preceding unsigned comment added by Sboehringer (talkcontribs) 18:47, 22 March 2004‎

Wrong arithmetic in Table 2.2 of Page 7 of the 4th External Link (A tutorial on PCA by Lindsay I. Smith). The sum of the numbers in the second table is 1352.46, and not 1149.89 . Jeanbrincat (talk) 12:29, 29 August 2008 (UTC)

Requested move 2004

Principle components analysis is better known as Principle component analysis (singular). This should be the main title and the plural form a synonym referring to this page (Unfortunately I do not know how to do it). — Preceding unsigned comment added by Sboehringer (talkcontribs) 18:47, 22 March 2004‎

I've always heard it with the plural. I have a PhD in statistics. I'm not saying the singular could never be used, but the plural is certainly the one that's frequently heard. Michael Hardy 21:18, 22 Mar 2004 (UTC)
The only monography solely dedicated to PCA is from Jolliffe to my knowledge and is titled "Principal component analysis". The naming issue is discussed in the introduction otherwise than you indicate. Then again naming issues are conventions and vary across the globe. Sboehringer
Google says: "Principal component analysis": 103,000 hits, "Principal components analysis": 46,300 hits. MH 13:48, 25 Mar 2004 (UTC)
I have that monograph and you are correct. It seems, however, that the analysis elucidates the principal components, plural, and so unless one is only interested in one principal component at a time, the plural appears to be more appropriate. — Preceding unsigned comment added by 24.10.224.158 (talkcontribs) 18:07, 21 August 2004‎
In various scientific papers/books I have seen it spelled like Principal Component Analysis. But as long as it is referred to the same content. I won't loose any sleep over it.
I have until now never seen it in plural, not in scientific papers as well. Do you ever write plural before "analysis"? "Houses analysis", "cars analysis", "components analysis"... I think plural is wrong, but I'm no native English speaker. Anoko moonlight 13:21, 30 July 2007 (UTC)
House prices analysis? Car sales analysis? Jheald 14:49, 30 July 2007 (UTC)
Okay, reasons to change it to singular: Primary PCA journal uses singular, more Google hits with singular, more Web of Science hits with singular. Reasons to keep it as plural: some wikipedia user says "the plural appears to be more appropriate." Well, I think that's enough for a change! 76.69.33.144 (talk) 15:20, 29 February 2008 (UTC)
There's a simple reason why it's usually called "principal component analysis". It's not that we use the singular, it's that in English when a plural noun is used attributively, you don't put an "s" on it. Actually it goes back to the Old English genitive plural, which ended in "-um" rather than "-s", and later the "-um" dropped off. Eric Kvaalen (talk) 17:44, 1 November 2010 (UTC)

Article needs serious improvement

Moving Michael Hardy's comments to Talk:

This article needs some serious revamping, to say the least. One cannot assume without loss of generality that the expectation is zero. If the expectation were observable, one could subtract it from x and get something with zero expectation, and so no generality would be lost by this assumption. In practice the expecation is never observable, and one must consider the probability distribution of the difference between x and an estimate, based on data, of the expectation of x.

Excuse me, but that is absurd. If the mean were observable, then one could simply subtract the mean from X, getting something with zero mean, and then indeed no generality would be lost by assuming that. In practice, one must use a data-based and therefore uncertain estimate of the mean, and one must therefore consider the probability distribution of the difference between X and the estimate of the mean of X.

If I may respond --- PCA is a technique that is applied to empirical data sets. PCA eigendecomposes the maximum likelihood covariance matrix. Indeed, there is a distribution of PCA decompositions about the "true" decomposition that you would get in the infinite data limit. But, that does not make it absurd. Or rather, no more absurd than any other maximum likelihood estimate. Any ML technique will have a variance around the estimate from infinite data.
Are you objecting because ML is not mentioned in the article? Or is it something else? -- hike395 04:39, 5 May 2004 (UTC)
Something else. Several something elses. It doesn't seem like that good an article. I'll probably drastically edit it within a few months; it's on my list. Michael Hardy 16:31, 5 May 2004 (UTC)

PCR and PLS?

would it be redundant to include some discussion of principal components regression? i don't think so, but i don't feel qualified to explain it. — Preceding unsigned comment added by Robotica (talkcontribs) 17:00, 9 August 2004‎

It would also be nice to have a piece on Partial Least Squares. Geladi and Kowalski Analytica Chimica Acta 185 (1986) 1-17 may serve as a starting point. — Preceding unsigned comment added by 12.18.36.40 (talkcontribs) 04:03, 22 March 2005‎

I disagree --- PLS and PCR are both forms of linear regression, which is supervised learning. PCA is density estimation, which is unsupervised learning. Very different sorts of algorithms --- hike395 04:35, 22 Mar 2005 (UTC)
The Principal Components Regression is used when the predictive variables are not uncorrelated, it means cov(xi;xj)<>0, for some i<>j. When this happens, we are in presence of multicolineality, which reduces the power of the inference. The technique of PCA is applied to the independent variables, and finally a regression model is adjusted with the principal factors chosen. The new estimated parameters are biased, but uncorrelated, and the variance of the new model is lesser.
I think that at least a discussion of the NIPALS algorithm would be useful here, since it is the main method by which PCs are calculated for very large datasets in the 'omics sciences. The Geladi reference is seminal. --Amaher (talk) 01:47, 15 January 2010 (UTC)
I've added the beginnings of a section on the NIPALS method - will expand over time. —Preceding unsigned comment added by Amaher (talkcontribs) 02:02, 15 January 2010 (UTC)

PCA & Least Squares

Is PCA the same as a least squares fit? (Furthermore, is either the same as finding the principle moment of inertia of an n-dimensional body?) —BenFrantzDale 23:53, August 3, 2005 (UTC)

No. A least-squares fit minimizes (the squares of) the residuals, the vertical distances from the fit line (hyperplane) to the data. PCA minimizes the orthogonal projections to the hyperplane. (Or something like that; I don't really know what I'm talking about.) As for moments of inertia, well, physics isn't exactly my area of expertise. —Caesura(t) 18:44, 14 December 2005 (UTC)
Yes. PCA is equivalent to finding the principal axes of inertia for N point masses in m dimensions, and then throwing all but l of the new transformed co-ordinates away. It's also mathematically the same problem as Total Least Squares (errors in all variables), rather than Ordinary Least Squares (errors only in y, not x), if you can scale it so the errors in all the variables are uncorrelated and the same size. You're then finding the best l dimensional hyperplane that your data ought to sit on through the m dimensional space. The real power tool behind all of this to get a feel for is Singular Value Decomposition. PCA is just SVD applied to your data. -- Jheald 19:40, 12 January 2006 (UTC).
I understand it better now. The article should clarify more about SVD of the (zero-mean) data matrix versus eigendecomposition of the covariance matrix. The latter approach seems most intuitive, but both are valid. Somewhere (here or on another page) we should have the fact that the singular values divided by n-1 give the principal variances... —Ben FrantzDale (talk) 23:30, 27 April 2010 (UTC)

Derivation of PCA

Shouldn't the constraint that we are looking for the maximum variance appear somewhere in that derivation ? I cannot understand it clearly as it is right now. --Raistlin 12:49, 24 August 2005 (UTC)

It is my understanding that the first principal component is the least squares fit to a multidimensional configuration of points, which happens to also be the axis of maximum variance. The second principal component is also a least squares fit to the configuration, with the additional constraint that it must be orthogonal to the first principal component. The third, fourth, fifth, etc, principal components are also least squares fits, except that they are each constrained to be orthogonal to all of the principal components before them. 24.221.60.71 05:03, 21 May 2007 (UTC)

Exactly right. The more of the variance that can be put into the first n components, ie the n-component subspace fitted, the less is the variance (sum of squares) of the points' residuals orthogonal to that subspace. Jheald 15:46, 21 May 2007 (UTC)

Conjugate transpose

and * T represents the conjugate transpose operation.

Why conjugate transpose instead of a normal transpose ? Does it even work with complex numbers ? Taw 04:18, 31 December 2005 (UTC)

As you probably know, conjugate transpose is a generalization of plain old transpose that allows these operations to work on complex numbers instead of just real numbers. If the source data X consists entirely of real numbers, then the conjugate operation is completely transparent, since the conjugate of a real number is the number itself. But if the source data includes complex numbers, then the conjugate operations is absolutely essential for the matrix operations to yield meaningful results. As far as I can tell, it does work on complex numbers. As an example where you might have complex numbers as source data, you might want to use PCA on the Fourier components of a real, discrete-time signal, which are in general complex. -- Metacomet 18:59, 1 January 2006 (UTC)
I have added a motivation paragraph at Conjugate_transpose#Motivation to try to show why it is so natural for the conjugate transpose to turn up, whenever the matrix you're transposing includes complex numbers. Hope it's helpful. -- Jheald 20:14, 12 January 2006 (UTC).

Computation -- surely this is not the right way to go ?

The section on computation looks to make a real meal of things, IMO; and to be pretty dubious too, as regards its numerical analysis. As soon as you square the data matrix, you're going to reduce the accuracy of your SVD from double precision to single precision.

Is there any reason to prefer either of the methods in the text, compared to choosing which bits of the SVD you actually want to keep, and then just wheeling out R-SVD ? (Which I imagine is quicker, too). -- Jheald 19:05, 12 January 2006 (UTC).

I agree that this article is unreadable. The lengthy "PCA algorithm" section is one of the main reasons - it is too long, and it doesn't agree with the equations in the introduction (where did we divide by N-1? why? what about the empirical standard deviations?). It doesn't even say what the output of the algorithm is, AFAICT. A5 13:32, 6 March 2006 (UTC)
I am working on improving the algorithm section to make it more readable. In the end, the section will still be quite long, because the algorithm is rather complicated and I think it is important to include enough detail so that people can actually implement it in software. After I have completed this upgrade, please make specific suggestions for further improvements. -- Metacomet 21:37, 9 March 2006 (UTC)
I am done for now. There is still more work to do, but it's a good start. Please provide comments and suggestions for improvement. Thanks. -- Metacomet 23:12, 9 March 2006 (UTC)
The improvement I would suggest is to delete the whole entire section completely, starting from the table, and then everything following it; and instead tell people to use SVD.
A standard SVD routine will be better written, better tested, faster, and more numerically stable.
IMO it is totally irresponsible for the article to be suggesting inefficient homespun routines, actually leading people away from the standard SVD routines. -- Jheald 00:03, 10 March 2006 (UTC).
I'm no expert Jheald, but I don't see what you're so worried about. Algorithms for SVD that I have seen on the WWW basically consist of the same algorithm that is listed on this PCA page, only done twice, once for left handed eigenvalues, once for right. Is there some other algorithm for SVD that is much preferable? --Chinasaur 08:40, 25 May 2006 (UTC)
I'm sort of an expert - I have a PhD in computer science, in algorithms, although numerical algorithms are not my specific thing - and yeah, actually, the algorithms for SVD that you'll find in a package like LAPACK, R or Mathematica actually are different from the one described here. They avoid computing the covariance matrix for the reason Jheald suggested. ProfessorSpice 01:07, 29 June 2007 (UTC)
I am really glad that you took some time to carefully review the work that I did and make some thoughtful recommendations. Thanks for the constructive feedback. Oh yes, that is sarcasm, in case you were wondering. -- Metacomet 00:48, 10 March 2006 (UTC)
"...totally irresponsible..." Don't you think that is just a wee bit of hyperbole?
"...homespun routines..." Are you referring to calculating the mean, the standard deviation, or the covariance? No, that can't be right, those are well-known and well-established procedures from statistics. Or perhaps eigenvectors and eigenvalues? Hmmmm, those are standard routines in linear algebra. Sorting the basis vectors by energy content and keeping only the ones with the highest contribution? No, that's also a standard concept called the 80-20 rule (or Pareto's principle). I guess I just don't understand what you mean by homespun routines....
-- Metacomet 01:36, 10 March 2006 (UTC)

BTW, I am pretty sure that dividing by N-1 is correct, which means the introduction needs to be fixed, not the algorithm. The reason the algorithm needs to divide by N-1 is that it is computing the expected value of the product, not the product itself. -- Metacomet 21:50, 9 March 2006 (UTC)
I dont know nothing abouth Maths, but all the pages about the Covariance Matrix use N so maybe N-1 is not so correct...? -- IC 18:48, 18 November 2006 (GMT+1)

A mathematical derivation with eigenvalues and eigenvectors is OK but such methods should not be called algorithms. The practical computation should be SVD. Squaring the matrix to get the covariance is harmful. By the way, homegrown SVD is harmful as well and I must support Jheald on both counts. A professional implementation should use SVD from some efficient and stable library, just as one should never write matrix-matrix multiplication except in a college homework. LAPACK is a de-facto standard for all that. (There might be justified exceptions such as programming something exotic with small memory.) Even if some engineering textbooks might have algorithms like here with the covariance created explicitly, their authors are obviously not professional numerical analysts or developers or they do not care about numerical aspects. Jmath666 07:28, 16 March 2007 (UTC)

I would like to support the suggestion of using an SVD function rather than a generic eigensolver. While the algorithm described is a mathematically correct way of doing PCA, not all mathematically correct algorithms are equally good. Two big issues for numerical algorithms are accuracy (how much does it lose in round-off error?) and efficiency. People spend their lives worrying about these issues, and those are the people who write programs like LAPack, SciPy, R, MatLab or Mathematica. Since the person implementing PCA is going install some package like this to compute the eigenvalues anyway, s/he might as well use the SVD function in the package instead of computing the covariance matrix and then using the eigen-solver function. SVD is more "special purpose" - it takes B, and computes the decomposition directly, without going through the step of computing BB^T. The SVD function will certainly be more accurate (as Jheald said, computing the covariance matrix loses digits to round-off error) and I believe almost always more efficient. And it makes the PCA implementation shorter and easier to do! Just to drag in a real big-gun reference, Golub and VanLoan's textbook Matrix Computation gets 14,000+ citations on Google Scholar, and it recommends against computing the covaraince matrix when computing SVD, in Section 8.3. ProfessorSpice 01:07, 29 June 2007 (UTC)

You all make a very strong case for SVD, however i hope we can all agree that the SVD article lacks any sort of decent step by step explanation of an algorithm to produce it. Until there is, i will use the method outlined here, as LAPACK is not an option for me. Please don't complain about what's up here until you know there is something better elsewhere on wikipedia, right now that's just not the case. Jordyhoyt 10:46, 1 August 2007 (UTC)

Question on reduced-space data matrix

The article states: and then obtaining the reduced-space data matrix Y by projecting X down into the reduced space defined by only the first L singular vectors, WL:

${\displaystyle \mathbf {Y} =\mathbf {W_{L}} ^{T}\mathbf {X} =\mathbf {\Sigma _{L}} \mathbf {V_{L}} ^{T}}$

I believe that the correct formula is:

${\displaystyle \mathbf {Y} =\mathbf {X} \mathbf {V_{L}} =\mathbf {W_{L}} \mathbf {\Sigma _{L}} }$

Can anyone verify this? — Preceding unsigned comment added by 216.113.168.141 (talkcontribs) 20:37, 16 June 2006‎

Afraid not. The way things are set up in the article, the data matrix X, of size M x N, consists of N column vectors, each representing a different sampling event; with each sampling made up of measurements of M different variables, so giving the matrix M different rows.
With the reduced space, we want to find a smaller set of L new variables, which for each sampling preserves as much of the information as possible out of the original M variables.
So we're looking for an L x N matrix, with the same number of columns (the same number of samples), but a smaller number of rows (so each sample is described by fewer variables).
Matrix WL is an M x L matrix, so WL ΣL is also an M x L matrix - not the shape we're looking for. But ΣL VLT is the desired L x N shape.
Hope this helps. -- Jheald 11:02, 17 June 2006 (UTC).
Yes, that clarifies. Thanks Jheald! I thought that X row vectors were the sampling events and the column vectors were the variables -- since the definition of X is in fact the transpose of what I thought, then everything makes sense. -- 12:33, 26 June 2006
On a related note, am I misunderstanding, or is this inconsistent with the article text? It says,
"each row represents a different repetition of the experiment, and each column gives the results from a particular probe"
how is that compatible with what was just said,
"the data matrix X, of size M x N, consists of N column vectors, each representing a different sampling event; with each sampling made up of measurements of M different variables, so giving the matrix M different rows."? 140.163.0.5 (talk) 16:03, 14 August 2009 (UTC)
That sounds contradictory to me. FWIW, Matlab's cov function returns the covariance matrix for row vectors. That is, cov(mxn) is n-by-n. Obviously, you can define any of these either way, but consistency would be nice, particularly between this page and covariance matrix. —Ben FrantzDale (talk) 20:01, 16 August 2009 (UTC)

Eigenvector/eigenvalue ordering

Under "Find the eigenvectors and eigenvalues of the covariance matrix", the article says "The eigenvalues and eigenvectors are ordered and paired." But then the next section says to order the columns by decreasing eigenvalue. Maybe I'm misunderstanding the previous section but this seems contradictory. 71.199.186.28 (talk) 00:06, 30 March 2008 (UTC)

I'v used PCA for classification without reordering at thist stage successfuly. It may be that it is simply trying to preempt the later stage discussed. 134.225.217.52 (talk) —Preceding comment was added at 02:20, 25 April 2008 (UTC)

Simplification

Could someone put one sentence at the top explaining this in layman's terms? It looks to me like a very fancy and statistically smart way to average a whole heap of data into some sort of dataset common to all of the data -- is this at all a correct impression? --Fastfission 04:07, 28 January 2006 (UTC)

Agreed! Understanding the introductory paragraphs requires one to first read a half dozen other articles. It ought to be possible to provide a simple explanation of PCA, perhaps with an illustrative example of its application, before diving headlong into statistician's jargon. I'm reasonably well educated and after reading this article, I have almost no clue what PCA is. --Anonymous —Preceding unsigned comment added by 192.55.52.9 (talk) 19:51, 6 November 2008 (UTC)
I agree with this! I have a science PhD, but unfortunately, minimal requisite statistics to process biological data, and this article simply confounded me. I request starting with the objective of PCA, then moving to what it provides, then moving to how it's done followed by theory. This is a classical teaching flow that many lessons follow.

Best of Luck! —Preceding unsigned comment added by 129.59.121.18 (talk) 15:01, 14 June 2010 (UTC)

Principal Component Analysis is a statistical method based on concepts from linear algebra. It seeks to summarize as much of the variation in a set of observations as possible with a small number of dimensions. Given a set of n observations (or cases), each of which contains a numerical value from each of a set of m variables, we regard each observation as a point in an m-dimensional vector space. Principal component analysis constructs an alternative orthogonal basis for the space, with dimensions chosen so that the first captures as much of the variance in the original data as possible, the second captures as much of the residual variance as possible after removing the first, and so on. Because the new dimensions are ordered from most to least "important" one can take a reduced basis consisting of the first k dimensions -- these are the "principal components" of the original multidimensional distribution. —Preceding unsigned comment added by 98.178.173.192 (talk) 00:13, 5 November 2010 (UTC)

If one is dealing with a MxN data set, i.e N factors and M obervations of each, the resulting cov matrix will be a NxN, not MxM.

It seems like everything from the mean vector subtraction to the covariance matrix calculation is done as if the data are organized as M rows of variables and N columns of observations. This is not properly explained in the "organizing the data" section, and is kind of opposite what most people would expect. I'm inclined to reverse everything. --Chinasaur 22:39, 19 May 2006 (UTC)
Yeah, this whole covariance matrix thing seems completely wrong. It states:
${\displaystyle \mathbf {C} ={1 \over N}\mathbf {B} \cdot \mathbf {B} ^{*}}$
And this is inconsistent on two levels. First of all the covariance matrix of B is NxN, not MxM, as other's have stated above. This is in direct contradiction to what this section of the article states, and to what the "organizing data section" states. Secondly, assuming that each data set is in a column (so 3 datasets of 5 points each is organized into a 5x3 [M=5, N=3] matrix), the covariance matrix is NOT ${\displaystyle {1 \over N}\mathbf {B} \cdot \mathbf {B} ^{*}}$ but is actually ${\displaystyle {1 \over N}\mathbf {B} ^{*}\cdot \mathbf {B} }$. So the equation given above is for the transpose of B. And in any case, that's not even the covariance matrix of the transpose of B, the 1/N is wrong, it should be 1/(N-1). Unfortunately I don't know enough about it to make the correction, and the Wikipedia article that I came to to learn about it is quite inadequate. Anyways, I will be putting a contradiction tag on this article because of this. This is a very, very poorly written article, and the original author really deserves a sound spanking. Once I actually do find a correct, concise source of information regarding PCA, I'll be redoing it. --JCipriani 22:50, 13 April 2007 (UTC)
This article seems taken from some confused engineering textbooks that make it too complicated because they try to be elementary and try to teach other things at the same time. I had to wade through the mess myself trying to learn about PCA not too long ago but I never found an acceptable source. In fact, it is very simple: PCA is the spectral decomposition of the sample covariance matrix. It is best computed by SVD. It can be proved that the eigenvectors have certain optimal properties regarding the variance. It is very short, really. The Karhunen-Loeve decomposition is something a bit else (see Loeve Probability theory ISBN 0-387-90262-7) and it is done in advanced graduate courses in probability theory; but once you know that you can just say that "PCA is KL decomposition with covariance replaced by the sample covariance". All else is crud. I may write it up one day if I have the time. If you want to give it a shot these are on the clearer side: Holmes et al ISBN 0-521-55142-0, and Liang, Y. C. et al Proper orthogonal decomposition and its applications. I Theory, Journal of Sound and Vibration, 252 (2002) 527--544. Jmath666 04:51, 22 April 2007 (UTC)
Should the sample covariance matrix be used here instead of the population covariance? It seems the calculation should be;
${\displaystyle \mathbf {C} ={1 \over N-1}\mathbf {B} \cdot \mathbf {B} ^{T}}$
p.484 of David Lay's Linear Algebra and its Applications, 3rd ed ISBN 0-201-70970-8 and p.5 of the paper "A Tutorial on Principal Component Analysis" by Lindsay Smith support this. When you are performing PCA its typically on a sample of the population, right? Zhroth (talk) 16:21, 19 February 2008 (UTC)
Comments: Outer product between matrix is not commonly used math term. Outer product is usually understood as operator betwen two vectors. Anyway, the use of outer product here only make the description seemly more formal and complicated.
—Preceding unsigned comment added by 68.147.165.202 (talk) 18:46, 8 March 2008 (UTC)

Cov Matrix size

The size of the cov matrix C is still unclear. From the session “Find the eigenvectors and eigenvalues of the covariance matrix” on, it is considered to be NxN, while in the session “Find the covariance matrix” it is MxM, which I think is the right size, since the matrix B is a MxN. 133.6.156.71 12:07, 6 June 2006 (UTC)

Shouldn't it read "inner product" instead of "outer product" as C as the outer product ${\displaystyle B\cdot B^{*}}$ would make it a ${\displaystyle M\times N\times N\times M}$ tensor?

outer product is right it is just they switched the meaning of the dimensions as some of the previous comments have indicated. —Preceding unsigned comment added by 74.192.1.156 (talk) 11:46, 18 October 2007 (UTC)

This isn't really working!

The first point, I wondered about, is "Calculate the empirical mean". I think the mean is not calculated in the right way. The mean is calculated over each dimension M. Isn't that sophisticating the data. I think you have to take the mean over each observation (N-vector).
The second point is the size, first of the covariance and then the size of the eigenvalue-matrix. By calculating the eigenvalues you get one for each variable in the data set. So, the size of this matrix should be MxM. And before, to reach this result, the covariance Matrix must have the same size.
... Has anybody an idea how it's really working?

Subtracting the mean of of the observation is nonsense. If I have features X1 X2 where X1 is on the order of 10^20 and X2 is on the order of 10^-20 subtracting the mean of the observation will just make X2 a hugely negative value and X1 close to 0. Subtracting the mean of the dimension makes sense because you are trying to shift the problem back to the orgin (if it were plotted). —Preceding unsigned comment added by 74.192.1.156 (talk) 11:50, 18 October 2007 (UTC)

Whats the difference between PCA and ICA

Just wondering.. This ist clear to me for these articles? --137.215.6.53 12:18, 3 August 2006 (UTC)

Principle Components analysis versus Exploratory factor analysis

I suggest to include a subsection discussing the differences between PCA and exploratory factor analysis. Based on my experience in working in Stat Lab is that students/clients get them confused. Perhaps a description of the differences between PCA and EFA may be included. This can be added to common factor analyses. Below is my undertanding on the differences. I did not want to use "greek" symbols so that it may perhaps be more accessible to non-mathematicians. What do you think?

Exploratory factor analysis (EFA) and principal component analysis (PCA) may differ in their utility. The goal in using EFA is factor structure interpretation and also in data reduction (reducing a large set of variables to a smaller set of new variables); whereas, the goal for PCA is usually only data reduction.

EFA is used to determine the number and the nature of latent factors which may account for a large part of the correlations among a large number of measured variables. On the other hand, PCA is used to reduce scores on a large set of observed (or measured) variables to a smaller set of linear composites of the original (or observed) variables that retain as much information as possible from the original (or observed) variables. That is, the components (linear combinations of the observed items) serve as reduced set of the observed variables.

Moreover, the core theoretical assumptions are different for both methods. EFA is based on the common factor model (FA), whereas, PCA is not.

1. Common and unique variances

Common Factor Model (FA): Factors are latent variables that explain the covariances (or correlations) among the observed variables (items). That is, each observed item is a linear equation of the common factors (i.e., single or multiple latent factors) and one unique factor (latent construct affiliated with the observed variable). The latent factors are viewed as the causes of the observed variables.
Note: Total variance of variable = common variance + unique variance (in which, unique variance = specific + error variance).
Principal Components (PCA): In contrast, PCA does not distinguish between common or unique variances. The components are estimated to represent the variances of the observed variables in an economical fashion as possible (i.e., in a small a number of dimensions as possible), and no latent (or common) variables underlying the observed variables need to be invoked. Instead, the principal components are optimally weighted sums of the observed variables (i.e., components are linear combinations of the observed items). So, in a sense, the observed variables are the causes of the composite variables.

2. Reproduction of observed variables

FA: Underlying factor structure tries to reproduce the correlations among the items
PCA: Composites reproduce the variances of observed variables

3. Assumption concerning communalities & the matrix type.

FA: Assumes that a variable's variance is composed of common variance and unique variance. For this reason, we analyze the matrix of correlations among measured variables with communality estimates (i.e., proportion of variance accounted for in each variable by the rest of the variables) on the main diagonal. This matrix is called the Rreduced.
Note: Principal Axis factoring (PAF) = principal component analysis on Rreduced.
PCA: There is no place for unique variance and all variance is common. Hence, we analyze the matrix of correlations (Rxx) among measured variables with 1.0s (representing all of the variance of the observed variables) on the main diagonal. The variance of each measured variable are entirely accounted for by the linear combination of principal components.

Also See factor analysis

(please bare with me, I am new with using wikipedia).

RicoStatGuy 15:53, Sept 30, 2006(UTC)

Orthogonality of components

According to this PDF, the eigenvectors of a covariance matrix are orthogonal. The eigenvectors of an arbitrary matrix are not necessarily orthogonal, as seen in the leading picture on the eigenvector page. So what gives? Why are these eigenvectors necessarily orthogonal? —Ben FrantzDale 14:44, 7 September 2006 (UTC)

According to Symmetric matrix, "Another way of stating the spectral theorem is that the eigenvectors of a symmetric matrix are orthogonal." That explains that. 128.113.54.151 20:00, 7 September 2006 (UTC)
If the multiplicity of every eigenvalue of the covariance matrix is 1, then the eigenvectors will by necessity be orthogonal.
If there exists an eigenvalue of the covariance matrix with multiplicity greater than 1, say of dimension r, then this corresponds to an r-dimensional subspace of Rn (n being the dimension of the covariance matrix). Then the corresponding eigenvectors can be in principle any basis of this subspace. But generally speaking, the basis is chosen to be orthogonal.
So to answer the question, in some cases they must be orthogonal, and in some cases they do not all have to be, but are usually chosen to be so.
On a side note, all software packages I am aware of will return orthogonal eigenvectors in the multiplicity case. I suspect that this is because the algorithms implicitly force this by recursively projecting Rn into the nullspace of the most recent eigenvector, or something equivalent. Baccyak4H (talk) 17:56, 20 November 2006 (UTC)
Actually, 128.113.54.151 is exactly correct. Because covariance matricies are symmetric, they are necessarily normal. The complex spectral theorem tells us that in ALL cases a nomal operator on a complex vector space has an orthonormal basis of eigenvectors. In fact, the theorem tells us that such an orthonormal basis exists if and only if the operator is normal. If we restrict ourselvs to the reals, the Real Spectral Theorem tells us that a matrix has an orthonormal collection of eigenvectors if and only if it is self-adjoint. Covariance matricies are self-adjoint, so again the theorem holds. The statement above by Baccyak4H that in some cases the eigenvectors do not have to be orthogonal is incorrect when we are talking about covariance matrices. His assertion that an arbitrary collection of eigenvectors can be reshaped into an orthogonal collection of eigenvectors is also incorrect. 167.206.189.3 20:44, 19 June 2007 (UTC)
No, Baccyak4H is correct. As you say, it is always possible to find a basis of orthonormal eigenvectors for a real symmetric matrix. However if two eigenvectors u1 and u2 share the same common eigenvalue λ, then any arbitrary linear combinations v11u11u2 and v22u12u2 are also eigenvectors with the same eigenvalue. So yes, it is possible to find vectors u1 and u2 which are orthogonal, but if they share an eigenvalue then one can also find an infinite number of pairs of valid eigenvectors v1 and v2 which are not orthogonal. Jheald 21:06, 19 June 2007 (UTC)

Cluttered

The article seems terribly cluttered. In particular, I dislike the table of symbols. Sboehringer 18:17, 14 December 2006‎

Rows and columns

I think our convention for the Data matrix is probably the wrong way round. At the the of the day, it would probably be more natural if our "principal components" vector was a column vector.

I also think that confusion between the two conventions is one of the things that has been making the article more difficult than it needs to be.

I propose to go ahead and make this change, unless anyone thinks it's a bad idea ? Jheald 16:53, 31 January 2007 (UTC).

Percent variance??

Presumably one wants to compare the sum of the leading eigenvalues to the sum of all eigenvalues. The example of comparing to a threshold of 90% doesn't make much sense otherwise

Cumulative energy

This term for the contributions of components seems to be from some other field. Is there a better general term for this ? Shyamal 08:25, 12 March 2007 (UTC)

Terminology used

Many users of PCA expect certain terminology such as the decomposition into "loadings" and "scores". The term loading itself is never used in the article and this can be confusing. The following is a mechanical statement for PCA in Matlab.

For a dataset X we can use Eigenvalue decomposition to produce
1) An Eigenvector matrix V whose columns are Eigenvectors and
2) Eigenvalue matrix D (diagonal) such that
(X-D)*V=0
and X=V*D*inv(V)=V*D*V'

depending on the algorithm the elements of D may be ascending, descending or in unsorted order, but the elements of D and the columns
of V may be suitably sorted without change in the identities, Matlab for instance puts the D values in ascending order in the eig()
function but descending is often preferred

If X consists of samples in rows and variables in columns, then
X'X gives the covariance matrix if X is mean centered.

PCA can be done on the covariance matrix or using X'X even without mean centering

Cov=X'X

Cov now is a square matrix with the dimension being the number of variables or columns

The scores can be obtained with

Scores=X * V(:,m-k,m)

for k components, m is the number of variables

It can be verified that

Instead of using the Covariance matrix X'X, one can also compute PCA using just the X matrix. Here the Singular Value Decomposition
algorithm (SVD) may be used. In Matlab

[U,S,V]=svd(X)

The V here is identical to the V (loadings) obtained by Eigenvalue decomposition and the Scores are now equal to U*S

Hope someone can use the above suitably formatted in the article with explanation of the terms scores and loadings. Shyamal 09:15, 12 March 2007 (UTC)

I'm starting to wonder if this terminology is common in some fields and not so in others. Usually when I see PCA mentioned in computer science (not that I see it all that much there), people talk about it using more obvious terms, such as eigenvectors, basis vectors, component vectors, etc. Someone added a reference for "results are usually discussed in terms of component scores and loadings" (some stats for environmental sciences book), though. Maybe these are more common in the softer sciences?
The terminology is also rather opaque, so explaining its origins would also be good. (what is being loaded and where? is "score" used because some people normalize PCA results often enough to almost consider normalization as part of the PCA itself, and score then comes from standard score, or something?). -- Coffee2theorems (talk) 09:35, 13 February 2009 (UTC)

Merge POD and PCA

These seem to be just different terms used in different circles/applications for the same thing. Jmath666 01:53, 16 March 2007 (UTC)

Agree. Shyamal 08:48, 16 March 2007 (UTC)
Agree. Algorithms 16:57, 7 June 2007 (UTC)
Disagree. - I would find it very confusing. Though I wouldn't object the other way around if POD is really the same thing. --MatthewKarlsen 17:21, 16 July 2007 (UTC)

Request information on how to choose how many components to retain

Could someone include information on the proper method for choosing the number of components to retain? I've done some searching and haven't found any 'rules'. Part of my interest is that in MBH98 they retain only the first PC, but they apparently did incorrect centering. If it is correctly centered then a similar result is achieved if including the first 4 PCs. http://en.wikipedia.org/wiki/Hockey_stick_controversy It might be useful for individuals familiar with PCA add some details to the above link.

LetterRip 10:49, 22 March 2007 (UTC)

In the past, we usually used PCA to try to represent 95% of the variablity involved. I was involved in software metrics for a while and we'd collect a number of metrics that measured similar, but not quite the same, items. Using PCA, we could reduce the measures from 20ish to maybe 4 and account for 95% or more of the variability. This way we could have a 95% confidence saying things like "modules with this level of complexity have a far higher rate of bugs than modules with that level of complexity." Tangurena 04:15, 18 September 2007 (UTC)
• Of course after I post the question I find a good reference :)

"Component retention in principal component analysis with application to cDNA microarray data"

"Many methods, both heuristic and statistically based, have been proposed to determine the number k, that is, the number of "meaningful" components. Some methods can be easily computed while others are computationally intensive. Methods include (among others): the broken stick model, the Kaiser-Guttman test, Log-Eigenvalue (LEV) diagram, Velicer's Partial Correlation Procedure, Cattell's SCREE test, cross-validation, bootstrapping techniques, cumulative percentage of total of variance, and Bartlett's test for equality of eigenvalues. For a description of these and other methods see [[7], Section 2.8] and [[9], Section 6.1]. For convenience, a brief overview of the techniques considered in this paper is given in the appendices.

Most techniques either suffer from an inherent subjectivity or have a tendency to under estimate or over estimate the true dimension of the data [20]. Ferré [21] concludes that there is no ideal solution to the problem of dimensionality in a PCA, while Jolliffe [9] notes "... it remains true that attempts to construct rules having more sound statistical foundations seem, at present, to offer little advantage over simpler rules in most circumstances." A comparison of the accuracy of certain methods based on real and simulated data can be found in [20-24]."

http://www.biology-direct.com/content/2/1/2

LetterRip 11:11, 22 March 2007 (UTC)

Question concerning subsection "Convert the source data to z-scores"

Is it correct to transform normalized source data using a PCA which is based on the covariance matrix? Would it not be necessary to use a PCA based on correlation matrix instead (which corresponds to the covariance matrix of normalized source data)?

The covariance matrix based on z scores is the correlation matrix. Technically, one might need to worry about whether the covariance matrix is the empirical moments or "unbiased estimators", which differ by a factor of n/(n-1). There is a related chapter in Jolliffe. —Preceding unsigned comment added by Dfarrar (talkcontribs) 14:16, 6 March 2008 (UTC)

What is h in the z-scores section? —Preceding unsigned comment added by 216.184.13.6 (talk) 17:45, 30 September 2007 (UTC)

Not necesarily orthogonal

I removed the bit about 'assumption that the principal components are orthogonal' I believe that two things got mixed up:

If the noise is not white, principal components are not orthogonal, such that PCA is not optimal, or canonnical, or anything. If the distribution of the noise is known, you may apply a linear transform to whiten the noise or (equivalently) apply a Generalized SVD or Restricted SVD.

I believe ICA applies to nonnormal variables. If the noise is jointly normally distributed, then a covariance of zero implies independence (See http://en.wikipedia.org/wiki/Normally_distributed_and_uncorrelated_does_not_imply_independent). Even though PCA is not optimal or canonical or anything, I believe this means that PCA does find uncorrelated, and hence independent variables in this case. Hence, I think it should be concluded that PCA is a form of ICA for jointly normal variables. —Preceding unsigned comment added by 130.89.67.57 (talk) 14:41, 17 March 2008 (UTC)

Fixed basis?

What does this sentence from the Details section mean: "Unlike other linear transforms, PCA does not have a fixed set of basis vectors. Its basis vectors depend on the data set." A linear transformation does not possess a basis at all. Does it mean that there is no standard choice of basis, with respect to which to compute the coefficients (matrix) of the linear transformation? 137.22.3.172 (talk) 18:46, 20 March 2008 (UTC)

Good point. I removed that sentence. —Ben FrantzDale (talk) 00:36, 21 March 2008 (UTC)
Presumably what was meant was that PCA cannot be represented by a particular fixed matrix operator.
If it did have a fixed matrix operator, eg like a Fourier transform, you could use SVD to identify a particular characteristic set of "input" basis directions, a set of "output" basis directions, and a corresponding set of scalings.
But a PCA is not that kind of a transformation. (It is not linear in the data). Jheald (talk) 09:09, 21 March 2008 (UTC)
But PCA might be considered as an approximately linear in the data if the sample size is thought large enough for the covarince matrix to be essentially fixed, and if the effects on only a limited number of data points are considered. Melcombe (talk) 17:43, 22 April 2008 (UTC)

Would it be worth putting the algorithm in a seperate artical, and maintaining this artical as a discussion of PCA theory? 134.225.217.52 (talk) —Preceding comment was added at 02:33, 25 April 2008 (UTC)

You mean the section "Computing PCA using the Covariance Method". I'm not sure if it should be in Wikipedia at all (at least in its current form). It's rather cookbookish and Wikipedia is not a cookbook. It is also absurdly long for the little content in it. For instance, it should take at most a couple of lines to describe computation of the covariance matrix, not a whole screenful with three headings. Similarly for postprocessing. The only involved part, the eigendecomposition, tells the reader to use ready-made software. Maybe this should really be a short code listing that works in octave/matlab or numpy (at least it would be far clearer and shorter, and the actual mathematics is adequately explained elsewhere in the article). If it can't be shortened to less than a screenful, then I guess it'd be better off in its own article (or in an open source project somewhere), instead of keeping this article unreadable.
Incidentally, the section "Table of symbols and abbreviations" is also an eyesore. If an encyclopedia article requires a full-screen table describing the notation used in the article, something is very badly wrong. -- Coffee2theorems (talk) 10:38, 13 February 2009 (UTC)
Agree that the article needs to be split into two: one for the technial/mathematical details of how it's calculated and another for the more practial/applied aspects. Tayste (edits) 21:05, 24 August 2011 (UTC)

Diagram

To whoever made this request, what kind of diagram do you want? --pfctdayelise (talk) 17:04, 2 August 2008 (UTC)

Karhunen-Loève transform?

The Karhunen-Loève transform is referred to several times, in a way that implies the reader knows what it is. It is not hyperlinked, and indeed the Wiki topic for Karhunen-Loève transform redirects to this page. It makes those areas very unclear. For example:

The Karhunen-Loève transform is therefore equivalent to finding the singular value decomposition of the data matrix X,...

Personally I think it deserves its own short Wiki page. I still don't know what it is. From the definition-like equation in the subsection Project the z-scores of the data onto the new basis, it sounds like KLT(X) is defined as being the projection of the data X onto the PCA basis obtained by thresholding cumulative variance at 90%. Apparently if I threshold at 91%, I'm no longer doing the KLT? Or should it be parameterized, e.g., KLT(X,90%)? Also, I take it from the article that the z-score conversion/normalization is also required to qualify as a KLT?

Other comments while I'm att it:

The Discussion section is in pretty horrible shape. It reads in a very disconnected, jumpy fashion. It tries to pursue a derivation of sorts, but lacks any description of what is being shown, and then jumps unexpectedly into connections with other topic areas. All the connections to other areas (EOFs, ANNs, LDA) should be separated from the derivation, and made clear that the purpose of the section is to identify connections to other topics.

In the subsection Compute the covariance matrix, shouldn't the equation be the sample covariance, i.e.:

${\displaystyle \mathbf {C} =...={1 \over {N-1}}\mathbf {B} \cdot \mathbf {B} ^{*}}$

Why the subtitle: Compute the cumulative energy content for each eigenvector? What is with the use of the term energy? Why not variance and cumulative variance? —Preceding unsigned comment added by 98.207.54.162 (talk) 22:52, 8 January 2009 (UTC)

Requested move

Regarding the above attempted counter-examples, "prices" and "sales" are things being analysed; Principal Component Analysis is the name of a method. The naming of statistical methods is very frequently in the singular like this. Surely even Dr Hardy would not refer to the method of "Factors Analysis (sic)"? Ged.R (talk) 19:52, 18 March 2009 (UTC)

Moved to new section. See also Talk:Principal components analysis#Requested move 2004 above, for earlier discussion on this topic. 199.125.109.126 (talk) 20:58, 18 March 2009 (UTC)

As the move was done in March, I have removed the move-request tag. Melcombe (talk) 09:14, 23 April 2009 (UTC)

Regularized PCA

Several authors discuss regularized principal component analysis, or regularized singular value decomposition (which presumably could help with regularized PCA). Does anyone have something they could put in this article: theory, computation, refs?

dfrankow (talk) 19:19, 23 January 2009 (UTC)

Section on "Computing Principal Components with Expectation Maximization"

I believe what is discussed in this section is more commonly referred to as the "power method" for calculating eigenvectors. This is what it is called in Golub and van Loan and other standard numerical linear algebra books. There may be some connection to the EM algorithm, but it is not apparent to me. In any case, I don't think this is an advisable way to carry out PCA. There are a number of good algorithms for computing the SVD, including Lanczos-type methods if only a few of the PC's are needed. However I don't think this page should get into SVD algorithms at all, focusing instead on how PCA is motivated, how it is used, strengths/limitations, and alternative/related techniques. Skbkekas (talk) 03:40, 11 May 2009 (UTC)

I agree that the caption is a bit hyped. The algorithm shown is however more likely to be NIPALS since it seems to be working on X rather than X'X which the power method uses. No reliable reference for this, but the difference is mentioned here. And yes I think the whole slew of methods should be just listed out and exact SVD and Eigenvalue computation algorithms moved to the SVD and Eigenvalue articles. Shyamal (talk) 05:20, 11 May 2009 (UTC)
Thanks for pointing this out, I wasn't aware of the distinction. Skbkekas (talk) 14:17, 12 May 2009 (UTC)

Can you have negative weightings (even if unphysical)?

Is it automatically unphysical to have a PCA reconstruction that has some stations negatively weighted? Would think that it could occur for both degeneracy and anticorrelation with the average (actual physical effects). Of course the summation must be positive, but is it automatically wrong if some of the stations have negative weights?

This is being debated on these blog threads. Unfortunateley, the debate has muddled particular examination of the Stieg Antarctic PCA-based recon with general absolute claims that negative weightings are bad, bad, bad.

See here:

http://noconsensus.wordpress.com/2009/06/07/antarctic-warming-the-final-straw/#comment-6727

http://noconsensus.wordpress.com/2009/06/09/tired-and-wrong-again/#comment-6726

http://wattsupwiththat.com/2009/06/10/quote-of-the-week-9-negative-thermometers/#more-8362

Also, if you can point to an academic expert who could answer this question, (negative weightings allowed?) would appreciate it.

P.s. Wiki police: this might have relevance to the article. —Preceding unsigned comment added by 69.250.46.136 (talk) 16:54, 19 June 2009 (UTC)

Negative weightings are not only allowed, they are routine.
For example, the most significant axis might be the difference between two stations.
Remember, the PCA is identifying signals that are correlated *deviations from the average*. (You have to centre the data first, by subtracting off the average for each station). Now if one of the most systematic deviations is that whenever station A reports high, then station B reports low, and vice-versa, then (Station A - Station B) could well be the kind of signal that PCA would extract. Jheald (talk) 17:02, 19 June 2009 (UTC)
Appreciate a little more detail (citations and/or more looking at the actual details of the debate here, which is more than just a PCA, but use of the factors afterwards. I like that you agree with me, but think I need more. BTW, we can take this conversation to your talk page if the wiki police feel this is too forum-like. 69.250.46.136 (talk) 17:35, 19 June 2009 (UTC)
Okay, so having looked at the blog posts, I think the thing to remember is that PCA isn't setting out to produce an "average" temperature. Instead, what PC1 represents is the direction of strongest correlation in the data. So if it systematically occurs that whenever there are higher temperatures than average at Station A, that there are lower temperatures than average at station B, then if station A has got a positive weight, station B will get a negative weight.
So PCA-weighted sum isn't giving you an "average" temperature. Rather, it's trying to give you an indication of the strength of the deviation in the correlation direction specified by PC1. So a lower temperature than usual, measured at a station which tends to record unusually low temperatures when other stations record unusually high temperatures, will be add to the strength being indicated for the effect.
Which seems to be what's happened here. The mistake then is to interpret the PCA-weighted sum as an averaged temperature, rather an indication of the strength of a particular trend, which may have +ve effects in some places, -ve effects in others.
It should be remembered, however, that PCA is quite a rough and ready tool. In particular it can mix together the effects of different causes. So for example, the strongest driver correlating most stations together might be the long term secular trend. But for some pairs of stations the most significant correlation might be their response to the effects of say short term storm-fronts. The #1 PCA vector just records the direction of strongest correlation overall, which can mix together the responses to different causes. Other, more sophisticated techniques, for example Independent Component Analysis (ICA) try to produce vector directions which maximally separate out the effects of different causes. It might be of interest to run the raw station data through ICA to see how different the vectors produced by PCA. Looking at the plot of the PCA station weights, it's clear that for the most part the whole continent is moving in step, though with the temperatures in some parts apparently systematically moving faster than others. The "rogue" stations may be actually ones where the temperature reacts the other way to the same overall stimulus; or they may be anti-correlated for other reasons. You'd probably need to have a look at the detailed data (or know about the typical weather patterns of those parts) to say for sure.
Finally, ridge regression doesn't really make any difference to the overall picture. The idea is that some of the variation in PC1 is just "noise" power. You can get an estimate by computing the power (ie average mean-square variation) for each of the subsequent principal component directions. The idea is that the later Principal Component weightings become increasingly random and geographically meaningless, so the degree of power in the component from a real geophysical cause falls off as the PC number goes up, until the later PCs in effect only reflect power due to noise. You can therefore use them to estimate the "noise floor" -- ie the background noise power in each of the components; and then use maths in effect identical to a Wiener filter to scale back each of the PCs by an amount appropriate to remove the amount of noise and leave only the amount of genuine signal.
Anyhow, I hope that helps a little bit. The main point is that the PCA-weighted sum is being used to estimate the size of a signal, (without a-priori knowing what that signal is or means), and not an 'average' temperature. Jheald (talk) 01:38, 20 June 2009 (UTC)
Cool. So, though. Am I right, that we could have theoretically recons with some stations negative weighted. Or the other guys who jump up and down and make pictures of negative thermometers and tell me it is shit, shit, shit, to have a station negative weighted. That it is wrong in all situations.
Oh, and are you a stats expert? Can you come down and just rule on this by way of credentials or something? Special:Contributions/69.250.46.136|69.250.46.136]] (talk) 02:48, 20 June 2009 (UTC)
Sorry if my last comment was a bit rambling. It was knocking on 3am here local time, and I really should have been asleep! And no, I'm not going to put credentials on the line for you.
As to who is right, the answer is somewhere in the middle. As I tried to clarify last night, PCA is trying to pull out a signal from the data, by seeing what measured deviations from the average systematically correlate together.
And it has done this, quite successfully. It thinks there is a signal in the multi-station temperature data, expressed particularly strongly in one category of stations, less strongly in another. ((i.e. one area is warming up more quickly, and PCA has noticed this))
But the potential problem comes if we then make a statement something like, "the clearest signal which comes out of the data (according to PCA) is a secular long-term warming trend in the typical temperatures".
This is pretty much true -- the signal is pretty much going up in a straight line with time, so it is appropriate to identify it as relating to a "secular long-term trend". And the strong majority of stations are weighted with the same sign -- so for most of them a positive value of the signal does represent warming.
But it's not quite the whole story, because for some of the stations a positive value of the signal represents cooling. Now, there are two suggestions I've made as to why this might have been found: (i) these stations may actually have seen a long-term temperature fall, which correlates to the temperature rise elsewhere, and has some shared physical driver; or (ii) the PCA is telling us that for these stations, the most important correlation with their neighbours is not the long term picture at all, but (say) a marked anti-correlation in deviations in the short term.
Either way, that therefore makes it too simple to talk about the signal being plotted up as time just as an "average temperature" -- the combination, with its negative weights for some stations, is more complex than that; and PCA wasn't being told to find an average temperature.
But it may be true that a common average long-term temperature trend (albeit going up faster in some places than others) is, for at least most of the stations, the most powerful correlation between different stations that PCA has found. Jheald (talk) 07:25, 20 June 2009 (UTC)
(Sorry if that may be a bit more nuanced than you were looking for!) Jheald (talk) 07:25, 20 June 2009 (UTC)
But my question is if it is just wrong, like violating a law of physics if a recon has some negative weights. I'm NOT asking if Steig is a good recon. But just this basic and GENERAL question. If you say "in the middle", aren't you really saying it's WRONG to make the general statement "no negative thermometers"? 69.250.46.136 (talk) 14:03, 20 June 2009 (UTC)
What do you mean by "recon"? My understanding is that what Steig has produced is not what I would think of as a recon, but perhaps you can clarify what the word means. Jheald (talk) 18:27, 20 June 2009 (UTC)

(Unindenting) He is doing a reconstruction of Antarctic temperature. He uses the correlation of sattelite areas (spanning the entire continent) to fixed stations, during the period of overlap (post 1982) to figure out predictors for the interior surfaces (based on the stations) during the period 1957-1982. This allows him to reconstruct what the temp was in those interior areas (and he posits that this is more accurate than doing simple distance weighting because it captures patterns and the like). Here is a link to a description of the algorithm (sorry, by a denialist, but still it's a helpful explanation). http://noconsensus.wordpress.com/2009/04/06/updated-flow-chart-for-antarctic-paper/ 69.250.46.136 (talk) 21:57, 20 June 2009 (UTC)

I don't know who is writing this but I believe the meaning of 'negative weights' has been confused. In the antarctic reconstruction being discussed at noconsensus above, the negative weights are unrelated to PCA. They are the final result of the reconstruction where 34 thermometer signals are added together to create a weighted average. This is different from the eigenvalue weights used to reconstruct the PCA. Therefore: The noconsensus links above relate to the weighting of a group of thermometers rather than the weighting of PCA. I am an engineer and I can say to a great deal of certainty that a upside down thermometer signal in a weighted average is not good. JeffId1 (talk) 12:41, 21 June 2009 (UTC)Jeff Id
You're quite right. My apologies. I hadn't looked closely enough at the blog posts to see what the argument was about. This is a controversy about negative weights appearing in a quadrature, completely unrelated to positive or negative weights being in the PCA.
On the other hand, negative weights in quadrature may not be quite so beyond the pale as you seem to be propounding.
Consider a straight line, y = m x + c, and suppose we want the area under it between x=0 and x=5, but we only have measurements of y at x=0 and x=1.
Straightforwardly, by integration, the area A = (25/2)m + 5c.
Now putting in our measurements, we find
A = (25/2)(y1 - y0) + 5y0
= (25/2) y1 - (15/2) y0
So the presence of a negative weight in a quadrature estimate is not necessarily a red stop light; at most, it's a warning light that the result may be quite dependent on the accuracy of some extrapolation. Jheald (talk) 09:02, 22 June 2009 (UTC)
I'm unfamiliar with the reply syntax of wiki - so I apologize for that.
The problem as I've pointed out on my thread to TCO is not in the fact that a negative value is not mathematically explanatory. In fact I fully believe we get a better regression fit. The problem is in that we are talking about temperature data, and in a weighted average of thermometer data, the negative thermometer is a non-physical value. This likely results in improper overweighting of similar positive thermometers and improper distribution of trend in a location critical reconstruction.
I'm actually sorry that TCO has taken your time with this as it is off topic for the thread it started on.
To make a simple point, if you have three thermometers recording temperature in your back yard for 50 years. How would you determine the average temperature over 50 years? If someone told you that one of the thermometers entire record should be flipped and arbitrarily (without explanation) given a negative sign, would you agree or question the rational for the negative sign? Consider that there are only 34 thermometers in the entire Antarctic during tis reconstruction, and under 20 lead back to the beginning and also consider they are thermometers and fairly good at recording temperature.
Would you consider flipping 5 of 34 to determine temp reasonable?
This is an amazingly simple slam dunk issue which has been deliberately expanded by a truly famous troll (thousands of people know TCO). He went searching for experts to find who would take the bait contacting at least a few different people. I'm sorry again that it wasted your time. - -Jeff
Jeff: Let the guy engage, Jeff. It's a privilege to have his attention. Don't try to cut him off. You can still speak your piece on the content. I welcome that. As I think it will help the expert engage and correct you. Oh and ixnay on teh OTC-ay abel-ay. I am either blocked for several months of wieght loss (according to one admin) or perma-blocked according to another. So...be weeery quiet with the naming.
Jheald: I worked through your example. Thanks. I understand that negative weighting in the recon is not axiomatically wrong, despite "thermometers" indicating positive above absolute zero. I'm not familiar with quadrature as a term. Have not seen people using that term in climate science papers or blog posts. 72.82.44.253 (talk) 15:31, 24 June 2009 (UTC)

Starting with Correlation vs. Covariance

Though you can normalize a covariance matrix, it seems easier to just start with a correlation matrix instead. This is discussed in the PCA book by George Dunteman (and likely others)

To recap, if your data contains values that are on a radically different scales, the covariance has to be carefully corrected, so that one number doesn't swamp the other simply because it's naturally larger.

A quick example: Suppose you're dealing with national financial statistics. One of your columns represents interest rates, measured in fractions of a percent, so let's say it varies between 2 samples by 1/4 of a percent, 0.0025, or 2.5e-3. But let's say you've also got data points measuring imports and exports in dollars, and let's say the same two samples differ by \$25 billion, or 2.5e10. The native scales of these variables happen to be 13 orders of magnitude. And we'll assume you've got 20 other variables at different scales, perhaps some other numbers in the millions and billions, so this difference isn't obvious at first glance.

As I understand it, starting with covariance is dangerous. Without special care, even a change of interest rate of 10% (quite dramatic) would be overwhelmed by the other numbers in the millions and billions.

The article DOES talk about normalization, but I didn't notice this issue pointed out by that name. And perhaps one of those formulas is equivalent to converting covariance to correlation? If so, maybe point that out? I do see the mention of the R matrix in the table of terms.

Ttennebkram (talk) 18:06, 26 July 2009 (UTC)

Derivation of PCA using the covariance method

This section was edited 2011-02-23. See the old version versus the new version.

I think the new version has some improvements; it's much shorter, for one. Also, the transposing of P makes things simpler and neater. However, I also think it's too dense and uses unnecessary terminology (and in particular, mixes terminology unnecessarily) - something the old version did too, but there it's less problematic since the forumlae were more explicit. e.g. usage of unitary vs. orthonormal transformation matrix.

I think it might be clearer to reduce the usage of terminology (use orthogonal matrix rather than unitary+orthonormal transformation matrix), and to reintroduce the explicit diagonalization rather than just mention it. AFAIK the WLOG assumption that mean(X) = 0 isn't necessary, or am I missing something?

Thoughts?

Eamon Nerbonne (talk) 10:06, 4 April 2011 (UTC)

"Software/source code" woes

The "software/source code" section seems unwieldly, unhelpful, and is riddled with unhelpful external links. Any stats package worth its salt will have support for principal component analysis, do we really need all these disorganized links? 188.74.104.106 removed the external links tag, but I'm reinstating because there's a lot that this section leaves to be desired in terms of that. Thoughts? Statisfactions (talk) 00:55, 28 November 2011 (UTC)

Looks like it could be trimmed down a bit. Just we should be careful not to be too draconian. Some of the links are quite useful/illustrative/unique. Kevin Baastalk 14:23, 30 November 2011 (UTC)
Agree. I think it's a very good article. The subject is non-trivial. Perfection is unattainable. Dratman (talk) 02:43, 31 December 2011 (UTC)

Mistake in details section

The Details section is inconsistent. It states that X has n rows and m columns, i.e. X is an n x m matrix. It then goes on to say that ${\displaystyle \mathbf {X} =\mathbf {W\Sigma V} ^{\top }}$, where W is m x m, Sigma is m x n and V is n x n, which would make X an m x n matrix. I think the latter is the correct form as opposed to the former. It is also consistent with the Table of Symbols and Abbreviations, and with the reference http://www.snl.salk.edu/~shlens/pca.pdf. Can someone confirm this? — Preceding unsigned comment added by 131.111.185.68 (talkcontribs) 11:50, 15 April 2011

It says that X^T is n-by-m, which makes X an m-by-n matrix, so there's no mistake. -- 195.178.200.68 (talk) 09:44, 16 April 2011 (UTC)
I agree that this part is formally correct. But I strongly recommend to point out in the text, that X^T is defined, not X. I also overead that and it took me 2 hours (and a look on this discussion site) to understand what I got wrong. Because if you read X instead of X^T, everything gets inconsistant to other literature.
I just added a note on the topic. — Preceding unsigned comment added by Ga29sic (talkcontribs) 14:13, 4 December 2011 (UTC)

Mistake in Find a covariance section

How can we start with the population covariance, e.g. the mathematical expectation operator and then move to the empirical measure, e.g. the sample mean, in the same line. That does not look right to me. — Preceding unsigned comment added by 15:33, 13 February 2012 (talkcontribs) 78.145.18.55

This is indeed wrong and I will change this now. We cannot move from one two another without invoking asymptotic theorems. — Preceding unsigned comment added by 11:21, 7 March 2013 (talkcontribs) 192.193.116.137

Wrong Indices at X?

At the beginning is written: X^T as n rows but later in the table that X as N rows. I think that the description later is correct. — Preceding unsigned comment added by 193.174.63.68 (talk) 13:07, 12 July 2012 (UTC)