# Talk:Least squares/Archive 1

## too complicated

I was in a hurry and I came here to find the formula for the gradient and the y-intercept. This page is too much hassle. Please just write the formula somewhere easily visible for the gradient and y-intercept, instead of filling the page with all this long-winded bs 129.78.64.100 (talk) 01:12, 3 June 2010 (UTC)

You want to go to the simple linear regression page probably.  // stpasha »  05:13, 3 June 2010 (UTC)

At the behest of the common mass, and in light of the fact that 50% of the populace is below the mean in terms of quantifiable intellect (myself included); an introductory summation might acomodate those who feel disinclined to achieve a math degree to fricking understand any of this. Id est, Lets dumb this sucker down with at least an introductory paragraph so that at least 25% of the population can understand something here. Many wikipedia readers are college students looking for help, not researchers reviewing each others work. While I really appreciate all the work that has gone into this page,... lets see what can be done to at least put a little description up at the top for dummies who have a computer program to do their forecasts for them.  :-D Insightfullysaid (talk) 20:25, 8 February 2012 (UTC)

I have moved some that information that belongs in least squares instead of in the article on linear regression. The article Least-squares estimation of linear regression coefficients should also be examined and merged with this.Woollymammoth 02:14, 24 February 2007 (UTC)

Regarding "least squares" ("Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every single equation.), the following is not clear to me: 1. how does the "overall solution minimize" ? 2. What does it minimize? I.e., what is "the sum of the squares of the errors"? 3. There is a logical inconsistency, since if you know what the squares (^2, resumably) of the errors are, then you also know what the error is -- then why not just remove the error from the result, thereby making it correct?

Needless to say, this introductory section needs to be made more accessible, less filled with statistical jargon. —Preceding unsigned comment added by 75.203.177.90 (talk) 20:03, 4 September 2010 (UTC)

## political science

A number of politics articles goes here, can this article be made more relevant to the political usage please? --Gregstephens 02:20, 25 Mar 2005 (UTC)

I have no idea what the political usage is, and I am unable to find those political articles linking here (I clicked on "What links here"). Unless it's very much related to this topic, I think its inclusion would be inappropriate; it should be in a separate article titled least squares (politics) or the like. Michael Hardy 03:02, 25 Mar 2005 (UTC)
It is linked to from proportional representation, it is a measurement ot figure out how disproportional an electoral system is. The formula used is:
Lsq= square root of (half the sum of (Vi-Si)^2)

Is that similar to what is used here? Or is it different? If it is, I will start an article on measures of disproportionality, as that can cover more. --Gregstephens 04:52, 25 Mar 2005 (UTC)

I find the remarks above cryptic. What, in particular, are the quantities you are calling Vi and Si? Michael Hardy 30 June 2005 21:11 (UTC)

Firstly, the formula cited above seems to be basically the same as the usual definition, save for some proportionality constant. So there is no need to involve politics here. Secondly, there is no such formula in proportional representation at this date. It was removed on the 5th of June. --HelgeStenstrom 12:37, 21 November 2005 (UTC)

The Gallagher Index is sometimes refered to as the least squares method of calculating disproportionality. --LeftyG 07:24, 6 January 2006 (UTC)

## Recursion

Could someone put in a short discussion of how to solve the problem using recursion? I filled a request for Recursive least squares algorithm by redirecting to this article, but perhaps someone whoever requested it wouldn't find all they're looking for here.--Joel 03:26, 25 Jun 2005 (UTC)

I don't think that redirect is very helpful as the current article does not talk about solving the least squares problem recursively at all. Unfortunately, I can't add it to the article since I have no idea how the algorithm is supposed to work. -- Jitse Niesen (talk) 11:30, 25 Jun 2005 (UTC)
Isn't that at Solving linear recurrence relations in Recurrence relation? 212.102.225.147 30 June 2005 12:54 (UTC)
I am sorry, but I don't see anything there about least squares problems. Could you please be more specific? -- Jitse Niesen (talk) 2 July 2005 23:52 (UTC)
No (sorry from my part), my response was more based on a hunch feeling then in-depth study of the article. 212.102.225.147 16:21, 11 July 2005 (UTC)

## Ordinate

Ah, the residuals are measured along the ordinate (assuming no error in the x-values) now allows me to understand what you mean by ordinate. But it wasn't at all obvious. Doesn't between the fitted function and the data. make this clear anyway, and rather more clear? William M. Connolley 09:25:26, 2005-08-24 (UTC).

"Between the fitted function and the data" is certainly erroneous. When "distance" is measured between a point and a particular line (in this case the fitted function), the outcome will vary with the choosen direction along which you want to measure. In geometry "distance" would create no problem, because by definition the measurement is assumed to be in orthogonal direction. However, in OLS, this is certainly not the case. The objective in OLS (as well as the optimization of the underlying partial derivative problem) is to minimize the error in the Y dimension. As a consequence, in OLS the "distance" is measured along the direction of the Y-axis (= the ordinate).

A challenge: try to define a formula for Orthogonal Least Squares, and you'll understand immediately why Ordinary LS is such a popular method. User:Witger

You have any info about this Witger? I am using OR at the moment (or Total Least Squares regression to be more precise, isn't that the same?) Until now I haven't been able to find any 'basic' information about Orthogonal Regression except: SIAM Review 36 (2) 258-264 Jeroenemans 16:42, 5 December 2006 (UTC)

## Nonrubustness of Least Squares and IRLS

I have noticed that outside certain fields (computer vision) there seems to be little discussion of robust statistics. Least squares is utterly crippled by any form of asymmetric outliers, and I believe that more of this article should be spent on discussing the limitations of the method and pointing to similar methods (Iterated Reweighted Least Squares, or Least Median Squares, or consensus based algorithms like RANSAC... interestingly wikipedia doesn't have any articles on IRLS or Least Median Squares, despite their importance in high noise environments... perhaps I'll write some). Anyone have any thoughts on this? - JustinWick 18:09, 5 December 2005 (UTC)

## Merge

I support the merge of Least mean squares into this page. Possible with a section on adaptive filters. --Pfafrich 23:13, 10 January 2006 (UTC)

I won't mind. But that text needs a good cleanup before that. Anybody willing to do the work? Oleg Alexandrov (talk) 00:13, 11 January 2006 (UTC)
No, LMS is a largely unrelated topic, an adaptive filter algorithm. Dicklyon (talk) 06:52, 5 December 2007 (UTC)
I oppose. The article Coefficient of determination will not be comprehensive, if this article disappears. A.L.

I support the merge of RSS with "Least Squares" (This article). Could we reference the definition as a sub-section titled "RSS"? Having a reference for the term would be helpful in other articles as well. What's required before a merge is allowed? Yoderj 19:19, 14 February 2007 (UTC)

• Merge for Explained_sum_of_squares, and Total sum of squares since both have no helpful links to them and are just silly little articles that should just be on this page. Coefficient of determination is not comprehensive as is, the importatn info from those other articles really should be on that page for it to be clear and understandable. Pdbailey (talk) 04:06, 5 December 2007 (UTC)
Then better start a new merge proposal instead of messing with this old one about something else. Dicklyon (talk) 06:52, 5 December 2007 (UTC)
This is listed as a proposed merger on Wikipedia:Proposed_mergers, where am I supposed to comment? Pdbailey (talk) 14:09, 5 December 2007 (UTC)
I took that 2006 item out of there, since it was totally out of process; see the top of that page about process. Start a new merge proposal if you want to. Dicklyon (talk) 15:47, 6 December 2007 (UTC)
What Dicklyon meant to say was, "Thanks for bringing up this important issue. Please note that the standard way to propose a merge is to add {{mergefrom}} and {{mergeto}} templates on the pages you think should be merged, and start a discussion in a new section on the talk page of the merge target. Take a look at WP:MERGE for more information. Thanks again, and have a nice day!" (Wasn't it, Dick? :-) ) --Zvika (talk) 20:00, 6 December 2007 (UTC)
Zvika, thanks for being kind and explaining Dicklyon's remark. I'm very familiar with the merge process and don't need any help navigating it. I'd encourage you to find the person who put the proposal there if you want to be of assistance. Thanks again for the kinder wording. Pdbailey (talk) 23:02, 6 December 2007 (UTC)
Dicklyon, do you really think that you are AGF when you do that? Maybe trying to help along the newby, or at least telling them what you are doing when you blast what they think is a valid proposal would be best. Pdbailey (talk) 23:02, 6 December 2007 (UTC)
I'm not following you. Zvika's over-interpretation of my terse remark is not too far off. But I'm not sure how to reconcile your familiarity with merge process with your calling this a "valid proposal". I simply said that if you want to make it a valid proposal, do so. Dicklyon (talk) 07:48, 7 December 2007 (UTC)
Dicklyon, sorry for the confusion, I had assumed that since this article was listed on Wikipedia:Proposed_mergers that meant someone tried to list it there and was paying attention. Now I understand that is was just detritus. Thanks for sticking with me. Pdbailey (talk) 03:37, 8 December 2007 (UTC)

## Redirect

Should not LSQ redirect/be disambiguation to this article? —Preceding unsigned comment added by 130.238.7.41 (talk) 05:16, 5 October 2007 (UTC)

## History: Fact Checking

The submission by 81.173.155.84 (diff) reads, in part:

On New Year's Day in 1801, the italien astronomer Giuseppe Piazzi discovered the asteroid Ceres. He was able to track its path for 40 days, until Ceres disappeared behind the sun.

I'm removing the wording "until Ceres disappeared behind the sun" on the grounds of error. Modern astronomy software confirms that on 10 Feb 1801 Ceres was still more than 90 degrees away from the sun, therefore not behind the Sun, and not even lost in the Sun's glare. (Jean Meeus, in Mathematical Astronomy Morsels, p. 300, agrees.) Conjunction with the Sun did not occur until July 1801. A much more likely guess is that Ceres, which had dimmed from magnitude 7.8 to 8.5 and was also progressively lower in the western sky each evening by the end of twilight, became too dim to be seen, but the date at which this happened depends on the aperture of Piazzi's telescope. Perhaps also Ceres became lost in the star fields between the Pleiades and Hyades. My own experience is that Ceres is difficult to track with 80mm binoculars in the best of conditions, so I favor the dimness explanation. I suggest someone find Piazzi's notebook and see what he said. - JEBrown87544 16:05, 22 July 2006 (UTC)

Another issue is the claim that this article claims Gauss described the method of least squares in 1794, two years before his first important work, the constructability of the regular 17-gon and similar polygons. I believe the actual first use of the algorithm by Gauss was in the Ceres prediction above. That would make the date 1801 or so. -- 70.57.21.66 (talk) 16:08, 18 December 2010 (UTC)

Thanks to JEBrown for explaining why Piazzi lost sight of the planetoid. I have seen it variously described (clouds, sun) and always wondered what really happened. Jfgrcar (talk) 02:49, 4 November 2011 (UTC)

He is chiefly remembered for his formulation of the method of least squares, published in 1808. Adrain certainly did not know of the work of C.F. Gauss on least squares (published 1809), although it is possible that he had read A.M. Legendre's article on the topic (published 1804).

## Wording in intro

I don't like saying LS is "a mathematical optimization technique." I would prefer something like that it is a criterion of agreement between the data and the fitted curve. Once a criterion has been decided upon, LS or whatever, one needs a mathematical optimization technique such as Levenberg-Marquandt, Gauss-Newton, etc.

Dfarrar 00:25, 2 March 2007 (UTC)

## problems with "least mean squares" in intro

There are several problems with the passage in the intro on least mean squares:

"The approach is known as least mean squares (LMS) when the number of measured data is 1 and the gradient descent method is used to minimize the squared residual. LMS is known to minimize the expectation of the squared residual, with the smallest operations (per iteration). But it requires a large number of iterations to converge."

(1) For purposes of an intro, this seems to be too much on one specific application of least squares. In particular specific numerical optimization algorithms, and problems with numerical optimization, probably do not belong in an introduction. (2) I don't see why the number of observations should be 1. (3) This appears to be from signal processing methodology, which should have been stated. Dfarrar 12:43, 8 April 2007 (UTC)

I am comforted to know that I am not the only one having trouble understanding the "least mean squares" in the intro. RE: ("Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every single equation.), the following is not clear to me: 1. how does the "overall solution minimize" ? 2. What does it minimize? I.e., what is "the sum of the squares of the errors"? 3. There is a logical inconsistency, since if you know what the squares (^2, resumably) of the errors are, then you also know what the error is -- then why not just remove the error from the result, thereby making it correct?

Needless to say, this introductory section needs to be made more accessible, less filled with statistical jargon. —Preceding unsigned comment added by 75.203.177.90 (talk) 20:06, 4 September 2010 (UTC)

## Role of Gauss-Markov

While there may be generalizations to the nonlinear case, the basic Gauss-Markov theorem applies to the linear case. It has to do with the conditions for the least squares estimates to be best in terms of mean square error, among linear unbiased estimators. In the nonlinear case the estimator is not generally linear or unbiased. The rationale for the method needs to be developed. Part of it is computational convenience, particularly in the linear case. Also, LS estimators are maximum likelihood estimators in particular situations involving Gaussian errors. Dfarrar 14:07, 11 April 2007 (UTC)

## Gauss-Markov, put it in but do it right

The article should have a correct cross-reference to Guass-Markov, integrated into a statement of statistical properties of least squares, recognizing differences between linear and nonlinear cases. Dfarrar 22:26, 13 April 2007 (UTC)

## here is the stuff I deleted from intro

I think this is too technical for an intro, without further development. I think the intro should provide a pretty informal statement of what least squares is all about. The material will be unfamiliar to a lot of good statisticians. Where does it come from? Following is what I took out.

"A related method is the 'least mean squares (LMS) method. It occurs when the number of measured data is 1 and the gradient descent method is used to minimize the squared residual. LMS is known to minimize the expectation of the squared residual, with the smallest number of operations per iteration). However, it requires a large number of iterations to converge."

"Furthermore, many other types of optimization problems can be expressed in a least squares form, by either minimizing energy or maximizing entropy."

Dfarrar 19:11, 24 May 2007 (UTC)

## Generalize the number of parameters?

The Least squares and regression analysis section gives a generalization where there are 2 parameters, when in practice there can be an arbitrary number of parameters to estimate. An example model using 3 parameters can be found here Linear_least_squares#Applications. I'm not sure exactly how to fix this situation, other than giving multiple examples based on different analytic forms of models.

The article becomes misleading about always modelling a line because of the sub-sections that call beta the slope, and alpha the intercept. It is correct for the model presented, but at first glance, it appears that a line is the only thing that can be modelled. The Problem statement section is clear that this can estimate any number of linear parameters, but i think that explanation is too technical for people unfamiliar with the subject.

Straha 206th 21:27, 12 July 2007 (UTC)

## Perpendicular Offsets?

I see no mention of least squares to minimize perpendicular offsets, which is discussed on MathWorld. Perhaps that could be added. 155.212.242.34 13:08, 3 October 2007 (UTC)

Least squares is enormously useful and has thousands of applications. I do not think it makes sense to list these in the "see also" section, unless they are exceedingly well-known examples. Doing so would unnecessarily clutter the page. I have therefore removed the link to Vanicek analysis. --Zvika 20:05, 9 October 2007 (UTC)

## Ambiguous math notation

${\displaystyle S=\sum _{i=1}^{n}(y_{i}-f({\vec {x}}_{i},{\vec {a}}))^{2}}$

Just to clear things up, which of the following C functions is the closest to the above?

 #include <math.h>

/* The model function */
extern double f(double x, double *a);

double S(double *y, double *x, double *a, size_t n) {
double result=0.0;
size_t i;

for(i=0; i < n; i++) {
result += pow(y[i] - f(x[i], a), 2.0);
}
return result;
}

double S_alt(double *y, double *x, double *a, size_t n) {
double result=0.0;
size_t i;

for(i=0; i < n; i++) {
result += y[i] - f(x[i], a);
}
return pow(result, 2.0);
}


216.215.128.49 (talk) 11:30, 17 November 2007 (UTC)

Hi, ${\displaystyle S}$ is the sum of the squares, not the square of the sums. So, your first function is correct. One could add a pair of brackets to the formula to make that clear: ${\displaystyle S=\sum _{i=1}^{n}((y_{i}-f({\vec {x}}_{i},{\vec {a}}))^{2})}$ GluonBall (talk) 12:17, 17 November 2007 (UTC)
But the notation is perfectly standard and clear without the extra brackets, so don't. Dicklyon (talk) 16:11, 17 November 2007 (UTC)
It should also be noted that ${\displaystyle {\vec {a}}}$ and ${\displaystyle {\vec {x}}}$ have different lengths, because ${\displaystyle {\vec {a}}}$ is a list of parameters that need to be fiddled with, while ${\displaystyle {\vec {x}}}$ is a list of input values for which ${\displaystyle f()}$ should return the corresponding value from ${\displaystyle {\vec {y}}}$. Therefore, the C program needs two size arguments, not just one, and ${\displaystyle f()}$ needs to receive the size argument for ${\displaystyle {\vec {a}}}$ to keep it from using unallocated memory. This would be so much easier if the program was written in D:
 import std.math;

double S(double[] y, double[] x, double[] a, double delegate(double, double[]) f) {
double result=0.0;

assert(y.length == x.length);

for(size_t i=0; i < x.length; i++) {
result += pow(y[i] - f(x[i], a), 2.0);
}
return result;
}


## A major proposal

Wikipedia coverage of the subject of least squares analysis is currently somewhat chaotic. I would like to propose a major re-structuring on the following basis (please respond in the relevant section)

### Principal article

This one, amplified and serving as an introduction to the main articles

### Subsidiary articles

Regression Least squares
isotonic regression Coefficient of determination
Generalized linear model Curve fitting
Least-squares estimation of linear regression coefficients Least mean squares filter
Non-linear regression Least-squares spectral analysis
Partial least squares regression Linear model
Robust regression Moving least squares
Segmented regression Numerical smoothing and differentiation
Outliers
Overdetermined system
Polynomial interpolation?
Recursive least squares
Sum of squares
Trend estimation
Total least squares or errors-in-variables model
Others?
Applications?

### Articles to merge

I would agree to merge Total sum of squares and Residual sum of squares, and also Levenberg-Marquardt algorithm. I would be reluctant to merge the others. Yes, the articles will be much shorter once you delete the redundant material, but this doesn't mean that they won't be expanded in the future. I think decisions to merge should be made on the basis of maximum potential length of the page at a reasonable point in the future. Least squares is already a rather long page and I'd hate to see it get much longer, and it would if these topics were expanded at all. Also, the material on the pages Weighted least squares is more technical, and I am a big fan of compartmentalizing technical material on separate pages that refer back to more elementary pages. Cazort (talk) 21:57, 22 January 2008 (UTC)
I don't think they are appropriate to merge — any of them, e.g., the Gauss-Newton algorithm is an optimization method, while non-linear regression is something that may involve an optimization/estimation method but would also involve, e.g., prediction. — fnielsen (talk) 23:49, 22 January 2008 (UTC)

### Template

A template with two (or more) rows (regression and least squares) would be a helpful navigation aid, but I don't feel competent to construct one. Petergans (talk) 17:17, 22 January 2008 (UTC)

I think this is one of the few cases where I would not be inclined to include a template/navbox. Least squares connects and relates to a broad variety of different topics, and I think a template could potentially downplay alternatives to least squares. My biggest issue with how least squares is presented (both on wikipedia and in most textbooks) is that it's presented without enough question and skepticism, and I think a template could be dangerously moving in the direction of less skepticism. Cazort (talk) 21:47, 22 January 2008 (UTC)
I could see a navbox that breaks down LS type algos by assumption. But a lot of thought would have to go into this. Pdbailey (talk) 16:46, 24 January 2008 (UTC)

I've dug out some more links. There are three main strands Maths, Statistics and applications (mainly science). The total collect now stands as

Surely some navigation aid is desirable? This would not downplay other optimization topics, but would make it easier to access all the articles relating to least squares and regression. Petergans (talk) 14:20, 26 January 2008 (UTC)

### Notation

The notation used in some articles is clearly conflicting. For example, X is used for coefficients in some articles, for parameters in others. Can a uniform notation be agreed? I also propose that the notation of linear algebra should be the main one used. Here is a table of actual usage.

Article ind. variable observations parameter
Generalized linear model, linear model ${\displaystyle X\,}$ ${\displaystyle {\vec {Y}}\,}$ ${\displaystyle \beta \,}$
Least-squares estimation of linear regression coefficients ${\displaystyle X\,}$ ${\displaystyle {\vec {Y}}\,}$ ${\displaystyle \theta \,}$
Nonlinear regression ${\displaystyle x\,}$ ${\displaystyle y\,}$ ${\displaystyle \theta \,}$
curve fitting ${\displaystyle x\,}$ ${\displaystyle y\,}$ ${\displaystyle a,b,\,}$
linear least squares, total least squares ${\displaystyle A\,}$ ${\displaystyle b\,}$ ${\displaystyle x\,}$
non-linear least squares ${\displaystyle f\,}$ ${\displaystyle S\,}$ ${\displaystyle p\,}$
regression analysis ${\displaystyle x\,}$ ${\displaystyle y\,}$ ${\displaystyle \alpha ,\beta ...\,}$
linear regression ${\displaystyle X\,}$ ${\displaystyle Y\,}$ ${\displaystyle \beta \,}$
numerical smoothing and differentiation ${\displaystyle x\,}$ ${\displaystyle y\,}$ ${\displaystyle a\,}$
proposal ${\displaystyle X\,}$ ${\displaystyle y\,}$ ${\displaystyle p\,}$

Any objections to the proposal that this notation should be used in most articles relating to least squares, that is, posing the problem as

${\displaystyle \mathbf {Xp=y\,} }$

with solution

${\displaystyle \mathbf {{\hat {p}}=\left(X^{T}X\right)^{-1}X^{T}y} \,}$?
I disagree slightly with the idea of the problem and slightly with the proposed solution. Some of the notational differences make sense (i.e. a scalar, vector, and matrix are often typeset differently), and there are times when it makes sense to breakout the coefficients from the vector form (specifically in regression analysis) how about a proposal of one notation for scalar, one for vector, and one for matrix? Also, p is not nearly as good as ${\displaystyle \beta }$ or ${\displaystyle \theta }$, just in that the latter two are canonical. Pdbailey (talk) 16:39, 24 January 2008 (UTC)
Using X for a matrix and p for a vector is very confusing. Besides, there is no need to unify the notation. Wikipedia articles are supposed to stand on their own. As long as a given article has good and self-consistent notation, that is good enough as far as I see.
I'd suggest you focus on improving individual articles rather than make them read as one (Wikipedia is not a book). One obvious area of improvement could be adding more references to this article for example. Oleg Alexandrov (talk) 16:40, 24 January 2008 (UTC)
I can see the value of unified notation -- very often I read one article and then click on another to get further information, and having to figure out the notation anew each time is time-consuming. I have seen β, θ, and x used for the parameter in books, so I also think it should be one of those rather than p. --Zvika (talk) 18:48, 24 January 2008 (UTC)
Bear in mind that there's no way that all mathematics articles on Wikipedia can have unified notation, there's simply too many, with their own standards in their respective fields. e.g. there's frequently conflict in cross-links between engineering and physics articles; the first discipline tends to use j for ${\displaystyle {\sqrt {-1}}}$ and f as the independent variable in Fourier transforms, the other uses i and ${\displaystyle \omega }$. There's no way that complete consistency can ever be achieved. Oli Filth(talk) 18:53, 24 January 2008 (UTC)
That is obviously true. But if we have a small group of related articles, and an editor willing and able to unify the notation in them, then I really don't see what the problem is. --Zvika (talk) 19:18, 24 January 2008 (UTC)

I agree it's impossible to be wholly consistent. That's why I originally said "most" rather than "all", since I believe that some greater consistency is desirable, as indicated by Zvika. I can accept ${\displaystyle \beta }$ for the parameters, even though it clashes with stability constant (my area of specialization). I was taught upper case for matrices and lower case for vectors, bold for whole items, italics for an element. Although it won't satisfy everyone, it's also the most commonly used notation, at this time, in the least-squares subject area. Lastly I never intended a wholesale revision; it has to be applied judiciously as indicated by Pdbailey. Petergans (talk) 20:17, 24 January 2008 (UTC)

I very much disagree with using X for a matrix. The expression Xp=y is just so confusing. Use A or B for matrices. Oleg Alexandrov (talk) 04:38, 25 January 2008 (UTC)
y=Xβ is standard notation in many statistics books (e.g. Lehmann and Casella), but I agree it is confusing for people in other fields (such as myself), so I too think that it should be avoided. In electrical engineering, I believe H is often used for the matrix (e.g. the textbook by Kay), but A or B will also do. --Zvika (talk) 08:44, 25 January 2008 (UTC)

OK. ${\displaystyle \mathbf {A\beta =y} \,}$ seems to be the preferred notation. However, ${\displaystyle \mathbf {X} \,}$ appears to be standard in regression articles, so I won't change the notation in those articles. Petergans (talk) 09:34, 25 January 2008 (UTC)

Hmm, ${\displaystyle \mathbf {A\beta =y} \,}$ just looks off to me because it is different than almost all my statistics books. I the standard in statistics books is ${\displaystyle \mathbf {X} \,}$, but mathematicians who are not statisticians often prefer the early letters. I'd say the plurality of the articles above use ${\displaystyle \mathbf {X} \,}$ and you should stick with that or not change it. Pdbailey (talk) 13:59, 25 January 2008 (UTC)

Final decision: ${\displaystyle \mathbf {X\beta =y} \,}$. The clincher for me is the fact that this is much the same notation as used in linear regression. This will help to make it clear that regression and least squares are not really different topics, but different ways of looking at the same problem. Petergans (talk) 14:20, 26 January 2008 (UTC)

Wow! I really love this table. Suggest it be put in the article with the comment that notation varies among various disciplines. Jfgrcar (talk) 02:46, 4 November 2011 (UTC)

### Draft replacement for linear least squares

I have prepared a thorough revision of this article in User:Petergans/a/sandbox. The main points to note are

• Use of linear algebra notation
• Inclusion of weights right from the start
• Shift of emphasis from pure mathematics to applications
• Addition of sections relating to errors
• Significant selection of references

Unless there are strong objections to the contrary, I shall replace the article by this draft as a first step towards the overall objective of rationalizing the treatment of least squares analysis in Wikipedia. Petergans (talk) 17:17, 22 January 2008 (UTC)

That article is WAAAAY ahead of the current one. I would be in favor of replacing it; we can tweak it further then! Cazort (talk) 22:51, 22 January 2008 (UTC)
The added information in the new version is definitely an improvement. However, I wonder if it is wise to begin the article with the most general case and with a long sequence of definitions. In the proposed version, the reader has to follow through 4 display equations and the definitions of the Jacobian and residual, before getting to what least-squares is about. Even then, the definition is a general case and the usefulness might not be clear. Compare this with the current version, in which the article begins with a very simple, informal explanation of the goal ("we want to find a solution ${\displaystyle {\hat {\mathbf {x} }}}$ for the 'equation' ${\displaystyle A{\hat {\mathbf {x} }}\approx \mathbf {b} }$"). This goal is something that is good to keep in mind while working through the math. I would suggest having a short "Motivation" section with an informal discussion such as the one in the first sentence of the Definition section in the current version. Just my two cents... --Zvika (talk) 11:55, 23 January 2008 (UTC)

User:Petergans/a/sandbox Now in revised notation and expanded. I accept that the introduction is a bit fierce, but I regard it as essential a) to include weights from the start and b) to state the minimization problem in the context of modelling experimental data. Petergans (talk) 14:20, 26 January 2008 (UTC)

### Discussion

#### Computations

Can I suggest that computation details might sit most naturally in two articles,

the first treating Cholesky, QR and SVD as applied to least squares; the second giving an overview of nonlinear least squares via Newton's method, Gauss-Newton, Gradient descent, Conjugate gradient, and Levenberg-Marquardt ? Jheald (talk) 17:50, 22 January 2008 (UTC)

## Comment on proposal

I disagre with merging Gauss-Newton algorithm and Levenberg-Marquardt algorithm because they are different (albeit related) algorithms.

I am very weary of grand schemes, and I do not support the proposal above. Let's start simple. You want to rewrite linear least squares, so let's discuss that and shelve everything else for later.

In short, I feel you are proposing too many changes, I don't feel they are all justified, and that it will be easier for everybody if we proceed one step at a time to avoid wasted efforts later on. Oleg Alexandrov (talk) 04:36, 23 January 2008 (UTC)

## Another proposal

I have prepared a draft of Non-linear Least Squares in User:Petergans/b/sandbox. I propose that this article should replace, with re-name, Gauss-Newton algorithm. Petergans (talk) 12:59, 30 January 2008 (UTC)

### If not merge then what?

What is to be done about Levenberg-Marquardt algorithm? There is clearly too much repetition of basics with the proposed Non-linear least squares as it stands. I disagree with Oleg that it is a separate algorithm as the Marquardt parameter is only ever invoked in the context of a non-linear least squares refinement. After all, it is used to modify the least squares normal equations. On the other hand, Levenberg-Marquardt algorithm does contain some useful references etc. Petergans (talk) 12:59, 30 January 2008 (UTC)

## The rewrite of linear least squares

See Wikipedia talk:WikiProject Mathematics#Least squares objectionable rewrite. Oleg Alexandrov (talk) 16:03, 31 January 2008 (UTC)

## Proposal: least squares is linear algebra and statistics is an application

The article as it is now [1] is introducing a number of concepts that will only confuse a reader who is not well versed in statistics. Yet the least squares problem is just very simple linear algebra with no need for statistics whatsoever. The statistical interpretation and motivation should be kept separate. Thus, the following topics should be treated separately, for different audiences, and crossreferenced:

• basic formulation and solution of least squares problem
• use in statistics and motivation from statistics
• advanced mathematical topics, including how to actually compute the least squares solution in a numerically stable and efficient way (which is not from the textbook formulas by any means), relations to QR, SVD, and pseudoinverse

This could be in separate sections or separate articles. Thoughts? Jmath666 (talk) 01:58, 2 February 2008 (UTC)

I think this article is still OK, having some motivation, history, and a gentle introduction to how the problem is set up. It is the article on linear least squares which starts right away with an in-depth foray in statistical data estimation likely to confuse anybody who is not a specialist.
And it is the article on linear least squares which is more about linear algebra than this one. So, how about starting there? :) Oleg Alexandrov (talk) 04:32, 2 February 2008 (UTC)

Indeed, what a mess. This article might be moved to Least squares (statistics) and then some parts of linear least squares split off to Least squares (linear algebra).

Actually, I find this article more confusing than linear least squares. Maybe I am too much of a specialist. Jmath666 (talk) 05:00, 2 February 2008 (UTC) Jmath666 (talk) 05:00, 2 February 2008 (UTC)

## Least squares: implementation of proposal

which contain more technical details, but it has sufficient detail to stand on its own.

In addition Gauss-Newton algorithm has been revised. The earlier article contained a serious error regarding the validity of setting second derivatives to zero. Points to notice include:

• Adoption of a standard notation in all four articles mentioned above. This makes for easy cross-referencing. The notation also agrees with many of the articles on regression
• Weighted least squares should be deleted. The first section is adequately covered in Linear least squares and Non-linear least squares. The second section (Linear Algebraic Derivation) is rubbish.

This completes the fist phase of restructuring of the topic of least squares analysis. From now on I envisage only minor revision of related articles. May I suggest that comments relating to more than one article be posted on talk: least squares and that comments relating to a specific article be posted on the talk page of that article. This note is being posted an all four talk pages and Wikipedia talk:WikiProject Mathematics.

Petergans (talk) 09:52, 8 February 2008 (UTC)

That's a big improvement. I appreciate your hard work on this. A couple of constructive (i hope) comments and a question:
• I think the fact the distinction between linear and non-linear least squares needs to be emphasised more - those new to least-squares often find it confusing that "linear" here means linear in the parameters, not the independent variables xi, so that linear least squares includes polynomial fitting in particular. (Maybe I should go ahead and add something on this myself).
• Some material is repeated in more than one article, in particular least squares#Differences between linear and non-linear least squares and non-linear least squares#Differences between linear and non-linear least squares. This is hard to maintain (the two will tend to drift apart) so I think a cross-reference is much preferable.
• Those sections say both that divergence is "quite rare" in LLSQ (unlike NLLSQ) and that it's non-iterative. I grant that LLSQ is not immune from numerical issues (especially if poorly programmed) but how can a non-iterative process ever be said to diverge?
Qwfp (talk) 12:56, 8 February 2008 (UTC)
I have not always been sure where to place material. Do you think it would be enough for least squares#Differences between linear and non-linear least squares to be referenced in the non-linear article? Also I moved the numerical example from least squares into linear least squares but I'm not sure if that was the right thing to do.
An earlier draft made the distinction as follows: linear - ${\displaystyle f=\alpha +\beta x+\gamma x^{2}}$, non-linear ${\displaystyle f=\alpha +\beta x+\gamma ^{2}x}$ but I discarded it as too confusing!
LLSQ can diverge if the sum of squares increases. I think I have seen this happen, but obviously it would have been due to ill-conditioning and/or round-off error. Petergans (talk) 14:31, 8 February 2008 (UTC)
I removed the part about linear least squares not being iterative. "Iterative methods for linear systems" is a fairly huge section of numerical linear algebra. The Jacobi method and Gauss-Seidel method are fairly ancient examples of this. A particularly common 20th century example is Conjugate gradient method, but Lanczos methods are popular in the research community (sometimes called Krylov methods). JackSchmidt (talk) 14:55, 8 February 2008 (UTC)
I'm not sure how to smoothly add this to the article, but an incredibly important aspect of least squares is solving under-determined systems with the solution minimized. The techniques are more or less identical, but the ability to choose the solution this way is important for many other ideas and algorithms. At some point though perhaps the right thing to do is to find the linear solver articles and ensure that important aspects are mentioned in both places. I agree with Qwfp though that a particular good discussion should only be in one article with references from the others. Every article can have a sentence or two about it in their own language, though. JackSchmidt (talk) 15:04, 8 February 2008 (UTC)
Most versions of linear least squares I've seen on wikipedia jump straight to the normal equations. I admit I don't understand why they are mentioned at all on least squares, and I think they are overemphasized at linear least squares. The normal equations are nice for proofs, but so is the SVD. The benefit of a more SVD based presentation is that the geometry is more evident and the mathematical significance of the solutions is clear. By geometry, I mean the inner product which defines angles is more obviously relevant, so that for instance the "weight" matrix now can be understood as describing the cost of movement in a non-isotropic medium. This was a problem with the pre-Petergans version as well. JackSchmidt (talk) 15:26, 8 February 2008 (UTC)
A minor problem is that Linear least squares might lead someone to believe calculating the SVD is a wise way to solve least squares problems. I have no objection to this, as there are many methods where the truncated SVD provides the theoretical analysis of the algorithm, but these methods do not produce the SVD as a by-product. Some of them are just slightly-modified QR factorizations. "Incomplete" methods often fall into this category. In some sense this is how the Jacobi method variants work too. At any rate, "this method is the most computationally intensive" scares me. Perhaps it is not so far from the truth as these truncated methods are usually applied to large problems, but they are applied only because they are so amazingly cheaper computationally than the other methods. JackSchmidt (talk) 15:26, 8 February 2008 (UTC)
I agree with this. Truncation by SVD is a very useful form of data compression, but surely too technical for WP? The words "computationally intensive" are carried over from the earlier versions. I left them there because calculating the singular values is indeed a somewhat lengthier process compared with the other methods.

Petergans (talk) 20:08, 8 February 2008 (UTC)

### Gauss-Newton

I wrote a comment on the rewrite of that article at Talk:Gauss-Newton algorithm#Response to rewrite. Oleg Alexandrov (talk) 03:24, 9 February 2008 (UTC)

## Constraints

The LASSO method is a welcome addition; it draws attention to the fact that WP has little on constrained optimization. I did consider including, in least squares, something on non-negativity constraints (applied by a change of variable such as a=b2) and equality constraints, but decided that it was too technical. Any comments? Petergans (talk) 09:22, 11 February 2008 (UTC)

Disagree. What has Lasso got to do with least squares? AFAIK it's just another ad hoc functional that some people like for some particular applications. Especially since it involves an L1 norm and does not belong here. Non-negativity constraints also should be treated as a form of regularization - these topics should be moved to [Regularization](mathematics or machine learning), although those sections are a bit of a disaster at present. 68.101.160.213 (talk) 07:27, 17 April 2008 (UTC)

## Error?

Under the section "Solving the Least Squares" beneath the text "r_i=y_i-f(x_i,\boldsymbol \beta)\, the gradient equations become" I believe there is an extra r_i in the equation. It's been a while since I've studied math, so I apologize if my interpretation is incorrect. —Preceding unsigned comment added by 75.180.35.23 (talk) 05:22, 2 March 2008 (UTC)

The text is correct. Petergans (talk) 07:36, 2 March 2008 (UTC)
Yep, ${\displaystyle {\frac {d}{dx}}(f(x))^{2}=2f(x)f'(x).}$ —Preceding unsigned comment added by Oleg Alexandrov (talkcontribs) 16:12, 2 March 2008 (UTC)

## An mistake on the index?

Main article, subtopic Linear Least Squares on the line with the double summation Should the indexing for j be from 1 to m instead of to n? I am reading this for the first time but it makes sense to be m equations. if it indeed is to n could someone please give short explaination. Thanks abunch!

Tao —Preceding unsigned comment added by 159.83.154.204 (talk) 20:47, 15 September 2008 (UTC)

You were correct. Correction now made. Melcombe (talk) 09:13, 16 September 2008 (UTC)

## Equations and observations should be equal?

It says "n equations for the m observations". In my opinion there is n measurements, n observations, and n equations (one for each measurement). Or is there a reason one of these would be something else than n? --82.203.173.200 (talk) 23:40, 22 May 2009 (UTC)

## Unclear sentence

I find the following sentence unclear: The methods of least squares and regression analysis are conceptually different. As I understand it the method of least squares is a type of regression analysis, so the two cannot be said to be different! Maybe the auther meant to say something else with this sentense? —Preceding unsigned comment added by 130.225.22.254 (talk) 16:25, 23 July 2009 (UTC)

## Upper case or lower case phi

In section "Linear least squares", there is a math png, where the phi looks upper case. Just below in unicode plain text phi is in lower case. Velle (talk) 15:15, 13 January 2010 (UTC)

## Is LASSO an acronym ?

If LASSO is an acronym, it needs to be spelled out. If LASSO is not an acronym, that needs to be mentioned. 74.195.224.102 (talk) 16:39, 23 September 2010 (UTC)

Lasso = Least absolute shrinkage and selection operator. ~ Elitropia (talk) 09:38, 24 September 2010 (UTC)

## the complex case

The least squares method finds its optimum when the sum, S, of squared residuals
${\displaystyle S=\sum _{i=1}^{n}{r_{i}}^{2}}$
is a minimum. A residual is defined as the difference between the value predicted by the model and the actual value of the dependent variable
${\displaystyle r_{i}=y_{i}-f(x_{i},\beta ).}$

This is not quite accurate when the data have complex values; in that case the quantity after the Σ can be expressed as the product of ${\displaystyle r_{i}}$ with its conjugate. This distinction has consequences! —Tamfang (talk) 21:37, 26 October 2010 (UTC)

## This concept is simple enough that it should be illustrated in such a way that a high school senior can understand it

The least squares method does not have to be expressed in such a complex, jargon-laden way. It is a pretty simple concept that doesn't really benefit from vocabulary brought in from esoteric graduate mathematics. If you can understand the way the least squares method was presented here, then you don't need to be reading about it... —Preceding unsigned comment added by 71.22.123.181 (talk) 09:27, 2 January 2011 (UTC)

## Fused Lasso

If you know something about fused lasso would you add it to the article? Thank you -- Leegrc (talk) 15:10, 27 January 2014 (UTC)

## minimizing error/non linear case

say x1,x2,x3...xn are values measured of a function f(x,y)=0 with corresponding y1,y2,y3..yn, A curve that passes through all these points is given by: Π((ai(x-xi))^2+(bi(y-yi))^2)=0 implying a fantastic coef of determination/zero chi-square. Can such solutions be considered valid for modelling purposes? logically no because only xi,yi satisfy the above equation, but can someone think of a way to add a variation to this? Once one plugs in known xi and know yi the function is of the form f(xn,y)=0 where we know xn but do not know y for that xn Can one solve it by equating the above to (or it's integral) to e(x,y), where e(x,y) is the error in measurement and solving for (e`(x,y))^2 being a minima etc?

In other words can one say Π(((x-xi))^2+((y-yi))^2)=e^2 where e is the error and then minimize the error? — Preceding unsigned comment added by Alokdube (talkcontribs) 13:53, 12 February 2014 (UTC)

-Alok 13:37, 12 February 2014 (UTC) — Preceding unsigned comment added by Alokdube (talkcontribs)