Talk:Simple linear regression

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Statistics (Rated C-class, Mid-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C-Class article C  This article has been rated as C-Class on the quality scale.
 Mid  This article has been rated as Mid-importance on the importance scale.
WikiProject Mathematics (Rated C-class, Mid-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
C Class
Mid Importance
 Field: Probability and statistics


I edited the page for math content (although I'm no expert on latex, either), added a section on inference and a numerical example. I don't know if the numerical example is helpful. EconProf86 22:04, 31 July 2007 (UTC)

Regression articles discussion July 2009[edit]

A discussion of content overlap of some regression-related articles has been started at Talk:Linear least squares#Merger proposal but it isn't really just a question of merging and no actual merge proposal has been made. Melcombe (talk) 11:45, 14 July 2009 (UTC)

Linear regression assumptions[edit]

Perhaps there should be a straight-forward section on assumptions, and how to check them, like this:

   * 1) That there is a linear relationship between independent and dependent variable(s).
    How to Check: Make an XY scatter plot, then look for data grouping along a line, instead of along a curve.
   * 2) That the data are homoskedastic, meaning that errors do not tend to get bigger (or smaller), as a trend, as independent variables change.
    How to Check: Make a residual plot, then see if it is symmetric, or make an XY scatter plot, and see if the points do not tend to spread as they progress toward the left, or toward the right. If the scatter plot points look like they get farther apart as they go from left to right (or vice versa), then the data are not homoskedastic.
   * 3) That the data are normally distributed, which would meet the three following conditions:
   * a) Unimodal_function:
    How to Check: Make a histogram of the data, then look for only one major peak, instead of many.
   * b) Symmetric, or Unskewed Data Distribution:
    How to Check Skewness: Make that same histogram, then compare the left and right tails - Do they look to be the same size? Or is the graph 'leaning' one way or another?
   * c) Kurtosis is approximately Zero:
    How to Check: Make that same histogram, then compare its peakedness to a normal distribution. Is it 'peakier', or less 'peaky'? Are the data points more clustered than a normal distribution?

Briancady413 (talk)

1) is false. That's not what 'linear' means. The linearity refers to a linear relationship between y and the PARAMETERS, i.e. the alpha and the betas. A quadratic relationship between y and x is still (a little paradoxically) a 'linear regression'. In mathematics an equation is said to be linear or nonlinear if it's linear or nonlinear in the UNKNOWNS, and in the regression setting it's the alpha and betas that are unknown.

Blaise (talk) 14:00, 13 September 2013 (UTC)

Broken link to sample correlation[edit]

The link Correlation#Sample_correlation in the first section is broken (there's no such anchor in the page).

Where should this point to? Correlation#Pearson.27s_product-moment_coefficient ?

Nuno J. Silva (talk) 18:21, 21 June 2010 (UTC)


Can anyone more mathematically literate tell me if there's a reason the α and β subscripts of standard error s have carets on top of them in the section "normality assumption", but not in the numerical example? - Kyle at 6:35pm CST, 4 April 2011 —Preceding unsigned comment added by (talk) 23:38, 5 April 2011 (UTC)

Numerical Example[edit]

There is an unfortunate basic error in the data. If plotted, an odd cadence in the x-positions can be noticed. This is because the original data were in inches, and the heights have been converted to metres, with rounding to the nearest centimetre. This is wrong, and has visible effects. Also, the line fit parameters change.

 Slope       Const.
 61.2722     -39.062  wrongly rounding to the nearest centimetre.
 61.6746     -39.7468 conversion without rounding.

And naturally, all the confidence intervals, etc. change as well. The fix is simple enough. Replace the original x by round(x/0.0254)*0.0254. I had reported this problem in 2008, but in the multiple reorganisations of the linear regression article, this was lost. There is not much point in discussing fancy calculations if the data are corrupt.

Later, it occurred to me to consider whether the weights might have been given in pounds. The results were odd in another way. Using the conversion 1KG = 2.20462234 lbs used in the USA, the weights are

115.1033 117.1095 120.1078 123.1061 126.1044 129.1247 132.123 135.1213 139.1337 142.132 146.1224 150.1348 154.1472 159.1517 164.1562
114.862  116.864  119.856  122.848  125.84   128.854  131.846 134.838  138.842  141.834 145.816  149.82   153.824  158.818  163.812

The second row being for the approximate conversion of 1KG = 2.2lbs. I am puzzled by the fractional parts. NickyMcLean (talk) 22:43, 5 September 2011 (UTC)

Broken link to Total least squares[edit]

The link goes to Deming regression instead, not the total least squares page. LegendCJS (talk) 16:48, 28 September 2011 (UTC)

By using calculus[edit]

\text{Find }\min_{\alpha,\,\beta}Q(\alpha,\beta),\text{ where } Q(\alpha,\beta) = \sum_{i=1}^n\hat{\varepsilon}_i^{\,2} = \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2\

It is not obvious to me how to solve that equation to derive the forms below it in the "Fitting the regression line" section. The text says "By using either calculus, the geometry of inner product spaces or simply expanding to get a quadratic in α and β, it can be shown that the values of α and β that minimize the objective function Q are..."

How do I use these methods to show that? I'd like to see this expanded. Spell out the steps in the derivation more explicitly. Perhaps I could figure out how to do it using the "geometry of inner product spaces" method if I read the linked article. To solve with calculus, I would differentiate by α and β and find where the derivative was equal to zero and the second derivative was positive. I... forgot how to do this with two variables, and I especially don't know how to do this with that summation.

Ideally, I'd like to see a link to an article that describes these methods (like the "geometry of inner product spaces" method), and also at least some of the intermediate steps in the derivation. Gnebulon (talk) 02:57, 9 November 2011 (UTC)

It's a matter of ploughing ahead. The general idea begins with minimising

 E = \sum_{i=1}^n [y_i - f(x_i)]^2\ so the plan is to minimise E by choice of \alpha, \beta in  E = \sum_{i=1}^n [y_i - (\beta x_i + \alpha)]^2\

The first step is to expand the contents of the summation, thus  E = \sum_{i=1}^n y_i^2 - 2y_i(\beta x_i + \alpha) + (\beta x_i + \alpha)^2\ There are endless variations on this, with \alpha, \beta, a,b, (or vice-versa) and m,c.

Then further expansion,  \sum_{i=1}^n y_i^2 - 2y_i\beta x_i - 2y_i\alpha + \beta^2 x_i^2 + 2\beta x_i\alpha + \alpha^2\

Now apply the rules of the calculus, with a separate differentiation for each of the parameters. As usual, the extremum is to be found where the slope has the value zero. I prefer to avoid the horizontal bar in dy/dx as it is not a normal sort of fraction, so you shouldn't cancel the ds for example. But that's just me.

Anyway,  \frac{dE}{d\alpha} = \sum_{i=1}^n 0 - 0 - 2y_i + 0 + 2\beta x_i + 2\alpha

The twos can be factored out, so  \sum_{i=1}^n - y_i + \beta x_i + \alpha = 0

Remembering that  \sum_{i=1}^n \alpha = N\alpha a rearrangement gives

\alpha = \frac{\sum y_i - \beta x_i}{N} or equivalently, \alpha = \bar{y} - \beta \bar{x}

Which is to say that the line (of whatever slope) goes through the average point (\bar{x},\bar{y}) because  \bar{y} = \alpha + \beta \bar{x}

Notice that the second differential is constant, 2: a positive number. So this extremum is a minimum.

Minimising with respect to \alpha is the first half. The second is to minimise with respect to \beta and now it becomes clear why collecting the terms for E would have been a waste of effort.

 \frac{dE}{d\beta} = \sum_{i=1}^n 0 - 2y_i x_i - 0 + 2\beta x_i^2 + 2x_i \alpha + 0

As before, the twos can be factored out, so  \frac{dE}{d\beta} = \sum_{i=1}^n -y_i x_i + \beta x_i^2 + x_i \alpha = 0

The second differential for this is  \sum_{i=1}^n x_i^2 which must be positive, and so this extremum is also a minimum.

Remembering that  \sum_{i=1}^n (a + b + c) = \sum_{i=1}^n a + \sum_{i=1}^n b + \sum_{i=1}^n c

\sum_{i=1}^n \beta x_i^2 + \sum_{i=1}^n x_i \alpha = \sum_{i=1}^n y_i x_i

Remembering that  \sum_{i=1}^n c x_i = c\sum_{i=1}^n x_i and substituting for \alpha

\sum \beta x_i^2 + (\bar{y} - \beta \bar{x})\sum x_i = \sum y_i x_i

Multiplying out and re-arranging,

\beta[\sum x_i^2 - \frac{(\sum x_i)^2}{N}] = \sum y_i x_i - \frac{(\sum y_i)(\sum x_i)}{N}
\beta = \frac{\sum y_i x_i - \frac{(\sum y_i)(\sum x_i)}{N}}{\sum x_i^2 - \frac{(\sum x_i)^2}{N}}

Multiplying top and bottom by N renders this less typographically formidable. Other variations are possible via the use of  \bar{x} and \bar{y} as appropriate.

\beta = \frac{N \sum y_i x_i - (\sum y_i)(\sum x_i)}{N \sum x_i^2 - (\sum x_i)^2}

NickyMcLean (talk) 04:54, 22 September 2012 (UTC)

Fitting the regression line[edit]

In ==Fitting the regression line== shouldn't the expression immediately before the one containing Cov & Var have 1/n as the multiplier before the - x.y and before the - x^2 terms? This is the form of the expression which is often used in computing to generate a straight-line fit to set of "bumpy" data. As expressed here it does not work and moreover does not follow mathematically from the preceding expression! However with the 1/n terms in place it appears to produce the correct result. 1/n is unique to these two terms ONLY and therefore does NOT cancel?!? But then I am a Physicist and not a Mathematician so I may have missed something??? Chris B. 6:45pm PST on 24th. Mar. 2013. — Preceding unsigned comment added by (talk) 00:48, 25 March 2013 (UTC)

Have a look at the section below "By using calculus" which steps through a derivation and also mentions multiplying top and bottom by N. Incidentally, I also studied Physics. NickyMcLean (talk) 02:24, 25 March 2013 (UTC)

Notation can lead to mistakes[edit]

The expression  \frac{ \operatorname{Cov}[x,y] }{ \operatorname{Var}[x] } can lead to mistakes if you use the sample variance instead of variance. Since every spreadsheet gives you the sample variance it is likely that people can use this formula incorrecly (as one of my students just did in one assignment). It should be better to stress that Var is not the sample variance. — Preceding unsigned comment added by 25pietro (talkcontribs) 07:11, 13 June 2014 (UTC)

beta hat[edit]

I was confused by the formula for \hat\beta, I wonder if the second one should have more parentheses around sums, like  \frac{ \sum_{i=1}^{n}{x_{i}y_{i}} - \frac1n (\sum_{i=1}^{n}{x_{i}})(\sum_{j=1}^{n}{y_{j}})}{ \sum_{i=1}^{n}({x_{i}^2}) - \frac1n (\sum_{i=1}^{n}{x_{i}})^2 } (as product has precedence over summation) - but I am no mathematician nor English, so perhaps do not know conventions. Can someone look at it and possibly fix it? I know it is logical to at least assume the parentheses, but this is introductory and should be as precise as possible.Drabek (talk) 19:08, 21 March 2014 (UTC)