|This article is of interest to the following WikiProjects:|
|This subject is featured in the Outline of regression analysis, which is incomplete and needs further development. That page, along with the other outlines on Wikipedia, is part of Wikipedia's Outline of Knowledge, which also serves as the table of contents or site map of Wikipedia.|
- 1 Endogenous/exogenous
- 2 Name: Regression? Or Linear Models, or Linear Statistical Models, etc.
- 3 Regression articles discussion July 2009
- 4 Rename?
- 5 Merge Trend estimation to here
- 6 Typography Disambiguation
- 7 Re: Assumptions
- 8 Example in the introductory section
- 9 Epidemiology example
- 10 Likely inaccurate portrayal of weighted linear regression
- 11 Should this article be renamed?
- 12 Move?
- 13 First sentence
- 14 Meaning of "regression coefficient"
- 15 Random variables
- 16 Missing terms: point of averages, regression line, average of
- 17 big
- 18 a misnomer
- 19 Median-median line?
- 20 Satistics or linear algebra
- 21 Least distance fit
- 22 Response variables
- 23 Errors and Residuals, confusing
The uses of "endogenous" and "exogenous" variables here are not consistent with the only way I've ever heard them used. Exogenous means outside of the model -- i.e., a latent/hidden variable. Endogenous describes a variable that IS accounted for by your model, be it independent OR dependent. See Wiki entry for "exogenous," which supports this.
I recommend that these two words be deleted from the list of alternate names for predictor and criterion variables. (unsigned comments by 184.108.40.206)
- I have found economists use exogenous in lenear models to mean non response variables. The contrast is endogenous variables that appear on the right hand side of one or more other variables equations but also on the left hand side of their own regression. Pdbailey 00:06, 7 October 2006 (UTC)
- In economics exogenous means something determined outside of the model, such as X which is determined by God, or random chance, or anything but not the model itself. On the other hand Y is endogenous, since it is determined within the model, via the equation Y = Xβ + ε. This terminology becomes more useful when discussing simultaneous equations models. Also within the context of IV whenever one of the X's is correlated with ε's we will call such X endogenous as well. Stpasha (talk) 19:10, 30 June 2009 (UTC)
I would like to add that in economic models there is another type of variable: the predetermined variable. Pre-determined variables, as the name implied, are often lagged endogenous variables or lagged dependent variables. —Preceding unsigned comment added by Daonng (talk • contribs) 06:54, 10 May 2011 (UTC)
Name: Regression? Or Linear Models, or Linear Statistical Models, etc.
There are substantial portions of the literature which have moved away from the use of the term "regression". The term "regression" is used for historical reasons but does not capture the meaning of what is actually going on.
Terms such as "Linear Models", "Linear Statistical Models" are becoming at least as widely used as "linear regression" in the literature, and their meaning is more descriptive of what is actually going on. I think we ought to consider renaming this article and have "Linear regression" forward to the new page. At the very least, we ought to discuss the issues regarding naming on this page. Cazort 19:24, 17 October 2007 (UTC)
The term 'linear model' is wider than 'linear regression' as the latter implies that the predictor variable is numeric, while the former allows for numeric or categorical variables (i.e. an analysis of variance model.) Blaise (talk) 13:07, 31 March 2013 (UTC)
Regression articles discussion July 2009
A discussion of content overlap of some regression-related articles has been started at Talk:Linear least squares#Merger proposal but it isn't really just a question of merging and no actual merge proposal has been made. Melcombe (talk) 11:33, 14 July 2009 (UTC)
- I wonder what do you all think about renaming this article into the Linear regression model? It seems to me that such name better expresses the topic of the article: some people use the word “regression” to refer to the process of estimation, while for others “regression” means the statistical model itself; however combination “regression model” is unambiguous. ... stpasha » talk » 01:17, 21 July 2009 (UTC)
- I reopened this topic for discussion at the WPStatistics talk page. Please leave your comments there. // stpasha » 23:54, 21 April 2010 (UTC)
Merge Trend estimation to here
The article Trend estimation does not go beyond linear models and contains much that is actually generic for linear regression and dealt with better here. It appears to me that Trend estimation could simply become a redirect to Linear regression#Trend line (which might be renamed Linear regression#Trend estimation) after merging any useful stuff from there to here. --Lambiam 19:17, 27 July 2009 (UTC)
- I think it would be better left separate, but changed to better reflect actual trend estimation rather than saying it is essentially equivalent to least squares regression, which it shouldn't be. Melcombe (talk) 10:50, 29 July 2009 (UTC)
- But who is going to execute that change and when? Don't you agree that until someone actually creates an article on trend estimation going beyond linear regression to find a trend line (which is more specialized than least squares regression, which might apply to other than linear trend models), the reader is better served by the proposed redirect? --Lambiam 12:48, 2 August 2009 (UTC)
But, isn't Linear regression just a tool used in Trend estimation? How can the over-arcing topic be listed under a tool? It would be like putting an on trees or woodworking in another article talking only about hammers. You use hammers to work wood, but you also use saws and other things that don't fit under the category of hammers. However, saws and the other tools are key to woodworking. Therefor, just put a link for Trend estimation at the bottom of the Linear regression page. if people want to read it, they can click the link. ~ Talon SFSU 12 September 2009
No, it isn't. It's a very general model for data. Trend estimation is just one application. Interpolation is another. Also, multilevel models (random effects) can be viewed as consisting of composed of multiple linear regression models. Blaise (talk) 13:14, 31 March 2013 (UTC)
The ' character is used in several different contexts without clarification.
The contextual meaning of the character should be explicitly stated, whether it means "Transpose" or is used to aggregate individual variables into rows and vectors.
Use of T notation is less ambiguous in all cases.
- aah i see, they dropped the second subscript to denote the entire row.
- so still the confusion derives from a lack of explicit definition of the typography!
- The article states that is a p-vector in the first sentence of Definition section; then it defines ’ notation right after the formula: it says “ is an inner product between two vectors”. Maybe we could state that ’ is transposition more explicitly.
- As for T notation, it isn’t less ambiguous as it could be misunderstood for raising the matrix to the T-th power. ... stpasha » talk » 14:52, 31 July 2009 (UTC)
Amadhila leonard a student from the University of Namibia(Ogongo Campus) think this could be written to be more usable to more people, along the lines of the following:
- 1) Linear relationship between independent and dependent variable(s)
- How to Check: Make an XY scatter plot, then look for data grouping along a line,instead of along a curve.
- 2) Homoskedastic, meaning that, as independent variables change, errors do not tend to get bigger or smaller.
- How to Check: a residual plot is symmetric, or points on an XY scatter plot do not tend to spread toward the left, or toward the right.
- 3) Normal Distribution of Data, which meets the three following conditions:
- a) Unimodal:
- How to Check: Make a histogram, then look for only one major peak, instead of many.
- b) Symmetric, or Unskewed Data Distribution:
- How to Check: Make that histogram, then compare the left and right tails for size, etc.
- c) Kurtosis is about Zero:
- How to Check: Make that histogram, then compare its peakedness to a normal distribution.
- These assumptions are unnecessary for the linear regression, that is, they are too strong. Well, except for the first one. But “making an XY scatterplot” recipe really works only in case of a simple linear regression. Besides, the approach suggested here contradicts the WP:NOTHOWTO(1) policy. … stpasha » 19:52, 30 November 2009 (UTC)
Assumption 1) has been misunderstood here (but not in the article) to mean a linear relationship between X and Y. That's not what's meant by linear. If Y is proportional to X^2 it's still a linear model, because Y is linearly related to the betas. THAT'S the relationship that must be linear. Also, in the usual case Y is normally distributed (the X's don't have to be) and this implies unimodality, symmetry and kurtosis equals zero. Blaise (talk) 13:23, 31 March 2013 (UTC)
This article, as with virtually all mathematically-oriented articles on Wikipedia, has been written BY and FOR people who already know the material but are struggling with communicating the material. What the group of you has forgotten is that people who come to Wikipedia are NOT mathematics experts (real or self-imagined)and need a clear explanation of the subject. This article is so filled with jargon and links to other pages that the process of trying to form an understanding of the subject is nearly impossible. Please think about this, consider translating this material for the audience that uses Wikipedia. Before I am condescended to by the mathematical cognoscenti, I will merely observe that I hold a Ph.D. myself, albeit in a different field. Give it some thought folks! — Preceding unsigned comment added by 220.127.116.11 (talk) 00:23, 19 June 2014 (UTC)
I absolutely agree with the previous comment. This article is not understandable by anyone who does not already know what is going on. In particular, I suggest it be written without matrix notation; anyone who understand matrix notation will just look it up in the textbook they learned matrix notation from. This is an important topic for many people who have no idea what a matrix is - and shouldn't need to learn about matrices to understand it. (Despite the fact that it is so "simple" using matrices - if you already know matrices.) David Poole (talk) 10:51, 29 January 2015 (UTC)
Example in the introductory section
Shouldn't it be mentioned in the example paragraph that, in general, predictor variables of type x, x^2, x^3 a.s.o. are mutually correlated? I know that this recipe is frequently given, but I think interpreting the results without accounting for these correlations makes it a dangerous recipe. Perhaps, a hint on how to normalize variables to a certain interval (which might be useful from numerical reasons, too) and on how to use a set of independent polynomials on that interval might be provided. Of course, interpreting the resulting coefficients may be much more complex. ChaosSchorsch (talk) 17:42, 18 February 2010 (UTC)
I am a bit confused by the inclusion of the example of tobacco smoking given in the section on applications of linear regression. Is it not more likely that the model used in these analyses was logistic regression?Jimjamjak (talk) 15:03, 26 March 2010 (UTC)
- The specific nature of the dependent variable is not given. If it were lifespan (measured in years), then linear regression would be perfectly appropriate. If it were "ever diagnosed with lung cancer" then it would probably be a logistic regression analysis. But the points made in this section mainly focus on issues with observational studies versus randomized experiments. So this is not a major point here. Skbkekas (talk) 23:07, 26 March 2010 (UTC)
A line has form y=mx + b, where m is the slope and b is the y-intercept. The current exposition seems to assume that the y-intercept is zero in all cases; that is, it says the form of the points is y_i = beta * x_i + epsilon_i, where epsilon_i is the "noise". There is no mention of the y-intercept, so it looks to me like the data is assumed to be centered at the origin. However, the figure at the top of the page clearly shows that the best-fit line does not need to pass through the origin. So what am I missing? —Preceding unsigned comment added by 18.104.22.168 (talk) 21:55, 6 April 2010 (UTC)
Likely inaccurate portrayal of weighted linear regression
Quote: "GLS can be viewed as applying a linear transformation to the data so that the assumptions of OLS are met for the transformed data. "
This seems to be incorrect. Firstly, because a linear transformation of the data cannot make it meet the assumptions for OLS, and secondly because the introduction of the weights into the equation does not correspond to a linear transformation of the *data*. The intuitive explanation of weighted linear regression that makes sense to me is that higher weighted data items have more impact on the result, as if they were replicated in the data set, but there may be better explanations than that. Grevillea (talk) 04:22, 20 April 2010 (UTC)
- You take the linear regression equation and multiply it by a constant . Then you apply OLS to the transformed data: . In this regression η is already homoscedastic, so the "assumptions of OLS" are met. // stpasha » 09:09, 20 April 2010 (UTC)
- The above argument is only relevant if all the elements of Ω are known, otherwise the new "observations" depend on unknown parameters, and often Ω is not fully known. While, for the simplest applications of "weighted regression", the weights may be known, this is not always the case (depending on exactly how "weighted regression is defined"): however, even in this case, the idea of "replicated observations" doesn't fully work because of the difficulty of treating factional observations. In the "transformation approach", the idea of "more highly weighted observations" is treated by taking the initial formal regression model, in which an observation has an error variance which is smaller than for others, and creating the transformed model in which the regression equation for that observation is replaced by one in which each term (observation, dependent variables and error) in multiplied by a factor such that the new error term (factor × old error) has a constant variance across observations. Melcombe (talk) 16:27, 17 May 2010 (UTC)
Should this article be renamed?
- User:Stpasha has proposed renaming this article, from linear regression to linear regression model. (He mentioned this in a thread higher up on this page where no on is likely to see it.) The discussion is at Wikipedia_talk:WikiProject_Statistics#Rename_suggestion. Michael Hardy (talk) 23:33, 22 April 2010 (UTC)
- I have placed a move request to revert the change that was made change the name to linear regression model without any backing here or on the Stats project talk page, and in the face of a previous revert of the same change. See [requests]. Melcombe (talk) 12:59, 17 May 2010 (UTC)
- No. The sentence is the topic sentence of a paragraph that defines the term. 018 (talk) 23:33, 10 September 2010 (UTC)
- What I meant was that the phrase "any approach to modeling the relationship between a scalar variable y and one or more variables denoted X" sounds like the definition of regression, not the definition of linear regression. —Preceding unsigned comment added by 22.214.171.124 (talk) 00:31, 14 September 2010 (UTC)
I agree with the elimination of the word "linear". To support my agreement, the following reasons are given: (i) Linearity or straight lines are pure human imagination, there is no such thing as a straight line in nature; and (ii) Linearity leads to many misunderstandings of models used in statistical or econometric research, resulting in many misspecified models followed by the downgrading of statistical and econometric models based on time series. A gravest misspecification of statistical and econometric models often cited in literature is the introduction of a "linear time trend" which is one of the most famous "unknown" in statistical models yet it appears most often and has been criticized most often. These critics have stimulated many econometricians in their search of more creative approaches in modelling to avoid the use of the "linear time trend" in the estimation of time series models. One of the novel approaches involve unit root tests and cointegration technique in econometric. In fact,when a linear time trend (represented by the variable To, To+1, To+2,...,To+n; with To is the time base and n is the number of observations), the estimated coefficient associated with this linear time trend variable is often interpreted as a measure of the impact of a number of known and unknown unmeasurable factors (subjectively, as a matter of fact)on the dependent variable in one unit of time. Logically, and strictly speaking, that interpretation is applicable to the estimation time periods only. Outside the estimation periods, one does not know how those unmeasurable factors behave both qualitatively and quantitatively. Furthermore, the linearity of the time trend poses many questions: (i) why should it be linear? (ii) if the trend is non-linear then under what conditions its inclusion does not influence the magnitude as well as the statistical significance of the estimates for other parameters in the model? (iii) the law of nature, especially in economics, commonly accepted is "what goes up must come down one day, and the reverse is also true" so why including the [u]linear [/u]time trend in your model which blatantly violates this law when n --> infinity ? Some known efforts of mathematicians, statisticians, econometricians, economists have been published in journals to respond to those questions (eg. the work of John Blatt (mathematical meaning of a time trend), C Granger and many other econometricians (on unit root testing, co-integration and related issues), Ho-Trieu & Tucker (on logarithmic time trend which is [u]non-linear[/u] with results alluding to a proof rejecting the existence of linear trend, and linear trend is just a misnomer of a special form of cyclical trend when periodicity is large; please see http://ideas.repec.org/a/ags/remaae/12288.html for further details). To conclude, I support the use of just "regression".
Meaning of "regression coefficient"
The section "Introduction to linear regression" contains the passage " is a p-dimensional parameter vector. Its elements are also called effects, or regression coefficients." But doesn't the term "regression coefficient" conventionally refer to the estimated values of the betas, rather than the betas themselves? Duoduoduo (talk) 16:25, 24 November 2010 (UTC)
Missing terms: point of averages, regression line, average of
Point of averages: The point whose x value is the average of all x values, and whose y is the value of all y values. http://www.youtube.com/watch?v=T7tj2-2r2Gk, at 4:30-5:00.
Graph of averages: If the x values are discrete, for each distinct x value, take the average of the corresponding y values. The set of points constitute the graph of averages. I believe this is defined for continuous data too, if the x axis is divided into intervals. Note that the there is a difference between the graph of averages of y values and the graph of values of x values. Source: http://www.youtube.com/watch?v=T7tj2-2r2Gk, at 5:00-5:30.
Regression line: "a smoothed version of the graph of averages". The regression line always passes through the point of averages. Source: http://www.youtube.com/watch?v=T7tj2-2r2Gk, at 6:00 - 7:00.
Could be nice, if these were defined in the article.
I see no difference between
Why the "\big"?
"Linear Regression" is a misonomer. Francis Galton first spoke of reversion, then regression in reference to generational changes in heights of men (from father to son). Shorter sons having regressed so to speak.
"..in 1877, Galton first referred to "reversion" in a lecture on the relationship between physical characteristics of parent and offspring seeds. The "law of reversion" was the first formal specification of what Galton later renamed "regression." Thirteen Ways to Look at the Correlation Coefficient
Galton's law of reversion is about genetics and has nothing to do with the mathematics of least squares, although least squares IS the applied mathematics. But if you say "linear regression" over the phone from the consultant's office, it sounds more impressive, which might explain why it sticks. — Preceding unsigned comment added by 126.96.36.199 (talk) 19:58, 21 August 2012 (UTC)
There doesn't seem to be a section here (or any article) about the median-median line, even though that's a popular regression technique. Is there a reason for that? -- Spireguy (talk) 19:23, 2 October 2012 (UTC)
Satistics or linear algebra
I would have written the first sentence to classify it as linear algebra, rather than statistics. Is one more standard thn the other? I learned this in my linear algebra class. Mythirdself (talk) 19:15, 30 March 2013 (UTC)
Least distance fit
Anyone an opinion on this?:
In the section "Least-squares estimation and related techniques" I guess it would be appropriate to add least distance fitting. Actually the result is rather simple:
Slope: beta = stdev(y) / stdev(x), possibly with a minus sign
Offset: epsilon = mean(y) - beta * mean(x)
Least distance fitting is practically useful when fitting data which is noisy in both x and y, for example correlation plots.
"Constant variance (aka homoscedasticity). This means that different response variables have the same variance in their errors, regardless of the values of the predictor variables." This was a confusing passage for me and I spent half an hour searching different sources to try and clear this up. Is it "different response variables" or "different values of the response variable"? The definition states there's only one response variable. Other sources also confirm that homoscedasticity refers to the variance in errors for the same variable. It's very confusing if you try to figure out what the sentence means if you were to have different response variables and you were comparing the variance in their errors. — Preceding unsigned comment added by 188.8.131.52 (talk) 09:10, 21 January 2014 (UTC)
Errors and Residuals, confusing
I think the term "error" is a little confusing and mixed in this article.
Sometimes it means the error (standard deviation) of all experimental y for a given x.
Sometimes it means the residuals, the distance from E[y|x] to the line, distance from the mean value of y for a given x to the line.
And the standard deviation sometimes refers to the first error, sometimes to the second.