Regression analysis: Difference between revisions

Content deleted Content added

Inline

Revision as of 11:44, 31 March 2007

In statistics, regression analysis examines the dependence of a random variable, called a dependent variable (response variable, regressor), on other random or deterministic variables, called independent variables (predictors). The mathematical model of their relationship is the regression equation. Well-known types of regression equations are linear regression, the logistic regression for discrete responses (both generalize in the generalized linear model), and nonlinear regression.

Regression equations contain one or more unknown regression parameters ("constants"), which quantitatively link the dependent and independent variables. The parameters are estimated from given data. In practical applications, data could come from any combination of public or private sources.

Applications of regression include curve fitting, forecasting of time series, modeling of causal relationships, and testing scientific hypotheses about relationships between variables.

Introduction

Regression analysis estimates the strength of a modeled relationship between one or more response variables (also called dependent variables, explained variables, predicted variables, or regressands) (usually named $Y$ ), and the predictors (also called independent variables, explanatory variables, control variables, or regressors, usually named $X_{1},\dots ,X_{p}$ ). These strengths of the relationships given that the model is correct are parameters of the model, which are estimated from a sample. Other parameters which are sometimes specified include error variances and covariances of the variables. The theoretical population parameters are commonly designated by Greek letters (e.g. $\beta$ ), their estimated values by a "hatted" Greek letter (e.g. ${\hat {\beta }}$ ), and the sample coefficients by a Latin letter (e.g. b). This stresses the fact that the sample coefficients are not the same as the population parameters, but the distribution of those parameters in the population can be inferred from the estimates and the sample size. This allows researchers to test for the statistical significance of estimated parameters and to measure goodness of fit of the model.

Still more generally, regression may be viewed as a special case of density estimation. The joint distribution of the response and explanatory variables can be constructed from the conditional distribution of the response variable and the marginal distribution of the explanatory variables. In some problems, it is convenient to work in the other direction: from the joint distribution, the conditional distribution of the response variable can be derived. Regression lines can be extrapolated, where the line is extended to fit the model for values of the explanatory variables outside their original range. However extrapolation may be very inaccurate and can only be used reliably in certain instances.

History of regression

The term "regression" was used in the nineteenth century to describe a biological phenomenon, namely that the progeny of exceptional individuals tend on average to be less exceptional than their parents, and more like their more distant ancestors. Francis Galton studied this phenomenon and applied the slightly misleading term "regression towards mediocrity" to it. For Galton, regression had only this biological meaning, but his work^[1] was later extended by Udny Yule and Karl Pearson to a more general statistical context.^[2]

Definitions and notation used in regression

The measured variable, y, is conventionally called the "response variable". Other terms include "endogenous variable," "output variable," "criterion variable," and "dependent variable." The controlled or manipulated variables, ${\vec {x}}$ , are called the explanatory variables. Other terms include "exogenous variables," "input variables," "predictor variables" and "independent variables."

Types of regression

Several types of regression analysis can be distinguished; all of these can be seen as special cases of the Generalized Linear Model.

Linear regression

Linear regression is a method for determining the parameters of a linear system, that is a system that can be expressed as follows:

\sum \limits _{i=1}^{p}{\beta _{i}f_{i}({\vec {x}})}

where $\beta _{i}$ is called the parameter and f is only a function of ${\vec {x}}$ . This can be rewritten in matrix form as

y={\vec {X}}{\vec {\beta }},\,

where ${\vec {X}}$ is a row vector that contains each of the function, that is, ${\vec {X}}=\langle f_{1}({\vec {x}}),f_{2}({\vec {x}}),\dots ,f_{p}({\vec {x}})\rangle$ and ${\vec {\beta }}$ is a column vector containing the parameters, that is, ${\vec {\beta }}=\langle \beta _{1},\beta _{2},\dots ,\beta _{p}\rangle ^{T}.$

The explanatory and response variables may be scalars or vectors. In the case, where both the explanatory and response variables are scalars, then the resulting regression is called simple linear regression. When there are more than one explanatory variable, then the resulting regression is called multiple linear regression. It should be noted that the general formulae are the same for both cases.

Two common techniques for solving linear regression models is using least squares analysis or robust regression.

Nonlinear regression models

A number of nonlinear regression techniques may be used to obtain a more accurate regression. It should be noted that an often-used alternative is a transformation of the variables such that the relationship of the transformed variables is again linear.

Non-continuous variables

If the variable is not continuous, specific techniques are available. For binary (zero or one) variables, there are the probit and logit model. The multivariate probit model makes it possible to estimate jointly the relationship between several binary dependent variables and some independent variables. For categorical variables with more than two values there is the multinomial logit. For ordinal variables with more than two values, there are the ordered logit and ordered probit models. An alternative to such procedures is linear regression based on polychoric or polyserial correlations between the categorical variables. Such procedures differ in the assumptions made about the distribution of the variables in the population. If the variable is positive with low values and represents the repetition of the occurence of an event, count models like the Poisson regression or the negative binomial model may be appropriate.

Other models

Although these three types are the most common, there also exist supervised learning and unit-weighted regression.

Nonparametric regression

The models described above are called parametric because the researcher must specify the nature of the relationships between the variables in advance. Several non-parametric techniques may be also used to estimate the impact of an explaining variable on a dependent variable. Nonparametric regressions, like kernel regression, require a high number of observations and are computationally intensive.

Multicollinearity

See Multicollinear.

Regression and Bayesian statistics

Bayesian methods can also be used to estimate regression models. Such methods may be applied for diverse reasons. One is to aid the estimation by using prior available information. Another may be the researcher's ideas about epistemology.

Examples

To illustrate the various goals of regression, we will give three examples.

Prediction of future observations

The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).

Height (in)	58	59	60	61	62	63	64	65	66	67	68	69	70	71	72
Weight (lbs)	115	117	120	123	126	129	132	135	139	142	146	150	154	159	164

We would like to see how the weight of these women depends on their height. We are therefore looking for a function $\eta$ such that $Y=\eta (X)+\varepsilon$ , where Y is the weight of the women and X their height. Intuitively, we can guess that if the women's proportions are constant and their density too, then the weight of the women must depend on the cube of their height.

File:Data plot women weight vs height.svg

A plot of the data set confirms this supposition

${\vec {X}}$ will denote the vector containing all the measured heights ( ${\vec {X}}=(58,59,60,\dots )$ ) and ${\vec {Y}}=(115,117,120,\dots )$ is the vector containing all measured weights. We can suppose the heights of the women are independent from each other and have constant variance, which means the Gauss-Markov assumptions hold. We can therefore use the least-squares estimator, i.e. we are looking for coefficients $\beta _{0},\beta _{1}$ and $\beta _{2}$ satisfying as well as possible (in the sense of the least-squares estimator) the equation:

{\vec {Y}}=\beta _{0}+\beta _{1}{\vec {X}}+\beta _{2}{\vec {X}}^{3}+{\vec {\varepsilon }}

Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables $1,X$ and $X^{3}$ . The matrix X is constructed simply by putting a first column of 1's (the constant term in the model) a column with the original values (the X in the model) and a third column with these values cubed ( $X^{3}$ ). The realization of this matrix (i.e. for the data at hand) can be written:

$1$	$x$	$x^{3}$
1	58	195112
1	59	205379
1	60	216000
1	61	226981
1	62	238328
1	63	250047
1	64	262144
1	65	274625
1	66	287496
1	67	300763
1	68	314432
1	69	328509
1	70	343000
1	71	357911
1	72	373248

The matrix $(\mathbf {X} ^{t}\mathbf {X} )^{-1}$ (sometimes called "information matrix" or "dispersion matrix") is:

$\left[{\begin{matrix}1.9\cdot 10^{3}&-45&3.5\cdot 10^{-3}\\-45&1.0&-8.1\cdot 10^{-5}\\3.5\cdot 10^{-3}&-8.1\cdot 10^{-5}&6.4\cdot 10^{-9}\end{matrix}}\right]$

Vector ${\widehat {\beta }}_{LS}$ is therefore:

${\widehat {\beta }}_{LS}=(X^{t}X)^{-1}X^{t}y=(147,-2.0,4.3\cdot 10^{-4})$

hence $\eta (X)=147-2.0X+4.3\cdot 10^{-4}X^{3}$

The confidence intervals are computed using:

[{\widehat {\beta _{j}}}-{\widehat {\sigma }}{\sqrt {s_{j}}}t_{n-p;1-{\frac {\alpha }{2}}};{\widehat {\beta _{j}}}+{\widehat {\sigma }}{\sqrt {s_{j}}}t_{n-p;1-{\frac {\alpha }{2}}}]

with:

{\widehat {\sigma }}=0.52

s_{1}=1.\cdot 10^{3},s_{2}=1.0,s_{3}=6.4\cdot 10^{-9}\;

\alpha =5\%

t_{n-p;1-{\frac {\alpha }{2}}}=2.2

Therefore, we can say that the 95% confidence intervals are:

\beta _{0}\in [112,181]

\beta _{1}\in [-2.8,-1.2]

\beta _{2}\in [3.6\cdot 10^{-4},4.9\cdot 10^{-4}]

References

^ Francis Galton. "Typical laws of heredity", Nature 15 (1877), 492-495, 512-514, 532-533. (Galton uses the term "reversion" in this paper, which discusses the size of peas.); Francis Galton. Presidential address, Section H, Anthropology. (1885) (Galton uses the term "regression" in this paper, which discusses the height of humans.)
^ G. Udny Yule. "On the Theory of Correlation", J. Royal Statist. Soc., 1897, p. 812-54. Karl Pearson, G. U. Yule, Norman Blanchard, and Alice Lee. "The Law of Ancestral Heredity", Biometrika (1903). In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925 (R.A. Fisher, "The goodness of fit of regression formulae, and the distribution of regression coefficients", J. Royal Statist. Soc., 85, 597-612 from 1922 and Statistical Methods for Research Workers from 1925). Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.

Other sources

Audi, R., Ed. (1996). The Cambridge Dictionary of Philosophy. Cambridge, Cambridge University Press. curve fitting problem, pp.172-173.
Birkes, David and Yadolah Dodge, Alternative Methods of Regression. ISBN 0-471-56881-3
Chatfield, C. (1993) "Calculating Interval Forecasts," Journal of Business and Economic Statistics, 11. pp. 121-135.
Fox, J. (1997). Applied Regression Analysis, Linear Models and Related Methods. Sage
Hardle, W., Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
Meade, N. and T. Islam (1995) "Prediction Intervals for Growth Curve Forecasts," Journal of Forecasting, 14, pp. 413-430.
Charles Darwin. (1869). The Variation of Animals and Plants under Domestication. (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)
Draper, N.R. and Smith, H. (1998).Applied Regression Analysis Wiley Series in Probability and Statistics
Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute, 15, pp. 246-263 (1886). (Facsimile at: [1])
Lindley, D.V. (1987). "Regression and correlation analysis," New Palgrave: A Dictionary of Economics, v. 4, pp. 120-23.

Software

All major statistical software packages, e.g. SAS System, SPSS or Stata, perform various types of regression analysis correctly and in a user-friendly way.
Simpler regression can be done in spreadsheets like MS Excel or OpenOffice.org Calc.
Experts can run complex types of regression using special programming languages like Mathematica, R programming language or Matlab.
There are many minor softwares specialized in a niche form of regression.

External links

SixSigmaFirst - Intro to regression analysis, and linear regression example
Curvefit: A complete guide to nonlinear regression - Online textbook
Exegeses on Linear Models - Some comments on linear regression models by Bill Venables.
Mazoo's Learning Blog - Example of linear regression. Shows how to find the linear regression equation, variances, standard errors, coefficients of correlation and determination, and confidence interval.
Regression of Weakly Correlated Data - How linear regression mistakes can appear when Y-range is much smaller than X-range
META STATISTICS PORTAL. Comparative international Statistics. Access to Data Sources around the Globe OECD Statistics, U.S. Data Sources, Census Bureau, White House, Eurostat, Penn World Tables, Groningen Development Centre Database, Economics Web Institute Stats, Pacific Exchange Rate Service etc.

[1] Francis Galton. "Typical laws of heredity", Nature 15 (1877), 492-495, 512-514, 532-533. (Galton uses the term "reversion" in this paper, which discusses the size of peas.); Francis Galton. Presidential address, Section H, Anthropology. (1885) (Galton uses the term "regression" in this paper, which discusses the height of humans.)

[2] G. Udny Yule. "On the Theory of Correlation", J. Royal Statist. Soc., 1897, p. 812-54. Karl Pearson, G. U. Yule, Norman Blanchard, and Alice Lee. "The Law of Ancestral Heredity", Biometrika (1903). In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925 (R.A. Fisher, "The goodness of fit of regression formulae, and the distribution of regression coefficients", J. Royal Statist. Soc., 85, 597-612 from 1922 and Statistical Methods for Research Workers from 1925). Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.

[1]

[2]

@@ Line 262: / Line 262: @@
 *[http://learning.mazoo.net/archives/000899.html Mazoo's Learning Blog] - Example of linear regression.  Shows how to find the linear regression equation, variances, standard errors, coefficients of correlation and determination, and confidence interval.
 *[http://www.vias.org/simulations/simusoft_regrot.html Regression of Weakly Correlated Data] - How linear regression mistakes can appear when Y-range is much smaller than X-range
+*[http://workforall.net/Statistics-Portal.html   ''' META STATISTICS PORTAL. Comparative international Statistics. Access to Data Sources around the Globe '''] '' OECD Statistics, U.S. Data Sources, Census  Bureau, White House, Eurostat, Penn World Tables, Groningen Development Centre Database, Economics Web Institute Stats, Pacific Exchange Rate Service etc.
 [[Category:Actuarial science]]