# Talk:Generalized linear model

WikiProject Statistics (Rated C-class, Top-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
Top  This article has been rated as Top-importance on the importance scale.

This text should be read with care, because of the errors, in the text and formulas.

## Cleanup

There are no errors here, but this article needs to be cleaned up and fleshed out a bit. I will try to work on it soon... --shaile 00:33, 14 April 2006 (UTC)

I don't like: "In statistics the generalized linear model (GLM) generalizes the ordinary least squares regression." Surely OLS is an estimation method, not a model? In my view, it should say "In statistics the generalized linear model (GLM) generalizes the linear model." Blaise 21:45, 23 September 2006 (UTC)
You better believe OLS implies a model. Granted, a model that most would agree is wrong for their data, but some models are useful, so we use them. I'd suggest merging those two, they cover the exact same subject. Pdbailey 13:39, 26 September 2006 (UTC)
I agree with Blaise, though I must say the one he suggests does sound a bit like a tautology. Wherami (talk) 17:03, 30 September 2011 (UTC)
I modified the sentence. As it turns out linear regression and least squares do not necessarily go together. Wherami (talk) 17:26, 30 September 2011 (UTC)

The second sentence of the introduction is terribly muddled. It should be clarified. I would do so, but I am currently struggling to understand this subject, so I shan't try just yet. Thomas Tvileren 08:44, 15 October 2007 (UTC)

## proposed changes

The edit I made (that was rved) did trim the overall size of the page and reduce the number of examples, but as it stands it is very difficult to understand what a GLM is. The basic question is what is a glm? After we have said that, we need to say why you might want to use one and then I think we should get a little into how parameters are estimated. I'm going to rv it back and expand substantially on what I wrote last time.

It would be great if we could expand on the part about using any CDF. here we have the alternative example where $Y = 1 if \eta > 0$ and is zero otherwise. Pdbailey

Would you explain what you meant by this sentence: "Because the variance portion is not constant, there is no sense to least squares, instead the parameters must be estimated with maximum likelihood or quasi maximum likelihood." It is not clear at all... Thanks! (Also, PLEASE sign your comments! All it takes is 4 ~'s) --shaile 22:21, 20 April 2006 (UTC)
Sorry, I don't have time to fix it up right now. The general idea is that OLS won't work because none of the assumptions hold (independance, constant variance)Pdbailey 00:11, 21 April 2006 (UTC)

## on using $\eta$

So the reason that I like the $\eta$ terminology is that it seperates out the linear part of the equation ($X\beta$) from the random part ($Y$). It makes it clear that there is a linear model in there. Also, the way the second equation is now ($E(Y_i) = \mu_i = g^{-1}(X_i^T\beta),$ so that $g(\mu_i) = X_i^T\beta.$) it is a bit of a garbled mess. Pdbailey 02:21, 21 April 2006 (UTC)

I guess I see your point. On the other hand, I think maybe it's better to keep a similar notation to that of Linear model, thus dropping the $\eta$. As for the second equation, I think it should be either $g(\mu_i) = X_i^T\beta,$ or $g(E(Y_i)) = X_i^T\beta.$ You're right, the inverse g function is a bit confusing, and I think $g(\mu)=X_i^T\beta$ is the more standard notation. I need to find the paper for this, I have it around somewhere... --shaile 04:21, 21 April 2006 (UTC)
I'm working off lecture notes from McCullagh (author of one of the References) and I'll have to admit that my notes are not an example of clarity (hence $g$ instead of $g^{-1}$). However, I really like the seperation that McCullagh emphasies between the linear part of the model (which describes the mean behavior with a linear equation) and the variance part of the model, which describes the dispersion of the $Y$ values. In fact, he argues that it is always clearer to write a model in the fashon -- to seperate out the expected value from the dispersion. In a GLM, when they are seperated it is clear that the link, well, links the two portions of the model. It gives some perspective into the relevance of the link function and seperates out clearly the three components that are present in all GLMs. Some people might think that probit and logit are worlds apart, but in this framework, it is clear that they are minor variants on each other.
This form is also echoed in programs such as STATA which has a linear model, a link function, and a variance function. But I'm affraid that we just disagree on this one -- you like the simplicity of having it all in one formula, I like seeing all three components seperated. Pdbailey 05:19, 21 April 2006 (UTC)
I just changed the page in light of this discussion not to rv but to state the model more clearly in the way I'm arguing for. Pdbailey 05:27, 21 April 2006 (UTC)
Actually, this is fine. It was just more confusing the way you had it before.  :) --shaile 13:21, 21 April 2006 (UTC)
Based on the objects that the two of you raised, I changed it back to look more like it used to. I think that it is easier to get a handle on this way quickly. Pdbailey 16:10, 27 April 2006 (UTC)
Would someone clarify this: $\epsilon_i = f( g^{-1}(X_i^T \beta) )$? It's not at all clear, and I don't see how the error term is a function of the other parts, actually, I don't think it should be stated this way. Also, We need more details in there, it was better when the three parts of GLM were clarified separately. Any objections? --shaile 19:22, 27 April 2006 (UTC)

## Reorganization and exponential family detail

I have been reorganizing this article a bit, starting at the top. I wish however to point out a small but important change I made to the definition of exponential family here. Where before it contained a term $a(y)b(\theta)$, in McCullagh & Nelder (p28) it clearly shows $a(y)\theta$. I have made this change and eliminated the reference to the b function. If there is a more reputable source than M&N (hard to imagine) which has the more general $a(y)b(\theta)$ form, feel free to revert but please include the source. Baccyak4H 18:50, 27 October 2006 (UTC)

I think that's a good change you made to the definition of the exponential family. Do you think we could include how this relates to the link function? (I know how to do that, but I'd have to do a few to remember exactly where the link function comes from...) --shaile 20:57, 27 October 2006 (UTC)

Thanks. I was planning to address some more advantages of this form (M&N's form) including sufficient statistics, variance as function of a, c and d, and the canonical parameter. But that must wait; my copy is elsewhere now ;-). I certainly will be continuing my reorganization (mostly to make different editors' contributions sound like they came from one editor -- my pet peeve). While I may do a lot of tweaking, please feel free to improve my efforts. Baccyak4H 03:06, 28 October 2006 (UTC)

while M&N use that unusual exponential family form in their book they are equivalent (in the sense that you point out) and I think it makes sense for wikipedia to the broader definition. Which is to say that the notational convenience that it affords one book may not be as nice in an encyclopedia. Also, this article is for a much broader audience than the book and covers in a lot less detail. Pdbailey 03:51, 28 October 2006 (UTC)

I suppose in the end a more general formula would be better. Is there a version like the current version which also contains the so-called dispersion parameter? M&N has that ($\phi$), and given it appears in an overdispersion context twice (one of which I just added), I see some merit in including it in the formula: one of the great merits as I see it is the unification that the theory provides - another reason I may discribe the canonical parameter some more (Note: I removed it from the link table title, not because it was wrong (it was not), but because it was unexplained. I plan on returning to fix that). Baccyak4H 03:00, 29 October 2006 (UTC)

Well, here is a possible generalization of the exponential family formula which includes a dispersion parameter $\phi$. THe inclusion of $\phi$ is in the spirit of M&N's definition, but the rest of the formula is the same as the current one.

Old: $f_y(y; \theta) = \exp{(a(y)b(\theta) + c(\theta) + d(y) )} \,\!$.
New: $f_y(y; \theta) = \exp{\left(\frac{a(y)b(\theta) + c(\theta)}{h(\phi)} + d(y,\phi) \right)} \,\!$.

I tried to reword the discussion there to apply to this. Baccyak4H 14:58, 30 October 2006 (UTC)

Just wanted to add a note thanking those responsible for that glm formula under 'model components'. It really helps a lot, Baccyak4. 158.121.165.13 (talk) 14:59, 19 October 2009 (UTC)

Just dawned on me that for the $b(\theta)$ formula, if b is invertable then it is exactly equivalent to the version without b; they merely represent a reparameterization of each other. Baccyak4H 16:34, 31 October 2006 (UTC)

## History and Motivation

1. There is no mention of the probit link. From a passage in McCullagh and Nelder, the probit work is historically important, in particular the presentation of the scoring algorithm in an appendix written by R.A. Fisher for a paper by the toxicologist Bliss.

2. Mention that in practice glm's provide an important way to address heteroscedasticity.

My apologies in the event I have overlooked some passage that addresses these concerns.

Dfarrar 04:48, 20 February 2007 (UTC)

There is an article (stub) under development for probit. Dfarrar 04:51, 20 February 2007 (UTC)

Should the probit link be included in the table of canonical link functions, or is it not considered a canonical link function? Bill Jefferys (talk) 23:43, 20 December 2007 (UTC)

It isn't a canonical link for any of the distributions there. I do suspect there is a distribution that makes it such, but don't know off the top of my head what it might be. Baccyak4H (Yak!) 02:23, 21 December 2007 (UTC)

## Boldface vectors

Shouldn't beta be in boldface throughout? At present some of the betas are bold and others aren't. This is particularly disturbing when it happens in Xβ.

--84.9.83.26 (talk) 11:46, 19 December 2007 (UTC)

Yes, consistency is a good thing. Nice catch. Baccyak4H (Yak!) 14:47, 19 December 2007 (UTC)
Update I did that fix; sorry if I missed any. I used wikimarkup (multiple single quotes) rather than html markup (<tags>) for consistency with the rest of the article. Baccyak4H (Yak!) 15:00, 19 December 2007 (UTC)

Thanks! :-)

Another one. In a vector context, shouldn't the linear predictor η be boldface too? η = X β

--84.9.73.5 (talk) 13:03, 21 December 2007 (UTC) (formerly 84.9.83.26)

This would depend on context. For one observation or data point, no, since X is a row vector. For the entire data, yes. I am not sure what you mean by "vector context", but I hope those two scenarios clarify things. Baccyak4H (Yak!) 16:26, 21 December 2007 (UTC)

Several types of regression which fall under this topic have been added to the see also section. However, links to these already appear where they are discussed higher up in the article. I propose to remove them in the see also section. Anyone with me here? Baccyak4H (Yak!) 19:12, 2 February 2008 (UTC)

## A couple of points

I think the technical term for the family of distributions is the Exponential Dispersion Family, though unfortunately I have no sources to confirm it other than my hazy memory. Can anyone confirm this?

Technically one doesn't even need to specify a distribution to fit a GLM, only a variance function is required (though specifying a distribution means one can estimate the dispersion parameter by maximum likelihood. However I don't know if that can be worked into the article without making it more confusing.

There is no mention in the article of iteratively re-weighted least squares (IRLS or IWLS depending on who you talk to), the method used for estimating the parameters, and the current article in that location doesn't seem relevant GLMs.

Thoughts? -3mta3 (talk) 00:17, 8 April 2008 (UTC)

3mta3, good points. One does need to specify an exponential family to fit with ML, but you are right that there are pseudo-ML options as well. I think this could be a new seciton. There also really should be a section on IRLS and Fisher steps, this a big part of what brings the GLM together, they can all be fit with the same general solver. Pdbailey (talk) 02:48, 8 April 2008 (UTC)

## Variance & Weighting

When you apply a link function to observed data, aren't you implicitly re-weighting points? For example, if I have samples for f(x) and expect it to be of the form f(x)=βx2, I could take the square root of my data and try to find β to minimize the norm of the residual of

$\sqrt{f(x)} = \sqrt{\beta} \sqrt{x}\qquad\qquad(1)$

but if I had equal variance on all of my samples, then a simple least-squares fit of (1) would be biased, putting more emphasis on a sample at, say, x=0.1 than at x=2 (if I have my head screwed on). I see some talk of this in the article and in this discussion page, but it isn't clear to me. How is this issue dealt with? Are the samples re-weighted based on the link function? —Ben FrantzDale (talk) 14:54, 26 August 2008 (UTC)

This is a good question for the Math Reference desk. But in lieu of that, you could check one of the references (McCullagh/Nelder is my fave): there you'll find the weighting issue is handled very elegantly by the maximization process of the likelihood. Baccyak4H (Yak!)

## Formula of density function related to mean and variance

Formerly, formula of the density function was
$f_Y(y; \theta, \tau) = \exp{\left(\frac{a(y)b(\theta) + c(\theta)} {h(\tau)} + d(y,\tau) \right)}.$
With that density function, if a is identity function and b is identity function, the mean of the distribution is
$\mu = \operatorname{E}(Y) = -c'(\theta)$
and the variance is
$\operatorname{Var}(Y) = -c''(\theta) h(\tau).$

Now, in the formula of the density function, the sign in front of c is minus. The formula is
$f_Y(y; \theta, \tau) = \exp{\left(\frac{a(y)b(\theta) - c(\theta)} {h(\tau)} + d(y,\tau) \right)}.$
I am fine with the change, because the formula is more closely resembles the formula that I saw in a book. But, with that change in the formula of the density function, the formulas of mean and variance are now incorrect.

So, to make the formulas of mean and variance correct, there are two alternatives:

• The formula of density function is changed back (the sign in front of c is plus); the formulas of mean and variance are not changed.
• The formula of density function is in current condition (the sign in front of c is minus); the formulas of mean and variance are changed (the minus sign removed).

--Anreto (talk) 04:57, 8 October 2008 (UTC)

## essentially due to Fisher

This is not as clearly spelled out in the book, but I would encourage you to read section 2.5 and 2.5.1 before you pass judgment on the claim. Now, I can see that the claim might, in order to be completely correct, be taken down a notch. But in precision it loses the ease of reading which is important in a lead. Fisher proposed a method of approximating the Hessian, but this is the key to using Newton's method for this set of problems. Certainly, Nelder and Wedderburn's paper and found that the method could be expanded to the exponential family form. But the Nelder and Wedderburn paper is about fitting the model and less about its application (which the MC&N book covers in more detail) and, again, Fisher made the one of the key insights on this front. Now, if you read this, read sections 2.5 of MC&N, and understand it and still want it out, then by all means, take it out. Alternately, it might make more sense to point it out in the body instead of in the lead, that is fine too. PDBailey (talk) 21:43, 11 January 2009 (UTC)

I've just skimmed those sections of McC & N and read the 'Bibliographic notes' in section 2.6 of McC&N, which i think make things fairly clear. As I understand it: Fisher introduced one method ('Fisher scoring') of maximizing the likelihood of the probit model. Wedderburn & Nelder generalized this to all exponential-family distributions and also generalized further by introducing the link function. These two generalizations together constitute the "generalized" bit of generalized linear models, and Fisher had nothing to do with them. I thought your edit implied that Fisher essentially proposed GLMs themselves, while I think he proposed a method (extended by N&W) used in fitting GLMs, which is why I reverted (but i'm happy we're discussing it). I think it would be worth explaining Fisher's contribution in a section on fitting GLMs, but the article doesn't have such a section at present. Qwfp (talk) 23:00, 12 January 2009 (UTC)
Agreed, fitting is the right spot for the point to be made. PDBailey (talk) 23:17, 12 January 2009 (UTC)

Wow, there really is no fitting section, I'll try to rectify that soon. PDBailey (talk) 03:32, 27 January 2009 (UTC)

"It relates the random distribution of the measured variable of the experiment (the distribution function) to the systematic (non-random) portion of the experiment (the linear predictor) through a function called the link function."

I think that is an accurate description (IIRC I helped write it). However, it might be considered a little wordy or technical. I was thinking of rephrasing it to segue with the first sentence, about being a generalization of least squares regression (linear regression). Something along the lines of, rather than just equating the mean to the linear predictor, it allows for particular relationships between them (in statspeak, the link function is being generalized). Thoughts? Baccyak4H (Yak!) 17:25, 6 February 2009 (UTC)

## Generalized linear model

So if i'm not mistaken, this model can be expressed using a standard notation (that is, comparable with the rest of regression models articles) as

$y_i = g^{-1}(x_i'\beta) + \varepsilon_i,$

where g is some monotonous "link" function (might be helpful to add a reasoning why we have g-1 instead of simply g), and ε belongs to an exponential family. Then it proceeds with explanation how MLE can be applied to obtain the estimates.

This "generalized linear model" is in fact a sub-case of slightly more general non-linear regression model

$y_i = g(x_i,\,\beta) + \varepsilon_i,$

and therefore it is also solvable by all the standard methods used for non-linear models, such as

• non-linear least squares, in which case distributional assumption for ε's is not needed
• non-parametric regression methods, in which case preliminary knowledge of the shape of function g is unnecessary

It seems however that neither this connection, nor alternative approaches to estimation are ever mentioned in this article. // Stpasha (talk) 10:47, 3 July 2009 (UTC)

I think you're mistaken, though it's a common enough source of confusion that maybe it merits a mention in the article. The model is
$\operatorname{E}(Y_i) = g^{-1}(x_i'\beta)$
where Y belongs to an exponential family. Although this can be expressed in the form you give:
$Y_i = g^{-1}(x_i'\beta) + \varepsilon_i,$
by defining ε =Y − E(Y), this is not very useful, as then ε does not, in general, belong to an exponential family (as you claim above), or in fact to any standard probability distribution, and in general its distribution will vary between observations. For example, consider the case when y is Bernoulli. You could still fit this model by non-linear least squares, but that will give biased estimates without the asymptotic efficiency of the MLE, and still requires an iterative solution.
Why g-1 instead of g? I guess it's fairly arbitrary whether you write the mean as a function of the linear predictor or the linear predictor as a function of the mean. From a quick glance at Nelder & Wedderburn's paper I think this convention may originate from the connection between the canonical link function and the natural parameter of the exponential distribution, but it also fits in with the usual way of writing logistic regression and Poisson regression which predate GLMs. Agree could be better explained. Qwfp (talk) 16:41, 3 July 2009 (UTC)
Ah yes, thank you, i see it now. The case when Y is Bernoulli is known as Binary response models, including Logit and Probit as particular cases. For those models there are also estimators which allow for unspecified link function, such as Manski's Maximum Score Estimator and Cossett's Generalized MLE. // Stpasha (talk) 21:17, 3 July 2009 (UTC)

## Atomic mass values

If the reported atomic values were reported and related to the 2 atomic parameters: 1, Deuteron number (=2 times the Z number), and 2, Extra neutron number (= the n - Z number), what would be the best regression method to assign the variance to these 2 factors and determine the best regression equation?WFPM (talk) 03:05, 10 December 2011 (UTC)

It should be noted that a regression of the incremental deuteron mass values results in the creation of 2 separate population of mass increase values, which are: 1, an increased even numbered deuteron mass value and 2, an increased odd numbered deuteron mass value. So maybe the analysis requires the regression to be for more than just 2 parameters.WFPM (talk) 16:43, 11 December 2011 (UTC)

## Multinomial belongs under examples, nout under extensions

An extension is something that expands the generalized linear model: the multinomial models shown are simply special cases of distribution and link. — Preceding unsigned comment added by 24.34.200.147 (talk) 01:17, 28 May 2012 (UTC)

## Conflation of models and methods for fitting models

The section on HGLMs incorrectly suggested that these models differ from GLMMs. If what differs is the method for fitting the models, i.e., if there is no difference in the mathematical expression of the relationship between the covariates and the outcomes, then there is no need for a new acronym or to claim a new model has been expressed. I removed the section. — Preceding unsigned comment added by 24.34.200.147 (talk) 01:24, 28 May 2012 (UTC)

## Incorrect claims regarding what a GLMM means

A line implied that GLMMs "assume" normal random effects. In fact, there is no such limitation, though abilities to relax this assumption (other than via MCMC) are limited. The GLMM does not include any such assumption. Similarly, the GLMM is not (as was implied) limited to single-level models, with multilevel of hierarchical linear models implying a further generalization. I simplified the "entensions" section to correct this. — Preceding unsigned comment added by 24.34.200.147 (talk) 01:27, 28 May 2012 (UTC)

## Notation "μ"

Small point perhaps, but I found a bit puzzling on first encounter: Can someone "in the know" add a small note about notation "μ". I find it a bit confusing here, and I found the same some years ago when I first read about GLM from a textbook. Towards the start here it seems to be implied that "μ" will, in the general case, stand for the expected value of the response variable. But then later, this notation is not always adhered to strictly. E.g. it seems to be used that way for Bernoulli response, but not for Binomial or Multinomial (unless you redefine the response to be the proportion, rather than the count, which is not clearly and consistently done). — Preceding unsigned comment added by 83.217.170.175 (talk) 01:11, 18 September 2012 (UTC)

### Conditional mean...

Also, it might be a slightly confusing notation because the mean varies with $x$ --- I think it is confusing to leave out $x$ as an argument. Superpronker (talk) 10:26, 19 October 2012 (UTC)

## Binomial or bernoulli ?

The part about binomial regression shows the link function for a Bernoulli variable, without really saying why this works for a binomial. For instance, if Y is Binom(n,p), then E(Y) = np and g(E(y)) = log((np) / (1-np)) which doesn't make sense. — Preceding unsigned comment added by Statr (talkcontribs) 19:41, 9 April 2013 (UTC)