Talk:Akaike information criterion

WikiProject Statistics (Rated C-class, Mid-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
Mid  This article has been rated as Mid-importance on the importance scale.
WikiProject Mathematics (Rated C-class, Mid-priority)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 C Class
 Mid Priority
Field: Probability and statistics

AICc formula

Hi all, I recently made a small change to the AICc formula to add the simplified version of the AICc (i.e. not in terms of the AIC equation above, but the direct formula) so that readers could see both the AICc's relation to the AIC (as a "correction") and as the formula recommended by Burnham and Anderson for general use [edit number 618051554].

An unnamed user reverted the edit, citing "invalid simplification." Does anyone (the reverting editor from ip 2.25.180.99 included) have any reason to not want the edit I made to stand? It adds the second line to the equation below,

\begin{align}AIC_c &= AIC + \frac{2k(k + 1)}{n - k - 1}\\ &= \frac{2kn}{n-k-1} - 2 \ln{(L)}\end{align}

The main point I have in favor of the change is that Burnham and Anderson point out in multiple locations that the AIC should be thought of as the asymptotic version of the AICc, rather than thinking of the AICc as a correction to the AIC. This second equation shows easily (by comparison with the AIC formula) how that asymptotic relationship holds, and it shows how to compute the AICc itself directly.

The point against it, as far as I can see, is just that there is now a second line of math (which I can imagine some people being opposed to...)

Any opinions? If no, I'll change my edit back in a few days/weeks. Dgianotti (talk) 18:03, 31 July 2014 (UTC)

The second equation is certainly valid. I do not see why the proposed new formula is a simplification though; for me, the first (current) formula is simpler. Regarding the asymptotic relationship, this is shown much more clearly by the first formula: as n→ ∞, the second term plainly goes to 0; so the formula becomes just AIC, as Burnham and Anderson state. Hence I prefer the current formula. 86.152.236.37 (talk) 21:11, 13 September 2014 (UTC)

Fitting statistical models

When analyzing data, a general question of both absolute and relative goodness-of-fit of a given model arises. In the general case, we are fitting a model of K parameters to the observed data x_1, x_2 ... x_N. In the case of fitting AR, MA, ARMA, or ARIMA models, the question we are concerned with is what K is, i.e. how many parameters to include in the model.

The parameters are routinely estimated by minimizing the residual sum of squares, or by maximizing log likelihood of the data. For normal distributions, the least sum of squares method and the log likelihood method yield identical results.

These techniques are, however, unusable for estimation of optimal K. For that, we use information criteria which also justify the use of the log likelihood above.

More TBA...=

More TBA... Note: The entropy link should be changed to Information entropy

Deviance Information Criterion

I've written a short article on DIC, please look it over and edit. Bill Jefferys 22:55, 7 December 2005 (UTC)

Great, thanks very much. I've edited a bit for style, no major changes though. Cheers, --MarkSweep (call me collect) 23:18, 7 December 2005 (UTC)

Justification/derivation

is AIC *derived* from anything or is it just a hack? BIC is at least derivable from some postulate. WHy would you ever use AIC over BIC or, better, cross validation?

There is a link on the page ([1]) which shows a proof that AIC can be derived from the same postulate as BIC and vice versa. Cross validation is good but computationally expensive compared to A/BIC - a problem for large scale optimisations. The actual discussion over BIC/AIC as a weapon of choice seems to be long, immensely technical/theoretical and not a little boring 128.240.229.7 12:37, 28 February 2007 (UTC)

Does the definition of AIC make sense with respect to dimension? That is...why would the log of the likelihood function have the same dimension as the number of parameters, so that subtracting them would make sense? Cazort 20:00, 14 November 2007 (UTC)

Origin of Name

AIC was said to stand for "An Information Criterion" by Akaike, not "Akaike information Criterion" Yoderj 19:39, 16 February 2007 (UTC)

This is similar to Peter Ryom developing the RV-index for the works of Vivaldi - instead of the official Répertoire Vivaldi, his index of course became known as the Ryom Verzeichnis... I am inclined to believe this is not unintentional on the part of the developer; it's probably (false?) modesty. Classical geographer 12:18, 2 April 2007 (UTC)

Travis Gee

I have sent an e-mail to this Mr. Gee who is cited as a possible reference, with the following text:

Dear Mr. Gee,
For some time now your name is mentioned in the Wikipedia article on the Akaike Information Criterion (http://en.wikipedia.org/wiki/Akaike_information_criterion). You are cited as having developed a pseudo-R2 derived from the AIC. However, no exact reference is given. I'd be glad to hear from you whether you actually developed this, and where, if anywhere, you have published this measure.

However, he has not answered. I will remove the reference. Classical geographer 12:18, 2 April 2007 (UTC)

That measurement ( R^2_{AIC}= 1 - \frac{AIC_0}{AIC_i} ) doesn't make sense to me. R^2 values range from 0-1. If the AIC is better than the null model, it should be smaller. If the numerator is larger than the denominator, the R^2_{AIC} will be less than 1. This is saying that better models will generate a negative R^2_{AIC}.

It would make sense if the model were: R^2_{AIC}= 1 - \frac{AIC_i}{AIC_0}

Denoting pronunciation

Please write the pronunciation using the International phonetic alphabet, as specified in Wikipedia:Manual of style (pronunciation). -Pgan002 05:09, 10 May 2007 (UTC)

Bayes' Factor

I'm not an expert in model selection but in my field (molecular phylogenetics) model selection is an increasingly important problem in methods involves Bayesian inference (e.g. MyBayes, BEAST) and AIC is apparently 'not appropriate' for these models [3]

Any thoughts anyone? I've also posted this on the model selection[4] page. Thanks.--Comrade jo (talk) 12:19, 19 December 2007 (UTC)

Confusion

The RSS in the definition is not a likelihood function! However, it turns out that the log likelihood looks similar to RSS. —Preceding unsigned comment added by 203.185.215.144 (talk) 23:12, 7 January 2008 (UTC)

I agree. What's written is actually the special case of AIC with least squares estimation with normally distributed errors. (As stated in Burnham, Anderson, "Model selection and inference". p48) Furthermore you can factor out ln(2pi)*n and an increase in K due to using least squares. These are both constant when available data is given, so they can be ignored. The AIC as Burnham and Anderson present it, is really a tool for ranking possible models, with the one with the lowest AIC being the best, the actual AIC value is of less importance. EverGreg (talk) 15:27, 29 April 2008 (UTC)

Relevance to $\chi^2$ fitting

I have contributed a modified AIC, valid only for models with the same number of data points. It is quite useful though. Velocidex (talk) 09:17, 8 July 2008 (UTC)

Could you please supply some references on this one? Many variations of AIC have been proposed, but as e.g. Burnham and Anderson stresses, only a few of these are grounded in the likelihood theory that AIC is derived from. I'm away from my books and in "summer-mode" so it could very well be that I just can't see how the section's result follow smoothly from the preceding derivation using RSS. :-)

EverGreg (talk) 11:22, 8 July 2008 (UTC)

For $\chi^2$ fitting, the likelihood is given by $L=\prod \left(\frac{1}{2 \pi \sigma_i^2}\right)^{1/2} \exp \left( -\sum\frac{(y_i-f(\mathbf{x}))}{2\sigma_i^2}\right)$
i.e. $\ln L = \ln\left(\prod\left(\frac{1}{2\pi\sigma_i}\right)^{1/2}\right) - \frac{1}{2}\sum \frac{(y_i-f(\mathbf{x})}{\sigma_i^2}$
$= C - \chi^2/2$, where C is a constant independent of the model used, and dependent only on the use of particular data points. i.e. it does not change if the data do not change.
The AIC is given by $AIC = 2k - 2\ln(L) = 2k - 2(C-\chi^2/2) = 2k -2C + \chi^2$. As only differences in AICc are meaningful, this constant can be omitted provided n does not change. This is the result I had before, which was correct. Velocidex (talk) 19:39, 18 May 2009 (UTC)
I should also say RSS is used by people who can't estimate their errors. If any error estimate is available for the data points, $\chi^2$ fitting should be used. Unweighted linear regression is dangerous because it uses the data points to estimate the errors by assuming a good fit. You get no independent estimate of the probability that your fit is good, Q. Velocidex (talk) 19:53, 18 May 2009 (UTC)

I think the link that appeared at the bottom "A tool for fitting distributions, times series and copulas using AIC with Excel by Vose Software" is not too relevant and only one of many tools that may incorporate AIC. I am not certain enough to remove it myself. Dirkjot (talk) 16:36, 17 November 2008 (UTC)

Confusion 2

The equation given here for determining AIC when error terms are normally distributed does not match the equation given by Burnham and Anderson on page 63 of their 2002 book. Burnham and Anderson's equation is identical except that it does not include a term with pi. Anyone know why this is? Tcadam (talk) 03:13, 17 December 2008 (UTC)Tcadam (talk) 03:14, 17 December 2008 (UTC)

Hi, I took the liberty to format your question. this is touched on in the "confusion" paragraph above. I assume you mean this equation:
$\mathit{AIC}=2k + n[\ln(2\pi \mathit{RSS}/n) + 1]\,.$
We should really fix it. Since for logarithms ln(x*y) = ln(x) + ln(y), you can factor out the 2pi term so that AIC = Burnham and andersons equation + 2pi term. since the 2pi term is a constant, it can be removed. This is because AIC is used to rank alternatives as best, second best e.t.c. Adding or subtracting a constant from the AIC score of all alternatives can't change the ranking between them. EverGreg (talk) 12:18, 17 December 2008 (UTC)
Added the simplified version in the article and emphasized ranking-only some more. By the way, did Burnaham and Anderson skip the + 1 term too? EverGreg (talk) 12:39, 17 December 2008 (UTC)
I dont understand why you are using the term RSS/n. I dont see that in at least two of the references i am looking at. It is just RSS. 137.132.250.11 (talk) 09:29, 29 April 2010 (UTC)
Exactly. the 1/n term can be factored out and removed just like the 2pi term, using that ln(x/y) = ln(x) - ln(y). It makes no difference if it's there or not, so most books should really go with $\mathit{AIC}=2k + n[\ln(\mathit{RSS})]\,$, as we have done in the article. The reason we see 2pi and 1/n at all is that they turn up when you take the general formula $\mathit{AIC} = 2k - 2\ln(L)\,$ and add the RSS assumption. We should probably add how this is derived, but I don't have a source on that nearby.EverGreg (talk) 13:19, 29 April 2010 (UTC)
Oh, I didn't check what you did on the article page. Thanks for spotting that! EverGreg (talk) 13:22, 29 April 2010 (UTC)

Further confusion: Is there a discrepancy between AIC defined from the $\chi^2$: $AIC = 2k -2C + \chi^2$ and the RSS version: $\mathit{AIC}=2k + n[\ln(\mathit{RSS})]\,$? Don't they differ with an extra $\ln$? —Preceding unsigned comment added by 152.78.192.25 (talk) 15:27, 13 May 2011 (UTC)

both formulas are valid, but the second one uses the additional assumption of a linear model. You can read that in [1]. It is a bit confusing because they do not state the assumption of the linear model p. 63, but on p. 12 the derive the log-likelihood for the case of linear models and that makes it clear (I'm referring to page numbers in the second edition as found here Frostus (talk) 11:04, 16 September 2014 (UTC)
I reverted my change with the linear model. Although it is shown for a linear model in Burnham & Anderson, 2002,[2] this assumption is not needed to derive the equation $AIC = n \ln( RSS/n) + 2 k + C$. I removed the part with "if the RSS is available", as it can always be calculated Frostus (talk) 14:12, 18 September 2014 (UTC)

I suspect that the whole derivation concerning chi-square is wrong, since it uses the likelihood function instead of the maximum of the likelihood function in the AIC. — Preceding unsigned comment added by 141.14.232.254 (talk) 19:22, 14 February 2012 (UTC)

References

Controversy?!

What on earth is this section? It should be properly explained, with real references, or permanently deleted! I would like to see a book on model selection which describes AIC in detail, but also points out these supposed controversies! True bugman (talk) 11:50, 7 September 2010 (UTC)

This is a good question, but there is probably something that does need to be said about properties of AIC, not necessarily under "Controversy". For example, in this online dissertation, I found "AIC and other constant penalties notoriously include too many irrelevant predictors (Breiman and Freedman, 1983)" with the reference being: L. Breiman and D. Freedman. "How many variables should be entered in a regression equation?" Journal of the American Statistical Association, pages 131–136, 1983. There are similar results for using AIC to select a model order in time series analysis. But these results just reflect the penalty on large models that is inherent in AIC, and arises from the underlying derivation of AIC as something to optimise. Melcombe (talk) 16:04, 7 September 2010 (UTC)
Time series data is only one of many data types where modelling and AIC are used together. Something like this should be included in a special section dedicated to time series data. True bugman (talk) 12:09, 8 September 2010 (UTC)
The point was that the supposed problem with AIC is known to occur for both regression and time series, in exactly the same way, so it would be silly to have to say it twice in separete sections. Melcombe (talk) 16:51, 8 September 2010 (UTC)
Regression, as an example, is not covered in this article. Neither is time series. But yes, AIC is not perfect, and yes this should probably be discussed. But in a neutral way, this is by no means a controversy. I believe the entire 'controversy' section should be deleted. These are all recent changes from different IP addresses (110.32.136.51, 150.243.64.1, 99.188.106.28, 130.239.101.140) unsupported by citation, irrelevant for the article (the controversial topics discussed are not even in the article), and it is very poorly written (again there is no connection to the article). What does "crossover design", "given to us a priori by pre-testing", and "Monte Carlo testing" even mean? This section is written as an attack on the technique rather than a non-biased source of information. It is not verifiable WP:V nor written with a neutral point of view WP:NPOV. It must go. True bugman (talk) 17:19, 8 September 2010 (UTC)

Takeuchi information criterion

I removed the part on Takeuchi information criterion (based on matrix trace), because this seemed to give credit to Claeskens & Hjort. There could be a new section on TIC, if someone wanted to write one; for now, I included a reference to the 1976 paper. Note that Burnham & Anderson (2002) discuss TIC at length, and a section on TIC should cite their discussion. TIC is rarely useful in practice; rather, it is an important intermediate step in the most-general derivation of AIC and AICc.  86.170.206.175 (talk) 16:24, 14 April 2011 (UTC)

Biased Tone of BIC Comparison Section

I made a few minor edits in the BIC section to try to keep it a *little* more neutral, but it still reads with a very biased tone. I imagine a bunch of AIC proponents had a huge argument with BIC proponents and then decided to write that section as pro-AIC propaganda. You can find just as many papers in the literature that unjustifiably argue that BIC is "better" than AIC, as you can find papers that unjustifiably argue AIC is "better" than BIC. Furthermore, if AIC can be derived from the BIC formalism by just taking a different prior, then one might argue AIC is essentially contained within "generalized BIC", so how can BIC, in general, be "worse" than AIC if AIC can be derived through the BIC framework?

The truth is that neither AIC nor BIC is inherently "better" or "worse" than the other until you define a specific application (and by AIC, I include AICc and minor variants, and by BIC I include variants also to be fair). You can find applications where AIC fails miserably and BIC works wonderfully, and vice versa. To argue that this or that method is better in practice, because of asymptotic results or because of a handful of research papers, is flawed since, for most applications, you never get close to the fantasy world of "asymptopia" where asymptotic results can actually be used for justification, and you can almost always find a handful of research papers that argue method A is better than method B when, in truth, method A is only better than method B for the specific application they were working on. — Preceding unsigned comment added by 173.3.109.197 (talk) 17:44, 15 April 2012 (UTC)

AIC for nested models?

The article states that AIC is applicable for nested and non-nested models, with a reference to Anderson (2008). However, looking up the source, there's no explicit indication that the AIC should be used for nested models. Instead, the indicated reference just states that the AIC can be valuable for non-nested models. Are there other sources that might be more explicit? — Preceding unsigned comment added by Redsilk09 (talkcontribs) 10:11, 18 July 2012 (UTC)

I agree with the above comment. I've tried using the AIC for nested models as specified by the article, and the results were nonsensical. — Preceding unsigned comment added by 152.160.76.249 (talk) 20:14, 1 August 2012 (UTC)

BIC Section

The BIC section claims Akaike derived BIC independently and credits him as much as anyone else in discovering BIC. However, I have always read in the history books that Akaike was very excited when he first saw (Schwartz's?) a BIC derivation, and that after seeing that it inspired him to develop his own Bayesian version of AIC. I thought it was well-documented historically that this was the case, and that he was a very graceful man who didn't think of BIC as a competitor to him, but thought of it as just yet another very useful and interesting result. His only disappointment, many accounts do claim, was that he didn't think of it himself earlier. Isn't that the standard way that all the historical accounts read?

Your version of events seems right to me. Akaike found his Bayesian version of AIC after seeing Schwartz's BIC. BIC and the Bayesian version of AIC turned out to be the same thing. (Maybe you should edit the article?) — Preceding unsigned comment added by 86.156.204.205 (talk) 14:10, 14 December 2012 (UTC)

Removed confusing sentence

I removed the following sentence: "This form is often convenient, because most model-fitting programs produce $\chi^2$ as a statistic for the fit." The $\chi^2$ statistic produced with many model-fitting programs is in fact the RSS (e.g. Origin [5]). But the RSS cannot simply replace $\chi^2$ in these equations. Either the σi has to be known or the following formula should be used AIC = n ln(RSS/n) + 2k + C. — Preceding unsigned comment added by 129.67.70.165 (talk) 14:34, 21 February 2013 (UTC)

Example?

The example from U. Georgia is no longer found; so I deleted it. It was:

I added the best example I could find with a Google-search: [Akaike example filetype:pdf] DoneCharles Edwin Shipp (talk) 13:31, 11 September 2013 (UTC)

Error in Reference

AIC was introduced by Akaike in 1971/1972 in "Informstion theory and an extension of the maximum likelihood principle", not in 1974. Please correct it. — Preceding unsigned comment added by 31.182.64.248 (talkcontribs) 01:05, 23 November 2014