# Talk:Akaike information criterion

WikiProject Statistics (Rated C-class, Mid-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
Mid  This article has been rated as Mid-importance on the importance scale.
WikiProject Mathematics (Rated C-class, Mid-priority)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 C Class
 Mid Priority
Field: Probability and statistics

## AICc formula

Hi all, I recently made a small change to the AICc formula to add the simplified version of the AICc (i.e. not in terms of the AIC equation above, but the direct formula) so that readers could see both the AICc's relation to the AIC (as a "correction") and as the formula recommended by Burnham and Anderson for general use [edit number 618051554].

An unnamed user reverted the edit, citing "invalid simplification." Does anyone (the reverting editor from ip 2.25.180.99 included) have any reason to not want the edit I made to stand? It adds the second line to the equation below,

{\displaystyle {\begin{aligned}AIC_{c}&=AIC+{\frac {2k(k+1)}{n-k-1}}\\&={\frac {2kn}{n-k-1}}-2\ln {(L)}\end{aligned}}}

The main point I have in favor of the change is that Burnham and Anderson point out in multiple locations that the AIC should be thought of as the asymptotic version of the AICc, rather than thinking of the AICc as a correction to the AIC. This second equation shows easily (by comparison with the AIC formula) how that asymptotic relationship holds, and it shows how to compute the AICc itself directly.

The point against it, as far as I can see, is just that there is now a second line of math (which I can imagine some people being opposed to...)

Any opinions? If no, I'll change my edit back in a few days/weeks. Dgianotti (talk) 18:03, 31 July 2014 (UTC)

The second equation is certainly valid. I do not see why the proposed new formula is a simplification though; for me, the first (current) formula is simpler. Regarding the asymptotic relationship, this is shown much more clearly by the first formula: as n→ ∞, the second term plainly goes to 0; so the formula becomes just AIC, as Burnham and Anderson state. Hence I prefer the current formula. 86.152.236.37 (talk) 21:11, 13 September 2014 (UTC)

## Fitting statistical models

When analyzing data, a general question of both absolute and relative goodness-of-fit of a given model arises. In the general case, we are fitting a model of K parameters to the observed data x_1, x_2 ... x_N. In the case of fitting AR, MA, ARMA, or ARIMA models, the question we are concerned with is what K is, i.e. how many parameters to include in the model.

The parameters are routinely estimated by minimizing the residual sum of squares, or by maximizing log likelihood of the data. For normal distributions, the least sum of squares method and the log likelihood method yield identical results.

These techniques are, however, unusable for estimation of optimal K. For that, we use information criteria which also justify the use of the log likelihood above.

## More TBA...=

More TBA... Note: The entropy link should be changed to Information entropy

## Deviance Information Criterion

I've written a short article on DIC, please look it over and edit. Bill Jefferys 22:55, 7 December 2005 (UTC)

Great, thanks very much. I've edited a bit for style, no major changes though. Cheers, --MarkSweep (call me collect) 23:18, 7 December 2005 (UTC)

## Justification/derivation

is AIC *derived* from anything or is it just a hack? BIC is at least derivable from some postulate. WHy would you ever use AIC over BIC or, better, cross validation?

There is a link on the page ([1]) which shows a proof that AIC can be derived from the same postulate as BIC and vice versa. Cross validation is good but computationally expensive compared to A/BIC - a problem for large scale optimisations. The actual discussion over BIC/AIC as a weapon of choice seems to be long, immensely technical/theoretical and not a little boring 128.240.229.7 12:37, 28 February 2007 (UTC)

Does the definition of AIC make sense with respect to dimension? That is...why would the log of the likelihood function have the same dimension as the number of parameters, so that subtracting them would make sense? Cazort 20:00, 14 November 2007 (UTC)

## Origin of Name

AIC was said to stand for "An Information Criterion" by Akaike, not "Akaike information Criterion" Yoderj 19:39, 16 February 2007 (UTC)

This is similar to Peter Ryom developing the RV-index for the works of Vivaldi - instead of the official Répertoire Vivaldi, his index of course became known as the Ryom Verzeichnis... I am inclined to believe this is not unintentional on the part of the developer; it's probably (false?) modesty. Classical geographer 12:18, 2 April 2007 (UTC)

This criterion is alternately called the WAIC: Watanabe-Akaike Information Criterion[2], or the widely-applicable information criterion [3][4]. — Preceding unsigned comment added by 167.220.148.12 (talk) 10:48, 29 April 2016 (UTC)

WAIC and AIC are not the same. I have now listed a paper for WAIC in the "Further reading" section. SolidPhase (talk) 07:31, 16 May 2016 (UTC)

## Travis Gee

I have sent an e-mail to this Mr. Gee who is cited as a possible reference, with the following text:

Dear Mr. Gee,
For some time now your name is mentioned in the Wikipedia article on the Akaike Information Criterion (http://en.wikipedia.org/wiki/Akaike_information_criterion). You are cited as having developed a pseudo-R2 derived from the AIC. However, no exact reference is given. I'd be glad to hear from you whether you actually developed this, and where, if anywhere, you have published this measure.

However, he has not answered. I will remove the reference. Classical geographer 12:18, 2 April 2007 (UTC)

That measurement ( R^2_{AIC}= 1 - \frac{AIC_0}{AIC_i} ) doesn't make sense to me. R^2 values range from 0-1. If the AIC is better than the null model, it should be smaller. If the numerator is larger than the denominator, the R^2_{AIC} will be less than 1. This is saying that better models will generate a negative R^2_{AIC}.

It would make sense if the model were: R^2_{AIC}= 1 - \frac{AIC_i}{AIC_0}

## Denoting pronunciation

Please write the pronunciation using the International phonetic alphabet, as specified in Wikipedia:Manual of style (pronunciation). -Pgan002 05:09, 10 May 2007 (UTC)

## Bayes' Factor

I'm not an expert in model selection but in my field (molecular phylogenetics) model selection is an increasingly important problem in methods involves Bayesian inference (e.g. MyBayes, BEAST) and AIC is apparently 'not appropriate' for these models [6]

Any thoughts anyone? I've also posted this on the model selection[7] page. Thanks.--Comrade jo (talk) 12:19, 19 December 2007 (UTC)

## Confusion

The RSS in the definition is not a likelihood function! However, it turns out that the log likelihood looks similar to RSS. —Preceding unsigned comment added by 203.185.215.144 (talk) 23:12, 7 January 2008 (UTC)

I agree. What's written is actually the special case of AIC with least squares estimation with normally distributed errors. (As stated in Burnham, Anderson, "Model selection and inference". p48) Furthermore you can factor out ln(2pi)*n and an increase in K due to using least squares. These are both constant when available data is given, so they can be ignored. The AIC as Burnham and Anderson present it, is really a tool for ranking possible models, with the one with the lowest AIC being the best, the actual AIC value is of less importance. EverGreg (talk) 15:27, 29 April 2008 (UTC)

## Relevance to ${\displaystyle \chi ^{2}}$ fitting

I have contributed a modified AIC, valid only for models with the same number of data points. It is quite useful though. Velocidex (talk) 09:17, 8 July 2008 (UTC)

Could you please supply some references on this one? Many variations of AIC have been proposed, but as e.g. Burnham and Anderson stresses, only a few of these are grounded in the likelihood theory that AIC is derived from. I'm away from my books and in "summer-mode" so it could very well be that I just can't see how the section's result follow smoothly from the preceding derivation using RSS. :-)

EverGreg (talk) 11:22, 8 July 2008 (UTC)

For ${\displaystyle \chi ^{2}}$ fitting, the likelihood is given by ${\displaystyle L=\prod \left({\frac {1}{2\pi \sigma _{i}^{2}}}\right)^{1/2}\exp \left(-\sum {\frac {(y_{i}-f(\mathbf {x} ))}{2\sigma _{i}^{2}}}\right)}$
i.e. ${\displaystyle \ln L=\ln \left(\prod \left({\frac {1}{2\pi \sigma _{i}}}\right)^{1/2}\right)-{\frac {1}{2}}\sum {\frac {(y_{i}-f(\mathbf {x} )}{\sigma _{i}^{2}}}}$
${\displaystyle =C-\chi ^{2}/2}$, where C is a constant independent of the model used, and dependent only on the use of particular data points. i.e. it does not change if the data do not change.
The AIC is given by ${\displaystyle AIC=2k-2\ln(L)=2k-2(C-\chi ^{2}/2)=2k-2C+\chi ^{2}}$. As only differences in AICc are meaningful, this constant can be omitted provided n does not change. This is the result I had before, which was correct. Velocidex (talk) 19:39, 18 May 2009 (UTC)
I should also say RSS is used by people who can't estimate their errors. If any error estimate is available for the data points, ${\displaystyle \chi ^{2}}$ fitting should be used. Unweighted linear regression is dangerous because it uses the data points to estimate the errors by assuming a good fit. You get no independent estimate of the probability that your fit is good, Q. Velocidex (talk) 19:53, 18 May 2009 (UTC)

I am unhappy with this section. It says "where C is a constant independent of the model used, and dependent only on the use of particular data points, i.e. it does not change if the data do not change."

But this is only true if the ${\displaystyle \sigma _{i}}$:s are the same for the two models. And under "Equal-variances case" it explicitly saya that ${\displaystyle \sigma }$ is unknown, hence is estimated by the models. For instance, if we compare two nested linear models, then the larger will estimnate ${\displaystyle \sigma }$ to a smaller value than the smaller model. In this case it is the converse: the "constant" C will differ between models, whereas the term with the exponentials will cancel out (they will both be exp(−1).)

The formula with RSS is correct, but the derivation is wrong for the above reason.

All this needs to be fixed. (Harald Lang, 9/12/2015) — Preceding unsigned comment added by 46.39.98.125 (talk) 11:36, 9 December 2015 (UTC)

Your point seems valid to me. Additionally, it is notable that the subsection "General case" does not have any references (unlike the subsection "Equal-variances case"). Moreover, I have just skimmed through Burnham & Anderson (2002), and did not see any supportive discussion that could be cited.
The editor who first added the text for the "General case" has been inactive for over a year; so asking them would probably not lead anywhere. Does anyone have a justification for keeping the "General case" subsection? If not, I will delete that subsection, and revise the "Equal-variances case" subsection.
SolidPhase (talk) 22:43, 10 December 2015 (UTC)

I think the link that appeared at the bottom "A tool for fitting distributions, times series and copulas using AIC with Excel by Vose Software" is not too relevant and only one of many tools that may incorporate AIC. I am not certain enough to remove it myself. Dirkjot (talk) 16:36, 17 November 2008 (UTC)

## Confusion 2

The equation given here for determining AIC when error terms are normally distributed does not match the equation given by Burnham and Anderson on page 63 of their 2002 book. Burnham and Anderson's equation is identical except that it does not include a term with pi. Anyone know why this is? Tcadam (talk) 03:13, 17 December 2008 (UTC)Tcadam (talk) 03:14, 17 December 2008 (UTC)

Hi, I took the liberty to format your question. this is touched on in the "confusion" paragraph above. I assume you mean this equation:
${\displaystyle {\mathit {AIC}}=2k+n[\ln(2\pi {\mathit {RSS}}/n)+1]\,.}$
We should really fix it. Since for logarithms ln(x*y) = ln(x) + ln(y), you can factor out the 2pi term so that AIC = Burnham and andersons equation + 2pi term. since the 2pi term is a constant, it can be removed. This is because AIC is used to rank alternatives as best, second best e.t.c. Adding or subtracting a constant from the AIC score of all alternatives can't change the ranking between them. EverGreg (talk) 12:18, 17 December 2008 (UTC)
Added the simplified version in the article and emphasized ranking-only some more. By the way, did Burnaham and Anderson skip the + 1 term too? EverGreg (talk) 12:39, 17 December 2008 (UTC)
I dont understand why you are using the term RSS/n. I dont see that in at least two of the references i am looking at. It is just RSS. 137.132.250.11 (talk) 09:29, 29 April 2010 (UTC)
Exactly. the 1/n term can be factored out and removed just like the 2pi term, using that ln(x/y) = ln(x) - ln(y). It makes no difference if it's there or not, so most books should really go with ${\displaystyle {\mathit {AIC}}=2k+n[\ln({\mathit {RSS}})]\,}$, as we have done in the article. The reason we see 2pi and 1/n at all is that they turn up when you take the general formula ${\displaystyle {\mathit {AIC}}=2k-2\ln(L)\,}$ and add the RSS assumption. We should probably add how this is derived, but I don't have a source on that nearby.EverGreg (talk) 13:19, 29 April 2010 (UTC)
Oh, I didn't check what you did on the article page. Thanks for spotting that! EverGreg (talk) 13:22, 29 April 2010 (UTC)

Further confusion: Is there a discrepancy between AIC defined from the ${\displaystyle \chi ^{2}}$: ${\displaystyle AIC=2k-2C+\chi ^{2}}$ and the RSS version: ${\displaystyle {\mathit {AIC}}=2k+n[\ln({\mathit {RSS}})]\,}$? Don't they differ with an extra ${\displaystyle \ln }$? —Preceding unsigned comment added by 152.78.192.25 (talk) 15:27, 13 May 2011 (UTC)

both formulas are valid, but the second one uses the additional assumption of a linear model. You can read that in [1]. It is a bit confusing because they do not state the assumption of the linear model p. 63, but on p. 12 the derive the log-likelihood for the case of linear models and that makes it clear (I'm referring to page numbers in the second edition as found here Frostus (talk) 11:04, 16 September 2014 (UTC)
I reverted my change with the linear model. Although it is shown for a linear model in Burnham & Anderson, 2002,[2] this assumption is not needed to derive the equation ${\displaystyle AIC=n\ln(RSS/n)+2k+C}$. I removed the part with "if the RSS is available", as it can always be calculated Frostus (talk) 14:12, 18 September 2014 (UTC)

I suspect that the whole derivation concerning chi-square is wrong, since it uses the likelihood function instead of the maximum of the likelihood function in the AIC. — Preceding unsigned comment added by 141.14.232.254 (talk) 19:22, 14 February 2012 (UTC)

References

## Controversy?!

What on earth is this section? It should be properly explained, with real references, or permanently deleted! I would like to see a book on model selection which describes AIC in detail, but also points out these supposed controversies! True bugman (talk) 11:50, 7 September 2010 (UTC)

This is a good question, but there is probably something that does need to be said about properties of AIC, not necessarily under "Controversy". For example, in this online dissertation, I found "AIC and other constant penalties notoriously include too many irrelevant predictors (Breiman and Freedman, 1983)" with the reference being: L. Breiman and D. Freedman. "How many variables should be entered in a regression equation?" Journal of the American Statistical Association, pages 131–136, 1983. There are similar results for using AIC to select a model order in time series analysis. But these results just reflect the penalty on large models that is inherent in AIC, and arises from the underlying derivation of AIC as something to optimise. Melcombe (talk) 16:04, 7 September 2010 (UTC)
Time series data is only one of many data types where modelling and AIC are used together. Something like this should be included in a special section dedicated to time series data. True bugman (talk) 12:09, 8 September 2010 (UTC)
The point was that the supposed problem with AIC is known to occur for both regression and time series, in exactly the same way, so it would be silly to have to say it twice in separete sections. Melcombe (talk) 16:51, 8 September 2010 (UTC)
Regression, as an example, is not covered in this article. Neither is time series. But yes, AIC is not perfect, and yes this should probably be discussed. But in a neutral way, this is by no means a controversy. I believe the entire 'controversy' section should be deleted. These are all recent changes from different IP addresses (110.32.136.51, 150.243.64.1, 99.188.106.28, 130.239.101.140) unsupported by citation, irrelevant for the article (the controversial topics discussed are not even in the article), and it is very poorly written (again there is no connection to the article). What does "crossover design", "given to us a priori by pre-testing", and "Monte Carlo testing" even mean? This section is written as an attack on the technique rather than a non-biased source of information. It is not verifiable WP:V nor written with a neutral point of view WP:NPOV. It must go. True bugman (talk) 17:19, 8 September 2010 (UTC)

## Takeuchi information criterion

I removed the part on Takeuchi information criterion (based on matrix trace), because this seemed to give credit to Claeskens & Hjort. There could be a new section on TIC, if someone wanted to write one; for now, I included a reference to the 1976 paper. Note that Burnham & Anderson (2002) discuss TIC at length, and a section on TIC should cite their discussion. TIC is rarely useful in practice; rather, it is an important intermediate step in the most-general derivation of AIC and AICc.  86.170.206.175 (talk) 16:24, 14 April 2011 (UTC)

## Biased Tone of BIC Comparison Section

I made a few minor edits in the BIC section to try to keep it a *little* more neutral, but it still reads with a very biased tone. I imagine a bunch of AIC proponents had a huge argument with BIC proponents and then decided to write that section as pro-AIC propaganda. You can find just as many papers in the literature that unjustifiably argue that BIC is "better" than AIC, as you can find papers that unjustifiably argue AIC is "better" than BIC. Furthermore, if AIC can be derived from the BIC formalism by just taking a different prior, then one might argue AIC is essentially contained within "generalized BIC", so how can BIC, in general, be "worse" than AIC if AIC can be derived through the BIC framework?

The truth is that neither AIC nor BIC is inherently "better" or "worse" than the other until you define a specific application (and by AIC, I include AICc and minor variants, and by BIC I include variants also to be fair). You can find applications where AIC fails miserably and BIC works wonderfully, and vice versa. To argue that this or that method is better in practice, because of asymptotic results or because of a handful of research papers, is flawed since, for most applications, you never get close to the fantasy world of "asymptopia" where asymptotic results can actually be used for justification, and you can almost always find a handful of research papers that argue method A is better than method B when, in truth, method A is only better than method B for the specific application they were working on. — Preceding unsigned comment added by 173.3.109.197 (talk) 17:44, 15 April 2012 (UTC)

## AIC for nested models?

The article states that AIC is applicable for nested and non-nested models, with a reference to Anderson (2008). However, looking up the source, there's no explicit indication that the AIC should be used for nested models. Instead, the indicated reference just states that the AIC can be valuable for non-nested models. Are there other sources that might be more explicit? — Preceding unsigned comment added by Redsilk09 (talkcontribs) 10:11, 18 July 2012 (UTC)

I agree with the above comment. I've tried using the AIC for nested models as specified by the article, and the results were nonsensical. — Preceding unsigned comment added by 152.160.76.249 (talk) 20:14, 1 August 2012 (UTC)

## BIC Section

The BIC section claims Akaike derived BIC independently and credits him as much as anyone else in discovering BIC. However, I have always read in the history books that Akaike was very excited when he first saw (Schwartz's?) a BIC derivation, and that after seeing that it inspired him to develop his own Bayesian version of AIC. I thought it was well-documented historically that this was the case, and that he was a very graceful man who didn't think of BIC as a competitor to him, but thought of it as just yet another very useful and interesting result. His only disappointment, many accounts do claim, was that he didn't think of it himself earlier. Isn't that the standard way that all the historical accounts read?

Your version of events seems right to me. Akaike found his Bayesian version of AIC after seeing Schwartz's BIC. BIC and the Bayesian version of AIC turned out to be the same thing. (Maybe you should edit the article?) — Preceding unsigned comment added by 86.156.204.205 (talk) 14:10, 14 December 2012 (UTC)

## Removed confusing sentence

I removed the following sentence: "This form is often convenient, because most model-fitting programs produce ${\displaystyle \chi ^{2}}$ as a statistic for the fit." The ${\displaystyle \chi ^{2}}$ statistic produced with many model-fitting programs is in fact the RSS (e.g. Origin [8]). But the RSS cannot simply replace ${\displaystyle \chi ^{2}}$ in these equations. Either the σi has to be known or the following formula should be used AIC = n ln(RSS/n) + 2k + C. — Preceding unsigned comment added by 129.67.70.165 (talk) 14:34, 21 February 2013 (UTC)

## Example?

The example from U. Georgia is no longer found; so I deleted it. It was:

I added the best example I could find with a Google-search: [Akaike example filetype:pdf] DoneCharles Edwin Shipp (talk) 13:31, 11 September 2013 (UTC)

## Error in Reference

AIC was introduced by Akaike in 1971/1972 in "Informstion theory and an extension of the maximum likelihood principle", not in 1974. Please correct it. — Preceding unsigned comment added by 31.182.64.248 (talkcontribs) 01:05, 23 November 2014

## Recent edits by Tayste

Explain yourself. The "relative quality of a model" is ungrammatical - relative to what? This must be "models" plural. As for measuring "quality" - this sounds like higher values of AIC mean greater quality, but the reverse is true, so this should be made clear up front in the lead. Why remove that? Thirdly, WP:HEADINGS states that "Headings should not refer redundantly to the subject of the article". Lastly, it seems (to me) better to talk about how AIC works before discussing its limitations. Tayste (edits) 18:07, 18 June 2015 (UTC)

@Tayste: Okay, how about removing the word "relative" from the first sentence?
About "models" plural, will you elaborate? AIC gives the value for a single model; so I think that singular is appropriate.
I disagree with mentioning about higher/lower values in the lead. Following WP:LEAD, this looks like clutter that someone who read only the lead would not benefit from. The issue is discussed in the second paragraph of the body: in the first sentence, and italicized.
Which heading referred redundantly to the subject of the article?
Does your last point pertain to my edit?
SolidPhase (talk) 18:28, 18 June 2015 (UTC)

As an interim measure (only), I have restored the body to my last edit, but kept your lead section. SolidPhase (talk) 19:47, 18 June 2015 (UTC)

It has now been over four days.
Regarding the sentence "Lower values of AIC indicate higher quality and therefore better models", as above I think that including this is clutter, which will be especially distracting for people who only read the lead. Additionally, there are many activities where the minimum is the optimum, e.g. golf. Moreover, in the field of Optimization, the canonical examples are minimization. I definitely believe that the sentence should be removed; so I have now done that.
Regarding the grammatical changes that you made, I do not agree. Back in March, though, you found a grammatical error: and you were correct, of course. Hence I get the impression that you have a really good grammatical knowledge. I do not understand what you find grammatically wrong about the previous version, though, or why your version is correct. Simply put, I am confused about this(!). Your edits to the grammar remain as you made them, but I would really appreciate it if we could discuss this issue further. Will you explain the reasons for your grammatical change more?
SolidPhase (talk) 19:19, 22 June 2015 (UTC)

I've stayed away (partly) to give other editors an opportunity to chip in. The AIC value for a single model is completely meaningless in isolation. It tells absolutely nothing about the quality of that model. AIC values are only useful when the differences in values is taken for pairs of models fitted to the same data set. So the word "relative" must be there. Thank you for retaining the plural in the first sentence.
Despite your specific counter examples, the generally understood meaning of the verb to measure is that it assigns higher numbers for greater amounts of the aspect being measured. In terms of relative measurement, AIC measures not the quality of models but their lack of quality, since higher values mean worse. I disagree that it clutters the lead to state the direction in which AIC works. It is a fundamentally important point to get across early for anyone wishing to understand what AIC is. Tayste (edits) 20:56, 22 June 2015 (UTC)
To me (a far-from-expert in grammar), the phrase “relative quality of statistical models” seems inappropriate, because “quality” is singular and “models” is plural.
The lead currently states that AIC “offers a relative estimate of the information lost”; so the less information lost, the better. Regarding measure, this is a formal term in mathematics, and the definition requires that all measures be nonnegative. What about replacing the term “measure” by something else?—e.g. “AIC provides a means for assessing the relative quality of statistical models”. Could something like that be okay?
SolidPhase (talk) 22:17, 22 June 2015 (UTC)
"Quality" is indeed singular, but "relative quality" necessitates a comparison involving at least a pair of models. I'd be happy with "means" but I think "measure" here was being used in the more general sense (anywhere on the Real line) rather than that specific mathematical definition. The point about the information lost is actually a better definition than "quality". Tayste (edits) 22:44, 22 June 2015 (UTC)
Quick comments. The lead is confusing and misrepresents. AIC is a number that calculated without reference to other models. It is a metric that does not depend on other models. That is, the value of AIC ignores all other models. AIC's value is not "relative to each of the other models".
AIC can be used to rank different models under the AIC metric. Using that metric does not guarantee the earlier statement that "Lower values of AIC indicate higher quality and therefore better models." The notion of "better" is tempered by the metric. AIC might deprecate an exact model due to its complexity.
I don't care much about which scores are better. For fits, lower chi square values are better, so smaller is better is not a foreign concept.
I don't know if AIC has some value as an absolute metric. For example, if input variances are known, then reduced chi square near 1 suggests a good model.
Glrx (talk) 05:46, 23 June 2015 (UTC)
I agree that the lead is confusing. I have spent 1–2 hours trying to come up with something better, but so far I have got nothing constructive to propose. Including the sentence about lower values indicating higher quality makes the lead more confusing, which is why I have been advocating keeping the sentence out.
I agree that metric is more appropriate than measure, considering the formal mathematical definitions. I also think that the mathematical definitions are highly relevant, given that AIC is part of some fairly advanced mathematical statistics. One problem with "metric", though, is that the word is not commonly known. Hence, if the word were used, people without the requisite mathematical background would be confused.
SolidPhase (talk) 09:42, 23 June 2015 (UTC)

Are you quite sure that AIC is ranked from best is lowest? Here are rankings from a Mathematica case of the 5 best models for a problem that uses BIC for ranking:

BIC AIC HQIC
3.841 3.857 3.845
3.815 3.825 3.818
3.735 3.746 3.738
3.732 3.742 3.735
3.458 3.468 3.461

Note that they go from highest as best to lowest as worst. I checked on this and is seems that some programs output -AIC, not AIC. However, the word is "index." In addition to being accurate, it is in common usage and people with no higher mathematical training understand it. Please change this, post an objection to the change or otherwise I will change it. If you then change it back without discussion, which is typical, we will have a dispute, as I will keep changing it back until there is a dispute settlement. CarlWesolowski (talk) 14:21, 11 July 2016 (UTC)CarlWesolowski (talk) 18:37, 14 July 2016 (UTC)

I strongly oppose using the word "index", because the word would be confusing here.
Your claim that I undo your edits "without discussion, which is typical" is false, and slanderous.
Your threat to start an edit war is in violation of Wikipedia norms.
SolidPhase (talk) 14:30, 15 July 2016 (UTC)

## Assessment comment

The comment(s) below were originally left at Talk:Akaike information criterion/Comments, and are posted here for posterity. Following several discussions in past years, these subpages are now deprecated. The comments may be irrelevant or outdated; if so, please feel free to remove this section.

 Hello, I am not a statistician and therefore can only provide remark about stuff I was unable to understand. My concern is about "k" : In the paragraph "Definition" it is said that k is the number of parameters in the statistical model In the paragraph "AICc and AICu" it is said that k denotes the number of model parameters + 1. If these two ks are different, then why to give them the same name. If they are not there is a problem of definition somewhere ?

Last edited at 13:27, 7 October 2009 (UTC). Substituted at 19:44, 1 May 2016 (UTC)

## Recent edits by CarlWesolowski

Some of the edits are ungrammatical, e.g. "AIC use as one means of model selection". Some of the edits introduce technical invalidity, e.g. "each candidate model has residuals that are normal distributions". I have undone the edits.

CarlWesolowski has been sporadically making edits to this article since at least 20 March 2015. Each time, those edits have been undone. My suggestion is this: if CarlWesolowski wants to make changes to the article, then he should discuss those changes on this Talk page, and get a consensus of editors to agree to the changes.
SolidPhase (talk) 06:18, 8 July 2016 (UTC)

Fine, let us talk about it. AIC is only one of many criteria for model selection, and often suggested for use when it is inappropriate. The introduction reads like a commercial for cigarettes. Mathematica uses BIC, as one of several tests whose combined score ranks models, and, there are lots of good folks who may compute AIC but not use it to rank. In addition to AIC, other methods used include BIC, step-wise partial probability ANOVA, HQIC, log likelihood, complexity error, factor analysis, and goodness of fit testing with Pearson Chi-squared, Cramer Von Mises probabilities and others. And, without looking at those other measurements, any pronouncements made with respect to model selection AIC should be ignored. What difference does it make to say for example that a straight line is a worse approximation to a non-trivial cubic than a non-trivial quadratic, for a special case, when both answers are generally incorrect, and improbable even in an example case?

In the section that says in rather poor quality English "Sometimes, each candidate model assumes that the residuals are distributed according to independent identical normal distributions (with zero mean). That gives rise to least squares model fitting." Let us take this one statement at a time.

Candidate models do not "assume." People assume. Ordinary least squares regression need not be associated with normal distribution of residuals, only homoscedasticity of residuals is required with a few additional requirements, for example, that the mean residual is not undefined. It is very common to obtain Student's-t residuals, and that becomes problematic for least squares regressions when the degrees of freedom are less than 2. For example, a Student's-t distribution with 1 df, is Cauchy, for which the mean is undefined. The requirement for normally distributed residuals is unnecessary. For example, a homoscedastic uniform distribution of residuals would be "content" (using your fondness for anthropomorphism) to be regressed with ordinary least squares. It is convenient when residuals are normally distributed, and in my personal observations, that happens approximately 10% of the time.

Next fuzzy thought, "independent identical normal distributions." Talk about loaded phrases, 1) the data is the same, only the models fit to data are different. Thus, the residuals are correlated and not independent. 2) "Identical" they do not have the same moments and are not zero-centered. The residuals, if they are even both normally distributed from two different fit equations, which is a very unlikely occurrence (BTW, 1%), would certainly have different standard deviations, or it would be difficult to tell them apart.

"(with zero mean)" Yeah, you wish. One outlier is enough to blow that bubble to smithereens. I suppose you trim all your data so that it is convenient? Again I am seeing a confusion of assumptions with their supposed implementation. What do you do with the error of finding the mean value? Ignore it? Indeed, this seems to be an assumption, not something relevant to working with the actual data. The error of finding the mean can always be seen using Monte Carlo simulation, and indeed, that is another way of comparing models, that and simulation slope testing.

"That gives rise to least squares model fitting." Well, no it doesn't. I should not have to teach you college level mathematics. The assumptions for OLS do however include, homoscedasticity, and fixed intervals on the x-axis. Otherwise, OLS is only approximate. Summarizing, I really think that you should consider pulling back from this cigarette commercial and injecting some perspective into this very sloppy article. And, you finished with that shot across my bow about my grammar, and did that without understanding just how sloppy the mess is I was trying to fix. Fine, do that. I question here whether it is a good idea to have only those who have special interests as advocates in a technique functioning as editors. Not that I am suggesting that there is somehow a better form of governance, however, Narcissus would be darn right proud to be associated with this article on AIC, and it is unbalanced to the point of tears. You will not let me fix it, so fix it yourselves. — (talkcontribs) 03:08, 9 July 2016 (UTC)

To begin with "The properties listed so far are all valid regardless of the underlying distribution of the error terms. However if you are willing to assume that the normality assumption holds (that is, that ε ~ N(0, σ2In)), then additional properties of the OLS estimators can be stated."-From Ordinary Least Squares [1]. Those properties include the maximum likelihood context with chicken and egg problem that AIC has. That is, model misspecification shows up as non-normality, and AIC assumes normality. Thus, in order to have a more general treatment of the problem the context that needs inserting includes QML stands for Quasi-Maximum Likelihood (statistics) based upon [2], which relaxes the odious requirement for ND. It is currently totally unclear what the statistical use of AIC is in the article, it is vacuous, so fix it. I am not doing this for jollies, the current article is dangerous, it promotes AIC without any insight as to appropriate usage, that is why I called it a cigarette commercial. I can see a use for finding distributions of residuals, but, BIC may be more useful. AIC does not have a role that I can see for inverse problems, and inverse problems are much more general than goodness-of-fit. Please understand that the role of a model is to extract useful parameters, not "fit a curve" so that it is irrelevant what the residual structure is for the primary curve, what matters in general is how well the target parameters are estimated, and there is much more involved than a visually pleasing curve fit. In fact, in the more general context of inverse problems, curve fitting is counterproductive.CarlWesolowski (talk) 17:09, 10 July 2016 (UTC)

Let me elaborate on that last point so that some light is shed on the subject. Suppose that one has data too sparse to allow for a full physical model to characterize it, as the number of parameters is prohibitively large compared to the available information for any sensible statistical solution. In that case, a physically motivated model containing the most important parameters can be used, but only with the understanding that heteroscedascity then obviates the use of 'goodness of fit' because the model is under-configured on purpose. We then need to do something other than least squares in any form as the problem is ill-posed. Suppose that we desire to obtain the area under the curve as opposed to the curve fit. We can then treat the problem as an ill-posed integral minimizing the error of the area under the curve. When we have done so, we then have the right area, but the wrong curve fit. However, we knew that that curve fit HAD to be wrong to begin with in order to extract the correct parameters for the incomplete model. The resulting incomplete model then approximates the correct parameters to agree with the corresponding complete model's parameters that we were unable to fit due to sparse data, and only the complete model can be judged with respect to goodness-of-fit. Given the sparse data, we can never test the goodness-of fit of the complete model, it is unachievable. One can, however, show all of this with simulations. There have been attempts to apply AIC to inverse problems, however, that is, to use the vernacular, "nuts." It will not produce the desired result as the ND assumption will not survive the first look at the residuals, and yes, if you are not looking at the residuals to check assumptions, you are not doing modelling properly, and improper modelling is the rule, unfortunately.CarlWesolowski (talk) 17:59, 10 July 2016 (UTC)

The word "a" is used as an indefinite article. Thus, it is clear that there might be more than one. The paragraph also links to model selection, which lists 13.
Some of your remarks about model assumptions might be appropriate for the article on model selection. They are, however, not specific to AIC.
It is colloquial to talk about models (rather than people) assuming something.
It is common to assume that "the residuals are distributed according to independent identical normal distributions (with zero mean)". That assumption "gives rise to least squares model fitting", as the article states. Other assumptions can also give rise to least squares model fitting, but that is irrelevant in the context.
SolidPhase (talk) 19:15, 10 July 2016 (UTC)

Thank you for responding. However, the "a" is too soft. In the matter of implication, hinting at something is not as good as saying it. This article has that problem throughout, and it is less useful in that form than it would be if it were more clearly written. For example, let us take the infamous sentence "Sometimes, each candidate model assumes that the residuals are distributed according to independent identical normal distributions (with zero mean). That gives rise to least squares model fitting." It took me a very long time to figure out what you are trying to say and, BTW, do not. Consider for "that gives rise to least squares..." it is unclear that it does, and most people having studied least squares would still not know what you are getting on about. Consider saying something relevant rather than making the reader study the phrase to make any sense out of it, namely, note [3] that "There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed." You do not say what you are assuming, and that is not a problem for the editors, but, it is a big problem for the readers.

When you use the "colloquialism" as you call it, you depreciate not only the language, but mask the fact that you have imposed an assumption, which does not help the reader understand what you are saying. The phrase "the residuals are distributed according to independent identical normal distributions (with zero mean)" is so inaccurate that it is nearly unintelligible. I think perhaps that you obliquely referring to ML ${\displaystyle X}$~ND, where ${\displaystyle X=\left\{x_{1},x_{2},x_{3},{\text{...}}x_{n-1},x_{n}\right\}}$ and ${\displaystyle {\text{Maximize}}\left[p(X)=\prod _{i=1}^{n}p_{i}\left(x_{i}\right)\right]}$, or some such. Take a look at [4]. It is much more clearly written than this Wikipedia entry. It is not misleading, it is not oversold, and it give a much better indication of where AIC is in the universe of methods. Try to emulate that level of clarity, please. What happens when the residuals are not ND. Surely you realize that that is most of the time. What does AIC mean in that context, anything? CarlWesolowski (talk) 23:50, 10 July 2016 (UTC)CarlWesolowski (talk) 00:54, 11 July 2016 (UTC)CarlWesolowski (talk) 18:51, 14 July 2016 (UTC)