# Talk:Bayesian information criterion

WikiProject Statistics (Rated Start-class, Mid-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

Start  This article has been rated as Start-Class on the quality scale.
Mid  This article has been rated as Mid-importance on the importance scale.
WikiProject Mathematics (Rated Start-class, Mid-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 Start Class
 Mid Importance
Field: Probability and statistics

## SIC vs BIC

As far as I know, SC and BIC are different. What is described is BIC.

The formula given matches that derived by Schwarz (1978)
There does some to be some confusion here though, as e.g. Bengtsson and Cavanaugh define SIC as given here, and BIC differently.
Can anyone provide some authoritative references to the modern use of the terms BIC and SIC?
--Ged.R 15:18, 22 January 2007 (UTC)
Should "SIC" in the first formula be "BIC"? The acronym "SIC" is never defined in the article.

The first formula is bizarre. geez. "The formula for the BIC is exp(-SIC/2) ???" What's up with that?

The entire article sounds like it is written by wannabe experts who aren't exactly certain of what they're talking about. Read this from the perspective of someone who is trying to ascertain the basic formula for BIC. (That is my situation. I have a software package that purports to calculate AIC and BIC scores but it (the software package) doesn't say exactly what it is calculating. Neither does this article. It introduces all these terms, xbar, etc and never says what they are. A "constant becoming trivial" is a pretty weird notion if you ask me. This could potentially be a very important article as statistical model selection invades more and more fields but as it currently is it is useless as a first approximation. You should keep the constants that become trivial. An article written by people who know this stuff, sort of, for people who also sort of know this stuff is useless. —Preceding unsigned comment added by 67.0.90.202 (talk) 23:42, 3 January 2011 (UTC)

81.231.127.12 (talk) 22:09, 18 December 2010 (UTC)
I've just reverted a change to that first formula that was made on 15 November, so it now agrees with the reference given and the standard definition of BIC. I agree that this article is in need of attention but this is not my area of expertise so I'll flag it as in need of expert attention from a suitable statistician. Thanks for pointing this out. To be honest I'm surprised the article was left like this for over a month. Qwfp (talk) 09:58, 4 January 2011 (UTC)
A problem is that it is unclear what source is actually being used for any of these formula. For example, the Priestley reference has a different formula for BIC than that for which it is supposedly being used as a source. But it does give formulae relating AIC, SIC and BIC all within the same context. It seems that Priestley uses S (here SIC) for what is here called BIC, and has a rather more complcated formula for his BIC. Thus Priestly uses BIC for "Akaike's BIC", and S for Schwarz's criterion ("Schwarz's BIC" although he doesn't call it that), where Schwarz's criterion is what is here called BIC. So a first question is what are good sources for current terminology, and is there a consistent usage? The above discussion mentions "the standard definition of BIC" ... are there good sources for such a thing. Melcombe (talk) 16:08, 5 January 2011 (UTC)

## Schwartz criterion

I would like to redirect Schwartz criterion to Schwartz set rather than here, since this is a term used in voting theory. Is there any objection to this? It seems to me that it only redirects here in case of a spelling mistake. CRGreathouse 02:21, 20 July 2006 (UTC)

Well, unfortunately this is a very frequent spelling mistake. There are loads of books who use the wrong spelling. And the Schwarz Bayesian IC is rather important afaik. I'd prefer to leave the redirection like this or to create a disambiguation page... Gtx, Frank1101 11:00, 20 July 2006 (UTC)
From what I can tell (and what I was taught in my MS in stats program) BIC is the more common moniker for this. I think the article should reflect this. --Chrispounds 00:51, 29 October 2006 (UTC)
I agree with Chrispounds, and so does google:
Any objection to the article being renamed to "Bayesian information criterion", replacing the current redirect with no history? John Vandenberg 07:24, 31 October 2006 (UTC)
I have made a proposal to reduce the confusion between Schwartz set and Schwarz criterion on Talk:Schwartz set#Schwarz criterion. John Vandenberg 01:14, 9 November 2006 (UTC)

the BIC is the schwarz critirion.his name was Gideon Ernst Schwarz and so there is no reason way ot should be under the mistaken schwartz — Preceding unsigned comment added by 79.177.15.127 (talk) 08:29, 18 May 2013 (UTC)

## Linear Model expression

The second formula:

Under the assumption that the model errors or disturbances are normally distributed, this becomes:
$\mathrm{SIC} = n\ln\left({\mathrm{RSS} \over n}\right) + k \ln(n). \$


seems wrong to me, $-2 \cdot \ln{L} = \left({\mathrm{RSS} \over \sigma^2}\right)$, right? And not $n\ln\left({\mathrm{RSS} \over n}\right)$ as stated here. --Ged.R 15:18, 22 January 2007 (UTC)

--

$n\ln\left({\mathrm{RSS} \over n}\right)$ is correct because we are dealing with the maximized likelihood. For a linear model, we have $\hat{\sigma}^2 = {\mathrm{RSS} \over n}$. The loglikelihood is of the form:
$l(\beta, \sigma^2; Y) = -\frac{n}{2} \log (2 \pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n \varepsilon_i^2$
Evaluating this at the maximum likelihood estimates for $\sigma^2$ and $\beta$, we obtain:
$l(\hat{\beta}, \hat{\sigma}^2; Y) = -\frac{n}{2} \log (2 \pi \hat{\sigma}^2) - \frac{1}{2\hat{\sigma}^2} \sum_{i=1}^n \hat{\varepsilon}_i^2$
$= -\frac{n}{2} \log (2 \pi {\mathrm{RSS} \over n}) - \frac{1}{2} \frac{n}{\mathrm{RSS}} \sum_{i=1}^n \hat{\varepsilon}_i^2$
$= -\frac{n}{2} \log (2 \pi {\mathrm{RSS} \over n}) - \frac{1}{2} \frac{n}{\mathrm{RSS}} \mathrm{RSS}$
$= -\frac{n}{2} \log (2 \pi {\mathrm{RSS} \over n}) - \frac{n}{2}$
This gives the above expression for $-2 \cdot \ln{L}$, up to an additive constant that depends only on $n$.
Wolf87 (talk) 01:40, 5 October 2008 (UTC)

Does anybody else find it fishy that the BIC here depends on the scaling of the data? Actually, wouldn't this be a problem using the likelihood function of any continuous domain probability distribution? —Preceding unsigned comment added by 216.15.124.160 (talk) 02:09, 6 December 2008 (UTC)

BIC does not depend upon the scaling of the data. BIC is defined only up to an additive constant that will be the same across all models being compared; that constant incorporates the scaling (at least in the linear model case given above) because any scaling factors come out as additive constants from the $\log ( 2 \pi {\mathrm{RSS} \over n})$ term. --Wolf87 (talk) 21:02, 14 March 2009 (UTC)

## Bayesian?

This seems to be rather unbayesian, notably in the use of maximum likelihood, no prior distribution, the absence of any integration, and more. Compare this with Bayes factor. --Henrygb 17:08, 15 March 2007 (UTC)

It's Bayesian to the extent that it represents an approximation to integrating over the detailed parameters of the model (which are assumed to have a flat prior), to give the marginal likelihood for the model as a whole. The argument is that in the limit of infinite data, the BIC would approach the Bayesian marginal likelihood. That contrasts with the Akaike criterion, attempts to find the most probable model parametrisation, rather than the most probable model. It also contrasts with frequestists, who cannot integrate over nuisance parameters to compute marginal likelihoods.
But I'd agree, the article should spell out much more clearly how, exactly, BIC is an approximation to the Bayesian marginal likelihood. Jheald 18:00, 15 March 2007 (UTC).

## What is the "dependent variable"?

I am confused by this sentence:

"It is important to keep in mind that the BIC can be used to compare estimated models only when the numerical values of the dependent variable are identical for all estimates being compared."

What is the dependent variable? And it unclear why "variable" is singular and everything else is plural.

Imran 09:02, 12 April 2007 (UTC)

I agree this sentence makes no sense to me. My best guess is that they meant the "independent variable values", i.e. we must estimate the same data points in each model. Because I would interpret the "dependent variable values" as meaning the "model estimates", which then obviously doesn't make sense. Moo (talk) 22:09, 14 May 2012 (UTC)

Unlikely to be the only desideratum, as there need not be any independent variables at all. I think it means:
• using exactly the same data-set, in terms of number of values and treatment of outliers/missing values and, if there are independent variables, then missing values among these cannot prevent any of the dependent variables being included in the likelihood for some of the models being compared ... you can't leave out some of the data for some models and not for others
• using the same measurement scale (units) for the data representing the dependent variable in all models, as changing units affects the likelihood to the extent of an additive constant ... which can only be ignored if the units are the same for all models
• using the same transformation of underlying data for the dependent variable in calculating the likelihood function for each model being compared.... thus there may be a choice between models that are most conveniently represented in terms of either y or log y, but th likelihood function must be evaluated in consistent way.
It would be good to find a good reference/source that properly covers the basics such as this. Melcombe (talk) 00:45, 15 May 2012 (UTC)

Neil Frazer, 10 March 2013 I think that by "dependent variable" they mean the data, y_i. In other words, you can only compare models using the same data. — Preceding unsigned comment added by Neil Frazer (talkcontribs) 00:43, 11 March 2013 (UTC)

## This formula for BIC may potentially confuse people who read the AIC entry.

The version of BIC as described here is not compatible with the definition of AIC in wikipedia. There is a divisor n stated with BIC, but not AIC in the Wikipedia entries. It would save confusion if they were consistently defined!

I would favour not dividing by n: i.e.

BIC = -2log L + k ln(n)

AIC = -2log L + 2k

One can then clearly compare the two, and see they are similar for small n, but BIC favours more parsiminious models for large n. —The preceding unsigned comment was added by 128.243.220.42 (talk) 13:47, 10 May 2007 (UTC).

In fact I have noticed that the formula was only changed recently on 21st April, 2007. It really needs changing back I think to what it was before!

--

I also believe that the definition without n is more common. See for example http://xxx.adelaide.edu.au/pdf/astro-ph/0701113, which gives a lucid, accessible review and comparison of the AIC, AICc, BIC and Deviance Information Criterion (DIC).

Every paper I have seen has it without the n.

The standard simplification for using $\chi^2$ for model selection has been pointed out above, namely that $-2\log L = \chi^2$. I think this is worth including on the page, as I had to go look in several journal articles to satisfy myself that this is the proper definition of log-likelihood.

Velocidex 12:54, 25 June 2007 (UTC)

## Definition of L

Hi,

Is L in the formula for the BIC really the log-likelihood? It seems to me that L is the likelihood, s.t. ln L would be the log-likelihood and the -2 ln L term is the same term as in the AIC. Am I missing something?

Mpas76 01:05, 17 October 2007 (UTC)

I think you are right. L is the likelihood function and -2*ln(L) is the same as that in AIC formula. —Preceding unsigned comment added by Shaohuawu (talkcontribs) 16:37, 27 October 2007 (UTC)

## Error variance

Possibly I haven't understood this properly, but surely the formula for so-called 'error variance' in the article is wrong:

$\hat{\sigma_e^2}=\frac{1}{n}\sum_{i=1}^n (x_i-\overline{x})^2$

If the x's are datapoints this appears to be the variance of the datapoints, whereas what we want is something like the RSS of the AIC article, presumably the mean squared error: $\hat{\sigma_e^2}=\frac{1}{n}\sum_{i=1}^n (x_i-\hat{x_i})^2$

93.96.236.8 (talk) 16:42, 11 September 2010 (UTC)

## Exponential family

The article explicitly mention that the investigated model should be a member of the exponential family. This was the original assumption in the demonstration of BIC assymptotical property. However, the derivation of this property was extented to less restrictive conditions. For example Cavanaugh and colleague in “Generalizing the Derivation of the Schwarz Information Criterion” —Preceding unsigned comment added by 195.220.100.11 (talk) 18:11, 4 March 2011 (UTC)

In Section "Characteristics of the Bayesian information criterion" of this wikipedia article, citation is needed for point 5 ("[BIC] can be used to choose the number of clusters according to the intrinsic complexity present in a particular dataset."). Checking Cavanaugh et al. paper it seems to me that the loosened conditions are sufficient to quarantee the validity of BIC at least for mixtures of exponential family distributions, which in turn would cover a remarkable variety of clustering methods as special cases. But I could not find articles where validity of BIC is explicitly shown for mixture (or clustering) models. — Preceding unsigned comment added by Lmlahti (talkcontribs) 11:17, 2 April 2011 (UTC)

## BIC penalizes larger data set?

Has anybody notice that minimization of the objective function:

${-2 \cdot \ln{p(x|k)}} \approx \mathrm{BIC} = {-2 \cdot \ln{L} + k \ln(n) }. \$

seems to lead to a penalty on larger data sets.

For example, given a data set D1, and another nested data set D2 (same features, but a subset of samples of D1). BIC seems to suggest that D2 should have more parameters than D1. This is ridiculous. 147.8.182.107 (talk) 05:02, 8 January 2012 (UTC)

The penalty gets bigger, but, if the parameters are helping, the likelihood improvement on the larger datasets due to the extra parameters should more than compensate. 41.151.113.108 (talk) 08:45, 9 February 2012 (UTC)

Yes, the penalty with sample size is larger with SIC than with AIC. With AIC, the selected model tends to have more parameters as the sample size increases. The justification for the SIC was to find a penalty to neutralize that effect.

Also, both AIC and SIC are more general than described in this article, applying to many different statistical models. 98.95.133.18 (talk) 13:30, 31 March 2013 (UTC)

## K

There doesn't appear to be a definition of what k is.

And it seems odd that in p(x|k) ... the probability of the observed data is calculated only on the number of the "free parameters" without any consideration for their value ... — Preceding unsigned comment added by Eep1mp (talkcontribs) 16:39, 3 February 2012 (UTC)

The section "Mathematically" has:
...and (although there are no details), p(x|k) and the reason for using it are based on Bayesian arguments wherein the distribution of the k unknown parameters are effectively integrated out. This leads to "their value" being summarised in the maximised likelihood. Melcombe (talk) 17:21, 3 February 2012 (UTC)
This is not the usual notation for a likelihood. The typical notation would be p(θ|y) where θ is a k×1 vector of parameters. In the BIC, this vector would usually be the maximum likelihood estimate (MLE) for the parameters. The MLE is equivalent to the mode of the posterior distribution with a non-informative prior. For some, ignoring the prior is a feature, for others it is a bug.
It should be obvious that not all models with k parameters are created equal, so p(x|k) is misleading. Suppose we are interested in blood pressure as the dependent variable and have a choice of two different datasets with k=3: The first has {gender, age, body mass index} and the second has {SAT score, typing speed, favorite color}. Suppose we pick the first dataset, we still could have different likelihoods depending on whether we want a continuous measure of diastolic pressure, or if we want a probit model that categorizes people as "high blood pressure" or not.
I'll try to edit the main article, but I'm not very familiar with the math code so it might take a bit. Frank MacCrory (talk) 21:06, 5 March 2012 (UTC)

## Akaike impressed?

The source for the sentence "Akaike was so impressed with Schwarz's Bayesian formalism that he developed his own Bayesian formalism, ..." does not actually support that statement. The source is from 1977, Schwarz's work is from 1987.

Are the facts maybe twisted here? What is clear is that Akaike developed the AIC before Schwarz developed the BIC (1974 vs.1978). Schwarz also cites Akaikes work in his 1978 paper. Is ABIC different from AIC and if yes, how? Can someone please add a correct source?

Georg Stillfried (talk) 14:07, 2 August 2013 (UTC)