# Talk:Kullback–Leibler divergence

WikiProject Statistics (Rated B-class, Mid-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

B  This article has been rated as B-Class on the quality scale.
Mid  This article has been rated as Mid-importance on the importance scale.
WikiProject Mathematics (Rated B-class, Mid-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 B Class
 Mid Importance
Field: Probability and statistics
WikiProject Physics (Rated B-class, Low-importance)
This article is within the scope of WikiProject Physics, a collaborative effort to improve the coverage of Physics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B  This article has been rated as B-Class on the project's quality scale.
Low  This article has been rated as Low-importance on the project's importance scale.

## real world EXAMPLE needed immediately following motivation

Suggestion: The average person can understand word relationships and KLD has been applied to countless NLP problems. KL distance for images other high-dimension feature spaces could quickly turn off the reader.

## To minus or not to minus

I think it is not equivalent to turn around the P and Q distributions inside the logarithm, and then get rid of the minus. If the measure-based definition is taken as the definition, then that should consistently induce all of the definitions for the special cases, such as the discrete and continuous cases. In particular, because absolute continuity is only required to one direction, the switch is not legal. Therefore, I am being bold and changing the definition of the discrete KL-divergence to be consistent with the measure-based definition. --Kaba3 (talk) 22:19, 8 October 2012 (UTC)

Ok, let's not be bold:) I was too quick to look: the current version is consistent. Instead, the formula on the Radom-Nikodym theorem page suffers from not using the minus-form instead. I'll go fix that instead. --Kaba3 (talk) 22:46, 8 October 2012 (UTC)

Hmm.. Something is still wrong. According to Sergio Verdu, to whose video lecture there is a link on the article, the definition of Kullback-Leibler divergence is given by

$D_{\mathrm{KL}}(P\|Q) = \int_X \ln \frac{{\rm d}P}{{\rm d}Q} \,{\rm d}P,$

if P is absolutely continuous with respect to Q (P << Q), and infinite otherwise. This is not equivalent to appending a minus, and switching to dQ/dP, since that would require Q << P. Fixing this makes everything consistent, which I'll now go do. --Kaba3 (talk) 23:13, 8 October 2012 (UTC)

## 2004-2006 discussions

To whom it may concern. Recently the statement "KL(p,q) = 0 iff p=q" was added. I suspect that's not quite the case; maybe we want "KL(p,q) = 0 iff p=q (except for a set of measure 0 wrt p)" ?? Happy editing, Wile E. Heresiarch 01:41, 21 Oct 2004 (UTC)

I added "KL(p,q) = 0 iff p=q", which is a stronger claim than "KL(p,p)=0" in an earlier revision. My own understanding of measure theory is pretty limited; moreover, the article does not explicitly mention measures in connection with the integrals. However, Kullback and Leibler in their 1951 paper (lemma 3.1) did consider this and say that divergence is equal to zero if and only if the measurable functions p and q are equivalent with respect to the measure (i.e. p and q are equal except on a null set). That would include the case you mentioned, wouldn't it? --MarkSweep 03:29, 21 Oct 2004 (UTC)
Yes, that's what I'm getting at. What's not clear to me is that when we say p and q are equal except for a set of $FOO-measure 0, which measure$FOO are we talking about? I guessed the measure induced by p; but K & L must have specified which one in their paper. Wile E. Heresiarch 00:59, 22 Oct 2004 (UTC)
I checked the K&L paper again, but they simply define a compound measure λ as the pair of the measures associated with the functions p and q. The lemma is stated in terms of the measure λ. --MarkSweep 09:28, 24 Oct 2004 (UTC)

You could say "KL(p,q) = 0 iff p=q almost everywhere" which is concise and says what I think you're trying to say. - grubber 01:57, 17 October 2005 (UTC)
The K-L divergence is between two probability measures on the same space. The support of one measure must be contained in the support of the other measure for the K-L divergence to be defined: For $D_{KL}(\mathbb P \| \mathbb Q)$ to be defined, it is necessary that $\mathrm{supp}\, \mathbb P \subseteq \mathrm{supp}\, \mathbb Q .$ Otherwise there is an unavoidable zero in the denominator in the integral. Here the "support" is the intersection of all closed sets of measure 1, and has itself measure 1 because the space of real numbers is second-countable. The integral should actually be taken over $\mathrm{supp}\, \mathbb P$ rather than over all the reals. The "$\log N$" that appears in some of the equations in the article should be the logarithm of the cardinality of the support of some probability measure on a discrete space. -- 130.94.162.61 20:35, 25 February 2006 (UTC)
Isn't the necessary condition $\mathbb P << \mathbb Q$ and isn't $D_{KL}(\mathbb P \| \mathbb Q) = \int_{\mathrm{supp}\,\mathbb P} \log\frac{d\mathbb P}{d\mathbb Q}\,d\mathbb P = \int_{\mathrm{supp}\,\mathbb P} \frac{d\mathbb P}{d\mathbb Q}\log\frac{d\mathbb P}{d\mathbb Q}\,d\mathbb Q$? -- 130.94.162.61 19:15, 27 February 2006 (UTC)
Re reversion: Why have a whole article on "cross entropy", then, if it's not significant? -- 130.94.162.61 04:18, 2 March 2006 (UTC)
Excellent article, by the way. Has the context, clear explanation of technical details, related concepts, everything that is needed in a technical article. -- 130.94.162.61 16:37, 11 March 2006 (UTC)
I agree that the article is excellent. I have several comments and a question.
Regarding the comment of 25 February 2006 and the follow-on, this has special importance when analyzing natural data (as opposed to data composed of man-made alphabets). If the P distribution is based on a corpus of pre-existing data, then, as soon as you discover a new "letter" in an observed sample (upon which the Q distribution is based), you can no longer use Dkl, because then there will be a zero in the denominator.
To give a concrete example, suppose you are looking at amino-acid distributions in proteins and are using Dkl as a measure of how different the composition of a certain class of proteins (Q) is from that of a broad sample (P) comprising many classes. If the Q set lacks some amino acids that the P set contains, you can still compute a Dkl. But suppose that all of a sudden you discover a new amino acid in the Q set; this isn't as far-fetched as it sounds, if you admit somatically modified species. Then, no matter how infrequent that new species is, Dkl goes to infinity (or, if you prefer, hell in a handbasket).
This may one of the reasons why K&L's symmetric measure has not met with favor: as the above example shows, you could easily come across a message that lacks some of the characters in the alphabet that P was computed over. In this case, the reverse -- and therefore the symmetric -- measure cannot be computed, even though the forward one can.
For example, imagine trying to create a Dkl for the composition of a single protein, compared to the composition of a broad set of proteins. It would be quite possible that some amino acid present in the large sample might not be represented in the small sample. But the one-way Dkl can still be computed and is useful.
Question: Is there a literature on expected error bounds on a Dkl estimate due to finite sample size (as there is for the Shannon entropy?) --Shenkin 03:05, 4 July 2006 (UTC)

## regarding recent revert

Regarding, jheald's recent revert: I spent a lot of time reorganizing the article and wanted to discuss the changes. Maybe some of them can be reimplemented? Here's a list:

• you cannot have two probability distributions on the same random variable -- it's nonsense! What you have are two random variables -- but why even talk about that. Just say given two discrete probability distributions.
• Absolutely you can, if they are conditioned on different information, or reflect different individuals' different knowledge, or different degrees of belief; or if one distribution is based on deliberate approximation. P(X|I) is different from P(X|D,I), but they are both distributions on the random variable X.
I understand what you mean. How does this mesh with the definition at random variable though? It says that every random variable follows a (single?) distribution. Probability theory and measurable function say that random variables are functions that map outcomes to real numbers (in the discrete case at least.
Strictly speaking, the random variable is the mapping X: Ω -> R. That is a function that can be applied to many probability spaces, distinguished by different measures: (Ώ,Σ,P), (Ώ,Σ,Q), (Ώ,Σ,R), (Ώ,Σ,S) etc. More loosely, we tend to talk of the random variable X as a quantity to which we can assign probability distribution(s), where these are the distributions induced from the measures P,Q,R,S by applying the mapping X to Ώ. Either way, it is entirely conventional to talk about "the probability distribution of the random variable X". Jheald 21:59, 3 March 2007 (UTC)
• use distinguish template instead of clouding the text
• Don't use ugly hat notes for mere asides (and that template is particularly ugly). Nobody is going to type in Kullback-Leibler if they want an article on vector calculus.
You're probably right :) I was just trying to get rid of the note "(not to be confused with..." from the text.
• list of alternative names all together instead of mixed into the text at the author's whim
• Lists which are too long are hard to read, and break up the flow. Better to stress only information divergence, information gain, and relative entropy first. Information gain and relative entropy are particularly important because they are different ways to think about the K-L divergence. They should stand out. On the other hand K-L distance is just an obvious loose abbreviation.
Okay... :)
• Generalizing the two examples is a new paragraph
• Unnecessary, and visually less appealing. Better rhythm without the break.
Yeah... but poor grammar :(
• Gibbs inequality is the most basic property and comes first
• No, the most important property is that this functional means something. The anchor for that meaning is the Kraft-McMillan theorem. And that meaning informs what the other properties mean.
Hmmm. that's a really good point. I didn't see it that way before.
• the motivation, properties and terminology section is split up into three sections
• Two-line long sections should suggest over-division. Besides, the whole point of calling it a "directed" divergence was about the non-symmetry of D(P||Q) and D(Q||P).
• the note about the KL divergence being well-defined for continuous distributions is superfluous given that it is defined for continuous distributions in the introduction.
• Not superfluous. It is hugely important in this context, in that the Shannon entropy does not have a reliably interpretable meaning for continuous distributions. The K-L divergence (a.k.a. relative entropy) does.
Oh! :) We could make that clear: "Unlike Shannon entry, the KL divergence remains..."

--MisterSheik 13:51, 3 March 2007 (UTC)

Sheik, backing revisions out wholesale is not something I do lightly. I can see from the log that you spent time thinking about them. But in this case, as with Beta distribution, I thought that not one of your edits was positive for the article. That contrasts with Probability distribution, Quantities of information, Probability theory and Information theory, where although I have issues with some of the changes you made, I thought some of the steps were in the right direction. Jheald 15:37, 3 March 2007 (UTC)
Thanks for getting back to me Jheald. I'm glad that backing out revisions isn't something you take lightly :) I'm going to take a break from this article so that I can think it over some more. I'll add any ideas to the talk page so that we can discuss them.
Also, I'm glad you were okay with (most) of my changes to those other articles. I didn't actually add any information to probability theory; I just brought together information that was spread over many pages and mostly reduplicated. I think it's unfortunate that some pages (like the information entropy pages) seem disorganized (individual pages are organized, but the group as a whole is hard to read--you end up readin g the same thing on many pages.) Do you think that these pages could be organized? Do you have any idea how we could start?

MisterSheik 18:28, 3 March 2007 (UTC)

## motivation

I read the motivation, but I am not really sure what it means. The first two sentences have nothing to do with the rest of the article. Could someone make this clearer? —Preceding unsigned comment added by Forwardmeasure (talkcontribs) 03:38, 25 March 2007

In fact, there is no "motivation" in there at all. That section really should be renamed. MisterSheik 03:54, 25 March 2007 (UTC)

## f-divergence

The link to f-divergence I placed in the opening section was removed and not placed anywhere else in the article. Is there a reason for this? This family of divergences is a fairly important generalisation and may lead readers who find the KL-divergence unsuitable to something more appropriate. Should I put it back in somewhere else? If not, a reason would be helpful as I can't understand the motivation to remove it completely. MDReid (talk) 11:26, 15 January 2008 (UTC)

Is there any relation between kullback-leibler divergence and information value? (Nav75 (talk) 14:54, 12 November 2008 (UTC))


-> I would say that the bayesian approach with entropy gives a pretty good idea of this measure in term of information

## KL divergence and Bayesian updating

In this section the demonstration in not sufficient. Some further approximations are needed to get the results and are not addressed. The implicite assumption here is that the p(x|i)\approxp(y|i) wich is true if knowing y is close in term of likelyhood and entropy to x wich is not always true... —Preceding unsigned comment added by 81.194.28.5 (talk) 16:16, 25 February 2009 (UTC)

The formula given is exact. It comes straight from the definition of DKL. But it assumes you have observed and now know the exact value of y.
If you haven't observed y, you can calculate the expected information gain about x from y by summing over the possible values of y weighted by the various probabilities P(y|I). That gives the result that the expected (i.e. average) information gain is the mutual information. Jheald (talk) 17:30, 25 February 2009 (UTC)

## Reorganization

The "Motivations, properties, etc" is a hodgepodge of ideas. I actually added to it, to complement the standard "it's not a metric" boilerplate, but I don't think I made it any more confusing than it already was. The way ideas are ordered makes it almost trivial to add subtitles with minor reorganization, I just ran out of time today. —Preceding unsigned comment added by Dnavarro (talkcontribs) 15:04, 29 April 2009 (UTC)

## Renyi reference

It would be nice to have a full citation for the Renyi (1961) reference - to which paper does it refer? 94.192.230.5 (talk) 09:59, 14 June 2009 (UTC)

In the paper (p. 554) he characterises his generalised alpha divergence as "the information of order α obtained if the distribution P is replaced with the distribution Q" (note he is using a P and Q defined the other way round to how we're defining them in the article).
This isn't quite the snappier term "information gain", but it is very close, and the thought behind it is exactly the same: the information gained by doing an experiment which allows you to improve your probability distribution. The actual term "information gain" may come from his book Probability -- I'll have to check. Jheald (talk) 11:28, 14 June 2009 (UTC)

## Simplifying needed

The sentance: "KL measures the expected number of extra bits required to code samples from P when using a code based on Q,", doesn't meen much to a guy like me. It should be simlified. 217.132.229.177 (talk) 07:39, 28 August 2009 (UTC)

Umm, that's fairly clear actually.
Maybe it wouldn't seem so clear to you if you were using a different code. Not that I think it is a bad explanation: maybe change "based on" for "designed for encoding" though. —Preceding unsigned comment added by 139.184.30.134 (talk) 21:24, 20 October 2010 (UTC)

## Book recommendations

Any good book recommendations on this topic? Thanks —Preceding unsigned comment added by 129.133.94.143 (talk) 22:05, 10 January 2010 (UTC)

## P to Q or Q to P?

Is it standard terminology to refer to $D_{\mathrm{KL}}(P\|Q)$ as the Kullback-Leibler divergence from P to Q, as this article does? Cover and Thomas, for instance, refer to it only as the divergence between P and Q.

I ask mainly because it seems the wrong way round. In the Bayesian interpretation of the KL-divergence, Q is the prior and P is the posterior, and it seems very strange to be talking about something being "from" the posterior "to" the prior. If it's completely accepted terminology then fair enough I guess, but it does seem confusing, and a quick google revealed only this page that uses the from/to terminology at all. Can someone supply a reference for it?

Cover and Thomas' terminology isn't perfect either, since it seems to imply symmetry - but to my mind this is less confusing than putting the "from" and the "to" the way round that this article does.

Nathaniel Virgo (talk) 14:38, 6 April 2010 (UTC)

My view is that if you identify P with "the truth" (or, at least, our current best estimate of it), then P is distinguished: there is one P, but there may be many Qs. That is the reason I think it makes sense to speak of the KL divergence of Q, from P.
Also, I think it is useful to emphasise the asymmetry.
Bayes' theorem finds the new P which minimises the divergence of the previous Q from the new P. I don't see a problem with that.
But I accept it's not cut and dried. Looking at some of the other synonyms, information gain does strongly suggest information gained by moving from Q to P; and it is customary to speak of the relative entropy of P relative to Q (though I'm not sure, when you think about it, that that conveys at all the right idea).
So I stand on the idea that, at least mentally, you are fixing P and then thinking of the divergence of Q from it, hence "from" and "to" as per the article. Jheald (talk) 17:26, 6 April 2010 (UTC)
Using the word "and" as well as the word "between" in "as the divergence between P and Q" implies that KL(P||Q) = KL(Q||P). this is incorrect. in a sensed you are applying the projective mapping q onto the "real" probability manifold p. and kl-div is giving you the average divergence of that q from the "true" distribution, p. (in bits or nats) which is not equal to the div of p from q. so the words "onto" and "from" would be appropriate, but "between" is not, nor is "and", as neither imply the directionality of the relationship, which is not reversible (/symmetric). and i hope i didn't get my p's and q's flipped. (mind you p's and q's! (what the hell are my p's and q's?))Kevin Baastalk 00:07, 7 April 2010 (UTC)
I guess I can sort-of see your argument, Jheald, although in variational Bayes it's the prior Q that's fixed and the posterior P that's varied. Personally, I see $D_{\mathrm{KL}}(P\|Q)$ as representing the information gained in going from the prior Q to the posterior P, and hence calling it the divergence from P to Q seems the wrong way around. To me the Bayesian-style interpretation is far more fundamental than the interpretation to do with correct and incorrect codings, since coding theory is just one application of information theory and the KL divergence has applications in variational Bayes and hypothesis testing that have nothing to do with it. I find the terminology in this article makes discussions about the subject awkward because I'm always having to explain that I'm updating from a prior to a posterior but calculating the divergence from the posterior to the prior.
However, what matters is not what any of us thinks makes sense but what is standard in the literature. Otherwise it would be "original research", and hence not suitable for a Wikipedia article. I agree that the use of "between...and" implies symmetry but I repeat that it's the only terminology I've seen used in published literature.
If someone can point to some use or justification of this terminology in the established literature on the subject then fair enough I guess, but otherwise I think the article should be changed to use the more standard (though sadly not ideal) "between...and" terminology. Nathaniel Virgo (talk) 01:15, 8 April 2010 (UTC)

Ok, so I've just found a definite published example of the "from P to Q" usage (i.e. the opposite from the article): http://www.kent.ac.uk/secl/philosophy/jw/2009/deFinetti.pdf (a book review in Philosophia Mathematica). I will put a note in the article to the effect that the terminology is not standard and, if nobody can come up with a published example of the "from Q to P" usage I will purge it from the article in a couple of weeks' time (when I'm less busy). Nathaniel Virgo (talk) 21:38, 17 May 2010 (UTC)

Is it possible that we standardize this, put the evaluation probability measure at the behind??? I mean to change all the notation to be consistent with the article F-divergence!! Anyone disagree?? Jackzhp (talk) 14:23, 23 February 2011 (UTC)

using "between" or something like that is just wrong, as it implies symmetry and this is not symmetric. KL(P||Q) is not equal to KL(Q||P). Though it is a distance measure, particularly from the distribution p to the distribution q. you can think of it roughly as what level of magnification you need to read language Q' if you see digitally and the pixels in your eyes are arranged according to P'. in this since it's quite literally from your eye, P', to the script, Q'. Whatever semantics you must use, the wording must maintain the asymmetry. Ideally it would also be visually intuitive, otherwise, it's not really communicating and thus isn't really language. IF you find a textbook that uses anything like "between p and q", then that textbook is wrong. and i suggest you not use it as it might contain other such errors. Kevin Baastalk 15:24, 23 February 2011 (UTC)
Agree, so let's remove the word "between", and explicitly state that we should not say the word "between". Jackzhp (talk) 18:09, 23 February 2011 (UTC)

This discussion is intensely frustrating. Jheald has it in his head backwards, and simply reverts changes that anyone else makes (as he does with other edits he doesn't like), and no one finds it worth fighting with him. I first tried changing this in 2009 with a reference to Wolfram MathWorld (http://mathworld.wolfram.com/RelativeEntropy.html) which was insufficient of a source for Jheald—search the edit logs for "Mathworld is wrong" to see what I mean. That's when I gave up trying to fix it—I didn't have the energy for a fight. I left it alone, hoping Jheald would go away and let someone who knew what they were doing fix it. I come back now, and the article is still wrong, and Jheald is still arguing it backwards.

Rather than try to persuade each other as to what sounds right or what sounds wrong, why not use the asymmetric nature of the metric to compute the answer? Consider two distributions:

A = {X:25%, Y:25%, Z:25%, Q:25%}, and
B = {X:33%, Y:33%, Z:33%, Q:0%},

where "33%" is shorthand for "one third". Now one of these has a higher KL divergence with respect to the other than the other does with respect to the one. Which is which? If you assume A is truth, then B has some divergence from that truth. If you assume B is truth, then A has infinite divergence because there should be no Q under any circumstances. So, does this not mean that the relative entropy/divergence of A wrt B is infinity, while the relative entropy/divergence of B wrt A is finite? I think it does. Imagine a large bucket of balls, evenly distributed across four colors: that's A. B is a distribution you could get from selecting 3 balls from that bucket (or 6 balls, or 9 balls, though it obviously becomes less likely as the number goes up.) B (the sample) thus diverges a bit from A (the truth). Now consider a different bucket of balls of three colors, evenly distributed: that's B. If you pull out 4 balls and get 4 different colors, you couldn't have been reaching into bucket B: A (the sample) has diverged infinitely from the distribution of B (the truth).

If you do not think this describes "the divergence of A wrt B" and "the divergence of B wrt A", then argue why it's the other way around, please.

Computing D(A||B) and D(B||A) is straightforward and unambiguous. Using ".33" for "1/3" to avoid too many fraction slashes:

D(B||A) = .33*log(.33/.25) + .33*log(.33/.25) + .33*log(.33/.25) + 0*log(0/.25); By convention 0*log0 = 0; so we have D(B||A) = log(.33/.25) = .1249... with a base-10 log.
D(A||B) = .25*log(.25/.33) + .25*log(.25/.33) + .25*log(.25/.33) + .25*log(.25/0); log(.25/0) goes to infinity, so the total sum goes to infinity.

If you really don't like division by 0, set the probability of Q in distribution B to be ε = 1/1E+50. That doesn't change the value of D(B||A) in the first 4 significant digits, but D(A||B) goes to 12.2590... Now set ε = 1/1E+500, or ε = 1/1E+50_000_000.

My conclusion: D(B||A) is the divergence of B wrt A. D(A||B) is the divergence of A wrt B. MathWorld got it right—big shock! Wikipedia has it wrong. This is why I quit editing anything other than typos in Wikipedia. 108.28.163.61 (talk) 20:24, 15 June 2011 (UTC)

Is it possible that my earlier argument convinced everyone? Can we switch it around now? 108.28.163.61 (talk) 00:46, 9 July 2011 (UTC)

I think you're right, and I was wrong. I have made the change you suggest, and will now go through the articles that link here, making sure they're consistent with the (now revised) usage here. If there are any that I've missed, please do fix them. Jheald (talk) 19:37, 7 November 2012 (UTC)
I have made the changes now; but I am having some qualms of second (third?) thoughts, having now looked through the articles that link here, wondering if perhaps I was right the first time after all.
I suspect that where books do specify a direction, they talk about the KL divergence of the posterior from the prior. And ultimately, the balance of the usage in the real world has to be what ought to most strongly guide us here. But it might be worth surveying. The first book I've just happened to look at (Gelman et al 1995 Bayesian Data Analysis 1st ed, p.485) writes of "the Kullback information of the model relative to the true distribution" -- ie DKL of q relative to p; although admittedly this is in a context where they are holding p fixed and varying q, which may make language this way round more natural. But if that's a first data point, it could well be worth surveying further.
On an aesthetic basis, I suppose it comes down to weighing whether it's from p (and/or with respect to p), because it's the distribution p we're taking as fundamental and weighting all our expectations in proportion to -- or whether it's from q, because it is zero values in the distribution of q which have such a determinative impact.
One article I noticed was cross-entropy, H(P,Q) = H(P) + D(P||Q) -- the average number of bits needed to code a datum if coded according to q rather than p. It does seem to make sense to consider this as the entropy of P plus an 'extra' number of bits for going from P to Q.
I was also struck by a number of articles that I've now changed to describe the "divergence of a distribution p from a reference prior m" using the new language -- but I can't help having the feeling that if m is a reference baseline that things are being calculated from, then doesn't that sort of suggest that it's m that the expectations should be being calculated with respect to? -- a sense that goes away if we write "divergence from the distribution to a reference prior m".
Looking at some of the synonyms for the divergence, "information gain" does strongly suggest moving from Q to P. But perhaps Akaike Information Criterion is nearer the mark when it suggests that what the KL divergence is really about is the amount of information loss in giving up P and moving back to Q (while still using P as the measuring stick to assess expectations).
Admittedly "discrimination information" goes the other way. If we consider P having a narrower support than Q, it may be had to discriminate P samples from a priorly assumed Q (small D(P||Q) away from Q), but easier to discriminate Q samples from a priorly assumed P (large D(Q||P). So here it does make sense to talk of discriminating a true distribution P from a prior distribution Q.
And then what about "relative information"/"relative entropy". Having got used to talking about H(p) as the Shannon entropy of p, then when we want to bring in a reference measure m, it then feels sort of appropriate to be talking about D(p||m) as the entropy of p "relative to" the measure m. But is this really what we mean?
... to be given more thought. Jheald (talk) 23:42, 7 November 2012 (UTC)
I have now done what I should have done to start with, namely look at a straw poll from Google books, to see what form the world really is using. Searching for "Kullback-Leibler divergence of", "Kullback-Leibler distance of", and "Kullback-Leibler divergence from" I'm getting 26 - 8 for "q from p", i.e. "approximation from truth". That may not be particularly scientific, since (i) Google fights are considered dodgy anyway (ii) it's only quite a small sample, (iii) many are of very little authority (iv) I had to ignore entries where Google only gave me snippets and I couldn't see the context, and (v) I got tired with the last lot and didn't get to the end. But I hope it's a good enough straw poll to decide that it should be the "from p to q" form that we go with. (And so now I have to go back and undo all the changes I made yesterday!)
I liked the explicit gloss given by Burnham and Anderson, as adapted in the paper in Kurdila: "The Kullback-Leibler distance of q from p is a measure of the information lost when q is used to approximate p" [1]. I think that would be well worth incorporating into the lead. Jheald (talk) 11:41, 8 November 2012 (UTC)

Details of the Google hits below, put into a show/hide box for convenience. Jheald (talk) 11:41, 8 November 2012 (UTC)

## Question

These two statements seem to contradict each other:

(1) The K-L divergence is only defined when $P>0$ and $Q>0$ for all values of i, and when P and Q both sum to 1.

(2) The self-information ... is the KL divergence of the probability distribution P(i) from a Kronecker delta...

They appear contradictory because a Kronecker delta puts probability zero on all events except one. So if (1) is correct, a KL divergence from a Kronecker delta should be undefined. Could someone knowledgeable please correct the page if necessary? (Or help me understand why no correction is necessary...) Thanks! Rinconsoleao (talk) 16:27, 15 August 2010 (UTC)

The first requirement is unnecessarily strict. It should be that $Q(i)>0$ for all i for which $P(i)>0$. One then makes the usual information theory limit, that we can take $0 \; \log \; 0$ to be zero. Jheald (talk) 17:41, 15 August 2010 (UTC)
Thanks! I just edited the page for consistency with your comment. I will look for an appropriate page-specific reference, but if you can provide one that would be great. Rinconsoleao (talk) 12:41, 16 August 2010 (UTC)

Another question, in the propierties section, it is written that the K-L divergence is non-negative, and the plots shown in the beginning have negative values. What is wrong? —Preceding unsigned comment added by 148.247.183.97 (talk) 22:25, 7 September 2010 (UTC)

Nothing is wrong. I assume you're referring to the image captioned "Illustration of the Kullback–Leibler (KL) divergence for two normal Gaussian distributions" KL divergence is the sum of area under the curve shown. And though parts of the curve may be negative, the SUM of the area will always be positive. As mentioned in the article, this is gauranteed by Gibb's inequality. Kevin Baastalk 15:44, 8 September 2010 (UTC)

## What About P > 0 and Q = 0? What About P = 0 Everywhere Q > 0

It seems to me that KL-divergence has the nice property that when P = Q, we end up with a "distance" of 0. But When P = 0 everywhere that Q > 0, we end up with a "distance" of 0 alsoeven though the distributions are obviously, in some sense, about as "far apart" as possible.

Furthermore, it seems that the KL-divergence would be of limited use to the practitioner who, in computer implementations for practical problems, will often end up having P > 0 and Q = 0 at some points. I'm sure this is a standard "problem", but how does one address this? If KL-divergence can't be "patched" for these cases, it seems useless for the LARGE number of applications where at least one sample will have P > 0 and Q = 0. — Preceding unsigned comment added by 99.69.50.190 (talk) 04:14, 1 November 2011 (UTC)

If P = 0 somewhere (or everywhere) that Q > 0, then somewhere else you must have P > Q, so that the total probability for each probability distribution can still both be 1. So the KL divergence will not be zero.
As to your second question: the moral is, only set Q(x)=0 if you are really really sure the case is impossible. Compare Cromwell's Rule for discussion. Jheald (talk) 10:10, 1 November 2011 (UTC)

## Symmetrised divergence

Any objections if I change "Kullback and Leibler themselves actually defined the divergence as" to "Kullback and Leibler cite Jeffery's divergence as" (source: Kullback & Leibler p81 (1951)) Wrecktaste (talk) 17:59, 1 November 2011 (UTC)

## Definitions

In my opinion the mentioning of a random variable in the definition is superfluous and in some way misleading. The definition is just about two discrete or two absolute continuous probability measures. Nijdam (talk) 17:51, 29 April 2012 (UTC)

## Correct Accent Marks?

I am just wondering if there are any accent marks on Kullback or Leiber. I would suspect, but do not know if there is supposed to be an umlaut over the 'u' in Kullback. Maybe there are no accent marks at all? I know this is trivial, but I would assume there are others like me who respect accent marks when possible. — Preceding unsigned comment added by 68.228.41.185 (talk) 21:18, 5 April 2013 (UTC)

There should be no accent marks. Solomon Kullback was American. Jheald (talk) 08:21, 6 April 2013 (UTC)

## Entropy formula

Just a very basic question, I was confused by the equation after "This distribution has a new entropy", isn't it missing a minus? — Preceding unsigned comment added by 130.88.91.196 (talk) 09:46, 14 May 2014 (UTC)

## cdot Notation in Bayesian section

In the Bayesian updating section, a notation involving \cdot is used without any introduction, explanation, or link. Can someone that knows what's going on there fix that? Thanks. — Preceding unsigned comment added by 174.101.176.129 (talk) 10:42, 11 March 2015 (UTC)

## Data differencing

A clearer description of relative entropy wrt data differencing could be written. For absolute entropy the p(i) Log [p(i)] formula works pretty well in approximating a theoretical limit on file compression. However I cannot see how the relative entropy formula places a theoretical limit on a patch size between two files.

I've taken two web cam images of the same scene under identical lighting conditions. 1.jpg is 64139 bytes long. 2.jpg is 64555 bytes, differing in what I assume to be image noise only. The relative entropy formula then gives 330 bits as a limiting patch /delta size. It seems low. I've also created a patch using the xdelta open source utility. It comes out as 60939 bytes. This intuitively seems realistic. This is several decades of magnitude different from the 330 bits.

Saying that relative entropy forms a background for determining minimum patch size causes confusion if the numbers cannot be made to support it. Perhaps alternative language / a numerical example needs to be used?