Talk:Jensen–Shannon divergence

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics articles
Low	This article has been rated as Low-priority on the project's priority scale.

Statistics Low‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
Low	This article has been rated as Low-importance on the importance scale.

Simplify definition of JSD Domain / co-domain?

I find the definition of JSD using sigma algebra unnecessarily formal. Couldn't it be simplified by stating that the probability distributions have common domain? 86.15.19.235 (talk) 22:08, 25 November 2016 (UTC)[reply]

Bad edits

If you look at the first edits they are very different than what the article is now.... I don't know what is correct... but it must be looked at. gren グレン 17:40, 2 June 2006 (UTC)[reply]

I decided to revert the page back to the last good version, I don't know what happened, but recent versions were essentially broken. --Dan|^(talk) 00:15, 22 June 2006 (UTC)[reply]

Codomain

This article mentions that the codomain of JSD is [0,1] but does not provide an explanation as to why. I fail to see how the higher bound holds, could someone explain it? The second referenced paper (Dagan, Ido; Lillian Lee, Fernando Pereira (1997). "Similarity-Based Methods For Word Sense Disambiguation". Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics: pp. 56–63.), mentions a [0, 2log2] bound, again without providing any explanation... which is right? Why?Pflaquerre (talk) 08:12, 15 July 2008 (UTC)[reply]

Removal of citations by David Carmel et al.

Hi,

I was kindly asked by David Carmel to remove his citations. Sorry for that. 02:40, 25 November 2010‎ Special:Contributions/192.114.107.4 ‎

Why? If an author has published, can this be a valid request? Unless the author has retracted their paper... Is this the case, is there a retraction? ... Ohhh, ahhh, judging from the titles of the papers, perhaps they should not be considered to be authoritative references for the topic. Not citing them makes sense, in this case. linas (talk) 18:19, 10 July 2012 (UTC)[reply]

Fisher information metric?

A recent edit connects the square root of the JSD with the Fisher information metric. Checking with that page seems to mention the JSD, but that sqrt(JSD) is Fisher information metric x sqrt(8). I'm not an expert on Fisher info so could do with some help here (this is actually the first time I've seen this connection), but would it then be more correct to say that sqrt(JSD) is proportional to Fisher info in the lead section, rather than equal to? Best, --Amkilpatrick (talk) 21:02, 10 July 2012 (UTC)[reply]

Yes, to say 'proportional' would be more correct. Insofar as different authors often define exactly the same thing, but with different notations, different normalizations and constants in front of them, one gets in the habit of saying 'is' instead of 'proportional', with the latter being understood when the former is written. linas (talk) 21:17, 10 July 2012 (UTC)[reply]

Great, updated accordingly - thanks for your help, --Amkilpatrick (talk) 06:45, 11 July 2012 (UTC)[reply]

Relationship to Jeffrey Divergence

Apparently, Jensen-Shannon divergence and Jeffrey divergence (metioned in Divergence (statistics), although I mostly saw formulations that look much more like J-S divergence) are essentially the same (maybe except for the weight factors $\pi$ ?). Anyone can research / elaborate? Thanks. --88.217.93.154 (talk) 11:33, 6 October 2013 (UTC)[reply]

___

I don't understand this comment but it looks interesting...! I don't know about jeffrey's divergence except what I've just read in Divergence (statistics), why do you think this looks like it's the same as J-S? What other formulation have you seen?

At least one difference seems clear in that JS is well-behaved as individual probabilities tend to zero as the kernel function has xlogx terms whereas Jeffery's looks like it isn't as it has logx terms RichardThePict (talk) 09:05, 9 October 2013 (UTC)[reply]

Yes, based on the definition in Divergence (statistics) they are essentially the same:

${\begin{aligned}D_{J}(p\parallel q)&=\int (p(x)-q(x)){\big (}\ln p(x)-\ln q(x){\big )}dx\\&=\int (p(x)-q(x))\ln {\frac {p(x)}{q(x)}}dx\\&=\int p(x)\ln {\frac {p(x)}{q(x)}}dx-\int q(x)\ln {\frac {p(x)}{q(x)}}dx\\&=\int p(x)\ln {\frac {p(x)}{q(x)}}dx+\int q(x)\ln {\frac {q(x)}{p(x)}}dx\\&=D_{KL}(p\parallel q)+D_{KL}(q\parallel p)\\&=2\operatorname {JSD} (p\parallel q)\end{aligned}}$

This is assuming that everything is well-behaved. Qorilla (talk) 15:16, 5 June 2019 (UTC)[reply]

Algebra for definition

For those interested, the formula given as a definition (the last step, #16, below, to within a factor of 1/2 as discussed) is here derived from formula 5.1 in Lin,^[1] which matches exactly the "more general definition" given in this article:

1.

{\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})=H\left(\sum _{i=1}^{n}\pi _{i}P_{i}\right)-\sum _{i=1}^{n}\pi _{i}H(P_{i})

As in the article, two equally weighted ( $\pi _{1}=\pi _{2}={\frac {1}{2}}$ ) vectors are used, $P_{1}=P$ and $P_{2}=Q$ .

Substituting that in and expanding the sums,

2.

{\rm {JSD}}(P,Q)=H\left({\frac {1}{2}}P+{\frac {1}{2}}Q\right)-({\frac {1}{2}}H(P)+{\frac {1}{2}}H(Q))

Factoring out the 1/2 in the first part and distributing the negative in the second,

3.

{\rm {JSD}}(P,Q)=H\left({\frac {1}{2}}(P+Q)\right)-{\frac {1}{2}}H(P)-{\frac {1}{2}}H(Q)

Plugging in the definition of Shannon entropy $H(X)=-\sum _{i}(X_{i}*\ln(X_{i}))$ ,

4.

{\rm {JSD}}(P,Q)=-\sum _{i}(({\frac {1}{2}}(P_{i}+Q_{i}))*\ln({\frac {1}{2}}(P_{i}+Q_{i})))+{\frac {1}{2}}\sum _{i}(P_{i}*\ln(P_{i}))+{\frac {1}{2}}\sum _{i}(Q_{i}*\ln(Q_{i}))

Distributing that first 1/2,

5.

{\rm {JSD}}(P,Q)=-\sum _{i}(({\frac {1}{2}}P_{i}+{\frac {1}{2}}Q_{i})*\ln({\frac {1}{2}}(P_{i}+Q_{i})))+{\frac {1}{2}}\sum _{i}(P_{i}*\ln(P_{i}))+{\frac {1}{2}}\sum _{i}(Q_{i}*\ln(Q_{i}))

Simplifying presentation by the Wikipedia article's definition of $M={\frac {1}{2}}(P+Q)$ :

6.

{\rm {JSD}}(P,Q)=-\sum _{i}(({\frac {1}{2}}P_{i}+{\frac {1}{2}}Q_{i})*\ln(M_{i}))+{\frac {1}{2}}\sum _{i}(P_{i}*\ln(P_{i}))+{\frac {1}{2}}\sum _{i}(Q_{i}*\ln(Q_{i}))

Distributing the first logarithm,

7.

{\rm {JSD}}(P,Q)=-\sum _{i}(({\frac {1}{2}}P_{i}*\ln(M_{i})+{\frac {1}{2}}Q_{i}*\ln(M_{i}))+{\frac {1}{2}}\sum _{i}(P_{i}*\ln(P_{i}))+{\frac {1}{2}}\sum _{i}(Q_{i}*\ln(Q_{i}))

Splitting the first sum,

8.

{\rm {JSD}}(P,Q)=-\sum _{i}({\frac {1}{2}}P_{i}*\ln(M_{i}))-\sum _{i}({\frac {1}{2}}Q_{i}*\ln(M_{i}))+{\frac {1}{2}}\sum _{i}(P_{i}*\ln(P_{i}))+{\frac {1}{2}}\sum _{i}(Q_{i}*\ln(Q_{i}))

Taking the ${\frac {1}{2}}$ out of those now first two sums,

9.

{\rm {JSD}}(P,Q)=-{\frac {1}{2}}\sum _{i}(P_{i}*\ln(M_{i}))-{\frac {1}{2}}\sum _{i}(Q_{i}*\ln(M_{i}))+{\frac {1}{2}}\sum _{i}(P_{i}*\ln(P_{i}))+{\frac {1}{2}}\sum _{i}(Q_{i}*\ln(Q_{i}))

Putting the third term first and the fourth term third,

10.

{\rm {JSD}}(P,Q)=+{\frac {1}{2}}\sum _{i}(P_{i}*\ln(P_{i}))-{\frac {1}{2}}\sum _{i}(P_{i}*\ln(M_{i}))+{\frac {1}{2}}\sum _{i}(Q_{i}*\ln(Q_{i}))-{\frac {1}{2}}\sum _{i}(Q_{i}*\ln(M_{i}))

Factoring out the ${\frac {1}{2}}$ from the first and second pairs of terms,

11.

{\rm {JSD}}(P,Q)={\frac {1}{2}}(\sum _{i}(P_{i}*\ln(P_{i}))-\sum _{i}(P_{i}*\ln(M_{i})))+{\frac {1}{2}}(\sum _{i}(Q_{i}*\ln(Q_{i}))-\sum _{i}(Q_{i}*\ln(M_{i})))

Combining a couple sums,

12.

{\rm {JSD}}(P,Q)={\frac {1}{2}}(\sum _{i}(P_{i}*\ln(P_{i})-P_{i}*\ln(M_{i})))+{\frac {1}{2}}(\sum _{i}(Q_{i}*\ln(Q_{i})-Q_{i}*\ln(M_{i})))

Removing some parentheses for clearer notation,

13.

{\rm {JSD}}(P,Q)={\frac {1}{2}}\sum _{i}(P_{i}*\ln(P_{i})-P_{i}*\ln(M_{i}))+{\frac {1}{2}}\sum _{i}(Q_{i}*\ln(Q_{i})-Q_{i}*\ln(M_{i}))

Factoring out the $P_{i}$ in the first sum and $Q_{i}$ in the second,

14.

{\rm {JSD}}(P,Q)={\frac {1}{2}}\sum _{i}(P_{i}*(\ln(P_{i})-\ln(M_{i})))+{\frac {1}{2}}\sum _{i}(Q_{i}*(\ln(Q_{i})-\ln(M_{i})))

Observing that $\ln(X)-\ln(Y)=ln({\frac {X}{Y}})$ ,

15.

{\rm {JSD}}(P,Q)={\frac {1}{2}}\sum _{i}(P_{i}*(\ln({\frac {P_{i}}{M_{i}}})))+{\frac {1}{2}}\sum _{i}(Q_{i}*(\ln({\frac {Q_{i}}{M_{i}}})))

The definition given for the Kullback–Leibler divergence (on Wikipedia or equation 2.1 in Lin's paper^[1]) is $D_{\mathrm {KL} }(P\|Q)=\sum _{i}P_{i}\,\ln {\frac {P_{i}}{Q_{i}}}.$ . This transforms the previous step to:

16.

{\rm {JSD}}(P,Q)={\frac {1}{2}}D_{\mathrm {KL} }(P\|M)+{\frac {1}{2}}D_{\mathrm {KL} }(Q\|M)

Which is what is given as the definition of the Jensen–Shannon divergence in this article.

That's pretty close to Lin's definition of JS-divergence in terms of KL-divergence, equation 3.4 in the paper^[1], but has an extra factor of 1/2 that Lin doesn't have. Equation 2.2 in the paper, the symmetric version of KL-divergence, is similarly missing that same scaling factor of 1/2. The scaling factor clearly makes sense, as the symmetric version is conceptually just averaging the asymmetric divergences, but it does give rise to a slight discrepancy between this article and Lin's in terms of the absolute values of any computed measures.

--WBTtheFROG (talk) 14:49, 4 June 2015 (UTC)[reply]

The Endres and Schindelin paper^[2] includes the 1/2 in the definition of Jensen-Shannon divergence (start p. 1860) but explicitly uses the version that is twice that (i.e. without the one half) before taking the square root shown there to be a metric. They indirectly cite ^[3] which also includes the 1/2 as part of the definition of the Jensen-Shannon divergence. --WBTtheFROG (talk) 22:57, 4 June 2015 (UTC)[reply]

References

^ ^a ^b ^c Lin, J. (1991). "Divergence Measures Based on the Shannon Entropy" (PDF). IEEE Transactions on Information Theory. 37 (1): 145–151. doi:10.1109/18.61115.
^ Endres, D. M.; J. E. Schindelin (2003). "A new metric for probability distributions". IEEE Trans. Inf. Theory. 49 (7): pp. 1858–1860. doi:10.1109/TIT.2003.813506. {{cite journal}}: |pages= has extra text (help)
^ El-Yaniv, Ran; Fine, Shai; Tishby, Naftali (1997). "Agnostic Classification of Markovian Sequences" (PDF). Advances in Neural Information Processing Systems 10. NIPS '97. MIT Press. pp. 465–471. Retrieved 2015-06-04. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help); Unknown parameter |editors= ignored (|editor= suggested) (help)

The equation for the JSD of two Bernoullis does not appear to be correct.

I believe that (p-q)(logit(p) - logit(q))/2 is Jeffreys' divergence. The JSD does not usefully simplify from H((p+q)/2) - (H(p) + H(q))/2. 2603:7000:602:7BDE:F5C7:533D:8570:AFE6 (talk) 00:34, 11 October 2022 (UTC)[reply]

[Lin-1] Lin, J. (1991). "Divergence Measures Based on the Shannon Entropy" (PDF). IEEE Transactions on Information Theory. 37 (1): 145–151. doi:10.1109/18.61115.

[EndresSchindelin-2] Endres, D. M.; J. E. Schindelin (2003). "A new metric for probability distributions". IEEE Trans. Inf. Theory. 49 (7): pp. 1858–1860. doi:10.1109/TIT.2003.813506. {{cite journal}}: |pages= has extra text (help)

[3] El-Yaniv, Ran; Fine, Shai; Tishby, Naftali (1997). "Agnostic Classification of Markovian Sequences" (PDF). Advances in Neural Information Processing Systems 10. NIPS '97. MIT Press. pp. 465–471. Retrieved 2015-06-04. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help); Unknown parameter |editors= ignored (|editor= suggested) (help)

[1]

[2]

[3]