User:Robert.Baruch/Sparse

To begin looking for an algorithm that induces a sparse coding, a probabilistic description is constructed. Suppose that there is some distribution of actual input vectors where the probability of a particular input vector ξ appearing is P*(ξ). The goal is to determine basis vectors b and coefficients s such that the distribution of reconstructed input vectors closely approximates the actual distribution. Call this approximate distribution P(ξ | b), which is the probability of a particular k-dimensional input vector ξ appearing given a particular set of n k-dimensional basis vectors b. By the law of total probability we know that this probability is based on a sum over all possible combinations of coefficients s:

$P(\xi |b)=\sum _{s}P(\xi |b,s)P(s)$

Recall that an input vector ξ is approximated by a reconstructed vector which is the linear combination of n basis vectors with n coefficients. We assume that the difference between the input vector and the reconstructed vector is a zero-mean Gaussian with variance σ_ν². More formally,

${\vec {\xi }}=\sum _{j=1}^{n}{\vec {b}}_{j}s_{j}+{\vec {\nu }}$

where ν is the Gaussian noise. From this, we can derive P(ξ | b,s), the probability of an input vector ξ appearing given a particular set of basis vectors and coefficients:

$P(\xi |b,s)={\frac {1}{Z_{\sigma _{\nu }}}}e^{-{\frac {\left\|\xi -bs\right\|^{2}}{2\sigma _{\nu }^{2}}}}$

where

$\left\|\xi -bs\right\|^{2}=\sum _{i=1}^{k}\left[\xi _{i}-\sum _{j=1}^{n}b_{ji}s_{j}\right]^{2}$

that is, it is the Euclidean distance between the input vector ξ and the reconstructed vector bs, and Z_{σ_ν} is a normalizing constant to make the probabilities sum up to 1. This equation means that the error between the input and reconstructed vectors is Gaussian-distributed.

We also make the assumption that the coefficients are independent of each other, so that the probability of a given set of coefficients is simply the product of the individual probabilities of each coefficient:

$P(s)=\prod _{j=1}^{n}P(s_{j})$

As stated above, it is desired that the distribution of each coefficient be very peaky, so that among the universe of all coefficients, most components are zero, a few (or none) are of small absolute value, and hardly any are of large absolute value. The probability density function of each coefficient can be parameterized as follows to allow different kinds of peakiness:

$P(s_{j})={\frac {1}{Z_{\beta }}}e^{-\beta S(s_{j})}$

where S determines the shape of the distribution, β determines the steepness of the distribution, and Z_β is the usual normalizing constant used to get the sum of the probabilities to 1.

Since we wish to determine the difference between P*(ξ) and P(ξ | b) and minimize it, we choose the Kullback-Leibler divergence, which gives a measure of how different two distributions are:

$D_{KL}(P^{*}(\xi )\parallel P(\xi |b))=\int P^{*}(\xi )\log {\frac {P^{*}(\xi )}{P(\xi |b)}}d\xi =-\int P^{*}(\xi )\log P(\xi |b)d\xi +\int P^{*}(\xi )\log P^{*}(\xi )d\xi$

Since P*(ξ) is fixed, the second term (the entropy of P*(ξ)) is also fixed, and so to minimize KL, one only needs to minimize the first term (the cross entropy of P*(ξ) with P(ξ | b)).