Sparse dictionary learning

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data (also known as coding) in the form of a linear combination of basis elements as well as those basis elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal. This problem setup also allows the dimensionality of representation space to be higher than the one of the input space. The above two properties lead to having seemingly redundant atoms that allow multiple reconstruction ways but also provide an improvement in sparsity and flexibility of the representation.

One of the key principles of dictionary learning is that the dictionary has to be inferred from the input data. The emergence of sparse dictionary learning methods was stimulated by the fact that in signal processing one typically wants to represent the input data using as few components as possible. Before this approach the general practice was to use predefined dictionaries (such as fourier or wavelet transforms). However, in certain cases the dictionary learned to fit the input data can significantly improve the sparsity and thus the results of signal processing.

This method is applied to data decomposition, compression and analysis and has been used in the fields of image denoising and classification, video and audio processing.

Problem statement

Given the input dataset $X=[x_{1},...,x_{K}],x_{i}\in \mathbb {R} ^{d}$ we wish to find a dictionary $\mathbf {D} \in \mathbb {R} ^{d\times n}:D=[d_{1},...,d_{n}]$ and a representation $R=[r_{1},...,r_{K}],r_{i}\in \mathbb {R} ^{n}$ such that both $\|X-\mathbf {D} R\|_{F}^{2}$ is minimized and the representations $r_{i}$ are sparse enough. This can be formulated as a following optimization problem:

${\underset {\mathbf {D} \in {\mathcal {C}},r_{i}\in \mathbb {R} ^{n}}{\text{argmin}}}\sum _{i=1}^{k}\|x_{i}-\mathbf {D} r_{i}\|_{2}^{2}+\lambda \|r_{i}\|_{0}$ , where ${\mathcal {C}}\equiv \{\mathbb {D} \in \mathbb {R} ^{d\times n}:\|d_{i}\|_{2}\leq 1\,\,\forall i=1,...,n\}$

${\mathcal {C}}$ is required to constrain $\mathbf {D}$ so that its atoms would not reach arbitrarily high values allowing for arbitrarily low (but non-zero) values of $r_{i}$ .

The minimization problem above is not convex because of the ℓ₀-"norm" and solving this problem is NP-hard.^[1] In some cases L¹-norm is known to ensure sparsity^[2] and so the above becomes a convex optimization problem with respect to each of the variables $\mathbf {D}$ and $R$ in case the other one of them is currently fixed.

Properties of the dictionary

The dictionary $\mathbf {D}$ defined above can be "undercomplete" if $n<d$ or "overcomplete" in case $n>d$ with the latter being a typical assumption for a sparse dictionary learning problem. The case of a complete dictionary does not provide any improvement from a representational point of view and thus isn't considered.

Undercomplete dictionaries represent the setup in which the actual input data lies in a lower-dimensional space. This case is strongly related to dimensionality reduction and techniques like principal component analysis which require atoms $d_{1},...,d_{n}$ to be orthogonal. Orthogonal dictionaries are appealing from a computational point of view because they allow to compute the representation coefficients by calculating a scalar product between the input data and the atoms, though their main downside is limiting the choice of atoms.

Overcomplete dictionaries, however, do not require the atoms to be orthogonal thus allowing for more flexible dictionaries and richer data representations.

Algorithms

As the optimization problem described above can be solved with respect to either dictionary or sparse coding while the other one of the two is fixed, most of the algorithms are based on the idea of iteratively updating one and the other.

The problem of finding an optimal sparse coding $R$ with a given dictionary $\mathbf {D}$ is known as sparse approximation (or sometimes just sparse coding problem). There has been developed a number of algorithms to solve it (such as orthogonal matching pursuit and LASSO) which are incorporated into the algorithms described below.

Method of optimal directions (MOD)

The method of optimal directions (or MOD) was one of the first methods introduced to tackle the sparse dictionary learning problem.^[3] The core idea of it is to solve the minimization problem subject to the limited number of non-zero components of the representation vector:

$\min _{\mathbf {D} ,R}\{\|X-\mathbf {D} R\|_{F}^{2}\}\,\,{\text{s.t.}}\,\,\forall i\,\,\|r_{i}\|_{0}\leq T$

MOD alternates between getting the sparse coding using one of the methods mentioned above and updating the dictionary by computing the analytical solution of the problem given by $\mathbf {D} =XR^{+}$ where $R^{+}$ is a Moore-Penrose pseudoinverse. After this update $\mathbf {D}$ is renormalized to fit the constraints and the new sparse coding is obtained again. The process is repeated until convergence.

MOD has proved to be a very efficient method for low-dimensional input data $X$ requiring just a few iterations to converge. However, due to the high complexity of the inversion operation, computing the pseudoinverse in high-dimensional cases is in many cases intractable. This shortcoming has inspired the development of other dictionary learning methods.

K-SVD

K-SVD is an algorithm that performs SVD at its core to update the atoms of the dictionary one by one and basically is a generalization of K-means. It enforces that each element of the input data $x_{i}$ is encoded by a linear combination of not more than $T_{0}$ elements in a way identical to the MOD approach:

$\min _{\mathbf {D} ,R}\{\|X-\mathbf {D} R\|_{F}^{2}\}\,\,{\text{s.t.}}\,\,\forall i\,\,\|r_{i}\|_{0}\leq T_{0}$

This algorithm's essence is to first fix the dictionary, find the best possible $R$ under the above constraint (using OMP) and then iteratively update the atoms of dictionary $\mathbf {D}$ in the following manner:

$\|X-\mathbf {D} R\|_{F}^{2}=\left|X-\sum _{i=1}^{K}d_{i}x_{T}^{i}\right|_{F}^{2}=\|E_{k}-d_{k}x_{T}^{k}\|_{F}^{2}$

Next steps of the algorithm include rank-1 approximation of the residual matrix $E_{k}$ , updating $d_{k}$ and enforcing the sparsity of $x_{k}$ after the update. This algorithm is considered to be standard for dictionary learning and is used in a variety of applications. However, it shares weaknesses with MOD being efficient only for signals with relatively low dimensionality and having the possibility for a solution to be stuck at local minima.

Stochastic gradient descent

One can also apply a widespread stochastic gradient descent method with iterative projection to solve this problem.^[4]^[5] The idea of this method is to update the dictionary using the first order stochastic gradient and project it on the constraint set ${\mathcal {C}}$ . The step that occurs at i-th iteration is described by this expression:

$\mathbf {D} _{i}={\text{proj}}_{\mathcal {C}}\{\mathbf {D} _{i-1}-\delta _{i}\nabla _{\mathbf {D} }\sum _{i\in S}\|x_{i}-\mathbf {D} r_{i}\|_{2}^{2}+\lambda \|r_{i}\|_{1}\}$ , where $S$ is a random subset of $\{1...K\}$ and $\delta _{i}$ is a gradient step.

Lagrange dual method

An algorithm based on solving a dual Lagrangian problem provides an efficient way to solve for the dictionary having no complications induced by the sparsity function.^[6] Consider the following Lagrangian:

${\mathcal {L}}(\mathbf {D} ,\Lambda )={\text{trace}}\left((X-\mathbf {D} R)^{T}(X-\mathbf {D} R)\right)+\sum _{j=1}^{n}\lambda _{i}({\sum _{i=1}^{d}\mathbf {D} _{ij}^{2}-c})$ , where $c$ is a constraint on the norm of the atoms and $\lambda _{i}$ are the so-called dual variables forming the diagonal matrix $\Lambda$ .

We can then provide an analytical expression for the Lagrange dual after minimization over $\mathbf {D}$ :

${\mathcal {D}}(\Lambda )=\min _{\mathbf {D} }{\mathcal {L}}(\mathbf {D} ,\Lambda )={\text{trace}}(X^{T}X-XR^{T}(RR^{T}+\Lambda )^{-1}(XR^{T})^{T}-c\Lambda )$ .

After applying one of the optimization methods to the value of the dual (such as Newton's method or conjugate gradient) we get the value of $\mathbf {D}$ :

$\mathbf {D} ^{T}=(RR^{T}+\Lambda )^{-1}(XR^{T})^{T}$

Solving this problem is less computational hard because the amount of dual variables $n$ is a lot of times much less than the amount of variables in the primal problem.

Parametric training methods

Parametric training methods are aimed to incorporate the best of both worlds — the realm of analytically constructed dictionaries and the learned ones.^[7] This allows to construct more powerful generalized dictionaries that can potentially be applied to the cases of arbitraty-sized signals. Notable approaches include:

Translation-invariant dictionaries.^[8] These dictionaries are composed by the translations of the atoms originating from the dictionary constructed for a finite-size signal patch. This allows the resulting dictionary to provide a representation for the arbitrary-sized signal.
Multiscale dictionaries.^[9] This method focuses on constructing a dictionary that is composed of differently scaled dictionaries to improve sparsity.
Sparse dictionaries.^[10] This method focuses on not only providing a sparse representation but also constructing a sparse dictionary which is enforced by the expression $\mathbf {D} =\mathbf {B} \mathbf {A}$ where $\mathbf {B}$ is some pre-defined analytical dictionary with desirable properties such as fast computation and $\mathbf {A}$ is a sparse matrix. Such formulation allows to directly combine the fast implementation of analytical dictionaries with the flexibility of sparse approaches.

Online dictionary learning

Many common approaches to sparse dictionary learning rely on the fact that the whole input data $X$ is available for the algorithm. However, this might not be the case in the real-world scenario as the size of the input data might be too big to fit it into memory. The other case where this assumption can not be made is when the input data comes in a form of a stream. Such cases lie in the field of study of online learning which essentially suggests iteratively updating the model upon the new data points $x$ becoming available.

A dictionary can be learned in an online manner the following way:^[11]

For $t=1...T:$
Draw a new sample $x_{t}$
Find a sparse coding using LARS: $r_{t}={\underset {r\in \mathbb {R} ^{n}}{\text{argmin}}}\left({\frac {1}{2}}\|x_{t}-\mathbf {D_{t-1}r} \|+\lambda \|r\|_{1}\right)$
Update dictionary using block-coordinate approach: $\mathbf {D} _{t}={\underset {\mathbf {D} \in {\mathcal {C}}}{\text{argmin}}}{\frac {1}{t}}\sum _{i=1}^{t}\left({\frac {1}{2}}\|x_{i}-\mathbf {D} r_{i}\|_{2}^{2}+\lambda \|r_{i}\|_{1}\right)$

This method allows us to gradually update the dictionary as new data becomes available for sparse representation learning and helps drastically reduce the amount of memory needed to store the dataset (which often has a huge size).

Applications

The dictionary learning framework, namely the linear decomposition of an input signal using a few basis elements learned from data itself, has led to the state-of-art results in various image and video processing tasks. This technique can be applied to classification problems in a way that if we have built specific dictionaries for each class, the input signal can be classified by finding the dictionary corresponding to the sparsest representation.

It also has properties that are useful for signal denoising since usually one can learn a dictionary to represent the meaningful part of the input signal in a sparse way but the noise in the input will have a much less sparse representation.^[12]

Sparse dictionary learning has been successfully applied to various image, video and audio processing tasks as well as to texture synthesis^[13] and unsupervised clustering^[14]

References

^ A. M. Tillmann, "On the Computational Intractability of Exact and Approximate Dictionary Learning", IEEE Signal Processing Letters 22(1), 2015: 45–49.
^ Donoho, David L. (2006-06-01). "For most large underdetermined systems of linear equations the minimal 𝓁1-norm solution is also the sparsest solution". Communications on Pure and Applied Mathematics. 59 (6): 797–829. doi:10.1002/cpa.20132. ISSN 1097-0312.
^ Engan, K.; Aase, S.O.; Hakon Husoy, J. (1999-01-01). "Method of optimal directions for frame design". , 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999. Proceedings. 5: 2443–2446 vol.5. doi:10.1109/ICASSP.1999.760624.
^ Aharon, Michal; Elad, Michael. "Sparse and Redundant Modeling of Image Content Using an Image-Signature-Dictionary". SIAM Journal on Imaging Sciences. 1 (3): 228–247. doi:10.1137/07070156x.
^ Pintér, János D. (2000-01-01). "Yair Censor and Stavros A. Zenios, Parallel Optimization — Theory, Algorithms, and Applications. Oxford University Press, New York/Oxford, 1997, xxviii+539 pages. ISBN 0-19-510062-X (US $ 85.00)". Journal of Global Optimization. 16 (1): 107–108. doi:10.1023/A:1008311628080. ISSN 0925-5001.
^ Lee, Honglak, et al. "Efficient sparse coding algorithms." Advances in neural information processing systems. 2006.
^ Rubinstein, R.; Bruckstein, A.M.; Elad, M. (2010-06-01). "Dictionaries for Sparse Representation Modeling". Proceedings of the IEEE. 98 (6): 1045–1057. doi:10.1109/JPROC.2010.2040551. ISSN 0018-9219.
^ Engan, Kjersti; Skretting, Karl; Husøy, John H\a akon (2007-01-01). "Family of Iterative LS-based Dictionary Learning Algorithms, ILS-DLA, for Sparse Signal Representation". Digit. Signal Process. 17 (1): 32–49. doi:10.1016/j.dsp.2006.02.002. ISSN 1051-2004.
^ Mairal, J.; Sapiro, G.; Elad, M. (2008-01-01). "Learning Multiscale Sparse Representations for Image and Video Restoration". Multiscale Modeling & Simulation. 7 (1): 214–241. doi:10.1137/070697653. ISSN 1540-3459.
^ Rubinstein, R.; Zibulevsky, M.; Elad, M. (2010-03-01). "Double Sparsity: Learning Sparse Dictionaries for Sparse Signal Approximation". IEEE Transactions on Signal Processing. 58 (3): 1553–1564. doi:10.1109/TSP.2009.2036477. ISSN 1053-587X.
^ Mairal, Julien; Bach, Francis; Ponce, Jean; Sapiro, Guillermo (2010-03-01). "Online Learning for Matrix Factorization and Sparse Coding". J. Mach. Learn. Res. 11: 19–60. ISSN 1532-4435.
^ Aharon, M, M Elad, and A Bruckstein. 2006. "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation." Signal Processing, IEEE Transactions on 54 (11): 4311-4322
^ Peyré, Gabriel (2008-11-06). "Sparse Modeling of Textures". Journal of Mathematical Imaging and Vision. 34 (1): 17–31. doi:10.1007/s10851-008-0120-3. ISSN 0924-9907.
^ Ramirez, Ignacio; Sprechmann, Pablo; Sapiro, Guillermo (2010-01-01). "Classification and clustering via dictionary learning with structured incoherence and shared features". 2014 IEEE Conference on Computer Vision and Pattern Recognition. 0. Los Alamitos, CA, USA: IEEE Computer Society: 3501–3508. doi:10.1109/CVPR.2010.5539964. ISBN 978-1-4244-6984-0.

[1] A. M. Tillmann, "On the Computational Intractability of Exact and Approximate Dictionary Learning", IEEE Signal Processing Letters 22(1), 2015: 45–49.

[2] Donoho, David L. (2006-06-01). "For most large underdetermined systems of linear equations the minimal 𝓁1-norm solution is also the sparsest solution". Communications on Pure and Applied Mathematics. 59 (6): 797–829. doi:10.1002/cpa.20132. ISSN 1097-0312.

[3] Engan, K.; Aase, S.O.; Hakon Husoy, J. (1999-01-01). "Method of optimal directions for frame design". , 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999. Proceedings. 5: 2443–2446 vol.5. doi:10.1109/ICASSP.1999.760624.

[4] Aharon, Michal; Elad, Michael. "Sparse and Redundant Modeling of Image Content Using an Image-Signature-Dictionary". SIAM Journal on Imaging Sciences. 1 (3): 228–247. doi:10.1137/07070156x.

[5] Pintér, János D. (2000-01-01). "Yair Censor and Stavros A. Zenios, Parallel Optimization — Theory, Algorithms, and Applications. Oxford University Press, New York/Oxford, 1997, xxviii+539 pages. ISBN 0-19-510062-X (US $ 85.00)". Journal of Global Optimization. 16 (1): 107–108. doi:10.1023/A:1008311628080. ISSN 0925-5001.

[6] Lee, Honglak, et al. "Efficient sparse coding algorithms." Advances in neural information processing systems. 2006.

[7] Rubinstein, R.; Bruckstein, A.M.; Elad, M. (2010-06-01). "Dictionaries for Sparse Representation Modeling". Proceedings of the IEEE. 98 (6): 1045–1057. doi:10.1109/JPROC.2010.2040551. ISSN 0018-9219.

[8] Engan, Kjersti; Skretting, Karl; Husøy, John H\a akon (2007-01-01). "Family of Iterative LS-based Dictionary Learning Algorithms, ILS-DLA, for Sparse Signal Representation". Digit. Signal Process. 17 (1): 32–49. doi:10.1016/j.dsp.2006.02.002. ISSN 1051-2004.

[9] Mairal, J.; Sapiro, G.; Elad, M. (2008-01-01). "Learning Multiscale Sparse Representations for Image and Video Restoration". Multiscale Modeling & Simulation. 7 (1): 214–241. doi:10.1137/070697653. ISSN 1540-3459.

[10] Rubinstein, R.; Zibulevsky, M.; Elad, M. (2010-03-01). "Double Sparsity: Learning Sparse Dictionaries for Sparse Signal Approximation". IEEE Transactions on Signal Processing. 58 (3): 1553–1564. doi:10.1109/TSP.2009.2036477. ISSN 1053-587X.

[11] Mairal, Julien; Bach, Francis; Ponce, Jean; Sapiro, Guillermo (2010-03-01). "Online Learning for Matrix Factorization and Sparse Coding". J. Mach. Learn. Res. 11: 19–60. ISSN 1532-4435.

[12] Aharon, M, M Elad, and A Bruckstein. 2006. "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation." Signal Processing, IEEE Transactions on 54 (11): 4311-4322

[13] Peyré, Gabriel (2008-11-06). "Sparse Modeling of Textures". Journal of Mathematical Imaging and Vision. 34 (1): 17–31. doi:10.1007/s10851-008-0120-3. ISSN 0924-9907.

[14] Ramirez, Ignacio; Sprechmann, Pablo; Sapiro, Guillermo (2010-01-01). "Classification and clustering via dictionary learning with structured incoherence and shared features". 2014 IEEE Conference on Computer Vision and Pattern Recognition. 0. Los Alamitos, CA, USA: IEEE Computer Society: 3501–3508. doi:10.1109/CVPR.2010.5539964. ISBN 978-1-4244-6984-0.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]