Proximal gradient methods are a generalized form of projection used to solve non-differentiable convex optimization problems.

Many interesting problems can be formulated as convex optimization problems of the form

${\displaystyle \operatorname {min} \limits _{x\in \mathbb {R} ^{N}}\sum _{i=1}^{n}f_{i}(x)}$

where ${\displaystyle f_{i},\ i=1,\dots ,n}$ are convex functions defined from ${\displaystyle f:\mathbb {R} ^{N}\rightarrow \mathbb {R} }$ where some of the functions are non-differentiable. This rules out conventional smooth optimization techniques like Steepest descent method, conjugate gradient method etc. Proximal gradient methods can be used instead. These methods proceed by splitting, in that the functions ${\displaystyle f_{1},...,f_{n}}$ are used individually so as to yield an easily implementable algorithm. They are called proximal because each non smooth function among ${\displaystyle f_{1},...,f_{n}}$ is involved via its proximity operator. Iterative Shrinkage thresholding algorithm,[1] projected Landweber, projected gradient, alternating projections, alternating-direction method of multipliers, alternating split Bregman are special instances of proximal algorithms.[2] For the theory of proximal gradient methods from the perspective of and with applications to statistical learning theory, see proximal gradient methods for learning.

## Notations and terminology

Let ${\displaystyle \mathbb {R} ^{N}}$, the ${\displaystyle N}$-dimensional Euclidean space, be the domain of the function ${\displaystyle f:\mathbb {R} ^{N}\rightarrow (-\infty ,+\infty ]}$. Suppose ${\displaystyle C}$ is a non-empty convex subset of ${\displaystyle \mathbb {R} ^{N}}$. Then, the indicator function of ${\displaystyle C}$ is defined as

${\displaystyle \iota _{C}:x\mapsto {\begin{cases}0&{\text{if }}x\in C\\+\infty &{\text{if }}x\notin C\end{cases}}}$
${\displaystyle p}$-norm is defined as ( ${\displaystyle \|\cdot \|_{p}}$ )
${\displaystyle \|x\|_{p}=(|x_{1}|^{p}+|x_{2}|^{p}+\cdots +|x_{N}|^{p})^{1/p}}$

The distance from ${\displaystyle x\in \mathbb {R} ^{N}}$ to ${\displaystyle C}$ is defined as

${\displaystyle D_{C}(x)=\min _{y\in C}\|x-y\|_{2}}$

If ${\displaystyle C}$ is closed and convex, the projection of ${\displaystyle x\in \mathbb {R} ^{N}}$ onto ${\displaystyle C}$ is the unique point ${\displaystyle P_{C}x\in C}$ such that ${\displaystyle D_{C}(x)=\|x-P_{C}x\|_{2}}$.

The subdifferential of ${\displaystyle f}$ at ${\displaystyle x}$ is given by

${\displaystyle \partial f(x)=\{u\in \mathbb {R} ^{N}\mid \forall y\in \mathbb {R} ^{N},(y-x)^{\mathrm {T} }u+f(x)\leq f(y).\}}$

## Projection onto convex sets (POCS)

One of the widely used convex optimization algorithms is projections onto convex sets (POCS). This algorithm is employed to recover/synthesize a signal satisfying simultaneously several convex constraints. Let ${\displaystyle f_{i}}$ be the indicator function of non-empty closed convex set ${\displaystyle C_{i}}$ modeling a constraint. This reduces to convex feasibility problem, which require us to find a solution such that it lies in the intersection of all convex sets ${\displaystyle C_{i}}$. In POCS method each set ${\displaystyle C_{i}}$ is incorporated by its projection operator ${\displaystyle P_{C_{i}}}$. So in each iteration ${\displaystyle x}$ is updated as

${\displaystyle x_{k+1}=P_{C_{1}}P_{C_{2}}\cdots P_{C_{n}}x_{k}}$

However beyond such problems projection operators are not appropriate and more general operators are required to tackle them. Among the various generalizations of the notion of a convex projection operator that exist, proximity operators are best suited for other purposes.

## Definition

The proximity operator of a convex function ${\displaystyle f}$ at ${\displaystyle x}$ is defined as the unique solution to

${\displaystyle \operatorname {argmin} \limits _{y}{\bigg (}f(y)+{\frac {1}{2}}\left\|x-y\right\|_{2}^{2}{\bigg )}}$

and is denoted ${\displaystyle \operatorname {prox} _{f}(x)}$.

${\displaystyle \operatorname {prox} _{f}(x):\mathbb {R} ^{N}\rightarrow \mathbb {R} ^{N}}$

Note that in the specific case where ${\displaystyle f}$ is the indicator function ${\displaystyle \iota _{C}}$ of some convex set ${\displaystyle C}$

{\displaystyle {\begin{aligned}\operatorname {prox} _{\iota _{C}}(x)&=\operatorname {argmin} \limits _{y}{\begin{cases}{\frac {1}{2}}\left\|x-y\right\|_{2}^{2}&{\text{if }}y\in C\\+\infty &{\text{if }}y\notin C\end{cases}}\\&=\operatorname {argmin} \limits _{y\in C}{\frac {1}{2}}\left\|x-y\right\|_{2}^{2}\\&=P_{C}(x)\end{aligned}}}

showing that the proximity operator is indeed a generalisation of the projection operator.

The proximity operator of ${\displaystyle f}$ is characterized by inclusion

${\displaystyle p=\operatorname {prox} _{f}(x)\Leftrightarrow x-p\in \partial f(p)\qquad (\forall (x,p)\in \mathbb {R} ^{N}\times \mathbb {R} ^{N})}$

If ${\displaystyle f}$ is differentiable then above equation reduces to

${\displaystyle p=\operatorname {prox} _{f}(x)\Leftrightarrow x-p=\nabla f(p)\quad (\forall (x,p)\in \mathbb {R} ^{N}\times \mathbb {R} ^{N})}$

## Examples

Special instances of Proximal Gradient Methods are