Universal approximation theorem

In the mathematical theory of artificial neural networks, the universal approximation theorem states[1] that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; however, it does not touch upon the algorithmic learnability of those parameters.

One of the first versions of the theorem was proved by George Cybenko in 1989 for sigmoid activation functions.[2] It was later shown in [3] that the class of deep neural networks is a universal approximator if and only if the activation function is not polynomial.

Kurt Hornik showed in 1991[4] that it is not the specific choice of the activation function, but rather the multilayer feedforward architecture itself which gives neural networks the potential of being universal approximators. The output units are always assumed to be linear.

Although feed-forward networks with a single hidden layer are universal approximators, the width of such networks has to be exponentially large. In 2017 Lu et al.[5] proved universal approximation theorem for width-bounded deep neural networks. In particular, they showed that width-n+4 networks with ReLU activation functions can approximate any Lebesgue integrable function on n-dimensional input space with respect to ${\displaystyle L^{1}}$ distance if network depth is allowed to grow. They also showed the limited expressive power if the width is less than or equal to n. All Lebesgue integrable functions except for a zero measure set cannot be approximated by width-n ReLU networks.

A later result of [5] showed that ReLU networks with width n+1 is sufficient to approximate any continuous function of n-dimensional input variables.[6]

Formal statement

The universal approximation theorem can be expressed mathematically:[2][4][7][8]

Unbounded Width Case

Let ${\displaystyle \varphi :\mathbb {R} \to \mathbb {R} }$ be a nonconstant, bounded, and continuous function (called the activation function). Let ${\displaystyle I_{m}}$ denote the m-dimensional unit hypercube ${\displaystyle [0,1]^{m}}$. The space of real-valued continuous functions on ${\displaystyle I_{m}}$ is denoted by ${\displaystyle C(I_{m})}$. Then, given any ${\displaystyle \varepsilon >0}$ and any function ${\displaystyle f\in C(I_{m})}$, there exist an integer ${\displaystyle N}$, real constants ${\displaystyle v_{i},b_{i}\in \mathbb {R} }$ and real vectors ${\displaystyle w_{i}\in \mathbb {R} ^{m}}$ for ${\displaystyle i=1,\ldots ,N}$, such that we may define:

${\displaystyle F(x)=\sum _{i=1}^{N}v_{i}\varphi \left(w_{i}^{T}x+b_{i}\right)}$

as an approximate realization of the function ${\displaystyle f}$; that is,

${\displaystyle |F(x)-f(x)|<\varepsilon }$

for all ${\displaystyle x\in I_{m}}$. In other words, functions of the form ${\displaystyle F(x)}$ are dense in ${\displaystyle C(I_{m})}$.

This still holds when replacing ${\displaystyle I_{m}}$ with any compact subset of ${\displaystyle \mathbb {R} ^{m}}$.

Bounded Width Case

The universal approximation theorem for width-bounded networks can be expressed mathematically as follows:[5]

For any Lebesgue-integrable function ${\displaystyle f:\mathbb {R} ^{n}\rightarrow \mathbb {R} }$ and any ${\displaystyle \epsilon >0}$, there exists a fully-connected ReLU network ${\displaystyle {\mathcal {A}}}$ with width ${\displaystyle d_{m}\leq {n+4}}$, such that the function ${\displaystyle F_{\mathcal {A}}}$ represented by this network satisfies

${\displaystyle \int _{\mathbb {R} ^{n}}\left|f(x)-F_{\mathcal {A}}(x)\right|\mathrm {d} x<\epsilon }$