# Universal approximation theorem

In the mathematical theory of artificial neural networks, universal approximation theorems are results that establish the density of an algorithmically generated class of functions within a given function space of interest. Typically, these results concern the approximation capabilities of the feedforward architecture on the space of continuous functions between two Euclidean spaces, and the approximation is with respect to the compact convergence topology. However, there are also a variety of results between non-Euclidean spaces and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the convolutional neural network (CNN) architecture, radial basis-functions, or neural networks with specific properties. Most universal approximation theorems can be parsed into two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("arbitrary width" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("arbitrary depth" case).

Universal approximation theorems imply that neural networks can represent a wide variety of interesting functions when given appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible.

## History

One of the first versions of the arbitrary width case was proved by George Cybenko in 1989 for sigmoid activation functions. Kurt Hornik showed in 1991 that it is not the specific choice of the activation function, but rather the multilayer feed-forward architecture itself which gives neural networks the potential of being universal approximators. Moshe Leshno et al in 1993 and later Allan Pinkus in 1999 showed that the universal approximation property, is equivalent to having a nonpolynomial activation function.

The arbitrary depth case was also studied by number of authors, such as Zhou Lu et al in 2017, Boris Hanin and Mark Sellke in 2018, and Patrick Kidger and Terry Lyons in 2020. The result minimal width per layer was refined in.

Several extensions of the theorem exist, such as to discontinuous activation functions, noncompact domains, certifiable networks and alternative network architectures and topologies. A full characterization of the universal approximation property on general function spaces is given by A. Kratsios in.

## Arbitrary Width Case

The classical form of the universal approximation theorem for arbitrary width and bounded depth is as follows. It extends the classical results of George Cybenko and Kurt Hornik.

Universal Approximation Theorem: Fix a continuous function $\sigma :\mathbb {R} \rightarrow \mathbb {R}$ (activation function) and positive integers $d,D$ . The function $\sigma$ is not a polynomial if and only if, for every continuous function $f:\mathbb {R} ^{d}\to \mathbb {R} ^{D}$ (target function), every compact subset $K$ of $\mathbb {R} ^{d}$ , and every $\epsilon >0$ there exists a continuous function $f_{\epsilon }:\mathbb {R} ^{d}\to \mathbb {R} ^{D}$ (the layer output) with representation

$f_{\epsilon }=W_{2}\circ \sigma \circ W_{1},$ where $W_{2},W_{1}$ are composable affine maps and $\circ$ denotes component-wise composition, such that the approximation bound

$\sup _{x\in K}\,\|f(x)-f_{\epsilon }(x)\|<\varepsilon$ holds for any $\epsilon$ arbitrarily small (distance from $f$ to $f_{\epsilon }$ can be infinitely small).

The theorem states that the result of first layer $f_{\epsilon }$ can approximate any well-behaved function $f$ . Such a well-behaved function can also be approximated by a network of greater depth by using the same construction for the first layer and approximating the identity function with later layers.

## Arbitrary Depth Case

The 'dual' versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017. They showed that networks of width n+4 with ReLU activation functions can approximate any Lebesgue integrable function on n-dimensional input space with respect to $L^{1}$ distance if network depth is allowed to grow. It was also shown that there was the limited expressive power if the width was less than or equal to n. All Lebesgue integrable functions except for a zero measure set cannot be approximated by ReLU networks of width n. In the same paper it was shown that ReLU networks with width n+1 were sufficient to approximate any continuous function of n-dimensional input variables. The following refinement, specifies the optimal minimum width for which such an approximation is possible and is due to 

Universal Approximation Theorem (L1 distance, ReLU activation, arbitrary depth, minimal width). For any Bochner-Lebesgue p-integrable function $f:\mathbb {R} ^{n}\rightarrow \mathbb {R} ^{m}$ and any $\epsilon >0$ , there exists a fully-connected ReLU network $F$ width exactly $d_{m}=\max\{{n+1},m\}$ , satisfying

$\int _{\mathbb {R} ^{n}}\left\|f(x)-F_{}(x)\right\|^{p}\mathrm {d} x<\epsilon$ .
Moreover, there exists a function $f\in L^{p}(\mathbb {R} ^{n},\mathbb {R} ^{m})$ and some $\epsilon >0$ , for which there is no fully-connected ReLU network of width less than $d_{m}=\max\{{n+1},m\}$ satisfying the above approximation bound.

Together, the central results of  and of  yield the following general universal approximation theorem for networks with bounded width, between general input and output spaces.

Universal Approximation Theorem (non-affine activation, arbitrary depth, Non-Euclidean). Let ${\mathcal {X}}$ be a compact topological space, $({\mathcal {Y}},d_{\mathcal {Y}})$ be a metric space, $\phi :{\mathcal {X}}\rightarrow \mathbb {R} ^{n}$ be a continuous and injective feature map and let $\rho :\mathbb {R} ^{m}\rightarrow {\mathcal {Y}}$ be a continuous readout map, with a section, having dense image $Im(\rho )$ with (possibly empty) collared boundary. Let $\sigma :\mathbb {R} \to \mathbb {R}$ be any non-affine continuous function which is continuously differentiable at at-least one point, with non-zero derivative at that point. Let ${\mathcal {N}}_{\phi ,\rho }^{\sigma }$ denote the space of feed-forward neural networks with $n$ input neurons, $m$ output neurons, and an arbitrary number of hidden layers each with $n+m+2$ neurons, such that every hidden neuron has activation function $\sigma$ and every output neuron has the identity as its activation function, with input layer $\phi$ , and output layer $\rho$ . Then given any $\varepsilon >0$ and any $f\in C({\mathcal {X}},{\mathcal {Y}})$ , there exists $F\in {\mathcal {N}}_{\phi ,\rho }^{\sigma }$ such that

$\sup _{x\in {\mathcal {X}}}\,d_{\mathcal {Y}}(F(x),f(x))<\varepsilon .$ In other words, ${\mathcal {N}}$ is dense in $C({\mathcal {X}};{\mathcal {Y}})$ with respect to the uniform distance.

Certain necessary conditions for the bounded width, arbitrary depth case have been established, but there is still a gap between the known sufficient and necessary conditions.