# User:Kfriston/sandbox

Generalized filtering is a generic Bayesian filtering scheme for nonlinear state-space models [1]. It is based on a variational principle of least action, formulated in generalized coordinates of motion [2]. Generalized filtering furnishes posterior densities over hidden states (and parameters) generating observed data using a generalized gradient decent on variational free energy, under the Laplace assumption. Unlike classical (e.g., Kalman-Bucy or particle) filtering, generalized filtering eschews Markovian assumptions about random fluctuations. Furthermore, it operates online, assimilating data to approximate the posterior density over unknown quantities, without the need for a backward pass. Special cases include variational filtering, dynamic expectation maximisation and generalised predictive coding.

## Definition

Definition: Generalized filtering rests on the tuple ${\displaystyle (\Omega ,X,S,U,p,q)}$:

• A sample space ${\displaystyle \Omega }$ from which random fluctuations ${\displaystyle \omega \in \Omega }$ are drawn
• Hidden states ${\displaystyle X:X\times U\times \Omega \to \mathbb {R} }$ – that cause sensor states and depend on control states
• Sensor states ${\displaystyle S:X\times U\times \Omega \to \mathbb {R} }$ – a probabilistic mapping from hidden and control states
• Control states ${\displaystyle U\in \mathbb {R} }$ – that act as external causes, input or forcing terms
• Generative density ${\displaystyle p({\tilde {s}},{\tilde {x}},{\tilde {u}}|m)}$ – over sensory, hidden and control states under a generative model
• Variational density ${\displaystyle q({\tilde {x}},{\tilde {u}}|m)}$ – over hidden and control states with mean ${\displaystyle {\tilde {\mu }}\in \mathbb {R} }$

Here ~ denotes a variable in generalised coordinates of motion: ${\displaystyle {\tilde {u}}=[u,u',u'',...]^{T}}$

### Generalised filtering

The objective is to approximate the posterior density over hidden and control states, given sensor states and a generative model – and estimate the (path integral of) model evidence ${\displaystyle p({\tilde {s}}(t)\vert m)}$ to compare different models. This generally involves an intractable marginalization over hidden states, so model evidence (or marginal likelihood) is replaced with a variational free energy bound [3]:

${\displaystyle {\tilde {\mu }}(t)={\underset {\tilde {\mu }}{\operatorname {arg\,min} }}\{F({\tilde {s}}(t),{\tilde {\mu }})\}}$

${\displaystyle F(t)=E_{q}[G({\tilde {s}},{\tilde {x}},{\tilde {u}})]-H[q({\tilde {x}},{\tilde {u}}\vert {\tilde {\mu }})]=-\ln p({\tilde {s}}\vert m)+D_{KL}[q({\tilde {x}},{\tilde {u}}\vert {\tilde {\mu }})\vert \vert p({\tilde {x}},{\tilde {u}}\vert {\tilde {s}},m)]}$

${\displaystyle G(t)=-\ln p({\tilde {s}},{\tilde {x}},{\tilde {u}}\vert m)}$

The second equality shows that minimizing variational free energy (i) minimizes the Kullback-Leibler divergence between the variational and true posterior density and (ii) renders the variational free energy (a bound approximation to) the negative log evidence (because the divergence can never be less than zero) [4]. Under the Laplace assumption ${\displaystyle q({\tilde {x}},{\tilde {u}}\vert {\tilde {\mu }})={\mathcal {N}}({\tilde {\mu }},C)}$ the variational density is Gaussian and the precision that minimizes free energy is ${\displaystyle C^{-1}=\Pi =\partial _{{\tilde {\mu }}{\tilde {\mu }}}G({\tilde {\mu }})}$. This means that free-energy can be expressed in terms of the variational mean [5] (omitting constants):

${\displaystyle F=G({\tilde {\mu }})+\textstyle {1 \over 2}\ln \vert \partial _{{\tilde {\mu }}{\tilde {\mu }}}G({\tilde {\mu }})\vert }$

The variational means that minimize the (path integral) of free energy can now be recovered by solving the generalized filter:

${\displaystyle {\dot {\tilde {\mu }}}=D{\tilde {\mu }}-\partial _{\tilde {\mu }}F({\tilde {s}},{\tilde {\mu }})}$

Where,${\displaystyle D}$ is a block matrix derivative operator of identify matrices such that ${\displaystyle D{\tilde {u}}=[{u}',{u}'',...]^{T}}$

### Variational basis

Generalised filtering is based on the following lemma: The self-consistent solution to ${\displaystyle {\dot {\tilde {\mu }}}=D{\tilde {\mu }}-\partial _{\tilde {\mu }}F(s,{\tilde {\mu }})}$ satisfies the variational principle of stationary action, where action is the path integral of variational free energy

${\displaystyle S=\int {dt}F({\tilde {s}}(t),{\tilde {\mu }}(t))}$

Proof: self-consistency requires the motion of the mean to be the mean of the motion and (by the fundamental lemma of variational calculus)

${\displaystyle {\dot {\tilde {\mu }}}=D{\tilde {\mu }}\Leftrightarrow \partial _{\tilde {\mu }}F({\tilde {s}},{\tilde {\mu }})=0\Leftrightarrow \delta _{\tilde {\mu }}S=0}$

Put simply, small perturbations to the path of the mean do not change variational free energy and it has the least action of all possible (local) paths.

Remarks: Heuristically, generalised filtering performs a gradient decent on variational free energy in a moving frame of reference: ${\displaystyle {\dot {\tilde {\mu }}}-D{\tilde {\mu }}=-\partial _{\tilde {\mu }}F(s,{\tilde {\mu }})}$, where the frame itself minimises variational free energy. For a related example in statistical physics, see Kerr and Graham [6] who use ensemble dynamics in generalised coordinates to provide a generalised phase-space version of Langevin and associated Fokker-Planck equations.

In practice, generalised filtering uses local linearization [7] over intervals ${\displaystyle \Delta t}$ to recover discrete updates

{\displaystyle {\begin{aligned}\Delta {\tilde {\mu }}&=(\exp(\Delta t\cdot J)-I)J^{-1}{\dot {\tilde {\mu }}}\\J&=\partial _{\tilde {\mu }}{\dot {\tilde {\mu }}}=D-\partial _{{\tilde {\mu }}{\tilde {\mu }}}F({\tilde {s}},{\tilde {\mu }})\end{aligned}}}

This updates the means of hidden variables at each interval (usually the interval between observations).

## Generative (state-space) models in generalised coordinates

Usually, the generative density or model is specified in terms of a nonlinear input-state-output model with continuous nonlinear functions:

{\displaystyle {\begin{aligned}s&=g(x,u)+\omega _{s}\\{\dot {x}}&=f(x,u)+\omega _{x}\\\end{aligned}}}

The corresponding generalised model (under local linearity assumptions) obtains the from the chain rule

{\displaystyle {\begin{aligned}{\tilde {s}}&={\tilde {g}}({\tilde {x}},{\tilde {u}})+{\tilde {\omega }}_{s}\\\\s&=g(x,u)+\omega _{s}\\{s}'&=\partial _{x}g\cdot {x}'+\partial _{u}g\cdot {u}'+{\omega }'_{x}\\{s}''&=\partial _{x}g\cdot {x}''+\partial _{u}g\cdot {u}''+{\omega }''_{x}\\\vdots \\\end{aligned}}\qquad {\begin{aligned}{\tilde {x}}&={\tilde {f}}({\tilde {x}},{\tilde {u}})+{\tilde {\omega }}_{x}\\\\{\dot {x}}&=f(x,u)+\omega _{x}\\{\dot {x}}'&=\partial _{x}f\cdot {x}'+\partial _{u}f\cdot {u}'+{\omega }'_{x}\\{\dot {x}}''&=\partial _{x}f\cdot {x}''+\partial _{u}f\cdot {u}''+{\omega }''_{x}\\\vdots \\\end{aligned}}}

Gaussian assumptions about the random fluctuations ${\displaystyle \omega }$ then prescribe the likelihood and empirical priors on the motion of hidden states

{\displaystyle {\begin{aligned}p\left({{\tilde {s}},{\tilde {x}},{\tilde {u}}\vert m}\right)&=p\left({{\tilde {s}}\vert {\tilde {x}},{\tilde {u}},m}\right)p\left({D{\tilde {x}}\vert x,{\tilde {u}},m}\right)p(x\vert m)p({\tilde {u}}\vert m)\\p\left({{\tilde {s}}\vert {\tilde {x}},{\tilde {u}},m}\right)&={\mathcal {N}}({\tilde {g}}({\tilde {x}},{\tilde {u}}),{\tilde {\Sigma }}({\tilde {x}},{\tilde {u}})_{s})\\p\left({D{\tilde {x}}\vert x,{\tilde {u}},m}\right)&={\mathcal {N}}({\tilde {f}}({\tilde {x}},{\tilde {u}}),{\tilde {\Sigma }}({\tilde {x}},{\tilde {u}})_{x})\\\end{aligned}}}

The covariances ${\displaystyle {\tilde {\Sigma }}=V\otimes \Sigma }$ factorise into a covariance among variables and correlations ${\displaystyle V}$ among generalised fluctuations that encodes their autocorrelation:

${\displaystyle V={\begin{bmatrix}1&0&{{\ddot {\rho }}(0)}&\cdots \\0&{-{\ddot {\rho }}(0)}&0\ &\ \\{{\ddot {\rho }}(0)}\ &0\ &{{\ddot {\ddot {\rho }}}(0)}\ &\ \\\vdots \ &\ &\ &\ddots \ \\\end{bmatrix}}}$

Here, ${\displaystyle {\ddot {\rho }}(0)}$ is the second derivative of the autocorrelation function evaluated at zero. This is a ubiquitous measure of roughness in the theory of stochastic processes [8]. Crucially, the precision (inverse variance) of high order derivatives fall to zero fairly quickly, which means it is only necessary to model relatively low order generalised motion (usually between two and eight) for any given or parameterized autocorrelation function.

## Special cases

### Filtering discrete time series

When time series are observed as a discrete sequence of ${\displaystyle N}$ observations, the implicit sampling is treated as part of the generative process, where (using Taylor's theorem)

${\displaystyle {[s_{1},\dots ,s_{N}]^{T}=(E\otimes I)\cdot {\tilde {s}}(t)}:\qquad {E_{ij}={\frac {(i-t)^{(j-1)}}{(j-1)!}}}}$

In principle, the entire sequence could be used to estimate hidden variables at each point in time. However, the precision of samples in the past and future falls quickly and can be ignored. This allows the scheme to assimilate data online, using local observations around each time point (typically between and eight).

### Generalised filtering and model parameters

For any slowly varying model parameters of the equations of motion ${\displaystyle f(x,u,\theta )}$ or precision ${\displaystyle {\tilde {\Pi }}(x,u,\theta )}$ generalised filtering takes the following form (where ${\displaystyle \mu }$ corresponds to the variational mean of the parameters)

{\displaystyle {\begin{aligned}{\dot {\mu }}&={\mu }'\\{\dot {{\mu }'}}&=-\partial _{\mu }F({\tilde {s}},\mu )-\kappa {\mu }'\\\end{aligned}}}

Here, the solution ${\displaystyle {\dot {\tilde {\mu }}}=0}$ minimizes variational free energy, when the motion of the mean is small. This can be seen by noting ${\displaystyle {\dot {\mu }}={\dot {\mu }}'=0\Rightarrow \partial _{\mu }F=0\Rightarrow \delta _{\mu }S=0}$. It is straightforward to show that this solution corresponds to a classical Newton update [9].

## Relationship to Bayesian filtering and predictive coding

### Generalised filtering and Kalman filtering

Classical filtering under Markovian or Weiner assumptions is equivalent to assuming the precision of the motion of random fluctuations is zero. In this limiting case, one only has to consider the states and their first derivative ${\displaystyle {\tilde {\mu }}=(\mu ,{\mu }')}$. This means generalised filtering takes the form of a Kalman-Bucy filter, with prediction and correction terms:

{\displaystyle {\begin{aligned}{\dot {\mu }}&={\mu }'-\partial _{\mu }F(s,{\tilde {\mu }})\\{\dot {{\mu }'}}&=-\partial _{{\mu }'}F(s,{\tilde {\mu }})\\\end{aligned}}}

Substituting this first-order filtering into the discrete update scheme above gives the equivalent of (extended) Kalman filtering [10].

### Generalised filtering and particle filtering

Particle filtering is a sampling-based scheme that relaxes assumptions about the form of the variational or approximate posterior density. The corresponding generalised filtering scheme is called variational filtering [11]. In variational filtering, an ensemble of particles diffuse over the free energy landscape in a frame of reference that moves with the expected (generalised) motion of the ensemble. This provides a relatively simple scheme that eschews Gaussian (unimodal) assumptions. Unlike particle filtering it does not require proposal densities -- or the elimination or creation of particles.

### Generalised filtering and variational Bayes

Variational Bayes rests on a mean field partition of the variational density:

${\displaystyle q({\tilde {x}},{\tilde {u}},\theta \dots \vert {\tilde {\mu }},\mu )=q({\tilde {x}},{\tilde {u}}\vert {\tilde {\mu }})q(\theta \vert \mu )\dots }$

This partition induces a variational update or step for each marginal density -- that is usually solved analytically using conjugate priors. In generalised filtering, this leads to dynamic expectation maximisation [12]. that comprises a D-step that optimises the sufficient statistics of unknown states, an E-step for parameters and an M-step for precisions.

### Generalised filtering and predictive coding

Generalised filtering is usually used to invert hierarchical models of the following form

{\displaystyle {\begin{aligned}{\tilde {s}}&={\tilde {g}}^{(1)}({\tilde {x}}^{(1)},{\tilde {u}}^{(1)})+{\tilde {\omega }}_{s}^{(1)}\\{\dot {\tilde {x}}}^{(1)}&=f^{(i)}({\tilde {x}}^{(1)},{\tilde {u}}^{(1)})+{\tilde {\omega }}_{x}^{(1)}\\\vdots \\{\tilde {u}}^{(i-1)}&=g^{(i)}({\tilde {x}}^{(i)},{\tilde {u}}^{(i)})+{\tilde {\omega }}_{u}^{(i)}\\{\dot {\tilde {x}}}^{(i)}&=f^{(i)}({\tilde {x}}^{(i)},{\tilde {u}}^{(i)})+{\tilde {\omega }}_{x}^{(i)}\\\vdots \\\end{aligned}}}

The ensuing generalised gradient descent on free energy can then be expressed compactly in terms of prediction errors, where (omitting high order terms):

{\displaystyle {\begin{aligned}{\dot {\tilde {\mu }}}_{u}^{(i)}&=D{\tilde {\mu }}^{(u,i)}-\partial _{u}{\tilde {\varepsilon }}^{(i)}\cdot \Pi ^{(i)}{\tilde {\varepsilon }}^{(i)}-\Pi ^{(i+1)}{\tilde {\varepsilon }}_{u}^{(i+1)}\\{\dot {\tilde {\mu }}}_{x}^{(i)}&=D{\tilde {\mu }}^{(x,i)}-\partial _{x}{\tilde {\varepsilon }}^{(i)}\cdot \Pi ^{(i)}{\tilde {\varepsilon }}^{(i)}\\\\{\tilde {\varepsilon }}_{u}^{(i)}&={\tilde {\mu }}_{u}^{(i-1)}-{\tilde {g}}^{(i)}\\{\tilde {\varepsilon }}_{x}^{(i)}&=D{\tilde {\mu }}_{x}^{(i)}-{\tilde {f}}^{(i)}\\\end{aligned}}}

Here, ${\displaystyle \Pi ^{(i)}}$ is the precision of random fluctuations at the$i$-th level. This is known as generalised predictive coding [11], with linear predictive coding as a special case.

## Applications

Generalised filtering has been primarily applied to biological timeseries -- in particular functional magnetic resonance imaging and electrophysiological data. This is usually in the context of dynamic causal modelling to make inferences about the underlying architectures of (neuronal) systems generating data [13]. It is also used to simulate inference in terms of generalised (hierarchical) predictive coding in the brain [14].

## References

1. ^ K Friston, K Stephan, B Li, and J. Daunizeau, "Generalised Filtering," Mathematical Problems in Engineering, vol. vol., 2010, p. 621670, 2010.
2. ^ B Balaji and K Friston, "Bayesian state estimation using generalized coordinates," Proc. SPIE, p. 80501Y , 2011
3. ^ R P Feynman, Statistical mechanics. Reading MA: Benjamin, 1972
4. ^ M J Beal, "Variational Algorithms for Approximate Bayesian Inference," PhD. Thesis, University College London, 2003.
5. ^ K Friston, J Mattout, N Trujillo-Barreto, J Ashburner, and W Penny, "Variational free energy and the Laplace approximation," NeuroImage, vol. 34, no. 1, pp. 220-34, 2007
6. ^ W C Kerr and A J Graham, "Generalised phase space version of Langevin equations and associated Fokker-Planck equations," Eur. Phys. J. B., vol. 15, pp. 305-11, 2000.
7. ^ T Ozaki, "A bridge between nonlinear time-series models and nonlinear stochastic dynamical systems: A local linearization approach," Statistica Sin., vol. 2, pp. 113-135, 1992
8. ^ D R Cox and H D Miller, The theory of stochastic processes. London: Methuen, 1965.
9. ^ K Friston, K Stephan, B Li, and J. Daunizeau, "Generalised Filtering," Mathematical Problems in Engineering, vol. vol., 2010, p. 621670, 2010.
10. ^ K J Friston, N Trujillo-Barreto, and J Daunizeau, "DEM: A variational treatment of dynamic systems," Neuroimage, vol. 41, no. 3, pp. 849-85, 2008
11. ^ K J Friston, "Variational filtering," Neuroimage, vol. 41, no. 3, pp. 747-66, 2008.
12. ^ K J Friston, N Trujillo-Barreto, and J Daunizeau, "DEM: A variational treatment of dynamic systems," Neuroimage, vol. 41, no. 3, pp. 849-85, 2008
13. ^ J Daunizeau, O David, and K E Stephan, "Dynamic causal modelling: a critical review of the biophysical and statistical foundations," Neuroimage, vol. 58, no. 2, pp. 312-22, 2011
14. ^ K Friston, "Hierarchical models in the brain," PLoS Comput Biol., vol. 4, no. 11, p. e1000211, 2008.