Fisher kernel

In statistical classification, the Fisher kernel, named in honour of Sir Ronald Fisher, is a function that measures the similarity of two objects on the basis of sets of measurements for each object and a statistical model. In a classification procedure, the class for a new object (whose real class is unknown) can be estimated by minimising, across classes, an average of the Fisher kernel distance from the new object to each known member of the given class.

The Fisher kernel was introduced in 1998.[1] It combines the advantages of generative statistical models (like the hidden Markov model) and those of discriminative methods (like support vector machines):

• generative models can process data of variable length (adding or removing data is well-supported)
• discriminative methods can have flexible criteria and yield better results.

Derivation

Fisher score

The Fisher kernel makes use of the Fisher score, defined as

$U_X = \nabla_{\theta} \log P(X|\theta)$

with θ being a set (vector) of parameters. The function taking θ to log P(X|θ) is the log-likelihood of the probabilistic model.

Fisher kernel

The Fisher kernel is defined as

$K(X_i, X_j) = U_{X_i}^T I^{-1} U_{X_j}$

with I the Fisher information matrix.

Applications

Information retrieval

The Fisher kernel is the kernel for a generative probabilistic model. As such, it constitutes a bridge between generative and probabilistic models of documents.[2] Fisher kernels exist for numerous models, notably tf–idf,[3] Naive Bayes and probabilistic latent semantic analysis.

Image classification and retrieval

The Fisher kernel can also be applied to image representation for classification or retrieval problems. Currently, the most popular bag-of-visual-words representation suffers from sparsity and high dimensionality. The Fisher kernel can result in a compact and dense representation, which is more desirable for image classification[4] and retrieval[5][6] problems.