# Multimodal learning

The information in real world usually comes as different modalities. For example, images are usually associated with tags and text explanations; texts contain images to more clearly express the main idea of the article. Different modalities are characterized by very different statistical properties. For instance, images are usually represented as pixel intensities or outputs of feature extractors, while texts are represented as discrete word count vectors. Due to the distinct statistical properties of different information resources, it is very important to discover the relationship between different modalities. Multimodal learning is a good model to represent the joint representations of different modalities. The multimodal learning model is also capable to fill missing modality given the observed ones. The multimodal learning model combines two deep Boltzmann machines each corresponds to one modality. An additional hidden layer is placed on top of the two Boltzmann Machines to give the joint representation.

## Motivation

A lot of models/algorithms have been implemented to retrieve and classify a certain type of data, e.g. image or text (where humans who interacts with machines can extract images in a form of pictures and text that could be any message etc). However, data usually comes with different modalities (it is the degree to which a system's components may be separated or combined) which carry different information. For example, it is very common to caption an image to convey the information not presented by this image. Similarly, sometimes it is more straightforward to use an image to describe the information which may not be obvious from texts. As a results, if some different words appear in similar images, these words are very likely used to describe the same thing. Conversely, if some words are used in different images, these images may represent the same object. Thus, it is important to invite a novel model which is able to jointly represent the information such that the model can capture the correlation structure between different modalities. Moreover, it should also be able to recover missing modalities given observed ones, e.g. predicting possible image object according to text description. The Multimodal Deep Boltzmann Machine model satisfies the above purposes.

## Background: Boltzmann machine

A Boltzmann machine is a type of stochastic neural network invented by Geoffrey Hinton and Terry Sejnowski in 1985. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They are named after the Boltzmann distribution in statistical mechanics. The units in Boltzmann machines are divided into two groups-visible units and hidden units. General Boltzmann machines allow connection between any units. However, learning is impractical using general Boltzmann Machines because the computational time is exponential to the size of the machine. A more efficient architecture is called restricted Boltzmann machine where connection is only allowed between hidden unit and visible unit, which is described in the next section.

### Restricted Boltzmann machine

A restricted Boltzmann machine[1] is an undirected graphical model with stochastic visible variable and stochastic hidden variables. Each visible variable is connected to each hidden variable. The energy function of the model is defined as

${\displaystyle E(\mathbf {v} ,\mathbf {h} ;\theta )=-\sum _{i=1}^{D}\sum _{j=1}^{F}W_{ij}v_{i}h_{j}-\sum _{i=1}^{D}b_{i}v_{i}-\sum _{j=1}^{F}a_{j}h_{j}}$

where ${\displaystyle \theta =\{\mathbf {v} ,\mathbf {h} ;\theta \}}$ are model parameters: ${\displaystyle W_{ij}}$ represents the symmetric interaction term between visible unit ${\displaystyle i}$ and hidden unit ${\displaystyle j}$; ${\displaystyle b_{i}}$ and ${\displaystyle a_{j}}$ are bias terms. The joint distribution of the system is defined as

${\displaystyle P(\mathbf {v} ;\theta )={\frac {1}{{\mathcal {Z}}(\theta )}}\sum _{\mathbf {h} }\mathrm {exp} (-E(\mathbf {v} ,\mathbf {h} ;\theta ))}$

where ${\displaystyle {\mathcal {Z}}(\theta )}$ is a normalizing constant. The conditional distribution over hidden ${\displaystyle \mathbf {h} }$ and ${\displaystyle \mathbf {v} }$ can be derived as logistic function in terms of model parameters.

${\displaystyle P(\mathbf {h} |\mathbf {v} ;\theta )=\prod _{j=1}^{F}p(h_{j}|\mathbf {v} )}$, with ${\displaystyle p(h_{j}=1|\mathbf {v} )=g(\sum _{i=1}^{D}W_{ij}v_{i}+a_{j})}$
${\displaystyle P(\mathbf {v} |\mathbf {h} ;\theta )=\prod _{i=1}^{D}p(v_{i}|\mathbf {v} )}$, with ${\displaystyle p(v_{i}=1|\mathbf {h} )=g(\sum _{j=1}^{F}W_{ij}h_{j}+b_{i})}$

where ${\displaystyle g(x)={\frac {1}{(1+\mathrm {exp} (-x))}}}$ is the logistic function.

The derivative of the log-likelihood with respect to the model parameters can be decomposed as the difference between the model's expectation and data-dependent expectation.

### Gaussian-Bernoulli RBM

Gaussian-Bernoulli RBMs[2] are a variant of restricted Boltzmann machine used for modeling real-valued vectors such as pixel intensities. It is usually used to model the image data. The energy of the system of the Gaussian-Bernoulli RBM is defined as

${\displaystyle E(\mathbf {v} ,\mathbf {h} ;\theta )=\sum _{i=1}^{D}{\frac {(v_{i}-b_{i})^{2}}{2\sigma _{i}^{2}}}-\sum _{i=1}^{D}\sum _{j=1}^{F}{\frac {v_{i}}{\sigma _{i}}}W_{ij}v_{i}h_{j}-\sum _{i=1}^{D}b_{i}v_{i}-\sum _{j=1}^{F}a_{j}h_{j}}$

where ${\displaystyle \theta =\{\mathbf {a} ,\mathbf {b} ,\mathbf {w} ,\mathbf {\sigma } \}}$ are the model parameters. The joint distribution is defined the same as the one in restricted Boltzmann machine. The conditional distributions now become

${\displaystyle P(\mathbf {h} |\mathbf {v} ;\theta )=\prod _{j=1}^{F}p(h_{j}|\mathbf {v} )}$, with ${\displaystyle p(h_{j}=1|\mathbf {v} )=g(\sum _{i=1}^{D}W_{ij}{\frac {v_{i}}{\sigma _{i}}}+a_{j})}$
${\displaystyle P(\mathbf {v} |\mathbf {h} ;\theta )=\prod _{i=1}^{D}p(v_{i}|\mathbf {h} )}$, with ${\displaystyle p(v_{i}|\mathbf {h} )\sim {\mathcal {N}}(\sigma _{i}\sum _{j=1}^{F}W_{ij}h_{j}+b_{i},\sigma _{i}^{2})}$

In Gaussian-Bernoulli RBM, the visible unit conditioned on hidden units is modeled as a Gaussian distribution.

### Replicated Softmax Model

The Replicated Softmax Model[3] is also an variant of restricted Boltzmann machine and commonly used to model word count vectors in a document. In a typical text mining problem, let ${\displaystyle K}$ be the dictionary size, and ${\displaystyle M}$ be the number of words in the document. Let ${\displaystyle \mathbf {V} }$ be a ${\displaystyle M\times K}$ binary matrix with ${\displaystyle v_{ik}=1}$ only when the ${\displaystyle i^{th}}$ word in the document is the ${\displaystyle k^{th}}$ word in the dictionary. ${\displaystyle {\hat {v}}_{k}}$ denotes the count for the ${\displaystyle k^{th}}$ word in the dictionary. The energy of the state ${\displaystyle \{\mathbf {V} ,\mathbf {h} \}}$ for a document contains ${\displaystyle M}$ words is defined as

${\displaystyle E(\mathbf {V} ,\mathbf {h} )=-\sum _{j=1}^{F}\sum _{k=1}^{K}W_{jk}{\hat {v}}_{k}h_{j}-\sum _{k=1}^{K}b_{k}{\hat {v}}_{k}-M\sum _{j=1}^{F}a_{j}h_{j}}$

The conditional distributions are given by

${\displaystyle p(h_{j}=1|\mathbf {V} )=g(Ma_{j}+\sum _{k=1}^{K}{\hat {v}}_{k}W_{jk})}$
${\displaystyle p(v_{ik}=1|\mathbf {h} )={\frac {\mathrm {exp} (b_{k}+\sum _{j=1}^{F}h_{j}W_{jk}}{\sum _{q=1}^{K}\mathrm {exp} (b_{q}+\sum _{j=1}^{F}h_{j}W_{jq}}})}$

## Deep Boltzmann machines

A deep Boltzmann machine[4] has a sequence of layers of hidden units.There are only connections between adjacent hidden layers, as well as between visible units and hidden units in the first hidden layer. The energy function of the system adds layer interaction terms to the energy function of general restricted Boltzmann machine and is defined by {\displaystyle {\begin{aligned}E({\mathbf {v} ,\mathbf {h} ;\theta })=&-\sum _{i=1}^{D}\sum _{j=1}^{F_{1}}W_{ij}^{(1)}v_{i}h_{j}^{(1)}-\sum _{j=1}^{F_{1}}\sum _{l=1}^{F_{2}}W_{jl}^{(2)}h_{j}^{(1)}h_{l}^{(2)}\\&-\sum _{l=1}^{F_{2}}\sum _{p=1}^{F_{3}}W_{lp}^{(3)}h_{l}^{(2)}h_{p}^{(3)}-\sum _{i=1}^{D}b_{i}v_{i}-\sum _{j=1}^{F_{1}}b_{j}^{(1)}h_{j}^{(1)}-\sum _{l=1}^{F_{2}}b_{l}^{(2)}h_{l}^{(2)}-\sum _{p=1}^{F_{3}}b_{p}^{(3)}h_{p}^{(3)}\end{aligned}}}

The joint distribution is

${\displaystyle P(\mathbf {v} ;\theta )={\frac {1}{{\mathcal {Z}}(\theta )}}\sum _{\mathbf {h} }\mathrm {exp} (-E(\mathbf {v} ,\mathbf {h} ^{(1)},\mathbf {h} ^{(2)},\mathbf {h} ^{(3)};\theta ))}$

## Multimodal deep Boltzmann machines

Multimodal deep Boltzmann machine[5][6] uses an image-text bi-modal DBM where the image pathway is modeled as Gaussian-Bernoulli DBM and text pathway as Replicated Softmax DBM, and each DBM has two hidden layers and one visible layer. The two DBMs join together at an additional top hidden layer. The joint distribution over the multi-modal inputs defined as {\displaystyle {\begin{aligned}P(\mathbf {v} ^{m},\mathbf {v} ^{t};\theta )&=\sum _{\mathbf {h} ^{(2m)},\mathbf {h} ^{(2t)},\mathbf {h} ^{(3)}}P(\mathbf {h} ^{(2m)},\mathbf {h} ^{(2t)},\mathbf {h} ^{(3)})(\sum _{\mathbf {h} ^{(1m)}}P(\mathbf {v} _{m},\mathbf {h} ^{(1m)}|\mathbf {h} ^{(2m)}))(\sum _{\mathbf {h} ^{(1t)}}P(\mathbf {v} ^{t},\mathbf {h} ^{(1t)}|\mathbf {h} ^{(2t)}))\\&={\frac {1}{{\mathcal {Z}}_{M}(\theta )}}\sum _{\mathbf {h} }\mathrm {exp} (\sum _{kj}W_{kj}^{(1t)}v_{k}^{t}h_{j}^{(1t)}\\&+\sum _{jl}W_{jl}^{(2t)}h_{j}^{(1t)}h_{l}^{(2t)}+\sum _{k}b_{k}^{t}v_{k}^{t}+M\sum _{j}b_{j}^{(1t)}h_{j}^{(1t)}+\sum _{l}b_{l}^{(2t)}h_{l}^{(2t)}\\&-\sum _{i}{\frac {(v_{i}^{m}-b_{i}^{m})^{2}}{2\sigma ^{2}}}+\sum _{ij}{\frac {v_{i}^{m}}{\sigma _{i}}}W_{ij}^{(1m)}h_{j}^{(1m)}\\&+\sum _{jl}W_{jl}^{(2m)}h_{j}^{(1m)}h_{l}^{(2m)}+\sum _{j}b_{j}^{(1m)}h_{j}^{(1m)}+\sum _{l}b_{l}^{(2m)}h_{l}{(2m)}\\&+\sum _{lp}W^{(3t)}h_{l}^{(2t)}h_{p}^{(3)}+\sum _{lp}W^{(3m)}h_{l}^{(2m)}h_{p}^{(3)}+\sum _{p}b_{p}^{(3)}h_{p}^{(3)}\end{aligned}}}

The conditional distributions over the visible and hidden units are

${\displaystyle p(h_{j}^{(1m)}=1|\mathbf {v} ^{m},\mathbf {h} ^{(2m)})=g(\sum _{i=1}^{D}W_{ij}^{(1m)}{\frac {v_{i}^{m}}{\sigma _{i}}}+\sum _{l=1}^{F_{2}^{m}}W_{jl}^{(2m)}h_{l}^{(2m)}+b_{j}^{(1m)})}$
${\displaystyle p(h_{l}^{(2m)}=1|\mathbf {h} ^{(1m)},\mathbf {h} ^{(3)})=g(\sum _{j=1}^{F_{1}^{m}}W_{jl}^{(2m)}h_{j}^{(1m)}+\sum _{p=1}^{F_{3}}W_{lp}^{(3m)}h_{p}^{(3)}+b_{l}^{(2m)})}$
${\displaystyle p(h_{j}^{(1t)}=1|\mathbf {v} ^{t},\mathbf {h} ^{(2t)})=g(\sum _{k=1}^{K}W_{kl}^{(1t)}v_{k}^{(t)}+\sum _{l=1}^{F_{2}^{t}}W_{jl}^{(2t)}h_{l}^{(2t)}+Mb_{j}^{(1t)})}$
${\displaystyle p(h_{l}^{(2t)}=1|\mathbf {h} ^{(1t)},\mathbf {h} ^{(3)})=g(\sum _{j=1}^{F_{1}^{t}}W_{jl}^{(2t)}h_{j}^{(1t)}+\sum _{p=1}^{F_{3}}W_{lp}^{(3t)}h_{p}^{(3)}+b_{l}^{(2t)})}$
${\displaystyle p(h_{p}^{3)}=1|\mathbf {h} ^{(2)})=g(\sum _{l=1}^{F_{2}^{m}}W_{lp}^{(3m)}h_{l}^{(2m)}+\sum _{l=1}^{F_{2}^{t}}W_{lp}^{(3t)}h_{l}^{(2t)}+b_{p}^{(3)})}$
${\displaystyle p(v_{ik}^{t}=1|\mathbf {h} ^{(1t)})={\frac {\mathrm {exp} (\sum _{j=1}^{F_{1}^{t}}h_{j}^{(1t)}W_{jk}^{(1t)}+b_{k}^{t})}{\sum _{q=1}^{K}\mathrm {exp} (\sum _{j=1}^{F_{1}^{t}}h_{j}^{(1t)}W_{jq}^{(1t)}+b_{k}^{t})}}}$
${\displaystyle p(v_{i}^{m}|\mathbf {h} ^{(1m)})\sim {\mathcal {N}}(\sigma _{i}\sum _{j=1}^{F_{1}^{m}}W_{ij}^{(1m)}h_{j}^{(1m)}+b_{i}^{m},\sigma _{i}^{2})}$

### Inference and learning

Exact maximum likelihood learning in this model is intractable, but approximate learning of DBMs can be carried out by using a variational approach, where mean-field inference is used to estimate data-dependent expectations and an MCMC based stochastic approximation procedure is used to approximate the model’s expected sufficient statistics.[7]

## Application

Multimodal deep Boltzmann machines is successfully used in classification and missing data retrieval. The classification accuracy of multimodal deep Boltzmann machine outperforms support vector machines, latent Dirichlet allocation and deep belief network, when models are tested on data with both image-text modalities or with single modality. Multimodal deep Boltzmann machine is also able to predict the missing modality given the observed ones with reasonably good precision.