Multimodal learning

From Wikipedia, the free encyclopedia

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text (typically represented as feature vector) or imaging data (consisting of pixel intensities and annotation tags) independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.


Many models and algorithms have been implemented to retrieve and classify certain types of data, e.g. image or text (where humans who interact with machines can extract images in the form of pictures and texts that could be any message etc.). However, data usually come with different modalities (it is the degree to which a system's components may be separated or combined) which carry different information. For example, it is very common to caption an image to convey the information not presented in the image itself. Similarly, sometimes it is more straightforward to use an image to describe the information which may not be obvious from texts. As a result, if different words appear in similar images, then these words likely describe the same thing. Conversely, if a word is used to describe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to use a model which is able to jointly represent the information such that the model can capture the correlation structure between different modalities. Moreover, it should also be able to recover missing modalities given observed ones (e.g. predicting possible image object according to text description). The Multimodal Deep Boltzmann Machine model satisfies the above purposes.

Multimodal transformers[edit]

Transformers can also be used/adapted for modalities (input or output) beyond just text, usually by finding a way to "tokenize" the modality.

Vision transformers[1] adapt the transformer to computer vision by breaking down input images as a series of patches, turning them into vectors, and treating them like tokens in a standard transformer.

Conformer[2] and later Whisper[3] follow the same pattern for speech recognition, first turning the speech signal into a spectrogram, which is then treated like an image, i.e. broken down into a series of patches, turned into vectors and treated like tokens in a standard transformer.

Perceivers by Andrew Jaegle et al. (2021)[4][5] can learn from large amounts of heterogeneous data.

Regarding image outputs, Peebles et al introduced a diffusion transformer (DiT) which facilitates use of the transformer architecture for diffusion-based image production.[6] Also, Google released a transformer-centric image generator called "Muse" based on parallel decoding and masked generative transformer technology.[7] (Transformers played a less-central role with prior image-producing technologies,[8] albeit still a significant one.[9])

Multimodal large language models[edit]

Multimodality means "having several modalities", and a "modality" refers to a type of input or output, such as video, image, audio, text, proprioception, etc.[10] There have been many AI models trained specifically to ingest one modality and output another modality, such as AlexNet for image to label,[11] visual question answering for image-text to text,[12] and speech recognition for speech to text.

A common method to create multimodal models out of an LLM is to "tokenize" the output of a trained encoder. Concretely, one can construct a LLM that can understand images as follows: take a trained LLM, and take a trained image encoder . Make a small multilayered perceptron , so that for any image , the post-processed vector has the same dimensions as an encoded token. That is an "image token". Then, one can interleave text tokens and image tokens. The compound model is then fine-tuned on an image-text dataset. This basic construction can be applied with more sophistication to improve the model. The image encoder may be frozen to improve stability.[13]

Flamingo demonstrated the effectiveness of the tokenization method, finetuning a pair of pretrained language model and image encoder to perform better on visual question answering than models trained from scratch.[14] Google PaLM model was fine-tuned into a multimodal model PaLM-E using the tokenization method, and applied to robotic control.[15] LLaMA models have also been turned multimodal using the tokenization method, to allow image inputs,[16] and video inputs.[17]

GPT-4 can use both text and image as inputs[18] (although the vision component wasn't released to the public until GPT-4V[19]); Google DeepMind's Gemini is also multimodal.[20]

Multimodal deep Boltzmann machines[edit]

A Boltzmann machine is a type of stochastic neural network invented by Geoffrey Hinton and Terry Sejnowski in 1985. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They are named after the Boltzmann distribution in statistical mechanics. The units in Boltzmann machines are divided into two groups: visible units and hidden units. Each unit is like a neuron with a binary output that represents whether it's activated or not.[21] General Boltzmann machines allow connection between any units. However, learning is impractical using general Boltzmann Machines because the computational time is exponential to the size of the machine[citation needed]. A more efficient architecture is called restricted Boltzmann machine where connection is only allowed between hidden unit and visible unit, which is described in the next section.

Multimodal deep Boltzmann machines can process and learn from different types of information, such as images and text, simultaneously. This can notably be done by having a separate deep Boltzmann machine for each modality, for example one for images and one for text, joined at an additional top hidden layer.[22]


Multimodal deep Boltzmann machines are successfully used in classification and missing data retrieval. The classification accuracy of multimodal deep Boltzmann machine outperforms support vector machines, latent Dirichlet allocation and deep belief network, when models are tested on data with both image-text modalities or with single modality.[citation needed] Multimodal deep Boltzmann machines are also able to predict missing modalities given the observed ones with reasonably good precision.[citation needed] Self Supervised Learning brings a more interesting and powerful model for multimodality. OpenAI developed CLIP and DALL-E models that revolutionized multimodality.

Multimodal deep learning is used for cancer screening – at least one system under development integrates such different types of data.[23][24]

See also[edit]


  1. ^ Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv:2010.11929 [cs.CV].
  2. ^ Gulati, Anmol; Qin, James; Chiu, Chung-Cheng; Parmar, Niki; Zhang, Yu; Yu, Jiahui; Han, Wei; Wang, Shibo; Zhang, Zhengdong; Wu, Yonghui; Pang, Ruoming (2020). "Conformer: Convolution-augmented Transformer for Speech Recognition". arXiv:2005.08100 [eess.AS].
  3. ^ Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv:2212.04356 [eess.AS].
  4. ^ Jaegle, Andrew; Gimeno, Felix; Brock, Andrew; Zisserman, Andrew; Vinyals, Oriol; Carreira, Joao (2021-06-22). "Perceiver: General Perception with Iterative Attention". arXiv:2103.03206 [cs.CV].
  5. ^ Jaegle, Andrew; Borgeaud, Sebastian; Alayrac, Jean-Baptiste; Doersch, Carl; Ionescu, Catalin; Ding, David; Koppula, Skanda; Zoran, Daniel; Brock, Andrew; Shelhamer, Evan; Hénaff, Olivier (2021-08-02). "Perceiver IO: A General Architecture for Structured Inputs & Outputs". arXiv:2107.14795 [cs.LG].
  6. ^ Peebles, William; Xie, Saining (March 2, 2023). "Scalable Diffusion Models with Transformers". arXiv:2212.09748 [cs.CV].
  7. ^ "Google AI Unveils Muse, a New Text-to-Image Transformer Model". InfoQ.
  8. ^ "Using Diffusion Models to Create Superior NeRF Avatars". January 5, 2023.
  9. ^ Islam, Arham (November 14, 2022). "How Do DALL·E 2, Stable Diffusion, and Midjourney Work?".
  10. ^ Kiros, Ryan; Salakhutdinov, Ruslan; Zemel, Rich (2014-06-18). "Multimodal Neural Language Models". Proceedings of the 31st International Conference on Machine Learning. PMLR: 595–603.
  11. ^ Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). "ImageNet Classification with Deep Convolutional Neural Networks". Advances in Neural Information Processing Systems. 25. Curran Associates, Inc.
  12. ^ Antol, Stanislaw; Agrawal, Aishwarya; Lu, Jiasen; Mitchell, Margaret; Batra, Dhruv; Zitnick, C. Lawrence; Parikh, Devi (2015). "VQA: Visual Question Answering". ICCV: 2425–2433.
  13. ^ Li, Junnan; Li, Dongxu; Savarese, Silvio; Hoi, Steven (2023-01-01). "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models". arXiv:2301.12597 [cs.CV].
  14. ^ Alayrac, Jean-Baptiste; Donahue, Jeff; Luc, Pauline; Miech, Antoine; Barr, Iain; Hasson, Yana; Lenc, Karel; Mensch, Arthur; Millican, Katherine; Reynolds, Malcolm; Ring, Roman; Rutherford, Eliza; Cabi, Serkan; Han, Tengda; Gong, Zhitao (2022-12-06). "Flamingo: a Visual Language Model for Few-Shot Learning". Advances in Neural Information Processing Systems. 35: 23716–23736. arXiv:2204.14198.
  15. ^ Driess, Danny; Xia, Fei; Sajjadi, Mehdi S. M.; Lynch, Corey; Chowdhery, Aakanksha; Ichter, Brian; Wahid, Ayzaan; Tompson, Jonathan; Vuong, Quan; Yu, Tianhe; Huang, Wenlong; Chebotar, Yevgen; Sermanet, Pierre; Duckworth, Daniel; Levine, Sergey (2023-03-01). "PaLM-E: An Embodied Multimodal Language Model". arXiv:2303.03378 [cs.LG].
  16. ^ Liu, Haotian; Li, Chunyuan; Wu, Qingyang; Lee, Yong Jae (2023-04-01). "Visual Instruction Tuning". arXiv:2304.08485 [cs.CV].
  17. ^ Zhang, Hang; Li, Xin; Bing, Lidong (2023-06-01). "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding". arXiv:2306.02858 [cs.CL].
  18. ^ OpenAI (2023-03-27). "GPT-4 Technical Report". arXiv:2303.08774 [cs.CL].
  19. ^ OpenAI (September 25, 2023). "GPT-4V(ision) System Card" (PDF).
  20. ^ Pichai, Sundar, Google Keynote (Google I/O '23), timestamp 15:31, retrieved 2023-07-02
  21. ^ Dey, Victor (2021-09-03). "Beginners Guide to Boltzmann Machine". Analytics India Magazine. Retrieved 2024-03-02.
  22. ^ "Multimodal Learning with Deep Boltzmann Machine" (PDF). 2014. Archived (PDF) from the original on 2015-06-21. Retrieved 2015-06-14.
  23. ^ Quach, Katyanna. "Harvard boffins build multimodal AI system to predict cancer". The Register. Archived from the original on 20 September 2022. Retrieved 16 September 2022.
  24. ^ Chen, Richard J.; Lu, Ming Y.; Williamson, Drew F. K.; Chen, Tiffany Y.; Lipkova, Jana; Noor, Zahra; Shaban, Muhammad; Shady, Maha; Williams, Mane; Joo, Bumjin; Mahmood, Faisal (8 August 2022). "Pan-cancer integrative histology-genomic analysis via multimodal deep learning". Cancer Cell. 40 (8): 865–878.e6. doi:10.1016/j.ccell.2022.07.004. ISSN 1535-6108. PMC 10397370. PMID 35944502. S2CID 251456162.