Foundation models

From Wikipedia, the free encyclopedia

A foundation model (also called base model)[1] is a large machine learning (ML) model trained on a vast quantity of data at scale (often by self-supervised learning or semi-supervised learning)[2] such that it can be adapted to a wide range of downstream tasks.[3][4] Foundation models have helped bring about a major transformation in how artificial intelligence (AI) systems are built, such as by powering prominent chatbots and other user-facing AI. The Stanford Institute for Human-Centered Artificial Intelligence's (HAI) Center for Research on Foundation Models (CRFM) popularized the term.[3]

Early examples of foundation models were pre-trained language models (LMs) including Google's BERT[5] and various early GPT foundation models, which notably includes OpenAI's "GPT-n" series. Such broad models can in turn be used for task and/or domain specific models using targeted datasets of various kinds, such as medical codes.[6]

Beyond text, several visual and multimodal foundation models have been produced—including DALL-E, Flamingo,[7] Florence [8] and NOOR.[9] Visual foundation models (VFMs) have been combined with text-based LLMs to develop sophisticated task-specific models.[10] There is also Segment Anything by Meta AI for general image segmentation.[11] For reinforcement learning agents, there is GATO by Google DeepMind.[12][13]


The Stanford Institute for Human-Centered Artificial Intelligence's (HAI) Center for Research on Foundation Models (CRFM) coined the term "foundation model" in August 2021, tentatively referring to "any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks".[14] This was based on their observation that existing overlapping terms were not adequate, submitting that "'(large) language model' was too narrow given [the] focus is not only language; 'self-supervised model' was too specific to the training objective; and 'pretrained model' suggested that the noteworthy action all happened after 'pretraining."[15] After considering many terms, they settled on "foundation model" to emphasize the intended function (i.e., amenability to subsequent further development) rather than modality, architecture, or implementation.

They also note that the concept is not truly new, as it is based on deep neural networks and self-supervised learning, but asserted that the scale at which the area has developed in recent years, and the increasing potential for any given model to be used for different purposes, warranted a new term.[14]

A foundation model is a "paradigm for building AI systems" in which a model trained on a large amount of unlabeled data can be adapted to many applications.[16][17] Foundation models are "designed to be adapted (e.g., finetuned) to various downstream cognitive tasks by pre-training on broad data at scale".[18]

Key characteristics of foundation models are emergence and homogenization.[14] Because training data is not labelled by humans, the model emerges rather than being explicitly encoded. Properties that were not anticipated can appear. For example, a model trained on a large language dataset might learn to generate stories of its own, or to do arithmetic, without being explicitly programmed to do so.[19] Furthermore, these properties can sometimes be hard to predict beforehand due to breaks[20] in downstream scaling laws. Homogenization means that the same method is used in many domains, which allows for powerful advances but also the possibility of "single points of failure".[14]

Personalizing foundation models[edit]

Since foundation models are pre-trained on a massive dataset, they are not capable of handling specific "personal" concepts that a user may be interested in. A series of methods were designed to augment a foundation model with personal, specific items without retraining the full model. For example, for few-shot image retrieval it was shown how to adapt a vision-language foundation model (CLIP) by adding new concept to its vocabulary.[21] For Text-to-image generation, an approach called textual inversion[22] can be similarly used to teach the system new concept that can later be generated in conjunction with the concepts that the foundation model is already familiar with.

Opportunities and risks[edit]

A 2021 arXiv report listed foundation models' capabilities in regards to "language, vision, robotics, reasoning, and human interaction", technical principles, such as "model architectures, training procedures, data, systems, security, evaluation, and theory", their applications, for example in law, healthcare, and education and their potential impact on society, including "inequity, misuse, economic and environmental impact, legal and ethical considerations".[14]

An article about foundation models in The Economist notes that "some worry that the technology's heedless spread will further concentrate economic and political power".[19]


  1. ^ Perrigo, Billy (13 April 2023). "The A to Z of Artificial Intelligence". Time. Retrieved 22 May 2023.
  2. ^ Goled, Shraddha (7 May 2021). "Self-Supervised Learning Vs Semi-Supervised Learning: How They Differ". Analytics India Magazine. Retrieved 22 May 2023.
  3. ^ a b "Introducing the Center for Research on Foundation Models (CRFM)". Stanford HAI. Retrieved 11 June 2022.
  4. ^ Goldman, Sharon (13 September 2022). "Foundation models: 2022's AI paradigm shift". VentureBeat. Retrieved 24 October 2022.
  5. ^ Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). "A Primer in BERTology: What we know about how BERT works". arXiv:2002.12327 [cs.CL].
  6. ^ Steinberg, Ethan; Jung, Ken; Fries, Jason A.; Corbin, Conor K.; Pfohl, Stephen R.; Shah, Nigam H. (January 2021). "Language models are an effective representation learning technique for electronic health record data". Journal of Biomedical Informatics. 113: 103637. doi:10.1016/j.jbi.2020.103637. ISSN 1532-0480. PMC 7863633. PMID 33290879.
  7. ^ Tackling multiple tasks with a single visual language model, 28 April 2022, retrieved 13 June 2022
  8. ^ Yuan, Lu; Chen, Dongdong; Chen, Yi-Ling; Codella, Noel; Dai, Xiyang; Gao, Jianfeng; Hu, Houdong; Huang, Xuedong; Li, Boxin; Li, Chunyuan; Liu, Ce; Liu, Mengchen; Liu, Zicheng; Lu, Yumao; Shi, Yu; Wang, Lijuan; Wang, Jianfeng; Xiao, Bin; Xiao, Zhen; Yang, Jianwei; Zeng, Michael; Zhou, Luowei; Zhang, Pengchuan (2022). "Florence: A New Foundation Model for Computer Vision". arXiv:2111.11432 [cs.CV].
  9. ^ "Technology Innovation Institute Announces Launch of NOOR, the World's Largest Arabic NLP Model".
  10. ^ Chenfei Wu; et al. (2023). "Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models". Cornell University. arXiv:2303.04671. {{cite web}}: Missing or empty |url= (help)
  11. ^ "Segment Anything | Meta AI". Retrieved 21 June 2023.
  12. ^ "A Generalist Agent". Retrieved 21 June 2023.
  13. ^ "RoboCat: A self-improving robotic agent". Retrieved 21 June 2023.
  14. ^ a b c d e Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik; Buch, Shyamal; Card, Dallas; Castellon, Rodrigo; Chatterji, Niladri; Chen, Annie; Creel, Kathleen; Davis, Jared Quincy; Demszky, Dora; Donahue, Chris; Doumbouya, Moussa; Durmus, Esin; Ermon, Stefano; Etchemendy, John; Ethayarajh, Kawin; Fei-Fei, Li; Finn, Chelsea; Gale, Trevor; Gillespie, Lauren; Goel, Karan; Goodman, Noah; Grossman, Shelby; Guha, Neel; Hashimoto, Tatsunori; Henderson, Peter; Hewitt, John; Ho, Daniel E.; Hong, Jenny; Hsu, Kyle; Huang, Jing; Icard, Thomas; Jain, Saahil; Jurafsky, Dan; Kalluri, Pratyusha; Karamcheti, Siddharth; Keeling, Geoff; Khani, Fereshte; Khattab, Omar; Koh, Pang Wei; Krass, Mark; Krishna, Ranjay; Kuditipudi, Rohith; Kumar, Ananya; Ladhak, Faisal; Lee, Mina; Lee, Tony; Leskovec, Jure; Levent, Isabelle; Li, Xiang Lisa; Li, Xuechen; Ma, Tengyu; Malik, Ali; Manning, Christopher D.; Mirchandani, Suvir; Mitchell, Eric; Munyikwa, Zanele; Nair, Suraj; Narayan, Avanika; Narayanan, Deepak; Newman, Ben; Nie, Allen; Niebles, Juan Carlos; Nilforoshan, Hamed; Nyarko, Julian; Ogut, Giray; Orr, Laurel; Papadimitriou, Isabel; Park, Joon Sung; Piech, Chris; Portelance, Eva; Potts, Christopher; Raghunathan, Aditi; Reich, Rob; Ren, Hongyu; Rong, Frieda; Roohani, Yusuf; Ruiz, Camilo; Ryan, Jack; Ré, Christopher; Sadigh, Dorsa; Sagawa, Shiori; Santhanam, Keshav; Shih, Andy; Srinivasan, Krishnan; Tamkin, Alex; Taori, Rohan; Thomas, Armin W.; Tramèr, Florian; Wang, Rose E.; Wang, William; Wu, Bohan; Wu, Jiajun; Wu, Yuhuai; Xie, Sang Michael; Yasunaga, Michihiro; You, Jiaxuan; Zaharia, Matei; Zhang, Michael; Zhang, Tianyi; Zhang, Xikun; Zhang, Yuhui; Zheng, Lucia; Zhou, Kaitlyn; Liang, Percy (18 August 2021). On the Opportunities and Risks of Foundation Models (Report). arXiv:2108.07258.
  15. ^ "Reflections on Foundation Models". Stanford HAI. 18 October 2021. Retrieved 22 May 2023.
  16. ^ "Stanford CRFM". Retrieved 10 June 2022.
  17. ^ "What are foundation models?". IBM Research Blog. 9 February 2021. Retrieved 10 June 2022.
  18. ^ Fei, Nanyi; Lu, Zhiwu; Gao, Yizhao; Yang, Guoxing; Huo, Yuqi; Wen, Jingyuan; Lu, Haoyu; Song, Ruihua; Gao, Xin; Xiang, Tao; Sun, Hao; Wen, Ji-Rong (December 2022). "Towards artificial general intelligence via a multimodal foundation model". Nature Communications. 13 (1): 3094. arXiv:2110.14378. Bibcode:2022NatCo..13.3094F. doi:10.1038/s41467-022-30761-2. ISSN 2041-1723. PMC 9163040. PMID 35655064.
  19. ^ a b "Huge "foundation models" are turbo-charging AI progress". The Economist. ISSN 0013-0613. Retrieved 24 October 2022.
  20. ^ Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". International Conference on Learning Representations (ICLR), 2023.
  21. ^ Cohen, Niv; Gal, Rinon; Meirom, Eli A.; Chechik, Gal; Atzmon, Yuval (23 October 2022). ""This is My Unicorn, Fluffy": Personalizing Frozen Vision-Language Representations". Computer Vision – ECCV 2022. Lecture Notes in Computer Science. Vol. 13680. Berlin, Heidelberg: Springer-Verlag. pp. 558–577. arXiv:2204.01694. doi:10.1007/978-3-031-20044-1_32. ISBN 978-3-031-20043-4.
  22. ^ Gal, Rinon; Alaluf, Yuval; Atzmon, Yuval; Patashnik, Or; Bermano, Amit H.; Chechik, Gal; Cohen-Or, Daniel (2 August 2022). "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion". arXiv:2208.01618 [cs.CV].