Large language model

A large language model (LLM) is a general purpose language model consisting of a neural network with many parameters (i.e. billions of weights or more). LLMs trained on large quantities of unlabelled text perform well at a wide variety of tasks, a development which, since their emergence around 2018, has shifted the focus of natural language processing research away from the previous paradigm of training specialized supervised models for specific tasks.^[1]

Properties

Though the term large language model has no formal definition, it generally refers to deep learning models having a parameter count on the order of billions or more.^[2] LLMs are general purpose models which excel at a wide range of tasks, as opposed to being trained for one specific task (such as sentiment analysis, named entity recognition, or mathematical reasoning).^[1]^[3]

Between 2018 and 2020, the standard method for harnessing an LLM for a specific NLP task was to fine tune the model with additional task-specific training. It has subsequently been found that more powerful LLMs such as GPT-3 can solve tasks without parameter updates using the technique of "few-shot prompting", in which the model is given a text prompt with a small number of examples of a particular task and must complete the unsolved task at the end of the prompt.^[1] For example, a sentiment analysis task of labelling the sentiment of a movie review could be prompted as follows:^[3]

Review: This movie stinks.
Sentiment: negative

Review: This movie is fantastic!
Sentiment:

If the model outputs "positive", then it has correctly solved the task. LLMs may also perform well at "zero-shot" prompts, in which they must solve a novel task presented in a text prompt without any preceding examples.^[4]

Architecture

Since 2018, large language models have generally used the transformer architecture (whereas, previously, recurrent architectures such as the LSTM were most common).^[1]

LLMs are computationally expensive to train. A 2020 study estimated the cost of training a 1.5 billion parameter model (1-2 orders of magnitude smaller than the state of the art at the time) at $1.6 million.^[4]

A 2020 analysis found that neural language models' capability (as measured by training loss) increased smoothly in a power law relationship with number of parameters, quantity of training data, and computation used for training.^[5]^[6] These relationships were tested over a wide range of values (up to seven orders of magnitude) and no attenuation of the relationship was observed at the highest end of the range (including for network sizes up to trillions of parameters).^[6]

List of large language models

List of large language models
Name	Year	Developer	Number of parameters^[a]	Notes
BERT	2018	Google	340 million^[7]
GPT-2	2019	OpenAI	1.5 billion^[8]
GPT-3	2020	OpenAI	175 billion^[4]	A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.^[9]
GLaM (Generalist Language Model)	2021	Google	1.2 trillion^[10]
Megatron-Turing NLG	2022	Microsoft and Nvidia	530 billion^[11]
LaMDA (Language Models for Dialog Applications)	2022	Google	137 billion^[12]
PaLM (Pathways Language Model)	2022	Google	540 billion^[13]
Chinchilla	2022	DeepMind	70 billion^[14]
BLOOM	2022	Various	175 billion^[5]	Developed by a team of around 1,000 researchers with funding from the French government and the US company Hugging Face.^[5]
LLaMA (Large Language Model Meta AI)	2023	Meta	65 billion^[15]

Notes

^ In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.

References

^ ^a ^b ^c ^d Manning, Christopher D. (2022). "Human Language Understanding & Reasoning". Daedalus.
^ Carlini, Nicholas; Tramer, Florian; Wallace, Eric; Jagielski, Matthew; Herbert-Voss, Ariel; Lee, Katherine; Roberts, Adam; Brown, Tom B; Song, Dawn; Erlingsson, Ulfar (2021). Extracting Training Data from Large Language Models (PDF). USENIX Security Symposium. Vol. 6.
^ ^a ^b Wei, Jason. "Emergent Abilities of Large Language Models".
^ ^a ^b ^c Wiggers, Kyle (28 April 2022). "The emerging types of language models and why they matter". TechCrunch.
^ ^a ^b ^c Ananthaswamy, Anil (8 March 2023). "In AI, is bigger always better?". Nature.
^ ^a ^b Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". CoRR. abs/2001.08361. arXiv:2001.08361.
^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].
^ "GPT-2: 1.5B Release". OpenAI. 2019-11-05. Archived from the original on 2019-11-14. Retrieved 2019-11-14.
^ "ChatGPT: Optimizing Language Models for Dialogue". OpenAI. 2022-11-30. Retrieved 2023-01-13.
^ Dai, Andrew M; Du, Nan (December 9, 2021). "More Efficient In-Context Learning with GLaM". ai.googleblog.com. Retrieved 2023-03-09.
^ Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu, Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay; Zhang, Elton; Child, Rewon; Aminabadi, Reza Yazdani; Bernauer, Julie; Song, Xia (2022-02-04). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model". arXiv:2201.11990.
^ Cheng, Heng-Tze; Thoppilan, Romal (January 21, 2022). "LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything". ai.googleblog.com. Retrieved 2023-03-09.
^ Narang, Sharan; Chowdhery, Aakanksha (April 4, 2022). "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance". ai.googleblog.com. Retrieved 2023-03-09.
^ Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Sifre, Laurent (12 April 2022). "An empirical analysis of compute-optimal large language model training". Deepmind Blog.
^ "Introducing LLaMA: A foundational, 65-billion-parameter large language model". Meta AI. 24 February 2023.

[7] In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.

[Manning-2022-1] Manning, Christopher D. (2022). "Human Language Understanding & Reasoning". Daedalus.

[extracting-2] Carlini, Nicholas; Tramer, Florian; Wallace, Eric; Jagielski, Matthew; Herbert-Voss, Ariel; Lee, Katherine; Roberts, Adam; Brown, Tom B; Song, Dawn; Erlingsson, Ulfar (2021). Extracting Training Data from Large Language Models (PDF). USENIX Security Symposium. Vol. 6.

[emergent-3] Wei, Jason. "Emergent Abilities of Large Language Models".

[Wiggers-4] Wiggers, Kyle (28 April 2022). "The emerging types of language models and why they matter". TechCrunch.

[bigger-better-5] Ananthaswamy, Anil (8 March 2023). "In AI, is bigger always better?". Nature.

[kaplan-scaling-6] Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". CoRR. abs/2001.08361. arXiv:2001.08361.

[bert-paper-8] Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].

[15Brelease-9] "GPT-2: 1.5B Release". OpenAI. 2019-11-05. Archived from the original on 2019-11-14. Retrieved 2019-11-14.

[chatgpt-blog-10] "ChatGPT: Optimizing Language Models for Dialogue". OpenAI. 2022-11-30. Retrieved 2023-01-13.

[glam-blog-11] Dai, Andrew M; Du, Nan (December 9, 2021). "More Efficient In-Context Learning with GLaM". ai.googleblog.com. Retrieved 2023-03-09.

[mtnlg-preprint-12] Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu, Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay; Zhang, Elton; Child, Rewon; Aminabadi, Reza Yazdani; Bernauer, Julie; Song, Xia (2022-02-04). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model". arXiv:2201.11990.

[lamda-blog-13] Cheng, Heng-Tze; Thoppilan, Romal (January 21, 2022). "LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything". ai.googleblog.com. Retrieved 2023-03-09.

[palm-blog-14] Narang, Sharan; Chowdhery, Aakanksha (April 4, 2022). "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance". ai.googleblog.com. Retrieved 2023-03-09.

[chinchilla-blog-15] Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Sifre, Laurent (12 April 2022). "An empirical analysis of compute-optimal large language model training". Deepmind Blog.

[llama-blog-16] "Introducing LLaMA: A foundational, 65-billion-parameter large language model". Meta AI. 24 February 2023.

[1]

[2]

[3]

[4]

[5]

[6]

[a]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Properties

Architecture

List of large language models

See also

Notes

References