= List of large language models =

A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

==List==
For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.

| Name | Release date | Developer | Number of parameters (billion) | Corpus size | Training cost (petaFLOP-<wbr />day) | License | Notes |
| GPT-1 | | OpenAI | 0.117 | | 1 | | First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs. |
| BERT | | Google | 0.340 | 3.3 billion words | 9 | | An early and influential language model.Encoder-only and thus not built to be prompted or generative. Training took 4 days on 64 TPUv2 chips. |
| T5 | | Google | 11 | 34 billion tokens | | | Base model for many Google projects, such as Imagen. |
| XLNet | | Google | 0.340 | 33 billion words | 330 | | An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days. |
| GPT-2 | | OpenAI | 1.5 | 40GB (~10 billion tokens) | 28 | | Trained on 32 TPUv3 chips for 1 week. |
| GPT-3 | | OpenAI | 175 | 300 billion tokens | 3640 | | A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022. |
| GPT-Neo | | EleutherAI | 2.7 | 825 GiB | | | The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3. |
| GPT-J | | EleutherAI | 6 | 825 GiB | 200 | | GPT-3-style language model |
| Megatron-Turing NLG | | Microsoft and Nvidia | 530 | 338.6 billion tokens | 38000 | | Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours |
| Ernie 3.0 Titan | | Baidu | 260 | 4TB | | | Chinese-language LLM. Ernie Bot is based on this model. |
| Claude | | Anthropic | 52 | 400 billion tokens | | | Fine-tuned for desirable behavior in conversations. |
| GLaM (Generalist Language Model) | | Google | 1200 | 1.6 trillion tokens | 5600 | | Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3. |
| Gopher | | Google DeepMind | 280 | 300 billion tokens | 5833 | | Later developed into the Chinchilla model. |
| LaMDA (Language Models for Dialog Applications) | | Google | 137 | 1.56T words, 168 billion tokens | 4110 | | Specialized for response generation in conversations. |
| GPT-NeoX | | EleutherAI | 20 | 825 GiB | 740 | | based on the Megatron architecture |
| Chinchilla | | Google DeepMind | 70 | 1.4 trillion tokens | 6805 | | Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law. |
| PaLM (Pathways Language Model) | | Google | 540 | 768 billion tokens | 29,250 | | Trained for ~60 days on ~6000 TPU v4 chips. |
| OPT (Open Pretrained Transformer) | | Meta | 175 | 180 billion tokens | 310 | | GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published. |
| YaLM 100B | | Yandex | 100 | 1.7TB | | | English-Russian model based on Microsoft's Megatron-LM |
| Minerva | | Google | 540 | 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server | | | For solving "mathematical and scientific questions using step-by-step reasoning". Initialized from PaLM models, then finetuned on mathematical and scientific data. |
| BLOOM | | Large collaboration led by Hugging Face | 175 | 350 billion tokens (1.6TB) | | | Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages) |
| Galactica | | Meta | 120 | 106 billion tokens | | | Trained on scientific text and modalities. |
| AlexaTM (Teacher Models) | | Amazon | 20 | 1.3 trillion | | | Bidirectional sequence-to-sequence architecture |
| Llama | | Meta AI | 65 | 1.4 trillion | 6300 | | Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters. |
| GPT-4 | | OpenAI | (According to rumors: 1760) | | , estimated 230,000 | | Available for all ChatGPT users now and used in several products. |
| Cerebras-GPT | | Cerebras | 13 | | 270 | | Trained with Chinchilla formula. |
| Falcon | | Technology Innovation Institute | 40 | 1 trillion tokens, from RefinedWeb (filtered web text corpus) plus some "curated corpora". | 2800 | | |
| BloombergGPT | | Bloomberg L.P. | 50 | 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets | | | Trained on financial data from proprietary sources, for financial tasks |
| PanGu-Σ | | Huawei | 1085 | 329 billion tokens | | | |
| OpenAssistant | | LAION | 17 | 1.5 trillion tokens | | | Trained on crowdsourced open data |
| Jurassic-2 | | AI21 Labs | | | | | Multilingual |
| PaLM 2 (Pathways Language Model 2) | | Google | 340 | 3.6 trillion tokens | 85,000 | | Was used in Bard chatbot. |
| YandexGPT | | Yandex | | | | | Used in Alice chatbot. |
| Llama 2 | | Meta AI | 70 | 2 trillion tokens | 21,000 | | 1.7 million A100-hours. |
| Claude 2 | | Anthropic | | | | | Used in Claude chatbot. |
| Granite 13b | | IBM | | | | | Used in IBM Watsonx. |
| Mistral 7B | | Mistral AI | 7.3 | | | | |
| YandexGPT 2 | | Yandex | | | | | Used in Alice chatbot. |
| Claude 2.1 | | Anthropic | | | | | Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages. |
| Grok 1 | | xAI | 314 | | | | Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X (Twitter). |
| Gemini 1.0 | | Google DeepMind | | | | | Multimodal model, comes in three sizes. Used in the chatbot of the same name. |
| Mixtral 8x7B | | Mistral AI | 46.7 | | | | Outperforms GPT-3.5 and Llama 2 70B on many benchmarks. Mixture of experts model, with 12.9 billion parameters activated per token. |
| DeepSeek-LLM | | DeepSeek | 67 | 2T tokens | 12,000 | | Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B |
| Phi-2 | | Microsoft | 2.7 | 1.4T tokens | 419 | | Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs. |
| Gemini 1.5 | | Google DeepMind | | | | | Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens. |
| Gemini Ultra | | Google DeepMind | | | | | |
| Gemma | | Google DeepMind | 7 | 6T tokens | | | |
| Claude 3 | | Anthropic | | | | | Includes three models, Haiku, Sonnet, and Opus. |
| DBRX | | Databricks and Mosaic ML | 136 | 12T tokens | | | Training cost 10 million USD |
| YandexGPT 3 Pro | | Yandex | | | | | Used in Alice chatbot. |
| Fugaku-LLM | | Fujitsu, Tokyo Institute of Technology, etc. | 13 | 380B tokens | | | The largest model ever trained on CPU-only, on the Fugaku |
| Chameleon | | Meta AI | 34 | 4.4 trillion | | | |
| Mixtral 8x22B | | Mistral AI | 141 | | | | |
| Phi-3 | | Microsoft | 14 | 4.8T tokens | | | Microsoft markets them as "small language model". |
| Granite Code Models | | IBM | | | | | |
| YandexGPT 3 Lite | | Yandex | | | | | Used in Alice chatbot. |
| Qwen2 | | Alibaba Cloud | 72 | 3T tokens | | | Multiple sizes, the smallest being 0.5B. |
| DeepSeek-V2 | | DeepSeek | 236 | 8.1T tokens | 28,000 | | 1.4M hours on H800. |
| Nemotron-4 | | Nvidia | 340 | 9T tokens | 200,000 | | Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024. |
| Claude 3.5 | | Anthropic | | | | | Initially, only one model, Sonnet, was released. In October 2024, Sonnet 3.5 was upgraded, and Haiku 3.5 became available. |
| Llama 3.1 | | Meta AI | 405 | 15.6T tokens | 440,000 | | 405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs. |
| Grok-2 | | xAI | | | | | Originally closed-source, then re-released as "Grok 2.5" under a source-available license in August 2025. |
| OpenAI o1 | | OpenAI | | | | | Reasoning model. |
| YandexGPT 4 Lite and Pro | | Yandex | | | | | Used in Alice chatbot. |
| Sarvam 1 | | Sarvam AI | 2 | 2T tokens | | | Multilingual LLM optimized for 10+ Indic languages and English; aims efficient inference; built on Indian infrastructure. |
| Mistral Large | | Mistral AI | 123 | | | | Upgraded over time. The latest version is 24.11. |
| Pixtral | | Mistral AI | 123 | | | | Multimodal. There is also a 12B version which is under Apache 2 license. |
| Phi-4 | | Microsoft | 14 | 9.8T tokens | | | Microsoft markets them as "small language model". |
| DeepSeek-V3 | | DeepSeek | 671 | 14.8T tokens | 56,000 | | 2.788M hours on H800 GPUs. Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025. |
| Amazon Nova | | Amazon | | | | | Includes three models, Nova Micro, Nova Lite, and Nova Pro |
| DeepSeek-R1 | | DeepSeek | 671 | | | | No pretraining. Reinforcement-learned upon V3-Base. |
| Qwen2.5 | | Alibaba | 72 | 18T tokens | | | 7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants. |
| MiniMax-Text-01 | | Minimax | 456 | 4.7T tokens | | | |
| Gemini 2.0 | | Google DeepMind | | | | | Three models released: Flash, Flash-Lite and Pro |
| Claude 3.7 | | Anthropic | | | | | One model, Sonnet 3.7. |
| YandexGPT 5 Lite Pretrain and Pro | | Yandex | | | | | Used in Alice Neural Network chatbot. |
| GPT-4.5 | | OpenAI | | | | | Largest non-reasoning model. |
| Grok 3 | | xAI | | | | | Training cost claimed "10x the compute of previous state-of-the-art models". |
| Gemini 2.5 | | Google DeepMind | | | | | Three models released: Flash, Flash-Lite and Pro |
| YandexGPT 5 Lite Instruct | | Yandex | | | | | Used in Alice Neural Network chatbot. |
| Llama 4 | | Meta AI | 400 | 40T tokens | | | |
| OpenAI o3 and o4-mini | | OpenAI | | | | | Reasoning models. |
| Qwen3 | | Alibaba Cloud | 235 | 36T tokens | | | Multiple sizes, the smallest being 0.6B. |
| Claude 4 | | Anthropic | | | | | Includes two models, Sonnet and Opus. |
| Sarvam-M | | Sarvam AI | 24 | | | | Hybrid reasoning model fine-tuned on Mistral Small base; optimized for math, programming, and Indian languages. |
| Grok 4 | | xAI | | | | | |
| GLM-4.5 | | Zhipu AI | 355 | 22T tokens | | | Released in 335B and 106B sizes. Corpus size was calculated by combining the 15 trillion tokens and the 7 trillion tokens pre-training mix. |
| GPT-OSS | | OpenAI | 117 | | | | Released in 20B and 120B sizes. |
| Claude 4.1 | | Anthropic | | | | | Includes one model, Opus. |
| GPT-5 | | OpenAI | | | | | Includes three models, GPT-5, GPT-5 mini, and GPT-5 nano. GPT-5 is available in ChatGPT and API. It includes thinking abilities. |
| DeepSeek-V3.1 | | DeepSeek | 671 | 15.639T | | | Training size: 14.8T tokens, of DeepSeek V3 plus 839B tokens from the extension phases (630B + 209B)It is a hybrid model that can switch between thinking and non-thinking modes. |
| YandexGPT 5.1 Pro | | Yandex | | | | | Used in Alice Neural Network chatbot. |
| Apertus | | ETH Zurich and EPF Lausanne | 70 | 15 trillion | | | It's said to be the first LLM to be compliant with EU's Artificial Intelligence Act. |
| Claude Sonnet 4.5 | | Anthropic | | | | | |
| DeepSeek-V3.2-Exp | | DeepSeek | 685 | | | | This experimental model built upon v3.1-Terminus uses a custom efficient mechanism tagged DeepSeek Sparse Attention (DSA). |
| GLM-4.6 | | Zhipu AI | 357 | | | | |
| Alice AI LLM 1.0 | | Yandex | | | | | Available in Alice AI chatbot. |
| Gemini 3 | | Google DeepMind | | | | | Two models released: Deep Think and Pro |
| Claude Opus 4.5 | | Anthropic | | | | | The largest model in the Claude family. |
| GPT 5.2 | | OpenAI | | | | | It was able to solve an open problem in statistical learning theory that had previously remained unresolved by human researchers. |
| GLM-4.7 | | Zhipu AI | 355 | | | | MoE architecture. Open-source SOTA on coding benchmarks. Also released Flash variant (30B-A3B) on January 19, 2026. |
| Qwen3-Max-Thinking | | Alibaba Cloud | | | | | Proprietary reasoning model with adaptive tool-use, test-time scaling, and iterative self-reflection. |
| Kimi K2.5 | | Moonshot AI | 1000 | 15T tokens | | | MoE with 32B active parameters per token. Agent Swarm technology coordinating up to 100 parallel sub-agents. Native multimodal. |
| Claude Opus 4.6 | | Anthropic | | | | | |
| GPT-5.3-Codex | | OpenAI | | | | | |
| Sarvam-2B | | Sarvam AI | 2 | | | | Audio-first LLM supporting 22 Indian languages (speech focus). |
| Sovereign LLM | | Sarvam AI | 70 | | | | A foundational model sponsored by IndiaAI Mission; multiple variants planned (Large, Small, Edge). |
| GLM-5 | | Zhipu AI | 754 | | | | Specialized for agentic engineering and long-horizon tasks. Integrates DeepSeek Sparse Attention (DSA) for 200K context. Trained entirely on Chinese Huawei Ascend hardware. |
| Sarvam 105B | | Sarvam AI | 105 | 128,000 Tokens | | | Superior in Indic languages. Based on Mixture-of-experts model hence uses only 9B active parameter, therefore cost effective in high-level reasoning tasks. |

== See also ==

- List of chatbots
- List of language model benchmarks
