Reinforcement learning from human feedback: Difference between revisions

Content deleted Content added

Inline

Revision as of 02:09, 26 February 2024

In machine learning, reinforcement learning from human feedback (RLHF), also called reinforcement learning from human preferences, is a technique to align an AI agent to human preferences. In classical reinforcement learning, such an agent learns a policy that maximizes a reward function that measures how well it performed its task. However, it is difficult to explicitly define such a reward function that approximates human preferences. Therefore, RLHF seeks to train a "reward model" directly from human feedback.^[1] This model can then function as a reward function to optimize an agent's policy through an optimization algorithm like Proximal Policy Optimization.^[2]^[3] The reward model is trained in advance to the policy being optimized to predict if a given output is good (high reward) or bad (low reward). RLHF can improve the robustness and exploration of RL agents, especially when the reward function is sparse or noisy.^[4]

Motivation

Optimizing a model based on human feedback is desirable when a task is difficult to specify yet easy to judge.^[5] For example, for the task of generating a compelling story, humans can iteratively assess the quality of different AI-generated stories, and the goal would be for the model to use their feedback to improve its story generation.

There have been various prior attempts at using human feedback to optimize a model's outputs, including through reinforcement learning, but most attempts were either narrow and difficult to generalize or broke down on more complex tasks.^[6]^[7]^[8]^[9] RLHF was an attempt to create a general algorithm for learning from a practical amount human feedback.^[5]^[3]

Collecting human feedback

Human feedback is commonly collected by prompting humans to rank instances of the agent's behavior.^[10]^[11]^[12] These rankings can then be used to score outputs, for example, using the Elo rating system.^[3] While ranking is the most widely adopted form of feedback, recent research has explored other forms, such as numerical feedback, natural language feedback, and prompting for direct edits to the model's output.^[13]

When learning from human feedback through pairwise comparisons under the Bradley-Terry-Luce model (or the Plackett-Luce model for K-wise comparisons), the maximum likelihood estimator (MLE) for linear reward functions converges if the comparison data is generated under specific models, but in policy training, a pessimistic MLE, which incorporates a lower confidence bound as the reward estimate, is more effective. Moreover, when applicable, it has been shown that considering K-wise comparisons directly is asymptotically more efficient than converting them into pairwise comparisons for prediction purposes.^[14]

Applications

RLHF has been applied to various domains of natural language processing, such as conversational agents, text summarization, and natural language understanding.^[15] Ordinary reinforcement learning, where agents learn from their own actions based on a "reward function", is difficult to apply to natural language processing tasks because the rewards are often not easy to define or measure, especially when dealing with complex tasks that involve human values or preferences. RLHF can enable language models to provide answers that align with these complex values, to generate more verbose responses, and to reject questions that are either inappropriate or outside the knowledge space of the model.^[16] Some examples of RLHF-trained language models are OpenAI's ChatGPT and its predecessor InstructGPT,^[11]^[17] as well as DeepMind's Sparrow.^[18]

RLHF has also been applied to other areas, such as the development of video game bots. For example, OpenAI and DeepMind trained agents to play Atari games based on human preferences.^[5]^[19] The agents achieved strong performance in many of the environments tested, often surpassing human performance.^[20]

Challenges and limitations

RLHF suffers from a number of challenges that can be broken down into problems with human feedback, problems with learning a reward model, and problems with optimizing the policy.^[21]

One major challenge is the scalability and cost of human feedback, which can be slow and expensive compared to unsupervised learning. The quality and consistency of human feedback can also vary depending on the task, the interface, and the individual preferences of the humans. Even when human feedback is feasible, RLHF models may still exhibit undesirable behaviors that are not captured by human feedback or exploit loopholes in the reward model, which brings to light the challenges of alignment and robustness.^[22]

The effectiveness of RLHF is dependent on the quality of human feedback.^[3] If the feedback lacks impartiality or is inconsistent or incorrect, the model may become biased.^[23] There is also a risk that the model may overfit to the feedback it receives. For instance, if feedback comes predominantly from a specific demographic or if it reflects specific biases, the model may learn not only the general alignment intended in the feedback, but also any peculiarities or noise present therein.^[24]^[25] This excessive alignment to the specific feedback it received (or to the biases of the specific demographic that provided it) can lead to the model performing suboptimally in new contexts or when used by different groups.

Additionally, in some cases, there may be a risk of the model learning to manipulate the feedback process or game the system to achieve higher rewards, rather than genuinely improving its performance, which indicates a fault in the reward function.^[26]

Researchers have surveyed a number of additional limitations to RLHF.^[27]

Alternatives

An alternative to RLHF called Direct Preference Optimization (DPO) has been proposed to learn human preferences. Like RLHF, it has been applied to align pre-trained large language models using human-generated preference data. However, instead of training an intermediate reward model to then be optimized by a policy using reinforcement learning, DPO uses a change of variables to define the "preference loss" directly as a function of the policy and uses this loss to fine-tune the model.^[28]

References

^ Russell, Stuart J.; Norvig, Peter (2016). Artificial intelligence: a modern approach (Third, Global ed.). Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo: Pearson. p. 830-831. ISBN 978-0-13-604259-4.
^ Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). "Fine-Tuning Language Models from Human Preferences". arXiv:1909.08593 [cs.CL].
^ ^a ^b ^c ^d Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. "Illustrating Reinforcement Learning from Human Feedback (RLHF)". huggingface.co. Retrieved 4 March 2023.
^
MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6 August 2017). "Interactive learning from policy-dependent human feedback". Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285–2294. arXiv:1701.06049.
- Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). arXiv:1709.10163. doi:10.1609/aaai.v32i1.11485. S2CID 4130751.
- Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv:2204.05862 [cs.CL].
^ ^a ^b ^c "Learning from human preferences". openai.com. Retrieved 4 March 2023.
^ Knox, W. Bradley; Stone, Peter; Breazeal, Cynthia (2013). "Training a Robot via Human Feedback: A Case Study". Social Robotics. Springer International Publishing: 460–470. doi:10.1007/978-3-319-02675-6_46. Retrieved 26 February 2024.
^ Akrour, Riad; Schoenauer, Marc; Sebag, Michèle (2012). "APRIL: Active Preference Learning-Based Reinforcement Learning". Machine Learning and Knowledge Discovery in Databases. Springer: 116–131. doi:10.1007/978-3-642-33486-3_8. Retrieved 26 February 2024.
^ Wilson, Aaron; Fern, Alan; Tadepalli, Prasad (2012). "A Bayesian Approach for Policy Learning from Trajectory Preference Queries". Advances in Neural Information Processing Systems. 25. Curran Associates, Inc. Retrieved 26 February 2024.
^ Schoenauer, Marc; Akrour, Riad; Sebag, Michele; Souplet, Jean-Christophe (18 June 2014). "Programming by Feedback". Proceedings of the 31st International Conference on Machine Learning. PMLR: 1503–1511. Retrieved 26 February 2024.
^ Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Gray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (31 October 2022). Training language models to follow instructions with human feedback. Thirty-Sixth Conference on Neural Information Processing Systems: NeurIPS 2022. arXiv:2203.02155.
^ ^a ^b Edwards, Benj (1 December 2022). "OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results". Ars Technica. Retrieved 4 March 2023.
^ Abhishek, Gupta (5 February 2023). "Getting stakeholder engagement right in responsible AI". VentureBeat. Retrieved 4 March 2023.
^ Fernandes, Patrick; Madaan, Aman; Liu, Emmy; Farinhas, António; Pedro Henrique Martins; Bertsch, Amanda; de Souza, José G. C.; Zhou, Shuyan; Wu, Tongshuang; Neubig, Graham; Martins, André F. T. (2023). "Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation". arXiv:2305.00955 [cs.CL].
^ Zhu, Banghua; Jordan, Michael; Jiao, Jiantao (2023-07-03). "Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons". Proceedings of the 40th International Conference on Machine Learning. PMLR: 43037–43067.
^
Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
- Nisan Stiennon; Long Ouyang; Jeffrey Wu; Daniel Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul F. Christiano (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.
^ Wiggers, Kyle (24 February 2023). "Can AI really be protected from text-based attacks?". TechCrunch. Retrieved 4 March 2023.
^
Farseev, Aleks. "Council Post: Is Bigger Better? Why The ChatGPT Vs. GPT-3 Vs. GPT-4 'Battle' Is Just A Family Chat". Forbes. Retrieved 4 March 2023.
- Heikkilä, Melissa. "How OpenAI is trying to make ChatGPT safer and less biased". MIT Technology Review. Retrieved 4 March 2023.
- Douglas Heaven, Will. "ChatGPT is OpenAI's latest fix for GPT-3. It's slick but still spews nonsense". MIT Technology Review. Retrieved 4 March 2023.
^
Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo; Rauh, Maribeth; Weidinger, Laura; Chadwick, Martin; Thacker, Phoebe; Campbell-Gillingham, Lucy; Uesato, Jonathan; Huang, Po-Sen; Comanescu, Ramona; Yang, Fan; See, Abigail; Dathathri, Sumanth; Greig, Rory; Chen, Charlie; Fritz, Doug; Elias, Jaume Sanchez; Green, Richard; Mokrá, Soňa; Fernando, Nicholas; Wu, Boxi; Foley, Rachel; Young, Susannah; Gabriel, Iason; Isaac, William; Mellor, John; Hassabis, Demis; Kavukcuoglu, Koray; Hendricks, Lisa Anne; Irving, Geoffrey (2022). "Improving alignment of dialogue agents via targeted human judgements". arXiv:2209.14375 [cs.LG].
- "Why DeepMind isn't deploying its new AI chatbot — and what it means for responsible AI". VentureBeat. 23 September 2022. Retrieved 4 March 2023.
- "Building safer dialogue agents". www.deepmind.com. Retrieved 4 March 2023.
^ "Learning through human feedback". www.deepmind.com. Retrieved 4 March 2023.
^ Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. Retrieved 4 March 2023.
^ Casper, Stephen; Davies, Xander; Shi, Claudia; Gilbert, Thomas Krendl; Scheurer, Jérémy; Rando, Javier; Freedman, Rachel; Korbak, Tomasz; Lindner, David; Freire, Pedro; Wang, Tony; Marks, Samuel; Segerie, Charbel-Raphaël; Carroll, Micah; Peng, Andi; Christoffersen, Phillip; Damani, Mehul; Slocum, Stewart; Anwar, Usman; Siththaranjan, Anand; Nadeau, Max; Michaud, Eric J.; Pfau, Jacob; Krasheninnikov, Dmitrii; Chen, Xin; Langosco, Lauro; Hase, Peter; Bıyık, Erdem; Dragan, Anca; Krueger, David; Sadigh, Dorsa; Hadfield-Menell, Dylan (2023). "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback". arXiv:2307.15217 [cs.AI].
^ Christiano, Paul. "Thoughts on the impact of RLHF research". Retrieved 4 March 2023.
^ Belenguer, Lorenzo (2022). "AI bias: exploring discriminatory algorithmic decision-making models and the application of possible machine-centric solutions adapted from the pharmaceutical industry". AI and Ethics. 2 (4). AI Ethics: 771–787. doi:10.1007/s43681-022-00138-8. PMC 8830968. PMID 35194591.
^ Wang, Austin. "Training Language Models to Follow Instructions with Human Feedback" (PDF). Princeton.
^ Zhang, Chiyuan; Bengio, Samy; Hardt, Moritz; Recht, Benjamin; Vinyals, Oriol (4 November 2016). "Understanding deep learning requires rethinking generalization". International Conference on Learning Representations.
^ "Faulty reward functions in the wild". OpenAI.
^ "Paper page - Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback". huggingface.co. 2023-07-31. Retrieved 2023-07-31.
^ Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". arXiv:2305.18290 [cs.LG].

[1] Russell, Stuart J.; Norvig, Peter (2016). Artificial intelligence: a modern approach (Third, Global ed.). Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo: Pearson. p. 830-831. ISBN 978-0-13-604259-4.

[2] Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). "Fine-Tuning Language Models from Human Preferences". arXiv:1909.08593 [cs.CL].

[huggingface-3] Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. "Illustrating Reinforcement Learning from Human Feedback (RLHF)". huggingface.co. Retrieved 4 March 2023.

[4] MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6 August 2017). "Interactive learning from policy-dependent human feedback". Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285–2294. arXiv:1701.06049.
Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). arXiv:1709.10163. doi:10.1609/aaai.v32i1.11485. S2CID 4130751.

Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv:2204.05862 [cs.CL].

[5] Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). arXiv:1709.10163. doi:10.1609/aaai.v32i1.11485. S2CID 4130751.

[6] Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv:2204.05862 [cs.CL].

[openai-5] "Learning from human preferences". openai.com. Retrieved 4 March 2023.

[6] Knox, W. Bradley; Stone, Peter; Breazeal, Cynthia (2013). "Training a Robot via Human Feedback: A Case Study". Social Robotics. Springer International Publishing: 460–470. doi:10.1007/978-3-319-02675-6_46. Retrieved 26 February 2024.

[7] Akrour, Riad; Schoenauer, Marc; Sebag, Michèle (2012). "APRIL: Active Preference Learning-Based Reinforcement Learning". Machine Learning and Knowledge Discovery in Databases. Springer: 116–131. doi:10.1007/978-3-642-33486-3_8. Retrieved 26 February 2024.

[8] Wilson, Aaron; Fern, Alan; Tadepalli, Prasad (2012). "A Bayesian Approach for Policy Learning from Trajectory Preference Queries". Advances in Neural Information Processing Systems. 25. Curran Associates, Inc. Retrieved 26 February 2024.

[9] Schoenauer, Marc; Akrour, Riad; Sebag, Michele; Souplet, Jean-Christophe (18 June 2014). "Programming by Feedback". Proceedings of the 31st International Conference on Machine Learning. PMLR: 1503–1511. Retrieved 26 February 2024.

[10] Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Gray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (31 October 2022). Training language models to follow instructions with human feedback. Thirty-Sixth Conference on Neural Information Processing Systems: NeurIPS 2022. arXiv:2203.02155.

[ars-11] Edwards, Benj (1 December 2022). "OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results". Ars Technica. Retrieved 4 March 2023.

[12] Abhishek, Gupta (5 February 2023). "Getting stakeholder engagement right in responsible AI". VentureBeat. Retrieved 4 March 2023.

[13] Fernandes, Patrick; Madaan, Aman; Liu, Emmy; Farinhas, António; Pedro Henrique Martins; Bertsch, Amanda; de Souza, José G. C.; Zhou, Shuyan; Wu, Tongshuang; Neubig, Graham; Martins, André F. T. (2023). "Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation". arXiv:2305.00955 [cs.CL].

[14] Zhu, Banghua; Jordan, Michael; Jiao, Jiantao (2023-07-03). "Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons". Proceedings of the 40th International Conference on Machine Learning. PMLR: 43037–43067.

[15] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
Nisan Stiennon; Long Ouyang; Jeffrey Wu; Daniel Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul F. Christiano (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.

[18] Nisan Stiennon; Long Ouyang; Jeffrey Wu; Daniel Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul F. Christiano (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.

[16] Wiggers, Kyle (24 February 2023). "Can AI really be protected from text-based attacks?". TechCrunch. Retrieved 4 March 2023.

[17] Farseev, Aleks. "Council Post: Is Bigger Better? Why The ChatGPT Vs. GPT-3 Vs. GPT-4 'Battle' Is Just A Family Chat". Forbes. Retrieved 4 March 2023.
Heikkilä, Melissa. "How OpenAI is trying to make ChatGPT safer and less biased". MIT Technology Review. Retrieved 4 March 2023.

Douglas Heaven, Will. "ChatGPT is OpenAI's latest fix for GPT-3. It's slick but still spews nonsense". MIT Technology Review. Retrieved 4 March 2023.

[21] Heikkilä, Melissa. "How OpenAI is trying to make ChatGPT safer and less biased". MIT Technology Review. Retrieved 4 March 2023.

[22] Douglas Heaven, Will. "ChatGPT is OpenAI's latest fix for GPT-3. It's slick but still spews nonsense". MIT Technology Review. Retrieved 4 March 2023.

[18] Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo; Rauh, Maribeth; Weidinger, Laura; Chadwick, Martin; Thacker, Phoebe; Campbell-Gillingham, Lucy; Uesato, Jonathan; Huang, Po-Sen; Comanescu, Ramona; Yang, Fan; See, Abigail; Dathathri, Sumanth; Greig, Rory; Chen, Charlie; Fritz, Doug; Elias, Jaume Sanchez; Green, Richard; Mokrá, Soňa; Fernando, Nicholas; Wu, Boxi; Foley, Rachel; Young, Susannah; Gabriel, Iason; Isaac, William; Mellor, John; Hassabis, Demis; Kavukcuoglu, Koray; Hendricks, Lisa Anne; Irving, Geoffrey (2022). "Improving alignment of dialogue agents via targeted human judgements". arXiv:2209.14375 [cs.LG].
"Why DeepMind isn't deploying its new AI chatbot — and what it means for responsible AI". VentureBeat. 23 September 2022. Retrieved 4 March 2023.

"Building safer dialogue agents". www.deepmind.com. Retrieved 4 March 2023.

[24] "Why DeepMind isn't deploying its new AI chatbot — and what it means for responsible AI". VentureBeat. 23 September 2022. Retrieved 4 March 2023.

[25] "Building safer dialogue agents". www.deepmind.com. Retrieved 4 March 2023.

[19] "Learning through human feedback". www.deepmind.com. Retrieved 4 March 2023.

[20] Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. Retrieved 4 March 2023.

[21] Casper, Stephen; Davies, Xander; Shi, Claudia; Gilbert, Thomas Krendl; Scheurer, Jérémy; Rando, Javier; Freedman, Rachel; Korbak, Tomasz; Lindner, David; Freire, Pedro; Wang, Tony; Marks, Samuel; Segerie, Charbel-Raphaël; Carroll, Micah; Peng, Andi; Christoffersen, Phillip; Damani, Mehul; Slocum, Stewart; Anwar, Usman; Siththaranjan, Anand; Nadeau, Max; Michaud, Eric J.; Pfau, Jacob; Krasheninnikov, Dmitrii; Chen, Xin; Langosco, Lauro; Hase, Peter; Bıyık, Erdem; Dragan, Anca; Krueger, David; Sadigh, Dorsa; Hadfield-Menell, Dylan (2023). "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback". arXiv:2307.15217 [cs.AI].

[22] Christiano, Paul. "Thoughts on the impact of RLHF research". Retrieved 4 March 2023.

[23] Belenguer, Lorenzo (2022). "AI bias: exploring discriminatory algorithmic decision-making models and the application of possible machine-centric solutions adapted from the pharmaceutical industry". AI and Ethics. 2 (4). AI Ethics: 771–787. doi:10.1007/s43681-022-00138-8. PMC 8830968. PMID 35194591.

[24] Wang, Austin. "Training Language Models to Follow Instructions with Human Feedback" (PDF). Princeton.

[25] Zhang, Chiyuan; Bengio, Samy; Hardt, Moritz; Recht, Benjamin; Vinyals, Oriol (4 November 2016). "Understanding deep learning requires rethinking generalization". International Conference on Learning Representations.

[26] "Faulty reward functions in the wild". OpenAI.

[27] "Paper page - Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback". huggingface.co. 2023-07-31. Retrieved 2023-07-31.

[28] Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". arXiv:2305.18290 [cs.LG].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

@@ Line 7: / Line 7: @@
 ==Motivation==
-RLHF is used in tasks where it's difficult to define a clear, algorithmic solution but where humans can easily judge the quality of the model's output. For example, if the task is to generate a compelling story, humans can rate different AI-generated stories on their quality, and the model can use their feedback to improve its story generation skills.
+Optimizing a model based on human feedback is desirable when a task is difficult to specify yet easy to judge.<ref name="openai"/> For example, for the task of generating a compelling story, humans can iteratively assess the quality of different AI-generated stories, and the goal would be for the model to use their feedback to improve its story generation.
+There have been various prior attempts at using human feedback to optimize a model's outputs, including through reinforcement learning, but most attempts were either narrow and difficult to generalize or broke down on more complex tasks.<ref>{{cite journal |last1=Knox |first1=W. Bradley |last2=Stone |first2=Peter |last3=Breazeal |first3=Cynthia |title=Training a Robot via Human Feedback: A Case Study |journal=Social Robotics |date=2013 |pages=460–470 |doi=10.1007/978-3-319-02675-6_46 |url=https://link.springer.com/chapter/10.1007/978-3-319-02675-6_46 |access-date=26 February 2024 |publisher=Springer International Publishing |language=en}}</ref><ref>{{cite journal |last1=Akrour |first1=Riad |last2=Schoenauer |first2=Marc |last3=Sebag |first3=Michèle |title=APRIL: Active Preference Learning-Based Reinforcement Learning |journal=Machine Learning and Knowledge Discovery in Databases |date=2012 |pages=116–131 |doi=10.1007/978-3-642-33486-3_8 |url=https://link.springer.com/chapter/10.1007/978-3-642-33486-3_8 |access-date=26 February 2024 |publisher=Springer |language=en}}</ref><ref>{{cite journal |last1=Wilson |first1=Aaron |last2=Fern |first2=Alan |last3=Tadepalli |first3=Prasad |title=A Bayesian Approach for Policy Learning from Trajectory Preference Queries |journal=Advances in Neural Information Processing Systems |date=2012 |volume=25 |url=https://papers.nips.cc/paper_files/paper/2012/hash/16c222aa19898e5058938167c8ab6c57-Abstract.html |access-date=26 February 2024 |publisher=Curran Associates, Inc.}}</ref><ref>{{cite journal |last1=Schoenauer |first1=Marc |last2=Akrour |first2=Riad |last3=Sebag |first3=Michele |last4=Souplet |first4=Jean-Christophe |title=Programming by Feedback |journal=Proceedings of the 31st International Conference on Machine Learning |date=18 June 2014 |pages=1503–1511 |url=https://proceedings.mlr.press/v32/schoenauer14.html |access-date=26 February 2024 |publisher=PMLR |language=en}}</ref> RLHF was an attempt to create a general algorithm for learning from a practical amount human feedback.<ref name="openai"/><ref name="huggingface"/>
 ==Collecting human feedback==
@@ Line 24: / Line 26: @@
 * {{cite web |title=Building safer dialogue agents |url=https://www.deepmind.com/blog/building-safer-dialogue-agents |website=www.deepmind.com |access-date=4 March 2023 |language=en}}</ref>
-RLHF has also been applied to other areas, such as the development of [[video game bot]]s. For example, OpenAI and DeepMind trained agents to play [[Atari]] games based on human preferences.<ref>{{cite web |title=Learning from human preferences |url=https://openai.com/research/learning-from-human-preferences |website=openai.com |access-date=4 March 2023}}</ref><ref>{{cite web |title=Learning through human feedback |url=https://www.deepmind.com/blog/learning-through-human-feedback |website=www.deepmind.com |access-date=4 March 2023 |language=en}}</ref> The agents achieved strong performance in many of the environments tested, often surpassing human performance.<ref>{{cite journal |last1=Christiano |first1=Paul F |last2=Leike |first2=Jan |last3=Brown |first3=Tom |last4=Martic |first4=Miljan |last5=Legg |first5=Shane |last6=Amodei |first6=Dario |title=Deep Reinforcement Learning from Human Preferences |journal=Advances in Neural Information Processing Systems |date=2017 |volume=30 |url=https://papers.nips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html |access-date=4 March 2023 |publisher=Curran Associates, Inc.}}</ref>
+RLHF has also been applied to other areas, such as the development of [[video game bot]]s. For example, OpenAI and DeepMind trained agents to play [[Atari]] games based on human preferences.<ref name="openai">{{cite web |title=Learning from human preferences |url=https://openai.com/research/learning-from-human-preferences |website=openai.com |access-date=4 March 2023}}</ref><ref>{{cite web |title=Learning through human feedback |url=https://www.deepmind.com/blog/learning-through-human-feedback |website=www.deepmind.com |access-date=4 March 2023 |language=en}}</ref> The agents achieved strong performance in many of the environments tested, often surpassing human performance.<ref>{{cite journal |last1=Christiano |first1=Paul F |last2=Leike |first2=Jan |last3=Brown |first3=Tom |last4=Martic |first4=Miljan |last5=Legg |first5=Shane |last6=Amodei |first6=Dario |title=Deep Reinforcement Learning from Human Preferences |journal=Advances in Neural Information Processing Systems |date=2017 |volume=30 |url=https://papers.nips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html |access-date=4 March 2023 |publisher=Curran Associates, Inc.}}</ref>
 ==Challenges and limitations==