This is the user sandbox of SoerenMind. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

Rationale for changes to Alignment article

Continuing my efforts from last year, I've almost completed a major update/rewrite to the AI Alignment article, see below. As this is a big change, I wanted to get some feedback from the existing editors who have done the most to improve this article in the past. @Rolf h nelson: @WeyerStudentOfAgrippa: @Johncdraper: as noted on the Talk page, please let me know if I should change anything before pushing these changes in a week or two. Thank you! SoerenMind (talk) 10:32, 11 September 2022 (UTC) Edit: I wrote this a few weeks ago but only signed today, hopefully you still got a notification.

Important: Here's the main motivation for the changes:

AI Alignment has progressed from a field of philosophy to a technical research field with present-day applications. For example, alignment research papers now get major media coverage (e.g. from OpenAI and Anthropic). This development is reflected.
I introduce a crisper delineation of sections. This will reduce duplication between e.g. the Problem Description section and the Alignment section. The former motivates the problem, and the latter explains solution approaches.
I restructured the Alignment section to organize it by high-level research problems that the AI alignment research community focuses on: learning complex values, scalable oversight, honest AI, emergent goals / inner alignment, and instrumental goals / power-seeking.
The article was renamed last year from “AI Control Problem” to “AI Alignment”. I think this is good because it reduces with related articles in terms of topic. But in terms of the actual text, there's still greater overlap as the text is still written for the old, more general title. I'll update the lead and problem description sections to reflect that the main focus is alignment.
The Capability Control section is not about alignment. If there are no objections, I'll merge it into the AI box, article and rename that article to AI capability control. This way we've made sure the AI Alignment article is only about alignment. I would only do this later after pushing the changes outlined above. Relevant to Rolf-H-Nelson
Since alignment increasingly matters to AIs at current levels of intelligence, the emphasis is not only on superintelligent AI. This change also reduces overlap with other articles, and it mirrors the emphasis in alignment research papers of recent years.

References:

See previous discussion at Sourcing issues. Where there are references to non-peer reviewed Arxiv papers, these are only for claims that are also supported by a secondary source like a survey or a news article/book. Note that some Arxiv links are themselves (highly cited) survey articles. There are a small number of primary / less reliable sources for easily verifiable claims like "risks from unaligned AI have been pointed out by researcher X and Y" where the source is written by the researcher themselves.

AI Alignment wiki draft

In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems towards their designers’ intended goals and interests. An AI system is described as misaligned if it is competent but advances an unintended objective.^[a]

Problems in AI alignment include the difficulty of completely specifying all desired and undesired behaviors; the use of easy-to-specify proxy goals that omit some desired constraints; reward hacking, by which AI systems find loopholes in these proxy goals, causing side-effects; instrumental goals such as power-seeking that help the AI system achieve its final goals;^[1]^[3]^[4]^[5] and emergent goals that may only become apparent when the system is deployed in new situations and data distributions.^[4]^[6] These problems affect commercial systems such as robots,^[7] language models,^[8]^[9]^[10] autonomous vehicles,^[11] and social media recommendation engines.^[8]^[3]^[12] They are thought to be more likely in highly capable systems as they partially result from high capability.^[13]^[4]

The AI research community and the United Nations have called for technical research and policy solutions to ensure that AI systems are aligned with human values.^[b]

AI alignment is a subfield of AI safety, the study of building safe AI systems.^[4]^[16] Other subfields of safety include robustness, monitoring, and capability control.^[4]^[17] Avenues for alignment research include learning human values and preferences, developing honest AI, scalable oversight, auditing and interpreting AI models, as well as preventing emergent AI behaviors like power-seeking.^[4]^[17] Alignment research has connections to interpretability research,^[18] robustness,^[4]^[16] anomaly detection, calibrated uncertainty,^[18] formal verification,^[19] preference learning,^[20]^[21]^[22] safety-critical engineering,^[4]^[23] game theory,^[24]^[25] algorithmic fairness,^[16]^[26] and the social sciences,^[27] among others.

Problem description

In 1960, AI pioneer Norbert Wiener articulated the AI alignment problem as follows: “If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively … we had better be quite sure that the purpose put into the machine is the purpose which we really desire.”^[28]^[3] More recently, AI alignment has emerged as an open problem in modern AI systems^[29]^[30]^[31]^[32] and a research field within AI.^[33]^[4]^[34]^[35]

Specification gaming and complexity of value

Part of the problem of alignment involves specifying goals in a way that captures important values and avoids loopholes and unwanted consequences.^[33] In many cases, the specifications used to train an AI system do not match the intended goals of the algorithm designer.^[16]^[4]^[36]^[17] Designing such specifications is difficult for complex outputs such as language, robotic movements, or content recommendation. This is because it is difficult to describe in full what makes any complex output desirable or not. For example, when training a reinforcement learning agent to drive a boat around a racing track, researchers at OpenAI noticed that the agent found “an isolated lagoon where it can turn in a large circle and repeatedly knock over three targets … our agent manages to achieve a higher score using this strategy than is possible by completing the course in the normal way.” Another example of specification failure is in text generation: language models output falsehoods at a high rate and produce convincing spurious explanations.^[37]^[38]^[22] Alignment research tries to align such models with more safe or more useful objectives.

Berkeley computer scientist Stuart Russell has noted that omitting an implicit constraint can result in harm: “A system [...] will often set [...] unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want.”^[39]

When misaligned AI is deployed, the side-effects can be consequential. Social media platforms have been known to optimize clickthrough rates as a proxy for optimizing user enjoyment, but this addicted some users, decreasing their well-being.^[4] Stanford researchers comment that such recommender algorithms are misaligned with their users because they “optimize simple engagement metrics rather than a harder-to-measure combination of societal and consumer well-being”.^[8]

Writing a specification that avoids side effects can be challenging. Since the systems are designed by humans, it is sometimes suggested to simply forbid the system from taking dangerous actions, for instance by listing forbidden outputs or by formalizing simple ethical rules.^[40] However, Russell argued that this approach neglects the complexity of human values:^[3] “It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective.”^[3]

Systemic risks

Commercial organizations may have incentives to take shortcuts on safety and deploy insufficiently aligned AI systems.^[4] An example are the aforementioned social media recommender systems, which were profitable despite creating unwanted addiction and polarization on a global scale.^[8]^[41]^[42] In addition, competitive pressure can create a race to the bottom on safety standards, as in the case of Elaine Herzberg, a pedestrian who was killed by a self-driving car after engineers disabled the emergency braking system because it was over-sensitive and slowing down development.^[43]

Autonomous AI systems may be assigned the wrong goals by accident.^[44] Two former presidents of the Association for the Advancement of Artificial Intelligence (AAAI), Tom Dietterich and Eric Horvitz, note that this is already of concern: "An important aspect of any AI system that interacts with people is that it must reason about what people intend rather than carrying out commands literally."^[45] Furthermore, a system that understands human intentions may still disregard them—AI systems only act according to the objective function, examples, or feedback their designers actually provide.^[46]

Risks from advanced AI

Some researchers are particularly interested in the alignment of increasingly advanced AI systems. This is motivated by the high rate of progress in AI, the large efforts from industry and governments to develop advanced AI systems, and the greater difficulty or aligning them.

As of 2020, OpenAI, DeepMind, and 70 other public projects had the stated aim of developing artificial general intelligence (AGI), a hypothesized system that matches or outperforms humans in a broad range of cognitive tasks.^[47] Indeed, researchers who scale modern neural networks observe that they develop increasingly general and unexpected capabilities.^[8] Such models have learned to operate a computer, write their own programs, and perform a wide range of other tasks from a single model.^[48]^[49]^[50] Surveys find that some AI researchers expect AGI to be created soon, some believe it is very far off, and many consider both possibilities.^[51]^[52]

Power-seeking

Current systems still lack capabilities such as long-term planning and strategic awareness that are thought to pose the most catastrophic risks.^[8]^[53]^[5] Future systems (not necessarily AGIs) that have these capabilities may seek to protect and grow their influence over their environment. This tendency is known as power-seeking or convergent instrumental goals. Power-seeking is not explicitly programmed but emerges since power is instrumental for achieving a wide range of goals. For example, AI agents may acquire financial resources and computation, or may evade being turned off, including by running additional copies of the system on other computers.^[54]^[5] Power-seeking has been observed in various reinforcement learning agents.^[c]^[56]^[57]^[58] Later research has mathematically shown that optimal reinforcement learning algorithms seek power in a wide range of environments.^[59] As a result, it is often argued that the alignment problem must be solved early, before advanced AI that exhibits emergent power-seeking is created.^[5]^[54]^[3]

Existential risk

According to some scientists, creating misaligned AI that broadly outperforms humans would challenge the position of humanity as Earth’s dominant species; accordingly it would lead to the disempowerment or possible extinction of humans.^[1]^[3] Notable computer scientists who have pointed out risks from highly advanced misaligned AI include Alan Turing,^[d] Ilya Sutskever,^[62] Yoshua Bengio,^[e] Judea Pearl,^[f] Murray Shanahan,^[65] Norbert Wiener,^[28]^[3] Marvin Minsky,^[g] Francesca Rossi,^[67] Scott Aaronson,^[68] Bart Selman,^[69] David McAllester,^[70] Jürgen Schmidhuber,^[71] Markus Hutter,^[72] Shane Legg,^[73] Eric Horvitz,^[74] and Stuart Russell.^[3] Skeptical researchers such as François Chollet,^[75] Gary Marcus,^[76] Yann LeCun,^[77] and Oren Etzioni^[78] have argued that AGI is far off, or would not seek power (successfully).

Alignment may be especially difficult for the most capable AI systems since several risks increase with the system’s capability: the system’s ability to find loopholes in the assigned objective,^[13] cause side-effects, protect and grow its power,^[59]^[5] grow its intelligence, and mislead its designers; the system’s autonomy; and the difficulty of interpreting and supervising the AI system.^[3]^[54]

Alignment research

Learning human values and preferences

Teaching AI systems to act in view of human values, goals, and preferences is a nontrivial problem because human values can be complex and hard to fully specify. When given an imperfect or incomplete objective, goal-directed AI systems commonly learn to exploit these imperfections.^[16] This phenomenon is known as reward hacking or specification gaming in AI, and as Goodhart's law in economics. Researchers aim to specify the intended behavior as completely as possible with “values-targeted” datasets, imitation learning, or preference learning.^[6] A central open problem is scalable oversight, the difficulty of supervising an AI system that outperforms humans in a given domain.^[16]

When training a goal-directed AI system, such as a Reinforcement learning (RL) agent, it is often difficult to specify the intended behavior by writing a reward function manually. An alternative is imitation learning, where the AI learns to imitate demonstrations of the desired behavior. In inverse reinforcement learning (IRL), human demonstrations are used to identify the objective, i.e. the reward function, behind the demonstrated behavior.^[79]^[80] Cooperative inverse reinforcement learning (CIRL) builds on this by assuming a human agent and artificial agent can work together to maximize the human’s reward function.^[3]^[81] CIRL emphasizes that AI agents should be uncertain about the reward function. This humility can help mitigate specification gaming as well as power-seeking tendencies (see § Power-Seeking).^[58]^[72] However, inverse reinforcement learning approaches assume that humans can demonstrate nearly perfect behavior, a misleading assumption when the task is difficult.^[82]^[72]

Other researchers have explored the possibility of eliciting complex behavior through preference learning. Rather than providing expert demonstrations, human annotators provide feedback on which of two or more of the AI’s behaviors they prefer.^[20]^[22] A helper model is then trained to predict human feedback for new behaviors. Researchers at OpenAI used this approach to train an agent to perform a backflip in less than an hour of evaluation, a maneuver that would have been hard to provide demonstrations for.^[83]^[84] Preference learning has also been an influential tool for recommender systems, web search, and information retrieval.^[85] However, one challenge is reward hacking: the helper model may not represent human feedback perfectly, and the main model may exploit this mismatch.^[16]^[86]

The arrival of large language models such as GPT-3 has enabled the study of value learning in a more general and capable class of AI systems than was available before. Preference learning approaches originally designed for RL agents have been extended to improve the quality of generated text and reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of state-of-the-art large language models.^[9]^[22]^[87] Anthropic has proposed using preference learning to fine-tune models to be helpful, honest, and harmless.^[88] Other avenues used for aligning language models include values-targeted datasets^[89]^[4] and red-teaming.^[90]^[91] In red-teaming, another AI system or a human tries to find inputs for which the model’s behavior is unsafe. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low.^[22]

While preference learning can instill hard-to-specify behaviors, it requires extensive datasets or human interaction to capture the full breadth of human values. Machine ethics provides a complementary approach: instilling AI systems with moral values.^[h] For instance, machine ethics aims to teach the systems about normative factors in human morality, such as wellbeing, equality and impartiality; not intending harm; avoiding falsehoods; and honoring promises. Unlike specifying the objective for a specific task, machine ethics seeks to teach AI systems broad moral values that could apply in many situations. This approach carries conceptual challenges of its own; machine ethicists have noted the necessity to clarify what alignment aims to accomplish: having AIs follow the programmer’s literal instructions, the programmers' implicit intentions, the programmers' revealed preferences, the preferences the programmers would have if they were more informed or rational, the programmers' objective interests, or objective moral standards.^[94] Further challenges include aggregating the preferences of different stakeholders and avoiding value lock-in—the indefinite preservation of the values of the first highly capable AI systems, which are unlikely to be fully representative.^[94]^[95]

Scalable oversight

The alignment of AI systems through human supervision may face challenges in scaling up. As AI systems attempt increasingly complex tasks, it can be slow or infeasible for humans to evaluate them. Such tasks include summarizing books,^[96] producing statements that are not merely convincing but also true,^[97]^[38]^[98] writing code without subtle bugs^[10] or security vulnerabilities, and predicting long-term outcomes such as the climate and the results of a policy decision.^[99]^[100] More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and detect when the AI’s solution is only seemingly convincing, humans require assistance or extensive time. Scalable oversight studies how to reduce the time needed for supervision as well as assist human supervisors.^[16]

AI researcher Paul Christiano argues that the owners of AI systems may continue to train AI using easy-to-evaluate proxy objectives since that is easier than solving scalable oversight and still profitable. Accordingly, this may lead to “a world that’s increasingly optimized for things [that are easy to measure] like making profits or getting users to click on buttons, or getting users to spend time on websites without being increasingly optimized for having good policies and heading in a trajectory that we’re happy with”.^[101]

One easy-to-measure objective is the score the supervisor assigns to the AI’s outputs. Some AI systems have discovered a shortcut to achieving high scores, by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective (see video of robot hand^[83]).Some AI systems have also learned to recognize when they are being evaluated, and “play dead”, only to behave differently once evaluation ends.^[102] This deceptive form of specification gaming may become easier for more sophisticated AI systems^[13]^[54] that attempt more difficult-to-evaluate tasks. If advanced models are also capable planners, they could be able to obscure their deception from supervisors.^[103] In the automotive industry, Volkswagen engineers obscured their cars’ emissions in laboratory testing, underscoring that deception of evaluators is a common pattern in the real world.^[4]

Approaches such as active learning and semi-supervised reward learning can reduce the amount of human supervision needed.^[16] Another approach is to train a helper model (‘reward model’) to imitate the supervisor’s judgment.^[16]^[21]^[22]^[104]

However, when the task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is not sufficient to reduce the quantity of supervision needed. To increase supervision quality, a range of approaches aim to assist the supervisor, sometimes using AI assistants. Iterated Amplification is an approach developed by Christiano that iteratively builds a feedback signal for challenging problems by using humans to combine solutions to easier subproblems.^[6]^[99] Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them.^[96]^[105] Another proposal is to train aligned AI by means of debate between AI systems, with the winner judged by humans.^[106]^[72] Such debate is intended to reveal the weakest points of an answer to a complex question, and reward the AI for truthful and safe answers.

Honest AI

A growing area of research in AI alignment focuses on ensuring that AI is honest and truthful. Researchers from the Future of Humanity Institute point out that the development of language models such as GPT-3, which can generate fluent and grammatically correct text,^[108]^[109] has opened the door to AI systems repeating falsehoods from their training data or even deliberately lying to humans.^[110]^[111]

Current state-of-the-art language models learn by imitating human writing across millions of books worth of text from the Internet.^[8]^[112] While this helps them learn a wide range of skills, the training data also includes common misconceptions, incorrect medical advice, and conspiracy theories. AI systems trained on this data learn to mimic false statements.^[107]^[113]^[38] Additionally, models often obediently continue falsehoods when prompted, generate empty explanations for their answers, or produce outright fabrications.^[32] For example, when prompted to write a biography for a real AI researcher, a chatbot confabulated numerous details about their life, which the researcher identified as false.^[114]

To combat the lack of truthfulness exhibited by modern AI systems, researchers have explored several directions. AI research organizations including OpenAI and DeepMind have developed AI systems that can cite their sources and explain their reasoning when answering questions, enabling better transparency and verifiability.^[115]^[116]^[117] Researchers from OpenAI and Anthropic have proposed using human feedback and curated datasets to fine-tune AI assistants to avoid negligent falsehoods or express when they are uncertain.^[22]^[118]^[88] Alongside technical solutions, researchers have argued for defining clear truthfulness standards and the creation of institutions, regulatory bodies, or watchdog agencies to evaluate AI systems on these standards before and during deployment.^[110]

Researchers distinguish truthfulness, which specifies that AIs only make statements that are objectively true, and honesty, which is the property that AIs only assert what they believe to be true. Recent research finds that state-of-the-art AI systems cannot be said to hold stable beliefs, so it is not yet tractable to study the honesty of AI systems.^[119] However, there is substantial concern that future AI systems that do hold beliefs could intentionally lie to humans. In extreme cases, a misaligned AI could deceive its operators into thinking it was safe or persuade them that nothing is amiss.^[5]^[8]^[4] Some argue that if AIs could be made to assert only what they believe to be true, this would sidestep numerous problems in alignment.^[110]^[120]

Inner alignment and emergent goals

Alignment research aims to line up three different descriptions of an AI system:^[121]

Intended goals: “the hypothetical (but hard to articulate) description of an ideal AI system that is fully aligned to the desires of the human operator”;
Specified goals (or ‘outer specification’): The goals we actually specify — typically jointly through an objective function and a dataset;
Emergent goals (or ‘inner specification’): The goals the AI actually advances.

‘Outer misalignment’ is a mismatch between the intended goals (1) and the specified goals (2), whereas ‘inner misalignment’ is a mismatch between the specified goals (2) and the emergent goals (3).

Inner misalignment is often explained by analogy to biological evolution.^[122] In the ancestral environment, evolution selected human genes for inclusive genetic fitness, but humans evolved to have other objectives. Fitness corresponds to (2), the specified goal used in the training environment and training data. In evolutionary history, maximizing the fitness specification led to intelligent agents, humans, that do not directly pursue inclusive genetic fitness. Instead, they pursue emergent goals (3) that correlated with genetic fitness in the ancestral environment: nutrition, sex, and so on. However, our environment has changed — a distribution shift has occurred. Humans still pursue their emergent goals, but this no longer maximizes genetic fitness. (In machine learning the analogous problem is known as goal misgeneralization.^[123]) Our taste for sugary food (an emergent goal) was originally beneficial, but now leads to overeating and health problems. Also, by using contraception, humans directly contradict genetic fitness. By analogy, if genetic fitness were the objective chosen by an AI developer, they would observe the model behaving as intended in the training environment, without noticing that the model is pursuing an unintended emergent goal until the model was deployed.

Research directions to detect and remove misaligned emergent goals include red teaming, verification, anomaly detection, and interpretability.^[16]^[4]^[17] Progress on these techniques may help reduce two open problems. Firstly, emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environments—even for a short time until its misalignment is detected. Such high stakes are common in autonomous driving, health care, and military applications.^[124] The stakes become higher yet when AI systems gain more autonomy and capability, becoming capable of sidestepping human interventions (see § Power-seeking and instrumental goals). Secondly, a sufficiently capable AI system may take actions that falsely convince the human supervisor that the AI is pursuing the intended objective (see previous discussion on deception at § Scalable oversight).

Power-seeking and instrumental goals

Since the 1950s, AI researchers have sought to build advanced AI systems that can achieve goals by predicting the results of their actions and making long-term plans.^[125] However, some researchers argue that suitably advanced planning systems will default to seeking power over their environment, including over humans — for example by evading shutdown and acquiring resources. This power-seeking behavior is not explicitly programmed but emerges because power is instrumental for achieving a wide range of goals.^[59]^[3]^[5] Power-seeking is thus considered a convergent instrumental goal.^[54]

Power-seeking is uncommon in current systems, but advanced systems that can foresee the long-term results of their actions may increasingly seek power. This was shown in formal work which found that optimal reinforcement learning agents will seek power by seeking ways to gain more options, a behavior that persists across a wide range of environments and goals.^[59]

Power-seeking already emerges in some present systems. Reinforcement learning systems have gained more options by acquiring and protecting resources, sometimes in ways their designers did not intend.^[55]^[126] Other systems have learned, in toy environments, that in order to achieve their goal, they can prevent human interference^[56] or disable their off-switch.^[58] Russell illustrated this behavior by imagining a robot that is tasked to fetch coffee and evades being turned off since "you can't fetch the coffee if you're dead".^[3]

Hypothesized ways to gain options include AI systems trying to:

“... break out of a contained environment; hack; get access to financial resources, or additional computing resources; make backup copies of themselves; gain unauthorized capabilities, sources of information, or channels of influence; mislead/lie to humans about their goals; resist or manipulate attempts to monitor/understand their behavior ... impersonate humans; cause humans to do things for them; ... manipulate human discourse and politics; weaken various human institutions and response capacities; take control of physical infrastructure like factories or scientific laboratories; cause certain types of technology and infrastructure to be developed; or directly harm/overpower humans.”^[5]

Researchers aim to train systems that are 'corrigible': systems that do not seek power and allow themselves to be turned off, modified, etc. An unsolved challenge is reward hacking: when researchers penalize a system for seeking power, the system is incentivized to seek power in difficult-to-detect ways.^[4] To detect such covert behavior, researchers aim to create techniques and tools to inspect AI models^[4] and interpret the inner workings of black-box models such as neural networks.

Additionally, researchers propose to solve the problem of systems disabling their off-switches by making AI agents uncertain about the objective they are pursuing.^[58]^[3] Agents designed in this way would allow humans to turn them off, since this would indicate that the agent was wrong about the value of whatever action they were taking prior to being shut down. More research is needed to translate this insight into usable systems.^[6]

Power-seeking AI is thought to pose unusual risks. Ordinary safety-critical systems like planes and bridges are not adversarial. They lack the ability and incentive to evade safety measures and appear safer than they are. In contrast, power-seeking AI has been compared to a hacker that evades security measures.^[5] Further, ordinary technologies can be made safe through trial-and-error, unlike power-seeking AI which has been compared to a virus whose release is irreversible since it continuously evolves and grows in numbers—potentially at a faster pace than human society, eventually leading to the disempowerment or extinction of humans.^[5] It is therefore often argued that the alignment problem must be solved early, before advanced power-seeking AI is created.^[54]

However, some critics have argued that power-seeking is not inevitable, since humans do not always seek power and may only do so for evolutionary reasons. Furthermore, there is debate whether any future AI systems need to pursue goals and make long-term plans at all.^[127]^[5]

Capability control

[unchanged for now]

Skepticism of AI risk

[unchanged for now]

Public policy

[unchanged for now]

Footnotes

^ See the textbook: Russel & Norvig, Artificial Intelligence: A Modern Approach^[1]. The distinction between misaligned AI and incompetent AI has been formalized in certain contexts.^[2]
^ The AI principles created at the Asilomar Conference on Beneficial AI were signed by 1797 AI/robotics researchers.^[14] Further, the UN Secretary-General’s report “Our Common Agenda“,^[15] notes: “[T]he Compact could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values" and discusses global catastrophic risks from technological developments.
^ Reinforcement learning systems have learned to gain more options by acquiring and protecting resources, sometimes in ways their designers did not intend.^[55]^[5]
^ In a 1951 lecture^[60] Turing argued that “It seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers. There would be no question of the machines dying, and they would be able to converse with each other to sharpen their wits. At some stage therefore we should have to expect the machines to take control, in the way that is mentioned in Samuel Butler’s Erewhon.” Also in a lecture broadcasted on BBC^[61] expressed: "If a machine can think, it might think more intelligently than we do, and then where should we be? Even if we could keep the machines in a subservient position, for instance by turning off the power at strategic moments, we should, as a species, feel greatly humbled. . . . This new danger . . . is certainly something which can give us anxiety.”
^ About the book Human Compatible: AI and the Problem of Control, Bengio said "This beautifully written book addresses a fundamental challenge for humanity: increasingly intelligent machines that do what we ask but not what we really intend. Essential reading if you care about our future."^[63]
^ About the book Human Compatible: AI and the Problem of Control, Pearl said "Human Compatible made me a convert to Russell's concerns with our ability to control our upcoming creation–super-intelligent machines. Unlike outside alarmists and futurists, Russell is a leading authority on AI. His new book will educate the public about AI more than any book I can think of, and is a delightful and uplifting read." ^[64]
^ Russell & Norvig^[66] note: “The “King Midas problem” was anticipated by Marvin Minsky, who once suggested that an AI program designed to solve the Riemann Hypothesis might end up taking over all the resources of Earth to build more powerful supercomputers."
^ About the book of Wendell Wallach and Colin Allen: Moral machines: teaching robots right from wrong^[92] Vincent Wiegel says “we should extend [machines] with moral sensitivity to the moral dimensions of the situations in which the increasingly autonomous machines will inevitably find themselves.”^[93]

References

^ ^a ^b ^c ^d Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751.
^ Cite error: The named reference goal_misgen was invoked but never defined (see the help page).
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ ^o ^p ^q ^r Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (2022-06-16). "Unsolved Problems in ML Safety". arXiv:2109.13916. {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l Carlsmith, Joseph (2022-06-16). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353. {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b ^c ^d Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753.{{cite book}}: CS1 maint: date and year (link)
^ Kober, Jens; Bagnell, J. Andrew; Peters, Jan (2013-09-01). "Reinforcement learning in robotics: A survey". The International Journal of Robotics Research. 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. S2CID 1932843.
^ ^a ^b ^c ^d ^e ^f ^g ^h Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (2022-07-12). "On the Opportunities and Risks of Foundation Models". Stanford CRFM.
^ ^a ^b Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". ArXiv. arXiv:2203.02155.
^ ^a ^b Zaremba, Wojciech; Brockman, Greg; OpenAI (2021-08-10). "OpenAI Codex". OpenAI. Retrieved 2022-07-23.
^ Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (2022-03-11). "Reward (Mis)design for Autonomous Driving" (PDF). {{cite journal}}: Cite journal requires |journal= (help)
^ Stray, Jonathan (2020). "Aligning AI Optimization to Community Well-Being". International Journal of Community Well-Being. 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107.
^ ^a ^b ^c Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (2022-02-14). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved 2022-07-21.
^ Future of Life Institute (2017-08-11). "Asilomar AI Principles". Future of Life Institute. Retrieved 2022-07-18.
^ United Nations (2021). Our Common Agenda: Report of the Secretary-General (PDF) (Report). New York: United Nations.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (2016-06-21). "Concrete Problems in AI Safety". arXiv:1606.06565. {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b ^c ^d Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (2018-09-27). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research - Medium. Retrieved 2022-07-18.
^ ^a ^b Rorvig, Mordechai (2022-04-14). "Researchers Gain New Understanding From Simple AI". Quanta Magazine. Retrieved 2022-07-18.
^ Russell, Stuart; Dewey, Daniel; Tegmark, Max (2015-12-31). "Research Priorities for Robust and Beneficial Artificial Intelligence". AI Magazine. 36 (4): 105–114. doi:10.1609/aimag.v36i4.2577. ISSN 2371-9621. S2CID 8174496.
^ ^a ^b Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.
^ ^a ^b Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.
^ ^a ^b ^c ^d ^e ^f ^g Heaven, Will Douglas (2022-01-27). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Retrieved 2022-07-18.
^ Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (2022-03-07). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823. {{cite journal}}: Cite journal requires |journal= (help)
^ Clifton, Jesse (2020). "Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda". Center on Long-Term Risk. Retrieved 2022-07-18.
^ Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (2021-05-06). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521.
^ Prunkl, Carina; Whittlestone, Jess (2020-02-07). "Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society". Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM: 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. S2CID 210164673.
^ Irving, Geoffrey; Askell, Amanda (2019-02-19). "AI Safety Needs Social Scientists". Distill. 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. S2CID 159180422.
^ ^a ^b Wiener, Norbert (1960-05-06). "Some Moral and Technical Consequences of Automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers". Science. 131 (3410): 1355–1358. doi:10.1126/science.131.3410.1355. ISSN 0036-8075. PMID 17841602.
^ The Ezra Klein Show (2021-06-04). "If 'All Models Are Wrong,' Why Do We Give Them So Much Power?". The New York Times. ISSN 0362-4331. Retrieved 2022-07-18.
^ Wolchover, Natalie (2015-04-21). "Concerns of an Artificial Intelligence Pioneer". Quanta Magazine. Retrieved 2022-07-18.
^ California Assembly. "Bill Text - ACR-215 23 Asilomar AI Principles". Retrieved 2022-07-18.
^ ^a ^b Johnson, Steven; Iziev, Nikita (2022-04-15). "A.I. Is Mastering Language. Should We Trust What It Says?". The New York Times. ISSN 0362-4331. Retrieved 2022-07-18.
^ ^a ^b Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 4–5. ISBN 978-1-292-40113-3. OCLC 1303900751.
^ OpenAI (2022-02-15). "Aligning AI systems with human intent". OpenAI. Retrieved 2022-07-18.
^ Medium. "DeepMind Safety Research". Medium. Retrieved 2022-07-18.
^ Krakovna, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Legg, Shane (2020-04-21). "Specification gaming: the flip side of AI ingenuity". Deepmind. Retrieved 2022-08-26.
^ Naughton, John (2021-10-02). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Retrieved 2022-07-18.
^ ^a ^b ^c Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 3214–3252. doi:10.18653/v1/2022.acl-long.229. S2CID 237532606.
^ Edge.org. "The Myth Of AI | Edge.org". Retrieved 2022-07-19.
^ Tasioulas, John (2019). "First Steps Towards an Ethics of Robots and Artificial Intelligence". Journal of Practical Ethics. 7 (1): 61–95.
^ Wells, Georgia; Deepa Seetharaman; Horwitz, Jeff (2021-11-05). "Is Facebook Bad for You? It Is for About 360 Million Users, Company Surveys Suggest". Wall Street Journal. ISSN 0099-9660. Retrieved 2022-07-19.
^ Barrett, Paul M.; Hendrix, Justin; Sims, J. Grant (September 2021). How Social Media Intensifies U.S. Political Polarization-And What Can Be Done About It (Report). Center for Business and Human Rights, NYU.
^ Shepardson, David (2018-05-24). "Uber disabled emergency braking in self-driving car: U.S. agency". Reuters. Retrieved 2022-07-20.
^ Russell, Stuart; Norvig, Peter (2009). "26.3: The Ethics and Risks of Developing Artificial Intelligence". Artificial Intelligence: A Modern Approach. Prentice Hall. ISBN 978-0-13-604259-4.
^ Dietterich, Thomas G.; Horvitz, Eric J. (2015-09-28). "Rise of concerns about AI: reflections and directions". Communications of the ACM. 58 (10): 38–40. doi:10.1145/2770869. ISSN 0001-0782. S2CID 20395145.
^ Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. ISBN 978-1-292-40113-3. OCLC 1303900751.
^ Baum, Seth (2021-01-01). "2020 Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy". Retrieved 2022-07-20.
^ Dominguez, Daniel (2022-05-19). "DeepMind Introduces Gato, a New Generalist AI Agent". InfoQ. Retrieved 2022-09-09.
^ Wiggers, Kyle (2022-04-26). "Adept aims to build AI that can automate any software process". TechCrunch. Retrieved 2022-09-09.
^ Wakefield, Jane (2022-02-02). "DeepMind AI rivals average human competitive coder". BBC News. Retrieved 2022-09-09.
^ Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (2018-07-31). "Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts". Journal of Artificial Intelligence Research. 62: 729–754. doi:10.1613/jair.1.11222. ISSN 1076-9757. S2CID 8746462.
^ Zhang, Baobao; Anderljung, Markus; Kahn, Lauren; Dreksler, Noemi; Horowitz, Michael C.; Dafoe, Allan (2021-08-02). "Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers". Journal of Artificial Intelligence Research. 71. doi:10.1613/jair.1.12895. ISSN 1076-9757. S2CID 233740003.
^ Wei, Jason; Tay, Yi; Bommasani, Rishi; Raffel, Colin; Zoph, Barret; Borgeaud, Sebastian; Yogatama, Dani; Bosma, Maarten; Zhou, Denny; Metzler, Donald; Chi, Ed H.; Hashimoto, Tatsunori; Vinyals, Oriol; Liang, Percy; Dean, Jeff (2022-06-15). "Emergent Abilities of Large Language Models". arXiv:2206.07682. {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b ^c ^d ^e ^f Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.
^ ^a ^b Ornes, Stephen (2019-11-18). "Playing Hide-and-Seek, Machines Invent New Tools". Quanta Magazine. Retrieved 2022-08-26.
^ ^a ^b Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane (2017-11-28). "AI Safety Gridworlds". arXiv:1711.09883. {{cite journal}}: Cite journal requires |journal= (help)
^ Orseau, Laurent; Armstrong, Stuart (2016-01-01). "Safely Interruptible Agents". Retrieved 2022-07-20. {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b ^c ^d Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (2017). "The Off-Switch Game". Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17. pp. 220–227. doi:10.24963/ijcai.2017/32.
^ ^a ^b ^c ^d Turner, Alexander Matt; Smith, Logan; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad (2021-12-03). "Optimal Policies Tend to Seek Power". Neural Information Processing Systems.
^ Turing, Alan (1951). Intelligent machinery, a heretical theory (Speech). Lecture given to '51 Society'. Manchester: The Turing Digital Archive. Retrieved 2022-07-22.
^ Turing, Alan (15 May 1951). "Can digital computers think?". Automatic Calculating Machines. Episode 2. BBC. Can digital computers think?.
^ Muehlhauser, Luke (2016-01-29). "Sutskever on Talking Machines". Luke Muehlhauser. Retrieved 2022-08-26.
^ "Human Compatible: AI and the Problem of Control". Retrieved 2022-07-22.
^ "Human Compatible: AI and the Problem of Control". Retrieved 2022-07-22.
^ Shanahan, Murray (2015). The technological singularity. Cambridge, Massachusetts. ISBN 978-0-262-33182-1. OCLC 917889148.{{cite book}}: CS1 maint: location missing publisher (link)
^ Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. p. 1010. ISBN 978-0-13-604259-4.
^ Rossi, Francesca. "Opinion | How do you teach a machine to be moral?". Washington Post. ISSN 0190-8286.
^ Aaronson, Scott (2022-06-17). "OpenAI!". Shtetl-Optimized.
^ Selman, Bart, Intelligence Explosion: Science or Fiction? (PDF)
^ McAllester (2014-08-10). "Friendly AI and the Servant Mission". Machine Thoughts.
^ Schmidhuber, Jürgen (2015-03-06). "I am Jürgen Schmidhuber, AMA!" (Reddit Comment). r/MachineLearning. Retrieved 2022-07-23.
^ ^a ^b ^c ^d Everitt, Tom; Lea, Gary; Hutter, Marcus (2018-05-21). "AGI Safety Literature Review". arXiv:1805.01109. {{cite journal}}: Cite journal requires |journal= (help)
^ Shane (2009-08-31). "Funding safe AGI". vetta project.
^ Horvitz, Eric (2016-06-27). "Reflections on Safety and Artificial Intelligence" (PDF). Eric Horvitz. Retrieved 2020-04-20.
^ Chollet, François (2018-12-08). "The implausibility of intelligence explosion". Medium. Retrieved 2022-08-26.
^ Marcus, Gary (2022-06-06). "Artificial General Intelligence Is Not as Imminent as You Might Think". Scientific American. Retrieved 2022-08-26.
^ Barber, Lynsey (2016-07-31). "Phew! Facebook's AI chief says intelligent machines are not a threat to humanity". CityAM. Retrieved 2022-08-26.
^ Harris, Jeremie (2021-06-16). "The case against (worrying about) existential risk from AI". Medium. Retrieved 2022-08-26.
^ Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. p. 88. ISBN 978-0-393-86833-3. OCLC 1233266753.{{cite book}}: CS1 maint: date and year (link)
^ Ng, Andrew Y.; Russell, Stuart J. (2000). "Algorithms for inverse reinforcement learning". Proceedings of the seventeenth international conference on machine learning. ICML '00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. pp. 663–670. ISBN 1-55860-707-2.
^ Hadfield-Menell, Dylan; Russell, Stuart J; Abbeel, Pieter; Dragan, Anca (2016). "Cooperative Inverse Reinforcement Learning". Advances in Neural Information Processing Systems. NIPS'16. Vol. 29. ISBN 978-1-5108-3881-9. Retrieved 2022-07-21.
^ Armstrong, Stuart; Mindermann, Sören (2018). "Occam' s razor is insufficient to infer the preferences of irrational agents". Advances in Neural Information Processing Systems. NeurIPS 2018. Vol. 31. Montréal: Curran Associates, Inc. Retrieved 2022-07-21.
^ ^a ^b Amodei, Dario; Christiano, Paul; Ray, Alex (2017-06-13). "Learning from Human Preferences". OpenAI. Retrieved 2022-07-21.
^ Li, Yuxi (2018-11-25). "Deep Reinforcement Learning: An Overview" (PDF). Lecture Notes in Networks and Systems Book Series.
^ Fürnkranz, Johannes; Hüllermeier, Eyke; Rudin, Cynthia; Slowinski, Roman; Sanner, Scott (2014). "Preference Learning". Marc Herbstritt: 27 pages. doi:10.4230/DAGREP.4.3.1. {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: unflagged free DOI (link)
^ Hilton, Jacob; Gao, Leo (2022-04-13). "Measuring Goodhart's Law". OpenAI. Retrieved 2022-09-09.
^ Anderson, Martin (2022-04-05). "The Perils of Using Quotations to Authenticate NLG Content". Unite.AI. Retrieved 2022-07-21.
^ ^a ^b Wiggers, Kyle (2022-02-05). "Despite recent progress, AI-powered chatbots still have a long way to go". VentureBeat. Retrieved 2022-07-23.
^ Hendrycks, Dan; Burns, Collin; Basart, Steven; Critch, Andrew; Li, Jerry; Song, Dawn; Steinhardt, Jacob (2021-07-24). "Aligning AI With Shared Human Values". International Conference on Learning Representations. arXiv:2008.02275.
^ Perez, Ethan; Huang, Saffron; Song, Francis; Cai, Trevor; Ring, Roman; Aslanides, John; Glaese, Amelia; McAleese, Nat; Irving, Geoffrey (2022-02-07). "Red Teaming Language Models with Language Models". arXiv:2202.03286. {{cite journal}}: Cite journal requires |journal= (help)
^ Bhattacharyya, Sreejani (2022-02-14). "DeepMind's "red teaming" language models with language models: What is it?". Analytics India Magazine. Retrieved 2022-07-23.
^ Wallach, Wendell; Allen, Colin (2009). Moral Machines: Teaching Robots Right from Wrong. New York: Oxford University Press. ISBN 978-0-19-537404-9. Retrieved 2022-07-23.
^ Wiegel, Vincent (2010-12-01). "Wendell Wallach and Colin Allen: moral machines: teaching robots right from wrong". Ethics and Information Technology. 12 (4): 359–361. doi:10.1007/s10676-010-9239-1. ISSN 1572-8439. S2CID 30532107. Retrieved 2022-07-23.
^ ^a ^b Gabriel, Iason (2020-09-01). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641. S2CID 210920551. Retrieved 2022-07-23.
^ MacAskill, William (2022). What we owe the future. New York, NY: Basic Books. ISBN 978-1-5416-1862-6. OCLC 1314633519.
^ ^a ^b Wu, Jeff; Ouyang, Long; Ziegler, Daniel M.; Stiennon, Nisan; Lowe, Ryan; Leike, Jan; Christiano, Paul (2021-09-27). "Recursively Summarizing Books with Human Feedback". arXiv:2109.10862. {{cite journal}}: Cite journal requires |journal= (help)
^ Irving, Geoffrey; Amodei, Dario (2018-05-03). "AI Safety via Debate". OpenAI. Retrieved 2022-07-23.
^ Naughton, John (2021-10-02). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Retrieved 2022-07-23.
^ ^a ^b Christiano, Paul; Shlegeris, Buck; Amodei, Dario (2018-10-19). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575. {{cite journal}}: Cite journal requires |journal= (help)
^ Genetic Programming Theory and Practice XVII. Genetic and Evolutionary Computation. Wolfgang Banzhaf, Erik Goodman, Leigh Sheneman, Leonardo Trujillo, Bill Worzel (eds.). Cham: Springer International Publishing. 2020. doi:10.1007/978-3-030-39958-0. ISBN 978-3-030-39957-3. S2CID 218531292. Retrieved 2022-07-23.{{cite book}}: CS1 maint: others (link)
^ Wiblin, Robert (October 2, 2018). "Dr Paul Christiano on how OpenAI is developing real solutions to the 'AI alignment problem', and his vision of how humanity will progressively hand over decision-making to AI systems" (Podcast). 80,000 hours. No. 44. Centre for Effective Altruism. Retrieved 2022-07-23.
^ Lehman, Joel; Clune, Jeff; Misevic, Dusan; Adami, Christoph; Altenberg, Lee; Beaulieu, Julie; Bentley, Peter J.; Bernard, Samuel; Beslon, Guillaume; Bryson, David M.; Cheney, Nick (2020). "The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities". Artificial Life. 26 (2): 274–306. doi:10.1162/artl_a_00319. ISSN 1064-5462. PMID 32271631. S2CID 4519185.
^ Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (2022-06-16). "Unsolved Problems in ML Safety": 7. arXiv:2109.13916. {{cite journal}}: Cite journal requires |journal= (help)
^ Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (2018-11-19). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871. {{cite journal}}: Cite journal requires |journal= (help)
^ Wiggers, Kyle (2021-09-23). "OpenAI unveils model that can summarize books of any length". VentureBeat. Retrieved 2022-07-23.
^ Moltzau, Alex (2019-08-24). "Debating the AI Safety Debate". Towards Data Science. Retrieved 2022-07-23.
^ ^a ^b Wiggers, Kyle (2021-09-20). "Falsehoods more likely with large language models". VentureBeat. Retrieved 2022-07-23.
^ The Guardian (2020-09-08). "A robot wrote this entire article. Are you scared yet, human?". The Guardian. ISSN 0261-3077. Retrieved 2022-07-23.
^ Heaven, Will Douglas (2020-07-20). "OpenAI's new language generator GPT-3 is shockingly good—and completely mindless". MIT Technology Review. Retrieved 2022-07-23.
^ ^a ^b ^c Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William (2021-10-13). "Truthful AI: Developing and governing AI that does not lie". arXiv:2110.06674. {{cite journal}}: Cite journal requires |journal= (help)
^ Wiggers, Kyle (2021-09-20). "Falsehoods more likely with large language models". VentureBeat. Retrieved 2022-07-23.
^ Alford, Anthony (2021-07-13). "EleutherAI Open-Sources Six Billion Parameter GPT-3 Clone GPT-J". InfoQ. Retrieved 2022-07-23.
^ Naughton, John (2021-10-02). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Retrieved 2022-07-23.
^ Shuster, Kurt; Poff, Spencer; Chen, Moya; Kiela, Douwe; Weston, Jason (November 2021). "Retrieval Augmentation Reduces Hallucination in Conversation". Findings of the Association for Computational Linguistics: EMNLP 2021. EMNLP-Findings 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics. pp. 3784–3803. doi:10.18653/v1/2021.findings-emnlp.320. Retrieved 2022-07-23.
^ Nakano, Reiichiro; Hilton, Jacob; Balaji, Suchir; Wu, Jeff; Ouyang, Long; Kim, Christina; Hesse, Christopher; Jain, Shantanu; Kosaraju, Vineet; Saunders, William; Jiang, Xu (2022-06-01). "WebGPT: Browser-assisted question-answering with human feedback". arXiv:2112.09332. {{cite journal}}: Cite journal requires |journal= (help)
^ Kumar, Nitish (2021-12-23). "OpenAI Researchers Find Ways To More Accurately Answer Open-Ended Questions Using A Text-Based Web Browser". MarkTechPost. Retrieved 2022-07-23.
^ Menick, Jacob; Trebacz, Maja; Mikulik, Vladimir; Aslanides, John; Song, Francis; Chadwick, Martin; Glaese, Mia; Young, Susannah; Campbell-Gillingham, Lucy; Irving, Geoffrey; McAleese, Nat (2022-03-21). "Teaching language models to support answers with verified quotes". DeepMind.
^ Askell, Amanda; Bai, Yuntao; Chen, Anna; Drain, Dawn; Ganguli, Deep; Henighan, Tom; Jones, Andy; Joseph, Nicholas; Mann, Ben; DasSarma, Nova; Elhage, Nelson (2021-12-09). "A General Language Assistant as a Laboratory for Alignment". arXiv:2112.00861. {{cite journal}}: Cite journal requires |journal= (help)
^ Kenton, Zachary; Everitt, Tom; Weidinger, Laura; Gabriel, Iason; Mikulik, Vladimir; Irving, Geoffrey (2021-03-30). "Alignment of Language Agents". Medium. Retrieved 2022-07-23.
^ Leike, Jan; Schulman, John; Wu, Jeffrey (2022-08-24). "Our approach to alignment research". OpenAI. Retrieved 2022-09-09.
^ Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (2018-09-27). "Building safe artificial intelligence: specification, robustness, and assurance". Medium. Retrieved 2022-08-26.
^ Christian, Brian (2020). "Chapter 5: Shaping". The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753.{{cite book}}: CS1 maint: date and year (link)
^ Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D; Pfau, Jacob; Krueger, David (2022-07-17). "Goal misgeneralization in deep reinforcement learning". Proceedings of the 39th international conference on machine learning. Proceedings of machine learning research. Vol. 162. PMLR. pp. 12004–12019.
^ Zhang, Xiaoge; Chan, Felix T.S.; Yan, Chao; Bose, Indranil (2022). "Towards risk-aware artificial intelligence and machine learning systems: An overview". Decision Support Systems. 159: 113800. doi:10.1016/j.dss.2022.113800. S2CID 248585546.
^ McCarthy, John; Minsky, Marvin L.; Rochester, Nathaniel; Shannon, Claude E. (2006-12-15). "A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955". AI Magazine. 27 (4): 12. doi:10.1609/aimag.v27i4.1904. ISSN 2371-9621. S2CID 19439915.
^ Baker, Bowen; Kanitscheider, Ingmar; Markov, Todor; Wu, Yi; Powell, Glenn; McGrew, Bob; Mordatch, Igor (2019-09-17). "Emergent Tool Use from Multi-Agent Interaction". OpenAI. Retrieved 2022-08-26.
^ Shermer, Michael (2017-03-01). "Artificial Intelligence Is Not a Threat—Yet". Scientific American. Retrieved 2022-08-26.

[3] See the textbook: Russel & Norvig, Artificial Intelligence: A Modern Approach^[1]. The distinction between misaligned AI and incompetent AI has been formalized in certain contexts.^[2]

[17] The AI principles created at the Asilomar Conference on Beneficial AI were signed by 1797 AI/robotics researchers.^[14] Further, the UN Secretary-General’s report “Our Common Agenda“,^[15] notes: “[T]he Compact could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values" and discusses global catastrophic risks from technological developments.

[58] Reinforcement learning systems have learned to gain more options by acquiring and protecting resources, sometimes in ways their designers did not intend.^[55]^[5]

[65] In a 1951 lecture^[60] Turing argued that “It seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers. There would be no question of the machines dying, and they would be able to converse with each other to sharpen their wits. At some stage therefore we should have to expect the machines to take control, in the way that is mentioned in Samuel Butler’s Erewhon.” Also in a lecture broadcasted on BBC^[61] expressed: "If a machine can think, it might think more intelligently than we do, and then where should we be? Even if we could keep the machines in a subservient position, for instance by turning off the power at strategic moments, we should, as a species, feel greatly humbled. . . . This new danger . . . is certainly something which can give us anxiety.”

[68] About the book Human Compatible: AI and the Problem of Control, Bengio said "This beautifully written book addresses a fundamental challenge for humanity: increasingly intelligent machines that do what we ask but not what we really intend. Essential reading if you care about our future."^[63]

[70] About the book Human Compatible: AI and the Problem of Control, Pearl said "Human Compatible made me a convert to Russell's concerns with our ability to control our upcoming creation–super-intelligent machines. Unlike outside alarmists and futurists, Russell is a leading authority on AI. His new book will educate the public about AI more than any book I can think of, and is a delightful and uplifting read." ^[64]

[73] Russell & Norvig^[66] note: “The “King Midas problem” was anticipated by Marvin Minsky, who once suggested that an AI program designed to solve the Riemann Hypothesis might end up taking over all the resources of Earth to build more powerful supercomputers."

[101] About the book of Wendell Wallach and Colin Allen: Moral machines: teaching robots right from wrong^[92] Vincent Wiegel says “we should extend [machines] with moral sensitivity to the moral dimensions of the situations in which the increasingly autonomous machines will inevitably find themselves.”^[93]

[:9-1] Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751.

[goal_misgen-2] Cite error: The named reference goal_misgen was invoked but never defined (see the help page).

[:2-4] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[:0-5] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ ^o ^p ^q ^r Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (2022-06-16). "Unsolved Problems in ML Safety". arXiv:2109.13916. {{cite journal}}: Cite journal requires |journal= (help)

[:7-6] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l Carlsmith, Joseph (2022-06-16). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353. {{cite journal}}: Cite journal requires |journal= (help)

[:22-7] Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753.{{cite book}}: CS1 maint: date and year (link)

[8] Kober, Jens; Bagnell, J. Andrew; Peters, Jan (2013-09-01). "Reinforcement learning in robotics: A survey". The International Journal of Robotics Research. 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. S2CID 1932843.

[:62-9] ^ ^a ^b ^c ^d ^e ^f ^g ^h Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (2022-07-12). "On the Opportunities and Risks of Foundation Models". Stanford CRFM.

[:4-10] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". ArXiv. arXiv:2203.02155.

[:11-11] Zaremba, Wojciech; Brockman, Greg; OpenAI (2021-08-10). "OpenAI Codex". OpenAI. Retrieved 2022-07-23.

[12] Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (2022-03-11). "Reward (Mis)design for Autonomous Driving" (PDF). {{cite journal}}: Cite journal requires |journal= (help)

[13] Stray, Jonathan (2020). "Aligning AI Optimization to Community Well-Being". International Journal of Community Well-Being. 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107.

[:152-14] Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (2022-02-14). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved 2022-07-21.

[15] Future of Life Institute (2017-08-11). "Asilomar AI Principles". Future of Life Institute. Retrieved 2022-07-18.

[16] United Nations (2021). Our Common Agenda: Report of the Secretary-General (PDF) (Report). New York: United Nations.

[:1-18] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (2016-06-21). "Concrete Problems in AI Safety". arXiv:1606.06565. {{cite journal}}: Cite journal requires |journal= (help)

[:232-19] Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (2018-09-27). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research - Medium. Retrieved 2022-07-18.

[:3-20] Rorvig, Mordechai (2022-04-14). "Researchers Gain New Understanding From Simple AI". Quanta Magazine. Retrieved 2022-07-18.

[:6-21] Russell, Stuart; Dewey, Daniel; Tegmark, Max (2015-12-31). "Research Priorities for Robust and Beneficial Artificial Intelligence". AI Magazine. 36 (4): 105–114. doi:10.1609/aimag.v36i4.2577. ISSN 2371-9621. S2CID 8174496.

[:12-22] Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.

[:16-23] Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.

[:5-24] ^ ^a ^b ^c ^d ^e ^f ^g Heaven, Will Douglas (2022-01-27). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Retrieved 2022-07-18.

[25] Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (2022-03-07). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823. {{cite journal}}: Cite journal requires |journal= (help)

[26] Clifton, Jesse (2020). "Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda". Center on Long-Term Risk. Retrieved 2022-07-18.

[27] Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (2021-05-06). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521.

[28] Prunkl, Carina; Whittlestone, Jess (2020-02-07). "Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society". Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM: 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. S2CID 210164673.

[29] Irving, Geoffrey; Askell, Amanda (2019-02-19). "AI Safety Needs Social Scientists". Distill. 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. S2CID 159180422.

[:102-30] Wiener, Norbert (1960-05-06). "Some Moral and Technical Consequences of Automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers". Science. 131 (3410): 1355–1358. doi:10.1126/science.131.3410.1355. ISSN 0036-8075. PMID 17841602.

[31] The Ezra Klein Show (2021-06-04). "If 'All Models Are Wrong,' Why Do We Give Them So Much Power?". The New York Times. ISSN 0362-4331. Retrieved 2022-07-18.

[32] Wolchover, Natalie (2015-04-21). "Concerns of an Artificial Intelligence Pioneer". Quanta Magazine. Retrieved 2022-07-18.

[33] California Assembly. "Bill Text - ACR-215 23 Asilomar AI Principles". Retrieved 2022-07-18.

[:192-34] Johnson, Steven; Iziev, Nikita (2022-04-15). "A.I. Is Mastering Language. Should We Trust What It Says?". The New York Times. ISSN 0362-4331. Retrieved 2022-07-18.

[:32-35] Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 4–5. ISBN 978-1-292-40113-3. OCLC 1303900751.

[36] OpenAI (2022-02-15). "Aligning AI systems with human intent". OpenAI. Retrieved 2022-07-18.

[37] Medium. "DeepMind Safety Research". Medium. Retrieved 2022-07-18.

[38] Krakovna, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Legg, Shane (2020-04-21). "Specification gaming: the flip side of AI ingenuity". Deepmind. Retrieved 2022-08-26.

[39] Naughton, John (2021-10-02). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Retrieved 2022-07-18.

[:132-40] Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 3214–3252. doi:10.18653/v1/2022.acl-long.229. S2CID 237532606.

[41] Edge.org. "The Myth Of AI | Edge.org". Retrieved 2022-07-19.

[42] Tasioulas, John (2019). "First Steps Towards an Ethics of Robots and Artificial Intelligence". Journal of Practical Ethics. 7 (1): 61–95.

[:72-43] Wells, Georgia; Deepa Seetharaman; Horwitz, Jeff (2021-11-05). "Is Facebook Bad for You? It Is for About 360 Million Users, Company Surveys Suggest". Wall Street Journal. ISSN 0099-9660. Retrieved 2022-07-19.

[:82-44] Barrett, Paul M.; Hendrix, Justin; Sims, J. Grant (September 2021). How Social Media Intensifies U.S. Political Polarization-And What Can Be Done About It (Report). Center for Business and Human Rights, NYU.

[45] Shepardson, David (2018-05-24). "Uber disabled emergency braking in self-driving car: U.S. agency". Reuters. Retrieved 2022-07-20.

[46] Russell, Stuart; Norvig, Peter (2009). "26.3: The Ethics and Risks of Developing Artificial Intelligence". Artificial Intelligence: A Modern Approach. Prentice Hall. ISBN 978-0-13-604259-4.

[47] Dietterich, Thomas G.; Horvitz, Eric J. (2015-09-28). "Rise of concerns about AI: reflections and directions". Communications of the ACM. 58 (10): 38–40. doi:10.1145/2770869. ISSN 0001-0782. S2CID 20395145.

[:322-48] Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. ISBN 978-1-292-40113-3. OCLC 1303900751.

[:26-49] Baum, Seth (2021-01-01). "2020 Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy". Retrieved 2022-07-20.

[50] Dominguez, Daniel (2022-05-19). "DeepMind Introduces Gato, a New Generalist AI Agent". InfoQ. Retrieved 2022-09-09.

[51] Wiggers, Kyle (2022-04-26). "Adept aims to build AI that can automate any software process". TechCrunch. Retrieved 2022-09-09.

[52] Wakefield, Jane (2022-02-02). "DeepMind AI rivals average human competitive coder". BBC News. Retrieved 2022-09-09.

[:28-53] Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (2018-07-31). "Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts". Journal of Artificial Intelligence Research. 62: 729–754. doi:10.1613/jair.1.11222. ISSN 1076-9757. S2CID 8746462.

[:29-54] Zhang, Baobao; Anderljung, Markus; Kahn, Lauren; Dreksler, Noemi; Horowitz, Michael C.; Dafoe, Allan (2021-08-02). "Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers". Journal of Artificial Intelligence Research. 71. doi:10.1613/jair.1.12895. ISSN 1076-9757. S2CID 233740003.

[55] Wei, Jason; Tay, Yi; Bommasani, Rishi; Raffel, Colin; Zoph, Barret; Borgeaud, Sebastian; Yogatama, Dani; Bosma, Maarten; Zhou, Denny; Metzler, Donald; Chi, Ed H.; Hashimoto, Tatsunori; Vinyals, Oriol; Liang, Percy; Dean, Jeff (2022-06-15). "Emergent Abilities of Large Language Models". arXiv:2206.07682. {{cite journal}}: Cite journal requires |journal= (help)

[:8-56] ^ ^a ^b ^c ^d ^e ^f Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.

[quanta-hide-seek-57] Ornes, Stephen (2019-11-18). "Playing Hide-and-Seek, Machines Invent New Tools". Quanta Magazine. Retrieved 2022-08-26.

[:10-59] Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane (2017-11-28). "AI Safety Gridworlds". arXiv:1711.09883. {{cite journal}}: Cite journal requires |journal= (help)

[:27-60] Orseau, Laurent; Armstrong, Stuart (2016-01-01). "Safely Interruptible Agents". Retrieved 2022-07-20. {{cite journal}}: Cite journal requires |journal= (help)

[:24-61] Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (2017). "The Off-Switch Game". Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17. pp. 220–227. doi:10.24963/ijcai.2017/32.

[:252-62] Turner, Alexander Matt; Smith, Logan; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad (2021-12-03). "Optimal Policies Tend to Seek Power". Neural Information Processing Systems.

[63] Turing, Alan (1951). Intelligent machinery, a heretical theory (Speech). Lecture given to '51 Society'. Manchester: The Turing Digital Archive. Retrieved 2022-07-22.

[64] Turing, Alan (15 May 1951). "Can digital computers think?". Automatic Calculating Machines. Episode 2. BBC. Can digital computers think?.

[:30-66] Muehlhauser, Luke (2016-01-29). "Sutskever on Talking Machines". Luke Muehlhauser. Retrieved 2022-08-26.

[67] "Human Compatible: AI and the Problem of Control". Retrieved 2022-07-22.

[69] "Human Compatible: AI and the Problem of Control". Retrieved 2022-07-22.

[:31-71] Shanahan, Murray (2015). The technological singularity. Cambridge, Massachusetts. ISBN 978-0-262-33182-1. OCLC 917889148.{{cite book}}: CS1 maint: location missing publisher (link)

[72] Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. p. 1010. ISBN 978-0-13-604259-4.

[:33-74] Rossi, Francesca. "Opinion | How do you teach a machine to be moral?". Washington Post. ISSN 0190-8286.

[:34-75] Aaronson, Scott (2022-06-17). "OpenAI!". Shtetl-Optimized.

[:35-76] Selman, Bart, Intelligence Explosion: Science or Fiction? (PDF)

[:36-77] McAllester (2014-08-10). "Friendly AI and the Servant Mission". Machine Thoughts.

[:37-78] Schmidhuber, Jürgen (2015-03-06). "I am Jürgen Schmidhuber, AMA!" (Reddit Comment). r/MachineLearning. Retrieved 2022-07-23.

[:112-79] Everitt, Tom; Lea, Gary; Hutter, Marcus (2018-05-21). "AGI Safety Literature Review". arXiv:1805.01109. {{cite journal}}: Cite journal requires |journal= (help)

[:38-80] Shane (2009-08-31). "Funding safe AGI". vetta project.

[:39-81] Horvitz, Eric (2016-06-27). "Reflections on Safety and Artificial Intelligence" (PDF). Eric Horvitz. Retrieved 2020-04-20.

[:40-82] Chollet, François (2018-12-08). "The implausibility of intelligence explosion". Medium. Retrieved 2022-08-26.

[:41-83] Marcus, Gary (2022-06-06). "Artificial General Intelligence Is Not as Imminent as You Might Think". Scientific American. Retrieved 2022-08-26.

[:43-84] Barber, Lynsey (2016-07-31). "Phew! Facebook's AI chief says intelligent machines are not a threat to humanity". CityAM. Retrieved 2022-08-26.

[:44-85] Harris, Jeremie (2021-06-16). "The case against (worrying about) existential risk from AI". Medium. Retrieved 2022-08-26.

[86] Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. p. 88. ISBN 978-0-393-86833-3. OCLC 1233266753.{{cite book}}: CS1 maint: date and year (link)

[87] Ng, Andrew Y.; Russell, Stuart J. (2000). "Algorithms for inverse reinforcement learning". Proceedings of the seventeenth international conference on machine learning. ICML '00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. pp. 663–670. ISBN 1-55860-707-2.

[88] Hadfield-Menell, Dylan; Russell, Stuart J; Abbeel, Pieter; Dragan, Anca (2016). "Cooperative Inverse Reinforcement Learning". Advances in Neural Information Processing Systems. NIPS'16. Vol. 29. ISBN 978-1-5108-3881-9. Retrieved 2022-07-21.

[89] Armstrong, Stuart; Mindermann, Sören (2018). "Occam' s razor is insufficient to infer the preferences of irrational agents". Advances in Neural Information Processing Systems. NeurIPS 2018. Vol. 31. Montréal: Curran Associates, Inc. Retrieved 2022-07-21.

[:14-90] Amodei, Dario; Christiano, Paul; Ray, Alex (2017-06-13). "Learning from Human Preferences". OpenAI. Retrieved 2022-07-21.

[91] Li, Yuxi (2018-11-25). "Deep Reinforcement Learning: An Overview" (PDF). Lecture Notes in Networks and Systems Book Series.

[92] Fürnkranz, Johannes; Hüllermeier, Eyke; Rudin, Cynthia; Slowinski, Roman; Sanner, Scott (2014). "Preference Learning". Marc Herbstritt: 27 pages. doi:10.4230/DAGREP.4.3.1. {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: unflagged free DOI (link)

[93] Hilton, Jacob; Gao, Leo (2022-04-13). "Measuring Goodhart's Law". OpenAI. Retrieved 2022-09-09.

[94] Anderson, Martin (2022-04-05). "The Perils of Using Quotations to Authenticate NLG Content". Unite.AI. Retrieved 2022-07-21.

[:20-95] Wiggers, Kyle (2022-02-05). "Despite recent progress, AI-powered chatbots still have a long way to go". VentureBeat. Retrieved 2022-07-23.

[96] Hendrycks, Dan; Burns, Collin; Basart, Steven; Critch, Andrew; Li, Jerry; Song, Dawn; Steinhardt, Jacob (2021-07-24). "Aligning AI With Shared Human Values". International Conference on Learning Representations. arXiv:2008.02275.

[97] Perez, Ethan; Huang, Saffron; Song, Francis; Cai, Trevor; Ring, Roman; Aslanides, John; Glaese, Amelia; McAleese, Nat; Irving, Geoffrey (2022-02-07). "Red Teaming Language Models with Language Models". arXiv:2202.03286. {{cite journal}}: Cite journal requires |journal= (help)

[98] Bhattacharyya, Sreejani (2022-02-14). "DeepMind's "red teaming" language models with language models: What is it?". Analytics India Magazine. Retrieved 2022-07-23.

[99] Wallach, Wendell; Allen, Colin (2009). Moral Machines: Teaching Robots Right from Wrong. New York: Oxford University Press. ISBN 978-0-19-537404-9. Retrieved 2022-07-23.

[100] Wiegel, Vincent (2010-12-01). "Wendell Wallach and Colin Allen: moral machines: teaching robots right from wrong". Ethics and Information Technology. 12 (4): 359–361. doi:10.1007/s10676-010-9239-1. ISSN 1572-8439. S2CID 30532107. Retrieved 2022-07-23.

[:15-102] Gabriel, Iason (2020-09-01). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641. S2CID 210920551. Retrieved 2022-07-23.

[103] MacAskill, William (2022). What we owe the future. New York, NY: Basic Books. ISBN 978-1-5416-1862-6. OCLC 1314633519.

[:17-104] Wu, Jeff; Ouyang, Long; Ziegler, Daniel M.; Stiennon, Nisan; Lowe, Ryan; Leike, Jan; Christiano, Paul (2021-09-27). "Recursively Summarizing Books with Human Feedback". arXiv:2109.10862. {{cite journal}}: Cite journal requires |journal= (help)

[105] Irving, Geoffrey; Amodei, Dario (2018-05-03). "AI Safety via Debate". OpenAI. Retrieved 2022-07-23.

[106] Naughton, John (2021-10-02). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Retrieved 2022-07-23.

[:13-107] Christiano, Paul; Shlegeris, Buck; Amodei, Dario (2018-10-19). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575. {{cite journal}}: Cite journal requires |journal= (help)

[108] Genetic Programming Theory and Practice XVII. Genetic and Evolutionary Computation. Wolfgang Banzhaf, Erik Goodman, Leigh Sheneman, Leonardo Trujillo, Bill Worzel (eds.). Cham: Springer International Publishing. 2020. doi:10.1007/978-3-030-39958-0. ISBN 978-3-030-39957-3. S2CID 218531292. Retrieved 2022-07-23.{{cite book}}: CS1 maint: others (link)

[109] Wiblin, Robert (October 2, 2018). "Dr Paul Christiano on how OpenAI is developing real solutions to the 'AI alignment problem', and his vision of how humanity will progressively hand over decision-making to AI systems" (Podcast). 80,000 hours. No. 44. Centre for Effective Altruism. Retrieved 2022-07-23.

[110] Lehman, Joel; Clune, Jeff; Misevic, Dusan; Adami, Christoph; Altenberg, Lee; Beaulieu, Julie; Bentley, Peter J.; Bernard, Samuel; Beslon, Guillaume; Bryson, David M.; Cheney, Nick (2020). "The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities". Artificial Life. 26 (2): 274–306. doi:10.1162/artl_a_00319. ISSN 1064-5462. PMID 32271631. S2CID 4519185.

[111] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (2022-06-16). "Unsolved Problems in ML Safety": 7. arXiv:2109.13916. {{cite journal}}: Cite journal requires |journal= (help)

[112] Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (2018-11-19). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871. {{cite journal}}: Cite journal requires |journal= (help)

[113] Wiggers, Kyle (2021-09-23). "OpenAI unveils model that can summarize books of any length". VentureBeat. Retrieved 2022-07-23.

[114] Moltzau, Alex (2019-08-24). "Debating the AI Safety Debate". Towards Data Science. Retrieved 2022-07-23.

[:18-115] Wiggers, Kyle (2021-09-20). "Falsehoods more likely with large language models". VentureBeat. Retrieved 2022-07-23.

[116] The Guardian (2020-09-08). "A robot wrote this entire article. Are you scared yet, human?". The Guardian. ISSN 0261-3077. Retrieved 2022-07-23.

[117] Heaven, Will Douglas (2020-07-20). "OpenAI's new language generator GPT-3 is shockingly good—and completely mindless". MIT Technology Review. Retrieved 2022-07-23.

[:21-118] Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William (2021-10-13). "Truthful AI: Developing and governing AI that does not lie". arXiv:2110.06674. {{cite journal}}: Cite journal requires |journal= (help)

[119] Wiggers, Kyle (2021-09-20). "Falsehoods more likely with large language models". VentureBeat. Retrieved 2022-07-23.

[120] Alford, Anthony (2021-07-13). "EleutherAI Open-Sources Six Billion Parameter GPT-3 Clone GPT-J". InfoQ. Retrieved 2022-07-23.

[121] Naughton, John (2021-10-02). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Retrieved 2022-07-23.

[122] Shuster, Kurt; Poff, Spencer; Chen, Moya; Kiela, Douwe; Weston, Jason (November 2021). "Retrieval Augmentation Reduces Hallucination in Conversation". Findings of the Association for Computational Linguistics: EMNLP 2021. EMNLP-Findings 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics. pp. 3784–3803. doi:10.18653/v1/2021.findings-emnlp.320. Retrieved 2022-07-23.

[123] Nakano, Reiichiro; Hilton, Jacob; Balaji, Suchir; Wu, Jeff; Ouyang, Long; Kim, Christina; Hesse, Christopher; Jain, Shantanu; Kosaraju, Vineet; Saunders, William; Jiang, Xu (2022-06-01). "WebGPT: Browser-assisted question-answering with human feedback". arXiv:2112.09332. {{cite journal}}: Cite journal requires |journal= (help)

[124] Kumar, Nitish (2021-12-23). "OpenAI Researchers Find Ways To More Accurately Answer Open-Ended Questions Using A Text-Based Web Browser". MarkTechPost. Retrieved 2022-07-23.

[125] Menick, Jacob; Trebacz, Maja; Mikulik, Vladimir; Aslanides, John; Song, Francis; Chadwick, Martin; Glaese, Mia; Young, Susannah; Campbell-Gillingham, Lucy; Irving, Geoffrey; McAleese, Nat (2022-03-21). "Teaching language models to support answers with verified quotes". DeepMind.

[126] Askell, Amanda; Bai, Yuntao; Chen, Anna; Drain, Dawn; Ganguli, Deep; Henighan, Tom; Jones, Andy; Joseph, Nicholas; Mann, Ben; DasSarma, Nova; Elhage, Nelson (2021-12-09). "A General Language Assistant as a Laboratory for Alignment". arXiv:2112.00861. {{cite journal}}: Cite journal requires |journal= (help)

[127] Kenton, Zachary; Everitt, Tom; Weidinger, Laura; Gabriel, Iason; Mikulik, Vladimir; Irving, Geoffrey (2021-03-30). "Alignment of Language Agents". Medium. Retrieved 2022-07-23.

[128] Leike, Jan; Schulman, John; Wu, Jeffrey (2022-08-24). "Our approach to alignment research". OpenAI. Retrieved 2022-09-09.

[129] Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (2018-09-27). "Building safe artificial intelligence: specification, robustness, and assurance". Medium. Retrieved 2022-08-26.

[130] Christian, Brian (2020). "Chapter 5: Shaping". The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753.{{cite book}}: CS1 maint: date and year (link)

[131] Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D; Pfau, Jacob; Krueger, David (2022-07-17). "Goal misgeneralization in deep reinforcement learning". Proceedings of the 39th international conference on machine learning. Proceedings of machine learning research. Vol. 162. PMLR. pp. 12004–12019.

[132] Zhang, Xiaoge; Chan, Felix T.S.; Yan, Chao; Bose, Indranil (2022). "Towards risk-aware artificial intelligence and machine learning systems: An overview". Decision Support Systems. 159: 113800. doi:10.1016/j.dss.2022.113800. S2CID 248585546.

[133] McCarthy, John; Minsky, Marvin L.; Rochester, Nathaniel; Shannon, Claude E. (2006-12-15). "A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955". AI Magazine. 27 (4): 12. doi:10.1609/aimag.v27i4.1904. ISSN 2371-9621. S2CID 19439915.

[134] Baker, Bowen; Kanitscheider, Ingmar; Markov, Todor; Wu, Yi; Powell, Glenn; McGrew, Bob; Mordatch, Igor (2019-09-17). "Emergent Tool Use from Multi-Agent Interaction". OpenAI. Retrieved 2022-08-26.

[135] Shermer, Michael (2017-03-01). "Artificial Intelligence Is Not a Threat—Yet". Scientific American. Retrieved 2022-08-26.

[a]

[1]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[b]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[c]

[56]

[57]

[58]

[59]

[d]

[62]

[e]

[f]

[65]

[g]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[91]

[h]

[94]

[95]

[96]

[97]

[98]

[99]

[100]

[101]

[102]

[103]