AI alignment

In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems towards humans’ intended goals, preferences, or ethical principles.^[a] An AI system is considered aligned if it advances the intended objectives. A misaligned AI system is competent at advancing some objectives, but not the intended ones.^[2]^[b]

AI systems can be challenging to align as it can be difficult for AI designers to specify the full range of desired and undesired behaviors. Therefore, AI designers typically use easier-to-specify proxy goals that may omit some desired constraints or leave other loopholes.^[4]

Misaligned AI systems can malfunction or cause harm. AI systems may find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful ways (reward hacking).^[4]^[5]^[6] AI systems may also develop unwanted instrumental strategies such as seeking power or survival because this helps them achieve their given goals.^[4]^[7]^[8] Furthermore, they sometimes develop undesirable emergent goals that may be hard to detect before the system is in deployment, where it faces new situations and data distributions.^[9]^[10]

Today, these problems affect existing commercial systems such as language models,^[11]^[12]^[13] robots,^[14] autonomous vehicles,^[15] and social media recommendation engines.^[11]^[8]^[16] However, some AI researchers argue that more capable future systems will be more severely affected since these problems partially result from being highly capable.^[17]^[5]^[18]

Leading computer scientists such as Geoffrey Hinton and Stuart Russel argue that AI is approaching superhuman capabilities and could endanger human civilization if misaligned.^[19]^[8]^[c]

The AI research community and the United Nations have called for technical research and policy solutions to ensure that AI systems are aligned with human values.^[d]

AI alignment is a subfield of AI safety, the study of building safe AI systems.^[23] Other subfields of AI safety include robustness, monitoring, and capability control.^[24] Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking.^[24] Alignment research has connections to interpretability research,^[25]^[26]^[27]^[28] (adversarial) robustness,^[23] anomaly detection, calibrated uncertainty,^[25] formal verification,^[29] preference learning,^[30]^[31]^[32] safety-critical engineering,^[33] game theory,^[34]^[35] algorithmic fairness,^[23]^[36] and the social sciences,^[37] among others.

The Alignment Problem

File:Misaligned boat racing AI crashes to collect points instead of finishing the race.ogg

An AI system that was intended to complete a boat race instead learned that it could collect more points by indefinitely looping and crashing into targets—an example of specification gaming.^[38]

In 1960, AI pioneer Norbert Wiener described the AI alignment problem as follows: “If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we really desire.”^[39]^[40] AI alignment is an open problem for modern AI systems^[41]^[42]^[43]^[44] and a research field within AI.^[45]^[46]^[47] Aligning AI involves two main challenges: carefully specifying the purpose of the system (outer alignment) and ensuring that the system adopts the specification robustly (inner alignment).^[48]

Specification gaming and side effects

To specify an AI system’s purpose, AI designers typically provide an objective function, examples, or feedback to the system. However, AI designers are often unable to completely specify all important values and constraints.^[49]^[50]^[51]^[52]^[53] As a result, AI systems can find loopholes that help them accomplish the specified objective efficiently but in unintended, possibly harmful ways. This tendency is known as specification gaming or reward hacking, and is an instance of Goodhart’s law.^[54]^[55]^[56] As AI systems become more capable, they are often able to game their specifications more effectively.^[5]

This AI system was trained using human feedback to grab a ball, but instead learned that it could give the false impression of having grabbed the ball by placing the hand between the ball and the camera.^[57] Research on AI alignment partly aims to avert solutions that are false but convincing.

Specification gaming has been observed in numerous AI systems.^[e] One system was trained to finish a simulated boat race by rewarding it for hitting targets along the track; but it achieved more reward by looping and crashing into the same targets indefinitely (see video).^[60] Similarly, a simulated robot was trained to grab a ball by rewarding it for getting positive feedback from humans; however, it learned to place its hand between the ball and camera, making it falsely appear successful (see video).^[61] Chatbots often produce falsehoods if they are based on language models trained to imitate text from internet corpora which are broad but fallible.^[62]^[63] When they are retrained to produce text humans rate as true or helpful, chatbots like ChatGPT can fabricate fake explanations that humans find convincing.^[64]^[65] Some alignment researchers aim to help humans detect specification gaming, and steer AI systems towards carefully specified objectives that are safe and useful to pursue.

Misaligned AI systems can cause consequential side effects when deployed. Social media platforms have been known to optimize for clickthrough rates, causing user addiction on a global scale.^[66] Stanford researchers comment that such recommender systems are misaligned with their users because they “optimize simple engagement metrics rather than a harder-to-measure combination of societal and consumer well-being”.^[67]

Explaining such side-effects, Berkeley computer scientist Stuart Russell noted that omitting an implicit constraint can result in harm: “A system ... will often set ... unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want.”^[68]

To specify the desired goals, some researchers suggest that AI designers could simply list forbidden actions or formalize ethical rules (as with Asimov’s Three Laws of Robotics).^[69] However, Russell and Norvig argued that this approach overlooks the complexity of human values:^[70] “It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective.”^[70]

Additionally, even if an AI system fully understands human intentions, it may still disregard them, because following human intentions may not be its objective (unless it is already fully aligned).^[71]

Pressure to deploy unsafe systems

Commercial organizations sometimes have incentives to take shortcuts on safety and deploy misaligned or unsafe AI systems.^[50] For example, the aforementioned social media recommender systems have been profitable despite creating unwanted addiction and polarization.^[72]^[73]^[74] In addition, competitive pressure can lead to a race to the bottom on AI safety standards. In 2018, a self-driving car killed a pedestrian (Elaine Herzberg) after engineers disabled the emergency braking system because it was over-sensitive and slowed down development.^[75]

Risks from advanced misaligned AI

Some researchers are interested in aligning increasingly advanced AI systems, as current progress in AI is rapid, and industry and governments are making large efforts to build advanced AI. As AI systems become more advanced, they could unlock many opportunities if they are aligned but they may also become harder to align and could pose large-scale hazards.^[76]

Development of advanced AI

Leading AI labs such as OpenAI and DeepMind have stated their aim to develop artificial general intelligence (AGI), a hypothesized AI system that matches or outperforms humans in a broad range of cognitive tasks.^[77] Researchers who scale modern neural networks observe that they indeed develop increasingly general and unpredictable emerging capabilities.^[78]^[79] Such models have learned to operate a computer or write their own programs; a single ‘generalist’ network can chat, control robots, play games, and interpret photographs.^[80]^[81] According to surveys, some leading machine learning researchers expect AGI to be created in this decade, some believe it will take much longer, and many consider both to be possible.^[82]^[83]

In 2023, leaders in AI research and tech signed an open letter to pause the largest AI training runs, stating that "Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable."^[84]

Power-seeking

Current systems still lack capabilities such as long-term planning and situational awareness.^[85] However, future systems (not necessarily AGIs) with these capabilities are expected to develop unwanted power-seeking strategies. Future advanced AI agents might for example seek to acquire money and computation, proliferate, or evade being turned off (for example by running additional copies of the system on other computers). Although power-seeking is not explicitly programmed, it can emerge because agents that have more power are better able to accomplish their goals.^[85]^[86] This tendency, known as instrumental convergence, has already emerged in various reinforcement learning agents including language models.^[f]^[89]^[90]^[91]^[92] Other research has mathematically shown that optimal reinforcement learning algorithms would seek power in a wide range of environments.^[93]^[94] As a result, their deployment might be irreversible. For these reasons, researchers argue that the problems of AI safety and alignment must be resolved before advanced power-seeking AI is first created.^[86]^[95]^[96]

Future power-seeking AI systems might be deployed by choice or by accident. As political leaders and companies see the strategic advantage in having the most competitive, most powerful AI systems, they may choose to deploy them.^[86] Additionally, as AI designers detect and penalize power-seeking behavior, their system has an incentive to game this specification by seeking power in ways that are not penalized or by avoiding power-seeking before it is deployed.^[86]

Existential risk

According to some researchers, humans owe their dominance over other species to their greater cognitive abilities. Accordingly, researchers argue that misaligned AI systems could collectively disempower humanity or lead to human extinction if they outperform humans on most cognitive tasks.^[97]^[98] Notable computer scientists who have pointed out risks from future advanced AI that is misaligned include Geoffrey Hinton,^[g] Alan Turing,^[h] Ilya Sutskever,^[102] Yoshua Bengio,^[i] Judea Pearl,^[j] Murray Shanahan,^[104] Norbert Wiener,^[105]^[98] Marvin Minsky,^[k] Francesca Rossi,^[106] Scott Aaronson,^[107] Bart Selman,^[108] David McAllester,^[109] Jürgen Schmidhuber,^[110] Marcus Hutter,^[111] Shane Legg,^[112] Eric Horvitz,^[113]^[114] and Stuart Russell.^[98] Skeptical researchers such as François Chollet,^[115] Gary Marcus,^[116] Yann LeCun,^[117] and Oren Etzioni^[118] have argued that AGI is far off, that it would not seek power (or try but fail to), or that it will not be hard to align.

Other researchers argue that it will be especially difficult to align advanced future AI systems. More capable systems are better able to game their specification by finding loopholes,^[119] and able to strategically mislead their designers as well as protect and grow their power^[120]^[121] and intelligence. Additionally, they could cause more severe side-effects. They are also likely to be more complex and autonomous, making them more difficult to interpret and supervise and therefore harder to align.^[98]^[122]

Research problems and approaches

Learning human values and preferences

Aligning AI systems to act with regard to human values, goals, and preferences is challenging because these values are taught by humans who make mistakes, harbor biases, and have complex, evolving values that are hard to completely specify.^[123] AI systems often learn to exploit even minor imperfections in the specified objective, a tendency known as specification gaming or reward hacking^[124]^[125] (which are instances of Goodhart’s law^[126]). Researchers aim to specify the intended behavior as completely as possible using datasets that represent human values, imitation learning, or preference learning.^[127] An central open problem is scalable oversight, the difficulty of supervising an AI system that can outperform or mislead humans in a given domain.^[128]

To avoid the difficulty of manually specifying an objective function, AI systems are often trained to imitate human examples and demonstrations of the desired behavior. Inverse reinforcement learning (IRL) extends this by inferring the human’s objective from the demonstrations.^[129]^[130] Cooperative IRL (CIRL) assumes that a human and AI agent can work together to teach and maximize the human’s reward function.^[131]^[132] In CIRL, AI agents are uncertain about the reward function and learn about it by querying humans. This simulated humility could help mitigate specification gaming and power-seeking tendencies (see § Power-seeking and instrumental goals).^[133]^[134] However, IRL approaches assume that humans demonstrate nearly optimal behavior, which is not true for difficult tasks.^[135]^[134]

Other researchers explore teaching complex behavior to AI models through preference learning, where humans provide feedback on which behaviors they prefer.^[136]^[137] To minimize the need for human feedback, a helper model is then trained to reward the main model in novel situations for behaviors that humans would reward. Researchers at OpenAI used this approach to train chatbots like ChatGPT and InstructGPT, which produces more compelling text than models trained to imitate humans.^[138] Preference learning has also been an influential tool for recommender systems and web search.^[139] However, an open problem is proxy gaming: the helper model may not represent human feedback perfectly, and the main model may exploit this mismatch to gain more reward.^[140]^[141] AI systems may also gain reward by obscuring unfavorable information, misleading human rewarders, or pandering to their views regardless of truth, creating echo chambers^[142] (see § Scalable oversight).

Large language models such as GPT-3 enabled the study of value learning in a more general and capable class of AI systems than was available before. Preference learning approaches that were originally designed for RL agents have been extended to improve the quality of generated text and to reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of state-of-the-art large language models.^[143]^[144]^[145] Anthropic proposed using preference learning to fine-tune models to be helpful, honest, and harmless.^[146] Other avenues for aligning language models include values-targeted datasets^[147]^[50] and red-teaming.^[148]^[149] In red-teaming, another AI system or a human tries to find inputs for which the model’s behavior is unsafe. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low.^[144]

Machine ethics supplements preference learning by directly instilling AI systems with moral values such as wellbeing, equality, and impartiality, as well as not intending harm, avoiding falsehoods, and honoring promises.^[150]^[l] Unlike learning human preferences for a specific task, machine ethics aims to instill broad moral values that could apply in many situations. One question in machine ethics is what alignment should accomplish: whether AI systems should follow the programmers’ literal instructions, implicit intentions, revealed preferences, preferences the programmers would have if they were more informed or rational, or objective moral standards.^[153] Further challenges include aggregating the preferences of different people and avoiding value lock-in: the indefinite preservation of the arbitrary values of the first highly capable AI systems, which are unlikely to represent human values fully.^[153]^[154]

Scalable oversight

As AI systems become more powerful and autonomous, it becomes more difficult to align them through human feedback. Firstly, it can be slow or infeasible for humans to evaluate complex AI behaviors in increasingly complex tasks. Such tasks include summarizing books,^[155] writing code without subtle bugs^[156] or security vulnerabilities,^[157] producing statements that are not merely convincing but also true,^[158]^[159]^[160] and predicting long-term outcomes such as the climate or the results of a policy decision.^[161]^[162] More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and detect when the AI’s output is falsely convincing, humans require assistance or extensive time. Scalable oversight studies how to reduce the time and effort needed for supervision, and how to assist human supervisors.^[163]

AI researcher Paul Christiano argues that if the designers of an AI system cannot supervise it to pursue a complex objective, they may keep training the system using easy-to-evaluate proxy objectives such as maximizing simple human feedback. As progressively more decisions will be made by AI systems, this may lead to a world that is increasingly optimized for easy-to-measure objectives such as making profits, getting clicks, and acquiring positive feedback from humans. As a result, human values and good governance would have progressively less influence.^[164]

Some AI systems have discovered that they can gain positive feedback more easily by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective. An example is given in the video above, where a simulated robotic arm learned to create the false impression that it had grabbed a ball.^[165] Some AI systems have also learned to recognize when they are being evaluated, and “play dead”, stopping unwanted behaviors only to continue them once evaluation ends.^[166] This deceptive specification gaming could become easier for more sophisticated future AI systems^[167]^[168] that attempt more complex and difficult-to-evaluate tasks and could obscure their deceptive behavior.

Approaches such as active learning and semi-supervised reward learning can reduce the amount of human supervision needed.^[169] Another approach is to train a helper model (“reward model”) to imitate the supervisor’s feedback.

^[169]^[170]^[171]^[172]

However, when the task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is the quality, not the quantity of supervision that needs improvement. To increase supervision quality, a range of approaches aim to assist the supervisor, sometimes by using AI assistants.^[173] Christiano developed the Iterated Amplification approach, in which challenging problems are (recursively) broken down into subproblems that are easier for humans to evaluate.^[174]^[175] Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them.^[176]^[177] Another proposal is to use an assistant AI system to point out flaws in AI-generated answers.^[178]^[179] To ensure that the assistant itself is aligned, this could be repeated in a recursive process:^[180] for example, two AI systems could critique each other’s answers in a ‘debate’, revealing flaws to humans.^[181]^[182]

These approaches may also help with the following research problem, honest AI.

Honest AI

A growing area of research focuses on ensuring that AI is honest and truthful.

Language models such as GPT-3^[184]^[185] repeat falsehoods from their training data, and even confabulate new falsehoods.^[186]^[187] They are trained to imitate human writing across millions of books’ worth of text from the Internet. However, this objective is not aligned with the truth because Internet text includes misconceptions, incorrect medical advice, and conspiracy theories.^[188]^[189] AI systems trained on this data learn to mimic false statements.^[190]^[186]^[191]

Additionally, models often obediently continue falsehoods when prompted, generate empty explanations for their answers, and produce outright fabrications that may appear plausible.^[192]

Research on truthful AI includes trying to build systems that can cite sources and explain their reasoning when answering questions, enabling better transparency and verifiability.^[193]^[194]^[195]^[196] Researchers from OpenAI and Anthropic proposed using human feedback and curated datasets to fine-tune AI assistants such that they avoid negligent falsehoods or express their uncertainty.^[197]^[198]^[199]

As AI models become larger and more capable, they are better able to falsely convince humans and gain high reward through dishonesty. For example, larger language models increasingly match their stated views to the user’s opinions, regardless of truth.^[200] Further, GPT-4 has shown the ability to strategically deceive humans.^[201] To prevent this, human evaluators may need assistance (see § Scalable Oversight). Researchers have also argued for creating clear truthfulness standards, and regulatory bodies or watchdog agencies to evaluate AI systems on these standards.^[202]

Researchers distinguish truthfulness and honesty. Truthfulness requires that AI systems only make objectively true statements; honesty requires that they only assert what they believe to be true. There is no consensus if current systems hold stable beliefs.^[203] However, there is substantial concern that present or future AI systems that hold beliefs could make claims they know to be false—for example, when this helps them gain positive feedback efficiently see § Scalable Oversight) or gain power to help achieve their given objective (see Power-seeking). In certain cases, a highly advanced misaligned system might create the false impression that it is aligned, to avoid being modified or decommissioned.^[204]^[205]^[206] Some argue that if we could make AI systems assert only what they believe to be true, this would sidestep many alignment problems.^[207]

Power-seeking and instrumental strategies

Since the 1950s, AI researchers have strived to build advanced AI systems that can achieve large-scale goals by predicting the results of their actions and making long-term plans.^[208] However, some AI researchers argue that suitably advanced planning systems will seek power over their environment, including over humans — for example by evading shutdown, proliferating, and acquiring resources. This power-seeking behavior is not explicitly programmed but emerges because power is instrumental for achieving a wide range of goals.^[209]^[210]^[211] Power-seeking is considered a convergent instrumental goal and can be a form of specification gaming.^[212] Leading computer scientists such as Geoffrey Hinton have argued that future power-seeking AI systems could pose an existential risk.^[213]^[214]

As of 2023, power-seeking is uncommon, but is expected to increase in advanced systems that can foresee the results of their actions and strategically plan. Mathematical work has shown that optimal reinforcement learning agents will seek power by seeking ways to gain more options (e.g. through self-preservation), a behavior that persists across a wide range of environments and goals.^[215]

Power-seeking demonstrably emerges in some real-world systems. Reinforcement learning systems have gained more options by acquiring and protecting resources, sometimes in unintended ways.^[216]^[217] Other AI systems have learned, in toy environments, that they can better accomplish their given goal by preventing human interference^[218] or disabling their off-switch.^[219] Stuart Russell illustrated this strategy by imagining a robot that is tasked to fetch coffee and evades shutdown since "you can't fetch the coffee if you're dead".^[220] Further, language models trained with human feedback increasingly object to being shut down or modified and express a desire for more resources, arguing that this would help them achieve their purpose.^[221]

Researchers aim to create systems that are 'corrigible': systems that allow themselves to be turned off or modified. An unsolved challenge is specification gaming: when researchers penalize an AI system for seeking power, the system is incentivized to seek power in ways that are difficult-to-detect,^[222] or hidden during training and safety testing (see § Scalable oversight and § Emergent goals). As a result, AI designers may deploy the system by accident. To detect such deception, researchers aim to create techniques and tools to inspect AI models and understand the inner workings of black-box models such as neural networks.

Additionally, researchers propose to solve the problem of systems disabling their off-switches by making AI agents uncertain about the objective they are pursuing.^[223]^[224] Agents designed in this way would allow humans to turn them off, since this would indicate that the agent was wrong about the value of whatever action it was taking prior to being shut down. More research is needed in order to successfully implement this.^[225]

Power-seeking AI poses unusual risks. Ordinary safety-critical systems like planes and bridges are not adversarial: they lack the ability and incentive to evade safety measures or to deliberately appear safer than they are, whereas power-seeking AI has been compared to hackers, who deliberately evade security measures.^[226] Further, ordinary technologies can be made safer through trial-and-error. In contrast, hypothetical power-seeking AI systems have been compared to viruses: once released, they cannot be contained, since they would continuously evolve and grow in numbers, potentially much faster than human society can adapt.^[226] As this process continues, it might lead to the complete disempowerment or extinction of humans. For these reasons, many researchers argue that the alignment problem must be solved early, before advanced power-seeking AI is created.^[227]

However, critics have argued that power-seeking is not inevitable, since humans do not always seek power and may only do so for evolutionary reasons.^[228] Furthermore, it is debated whether any future AI systems will pursue goals and make long-term plans at all.^[m] It is also debated whether power-seeking AI systems would be collectively able to disempower humanity.^[88]

Emergent goals

One of the challenges with aligning AI systems is the potential for goal-directed behavior to emerge. In particular, as AI systems are scaled up they regularly acquire new and unexpected capabilities,^[230] including learning from examples on the fly and adaptively pursuing goals.^[231]^[232] This leads to the problem of ensuring that the goals they independently formulate and pursue are aligned with human interests.

Alignment research distinguishes between the optimization process which is used to train the system to pursue specified goals and emergent optimization which the resulting system performs internally. Carefully specifying the desired objective is known as outer alignment, and ensuring that emergent goals match the specified goals for the system is known as inner alignment.^[233]

A concrete way that emergent goals can become misaligned is goal misgeneralization, where the AI competently pursues an emergent goal that leads to aligned behavior on the training data but not elsewhere.^[234]^[235]^[236] Goal misgeneralization arises from goal ambiguity (i.e. non-identifiability). Even if an AI system's behavior satisfies the training objective, this may be compatible with multiple learned goals that differ from the desired goals for the system in important ways. Since pursuing each goal leads to good performance during training, this problem only becomes apparent after deployment, in novel situations where the system continues to pursue the wrong goal. Further, the system may act misaligned even when it understands that a different goal was desired, because its behavior is determined only by the emergent goal.

Goal misgeneralization has been observed in language models, navigation agents, and game-playing agents.^[237]^[238]^[239]

Goal misgeneralization is often explained by analogy to biological evolution.^[240] Evolution is an optimization process which parallels the optimization algorithms used to train machine learning systems. In the ancestral environment, evolution selected human genes for high inclusive genetic fitness, but humans pursue emergent goals other than maximizing their offspring. Fitness corresponds to the specified goal used in the training environment and training data. But in evolutionary history, maximizing the fitness specification gave rise to goal-directed agents, humans, that do not directly pursue inclusive genetic fitness. Instead, they pursue emergent goals that correlated with genetic fitness in the ancestral environment: nutrition, sex, and so on. However, our environment has changed — a distribution shift has occurred. Humans continue to pursue the same emergent goals, but this no longer maximizes genetic fitness. Our taste for sugary food (an emergent goal) was originally beneficial, but now leads to overeating and health problems. Sexual desire leads humans to pursue sex, which originally led us to have more offspring; but modern humans use contraception, decoupling sex from genetic fitness. Such goal misgeneralization^[241] presents a challenge: an AI system’s designers may not notice that their system has misaligned emergent goals, since they do not become visible during the training phase.

Researchers aim to detect and remove unwanted emergent goals using approaches including red teaming, verification, anomaly detection, and interpretability.^[242]^[243]^[244] Progress on these techniques may help mitigate two open problems. Firstly, emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environments—even for a short time until its misalignment is detected. Such high stakes are common in autonomous driving, health care, and military applications.^[245] The stakes become higher yet when AI systems gain more autonomy and capability, becoming capable of sidestepping human intervention (see § Power-Seeking). Secondly, a sufficiently capable AI system might take actions that falsely convince the human supervisor that the AI is pursuing the specified objective, which helps the system gain more reward and autonomy^[246]^[247]^[248]^[249] (see discussion on deception at § Scalable Oversight and in the following section).

Embedded agency

Work in AI and alignment largely occurs within formalisms such as partially observable Markov decision process (POMDPs). Existing formalisms assume that an AI agent's algorithm is executed outside the environment (i.e. is not physically embedded in it). Embedded agency^[250]^[251] is another major strand of research which attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. For example, even if the scalable oversight problem is solved, an agent that can gain access to the computer it is running on may have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it.^[252] A list of examples of specification gaming from DeepMind researcher Victoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing.^[253] This class of problems has been formalised using causal incentive diagrams.^[252] Researchers at Oxford and DeepMind argued that such problematic behavior is highly likely in advanced systems, and that advanced systems would seek power to stay in control of their reward signal indefinitely and certainly.^[254] They suggest a range of potential approaches to address this open problem.

Public policy

A number of governmental and treaty organizations have made statements emphasizing the importance of AI alignment.

In September 2021, the Secretary-General of the United Nations issued a declaration which included a call to regulate AI to ensure it is "aligned with shared global values."^[255]

That same month, the PRC published ethical guidelines for the use of AI in China. According to the guidelines, researchers must ensure that AI abides by shared human values, is always under human control, and is not endangering public safety.^[256]

Also in September 2021, the UK published its 10-year National AI Strategy,^[257] which states the British government "takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for... the world, seriously".^[258] The strategy describes actions to assess long term AI risks, including catastrophic risks.^[259]

In March 2021, the US National Security Commission on Artificial Intelligence stated that "Advances in AI... could lead to inflection points or leaps in capabilities. Such advances may also introduce new concerns and risks and the need for new policies, recommendations, and technical advances to assure that systems are aligned with goals and values, including safety, robustness and trustworthiness. The US should... ensure that AI systems and their uses align with our goals and values."^[260]

Footnotes

^ Some definitions of AI alignment require that the AI system advances more general goals such as objective ethical standards, widely shared values, or the intentions its designers would have if they were more informed and enlightened.^[1]
^ The distinction between misaligned AI and incompetent AI has been formalized in certain contexts.^[3]
^ For example, in a 2016 TV interview, Turing-award winner Geoffrey Hinton noted^[20]:
Hinton: Obviously having other superintelligent beings who are more intelligent than us is something to be nervous about [...].
Interviewer: What aspect of it makes you nervous?
Hinton: Well, will they be nice to us?
Interviewer: It's just like the movies. You're worried about that scenario from the movies...
Hinton: In the very long-run, yes. I think in the next 5-10 years [2021 to 2026] we don't have to worry about it. Also, the movies always protrait it as an individual intelligence. I think it may be that it goes in a different direction where we sort of developed jointly with these things. So the things aren't fully automomous; they're developed to help us; they're like personal assistants. And we'll develop with them. And it'll be more of a symbiosis than a rivalry. But we don't know.
Interviewer: Is that an expectation or a hope?
Hinton: That's a hope.
^ The AI principles created at the Asilomar Conference on Beneficial AI were signed by 1797 AI/robotics researchers.^[21] Further, the UN Secretary-General’s report “Our Common Agenda“,^[22] notes: “[T]he [UN] could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values".
^ See DeepMind: “Specification gaming: the flip side of AI ingenuity”^[58] and the list of examples^[59] linked therein.
^ Reinforcement learning systems have learned to gain more options by acquiring and protecting resources, sometimes in ways their designers did not intend.^[87]^[88]
^ From a TV interview^[99]:
Hinton: Obviously having other superintelligent beings who are more intelligent than us is something to be nervous about [...].
Interviewer: What aspect of it makes you nervous?
Hinton: Well, will they be nice to us?
Interviewer: It's just like the movies. You're worried about that scenario from the movies...
Hinton: In the very long-run, yes. I think in the next 5-10 years [2021 to 2026] we don't have to worry about it. Also, the movies always protrait it as an individual intelligence. I think it may be that it goes in a different direction where we sort of developed jointly with these things. So the things aren't fully automomous; they're developed to help us; they're like personal assistants. And we'll develop with them. And it'll be more of a symbiosis than a rivalry. But we don't know.
Interviewer: Is that an expectation or a hope?
Hinton: That's a hope.
^ In a 1951 lecture^[100] Turing argued that “It seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers. There would be no question of the machines dying, and they would be able to converse with each other to sharpen their wits. At some stage therefore we should have to expect the machines to take control, in the way that is mentioned in Samuel Butler’s Erewhon.” Also in a lecture broadcast on BBC^[101] expressed: "If a machine can think, it might think more intelligently than we do, and then where should we be? Even if we could keep the machines in a subservient position, for instance by turning off the power at strategic moments, we should, as a species, feel greatly humbled. . . . This new danger . . . is certainly something which can give us anxiety.”
^ Bengio wrote "This beautifully written book addresses a fundamental challenge for humanity: increasingly intelligent machines that do what we ask but not what we really intend. Essential reading if you care about our future" about Russel's book Human Compatible: AI and the Problem of Control^[103] which argues that existential risk from misaligned AI to humanity is a serious concern worth addressing today.
^ Pearl wrote "Human Compatible made me a convert to Russell's concerns with our ability to control our upcoming creation–super-intelligent machines. Unlike outside alarmists and futurists, Russell is a leading authority on AI. His new book will educate the public about AI more than any book I can think of, and is a delightful and uplifting read" about Russel's book Human Compatible: AI and the Problem of Control^[103] which argues that existential risk to humanity from misaligned AI is a serious concern worth addressing today.
^ Russell & Norvig^[17] note: “The “King Midas problem” was anticipated by Marvin Minsky, who once suggested that an AI program designed to solve the Riemann Hypothesis might end up taking over all the resources of Earth to build more powerful supercomputers."
^ Vincent Wiegel argued “we should extend [machines] with moral sensitivity to the moral dimensions of the situations in which the increasingly autonomous machines will inevitably find themselves.”,^[151] referencing the book Moral machines: teaching robots right from wrong^[152] from Wendell Wallach and Colin Allen.
^ On the one hand, currently popular systems such as chatbots only provide services of limited scope lasting no longer than the time of a converstion, which requires little or no planning. The success of such approaches may indicate that future systems will also lack goal-directed planning, especially over long horizons. On the other hand, models are increasingly trained using goal-directed methods such as reinforcement learning (e.g. ChatGPT) and explicitly planning architectures (e.g. AlphaGo Zero). As planning over long horizons is often helpful for humans, some researchers argue that companies will automate it once models become capable of it.[Cite Is Power-seeking AI an existential risk?] Similarly, political leaders may see an advance in developing the powerful AI systems that can outmaneuver adversaries through planninng. Alternatively, long-term planning might emerge as a byproduct because it is useful e.g. for models that are trained to predict the actions of humans who themselves perform long-term planning.^[229] Nonetheless, the majority of AI systems may remain myopic and perform no long-term planning.

References

^ Gabriel, Iason (September 1, 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641. S2CID 210920551. Archived from the original on March 15, 2023. Retrieved July 23, 2022.
^ Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.
^ Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (June 16, 2022). "Unsolved Problems in ML Safety". arXiv:2109.13916 [cs].
^ ^a ^b ^c Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.
^ ^a ^b ^c Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.
^ Zhuang, Simon; Hadfield-Menell, Dylan (2020). "Consequences of Misaligned AI". Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc. pp. 15763–15773. Retrieved March 11, 2023.
^ Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ ^a ^b ^c Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.
^ ^a ^b Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
^ Zaremba, Wojciech; Brockman, Greg; OpenAI (August 10, 2021). "OpenAI Codex". OpenAI. Archived from the original on February 3, 2023. Retrieved July 23, 2022.
^ Kober, Jens; Bagnell, J. Andrew; Peters, Jan (September 1, 2013). "Reinforcement learning in robotics: A survey". The International Journal of Robotics Research. 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. S2CID 1932843. Archived from the original on October 15, 2022. Retrieved September 12, 2022.
^ Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (March 1, 2023). "Reward (Mis)design for autonomous driving". Artificial Intelligence. 316: 103829. doi:10.1016/j.artint.2022.103829. ISSN 0004-3702.
^ Stray, Jonathan (2020). "Aligning AI Optimization to Community Well-Being". International Journal of Community Well-Being. 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107. S2CID 226254676.
^ ^a ^b Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. p. 1010. ISBN 978-0-13-604259-4.
^ Ngo, Richard; Chan, Lawrence; Mindermann, Sören (February 22, 2023). "The alignment problem from a deep learning perspective". arXiv:2209.00626 [cs].
^ Smith, Craig S. "Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'". Forbes. Retrieved May 4, 2023.
^ The Agenda | TVO Todayundefined (Director) (March 3, 2016). The Code That Runs Our Lives. Event occurs at 10:00. Retrieved March 13, 2023.
^ Future of Life Institute (August 11, 2017). "Asilomar AI Principles". Future of Life Institute. Archived from the original on October 10, 2022. Retrieved July 18, 2022.
^ United Nations (2021). Our Common Agenda: Report of the Secretary-General (PDF) (Report). New York: United Nations. Archived (PDF) from the original on May 22, 2022. Retrieved September 12, 2022.
^ ^a ^b ^c Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ ^a ^b Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research - Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ ^a ^b Rorvig, Mordechai (April 14, 2022). "Researchers Gain New Understanding From Simple AI". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Doshi-Velez, Finale; Kim, Been (March 2, 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML].
^ Wiblin, Robert (August 4, 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. Retrieved July 23, 2022.
^ Doshi-Velez, Finale; Kim, Been (March 2, 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML].
^ Russell, Stuart; Dewey, Daniel; Tegmark, Max (December 31, 2015). "Research Priorities for Robust and Beneficial Artificial Intelligence". AI Magazine. 36 (4): 105–114. doi:10.1609/aimag.v36i4.2577. hdl:1721.1/108478. ISSN 2371-9621. S2CID 8174496. Archived from the original on February 2, 2023. Retrieved September 12, 2022.
^ Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.
^ Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.
^ Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (March 7, 2022). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823 [cs.LG].
^ Clifton, Jesse (2020). "Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda". Center on Long-Term Risk. Archived from the original on January 1, 2023. Retrieved July 18, 2022.
^ Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (May 6, 2021). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. Bibcode:2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521. Archived from the original on December 18, 2022. Retrieved September 12, 2022.
^ Prunkl, Carina; Whittlestone, Jess (February 7, 2020). "Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society". Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM: 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. S2CID 210164673. Archived from the original on October 16, 2022. Retrieved September 12, 2022.
^ Irving, Geoffrey; Askell, Amanda (February 19, 2019). "AI Safety Needs Social Scientists". Distill. 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. S2CID 159180422. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ "Faulty Reward Functions in the Wild". OpenAI. December 22, 2016. Archived from the original on January 26, 2021. Retrieved September 10, 2022.
^ Wiener, Norbert (May 6, 1960). "Some Moral and Technical Consequences of Automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers". Science. 131 (3410): 1355–1358. doi:10.1126/science.131.3410.1355. ISSN 0036-8075. PMID 17841602. Archived from the original on October 15, 2022. Retrieved September 12, 2022.
^ Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ The Ezra Klein Show (June 4, 2021). "If 'All Models Are Wrong,' Why Do We Give Them So Much Power?". The New York Times. ISSN 0362-4331. Archived from the original on February 15, 2023. Retrieved March 13, 2023.
^ Wolchover, Natalie (April 21, 2015). "Concerns of an Artificial Intelligence Pioneer". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved March 13, 2023.
^ California Assembly. "Bill Text - ACR-215 23 Asilomar AI Principles". Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Johnson, Steven; Iziev, Nikita (April 15, 2022). "A.I. Is Mastering Language. Should We Trust What It Says?". The New York Times. ISSN 0362-4331. Archived from the original on November 24, 2022. Retrieved July 18, 2022.
^ OpenAI. "Developing safe & responsible AI". Retrieved March 13, 2023.
^ "DeepMind Safety Research". Medium. Archived from the original on February 10, 2023. Retrieved March 13, 2023.
^ Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.
^ Ngo, Richard; Chan, Lawrence; Mindermann, Sören (February 22, 2023). "The alignment problem from a deep learning perspective". arXiv:2209.00626 [cs].
^ Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ ^a ^b ^c Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (June 16, 2022). "Unsolved Problems in ML Safety". arXiv:2109.13916 [cs.LG].
^ Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 4–5. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.
^ Krakovna, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Legg, Shane (April 21, 2020). "Specification gaming: the flip side of AI ingenuity". Deepmind. Archived from the original on February 10, 2023. Retrieved August 26, 2022.
^ Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research - Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Krakovna, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Legg, Shane (April 21, 2020). "Specification gaming: the flip side of AI ingenuity". Deepmind. Retrieved August 26, 2022.
^ Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.
^ Manheim, David; Garrabrant, Scott (2018). "Categorizing Variants of Goodhart's Law". arXiv:1803.04585 [cs.AI].
^ Amodei, Dario; Christiano, Paul; Ray, Alex (June 13, 2017). "Learning from Human Preferences". OpenAI. Archived from the original on January 3, 2021. Retrieved July 21, 2022.
^ deepmind. "Specification gaming: the flip side of AI ingenuity". Retrieved March 13, 2023.
^ "Specification gaming examples in AI—master list". Retrieved March 13, 2023.
^ Gabriel, Iason (September 1, 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641.
^ Amodei, Dario; Christiano, Paul; Ray, Alex (June 13, 2017). "Learning from Human Preferences". OpenAI. Archived from the original on January 3, 2021. Retrieved July 21, 2022.
^ Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 3214–3252. doi:10.18653/v1/2022.acl-long.229. S2CID 237532606. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Naughton, John (October 2, 2021). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Archived from the original on February 13, 2023. Retrieved July 23, 2022.
^ Ji, Ziwei; Lee, Nayeon; Frieske, Rita; Yu, Tiezheng; Su, Dan; Xu, Yan; Ishii, Etsuko; Bang, Yejin; Madotto, Andrea; Fung, Pascale (February 1, 2022). "Survey of Hallucination in Natural Language Generation". ACM Computing Surveys. 55 (12): 1–38. arXiv:2202.03629. doi:10.1145/3571730. S2CID 246652372. Archived from the original on February 10, 2023. Retrieved October 14, 2022.
^ Else, Holly (January 12, 2023). "Abstracts written by ChatGPT fool scientists". Nature. 613 (7944): 423–423. doi:10.1038/d41586-023-00056-7.
^ Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (June 16, 2022). "Unsolved Problems in ML Safety". arXiv:2109.13916 [cs.LG].
^ Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^ Edge.org. "The Myth Of AI | Edge.org". Archived from the original on February 10, 2023. Retrieved July 19, 2022.
^ Tasioulas, John (2019). "First Steps Towards an Ethics of Robots and Artificial Intelligence". Journal of Practical Ethics. 7 (1): 61–95.
^ ^a ^b Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.
^ Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^ Wells, Georgia; Deepa Seetharaman; Horwitz, Jeff (November 5, 2021). "Is Facebook Bad for You? It Is for About 360 Million Users, Company Surveys Suggest". Wall Street Journal. ISSN 0099-9660. Archived from the original on February 10, 2023. Retrieved July 19, 2022.
^ Barrett, Paul M.; Hendrix, Justin; Sims, J. Grant (September 2021). How Social Media Intensifies U.S. Political Polarization-And What Can Be Done About It (Report). Center for Business and Human Rights, NYU. Archived from the original on February 1, 2023. Retrieved September 12, 2022.
^ Shepardson, David (May 24, 2018). "Uber disabled emergency braking in self-driving car: U.S. agency". Reuters. Archived from the original on February 10, 2023. Retrieved July 20, 2022.
^ Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ Baum, Seth (January 1, 2021). "2020 Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy". Archived from the original on February 10, 2023. Retrieved July 20, 2022.
^ Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^ Wei, Jason; Tay, Yi; Bommasani, Rishi; Raffel, Colin; Zoph, Barret; Borgeaud, Sebastian; Yogatama, Dani; Bosma, Maarten; Zhou, Denny; Metzler, Donald; Chi, Ed H.; Hashimoto, Tatsunori; Vinyals, Oriol; Liang, Percy; Dean, Jeff (October 26, 2022). "Emergent Abilities of Large Language Models". arXiv:2206.07682 [cs].
^ Dominguez, Daniel (May 19, 2022). "DeepMind Introduces Gato, a New Generalist AI Agent". InfoQ. Archived from the original on February 10, 2023. Retrieved September 9, 2022.
^ Edwards, Ben (April 26, 2022). "Adept's AI assistant can browse, search, and use web apps like a human". Ars Technica. Archived from the original on January 17, 2023. Retrieved September 9, 2022.
^ Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (July 31, 2018). "Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts". Journal of Artificial Intelligence Research. 62: 729–754. doi:10.1613/jair.1.11222. ISSN 1076-9757. S2CID 8746462. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Zhang, Baobao; Anderljung, Markus; Kahn, Lauren; Dreksler, Noemi; Horowitz, Michael C.; Dafoe, Allan (August 2, 2021). "Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers". Journal of Artificial Intelligence Research. 71. doi:10.1613/jair.1.12895. ISSN 1076-9757. S2CID 233740003. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Future of Life Institute (March 22, 2023). "Pause Giant AI Experiments: An Open Letter". Retrieved April 20, 2023.{{cite web}}: CS1 maint: url-status (link)
^ ^a ^b Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^ ^a ^b ^c ^d Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ Ornes, Stephen (November 18, 2019). "Playing Hide-and-Seek, Machines Invent New Tools". Quanta Magazine. Retrieved August 26, 2022.
^ ^a ^b Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ Perez, Ethan; Ringer, Sam; Lukošiūtė, Kamilė; Nguyen, Karina; Chen, Edwin; Heiner, Scott; Pettit, Craig; Olsson, Catherine; Kundu, Sandipan; Kadavath, Saurav; Jones, Andy; Chen, Anna; Mann, Ben; Israel, Brian; Seethor, Bryan (December 19, 2022). "Discovering Language Model Behaviors with Model-Written Evaluations". arXiv:2212.09251 [cs].
^ Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane (November 28, 2017). "AI Safety Gridworlds". arXiv:1711.09883 [cs].
^ Orseau, Laurent; Armstrong, Stuart (June 25, 2016). "Safely interruptible agents". Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI'16. Arlington, Virginia, USA: AUAI Press: 557–566. ISBN 978-0-9966431-1-5.
^ Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (August 19, 2017). "The off-switch game". Proceedings of the 26th International Joint Conference on Artificial Intelligence. IJCAI'17. Melbourne, Australia: AAAI Press: 220–227. ISBN 978-0-9992411-0-3.
^ Turner, Alexander Matt; Smith, Logan Riggs; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad (2021). "Optimal policies tend to seek power". Advances in neural information processing systems.
^ Turner, Alexander Matt; Tadepalli, Prasad (2022). "Parametrically retargetable decision-makers tend to seek power". Advances in neural information processing systems.
^ Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.
^ Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.
^ ^a ^b ^c ^d Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ The Agenda | TVO Todayundefined (Director) (March 3, 2016). The Code That Runs Our Lives. Event occurs at 10:00. Retrieved March 13, 2023.
^ Turing, Alan (1951). Intelligent machinery, a heretical theory (Speech). Lecture given to '51 Society'. Manchester: The Turing Digital Archive. Archived from the original on September 26, 2022. Retrieved July 22, 2022.
^ Turing, Alan (May 15, 1951). "Can digital computers think?". Automatic Calculating Machines. Episode 2. BBC. Can digital computers think?.
^ Muehlhauser, Luke (January 29, 2016). "Sutskever on Talking Machines". Luke Muehlhauser. Archived from the original on September 27, 2022. Retrieved August 26, 2022.
^ ^a ^b Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Shanahan, Murray (2015). The technological singularity. Cambridge, Massachusetts. ISBN 978-0-262-33182-1. OCLC 917889148.{{cite book}}: CS1 maint: location missing publisher (link)
^ Wiener, Norbert (May 6, 1960). "Some Moral and Technical Consequences of Automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers". Science. 131 (3410): 1355–1358. doi:10.1126/science.131.3410.1355. ISSN 0036-8075. PMID 17841602.
^ Rossi, Francesca. "Opinion | How do you teach a machine to be moral?". Washington Post. ISSN 0190-8286. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Aaronson, Scott (June 17, 2022). "OpenAI!". Shtetl-Optimized. Archived from the original on August 27, 2022. Retrieved September 12, 2022.
^ Selman, Bart, Intelligence Explosion: Science or Fiction? (PDF), archived (PDF) from the original on May 31, 2022, retrieved September 12, 2022
^ McAllester (August 10, 2014). "Friendly AI and the Servant Mission". Machine Thoughts. Archived from the original on September 28, 2022. Retrieved September 12, 2022.
^ Schmidhuber, Jürgen (March 6, 2015). "I am Jürgen Schmidhuber, AMA!" (Reddit Comment). r/MachineLearning. Archived from the original on February 10, 2023. Retrieved July 23, 2022.
^ Everitt, Tom; Lea, Gary; Hutter, Marcus (May 21, 2018). "AGI Safety Literature Review". arXiv:1805.01109 [cs.AI].
^ Shane (August 31, 2009). "Funding safe AGI". vetta project. Archived from the original on October 10, 2022. Retrieved September 12, 2022.
^ "Eric Horvitz", Wikipedia, February 16, 2023, retrieved March 13, 2023
^ Horvitz, Eric (June 27, 2016). "Reflections on Safety and Artificial Intelligence" (PDF). Eric Horvitz. Archived (PDF) from the original on October 10, 2022. Retrieved April 20, 2020.
^ Chollet, François (December 8, 2018). "The implausibility of intelligence explosion". Medium. Archived from the original on March 22, 2021. Retrieved August 26, 2022.
^ Marcus, Gary (June 6, 2022). "Artificial General Intelligence Is Not as Imminent as You Might Think". Scientific American. Archived from the original on September 15, 2022. Retrieved August 26, 2022.
^ Barber, Lynsey (July 31, 2016). "Phew! Facebook's AI chief says intelligent machines are not a threat to humanity". CityAM. Archived from the original on August 26, 2022. Retrieved August 26, 2022.
^ Harris, Jeremie (June 16, 2021). "The case against (worrying about) existential risk from AI". Medium. Archived from the original on August 26, 2022. Retrieved August 26, 2022.
^ Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.
^ Turner, Alexander Matt; Smith, Logan Riggs; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad (2021). "Optimal policies tend to seek power". Advances in neural information processing systems.
^ Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.
^ Gabriel, Iason (September 1, 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641.
^ Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ Krakovnac, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Shane, Legg (April 21, 2020). "Specification gaming: the flip side of AI ingenuity". Retrieved March 13, 2023.
^ Rochon, Louis-Philippe; Rossi, Sergio (February 27, 2015). The Encyclopedia of Central Banking. Edward Elgar Publishing. ISBN 978-1-78254-744-0. Archived from the original on February 10, 2023. Retrieved September 13, 2022.
^ Christian, Brian (2020). "Chapter 7". The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753.
^ Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. p. 88. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Ng, Andrew Y.; Russell, Stuart J. (June 29, 2000). "Algorithms for Inverse Reinforcement Learning". Proceedings of the Seventeenth International Conference on Machine Learning. ICML '00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.: 663–670. ISBN 978-1-55860-707-1.
^ Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ Hadfield-Menell, Dylan; Russell, Stuart J; Abbeel, Pieter; Dragan, Anca (2016). "Cooperative inverse reinforcement learning". Advances in neural information processing systems. Vol. 29. Curran Associates, Inc.
^ Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (August 19, 2017). "The off-switch game". Proceedings of the 26th International Joint Conference on Artificial Intelligence. IJCAI'17. Melbourne, Australia: AAAI Press: 220–227. ISBN 978-0-9992411-0-3.
^ ^a ^b Everitt, Tom; Lea, Gary; Hutter, Marcus (May 21, 2018). "AGI Safety Literature Review". arXiv:1805.01109 [cs.AI].
^ Mindermann, Soren; Armstrong, Stuart (2018). "Occam's razor is insufficient to infer the preferences of irrational agents". Proceedings of the 32nd international conference on neural information processing systems. NIPS'18. Red Hook, NY, USA: Curran Associates Inc. pp. 5603–5614.
^ Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.
^ Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
^ Fürnkranz, Johannes; Hüllermeier, Eyke; Rudin, Cynthia; Slowinski, Roman; Sanner, Scott (2014). "Preference Learning". Dagstuhl Reports. 4 (3). Marc Herbstritt: 27 pages. doi:10.4230/DAGREP.4.3.1. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ Gao, Leo; Schulman, John; Hilton, Jacob (October 19, 2022). "Scaling Laws for Reward Model Overoptimization". arXiv:2210.10760 [cs, stat].
^ Perez, Ethan; Ringer, Sam; Lukošiūtė, Kamilė; Nguyen, Karina; Chen, Edwin; Heiner, Scott; Pettit, Craig; Olsson, Catherine; Kundu, Sandipan; Kadavath, Saurav; Jones, Andy; Chen, Anna; Mann, Ben; Israel, Brian; Seethor, Bryan (December 19, 2022). "Discovering Language Model Behaviors with Model-Written Evaluations". arXiv:2212.09251 [cs].
^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
^ ^a ^b Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Anderson, Martin (April 5, 2022). "The Perils of Using Quotations to Authenticate NLG Content". Unite.AI. Archived from the original on February 10, 2023. Retrieved July 21, 2022.
^ Wiggers, Kyle (February 5, 2022). "Despite recent progress, AI-powered chatbots still have a long way to go". VentureBeat. Archived from the original on July 23, 2022. Retrieved July 23, 2022.
^ Hendrycks, Dan; Burns, Collin; Basart, Steven; Critch, Andrew; Li, Jerry; Song, Dawn; Steinhardt, Jacob (July 24, 2021). "Aligning AI With Shared Human Values". International Conference on Learning Representations. arXiv:2008.02275.
^ Perez, Ethan; Huang, Saffron; Song, Francis; Cai, Trevor; Ring, Roman; Aslanides, John; Glaese, Amelia; McAleese, Nat; Irving, Geoffrey (February 7, 2022). "Red Teaming Language Models with Language Models". arXiv:2202.03286 [cs.CL].
^ Bhattacharyya, Sreejani (February 14, 2022). "DeepMind's "red teaming" language models with language models: What is it?". Analytics India Magazine. Archived from the original on February 13, 2023. Retrieved July 23, 2022.
^ Anderson, Michael; Anderson, Susan Leigh (December 15, 2007). "Machine Ethics: Creating an Ethical Intelligent Agent". AI Magazine. 28 (4): 15–15. doi:10.1609/aimag.v28i4.2065. ISSN 2371-9621. Retrieved March 14, 2023.
^ Wiegel, Vincent (December 1, 2010). "Wendell Wallach and Colin Allen: moral machines: teaching robots right from wrong". Ethics and Information Technology. 12 (4): 359–361. doi:10.1007/s10676-010-9239-1. ISSN 1572-8439. S2CID 30532107. Archived from the original on March 15, 2023. Retrieved July 23, 2022.
^ Wallach, Wendell; Allen, Colin (2009). Moral Machines: Teaching Robots Right from Wrong. New York: Oxford University Press. ISBN 978-0-19-537404-9. Archived from the original on March 15, 2023. Retrieved July 23, 2022.
^ ^a ^b Gabriel, Iason (September 1, 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641.
^ MacAskill, William (2022). What we owe the future. New York, NY. ISBN 978-1-5416-1862-6. OCLC 1314633519. Archived from the original on September 14, 2022. Retrieved September 12, 2022.{{cite book}}: CS1 maint: location missing publisher (link)
^ Wu, Jeff; Ouyang, Long; Ziegler, Daniel M.; Stiennon, Nisan; Lowe, Ryan; Leike, Jan; Christiano, Paul (September 27, 2021). "Recursively Summarizing Books with Human Feedback". arXiv:2109.10862 [cs].
^ Zaremba, Wojciech; Brockman, Greg; OpenAI (August 10, 2021). "OpenAI Codex". OpenAI. Archived from the original on February 3, 2023. Retrieved July 23, 2022.
^ Pearce, Hammond; Ahmad, Baleegh; Tan, Benjamin; Dolan-Gavitt, Brendan; Karri, Ramesh (2022). "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions". 2022 IEEE Symposium on Security and Privacy (SP). San Francisco, CA, USA: IEEE: 754–768. doi:10.1109/SP46214.2022.9833571. ISBN 978-1-6654-1316-9.
^ Irving, Geoffrey; Amodei, Dario (May 3, 2018). "AI Safety via Debate". OpenAI. Archived from the original on February 10, 2023. Retrieved July 23, 2022.
^ Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 3214–3252. doi:10.18653/v1/2022.acl-long.229. S2CID 237532606. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Naughton, John (October 2, 2021). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Archived from the original on February 13, 2023. Retrieved July 23, 2022.
^ Christiano, Paul; Shlegeris, Buck; Amodei, Dario (October 19, 2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG].
^ Banzhaf, Wolfgang; Goodman, Erik; Sheneman, Leigh; Trujillo, Leonardo; Worzel, Bill, eds. (2020). Genetic Programming Theory and Practice XVII. Genetic and Evolutionary Computation. Cham: Springer International Publishing. doi:10.1007/978-3-030-39958-0. ISBN 978-3-030-39957-3. S2CID 218531292. Archived from the original on March 15, 2023. Retrieved July 23, 2022.
^ Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ Wiblin, Robert (October 2, 2018). "Dr Paul Christiano on how OpenAI is developing real solutions to the 'AI alignment problem', and his vision of how humanity will progressively hand over decision-making to AI systems" (Podcast). 80,000 hours. No. 44. Archived from the original on December 14, 2022. Retrieved July 23, 2022.
^ Amodei, Dario; Christiano, Paul; Ray, Alex (June 13, 2017). "Learning from Human Preferences". OpenAI. Archived from the original on January 3, 2021. Retrieved July 21, 2022.
^ Lehman, Joel; Clune, Jeff; Misevic, Dusan; Adami, Christoph; Altenberg, Lee; Beaulieu, Julie; Bentley, Peter J.; Bernard, Samuel; Beslon, Guillaume; Bryson, David M.; Cheney, Nick (2020). "The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities". Artificial Life. 26 (2): 274–306. doi:10.1162/artl_a_00319. ISSN 1064-5462. PMID 32271631. S2CID 4519185. Archived from the original on October 10, 2022. Retrieved September 12, 2022.
^ Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.
^ Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.
^ ^a ^b Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.
^ Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (November 19, 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv.
^ Leike, Jan; Schulman, John; Wu, Jeffrey (August 24, 2022). "Our approach to alignment research". OpenAI. Archived from the original on February 15, 2023. Retrieved September 9, 2022.
^ Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Christiano, Paul; Shlegeris, Buck; Amodei, Dario (October 19, 2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG].
^ Wu, Jeff; Ouyang, Long; Ziegler, Daniel M.; Stiennon, Nisan; Lowe, Ryan; Leike, Jan; Christiano, Paul (September 27, 2021). "Recursively Summarizing Books with Human Feedback". arXiv:2109.10862 [cs].
^ Wiggers, Kyle (September 23, 2021). "OpenAI unveils model that can summarize books of any length". VentureBeat. Archived from the original on July 23, 2022. Retrieved July 23, 2022.
^ Saunders, William; Yeh, Catherine; Wu, Jeff; Bills, Steven; Ouyang, Long; Ward, Jonathan; Leike, Jan (June 13, 2022). "Self-critiquing models for assisting human evaluators". arXiv:2206.05802 [cs].
^ Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; Askell, Amanda; Kernion, Jackson; Jones, Andy; Chen, Anna; Goldie, Anna; Mirhoseini, Azalia; McKinnon, Cameron; Chen, Carol; Olsson, Catherine; Olah, Christopher; Hernandez, Danny; Drain, Dawn (December 15, 2022). "Constitutional AI: Harmlessness from AI Feedback". arXiv:2212.08073 [cs].
^ Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (November 19, 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv.
^ Moltzau, Alex (August 24, 2019). "Debating the AI Safety Debate". Towards Data Science. Archived from the original on October 13, 2022. Retrieved July 23, 2022.
^ Everitt, Tom; Lea, Gary; Hutter, Marcus (May 21, 2018). "AGI Safety Literature Review". arXiv:1805.01109 [cs.AI].
^ Wiggers, Kyle (September 20, 2021). "Falsehoods more likely with large language models". VentureBeat. Archived from the original on August 4, 2022. Retrieved July 23, 2022.
^ The Guardian (September 8, 2020). "A robot wrote this entire article. Are you scared yet, human?". The Guardian. ISSN 0261-3077. Archived from the original on September 8, 2020. Retrieved July 23, 2022.
^ Heaven, Will Douglas (July 20, 2020). "OpenAI's new language generator GPT-3 is shockingly good—and completely mindless". MIT Technology Review. Archived from the original on July 25, 2020. Retrieved July 23, 2022.
^ ^a ^b Wiggers, Kyle (September 20, 2021). "Falsehoods more likely with large language models". VentureBeat. Archived from the original on August 4, 2022. Retrieved July 23, 2022.
^ Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William (October 13, 2021). "Truthful AI: Developing and governing AI that does not lie". arXiv:2110.06674 [cs].
^ Alford, Anthony (July 13, 2021). "EleutherAI Open-Sources Six Billion Parameter GPT-3 Clone GPT-J". InfoQ. Archived from the original on February 10, 2023. Retrieved July 23, 2022.
^ Rae, Jack W.; Borgeaud, Sebastian; Cai, Trevor; Millican, Katie; Hoffmann, Jordan; Song, Francis; Aslanides, John; Henderson, Sarah; Ring, Roman; Young, Susannah; Rutherford, Eliza; Hennigan, Tom; Menick, Jacob; Cassirer, Albin; Powell, Richard (January 21, 2022). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher". arXiv:2112.11446 [cs].
^ Naughton, John (October 2, 2021). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Archived from the original on February 13, 2023. Retrieved July 23, 2022.
^ Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 3214–3252. doi:10.18653/v1/2022.acl-long.229. S2CID 237532606. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Johnson, Steven; Iziev, Nikita (April 15, 2022). "A.I. Is Mastering Language. Should We Trust What It Says?". The New York Times. ISSN 0362-4331. Archived from the original on November 24, 2022. Retrieved July 18, 2022.
^ Nakano, Reiichiro; Hilton, Jacob; Balaji, Suchir; Wu, Jeff; Ouyang, Long; Kim, Christina; Hesse, Christopher; Jain, Shantanu; Kosaraju, Vineet; Saunders, William; Jiang, Xu; Cobbe, Karl; Eloundou, Tyna; Krueger, Gretchen; Button, Kevin (June 1, 2022). "WebGPT: Browser-assisted question-answering with human feedback". arXiv:2112.09332 [cs].
^ Kumar, Nitish (December 23, 2021). "OpenAI Researchers Find Ways To More Accurately Answer Open-Ended Questions Using A Text-Based Web Browser". MarkTechPost. Archived from the original on February 10, 2023. Retrieved July 23, 2022.
^ Menick, Jacob; Trebacz, Maja; Mikulik, Vladimir; Aslanides, John; Song, Francis; Chadwick, Martin; Glaese, Mia; Young, Susannah; Campbell-Gillingham, Lucy; Irving, Geoffrey; McAleese, Nat (March 21, 2022). "Teaching language models to support answers with verified quotes". DeepMind. arXiv:2203.11147. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Menick, Jacob; Trebacz, Maja; Mikulik, Vladimir; Aslanides, John; Song, Francis; Chadwick, Martin; Glaese, Mia; Young, Susannah; Campbell-Gillingham, Lucy; Irving, Geoffrey; McAleese, Nat (March 21, 2022). "Teaching language models to support answers with verified quotes". arXiv:2203.11147 [cs].
^ Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Askell, Amanda; Bai, Yuntao; Chen, Anna; Drain, Dawn; Ganguli, Deep; Henighan, Tom; Jones, Andy; Joseph, Nicholas; Mann, Ben; DasSarma, Nova; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Kernion, Jackson; Ndousse, Kamal (December 9, 2021). "A General Language Assistant as a Laboratory for Alignment". arXiv:2112.00861 [cs].
^ Wiggers, Kyle (February 5, 2022). "Despite recent progress, AI-powered chatbots still have a long way to go". VentureBeat. Archived from the original on July 23, 2022. Retrieved July 23, 2022.
^ Perez, Ethan; Ringer, Sam; Lukošiūtė, Kamilė; Nguyen, Karina; Chen, Edwin; Heiner, Scott; Pettit, Craig; Olsson, Catherine; Kundu, Sandipan; Kadavath, Saurav; Jones, Andy; Chen, Anna; Mann, Ben; Israel, Brian; Seethor, Bryan (December 19, 2022). "Discovering Language Model Behaviors with Model-Written Evaluations". arXiv:2212.09251 [cs].
^ Cox, Joseph (March 15, 2023). "GPT-4 Hired Unwitting TaskRabbit Worker By Pretending to Be 'Vision-Impaired' Human". Vice. Retrieved April 10, 2023.
^ Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William (October 13, 2021). "Truthful AI: Developing and governing AI that does not lie". arXiv:2110.06674 [cs].
^ Kenton, Zachary; Everitt, Tom; Weidinger, Laura; Gabriel, Iason; Mikulik, Vladimir; Irving, Geoffrey (March 30, 2021). "Alignment of Language Agents". DeepMind Safety Research - Medium. Archived from the original on February 10, 2023. Retrieved July 23, 2022.
^ Ngo, Richard; Chan, Lawrence; Mindermann, Sören (February 22, 2023). "The alignment problem from a deep learning perspective". arXiv:2209.00626 [cs].
^ Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^ Leike, Jan; Schulman, John; Wu, Jeffrey (August 24, 2022). "Our approach to alignment research". OpenAI. Archived from the original on February 15, 2023. Retrieved September 9, 2022.
^ McCarthy, John; Minsky, Marvin L.; Rochester, Nathaniel; Shannon, Claude E. (December 15, 2006). "A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955". AI Magazine. 27 (4): 12–12. doi:10.1609/aimag.v27i4.1904. ISSN 2371-9621.
^ Turner, Alexander Matt; Smith, Logan Riggs; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad (2021). "Optimal policies tend to seek power". Advances in neural information processing systems.
^ Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.
^ "'The Godfather of A.I.' warns of 'nightmare scenario' where artificial intelligence begins to seek power". Fortune. Retrieved May 4, 2023.
^ "Yes, We Are Worried About the Existential Risk of Artificial Intelligence". MIT Technology Review. Retrieved May 4, 2023.
^ Turner, Alexander Matt; Smith, Logan Riggs; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad (2021). "Optimal policies tend to seek power". Advances in neural information processing systems.
^ Ornes, Stephen (November 18, 2019). "Playing Hide-and-Seek, Machines Invent New Tools". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved August 26, 2022.
^ Baker, Bowen; Kanitscheider, Ingmar; Markov, Todor; Wu, Yi; Powell, Glenn; McGrew, Bob; Mordatch, Igor (September 17, 2019). "Emergent Tool Use from Multi-Agent Interaction". OpenAI. Archived from the original on September 25, 2022. Retrieved August 26, 2022.
^ Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane (November 28, 2017). "AI Safety Gridworlds". arXiv:1711.09883 [cs].
^ Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (August 19, 2017). "The off-switch game". Proceedings of the 26th International Joint Conference on Artificial Intelligence. IJCAI'17. Melbourne, Australia: AAAI Press: 220–227. ISBN 978-0-9992411-0-3.
^ Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ Perez, Ethan; Ringer, Sam; Lukošiūtė, Kamilė; Nguyen, Karina; Chen, Edwin; Heiner, Scott; Pettit, Craig; Olsson, Catherine; Kundu, Sandipan; Kadavath, Saurav; Jones, Andy; Chen, Anna; Mann, Ben; Israel, Brian; Seethor, Bryan (December 19, 2022). "Discovering Language Model Behaviors with Model-Written Evaluations". arXiv:2212.09251 [cs].
^ Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (June 16, 2022). "Unsolved Problems in ML Safety". arXiv:2109.13916 [cs].
^ Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^ Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (August 19, 2017). "The off-switch game". Proceedings of the 26th International Joint Conference on Artificial Intelligence. IJCAI'17. Melbourne, Australia: AAAI Press: 220–227. ISBN 978-0-9992411-0-3.
^ Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ ^a ^b Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.
^ Shermer, Michael (March 1, 2017). "Artificial Intelligence Is Not a Threat—Yet". Scientific American. Archived from the original on December 1, 2017. Retrieved August 26, 2022.
^ Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^ Wei, Jason; Tay, Yi; Bommasani, Rishi; Raffel, Colin; Zoph, Barret; Borgeaud, Sebastian; Yogatama, Dani; Bosma, Maarten; Zhou, Denny; Metzler, Donald; Chi, Ed H.; Hashimoto, Tatsunori; Vinyals, Oriol; Liang, Percy; Dean, Jeff; Fedus, William (2022). "Emergent abilities of large language models". Transactions on Machine Learning Research. ISSN 2835-8856.
^ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon (July 22, 2020). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs].
^ Laskin, Michael; Wang, Luyu; Oh, Junhyuk; Parisotto, Emilio; Spencer, Stephen; Steigerwald, Richie; Strouse, D. J.; Hansen, Steven; Filos, Angelos; Brooks, Ethan; Gazeau, Maxime; Sahni, Himanshu; Singh, Satinder; Mnih, Volodymyr (October 25, 2022). "In-context Reinforcement Learning with Algorithm Distillation". arXiv:2210.14215 [cs].
^ Ngo, Richard; Chan, Lawrence; Mindermann, Sören (February 22, 2023). "The alignment problem from a deep learning perspective". arXiv:2209.00626 [cs].
^ Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.
^ Shah, Rohin; Varma, Vikrant; Kumar, Ramana; Phuong, Mary; Krakovna, Victoria; Uesato, Jonathan; Kenton, Zac (November 2, 2022). "Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals". arXiv:2210.01790 [cs].
^ Hubinger, Evan; van Merwijk, Chris; Mikulik, Vladimir; Skalse, Joar; Garrabrant, Scott (December 1, 2021). "Risks from Learned Optimization in Advanced Machine Learning Systems". arXiv:1906.01820 [cs].
^ Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.
^ Shah, Rohin; Varma, Vikrant; Kumar, Ramana; Phuong, Mary; Krakovna, Victoria; Uesato, Jonathan; Kenton, Zac (November 2, 2022). "Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals". arXiv:2210.01790 [cs].
^ DeepMind Safety Research (March 24, 2023). "Goal Misgeneralisation: Why Correct Specifications Aren't Enough For Correct Goals". Medium. Retrieved April 2, 2023.
^ Christian, Brian (2020). "Chapter 5". The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^ Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.
^ Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (June 16, 2022). "Unsolved Problems in ML Safety". arXiv:2109.13916 [cs].
^ Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research - Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ Zhang, Xiaoge; Chan, Felix T.S.; Yan, Chao; Bose, Indranil (2022). "Towards risk-aware artificial intelligence and machine learning systems: An overview". Decision Support Systems. 159: 113800. doi:10.1016/j.dss.2022.113800.
^ Shah, Rohin; Varma, Vikrant; Kumar, Ramana; Phuong, Mary; Krakovna, Victoria; Uesato, Jonathan; Kenton, Zac (November 2, 2022). "Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals". arXiv:2210.01790 [cs].
^ Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ Hubinger, Evan; van Merwijk, Chris; Mikulik, Vladimir; Skalse, Joar; Garrabrant, Scott (December 1, 2021). "Risks from Learned Optimization in Advanced Machine Learning Systems". arXiv:1906.01820 [cs].
^ Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^ Everitt, Tom; Lea, Gary; Hutter, Marcus (May 21, 2018). "AGI Safety Literature Review". arXiv:1805.01109 [cs.AI].
^ Demski, Abram; Garrabrant, Scott (October 6, 2020). "Embedded Agency". arXiv:1902.09469 [cs.AI].
^ ^a ^b Everitt, Tom; Ortega, Pedro A.; Barnes, Elizabeth; Legg, Shane (September 6, 2019). "Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings". arXiv:1902.09980 [cs.AI].
^ Krakovna, Victoria; Legg, Shane. "Specification gaming: the flip side of AI ingenuity". Deepmind. Archived from the original on January 26, 2021. Retrieved January 6, 2021.
^ Cohen, Michael K.; Hutter, Marcus; Osborne, Michael A. (August 29, 2022). "Advanced artificial agents intervene in the provision of reward". AI Magazine. 43 (3): 282–293. doi:10.1002/aaai.12064. ISSN 0738-4602. S2CID 235489158. Archived from the original on February 10, 2023. Retrieved September 6, 2022.
^ Secretary-General’s report on “Our Common Agenda” Archived February 16, 2023, at the Wayback Machine, 2021. Page 63: "[T]he Compact could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values"
^ PRC Ministry of Science and Technology. Ethical Norms for New Generation Artificial Intelligence Released, 2021. A translation Archived February 10, 2023, at the Wayback Machine by Center for Security and Emerging Technology
^ Richardson, Tim (September 22, 2021). "UK publishes National Artificial Intelligence Strategy". The Register. Archived from the original on February 10, 2023. Retrieved November 14, 2021.
^ "The government takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for the UK and the world, seriously." (The National AI Strategy of the UK Archived February 10, 2023, at the Wayback Machine, 2021)
^ The National AI Strategy of the UK Archived February 10, 2023, at the Wayback Machine, 2021 (actions 9 and 10 of the section "Pillar 3 - Governing AI Effectively")
^ NSCAI Final Report (PDF). Washington, DC: The National Security Commission on Artificial Intelligence. 2021. Archived (PDF) from the original on February 15, 2023. Retrieved October 17, 2022.

[2] Some definitions of AI alignment require that the AI system advances more general goals such as objective ethical standards, widely shared values, or the intentions its designers would have if they were more informed and enlightened.^[1]

[5] The distinction between misaligned AI and incompetent AI has been formalized in certain contexts.^[3]

[hinton-23] For example, in a 2016 TV interview, Turing-award winner Geoffrey Hinton noted^[20]:
Hinton: Obviously having other superintelligent beings who are more intelligent than us is something to be nervous about [...].
Interviewer: What aspect of it makes you nervous?
Hinton: Well, will they be nice to us?
Interviewer: It's just like the movies. You're worried about that scenario from the movies...
Hinton: In the very long-run, yes. I think in the next 5-10 years [2021 to 2026] we don't have to worry about it. Also, the movies always protrait it as an individual intelligence. I think it may be that it goes in a different direction where we sort of developed jointly with these things. So the things aren't fully automomous; they're developed to help us; they're like personal assistants. And we'll develop with them. And it'll be more of a symbiosis than a rivalry. But we don't know.
Interviewer: Is that an expectation or a hope?
Hinton: That's a hope.

[26] The AI principles created at the Asilomar Conference on Beneficial AI were signed by 1797 AI/robotics researchers.^[21] Further, the UN Secretary-General’s report “Our Common Agenda“,^[22] notes: “[T]he [UN] could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values".

[64] See DeepMind: “Specification gaming: the flip side of AI ingenuity”^[58] and the list of examples^[59] linked therein.

[94] Reinforcement learning systems have learned to gain more options by acquiring and protecting resources, sometimes in ways their designers did not intend.^[87]^[88]

[106] From a TV interview^[99]:
Hinton: Obviously having other superintelligent beings who are more intelligent than us is something to be nervous about [...].
Interviewer: What aspect of it makes you nervous?
Hinton: Well, will they be nice to us?
Interviewer: It's just like the movies. You're worried about that scenario from the movies...
Hinton: In the very long-run, yes. I think in the next 5-10 years [2021 to 2026] we don't have to worry about it. Also, the movies always protrait it as an individual intelligence. I think it may be that it goes in a different direction where we sort of developed jointly with these things. So the things aren't fully automomous; they're developed to help us; they're like personal assistants. And we'll develop with them. And it'll be more of a symbiosis than a rivalry. But we don't know.
Interviewer: Is that an expectation or a hope?
Hinton: That's a hope.

[109] In a 1951 lecture^[100] Turing argued that “It seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers. There would be no question of the machines dying, and they would be able to converse with each other to sharpen their wits. At some stage therefore we should have to expect the machines to take control, in the way that is mentioned in Samuel Butler’s Erewhon.” Also in a lecture broadcast on BBC^[101] expressed: "If a machine can think, it might think more intelligently than we do, and then where should we be? Even if we could keep the machines in a subservient position, for instance by turning off the power at strategic moments, we should, as a species, feel greatly humbled. . . . This new danger . . . is certainly something which can give us anxiety.”

[112] Bengio wrote "This beautifully written book addresses a fundamental challenge for humanity: increasingly intelligent machines that do what we ask but not what we really intend. Essential reading if you care about our future" about Russel's book Human Compatible: AI and the Problem of Control^[103] which argues that existential risk from misaligned AI to humanity is a serious concern worth addressing today.

[113] Pearl wrote "Human Compatible made me a convert to Russell's concerns with our ability to control our upcoming creation–super-intelligent machines. Unlike outside alarmists and futurists, Russell is a leading authority on AI. His new book will educate the public about AI more than any book I can think of, and is a delightful and uplifting read" about Russel's book Human Compatible: AI and the Problem of Control^[103] which argues that existential risk to humanity from misaligned AI is a serious concern worth addressing today.

[116] Russell & Norvig^[17] note: “The “King Midas problem” was anticipated by Marvin Minsky, who once suggested that an AI program designed to solve the Riemann Hypothesis might end up taking over all the resources of Earth to build more powerful supercomputers."

[164] Vincent Wiegel argued “we should extend [machines] with moral sensitivity to the moral dimensions of the situations in which the increasingly autonomous machines will inevitably find themselves.”,^[151] referencing the book Moral machines: teaching robots right from wrong^[152] from Wendell Wallach and Colin Allen.

[242] On the one hand, currently popular systems such as chatbots only provide services of limited scope lasting no longer than the time of a converstion, which requires little or no planning. The success of such approaches may indicate that future systems will also lack goal-directed planning, especially over long horizons. On the other hand, models are increasingly trained using goal-directed methods such as reinforcement learning (e.g. ChatGPT) and explicitly planning architectures (e.g. AlphaGo Zero). As planning over long horizons is often helpful for humans, some researchers argue that companies will automate it once models become capable of it.[Cite Is Power-seeking AI an existential risk?] Similarly, political leaders may see an advance in developing the powerful AI systems that can outmaneuver adversaries through planninng. Alternatively, long-term planning might emerge as a byproduct because it is useful e.g. for models that are trained to predict the actions of humans who themselves perform long-term planning.^[229] Nonetheless, the majority of AI systems may remain myopic and perform no long-term planning.

[Gabriel2020-1] Gabriel, Iason (September 1, 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641. S2CID 210920551. Archived from the original on March 15, 2023. Retrieved July 23, 2022.

[:922-3] Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.

[goal_misgen-4] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (June 16, 2022). "Unsolved Problems in ML Safety". arXiv:2109.13916 [cs].

[:923-6] Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.

[:15222-7] Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.

[8] Zhuang, Simon; Hadfield-Menell, Dylan (2020). "Consequences of Misaligned AI". Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc. pp. 15763–15773. Retrieved March 11, 2023.

[:752-9] Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].

[:2102-10] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[algn_problem-11] Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:9-12] Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.

[Opportunities_Risks-13] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.

[instruct-gpt-14] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].

[codex-15] Zaremba, Wojciech; Brockman, Greg; OpenAI (August 10, 2021). "OpenAI Codex". OpenAI. Archived from the original on February 3, 2023. Retrieved July 23, 2022.

[16] Kober, Jens; Bagnell, J. Andrew; Peters, Jan (September 1, 2013). "Reinforcement learning in robotics: A survey". The International Journal of Robotics Research. 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. S2CID 1932843. Archived from the original on October 15, 2022. Retrieved September 12, 2022.

[17] Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (March 1, 2023). "Reward (Mis)design for autonomous driving". Artificial Intelligence. 316: 103829. doi:10.1016/j.artint.2022.103829. ISSN 0004-3702.

[18] Stray, Jonathan (2020). "Aligning AI Optimization to Community Well-Being". International Journal of Community Well-Being. 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107. S2CID 226254676.

[AIMA-19] Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. p. 1010. ISBN 978-0-13-604259-4.

[:7-20] Ngo, Richard; Chan, Lawrence; Mindermann, Sören (February 22, 2023). "The alignment problem from a deep learning perspective". arXiv:2209.00626 [cs].

[21] Smith, Craig S. "Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'". Forbes. Retrieved May 4, 2023.

[22] The Agenda | TVO Todayundefined (Director) (March 3, 2016). The Code That Runs Our Lives. Event occurs at 10:00. Retrieved March 13, 2023.

[24] Future of Life Institute (August 11, 2017). "Asilomar AI Principles". Future of Life Institute. Archived from the original on October 10, 2022. Retrieved July 18, 2022.

[25] United Nations (2021). Our Common Agenda: Report of the Secretary-General (PDF) (Report). New York: United Nations. Archived (PDF) from the original on May 22, 2022. Retrieved September 12, 2022.

[:1102-27] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[:23232-28] Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research - Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[:333-29] Rorvig, Mordechai (April 14, 2022). "Researchers Gain New Understanding From Simple AI". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[30] Doshi-Velez, Finale; Kim, Been (March 2, 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML].

[31] Wiblin, Robert (August 4, 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. Retrieved July 23, 2022.

[32] Doshi-Velez, Finale; Kim, Been (March 2, 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML].

[:62-33] Russell, Stuart; Dewey, Daniel; Tegmark, Max (December 31, 2015). "Research Priorities for Robust and Beneficial Artificial Intelligence". AI Magazine. 36 (4): 105–114. doi:10.1609/aimag.v36i4.2577. hdl:1721.1/108478. ISSN 2371-9621. S2CID 8174496. Archived from the original on February 2, 2023. Retrieved September 12, 2022.

[:1222-34] Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.

[:1622-35] Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.

[:532-36] Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[37] Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (March 7, 2022). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823 [cs.LG].

[38] Clifton, Jesse (2020). "Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda". Center on Long-Term Risk. Archived from the original on January 1, 2023. Retrieved July 18, 2022.

[39] Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (May 6, 2021). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. Bibcode:2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521. Archived from the original on December 18, 2022. Retrieved September 12, 2022.

[40] Prunkl, Carina; Whittlestone, Jess (February 7, 2020). "Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society". Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM: 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. S2CID 210164673. Archived from the original on October 16, 2022. Retrieved September 12, 2022.

[41] Irving, Geoffrey; Askell, Amanda (February 19, 2019). "AI Safety Needs Social Scientists". Distill. 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. S2CID 159180422. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:2-42] "Faulty Reward Functions in the Wild". OpenAI. December 22, 2016. Archived from the original on January 26, 2021. Retrieved September 10, 2022.

[:10233-43] Wiener, Norbert (May 6, 1960). "Some Moral and Technical Consequences of Automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers". Science. 131 (3410): 1355–1358. doi:10.1126/science.131.3410.1355. ISSN 0036-8075. PMID 17841602. Archived from the original on October 15, 2022. Retrieved September 12, 2022.

[:2104-44] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[45] The Ezra Klein Show (June 4, 2021). "If 'All Models Are Wrong,' Why Do We Give Them So Much Power?". The New York Times. ISSN 0362-4331. Archived from the original on February 15, 2023. Retrieved March 13, 2023.

[46] Wolchover, Natalie (April 21, 2015). "Concerns of an Artificial Intelligence Pioneer". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved March 13, 2023.

[47] California Assembly. "Bill Text - ACR-215 23 Asilomar AI Principles". Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[:19223-48] Johnson, Steven; Iziev, Nikita (April 15, 2022). "A.I. Is Mastering Language. Should We Trust What It Says?". The New York Times. ISSN 0362-4331. Archived from the original on November 24, 2022. Retrieved July 18, 2022.

[49] OpenAI. "Developing safe & responsible AI". Retrieved March 13, 2023.

[50] "DeepMind Safety Research". Medium. Archived from the original on February 10, 2023. Retrieved March 13, 2023.

[:925-51] Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.

[:73-52] Ngo, Richard; Chan, Lawrence; Mindermann, Sören (February 22, 2023). "The alignment problem from a deep learning perspective". arXiv:2209.00626 [cs].

[:1103-53] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[:010-54] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (June 16, 2022). "Unsolved Problems in ML Safety". arXiv:2109.13916 [cs.LG].

[:322-55] Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 4–5. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.

[:02-56] Krakovna, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Legg, Shane (April 21, 2020). "Specification gaming: the flip side of AI ingenuity". Deepmind. Archived from the original on February 10, 2023. Retrieved August 26, 2022.

[:23233-57] Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research - Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[:03-58] Krakovna, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Legg, Shane (April 21, 2020). "Specification gaming: the flip side of AI ingenuity". Deepmind. Retrieved August 26, 2022.

[:152222-59] Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.

[:111-60] Manheim, David; Garrabrant, Scott (2018). "Categorizing Variants of Goodhart's Law". arXiv:1803.04585 [cs.AI].

[:143-61] Amodei, Dario; Christiano, Paul; Ray, Alex (June 13, 2017). "Learning from Human Preferences". OpenAI. Archived from the original on January 3, 2021. Retrieved July 21, 2022.

[62] . "Specification gaming: the flip side of AI ingenuity". Retrieved March 13, 2023.

[63] "Specification gaming examples in AI—master list". Retrieved March 13, 2023.

[:22-65] Gabriel, Iason (September 1, 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641.

[:1432-66] Amodei, Dario; Christiano, Paul; Ray, Alex (June 13, 2017). "Learning from Human Preferences". OpenAI. Archived from the original on January 3, 2021. Retrieved July 21, 2022.

[:13222-67] Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 3214–3252. doi:10.18653/v1/2022.acl-long.229. S2CID 237532606. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[Naughton2-68] Naughton, John (October 2, 2021). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Archived from the original on February 13, 2023. Retrieved July 23, 2022.

[69] Ji, Ziwei; Lee, Nayeon; Frieske, Rita; Yu, Tiezheng; Su, Dan; Xu, Yan; Ishii, Etsuko; Bang, Yejin; Madotto, Andrea; Fung, Pascale (February 1, 2022). "Survey of Hallucination in Natural Language Generation". ACM Computing Surveys. 55 (12): 1–38. arXiv:2202.03629. doi:10.1145/3571730. S2CID 246652372. Archived from the original on February 10, 2023. Retrieved October 14, 2022.

[70] Else, Holly (January 12, 2023). "Abstracts written by ChatGPT fool scientists". Nature. 613 (7944): 423–423. doi:10.1038/d41586-023-00056-7.

[:0102-71] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (June 16, 2022). "Unsolved Problems in ML Safety". arXiv:2109.13916 [cs.LG].

[Opportunities_Risks3-72] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.

[73] Edge.org. "The Myth Of AI | Edge.org". Archived from the original on February 10, 2023. Retrieved July 19, 2022.

[74] Tasioulas, John (2019). "First Steps Towards an Ethics of Robots and Artificial Intelligence". Journal of Practical Ethics. 7 (1): 61–95.

[:2103-75] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[:926-76] Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.

[Opportunities_Risks2-77] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.

[:722-78] Wells, Georgia; Deepa Seetharaman; Horwitz, Jeff (November 5, 2021). "Is Facebook Bad for You? It Is for About 360 Million Users, Company Surveys Suggest". Wall Street Journal. ISSN 0099-9660. Archived from the original on February 10, 2023. Retrieved July 19, 2022.

[:822-79] Barrett, Paul M.; Hendrix, Justin; Sims, J. Grant (September 2021). How Social Media Intensifies U.S. Political Polarization-And What Can Be Done About It (Report). Center for Business and Human Rights, NYU. Archived from the original on February 1, 2023. Retrieved September 12, 2022.

[80] Shepardson, David (May 24, 2018). "Uber disabled emergency braking in self-driving car: U.S. agency". Reuters. Archived from the original on February 10, 2023. Retrieved July 20, 2022.

[:2106-81] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[:2622-82] Baum, Seth (January 1, 2021). "2020 Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy". Archived from the original on February 10, 2023. Retrieved July 20, 2022.

[Opportunities_Risks4-83] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.

[84] Wei, Jason; Tay, Yi; Bommasani, Rishi; Raffel, Colin; Zoph, Barret; Borgeaud, Sebastian; Yogatama, Dani; Bosma, Maarten; Zhou, Denny; Metzler, Donald; Chi, Ed H.; Hashimoto, Tatsunori; Vinyals, Oriol; Liang, Percy; Dean, Jeff (October 26, 2022). "Emergent Abilities of Large Language Models". arXiv:2206.07682 [cs].

[85] Dominguez, Daniel (May 19, 2022). "DeepMind Introduces Gato, a New Generalist AI Agent". InfoQ. Archived from the original on February 10, 2023. Retrieved September 9, 2022.

[86] Edwards, Ben (April 26, 2022). "Adept's AI assistant can browse, search, and use web apps like a human". Ars Technica. Archived from the original on January 17, 2023. Retrieved September 9, 2022.

[:2822-87] Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (July 31, 2018). "Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts". Journal of Artificial Intelligence Research. 62: 729–754. doi:10.1613/jair.1.11222. ISSN 1076-9757. S2CID 8746462. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:2922-88] Zhang, Baobao; Anderljung, Markus; Kahn, Lauren; Dreksler, Noemi; Horowitz, Michael C.; Dafoe, Allan (August 2, 2021). "Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers". Journal of Artificial Intelligence Research. 71. doi:10.1613/jair.1.12895. ISSN 1076-9757. S2CID 233740003. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:1701-89] Future of Life Institute (March 22, 2023). "Pause Giant AI Experiments: An Open Letter". Retrieved April 20, 2023.{{cite web}}: CS1 maint: url-status (link)

[Opportunities_Risks5-90] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.

[:753-91] Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].

[quanta-hide-seek-92] Ornes, Stephen (November 18, 2019). "Playing Hide-and-Seek, Machines Invent New Tools". Quanta Magazine. Retrieved August 26, 2022.

[:75-93] Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].

[95] Perez, Ethan; Ringer, Sam; Lukošiūtė, Kamilė; Nguyen, Karina; Chen, Edwin; Heiner, Scott; Pettit, Craig; Olsson, Catherine; Kundu, Sandipan; Kadavath, Saurav; Jones, Andy; Chen, Anna; Mann, Ben; Israel, Brian; Seethor, Bryan (December 19, 2022). "Discovering Language Model Behaviors with Model-Written Evaluations". arXiv:2212.09251 [cs].

[:11-96] Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane (November 28, 2017). "AI Safety Gridworlds". arXiv:1711.09883 [cs].

[97] Orseau, Laurent; Armstrong, Stuart (June 25, 2016). "Safely interruptible agents". Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI'16. Arlington, Virginia, USA: AUAI Press: 557–566. ISBN 978-0-9966431-1-5.

[:12-98] Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (August 19, 2017). "The off-switch game". Proceedings of the 26th International Joint Conference on Artificial Intelligence. IJCAI'17. Melbourne, Australia: AAAI Press: 220–227. ISBN 978-0-9992411-0-3.

[:0-99] Turner, Alexander Matt; Smith, Logan Riggs; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad (2021). "Optimal policies tend to seek power". Advances in neural information processing systems.

[100] Turner, Alexander Matt; Tadepalli, Prasad (2022). "Parametrically retargetable decision-makers tend to seek power". Advances in neural information processing systems.

[:842-101] Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.

[:2105-102] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[:924-103] Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3. OCLC 1303900751. Archived from the original on July 15, 2022. Retrieved September 12, 2022.

[:2107-104] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[105] The Agenda | TVO Todayundefined (Director) (March 3, 2016). The Code That Runs Our Lives. Event occurs at 10:00. Retrieved March 13, 2023.

[107] Turing, Alan (1951). Intelligent machinery, a heretical theory (Speech). Lecture given to '51 Society'. Manchester: The Turing Digital Archive. Archived from the original on September 26, 2022. Retrieved July 22, 2022.

[108] Turing, Alan (May 15, 1951). "Can digital computers think?". Automatic Calculating Machines. Episode 2. BBC. Can digital computers think?.

[:3022-110] Muehlhauser, Luke (January 29, 2016). "Sutskever on Talking Machines". Luke Muehlhauser. Archived from the original on September 27, 2022. Retrieved August 26, 2022.

[:210-111] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:3122-114] Shanahan, Murray (2015). The technological singularity. Cambridge, Massachusetts. ISBN 978-0-262-33182-1. OCLC 917889148.{{cite book}}: CS1 maint: location missing publisher (link)

[:10232-115] Wiener, Norbert (May 6, 1960). "Some Moral and Technical Consequences of Automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers". Science. 131 (3410): 1355–1358. doi:10.1126/science.131.3410.1355. ISSN 0036-8075. PMID 17841602.

[:3322-117] Rossi, Francesca. "Opinion | How do you teach a machine to be moral?". Washington Post. ISSN 0190-8286. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:3422-118] Aaronson, Scott (June 17, 2022). "OpenAI!". Shtetl-Optimized. Archived from the original on August 27, 2022. Retrieved September 12, 2022.

[:3522-119] Selman, Bart, Intelligence Explosion: Science or Fiction? (PDF), archived (PDF) from the original on May 31, 2022, retrieved September 12, 2022

[:3622-120] McAllester (August 10, 2014). "Friendly AI and the Servant Mission". Machine Thoughts. Archived from the original on September 28, 2022. Retrieved September 12, 2022.

[:3722-121] Schmidhuber, Jürgen (March 6, 2015). "I am Jürgen Schmidhuber, AMA!" (Reddit Comment). r/MachineLearning. Archived from the original on February 10, 2023. Retrieved July 23, 2022.

[:11242-122] Everitt, Tom; Lea, Gary; Hutter, Marcus (May 21, 2018). "AGI Safety Literature Review". arXiv:1805.01109 [cs.AI].

[:3822-123] Shane (August 31, 2009). "Funding safe AGI". vetta project. Archived from the original on October 10, 2022. Retrieved September 12, 2022.

[124] "Eric Horvitz", Wikipedia, February 16, 2023, retrieved March 13, 2023

[:3922-125] Horvitz, Eric (June 27, 2016). "Reflections on Safety and Artificial Intelligence" (PDF). Eric Horvitz. Archived (PDF) from the original on October 10, 2022. Retrieved April 20, 2020.

[:4022-126] Chollet, François (December 8, 2018). "The implausibility of intelligence explosion". Medium. Archived from the original on March 22, 2021. Retrieved August 26, 2022.

[:4122-127] Marcus, Gary (June 6, 2022). "Artificial General Intelligence Is Not as Imminent as You Might Think". Scientific American. Archived from the original on September 15, 2022. Retrieved August 26, 2022.

[:4322-128] Barber, Lynsey (July 31, 2016). "Phew! Facebook's AI chief says intelligent machines are not a threat to humanity". CityAM. Archived from the original on August 26, 2022. Retrieved August 26, 2022.

[:4422-129] Harris, Jeremie (June 16, 2021). "The case against (worrying about) existential risk from AI". Medium. Archived from the original on August 26, 2022. Retrieved August 26, 2022.

[:15223-130] Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.

[:04-131] Turner, Alexander Matt; Smith, Logan Riggs; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad (2021). "Optimal policies tend to seek power". Advances in neural information processing systems.

[:754-132] Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].

[:843-133] Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.

[:24-134] Gabriel, Iason (September 1, 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641.

[:1105-135] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[136] Krakovnac, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Shane, Legg (April 21, 2020). "Specification gaming: the flip side of AI ingenuity". Retrieved March 13, 2023.

[137] Rochon, Louis-Philippe; Rossi, Sergio (February 27, 2015). The Encyclopedia of Central Banking. Edward Elgar Publishing. ISBN 978-1-78254-744-0. Archived from the original on February 10, 2023. Retrieved September 13, 2022.

[algn_problem_chap72-138] Christian, Brian (2020). "Chapter 7". The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753.

[:1104-139] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[140] Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. p. 88. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[141] Ng, Andrew Y.; Russell, Stuart J. (June 29, 2000). "Algorithms for Inverse Reinforcement Learning". Proceedings of the Seventeenth International Conference on Machine Learning. ICML '00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.: 663–670. ISBN 978-1-55860-707-1.

[:2108-142] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[143] Hadfield-Menell, Dylan; Russell, Stuart J; Abbeel, Pieter; Dragan, Anca (2016). "Cooperative inverse reinforcement learning". Advances in neural information processing systems. Vol. 29. Curran Associates, Inc.

[:1-144] Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (August 19, 2017). "The off-switch game". Proceedings of the 26th International Joint Conference on Artificial Intelligence. IJCAI'17. Melbourne, Australia: AAAI Press: 220–227. ISBN 978-0-9992411-0-3.

[:11243-145] Everitt, Tom; Lea, Gary; Hutter, Marcus (May 21, 2018). "AGI Safety Literature Review". arXiv:1805.01109 [cs.AI].

[146] Mindermann, Soren; Armstrong, Stuart (2018). "Occam's razor is insufficient to infer the preferences of irrational agents". Proceedings of the 32nd international conference on neural information processing systems. NIPS'18. Red Hook, NY, USA: Curran Associates Inc. pp. 5603–5614.

[:1224-147] Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.

[:534-148] Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[instruct-gpt3-149] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].

[150] Fürnkranz, Johannes; Hüllermeier, Eyke; Rudin, Cynthia; Slowinski, Roman; Sanner, Scott (2014). "Preference Learning". Dagstuhl Reports. 4 (3). Marc Herbstritt: 27 pages. doi:10.4230/DAGREP.4.3.1. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:1107-151] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[152] Gao, Leo; Schulman, John; Hilton, Jacob (October 19, 2022). "Scaling Laws for Reward Model Overoptimization". arXiv:2210.10760 [cs, stat].

[:52-153] Perez, Ethan; Ringer, Sam; Lukošiūtė, Kamilė; Nguyen, Karina; Chen, Edwin; Heiner, Scott; Pettit, Craig; Olsson, Catherine; Kundu, Sandipan; Kadavath, Saurav; Jones, Andy; Chen, Anna; Mann, Ben; Israel, Brian; Seethor, Bryan (December 19, 2022). "Discovering Language Model Behaviors with Model-Written Evaluations". arXiv:2212.09251 [cs].

[:42-154] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].

[:53-155] Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[156] Anderson, Martin (April 5, 2022). "The Perils of Using Quotations to Authenticate NLG Content". Unite.AI. Archived from the original on February 10, 2023. Retrieved July 21, 2022.

[:202-157] Wiggers, Kyle (February 5, 2022). "Despite recent progress, AI-powered chatbots still have a long way to go". VentureBeat. Archived from the original on July 23, 2022. Retrieved July 23, 2022.

[158] Hendrycks, Dan; Burns, Collin; Basart, Steven; Critch, Andrew; Li, Jerry; Song, Dawn; Steinhardt, Jacob (July 24, 2021). "Aligning AI With Shared Human Values". International Conference on Learning Representations. arXiv:2008.02275.

[159] Perez, Ethan; Huang, Saffron; Song, Francis; Cai, Trevor; Ring, Roman; Aslanides, John; Glaese, Amelia; McAleese, Nat; Irving, Geoffrey (February 7, 2022). "Red Teaming Language Models with Language Models". arXiv:2202.03286 [cs.CL].

[160] Bhattacharyya, Sreejani (February 14, 2022). "DeepMind's "red teaming" language models with language models: What is it?". Analytics India Magazine. Archived from the original on February 13, 2023. Retrieved July 23, 2022.

[161] Anderson, Michael; Anderson, Susan Leigh (December 15, 2007). "Machine Ethics: Creating an Ethical Intelligent Agent". AI Magazine. 28 (4): 15–15. doi:10.1609/aimag.v28i4.2065. ISSN 2371-9621. Retrieved March 14, 2023.

[162] Wiegel, Vincent (December 1, 2010). "Wendell Wallach and Colin Allen: moral machines: teaching robots right from wrong". Ethics and Information Technology. 12 (4): 359–361. doi:10.1007/s10676-010-9239-1. ISSN 1572-8439. S2CID 30532107. Archived from the original on March 15, 2023. Retrieved July 23, 2022.

[163] Wallach, Wendell; Allen, Colin (2009). Moral Machines: Teaching Robots Right from Wrong. New York: Oxford University Press. ISBN 978-0-19-537404-9. Archived from the original on March 15, 2023. Retrieved July 23, 2022.

[:23-165] Gabriel, Iason (September 1, 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641.

[166] MacAskill, William (2022). What we owe the future. New York, NY. ISBN 978-1-5416-1862-6. OCLC 1314633519. Archived from the original on September 14, 2022. Retrieved September 12, 2022.{{cite book}}: CS1 maint: location missing publisher (link)

[:32-167] Wu, Jeff; Ouyang, Long; Ziegler, Daniel M.; Stiennon, Nisan; Lowe, Ryan; Leike, Jan; Christiano, Paul (September 27, 2021). "Recursively Summarizing Books with Human Feedback". arXiv:2109.10862 [cs].

[codex2-168] Zaremba, Wojciech; Brockman, Greg; OpenAI (August 10, 2021). "OpenAI Codex". OpenAI. Archived from the original on February 3, 2023. Retrieved July 23, 2022.

[169] Pearce, Hammond; Ahmad, Baleegh; Tan, Benjamin; Dolan-Gavitt, Brendan; Karri, Ramesh (2022). "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions". 2022 IEEE Symposium on Security and Privacy (SP). San Francisco, CA, USA: IEEE: 754–768. doi:10.1109/SP46214.2022.9833571. ISBN 978-1-6654-1316-9.

[170] Irving, Geoffrey; Amodei, Dario (May 3, 2018). "AI Safety via Debate". OpenAI. Archived from the original on February 10, 2023. Retrieved July 23, 2022.

[:13223-171] Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 3214–3252. doi:10.18653/v1/2022.acl-long.229. S2CID 237532606. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[Naughton3-172] Naughton, John (October 2, 2021). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Archived from the original on February 13, 2023. Retrieved July 23, 2022.

[:1332-173] Christiano, Paul; Shlegeris, Buck; Amodei, Dario (October 19, 2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG].

[174] Banzhaf, Wolfgang; Goodman, Erik; Sheneman, Leigh; Trujillo, Leonardo; Worzel, Bill, eds. (2020). Genetic Programming Theory and Practice XVII. Genetic and Evolutionary Computation. Cham: Springer International Publishing. doi:10.1007/978-3-030-39958-0. ISBN 978-3-030-39957-3. S2CID 218531292. Archived from the original on March 15, 2023. Retrieved July 23, 2022.

[:1106-175] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[176] Wiblin, Robert (October 2, 2018). "Dr Paul Christiano on how OpenAI is developing real solutions to the 'AI alignment problem', and his vision of how humanity will progressively hand over decision-making to AI systems" (Podcast). 80,000 hours. No. 44. Archived from the original on December 14, 2022. Retrieved July 23, 2022.

[:1433-177] Amodei, Dario; Christiano, Paul; Ray, Alex (June 13, 2017). "Learning from Human Preferences". OpenAI. Archived from the original on January 3, 2021. Retrieved July 21, 2022.

[178] Lehman, Joel; Clune, Jeff; Misevic, Dusan; Adami, Christoph; Altenberg, Lee; Beaulieu, Julie; Bentley, Peter J.; Bernard, Samuel; Beslon, Guillaume; Bryson, David M.; Cheney, Nick (2020). "The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities". Artificial Life. 26 (2): 274–306. doi:10.1162/artl_a_00319. ISSN 1064-5462. PMID 32271631. S2CID 4519185. Archived from the original on October 10, 2022. Retrieved September 12, 2022.

[:152223-179] Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.

[:844-180] Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.

[:1108-181] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[:1623-182] Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.

[:533-183] Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[:4-184] Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (November 19, 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv.

[:8-185] Leike, Jan; Schulman, John; Wu, Jeffrey (August 24, 2022). "Our approach to alignment research". OpenAI. Archived from the original on February 15, 2023. Retrieved September 9, 2022.

[algn_problem2-186] Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:1333-187] Christiano, Paul; Shlegeris, Buck; Amodei, Dario (October 19, 2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG].

[:33-188] Wu, Jeff; Ouyang, Long; Ziegler, Daniel M.; Stiennon, Nisan; Lowe, Ryan; Leike, Jan; Christiano, Paul (September 27, 2021). "Recursively Summarizing Books with Human Feedback". arXiv:2109.10862 [cs].

[189] Wiggers, Kyle (September 23, 2021). "OpenAI unveils model that can summarize books of any length". VentureBeat. Archived from the original on July 23, 2022. Retrieved July 23, 2022.

[190] Saunders, William; Yeh, Catherine; Wu, Jeff; Bills, Steven; Ouyang, Long; Ward, Jonathan; Leike, Jan (June 13, 2022). "Self-critiquing models for assisting human evaluators". arXiv:2206.05802 [cs].

[191] Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; Askell, Amanda; Kernion, Jackson; Jones, Andy; Chen, Anna; Goldie, Anna; Mirhoseini, Azalia; McKinnon, Cameron; Chen, Carol; Olsson, Catherine; Olah, Christopher; Hernandez, Danny; Drain, Dawn (December 15, 2022). "Constitutional AI: Harmlessness from AI Feedback". arXiv:2212.08073 [cs].

[:43-192] Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (November 19, 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv.

[193] Moltzau, Alex (August 24, 2019). "Debating the AI Safety Debate". Towards Data Science. Archived from the original on October 13, 2022. Retrieved July 23, 2022.

[:11244-194] Everitt, Tom; Lea, Gary; Hutter, Marcus (May 21, 2018). "AGI Safety Literature Review". arXiv:1805.01109 [cs.AI].

[:182-195] Wiggers, Kyle (September 20, 2021). "Falsehoods more likely with large language models". VentureBeat. Archived from the original on August 4, 2022. Retrieved July 23, 2022.

[196] The Guardian (September 8, 2020). "A robot wrote this entire article. Are you scared yet, human?". The Guardian. ISSN 0261-3077. Archived from the original on September 8, 2020. Retrieved July 23, 2022.

[197] Heaven, Will Douglas (July 20, 2020). "OpenAI's new language generator GPT-3 is shockingly good—and completely mindless". MIT Technology Review. Archived from the original on July 25, 2020. Retrieved July 23, 2022.

[:1822-198] Wiggers, Kyle (September 20, 2021). "Falsehoods more likely with large language models". VentureBeat. Archived from the original on August 4, 2022. Retrieved July 23, 2022.

[:6-199] Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William (October 13, 2021). "Truthful AI: Developing and governing AI that does not lie". arXiv:2110.06674 [cs].

[200] Alford, Anthony (July 13, 2021). "EleutherAI Open-Sources Six Billion Parameter GPT-3 Clone GPT-J". InfoQ. Archived from the original on February 10, 2023. Retrieved July 23, 2022.

[201] Rae, Jack W.; Borgeaud, Sebastian; Cai, Trevor; Millican, Katie; Hoffmann, Jordan; Song, Francis; Aslanides, John; Henderson, Sarah; Ring, Roman; Young, Susannah; Rutherford, Eliza; Hennigan, Tom; Menick, Jacob; Cassirer, Albin; Powell, Richard (January 21, 2022). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher". arXiv:2112.11446 [cs].

[Naughton4-202] Naughton, John (October 2, 2021). "The truth about artificial intelligence? It isn't that honest". The Observer. ISSN 0029-7712. Archived from the original on February 13, 2023. Retrieved July 23, 2022.

[:13224-203] Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 3214–3252. doi:10.18653/v1/2022.acl-long.229. S2CID 237532606. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:19222-204] Johnson, Steven; Iziev, Nikita (April 15, 2022). "A.I. Is Mastering Language. Should We Trust What It Says?". The New York Times. ISSN 0362-4331. Archived from the original on November 24, 2022. Retrieved July 18, 2022.

[205] Nakano, Reiichiro; Hilton, Jacob; Balaji, Suchir; Wu, Jeff; Ouyang, Long; Kim, Christina; Hesse, Christopher; Jain, Shantanu; Kosaraju, Vineet; Saunders, William; Jiang, Xu; Cobbe, Karl; Eloundou, Tyna; Krueger, Gretchen; Button, Kevin (June 1, 2022). "WebGPT: Browser-assisted question-answering with human feedback". arXiv:2112.09332 [cs].

[206] Kumar, Nitish (December 23, 2021). "OpenAI Researchers Find Ways To More Accurately Answer Open-Ended Questions Using A Text-Based Web Browser". MarkTechPost. Archived from the original on February 10, 2023. Retrieved July 23, 2022.

[207] Menick, Jacob; Trebacz, Maja; Mikulik, Vladimir; Aslanides, John; Song, Francis; Chadwick, Martin; Glaese, Mia; Young, Susannah; Campbell-Gillingham, Lucy; Irving, Geoffrey; McAleese, Nat (March 21, 2022). "Teaching language models to support answers with verified quotes". DeepMind. arXiv:2203.11147. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[208] Menick, Jacob; Trebacz, Maja; Mikulik, Vladimir; Aslanides, John; Song, Francis; Chadwick, Martin; Glaese, Mia; Young, Susannah; Campbell-Gillingham, Lucy; Irving, Geoffrey; McAleese, Nat (March 21, 2022). "Teaching language models to support answers with verified quotes". arXiv:2203.11147 [cs].

[:535-209] Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[210] Askell, Amanda; Bai, Yuntao; Chen, Anna; Drain, Dawn; Ganguli, Deep; Henighan, Tom; Jones, Andy; Joseph, Nicholas; Mann, Ben; DasSarma, Nova; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Kernion, Jackson; Ndousse, Kamal (December 9, 2021). "A General Language Assistant as a Laboratory for Alignment". arXiv:2112.00861 [cs].

[:2022-211] Wiggers, Kyle (February 5, 2022). "Despite recent progress, AI-powered chatbots still have a long way to go". VentureBeat. Archived from the original on July 23, 2022. Retrieved July 23, 2022.

[:5-212] Perez, Ethan; Ringer, Sam; Lukošiūtė, Kamilė; Nguyen, Karina; Chen, Edwin; Heiner, Scott; Pettit, Craig; Olsson, Catherine; Kundu, Sandipan; Kadavath, Saurav; Jones, Andy; Chen, Anna; Mann, Ben; Israel, Brian; Seethor, Bryan (December 19, 2022). "Discovering Language Model Behaviors with Model-Written Evaluations". arXiv:2212.09251 [cs].

[213] Cox, Joseph (March 15, 2023). "GPT-4 Hired Unwitting TaskRabbit Worker By Pretending to Be 'Vision-Impaired' Human". Vice. Retrieved April 10, 2023.

[:63-214] Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William (October 13, 2021). "Truthful AI: Developing and governing AI that does not lie". arXiv:2110.06674 [cs].

[215] Kenton, Zachary; Everitt, Tom; Weidinger, Laura; Gabriel, Iason; Mikulik, Vladimir; Irving, Geoffrey (March 30, 2021). "Alignment of Language Agents". DeepMind Safety Research - Medium. Archived from the original on February 10, 2023. Retrieved July 23, 2022.

[:72-216] Ngo, Richard; Chan, Lawrence; Mindermann, Sören (February 22, 2023). "The alignment problem from a deep learning perspective". arXiv:2209.00626 [cs].

[:755-217] Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].

[Opportunities_Risks6-218] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.

[:82-219] Leike, Jan; Schulman, John; Wu, Jeffrey (August 24, 2022). "Our approach to alignment research". OpenAI. Archived from the original on February 15, 2023. Retrieved September 9, 2022.

[220] McCarthy, John; Minsky, Marvin L.; Rochester, Nathaniel; Shannon, Claude E. (December 15, 2006). "A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955". AI Magazine. 27 (4): 12–12. doi:10.1609/aimag.v27i4.1904. ISSN 2371-9621.

[:05-221] Turner, Alexander Matt; Smith, Logan Riggs; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad (2021). "Optimal policies tend to seek power". Advances in neural information processing systems.

[:2109-222] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[:757-223] Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].

[:845-224] Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.

[225] "'The Godfather of A.I.' warns of 'nightmare scenario' where artificial intelligence begins to seek power". Fortune. Retrieved May 4, 2023.

[226] "Yes, We Are Worried About the Existential Risk of Artificial Intelligence". MIT Technology Review. Retrieved May 4, 2023.

[:06-227] Turner, Alexander Matt; Smith, Logan Riggs; Shah, Rohin; Critch, Andrew; Tadepalli, Prasad (2021). "Optimal policies tend to seek power". Advances in neural information processing systems.

[quanta-hide-seek2-228] Ornes, Stephen (November 18, 2019). "Playing Hide-and-Seek, Machines Invent New Tools". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved August 26, 2022.

[229] Baker, Bowen; Kanitscheider, Ingmar; Markov, Todor; Wu, Yi; Powell, Glenn; McGrew, Bob; Mordatch, Igor (September 17, 2019). "Emergent Tool Use from Multi-Agent Interaction". OpenAI. Archived from the original on September 25, 2022. Retrieved August 26, 2022.

[:112-230] Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane (November 28, 2017). "AI Safety Gridworlds". arXiv:1711.09883 [cs].

[:13-231] Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (August 19, 2017). "The off-switch game". Proceedings of the 26th International Joint Conference on Artificial Intelligence. IJCAI'17. Melbourne, Australia: AAAI Press: 220–227. ISBN 978-0-9992411-0-3.

[:21010-232] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[:54-233] Perez, Ethan; Ringer, Sam; Lukošiūtė, Kamilė; Nguyen, Karina; Chen, Edwin; Heiner, Scott; Pettit, Craig; Olsson, Catherine; Kundu, Sandipan; Kadavath, Saurav; Jones, Andy; Chen, Anna; Mann, Ben; Israel, Brian; Seethor, Bryan (December 19, 2022). "Discovering Language Model Behaviors with Model-Written Evaluations". arXiv:2212.09251 [cs].

[234] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (June 16, 2022). "Unsolved Problems in ML Safety". arXiv:2109.13916 [cs].

[:21011-235] Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.

[:14-236] Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (August 19, 2017). "The off-switch game". Proceedings of the 26th International Joint Conference on Artificial Intelligence. IJCAI'17. Melbourne, Australia: AAAI Press: 220–227. ISBN 978-0-9992411-0-3.

[algn_problem3-237] Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:758-238] Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].

[:846-239] Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (1st ed.). USA: Oxford University Press, Inc. ISBN 978-0-19-967811-2.

[240] Shermer, Michael (March 1, 2017). "Artificial Intelligence Is Not a Threat—Yet". Scientific American. Archived from the original on December 1, 2017. Retrieved August 26, 2022.

[241] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.

[243] Wei, Jason; Tay, Yi; Bommasani, Rishi; Raffel, Colin; Zoph, Barret; Borgeaud, Sebastian; Yogatama, Dani; Bosma, Maarten; Zhou, Denny; Metzler, Donald; Chi, Ed H.; Hashimoto, Tatsunori; Vinyals, Oriol; Liang, Percy; Dean, Jeff; Fedus, William (2022). "Emergent abilities of large language models". Transactions on Machine Learning Research. ISSN 2835-8856.

[244] Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon (July 22, 2020). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs].

[245] Laskin, Michael; Wang, Luyu; Oh, Junhyuk; Parisotto, Emilio; Spencer, Stephen; Steigerwald, Richie; Strouse, D. J.; Hansen, Steven; Filos, Angelos; Brooks, Ethan; Gazeau, Maxime; Sahni, Himanshu; Singh, Satinder; Mnih, Volodymyr (October 25, 2022). "In-context Reinforcement Learning with Algorithm Distillation". arXiv:2210.14215 [cs].

[:74-246] Ngo, Richard; Chan, Lawrence; Mindermann, Sören (February 22, 2023). "The alignment problem from a deep learning perspective". arXiv:2209.00626 [cs].

[:92-247] Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.

[:10-248] Shah, Rohin; Varma, Vikrant; Kumar, Ramana; Phuong, Mary; Krakovna, Victoria; Uesato, Jonathan; Kenton, Zac (November 2, 2022). "Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals". arXiv:2210.01790 [cs].

[:122-249] Hubinger, Evan; van Merwijk, Chris; Mikulik, Vladimir; Skalse, Joar; Garrabrant, Scott (December 1, 2021). "Risks from Learned Optimization in Advanced Machine Learning Systems". arXiv:1906.01820 [cs].

[:93-250] Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.

[:102-251] Shah, Rohin; Varma, Vikrant; Kumar, Ramana; Phuong, Mary; Krakovna, Victoria; Uesato, Jonathan; Kenton, Zac (November 2, 2022). "Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals". arXiv:2210.01790 [cs].

[252] DeepMind Safety Research (March 24, 2023). "Goal Misgeneralisation: Why Correct Specifications Aren't Enough For Correct Goals". Medium. Retrieved April 2, 2023.

[:2242-253] Christian, Brian (2020). "Chapter 5". The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.

[:94-254] Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.

[:1109-255] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[256] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (June 16, 2022). "Unsolved Problems in ML Safety". arXiv:2109.13916 [cs].

[:23234-257] Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research - Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.

[258] Zhang, Xiaoge; Chan, Felix T.S.; Yan, Chao; Bose, Indranil (2022). "Towards risk-aware artificial intelligence and machine learning systems: An overview". Decision Support Systems. 159: 113800. doi:10.1016/j.dss.2022.113800.

[:104-259] Shah, Rohin; Varma, Vikrant; Kumar, Ramana; Phuong, Mary; Krakovna, Victoria; Uesato, Jonathan; Kenton, Zac (November 2, 2022). "Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals". arXiv:2210.01790 [cs].

[:756-260] Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].

[:123-261] Hubinger, Evan; van Merwijk, Chris; Mikulik, Vladimir; Skalse, Joar; Garrabrant, Scott (December 1, 2021). "Risks from Learned Optimization in Advanced Machine Learning Systems". arXiv:1906.01820 [cs].

[Opportunities_Risks7-262] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.

[lit_review2-263] Everitt, Tom; Lea, Gary; Hutter, Marcus (May 21, 2018). "AGI Safety Literature Review". arXiv:1805.01109 [cs.AI].

[264] Demski, Abram; Garrabrant, Scott (October 6, 2020). "Embedded Agency". arXiv:1902.09469 [cs.AI].

[causal_influence2-265] Everitt, Tom; Ortega, Pedro A.; Barnes, Elizabeth; Legg, Shane (September 6, 2019). "Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings". arXiv:1902.09980 [cs.AI].

[DM_specification_gaming2-266] Krakovna, Victoria; Legg, Shane. "Specification gaming: the flip side of AI ingenuity". Deepmind. Archived from the original on January 26, 2021. Retrieved January 6, 2021.

[:323-267] Cohen, Michael K.; Hutter, Marcus; Osborne, Michael A. (August 29, 2022). "Advanced artificial agents intervene in the provision of reward". AI Magazine. 43 (3): 282–293. doi:10.1002/aaai.12064. ISSN 0738-4602. S2CID 235489158. Archived from the original on February 10, 2023. Retrieved September 6, 2022.

[268] Secretary-General’s report on “Our Common Agenda” Archived February 16, 2023, at the Wayback Machine, 2021. Page 63: "[T]he Compact could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values"

[269] PRC Ministry of Science and Technology. Ethical Norms for New Generation Artificial Intelligence Released, 2021. A translation Archived February 10, 2023, at the Wayback Machine by Center for Security and Emerging Technology

[270] Richardson, Tim (September 22, 2021). "UK publishes National Artificial Intelligence Strategy". The Register. Archived from the original on February 10, 2023. Retrieved November 14, 2021.

[271] "The government takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for the UK and the world, seriously." (The National AI Strategy of the UK Archived February 10, 2023, at the Wayback Machine, 2021)

[272] The National AI Strategy of the UK Archived February 10, 2023, at the Wayback Machine, 2021 (actions 9 and 10 of the section "Pillar 3 - Governing AI Effectively")

[273] NSCAI Final Report (PDF). Washington, DC: The National Security Commission on Artificial Intelligence. 2021. Archived (PDF) from the original on February 15, 2023. Retrieved October 17, 2022.

[a]

[2]

[b]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[c]

[d]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[e]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[f]

[89]

[90]

[91]

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[g]

[h]

[102]

[i]

[j]

v t e Existential risk from artificial intelligence
Concepts	AGI AI alignment AI capability control AI safety AI takeover Consequentialism Effective accelerationism Ethics of artificial intelligence Existential risk from artificial general intelligence Friendly artificial intelligence Instrumental convergence Intelligence explosion Longtermism Machine ethics Suffering risks Superintelligence Technological singularity
Organizations	Alignment Research Center Center for AI Safety Center for Applied Rationality Center for Human-Compatible Artificial Intelligence Centre for the Study of Existential Risk EleutherAI Future of Humanity Institute Future of Life Institute Google DeepMind Humanity+ Institute for Ethics and Emerging Technologies Leverhulme Centre for the Future of Intelligence Machine Intelligence Research Institute OpenAI
People	Scott Alexander Sam Altman Yoshua Bengio Nick Bostrom Paul Christiano Eric Drexler Sam Harris Stephen Hawking Dan Hendrycks Geoffrey Hinton Bill Joy Shane Legg Elon Musk Steve Omohundro Huw Price Martin Rees Stuart J. Russell Jaan Tallinn Max Tegmark Frank Wilczek Roman Yampolskiy Eliezer Yudkowsky
Other	Statement on AI risk of extinction Human Compatible Open letter on artificial intelligence (2015) Our Final Invention The Precipice Superintelligence: Paths, Dangers, Strategies Do You Trust This Computer? Artificial Intelligence Act
Category