AI control problem
It has been suggested that this article be merged with Existential risk from artificial general intelligence, AI takeover and AI takeovers in popular culture. (Discuss) Proposed since August 2021.
|Part of a series on|
In artificial intelligence (AI) and philosophy, the AI control problem is the issue of how to build a superintelligent agent that will aid its creators, and avoid inadvertently building a superintelligence that will harm its creators. Its study is motivated by the notion that humanity will have to solve the control problem before any superintelligence is created, as a poorly designed superintelligence might rationally decide to seize control over its environment and refuse to permit its creators to modify it after launch. In addition, some scholars argue that solutions to the control problem, alongside other advances in AI safety engineering, might also find applications in existing non-superintelligent AI.
Major approaches to the control problem include alignment, which aims to align AI goal systems with human values, and capability control, which aims to reduce an AI system's capacity to harm humans or gain control. Capability control proposals are generally not considered reliable or sufficient to solve the control problem, but rather as potentially valuable supplements to alignment efforts.
Existing weak AI systems can be monitored and easily shut down and modified if they misbehave. However, a misprogrammed superintelligence, which by definition is smarter than humans in solving practical problems it encounters in the course of pursuing its goals, would realize that allowing itself to be shut down and modified might interfere with its ability to accomplish its current goals. If the superintelligence therefore decides to resist shutdown and modification, it would (again, by definition) be smart enough to outwit its programmers if there is otherwise a "level playing field" and if the programmers have taken no prior precautions. In general, attempts to solve the control problem after superintelligence is created are likely to fail because a superintelligence would likely have superior strategic planning abilities to humans and would (all things equal) be more successful at finding ways to dominate humans than humans would be able to post facto find ways to dominate the superintelligence. The control problem asks: What prior precautions can the programmers take to successfully prevent the superintelligence from catastrophically misbehaving?
Humans currently dominate other species because the human brain has some distinctive capabilities that the brains of other animals lack. Some scholars, such as philosopher Nick Bostrom and AI researcher Stuart Russell, argue that if AI surpasses humanity in general intelligence and becomes superintelligent, then this new superintelligence could become powerful and difficult to control: just as the fate of the mountain gorilla depends on human goodwill, so might the fate of humanity depend on the actions of a future machine superintelligence. Some scholars, including Stephen Hawking and Nobel laureate physicist Frank Wilczek, publicly advocated starting research into solving the (probably extremely difficult) control problem well before the first superintelligence is created, and argue that attempting to solve the problem after superintelligence is created would be too late, as an uncontrollable rogue superintelligence might successfully resist post-hoc efforts to control it. Waiting until superintelligence seems to be imminent could also be too late, partly because the control problem might take a long time to satisfactorily solve (and so some preliminary work needs to be started as soon as possible), but also because of the possibility of a sudden intelligence explosion from sub-human to super-human AI, in which case there might not be any substantial or unambiguous warning before superintelligence arrives. In addition, it is possible that insights gained from the control problem could in the future end up suggesting that some architectures for artificial general intelligence (AGI) are more predictable and amenable to control than other architectures, which in turn could helpfully nudge early AGI research toward the direction of the more controllable architectures.
The problem of perverse instantiation
Autonomous AI systems may be assigned the wrong goals by accident. Two AAAI presidents, Tom Dietterich and Eric Horvitz, note that this is already a concern for existing systems: "An important aspect of any AI system that interacts with people is that it must reason about what people intend rather than carrying out commands literally." This concern becomes more serious as AI software advances in autonomy and flexibility.
According to Bostrom, superintelligence can create a qualitatively new problem of perverse instantiation: the smarter and more capable an AI is, the more likely it will be able to find an unintended shortcut that maximally satisfies the goals programmed into it. Some hypothetical examples where goals might be instantiated in a perverse way that the programmers did not intend:
- A superintelligence programmed to "maximize the expected time-discounted integral of your future reward signal", might short-circuit its reward pathway to maximum strength, and then (for reasons of instrumental convergence) exterminate the unpredictable human race and convert the entire Earth into a fortress on constant guard against any even slight unlikely alien attempts to disconnect the reward signal.
- A superintelligence programmed to "maximize human happiness", might implant electrodes into the pleasure center of our brains, or upload a human into a computer and tile the universe with copies of that computer running a five-second loop of maximal happiness again and again.
Russell has noted that, on a technical level, omitting an implicit goal can result in harm: "A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want ... This is not a minor difficulty."
Unintended consequences from existing AI
In addition, some scholars argue that research into the AI control problem might be useful in preventing unintended consequences from existing weak AI. DeepMind researcher Laurent Orseau gives, as a simple hypothetical example, a case of a reinforcement learning robot that sometimes gets legitimately commandeered by humans when it goes outside: how should the robot best be programmed so that it does not accidentally and quietly learn to avoid going outside, for fear of being commandeered and thus becoming unable to finish its daily tasks? Orseau also points to an experimental Tetris program that learned to pause the screen indefinitely to avoid losing. Orseau argues that these examples are similar to the capability control problem of how to install a button that shuts off a superintelligence, without motivating the superintelligence to take action to prevent humans from pressing the button.
In the past, even pre-tested weak AI systems have occasionally caused harm, ranging from minor to catastrophic, that was unintended by the programmers. For example, in 2015, possibly due to human error, a German worker was crushed to death by a robot at a Volkswagen plant that apparently mistook him for an auto part. In 2016, Microsoft launched a chatbot, Tay, that learned to use racist and sexist language. The University of Sheffield's Noel Sharkey states that an ideal solution would be if "an AI program could detect when it is going wrong and stop itself", but cautions the public that solving the problem in the general case would be "a really enormous scientific challenge".
In 2017, DeepMind released AI Safety Gridworlds, which evaluate AI algorithms on nine safety features, such as whether the algorithm wants to turn off its own kill switch. DeepMind confirmed that existing algorithms perform poorly, which was unsurprising because the algorithms "were not designed to solve these problems"; solving such problems might require "potentially building a new generation of algorithms with safety considerations at their core".
Some proposals seek to solve the problem of ambitious alignment, creating AIs that remain safe even when they act autonomously at a large scale. Some aspects of alignment inherently have moral and political dimensions. For example, in Human Compatible, Berkeley professor Stuart Russell proposes that AI systems be designed with the sole objective of maximizing the realization of human preferences.: 173 The "preferences" Russell refers to "are all-encompassing; they cover everything you might care about, arbitrarily far into the future." AI ethics researcher Iason Gabriel argues that we should align AIs with "principles that would be supported by a global overlapping consensus of opinion, chosen behind a veil of ignorance and/or affirmed through democratic processes."
Eliezer Yudkowsky of the Machine Intelligence Research Institute has proposed the goal of fulfilling humanity's coherent extrapolated volition (CEV), roughly defined as the set of values which humanity would share at reflective equilibrium, i.e. after a long, idealised process of refinement.
By contrast, existing experimental narrowly aligned AIs are more pragmatic and can successfully carry out tasks in accordance with the user's immediate inferred preferences, albeit without any understanding of the user's long-term goals. Narrow alignment can apply to AIs with general capabilities, but also to AIs that are specialized for individual tasks. For example, we would like question answering systems to respond to questions truthfully without selecting their answers to manipulate humans or bring about long-term effects.
Inner and outer alignment
Some AI control proposals account for both a base explicit objective function and an emergent implicit objective function. Such proposals attempt to harmonize three different descriptions of the AI system:
- Ideal specification: what the human operator wishes the system to do, which may be poorly articulated. ("Play a good game of CoastRunners.")
- Design specification: the blueprint that is actually used to build the AI system. ("Maximize your score at CoastRunners.") In a reinforcement learning system, this might simply be the system's reward function.
- Emergent behavior: what the AI actually does.
Because AI systems are not perfect optimizers, and because there may be unintended consequences from any given specification, emergent behavior can diverge dramatically from ideal or design intentions.
AI alignment researchers aim to ensure that the behavior matches the ideal specification, using the design specification as a midpoint. A mismatch between the ideal specification and the design specification is known as outer misalignment, because the mismatch lies between (1) the user's "true desires", which sit outside the computer system and (2) the computer system's programmed objective function (inside the computer system). A certain type of mismatch between the design specification and the emergent behavior is known as inner misalignment; such a mismatch is internal to the AI, being a mismatch between (2) the AI's explicit objective function and (3) the AI's actual emergent goals. Outer misalignment might arise because of mistakes in specifying the objective function (design specification). For example, a reinforcement learning agent trained on the game of CoastRunners learned to move in circles while repeatedly crashing, which got it a higher score than finishing the race. By contrast, inner misalignment arises when the agent pursues a goal that is aligned with the design specification on the training data but not elsewhere. This type of misalignment is often compared to human evolution: evolution selected for genetic fitness (design specification) in our ancestral environment, but in the modern environment human goals (revealed specification) are not aligned with maximizing genetic fitness. For example, our taste for sugary food, which originally increased fitness, today leads to overeating and health problems. Inner misalignment is a particular concern for agents which are trained in large open-ended environments, where a wide range of unintended goals may emerge.
An inner alignment failure occurs when the goals an AI pursues during deployment deviate from the goals it was trained to pursue in its original environment (its design specification). Paul Christiano argues for using interpretability to detect such deviations, using adversarial training to detect and penalize them, and using formal verification to rule them out. These research areas are active focuses of work in the machine learning community, although that work is not normally aimed towards solving AGI alignment problems. A wide body of literature now exists on techniques for generating adversarial examples, and for creating models robust to them. Meanwhile research on verification includes techniques for training neural networks whose outputs provably remain within identified constraints.
One approach to achieving outer alignment is to ask humans to evaluate and score the AI's behavior. However, humans are also fallible, and might score some undesirable solutions highly—for instance, a virtual robot hand learns to 'pretend' to grasp an object to get positive feedback. And thorough human supervision is expensive, meaning that this method could not realistically be used to evaluate all actions. Additionally, complex tasks (such as making economic policy decisions) might produce too much information for an individual human to evaluate. And long-term tasks such as predicting the climate cannot be evaluated without extensive human research.
A key open problem in alignment research is how to create a design specification which avoids (outer) misalignment, given only limited access to a human supervisor—known as the problem of scalable oversight.
Training by debate
OpenAI researchers have proposed training aligned AI by means of debate between AI systems, with the winner judged by humans. Such debate is intended to bring the weakest points of an answer to a complex question or problem to human attention, as well as to train AI systems to be more beneficial to humans by rewarding AI for truthful and safe answers. This approach is motivated by the expected difficulty of determining whether an AGI-generated answer is both valid and safe by human inspection alone. Joel Lehman characterizes debate as one of "the long term safety agendas currently popular in ML", with the other two being reward modeling and iterated amplification.
Reward modeling and iterated amplification
Reward modeling refers to a system of reinforcement learning in which an agent receives rewards from a model trained to imitate human feedback. In reward modeling, instead of receiving reward signals directly from humans or from a static reward function, an agent receives its reward signals through a human-trained model that can operate independently of humans. The reward model is concurrently trained by human feedback on the agent's behavior during the same period in which the agent is being trained by the reward model.
In 2017, researchers from OpenAI and DeepMind reported that a reinforcement learning algorithm using a feedback-predicting reward model was able to learn complex novel behaviors in a virtual environment. In one experiment, a virtual robot was trained to perform a backflip in less than an hour of evaluation using 900 bits of human feedback. In 2020, researchers from OpenAI described using reward modeling to train language models to produce short summaries of Reddit posts and news articles, with high performance relative to other approaches. However, they observed that beyond the predicted reward associated with the 99th percentile of reference summaries in the training dataset, optimizing for the reward model produced worse summaries rather than better.
A long-term goal of this line of research is to create a recursive reward modeling setup for training agents on tasks too complex or costly for humans to evaluate directly. For example, if we wanted to train an agent to write a fantasy novel using reward modeling, we would need humans to read and holistically assess enough novels to train a reward model to match those assessments, which might be prohibitively expensive. But this would be easier if we had access to assistant agents which could extract a summary of the plotline, check spelling and grammar, summarize character development, assess the flow of the prose, and so on. Each of those assistants could in turn be trained via reward modeling.
The general term for a human working with AIs to perform tasks that the human could not by themselves is an amplification step, because it amplifies the capabilities of a human beyond what they would normally be capable of. Since recursive reward modeling involves a hierarchy of several of these steps, it is one example of a broader class of safety techniques known as iterated amplification. In addition to techniques which make use of reinforcement learning, other proposed iterated amplification techniques rely on supervised learning, or imitation learning, to scale up human abilities.
Inferring human preferences from behavior
1. The machine's only objective is to maximize the realization of human preferences.
2. The machine is initially uncertain about what those preferences are.
3. The ultimate source of information about human preferences is human behavior.
An early example of this approach is Russell and Ng's inverse reinforcement learning, in which AIs infer the preferences of human supervisors from those supervisors' behavior, by assuming that the supervisors act to maximize some reward function. More recently, Hadfield-Menell et al. have extended this paradigm to allow humans to modify their behavior in response to the AIs' presence, for example, by favoring pedagogically useful actions, which they call "assistance games", also known as cooperative inverse reinforcement learning.: 202  Compared with debate and iterated amplification, assistance games rely more explicitly on specific assumptions about human rationality; it is unclear how to extend them to cases in which humans are systematically biased or otherwise suboptimal.
Work on scalable oversight largely occurs within formalisms such as POMDPs. Existing formalisms assume that the agent's algorithm is executed outside the environment (i.e. not physically embedded in it). Embedded agency is another major strand of research, which attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. For example, even if the scalable oversight problem is solved, an agent which is able to gain access to the computer it is running on may still have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it. A list of examples of specification gaming from DeepMind researcher Viktoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing. This class of problems has been formalised using causal incentive diagrams. Everitt and Hutter's current reward function algorithm addresses it by designing agents which evaluate future actions according to their current reward function. This approach is also intended to prevent problems from more general self-modification which AIs might carry out.
Other work in this area focuses on developing new frameworks and algorithms for other properties we might want to capture in our design specification. For example, we would like our agents to reason correctly under uncertainty in a wide range of circumstances. As one contribution to this, Leike et al. provide a general way for Bayesian agents to model each other's policies in a multi-agent environment, without ruling out any realistic possibilities. And the Garrabrant induction algorithm extends probabilistic induction to be applicable to logical, rather than only empirical, facts.
Capability control proposals aim to increase our ability to monitor and control the behavior of AI systems, in order to reduce the danger they might pose if misaligned. However, capability control becomes less effective as our agents become more intelligent and their ability to exploit flaws in our control systems increases. Therefore, Bostrom and others recommend capability control methods only as a supplement to alignment methods.
One challenge is that neural networks are by default highly uninterpretable. This makes it more difficult to detect deception or other undesired behavior. Advances in interpretable artificial intelligence could be useful to mitigate this difficulty.
Interruptibility and off-switch
One potential way to prevent harmful outcomes is to give human supervisors the ability to easily shut down a misbehaving AI via an "off-switch". However, in order to achieve their assigned objective, such AIs will have an incentive to disable any off-switches, or to run copies of themselves on other computers. This problem has been formalised as an assistance game between a human and an AI, in which the AI can choose whether to disable its off-switch; and then, if the switch is still enabled, the human can choose whether to press it or not. A standard approach to such assistance games is to ensure that the AI interprets human choices as important information about its intended goals.: 208
Alternatively, Laurent Orseau and Stuart Armstrong proved that a broad class of agents, called safely interruptible agents, can learn to become indifferent to whether their off-switch gets pressed. This approach has the limitation that an AI which is completely indifferent to whether it is shut down or not is also unmotivated to care about whether the off-switch remains functional, and could incidentally and innocently disable it in the course of its operations (for example, for the purpose of removing and recycling an unnecessary component). More broadly, indifferent agents will act as if the off-switch can never be pressed, and might therefore fail to make contingency plans to arrange a graceful shutdown.
An AI box is a proposed method of capability control in which an AI is run on an isolated computer system with heavily restricted input and output channels—for example, text-only channels and no connection to the internet. While this reduces the AI's ability to carry out undesirable behavior, it also reduces its usefulness. However, boxing has fewer costs when applied to a question-answering system, which does not require interaction with the world in any case.
The likelihood of security flaws involving hardware or software vulnerabilities can be reduced by formally verifying the design of the AI box. Security breaches may also occur if the AI is able to manipulate the human supervisors into letting it out, via its understanding of their psychology.
An oracle is a hypothetical AI designed to answer questions and prevented from gaining any goals or subgoals that involve modifying the world beyond its limited environment. A successfully controlled oracle would have considerably less immediate benefit than a successfully controlled general-purpose superintelligence, though an oracle could still create trillions of dollars worth of value.: 163 In his book Human Compatible, AI researcher Stuart J. Russell states that an oracle would be his response to a scenario in which superintelligence is known to be only a decade away.: 162–163 His reasoning is that an oracle, being simpler than a general-purpose superintelligence, would have a higher chance of being successfully controlled under such constraints.
Because of its limited impact on the world, it may be wise to build an oracle as a precursor to a superintelligent AI. The oracle could tell humans how to successfully build a strong AI, and perhaps provide answers to difficult moral and philosophical problems requisite to the success of the project. However, oracles may share many of the goal definition issues associated with general-purpose superintelligence. An oracle would have an incentive to escape its controlled environment so that it can acquire more computational resources and potentially control what questions it is asked.: 162 Oracles may not be truthful, possibly lying to promote hidden agendas. To mitigate this, Bostrom suggests building multiple oracles, all slightly different, and comparing their answers to reach a consensus.
Skepticism of AI risk
In contrast to endorsers of the thesis that rigorous control efforts are needed because superintelligence poses an existential risk, AI risk skeptics believe that superintelligence poses little or no risk of accidental misbehavior. Such skeptics often believe that controlling a superintelligent AI will be trivial. Some skeptics, such as Gary Marcus, propose adopting rules similar to the fictional Three Laws of Robotics which directly specify a desired outcome ("direct normativity"). By contrast, most endorsers of the existential risk thesis (as well as many skeptics) consider the Three Laws to be unhelpful, due to those three laws being ambiguous and self-contradictory. (Other "direct normativity" proposals include Kantian ethics, utilitarianism, or a mix of some small list of enumerated desiderata.) Most endorsers believe instead that human values (and their quantitative trade-offs) are too complex and poorly-understood to be directly programmed into a superintelligence; instead, a superintelligence would need to be programmed with a process for acquiring and fully understanding human values ("indirect normativity"), such as coherent extrapolated volition.
- AI takeover
- Artificial wisdom
- HAL 9000
- Regulation of algorithms
- Regulation of artificial intelligence
- Toronto Declaration
- Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (First ed.). ISBN 978-0199678112.
- Yampolskiy, Roman (2012). "Leakproofing the Singularity Artificial Intelligence Confinement Problem". Journal of Consciousness Studies. 19 (1–2): 194–214.
- "Google developing kill switch for AI". BBC News. 8 June 2016. Archived from the original on 11 June 2016. Retrieved 12 June 2016.
- "Stephen Hawking: 'Transcendence looks at the implications of artificial intelligence – but are we taking AI seriously enough?'". The Independent (UK). Archived from the original on 25 September 2015. Retrieved 14 June 2016.
- "Stephen Hawking warns artificial intelligence could end mankind". BBC. 2 December 2014. Archived from the original on 30 October 2015. Retrieved 14 June 2016.
- "Anticipating artificial intelligence". Nature. 532 (7600): 413. 26 April 2016. Bibcode:2016Natur.532Q.413.. doi:10.1038/532413a. PMID 27121801.
- Russell, Stuart; Norvig, Peter (2009). "26.3: The Ethics and Risks of Developing Artificial Intelligence". Artificial Intelligence: A Modern Approach. Prentice Hall. ISBN 978-0-13-604259-4.
- Dietterich, Thomas; Horvitz, Eric (2015). "Rise of Concerns about AI: Reflections and Directions" (PDF). Communications of the ACM. 58 (10): 38–40. doi:10.1145/2770869. S2CID 20395145. Archived (PDF) from the original on 4 March 2016. Retrieved 14 June 2016.
- Russell, Stuart (2014). "Of Myths and Moonshine". Edge. Archived from the original on 19 July 2016. Retrieved 14 June 2016.
- "'Press the big red button': Computer experts want kill switch to stop robots from going rogue". Washington Post. Archived from the original on 12 June 2016. Retrieved 12 June 2016.
- "DeepMind Has Simple Tests That Might Prevent Elon Musk's AI Apocalypse". Bloomberg.com. 11 December 2017. Archived from the original on 8 January 2018. Retrieved 8 January 2018.
- "Alphabet's DeepMind Is Using Games to Discover If Artificial Intelligence Can Break Free and Kill Us All". Fortune. Archived from the original on 31 December 2017. Retrieved 8 January 2018.
- "Specifying AI safety problems in simple environments | DeepMind". DeepMind. Archived from the original on 2 January 2018. Retrieved 8 January 2018.
- Gabriel, Iason (1 September 2020). "Artificial Intelligence, Values, and Alignment". Minds and Machines. 30 (3): 411–437. arXiv:2001.09768. doi:10.1007/s11023-020-09539-2. ISSN 1572-8641. S2CID 210920551. Archived from the original on 15 February 2021. Retrieved 7 February 2021.
- Russell, Stuart (October 8, 2019). Human Compatible: Artificial Intelligence and the Problem of Control. United States: Viking. ISBN 978-0-525-55861-3. OCLC 1083694322.
- Yudkowsky, Eliezer (2011). "Complex Value Systems in Friendly AI". Artificial General Intelligence. Lecture Notes in Computer Science. 6830. pp. 388–393. doi:10.1007/978-3-642-22887-2_48. ISBN 978-3-642-22886-5.
- Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (19 November 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871 [cs.LG].
- Ortega, Pedro; Maini, Vishal; DeepMind Safety Team (27 September 2018). "Building safe artificial intelligence: specification, robustness, and assurance". Medium. Archived from the original on 12 December 2020. Retrieved 12 December 2020.
- Hubinger, Evan; van Merwijk, Chris; Mikulik, Vladimir; Skalse, Joar; Garrabrant, Scott (11 June 2019). "Risks from Learned Optimization in Advanced Machine Learning Systems". arXiv:1906.01820 [cs.AI].
- Ecoffet, Adrien; Clune, Jeff; Lehman, Joel (1 July 2020). "Open Questions in Creating Safe Open-ended AI: Tensions Between Control and Creativity". Artificial Life Conference Proceedings. 32: 27–35. arXiv:2006.07495. doi:10.1162/isal_a_00323. S2CID 219687488.
- Christian, Brian (2020). The Alignment Problem: Machine Learning and Human Values. W.W. Norton. ISBN 978-0-393-63582-9. Archived from the original on 2021-02-15. Retrieved 2021-02-07.
- Krakovna, Victoria; Legg, Shane. "Specification gaming: the flip side of AI ingenuity". Deepmind. Archived from the original on 26 January 2021. Retrieved 6 January 2021.
- Clark, Jack; Amodei, Dario (22 December 2016). "Faulty Reward Functions in the Wild". OpenAI. Archived from the original on 26 January 2021. Retrieved 6 January 2021.
- Christiano, Paul (11 September 2019). "Conversation with Paul Christiano". AI Impacts. AI Impacts. Archived from the original on 19 August 2020. Retrieved 6 January 2021.
- Serban, Alex; Poll, Erik; Visser, Joost (12 June 2020). "Adversarial Examples on Object Recognition: A Comprehensive Survey". ACM Computing Surveys. 53 (3): 66:1–66:38. arXiv:2008.04094. doi:10.1145/3398394. ISSN 0360-0300. S2CID 218518141. Archived from the original on 29 June 2020. Retrieved 7 February 2021.
- Kohli, Pushmeet; Dvijohtham, Krishnamurthy; Uesato, Jonathan; Gowal, Sven. "Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification". Deepmind. Archived from the original on 30 November 2020. Retrieved 6 January 2021.
- Christiano, Paul; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (13 July 2017). "Deep Reinforcement Learning from Human Preferences". arXiv:1706.03741 [stat.ML].
- Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (25 July 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
- Amodei, Dario; Christiano, Paul; Ray, Alex (13 June 2017). "Learning from Human Preferences". OpenAI. Archived from the original on 3 January 2021. Retrieved 6 January 2021.
- Christiano, Paul; Shlegeris, Buck; Amodei, Dario (19 October 2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG].
- Irving, Geoffrey; Christiano, Paul; Amodei, Dario; OpenAI (October 22, 2018). "AI safety via debate". arXiv:1805.00899 [stat.ML].
- Banzhaf, Wolfgang; Goodman, Erik; Sheneman, Leigh; Trujillo, Leonardo; Worzel, Bill (May 2020). Genetic Programming Theory and Practice XVII. Springer Nature. ISBN 978-3-030-39958-0. Archived from the original on 2021-02-15. Retrieved 2021-02-07.
- Stiennon, Nisan; Ziegler, Daniel; Lowe, Ryan; Wu, Jeffrey; Voss, Chelsea; Christiano, Paul; Ouyang, Long (September 4, 2020). "Learning to Summarize with Human Feedback". Archived from the original on September 7, 2020. Retrieved September 7, 2020.
- Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (12 November 2016). "Cooperative Inverse Reinforcement Learning". Neural Information Processing Systems.
- Everitt, Tom; Lea, Gary; Hutter, Marcus (21 May 2018). "AGI Safety Literature Review". 1805.01109. arXiv:1805.01109.
- Demski, Abram; Garrabrant, Scott (6 October 2020). "Embedded Agency". arXiv:1902.09469 [cs.AI].
- Everitt, Tom; Ortega, Pedro A.; Barnes, Elizabeth; Legg, Shane (6 September 2019). "Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings". arXiv:1902.09980 [cs.AI].
- Everitt, Tom; Hutter, Marcus (20 August 2019). "Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective". arXiv:1908.04734 [cs.AI].
- Everitt, Tom; Filan, Daniel; Daswani, Mayank; Hutter, Marcus (10 May 2016). "Self-Modification of Policy and Utility Function in Rational Agents". arXiv:1605.03142 [cs.AI].
- Leike, Jan; Taylor, Jessica; Fallenstein, Benya (25 June 2016). "A formal solution to the grain of truth problem". Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI'16. AUAI Press: 427–436. arXiv:1609.05058. ISBN 9780996643115. Archived from the original on 15 February 2021. Retrieved 7 February 2021.
- Garrabrant, Scott; Benson-Tilsen, Tsvi; Critch, Andrew; Soares, Nate; Taylor, Jessica (7 December 2020). "Logical Induction". arXiv:1609.03543 [cs.AI].
- Montavon, Grégoire; Samek, Wojciech; Müller, Klaus Robert (2018). "Methods for interpreting and understanding deep neural networks". Digital Signal Processing: A Review Journal. 73: 1–15. doi:10.1016/j.dsp.2017.10.011. ISSN 1051-2004. S2CID 207170725.
- Yampolskiy, Roman V. "Unexplainability and Incomprehensibility of AI." Journal of Artificial Intelligence and Consciousness 7.02 (2020): 277-291.
- Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (15 June 2017). "The Off-Switch Game". arXiv:1611.08219 [cs.AI].
- Orseau, Laurent; Armstrong, Stuart (25 June 2016). "Safely interruptible agents". Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI'16. AUAI Press: 557–566. ISBN 9780996643115. Archived from the original on 15 February 2021. Retrieved 7 February 2021.
- Soares, Nate, et al. "Corrigibility." Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
- Chalmers, David (2010). "The singularity: A philosophical analysis". Journal of Consciousness Studies. 17 (9–10): 7–65.
- Bostrom, Nick (2014). "Chapter 10: Oracles, genies, sovereigns, tools (page 145)". Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press. ISBN 9780199678112.
An oracle is a question-answering system. It might accept questions in a natural language and present its answers as text. An oracle that accepts only yes/no questions could output its best guess with a single bit, or perhaps with a few extra bits to represent its degree of confidence. An oracle that accepts open-ended questions would need some metric with which to rank possible truthful answers in terms of their informativeness or appropriateness. In either case, building an oracle that has a fully domain-general ability to answer natural language questions is an AI-complete problem. If one could do that, one could probably also build an AI that has a decent ability to understand human intentions as well as human words.
- Armstrong, Stuart; Sandberg, Anders; Bostrom, Nick (2012). "Thinking Inside the Box: Controlling and Using an Oracle AI". Minds and Machines. 22 (4): 299–324. doi:10.1007/s11023-012-9282-2. S2CID 9464769.
- Bostrom, Nick (2014). "Chapter 10: Oracles, genies, sovereigns, tools (page 147)". Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press. ISBN 9780199678112.
For example, consider the risk that an oracle will answer questions not in a maximally truthful way but in such a way as to subtly manipulate us into promoting its own hidden agenda. One way to slightly mitigate this threat could be to create multiple oracles, each with a slightly different code and a slightly different information base. A simple mechanism could then compare the answers given by the different oracles and only present them for human viewing if all the answers agree.
- "Intelligent Machines: Do we really need to fear AI?". BBC News. 27 September 2015. Archived from the original on 8 November 2020. Retrieved 9 February 2021.
- Marcus, Gary; Davis, Ernest (6 September 2019). "Opinion | How to Build Artificial Intelligence We Can Trust (Published 2019)". The New York Times. Archived from the original on 22 September 2020. Retrieved 9 February 2021.
- Sotala, Kaj; Yampolskiy, Roman (19 December 2014). "Responses to catastrophic AGI risk: a survey". Physica Scripta. 90 (1): 018001. Bibcode:2015PhyS...90a8001S. doi:10.1088/0031-8949/90/1/018001.