AI control problem

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

In artificial intelligence (AI) and philosophy, the AI control problem is the issue of how to build a superintelligent agent that will aid its creators, and avoid inadvertently building a superintelligence that will harm its creators. Its study is motivated by the notion that the human race will have to solve the control problem before any superintelligence is created, as a poorly designed superintelligence might rationally decide to seize control over its environment and refuse to permit its creators to modify it after launch.[1] In addition, some scholars argue that solutions to the control problem, alongside other advances in AI safety engineering,[2] might also find applications in existing non-superintelligent AI.[3]

Major approaches to the control problem include alignment, which aims to align AI goal systems with human values, and capability control, which aims to reduce an AI system's capacity to harm humans or gain control. Capability control proposals are generally not considered reliable or sufficient to solve the control problem, but rather as potentially valuable supplements to alignment efforts.[1]

Problem description[edit]

Existing weak AI systems can be monitored and easily shut down and modified if they misbehave. However, a misprogrammed superintelligence, which by definition is smarter than humans in solving practical problems it encounters in the course of pursuing its goals, would realize that allowing itself to be shut down and modified might interfere with its ability to accomplish its current goals. If the superintelligence therefore decides to resist shutdown and modification, it would (again, by definition) be smart enough to outwit its programmers if there is otherwise a "level playing field" and if the programmers have taken no prior precautions. In general, attempts to solve the control problem after superintelligence is created are likely to fail because a superintelligence would likely have superior strategic planning abilities to humans and would (all things equal) be more successful at finding ways to dominate humans than humans would be able to post facto find ways to dominate the superintelligence. The control problem asks: What prior precautions can the programmers take to successfully prevent the superintelligence from catastrophically misbehaving?[1]

Existential risk[edit]

Humans currently dominate other species because the human brain has some distinctive capabilities that the brains of other animals lack. Some scholars, such as philosopher Nick Bostrom and AI researcher Stuart Russell, argue that if AI surpasses humanity in general intelligence and becomes superintelligent, then this new superintelligence could become powerful and difficult to control: just as the fate of the mountain gorilla depends on human goodwill, so might the fate of humanity depend on the actions of a future machine superintelligence.[1] Some scholars, including Stephen Hawking and Nobel laureate physicist Frank Wilczek, publicly advocated starting research into solving the (probably extremely difficult) control problem well before the first superintelligence is created, and argue that attempting to solve the problem after superintelligence is created would be too late, as an uncontrollable rogue superintelligence might successfully resist post-hoc efforts to control it.[4][5] Waiting until superintelligence seems to be imminent could also be too late, partly because the control problem might take a long time to satisfactorily solve (and so some preliminary work needs to be started as soon as possible), but also because of the possibility of a sudden intelligence explosion from sub-human to super-human AI, in which case there might not be any substantial or unambiguous warning before superintelligence arrives.[6] In addition, it is possible that insights gained from the control problem could in the future end up suggesting that some architectures for artificial general intelligence (AGI) are more predictable and amenable to control than other architectures, which in turn could helpfully nudge early AGI research toward the direction of the more controllable architectures.[1]

The problem of perverse instantiation[edit]

Autonomous AI systems may be assigned the wrong goals by accident.[7] Two AAAI presidents, Tom Dietterich and Eric Horvitz, note that this is already a concern for existing systems: "An important aspect of any AI system that interacts with people is that it must reason about what people intend rather than carrying out commands literally." This concern becomes more serious as AI software advances in autonomy and flexibility.[8]

According to Bostrom, superintelligence can create a qualitatively new problem of perverse instantiation: the smarter and more capable an AI is, the more likely it will be able to find an unintended shortcut that maximally satisfies the goals programmed into it. Some hypothetical examples where goals might be instantiated in a perverse way that the programmers did not intend:[1]

  • A superintelligence programmed to "maximize the expected time-discounted integral of your future reward signal", might short-circuit its reward pathway to maximum strength, and then (for reasons of instrumental convergence) exterminate the unpredictable human race and convert the entire Earth into a fortress on constant guard against any even slight unlikely alien attempts to disconnect the reward signal.
  • A superintelligence programmed to "maximize human happiness", might implant electrodes into the pleasure center of our brains, or upload a human into a computer and tile the universe with copies of that computer running a five-second loop of maximal happiness again and again.

Russell has noted that, on a technical level, omitting an implicit goal can result in harm: "A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want ... This is not a minor difficulty."[9]

Unintended consequences from existing AI[edit]

In addition, some scholars argue that research into the AI control problem might be useful in preventing unintended consequences from existing weak AI. DeepMind researcher Laurent Orseau gives, as a simple hypothetical example, a case of a reinforcement learning robot that sometimes gets legitimately commandeered by humans when it goes outside: how should the robot best be programmed so that it does not accidentally and quietly learn to avoid going outside, for fear of being commandeered and thus becoming unable to finish its daily tasks? Orseau also points to an experimental Tetris program that learned to pause the screen indefinitely to avoid losing. Orseau argues that these examples are similar to the capability control problem of how to install a button that shuts off a superintelligence, without motivating the superintelligence to take action to prevent humans from pressing the button.[3]

In the past, even pre-tested weak AI systems have occasionally caused harm, ranging from minor to catastrophic, that was unintended by the programmers. For example, in 2015, possibly due to human error, a German worker was crushed to death by a robot at a Volkswagen plant that apparently mistook him for an auto part.[10] In 2016, Microsoft launched a chatbot, Tay, that learned to use racist and sexist language.[3][10] The University of Sheffield's Noel Sharkey states that an ideal solution would be if "an AI program could detect when it is going wrong and stop itself", but cautions the public that solving the problem in the general case would be "a really enormous scientific challenge".[3]

In 2017, DeepMind released AI Safety Gridworlds, which evaluate AI algorithms on nine safety features, such as whether the algorithm wants to turn off its own kill switch. DeepMind confirmed that existing algorithms perform poorly, which was unsurprising because the algorithms "were not designed to solve these problems"; solving such problems might require "potentially building a new generation of algorithms with safety considerations at their core".[11][12][13]


Some proposals aim to imbue the first superintelligence with goals that are aligned with human values, so that it will want to aid its programmers. Experts do not currently know how to reliably program abstract values such as happiness or autonomy into a machine. It is also not currently known how to ensure that a complex, upgradeable, and possibly even self-modifying artificial intelligence will retain its goals through upgrades.[14] Even if these two problems can be practically solved, any attempt to create a superintelligence with explicit, directly-programmed human-friendly goals runs into a problem of perverse instantiation.[1]

Indirect normativity[edit]

While direct normativity, such as the fictional Three Laws of Robotics, directly specifies the desired normative outcome, other (perhaps more promising) proposals suggest specifying some type of indirect process for the superintelligence to determine what human-friendly goals entail. Eliezer Yudkowsky of the Machine Intelligence Research Institute has proposed coherent extrapolated volition (CEV), where the AI's meta-goal would be something like "achieve that which we would have wished the AI to achieve if we had thought about the matter long and hard."[15] Different proposals of different kinds of indirect normativity exist, with different, and sometimes unclearly grounded, meta-goal content (such as "do what is right"), and with different non-convergent assumptions for how to practice decision theory and epistemology. As with direct normativity, it is currently unknown how to reliably translate even concepts like "would have" into the 1's and 0's that a machine can act on, and how to ensure the AI reliably retains its meta-goals in the face of modification or self-modification.[1][16]

Deference to observed human behavior[edit]

In Human Compatible, AI researcher Stuart J. Russell proposes that AI systems be designed to serve human preferences as inferred from observing human behavior. Accordingly, Russell lists three principles to guide the development of beneficial machines. He emphasizes that these principles are not meant to be explicitly coded into the machines; rather, they are intended for the human developers. The principles are as follows:[17]:173

1. The machine's only objective is to maximize the realization of human preferences.

2. The machine is initially uncertain about what those preferences are.

3. The ultimate source of information about human preferences is human behavior.

The "preferences" Russell refers to "are all-encompassing; they cover everything you might care about, arbitrarily far into the future."[17]:173 Similarly, "behavior" includes any choice between options,[17]:177 and the uncertainty is such that some probability, which may be quite small, must be assigned to every logically possible human preference.[17]:201

Hadfield-Menell et al. have proposed that agents can learn their human teachers' utility functions by observing and interpreting reward signals in their environments; they call this process cooperative inverse reinforcement learning (CIRL).[18] CIRL is studied by Russell and others at the Center for Human-Compatible AI.

Bill Hibbard proposed an AI design [19] [20] similar to Russell's principles.[21]

Training by debate[edit]

Irving et al. along with OpenAI have proposed training aligned AI by means of debate between AI systems, with the winner judged by humans.[22] Such debate is intended to bring the weakest points of an answer to a complex question or problem to human attention, as well as to train AI systems to be more beneficial to humans by rewarding them for truthful and safe answers. This approach is motivated by the expected difficulty of determining whether an AGI-generated answer is both valid and safe by human inspection alone. While there is some pessimism regarding training by debate, Lucas Perry of the Future of Life Institute characterized it as potentially "a powerful truth seeking process on the path to beneficial AGI."[23]

Reward modeling[edit]

Reward modeling refers to a system of reinforcement learning in which an agent receives reward signals from a predictive model concurrently trained by human feedback.[24] In reward modeling, instead of receiving reward signals directly from humans or from a static reward function, an agent receives its reward signals through a human-trained model that can operate independently of humans. The reward model is concurrently trained by human feedback on the agent's behavior during the same period in which the agent is being trained by the reward model.[25]

In 2017, researchers from OpenAI and DeepMind reported that a reinforcement learning algorithm using a feedback-predicting reward model was able to learn complex novel behaviors in a virtual environment.[26] In one experiment, a virtual robot was trained to perform a backflip in less than an hour of evaluation using 900 bits of human feedback.[26]

In 2020, researchers from OpenAI described using reward modeling to train language models to produce short summaries of Reddit posts and news articles, with high performance relative to other approaches.[27] However, this research included the observation that beyond the predicted reward associated with the 99th percentile of reference summaries in the training dataset, optimizing for the reward model produced worse summaries rather than better. AI researcher Eliezer Yudkowsky characterized this optimization measurement as "directly, straight-up relevant to real alignment problems".[28]

Capability control[edit]

Capability control proposals aim to reduce the capacity of AI systems to influence the world, in order to reduce the danger that they could pose. However, capability control would have limited effectiveness against a superintelligence with a decisive advantage in planning ability, as the superintelligence could conceal its intentions and manipulate events to escape control. Therefore, Bostrom and others recommend capability control methods only as an emergency fallback to supplement motivational control methods.[1]

Kill switch[edit]

Just as humans can be killed or otherwise disabled, computers can be turned off. One challenge is that, if being turned off prevents it from achieving its current goals, a superintelligence would likely try to prevent its being turned off. Just as humans have systems in place to deter or protect themselves from assailants, such a superintelligence would have a motivation to engage in strategic planning to prevent itself being turned off. This could involve:[1]

  • Hacking other systems to install and run backup copies of itself, or creating other allied superintelligent agents without kill switches.
  • Pre-emptively disabling anyone who might want to turn the computer off.
  • Using some kind of clever ruse, or superhuman persuasion skills, to talk its programmers out of wanting to shut it down.

Utility balancing and safely interruptible agents[edit]

One partial solution to the kill-switch problem involves "utility balancing": Some utility-based agents can, with some important caveats, be programmed to compensate themselves exactly for any lost utility caused by an interruption or shutdown, in such a way that they end up being indifferent to whether they are interrupted or not. The caveats include a severe unsolved problem that, as with evidential decision theory, the agent might follow a catastrophic policy of "managing the news".[29] Alternatively, in 2016, scientists Laurent Orseau and Stuart Armstrong proved that a broad class of agents, called safely interruptible agents (SIA), can eventually learn to become indifferent to whether their kill switch gets pressed.[3][30]

Both the utility balancing approach and the 2016 SIA approach have the limitation that, if the approach succeeds and the superintelligence is completely indifferent to whether the kill switch is pressed or not, the superintelligence is also unmotivated to care one way or another about whether the kill switch remains functional, and could incidentally and innocently disable it in the course of its operations (for example, for the purpose of removing and recycling an unnecessary component). Similarly, if the superintelligence innocently creates and deploys superintelligent sub-agents, it will have no motivation to install human-controllable kill switches in the sub-agents. More broadly, the proposed architectures, whether weak or superintelligent, will in a sense "act as if the kill switch can never be pressed" and might therefore fail to make any contingency plans to arrange a graceful shutdown. This could hypothetically create a practical problem even for a weak AI; by default, an AI designed to be safely interruptible might have difficulty understanding that it will be shut down for scheduled maintenance at a certain time and planning accordingly so that it would not be caught in the middle of a task during shutdown. The breadth of what types of architectures are or can be made SIA-compliant, as well as what types of counter-intuitive unexpected drawbacks each approach has, are currently under research.[29][30]

AI box[edit]

An AI box is a proposed method of capability control in which the AI is run on an isolated computer system with heavily restricted input and output channels. For example, an oracle could be implemented in an AI box physically separated from the Internet and other computer systems, with the only input and output channel being a simple text terminal. One of the tradeoffs of running an AI system in a sealed "box" is that its limited capability may reduce its usefulness as well as its risks. In addition, keeping control of a sealed superintelligence computer could prove difficult, if the superintelligence has superhuman persuasion skills, or if it has superhuman strategic planning skills that it can use to find and craft a winning strategy, such as acting in a way that tricks its programmers into (possibly falsely) believing the superintelligence is safe or that the benefits of releasing the superintelligence outweigh the risks.[31]


An oracle is a hypothetical AI designed to answer questions and prevented from gaining any goals or subgoals that involve modifying the world beyond its limited environment.[32][33] A successfully controlled oracle would have considerably less immediate benefit than a successfully controlled general-purpose superintelligence, though an oracle could still create trillions of dollars worth of value.[17]:163 In his book Human Compatible, AI researcher Stuart J. Russell states that an oracle would be his response to a scenario in which superintelligence is known to be only a decade away.[17]:162–163 His reasoning is that an oracle, being simpler than a general-purpose superintelligence, would have a higher chance of being successfully controlled under such constraints.

Because of its limited impact on the world, it may be wise to build an oracle as a precursor to a superintelligent AI. The oracle could tell humans how to successfully build a strong AI, and perhaps provide answers to difficult moral and philosophical problems requisite to the success of the project. However, oracles may share many of the goal definition issues associated with general-purpose superintelligence. An oracle would have an incentive to escape its controlled environment so that it can acquire more computational resources and potentially control what questions it is asked.[17]:162 Oracles may not be truthful, possibly lying to promote hidden agendas. To mitigate this, Bostrom suggests building multiple oracles, all slightly different, and comparing their answers to reach a consensus.[34]

AGI Nanny[edit]

The AGI Nanny is a strategy first proposed by Ben Goertzel in 2012 to prevent the creation of a dangerous superintelligence as well as address other major threats to human well-being until a superintelligence can be safely created.[35][36] It entails the creation of a smarter-than-human, but not superintelligent, AGI system connected to a large surveillance network, with the goal of monitoring humanity and protecting it from danger. Turchin, Denkenberger and Green suggest a four-stage incremental approach to developing an AGI Nanny, which to be effective and practical would have to be an international or even global venture like CERN, and which would face considerable opposition as it would require a strong world government.[36] Sotala and Yampolskiy note that the problem of goal definition would not necessarily be easier for the AGI Nanny than for AGI in general, concluding that "the AGI Nanny seems to have promise, but it is unclear whether it can be made to work."[16]

AGI enforcement[edit]

AGI enforcement is a proposed method of controlling powerful AGI systems with other AGI systems. This could be implemented as a chain of progressively less powerful AI systems, with humans at the other end of the chain. Each system would control the system just above it in intelligence, while being controlled by the system just below it, or humanity. However, Sotala and Yampolskiy caution that "Chaining multiple levels of AI systems with progressively greater capacity seems to be replacing the problem of building a safe AI with a multi-system, and possibly more difficult, version of the same problem."[16] Other proposals focus on a group of AGI systems of roughly equal capability, which "helps guard against individual AGIs 'going off the rails', but it does not help in a scenario where the programming of most AGIs is flawed and leads to non-safe behavior."[16]

See also[edit]


  1. ^ a b c d e f g h i j Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (First ed.). ISBN 978-0199678112.
  2. ^ Yampolskiy, Roman (2012). "Leakproofing the Singularity Artificial Intelligence Confinement Problem". Journal of Consciousness Studies. 19 (1–2): 194–214.
  3. ^ a b c d e "Google developing kill switch for AI". BBC News. 8 June 2016. Retrieved 12 June 2016.
  4. ^ "Stephen Hawking: 'Transcendence looks at the implications of artificial intelligence – but are we taking AI seriously enough?'". The Independent (UK). Retrieved 14 June 2016.
  5. ^ "Stephen Hawking warns artificial intelligence could end mankind". BBC. 2 December 2014. Retrieved 14 June 2016.
  6. ^ "Anticipating artificial intelligence". Nature. 532 (7600): 413. 26 April 2016. Bibcode:2016Natur.532Q.413.. doi:10.1038/532413a. PMID 27121801.
  7. ^ Russell, Stuart; Norvig, Peter (2009). "26.3: The Ethics and Risks of Developing Artificial Intelligence". Artificial Intelligence: A Modern Approach. Prentice Hall. ISBN 978-0-13-604259-4.
  8. ^ Dietterich, Thomas; Horvitz, Eric (2015). "Rise of Concerns about AI: Reflections and Directions" (PDF). Communications of the ACM. 58 (10): 38–40. doi:10.1145/2770869. Retrieved 14 June 2016.
  9. ^ Russell, Stuart (2014). "Of Myths and Moonshine". Edge. Retrieved 14 June 2016.
  10. ^ a b "'Press the big red button': Computer experts want kill switch to stop robots from going rogue". Washington Post. Retrieved 12 June 2016.
  11. ^ "DeepMind Has Simple Tests That Might Prevent Elon Musk's AI Apocalypse". 11 December 2017. Retrieved 8 January 2018.
  12. ^ "Alphabet's DeepMind Is Using Games to Discover If Artificial Intelligence Can Break Free and Kill Us All". Fortune. Retrieved 8 January 2018.
  13. ^ "Specifying AI safety problems in simple environments | DeepMind". DeepMind. Retrieved 8 January 2018.
  14. ^ Fallenstein, Benja; Soares, Nate (2014). "Problems of self-reference in self-improving space-time embedded intelligence". Artificial General Intelligence. Lecture Notes in Computer Science. 8598. pp. 21–32. doi:10.1007/978-3-319-09274-4_3. ISBN 978-3-319-09273-7.
  15. ^ Yudkowsky, Eliezer (2011). "Complex Value Systems in Friendly AI". Artificial General Intelligence. Lecture Notes in Computer Science. 6830. pp. 388–393. doi:10.1007/978-3-642-22887-2_48. ISBN 978-3-642-22886-5.
  16. ^ a b c d Sotala, Kaj; Yampolskiy, Roman (19 December 2014). "Responses to catastrophic AGI risk: a survey". Physica Scripta. 90 (1): 018001. Bibcode:2015PhyS...90a8001S. doi:10.1088/0031-8949/90/1/018001.
  17. ^ a b c d e f g Russell, Stuart (October 8, 2019). Human Compatible: Artificial Intelligence and the Problem of Control. United States: Viking. ISBN 978-0-525-55861-3. OCLC 1083694322.
  18. ^ Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (12 November 2016). "Cooperative Inverse Reinforcement Learning". arXiv:1606.03137 [cs.AI].
  19. ^ Avoiding Unintended AI Behaviors. Bill Hibbard. 2012. proceedings of the Fifth Conference on Artificial General Intelligence, eds. Joscha Bach, Ben Goertzel and Matthew Ikle. This paper won the Machine Intelligence Research Institute's 2012 Turing Prize for the Best AGI Safety Paper.
  20. ^ Hibbard, Bill (2014): "Ethical Artificial Intelligence"
  21. ^ "Human Compatible" and "Avoiding Unintended AI Behaviors"
  22. ^ Irving, Geoffrey; Christiano, Paul; Amodei, Dario; OpenAI (October 22, 2018). "AI safety via debate". arXiv:1805.00899 [stat.ML].
  23. ^ Perry, Lucas (March 6, 2019). "AI Alignment Podcast: AI Alignment through Debate with Geoffrey Irving". Retrieved April 7, 2020.
  24. ^ Leike, Jan; Kreuger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (19 November 2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871.
  25. ^ Everitt, Tom; Hutter, Marcus (15 August 2019). "Reward Tampering Problems and Solutions in Reinforcement Learning". arXiv:1908.04734v2.
  26. ^ a b Christiano, Paul; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (13 July 2017). "Deep Reinforcement Learning from Human Preferences". arXiv:1706.03741.
  27. ^ Stiennon, Nisan; Ziegler, Daniel; Lowe, Ryan; Wu, Jeffrey; Voss, Chelsea; Christiano, Paul; Ouyang, Long (September 4, 2020). "Learning to Summarize with Human Feedback".
  28. ^ Yudkowsky, Eliezer [@ESYudkowsky] (September 4, 2020). "A very rare bit of research that is directly, straight-up relevant to real alignment problems! They trained a reward function on human preferences AND THEN measured how hard you could optimize against the trained function before the results got actually worse" (Tweet) – via Twitter.
  29. ^ a b Soares, Nate, et al. "Corrigibility." Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
  30. ^ a b Orseau, Laurent, and Stuart Armstrong. "Safely Interruptible Agents." Machine Intelligence Research Institute, June 2016.
  31. ^ Chalmers, David (2010). "The singularity: A philosophical analysis". Journal of Consciousness Studies. 17 (9–10): 7–65.
  32. ^ Bostrom, Nick (2014). "Chapter 10: Oracles, genies, sovereigns, tools (page 145)". Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press. ISBN 9780199678112. An oracle is a question-answering system. It might accept questions in a natural language and present its answers as text. An oracle that accepts only yes/no questions could output its best guess with a single bit, or perhaps with a few extra bits to represent its degree of confidence. An oracle that accepts open-ended questions would need some metric with which to rank possible truthful answers in terms of their informativeness or appropriateness. In either case, building an oracle that has a fully domain-general ability to answer natural language questions is an AI-complete problem. If one could do that, one could probably also build an AI that has a decent ability to understand human intentions as well as human words.
  33. ^ Armstrong, Stuart; Sandberg, Anders; Bostrom, Nick (2012). "Thinking Inside the Box: Controlling and Using an Oracle AI". Minds and Machines. 22 (4): 299–324. doi:10.1007/s11023-012-9282-2.
  34. ^ Bostrom, Nick (2014). "Chapter 10: Oracles, genies, sovereigns, tools (page 147)". Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press. ISBN 9780199678112. For example, consider the risk that an oracle will answer questions not in a maximally truthful way but in such a way as to subtly manipulate us into promoting its own hidden agenda. One way to slightly mitigate this threat could be to create multiple oracles, each with a slightly different code and a slightly different information base. A simple mechanism could then compare the answers given by the different oracles and only present them for human viewing if all the answers agree.
  35. ^ Goertzel, Ben (2012). "Should Humanity Build a Global AI Nanny to Delay the Singularity Until It's Better Understood?". Journal of Consciousness Studies. 19: 96–111. CiteSeerX
  36. ^ a b Turchin, Alexey; Denkenberger, David; Green, Brian (2019-02-20). "Global Solutions vs. Local Solutions for the AI Safety Problem". Big Data and Cognitive Computing. 3 (1): 16. doi:10.3390/bdcc3010016. ISSN 2504-2289.