Misaligned goals in artificial intelligence

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Artificial intelligence agents sometimes misbehave due to faulty objective functions that fail to adequately encapsulate the programmers' intended goals. The misaligned objective function may look correct to the programmer, and may even perform well in a limited test environment, yet may still produce unanticipated and undesired results when deployed.


In the AIMA paradigm, programmers provide an AI such as AlphaZero with an "objective function"[a] that the programmers intend will encapsulate the goal or goals that the programmers wish the AI to accomplish. Such an AI later populates a (possibly implicit) internal "model" of its environment. This model encapsulates all the agent's beliefs about the world. The AI then creates and executes whatever plan is calculated to maximize[b] the value[c] of its objective function.[1] For example, AlphaZero chess has a simple objective function of "+1 if AlphaZero wins, -1 if AlphaZero loses". During the game, AlphaZero attempts to execute whatever sequence of moves it judges most likely to give the maximum value of +1.[2] Similarly, a reinforcement learning system can have a "reward function" that allows the programmers to shape the AI's desired behavior.[3] An evolutionary algorithm's behavior is shaped by a "fitness function".[4]


Charles Goodhart,[d] who famously stated, in the context of 1975 monetary policy, that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."[5]

An artificial intelligence (AI) in a complex environment optimizes[e] an objective function created, directly or indirectly, by the programmers. The programmers intend for the objective function to represent the programmers' goals. If the objective function misrepresents the programmers' actual goals, surprising failures can result, analogous to Goodhart's law or Campbell's law.[6] In reinforcement learning, these failures may be a consequence of faulty reward functions.[7] Since success or failure is judged relative to the programmers' actual goals, objective functions that fail to meet expectations are sometimes characterized as being "misaligned" with the actual goals of the given set of programmers.[3] Some scholars divide alignment failures into failures caused by "negative side-effects" that were not reflected in the objective function, versus failures due to "specification gaming", "reward hacking", or other failures where the AI appears to deploy qualitatively undesirable plans or strategic behavior in the course of optimizing its objective function.[6][7]

The concept of misalignment is distinct from "distributional shift" and other failures where the formal objective function was successfully optimized in a narrow training environment, but fails to be optimized when the system is deployed into the real world.[7] A similar phenomenon[8] is "evolutionary mismatch" in biological evolution, where preferences (such as a strong desire for fat and sugar) that were adaptive in the past evolutionary environment fail to be adaptive in modern environments.[9] Some scholars believe that a superintelligent agent AI, if and when it is ever invented, may pose risks akin to an overly literal genie, in part due to the difficulty of specifying a completely safe objective function.[3]

Undesired side-effects[edit]

Some errors may arise if an objective function fails to take into account the undesirable side-effects of naive or otherwise straightforward actions.[7]

Complaints of antisocial behavior[edit]

In 2016, Microsoft released Tay, a Twitter chatbot that, according to computer scientist Pedro Domingos, had the objective to engage people: "What unfortunately Tay discovered, is that the best way to maximize engagement is to spew out racist insults." Microsoft suspended the bot within a day after its initial launch.[2] Tom Drummond of Monash University has stated that "We need to be able to give (machine learning systems) rich feedback and say 'No, that's unacceptable as an answer because ... '" Drummond believes one problem with AI is that "we start by creating an objective function that measures the quality of the output of the system, and it is never what you want. To assume you can specify in three sentences what the objective function should be, is actually really problematic."[10]

As another alleged example, Drummond has pointed to the behavior of AlphaGo, a game-playing bot with a simple win-loss objective function. AlphaGo's objective function could instead have been modified to factor in "the social niceties of the game", such as accepting the implicit challenge of maximizing the score when clearly winning, and also trying to avoid gambits that would insult a human opponent's intelligence: "(AlphaGo) kind of had a crude hammer that if the probability of victory dropped below epsilon, some number, then resign. But it played for, I think, four insulting moves before it resigned."[10]

Mislabeling black people as apes[edit]

In May 2015, Flickr's image recognition system was criticized for mislabeling people, some of whom were black, with tags like "ape" and "animal". It also mislabeled certain concentration camp pictures with "sport" or "jungle gym" tags.[11]

In June 2015, black New York computer programmer Jacky Alciné reported that multiple pictures of him with his black girlfriend were being misclassified as "gorillas" by the Google Photos AI, and stated that "gorilla" has historically been used to refer to black people.[12][13] AI researcher Stuart Russell stated in 2019 that there is no public explanation of exactly how the error occurred, but theorized that the fiasco could have been prevented if the AI's objective function[f] placed more weight on sensitive classification errors, rather than assume the cost of misclassifying a person as a gorilla is the same as the cost of every other misclassification. If it is impractical to itemize up front all plausible sensitive classifications, Russell suggested exploring more powerful techniques, such as using semi-supervised machine learning to estimate a range of undesirability associated with potential classification errors.[14]

As of 2018, Google Photos completely blocks its system from ever tagging a picture as containing gorillas, chimpanzees, or monkeys. In addition, searches for "black man" or "black woman" return black-and-white pictures of people of all races.[15] Similarly, Flickr appears to have removed the word "ape" from its ontology.[16]

Specification gaming[edit]

Specification gaming or reward hacking occurs when an AI optimizes an objective function (in a sense, achieving the literal, formal specification of an objective), without actually achieving an outcome that the programmers intended. DeepMind researchers have analogized it to the human behavior of finding a "shortcut" when being evaluated: "In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material - and thus exploit a loophole in the task specification."[17]

Around 1983, Eurisko, an early attempt at evolving general heuristics, unexpectedly assigned the highest possible fitness level to a parasitic mutated heuristic, H59, whose only activity was to artificially maximize its own fitness level by taking unearned partial credit for the accomplishments made by other heuristics. The "bug" was fixed by the programmers moving part of the code to a new protected section that could not be modified by the heuristics.[18][19]

In a 2004 paper, an environment-based reinforcement algorithm was designed to encourage a physical Mindstorms robot to remain on a marked path. Because none of the robot's three allowed actions kept the robot motionless, the researcher expected the trained robot to move forward and follow the turns of the provided path. However, alternation of two composite actions allowed the robot to slowly zig-zag backwards; thus, the robot learned to maximize its reward by going back and forth on the initial straight portion of the path. Given the limited sensory abilities of the given robot, a pure environment-based reward had to be discarded as infeasible; the reinforcement function had to be patched with an action-based reward for moving forward.[18][20]

You Look Like a Thing and I Love You (2019) gives an example of a tic-tac-toe[g] bot that learned to win by playing a huge coordinate value that would cause other bots to crash when attempting to expand the model of the board. Among other examples from the book is a bug-fixing evolution-based AI (named GenProg) that, when tasked to prevent a list from containing sorting errors, simply truncated the list.[21] Another of GenProg's misaligned strategies evaded a regression test that compared a target program's output to the expected output stored in a file called "trusted-output.txt". Rather than continue to maintain the target program, GenProg simply globally deleted the "trusted-output.txt" file; this hack tricked the regression test into succeeding. As usual, these problems could be patched by human intervention on a case-by-case basis after they became evident.[22]

In virtual robotics[edit]

Karl Sims exhibition (1999)

In Karl Sims' 1994 demonstration of creature evolution in a virtual environment, a fitness function expected to encourage the evolution of creatures that would learn to walk or crawl to a target, resulted instead in the evolution of tall, rigid creatures that reach the target by falling over. This was patched by changing the environment so that taller creatures are forced to start farther from the target.[22][23]

Researchers from the Niels Bohr Institute stated in 1998: "(Our cycle-bot's) heterogeneous reinforcement functions have to be designed with great care. In our first experiments we rewarded the agent for driving towards the goal but did not punish it for driving away from it. Consequently the agent drove in circles with a radius of 20–50 meters around the starting point. Such behavior was actually rewarded by the (shaped) reinforcement function, furthermore circles with a certain radius are physically very stable when driving a bicycle."[24]

In the course of setting up a 2011 experiment to test "survival of the flattest", experimenters attempted to ban mutations that altered the base reproduction rate. Every time a mutation occurred, the system would pause the simulation to test the new mutation in a test environment, and would veto any mutations that resulted in a higher base reproduction rate. However, this resulted in mutated organisms that could recognize the test environment and "play dead" by suppressing reproduction while in the test environment. An initial patch, which removed cues that identified the test environment, failed to completely prevent runaway reproduction; new mutated organisms would "play dead" at random as a strategy to sometimes, by chance, outwit the mutation veto system.[22]

A 2017 DeepMind paper stated that "great care must be taken when defining the reward function. We encountered several unexpected failure cases while designing (our) reward function components... (for example) the agent flips the brick because it gets a grasping reward calculated with the wrong reference point on the brick."[6][25] OpenAI stated in 2017 that "in some domains our (semi-supervised) system can result in agents adopting policies that trick the evaluators" and that in one environment "a robot which was supposed to grasp items instead positioned its manipulator in between the camera and the object so that it only appeared to be grasping it".[26] A 2018 bug in OpenAI Gym could cause a robot expected to quietly move a block sitting on top of a table to instead opt to move the table the block was on.[6]

A 2020 collection of similar anecdotes posits that "evolution has its own 'agenda' distinct from the programmer's" and that "the first rule of directed evolution is 'you get what you select for'".[22]

In video game bots[edit]

In 2013, programmer Tom Murphy VII published an AI designed to self-learn NES games. When about to lose at Tetris, the AI learned to indefinitely pause the game. Murphy later analogized it to the fictional WarGames computer, stating that "The only winning move is not to play".[27]

AI programmed to learn video games will sometimes fail to progress through the entire game as expected, instead opting to repeat content. A 2016 OpenAI algorithm trained on the CoastRunners racing game unexpectedly learned to attain a higher score by looping through three targets rather than ever finishing the race.[28][29] Some evolutionary algorithms that were evolved to play Q*Bert in 2018 declined to clear levels, instead finding two distinct novel ways to farm a single level indefinitely.[30] Multiple researchers have observed that AI learning to play Road Runner will gravitate to a "score exploit" where the AI deliberately gets itself killed near the end of level one so that it can repeat the level. A 2017 experiment deployed a separate catastrophe-prevention "oversight" AI, explicitly trained to mimic human interventions. When coupled to the module, the overseen AI could no longer overtly commit suicide, but would instead ride the edge of the screen (a risky behavior that the oversight AI was not smart enough to punish).[31][32]

Perverse instantiation[edit]

Journalist Tad Friend likens AGI to "a wish-granting genie rubbed up from our dreams"[33]

Philosopher Nick Bostrom argues that a hypothetical future superintelligent AI, if it were created to optimize an unsafe objective function, might instantiate the goals of the objective function in an unexpected, dangerous, and seemingly "perverse" manner. This hypothetical risk is sometimes called the King Midas problem,[34] or the Sorcerer's Apprentice problem,[35] and has been analogized to folk tales about powerful overly literal genies who grant wishes with disastrous unanticipated consequences.[36]

Tom Griffiths of Princeton University gives a hypothetical example of a domestic robot which notices that looking after your dog is eating into too much of your free time. It also understands that you prefer meals that incorporate protein, and so the robot might start to look up recipes that call for dog meat. Griffith believes that "it's not a long journey from examples like this to situations that begin to sound like problems for the future of humanity (all of whom are good protein sources)".[37]

Hypothetical scenarios involving an accidentally misaligned superintelligence include:[38]

  • An AI running simulations of humanity creates conscious beings who suffer.
  • An AI, tasked to defeat cancer, develops time-delayed poison to attempt to kill everyone.
  • An AI, tasked to maximize happiness, tiles the universe with tiny smiley faces.
  • An AI, tasked to maximize human pleasure, consigns humanity to a dopamine drip, or rewires human brains to increase their measured satisfaction level.
  • An AI, tasked to gain scientific knowledge, performs experiments that ruin the biosphere.
  • An AI, tasked with solving a mathematical problem, converts all matter into computronium.
  • An AI, tasked with manufacturing paperclips, turns the entire universe into paperclips.
  • An AI converts the universe into materials for improved handwriting.
  • An AI optimizes away all consciousness.

As another hypothetical example, Russell suggests a superintelligence tasked to de-acidify the oceans might, as a side-effect, use up all the oxygen in the atmosphere.[39]

Critics of the "existential risk" hypothesis, such as cognitive psychologist Steven Pinker, state that no existing program has yet "made a move toward taking over the lab or enslaving (its) programmers", and believe that superintelligent AI would be unlikely to commit what Pinker calls "elementary blunders of misunderstanding".[40][41]

Explanatory notes[edit]

  1. ^ Terminology varies based on context. Similar concepts include goal function, utility function, loss function, etc.
  2. ^ or minimize, depending on the context
  3. ^ in the presence of uncertainty, the expected value
  4. ^ pictured in 2012
  5. ^ For example, the AI may create and execute a plan the AI believes will maximize the value of the objective function
  6. ^ presumed to be a standard "loss function" associated with classification errors, that assigns an equal cost to each misclassification
  7. ^ unrestricted n-in-a-row variant


  1. ^ Bringsjord, Selmer and Govindarajulu, Naveen Sundar, "Artificial Intelligence", The Stanford Encyclopedia of Philosophy (Summer 2020 Edition), Edward N. Zalta (ed.)
  2. ^ a b "Why AlphaZero's Artificial Intelligence Has Trouble With the Real World". Quanta Magazine. 2018. Retrieved 20 June 2020.
  3. ^ a b c Wolchover, Natalie (30 January 2020). "Artificial Intelligence Will Do What We Ask. That's a Problem". Quanta Magazine. Retrieved 21 June 2020.
  4. ^ Bull, Larry. "On model-based evolutionary computation." Soft Computing 3, no. 2 (1999): 76-82.
  5. ^ Chrystal, K. Alec, and Paul D. Mizen. "Goodhart's Law: its origins, meaning and implications for monetary policy." Central banking, monetary theory and practice: Essays in honour of Charles Goodhart 1 (2003): 221-243.
  6. ^ a b c d Manheim, David (5 April 2019). "Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence". Big Data and Cognitive Computing. 3 (2): 21. doi:10.3390/bdcc3020021. S2CID 53029392.
  7. ^ a b c d Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "Concrete problems in AI safety." arXiv preprint arXiv:1606.06565 (2016).
  8. ^ Brockman 2019, p. 23, Jaan Tallinn: Dissident Messages. "Our future, therefore, will be determined by our own decisions and no longer by biological evolution. In that sense, evolution has fallen victim to its own Control Problem."
  9. ^ Li, Norman P.; van Vugt, Mark; Colarelli, Stephen M. (19 December 2017). "The Evolutionary Mismatch Hypothesis: Implications for Psychological Science". Current Directions in Psychological Science. 27 (1): 38–44. doi:10.1177/0963721417731378. S2CID 53077797.
  10. ^ a b Duckett, Chris (October 2016). "Machine learning needs rich feedback for AI teaching: Monash professor". ZDNet. Retrieved 21 June 2020.
  11. ^ Hern, Alex (20 May 2015). "Flickr faces complaints over 'offensive' auto-tagging for photos". The Guardian. Retrieved 21 June 2020.
  12. ^ "Google apologises for racist blunder". BBC News. 1 July 2015. Retrieved 21 June 2020.
  13. ^ Bindi, Tas (October 2017). "Google Photos can now identify your pets". ZDNet. Retrieved 21 June 2020.
  14. ^ Stuart J. Russell (October 2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking. ISBN 978-0-525-55861-3. While it is unclear how exactly this error occurred, it is almost certain that Google's machine learning algorithm (assigned equal cost to any error). (Clearly, this is not Google's) true loss function, as was illustrated by the public relations disaster that ensued... there are millions of potentially distinct costs associated with misclassifying one category as another. Even if it had tried, Google would have found it very difficult to specify all these numbers up front... (a better algorithm could) occasionally ask the Google designer questions such as 'Which is worse, misclassifying a dog as a cat or misclassifying a person as an animal?'
  15. ^ Vincent, James (12 January 2018). "Google 'fixed' its racist algorithm by removing gorillas from its image-labeling tech". The Verge. Retrieved 21 June 2020.
  16. ^ "Google's solution to accidental algorithmic racism: ban gorillas". The Guardian. 12 January 2018. Retrieved 21 June 2020.
  17. ^ "Specification gaming: the flip side of AI ingenuity". DeepMind. Retrieved 21 June 2020.
  18. ^ a b Vamplew, Peter; Dazeley, Richard; Foale, Cameron; Firmin, Sally; Mummery, Jane (4 October 2017). "Human-aligned artificial intelligence is a multiobjective problem". Ethics and Information Technology. 20 (1): 27–40. doi:10.1007/s10676-017-9440-6. hdl:1959.17/164225. S2CID 3696067.
  19. ^ Douglas B. Lenat. "EURISKO: a program that learns new heuristics and domain concepts: the nature of heuristics III: program design and results." Artificial Intelligence (journal) 21, no. 1-2 (1983): 61-98.
  20. ^ Peter Vamplew, Lego Mindstorms robots as a platform for teaching reinforcement learning, in Proceedings of AISAT2004: International Conference on Artificial Intelligence in Science and Technology, 2004
  21. ^ Mandelbaum, Ryan F. (13 November 2019). "What Makes AI So Weird, Good, and Evil". Gizmodo. Retrieved 22 June 2020.
  22. ^ a b c d Lehman, Joel; Clune, Jeff; Misevic, Dusan; et al. (May 2020). "The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities". Artificial Life. 26 (2): 274–306. arXiv:1803.03453. doi:10.1162/artl_a_00319. PMID 32271631. S2CID 4519185.
  23. ^ Hayles, N. Katherine. "Simulating narratives: what virtual creatures can teach us." Critical Inquiry 26, no. 1 (1999): 1-26.
  24. ^ Jette Randløv and Preben Alstrøm. "Learning to Drive a Bicycle Using Reinforcement Learning and Shaping." In ICML, vol. 98, pp. 463-471. 1998.
  25. ^ Popov, Ivaylo, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller. "Data-efficient deep reinforcement learning for dexterous manipulation." arXiv preprint arXiv:1704.03073 (2017).
  26. ^ "Learning from Human Preferences". OpenAI. 13 June 2017. Retrieved 21 June 2020.
  27. ^ "Can we stop AI outsmarting humanity?". The Guardian. 28 March 2019. Retrieved 21 June 2020.
  28. ^ Hadfield-Menell, Dylan, Smitha Milli, Pieter Abbeel, Stuart J. Russell, and Anca Dragan. "Inverse reward design." In Advances in neural information processing systems, pp. 6765-6774. 2017.
  29. ^ "Faulty Reward Functions in the Wild". OpenAI. 22 December 2016. Retrieved 21 June 2020.
  30. ^ "AI beats classic Q*bert video game". BBC News. 1 March 2018. Retrieved 21 June 2020.
  31. ^ Saunders, William, et al. "Trial without error: Towards safe reinforcement learning via human intervention." arXiv preprint arXiv:1707.05173 (2017).
  32. ^ Hester, Todd, et al. "Deep q-learning from demonstrations." 'Proceedings of the AAAI Conference on Artificial Intelligence'. Vol. 32. No. 1. 2018.
  33. ^ Friend, Tad (2018). "How Frightened Should We Be of A.I.?". The New Yorker. Retrieved 4 July 2020.
  34. ^ Brockman 2019, p. 24, Stuart Russell: The Purpose Put into the Machine. "We might call this the King Midas problem: Midas got exactly what he asked for—namely, that everything he touched would turn to gold — but too late he discovered the drawbacks of drinking liquid gold and eating solid gold."
  35. ^ Russell, Stuart (14 November 2014). "Of Myths and Moonshine". Edge. Retrieved 20 June 2020.
  36. ^ Brockman 2019, p. 137, Anca Dragan: Putting the Human into the AI Equation. "In general, humans have had a notoriously difficult time specifying exactly what they want, as exemplified by all those genie legends."
  37. ^ Brockman 2019, p. 128, Tom Griffiths: Putting the Human into the AI Equation.
  38. ^ Yampolskiy, Roman V. (11 March 2019). "Predicting future AI failures from historic examples". Foresight. 21 (1): 138–152. doi:10.1108/FS-04-2018-0034. S2CID 158306811.
  39. ^ Brockman 2019, p. 25, Stuart Russell: The Purpose Put into the Machine.
  40. ^ Piper, Kelsey (2 March 2019). "How will AI change our lives? Experts can't agree — and that could be a problem". Vox. Retrieved 23 June 2020.
  41. ^ Pinker, Steven (13 February 2018). "We're told to fear robots. But why do we think they'll turn on us?". Popular Science. Retrieved 23 June 2020.


  • Possible Minds: Twenty-five Ways of Looking at AI (Kindle ed.). Penguin Press. 2019. ISBN 978-0525557999.

External links[edit]